The Cloudera Support Organization has always strived to not only provide solutions to our customers but to also deliver helpful knowledge. One of the primary sources of that knowledge comes from our Knowledge Articles. This content is created and curated by our knowledgeable Support Staff based on real-world experience coming from support cases.
These Knowledge Articles have proven to be invaluable to our Support Staff over the years. While the content is also available to our customers to use in their own troubleshooting efforts, we want to do more to help bring the right Knowledge Articles to our customers at the right time. To that end, we have been working on improving the way our customers discover the collection of knowledge available in our Knowledge Articles.
We have recently released a new capability for the chatbot on our website and myCloudera portal, CDP-3O. While engaging through chat, you now have the ability to search across our vast Knowledge Base to see relevant articles to get immediate access to answers and solutions. In order to deliver effective results in the context of a chat interaction, we had to re-think how people search. When using search on a website through the typical search field, users tend to adapt their word choice as we have all been trained on how to effectively use a search engine. However, in the context of a chat, the word choice is different and requires that the search engine adapts to more natural language.
Additionally, people are accustomed to relatively short and abbreviated sentences when using chat. This can lend to a more difficult search environment as there is not a lot of context to go off of. It is these two areas that we worked to address through a combination of tools and techniques which we’ll go into below.
When talking about searching for Knowledge Articles, the previous search index relied on the content of the Knowledge Article alone to deliver results. This works since the Knowledge Article is written as a result of a set of symptoms, cause, and ultimately, a solution. However, over time we see that a slightly different set of circumstances or symptoms may result in the same cause and/or solution.
The previous search index that was based on the Knowledge Article content alone, and which was not leveraging any NLP on the search input yielded 22% accuracy for the Top 5 results.
The Support Team captures these other conditions during a process called the Cloudera Diagnostic Methodology. The CDM process has the goal of enabling our COEs (Customer Operations Engineers) to document the diagnostic path they followed when helping resolve the customer’s issue. This includes documenting any existing content that aided in either identifying the cause or in delivering the solution.
This provides us with:
- A Knowledge Article which contains information about the component and symptom it was originally written to help address
- A number of related (by way of links to Knowledge Articles) support cases that each also have information about the component(s) involved and additional symptoms that the customer experienced.
If we combine the content from the Knowledge Article with the content from the various support cases, we now have a richer pool of symptoms, components, and conditions which were all identified by our COEs as being related to the same underlying issue. Additionally, we know that they share, at least in part, the same solution described by the Knowledge Article.
Putting it together
In order to make this all work together, we employed the use of several pieces of technology available to us as part of the Cloudera Data Platform. The first step is to clean up the input data. This includes the following operations:
- Extract known technical entities from the support case (log lines, configs, etc.)
- Retrieve the relationship between support cases and knowledge articles
- Extract technical sentences and label the words
- Promote the results based on usage
Extract Known Technical Entities
We use a relatively straightforward Spark job to extract known technical content including log lines, stack traces, product configuration properties, Jiras and more. We categorize each of these for use later in the process.
Retrieve Support Case and Knowledge Article relationships
Next, Hue is used to extract all support cases from Impala where there is a reference to a Knowledge Article in the CDM fields and store it into Parquet files. This is done for easy transport and reference from Spark. The resulting set of cases becomes our new dataset to use for the next phase.
Extract Technical Sentences
For this step, we used Spark ML in CML (Cloudera Machine Learning) to train a Naïve Bayes model to identify what a technical sentence looks like in the context of a case comment. This began with taking a dataset containing 10k sentences and labeling them as one of the following:
- Technically Relevant – Contains technical content that’s relevant to the case discussion.
- Screen Share – A sentence related to scheduling or discussing a screen share session.
- Data Collection Request – A sentence requesting data from the customer.
- Not Relevant – A sentence not relevant to the technical content of the case.
Now that we have the trained model, we need to use it on the cases identified in the previous step. This starts once again using Spark ML and the model above to extract the relevant sentences from the cases. Spark ML’s TF-IDF feature extractor is then used to identify the 100 most important words and a weight is assigned to each of them.
The above is done separately for comments coming from the customer and those coming from Clouderans since we have found that the terminology and phrasing used can differ. This allows us to be more focused in the responses based on who is using the search index, customer vs internal use.
Promote the Results
We now have the case and Knowledge Article content extracted, normalized and tokenized. Next is to join this all together along with some metrics to help promote the more relevant content.
To do this, we can again leverage the case content to help. Knowledge Articles that are regularly referenced in support cases are considered to be more likely to be relevant to the customer.
But we have more than just case references to measure knowledge article popularity, we can also pull in another data source, customer views. We again use Hue and Impala to extract the specific view of the data we want into Parquet files. We take the access counts for each Knowledge Article, filter out access from Clouderans, and rank articles that are more popular.
The final step is to use Spark to join the Knowledge Article content, related case semantic context, and popularity metrics from support cases and article views into a single search index.
Testing the Results
To test the validity of this approach, we performed tests at various stages of the process.
The first tests came during the Spark ML model training. We reserved 30% of the original 10k categorized sentences to see how well the model performs against a known dataset. When testing, we found that comments from COEs saw an 88% accuracy and those from Customers saw 90% accuracy.
We also tested the final index by taking a list of Knowledge Articles and having several individuals create short (3-10 word) statements that a customer might use in chat context and expect to see the article returned. With the new index, we saw that 73% of the searches returned the expected results in the Top 5 articles.
This new index is currently in use when interacting with the chat feature on our website. Specifically when following the technical path to get additional assistance. However, we want to expand its use on MyCloudera and our products. To do this, we will be introducing additional content alongside the Knowledge Articles into the index such as documentation, whitepapers, training material, etc.
We will also look at introducing new use cases such as Just-In-Time Support where we will use the index to help provide relevant content during the case creation process. We have done some initial testing with this and found that using the case summary and case descriptions is giving us 80% accuracy in returning relevant Knowledge Articles within the Top 5 results.
By surfacing these Knowledge Articles during case creation, we may be able to offer the customer a solution without needing to contact a COE. In the instances where the case is still needed, knowing which articles were presented, reviewed, and ultimately rejected by the customer can also help in the troubleshooting process and expedite the path to a solution.
We are excited to see where the use of Machine Learning and Natural Language Processing can take us in offering our customers a more natural interaction and deliver more meaningful results.
Learn more about how you can use NLP for question answering in this Cloudera Fast Forward research https://qa.fastforwardlabs.com/