Analyzing and Improving a Watson Assistant Solution Part 3: Recipes for common analytic patterns

Published in

IBM Data Science in Practice

7 min readMar 13, 2020

Follow these recipes to analyze your Watson Assistant! Photo by Maarten van den Heuvel on Unsplash

In previous posts we explored what analysts want to discover about their virtual assistant and some building blocks for building analytics. In this post I will demonstrate some common recipes tailored to Watson Assistant logs.

Base layer: Get raw log events

First we extract raw log events and store on the file system. This requires the apikey and URL for your skill. For a single-skill assistant you will also need the workspace ID (extractable from the “Legacy v1 Workspace URL”), for a multi-skill assistant there are other IDs you can use to filter on (described in the Watson Assistant list log events API).

How to find the API Details for your Watson Assistant workspace

Example API details:

Skill Name: Sample SkillSkill ID: 5f34f4e3-c579–4a59-aa87–9fc45098133dLegacy v1 Workspace URL: https://gateway-wdc.watsonplatform.net/assistant/api/v1/workspaces/5f34f4e3-c579-4a59-aa87-9fc45098133d/message

The corresponding notebook code for this recipe looks like the following which extracts up to 10,000 log events between November 1 and November 30, 2019:

import getAllLogs iam_apikey="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
url="https://gateway-wdc.watsonplatform.net/assistant/api"
workspace_id="5f34f4e3-c579–4a59-aa87–9fc45098133d"
log_filter="response_timestamp>=2019–11–01,response_timestamp<2019–11–30"
page_size_limit=500
page_num_limit=20
version="2018–09–20"rawLogsJson = getAllLogs.getLogs(iam_apikey, url, workspace_id, log_filter, page_size_limit, page_num_limit, version)

Base layer: Get fields of interest from log events

Next we convert the raw log events into a format suitable for exploration and data science. We will extract fields of interest and correlate the raw log events into conversations.

A correlating conversation ID is required. For a single-skill assistant this is “response.context.conversation_id”. The most commonly analyzed event fields are automatically extracted. You may optionally provide a list of additional fields to include in your analysis. The output of this step is a Pandas data frame representing your Watson Assistant conversation data.

This recipe looks as follows for a single-skill assistant with the sample fields only. The last statement shows a sample from the data frame, proving the log extraction was successful.

import extractConversationsprimaryLogKey = "response.context.conversation_id"
conversationKey='conversation_id'
customFieldNames = NoneallLogsDF = extractConversations.extractConversationData(rawLogsJson, primaryLogKey, customFieldNames)allLogsDF.head()

Example output:

Recipe — how many conversations?

The business analyst will start with a basic question and this is handled by the following code.

print("Total log events:",len(allLogsDF))print("Total conversations:",len(allLogsDF[conversationKey].unique()))

Example output:

Total log events: 10000Total conversations: 1311

Recipe — how frequently is a dialog node visited?
The business analyst will also want to understand how many times key events occur in a conversation. The events may include satisfaction, escalation, or any other type of event. These events are correlated to specific nodes in the Watson Assistant dialog. Given the node id for a dialog node we can count how many times it is visited.

When using the Watson Assistant dialog editor, the node id for a given dialog node can be found by clicking on the dialog node and inspecting the end of the URL.

node_to_search="node id here like node_5_1546894657426"node_visits_as_frame = allLogsDF[[node_to_search in x for x in allLogsDF['nodes_visited']]]print("Total visits to target node:",len(node_visits_as_frame))print("Unique visitors to target node:",len(node_visits_as_frame[conversationKey].unique()))

Example output:

Total visits to target node: 595Unique visitors to target node: 586

Recipe — gather user utterances to train and test the classifier

The intent analyst wants to make sure user utterances are being classified into the right intents. This process takes three steps overall:

1. Gather the user utterances which should map to intents

2. Subject matter experts providing or validating the intents

3. Testing accuracy and improving the classifier

Step 1 is demonstrated below, step 2 is custom to your assistant, and step 3 is handled in much greater detail by the Dialog Skill Analysis notebook.

It is important to recognize that not all user utterances contain intents. Yes/no questions like “Was that helpful?” or data entry questions like “What’s the order ID?” are generally not coded with intents. Thus you need to know where to look for utterances that contain intents.

The simplest source of user utterances is the first time the user sends text to the assistant. This requires knowing which dialog turn number corresponds to the users’ first time, represented in the variable USER_FIRST_TURN_COUNTER. “User first” is USER_FIRST_TURN_COUNTER=1, “assistant first” is USER_FIRST_TURN_COUNTER=2.

Example code:

USER_FIRST_TURN_COUNTER=2userFirstTurnView = allLogsDF[allLogsDF['dialog_turn_counter']==USER_FIRST_TURN_COUNTER]userFirstTurnDF = userFirstTurnView[["input.text","intent","intent_confidence"]]userFirstTurnDF.to_csv("utterances.csv",index=False,header=["Utterance", "Predicted Intent", "Prediction Confidence"])userFirstTurnDF.head()

Example output:

Sample user utterances with intent and intent confidence

Similarly we can extract utterances given in response to a given dialog node by specifying its node identifier. In this recipe we look at the nodes_visited from the previous message, since the utterances we expected are responses to it, by using the prev_nodes_visited field.

Example code:

def responses_to_node(df:pd.DataFrame, node_to_search:str):
   responses_df = df[[node_to_search in x for x in df['prev_nodes_visited']]]
   #Remove conversations that didn't reach this node
   responses_df = responses_df[responses_df['input.text'] != '']
   return responses_df[['input.text','intent','intent_confidence']]nodeResponsesDF = responses_to_node(allLogsDF, node_to_search)
nodeResponsesDF.head()

Example output:

User responses to a target node and how they were classified

Recipe — gather voice responses to a given dialog node

The speech analysis will want to review the transcribed utterances for a given dialog node in order to determine how well the speech model is working for a given response type. As in the recipe above used by the intent analyst we can extract additional fields suitable for reviewing audio files and transcriptions. The fields message_start and message_end represent the relative time within the call when the user was prompted to respond and when they responded. The additional context variable STT_CONFIDENCE needs to be added via the assistant’s orchestration layer in order to pass the Speech to Text engine’s transcription confidence.

Example code:

def speech_responses_to_node(df:pd.DataFrame, conversationKey:str, node_to_search:str):
   responses_df = df[[node_to_search in x for x in df['prev_nodes_visited']]]
   #Remove conversations that didn't reach this node
   responses_df = responses_df[responses_df['input.text'] != '']
   return responses_df[[conversationKey,'message_start','message_end', 'input.text','intent','intent_confidence','STT_CONFIDENCE']]voiceResponsesDF = speech_responses_to_node(allLogsDF, conversationKey, node_to_search)
voiceResponsesDF.head()

Example output:

Recipe — summarize intent confidence

The intent analyst will want a broad summary of the intents selected by the assistant including their frequency and average confidence. Training and maintenance efforts should focus on the highest-frequency intents with the lowest confidence. In the example below we note that many utterances are not classified to an intent (these should always be reviewed). However we can also notice that intents “reset_password” and “ticket_new_request” occur with high frequency and low confidence and training updates should focus on these trouble spots.

Example code:

userIntentSummaryDF = userFirstTurnDF.groupby('intent',as_index=False).agg({
   'input.text': ['count'],
   'intent_confidence': ['mean']
})# Rename columns for future ease of use
userIntentSummaryDF.columns=["intent","count","confidence"]# Increases readability of dashboard reports
if userIntentSummaryDF.loc[0,"intent"]=="":
   userIntentSummaryDF.loc[0,"intent"]="(no intent found)"# Printing `userIntentSummaryDF` or `userIntentSummaryDF.head()` produces a table but the same data can be visually displayedintent_heatmap.generateTreemap(userIntentSummaryDF, 'count', 'confidence', 'intent', 'Classifier confidence by intent')

Example output:

Tree map visualization of intent frequency and confidence

What’s next after these recipes?

This post outlined two basic types of recipes: building descriptive “dashboard” metrics and extracting data to test and train your models.

For “dashboard” metrics the next step is to determine which metrics your business requires and start implementing them. Common dashboard metrics include how many conversations start, how many escalate, and how many complete. These metrics are often categorized by type of intent. Many of these dashboard metrics are achievable but counting how many conversations reach (or not do reach) a given dialog node. The Measure Notebook is a good launching point for this investigation.

For “test/train” recipes the goal is to extract output from a model and assess how well the model is performing. This generally requires review from a subject matter expert. With a set of utterances and intents from both SMEs and Watson Assistant you can determine the accuracy of the assistant and where the intent training needs to be improved. With a set of audio files, Watson transcriptions and SME transcriptions you can assess the performance of a Speech to Text model. Dialog Skill Analysis and WA-Testing-Tool are good next steps once you have collected this kind of data.

Note that the analytics you wish to execute may require a modification of your Watson Assistant. One common pattern is to add extra context variables to the assistant for use in analytics. Our addition of STT_CONFIDENCE to measure Speech to Text transcription confidence was one example. You may also add variables indicating the beginning, end, or result of some business process. Many analytics are possible using just node ids, intents, and confidence values, but you can augment these analytics by adding your own context variables and analyzing them.

The code used in this blog series is available on the WA-Testing-Tool site at https://github.com/cognitive-catalyst/WA-Testing-Tool/tree/master/log_analytics.

Thanks to the following reviewers and co-contributors to this post: Eric Wayne, Aishwarya Hariharan, Audrey Holloman, Mohammad Gorji-Sefidmazgi, and Daniel Zyska.

For more help in analyzing and improving your Watson Assistant reach out to IBM Data and AI Expert Labs and Learning.

Analyzing and Improving a Watson Assistant Solution Part 3: Recipes for common analytic patterns

Written by Andrew R. Freed