New Speech Testing Utilities for Conversational AI Projects

Andrew R. Freed

Published in

IBM Watson Speech Services

5 min readJul 27, 2022

Woman operating a switchboard — Figure 1 Testing conversational AI requires connecting a few principles! Photo by Museums Victoria on Unsplash

Teams building Conversational AI solutions have two testing questions early in their project development:

1. How will the dialogue sound when spoken by the AI (via Text to Speech)

2. How will user utterances be accurately transcribed (via Speech to Text)

I’ve previously written on some strategies for voice-based testing. Recent updates to the open-source TTS-Python and STT-WER-Python repositories provide scripts that help solve both of those questions.

Verify how dialogue sounds

Imagine you’ve built a dialog tree in Watson Assistant. You want to make sure each message you’ve coded sounds good in the assistant. You’ve tuned your messages with SSML and phonetics or with “Tune by Example” but now you want to hear all the messages. You could make a lot of phone calls to your assistant, trying different conversational paths with each phone call. But that will get tedious, and redundant, quickly.

Instead, you can use two capabilities from TTS-Python as shown below.

Extract text from Watson Assistant and synthesize it with Text to Speech — Figure 2 Pattern for verifying Text-to-Speech synthesis of Watson Assistant dialog text

The steps are:

1. Extract the output text from your Watson Assistant skill. This builds a CSV of all your dialog messages with their associated Watson Assistant node.

2. Synthesize each message to an audio file.

3. Listen to each audio file.

Extracting text from your skill

The TTS-Python tool’s extract_skill_text.py connects to a Watson Assistant skill via API parameters or by a JSON file. Once connected, it iterates through all dialog nodes and records each output dialog message in a two-column CSV file.

Configuration for extract_skill_text.py — Figure 3 Exemplary use of extract_skill_text.py

This file can be used as-is, or you can manually manipulate the contents of this file (perhaps to create multiple variations of the same dialog text).

Synthesizing the text to audio

The TTS-Python tool’s synthesize.py takes any CSV file (with “id” and “text”) columns and produces audio files, using “id” to build the filename and “text” as the synthesized content. The configuration for this script requires picking an output file type (.wav, .mp3, etc) and selecting one or more voices to use in the transcription.

When synthesize.py is combined with extract_skill_text.py, the output audio files are easily traced back to the original Watson Assistant skill. The Watson Assistant node ID is carried through to the output file name, and that node ID helps you quickly get back to the message you are testing.

Listen to each audio file

You can use automation to build the audio files, but there’s no substitute (yet?) for having a human listen to and verify that each message sounds good. You can distribute the audio files across your team, sharing the load for this manual verification task. Or you can combine all the files in a single playlist. Either way, you are saving a lot of time testing your Text to Speech work by listening to these audio files, rather than trying to hear all these messages in a phone call.

Figure 4 Listen to the text your conversational AI solution will speak. Photo by Soundtrap on Unsplash

Verify speech to text transcription accuracy

There’s no substitute for using audio data collected from humans. But if you are in a pinch, or have small gaps in your audio test set, you can use Text to Speech to generate some test audio for you to supplement your other audio. Even better, you can use your Watson Assistant skill to figure out what text you need to generate. This pattern combines two capabilities from TTS-Python and two from STT-WER-Python as shown below.

Figure 5 Pattern for verifying Speech-to-Text transcription of Watson Assistant intent and entity training data

The steps are:

1. Extract the user utterance training text (from intents and entities) from your Watson Assistant skill.

2. Synthesize each message to an audio file using Text to Speech

3. Transcribe each audio to text using Speech to Text

4. Analyze the Speech to Text transcription accuracy

Extract user utterance training text

This example starts like our previous example, we just toggle the TTS-Python tool’s extract_skill_text.py configuration to extract intent and entity examples (user input) instead of dialogue text (system output).

Figure 6 Extracting training examples from Watson Assistant

The output CSV file can be passed directly to the next step, or can be manually manipulated (with additions, modifications, or deletions) before proceeding.

Synthesizing the text to audio

The TTS-Python tool’s synthesize.py is updated with several capabilities that make it useful for bootstrapping Speech-to-Text test data.

1. It can generate output audio for each of multiple voices. (Speech-to-Text transcription accuracy varies by voice!)

2. While synthesizing in Text-to-Speech, it builds a reference file suitable for use as “ground truth” by STT-WER-Python.

The output audio, and ground truth reference file, can be used as-is or can be appended to an existing Speech to Text test set and ground truth.

Transcribe and analyze accuracy

I’ve combined these last two steps because Marco Noel’s previous guide “New Python Scripts to Measure Word Error Rate on Watson Speech to Text” still covers them well. The files you generate with TTS-Python are explicitly crafted so that they are compatible with STT-WER-Python.

Wrapping up

In this article you’ve learned how to use the open-source TTS-Python and STT-WER-Python repositories to solve two common testing questions in Conversational AI solutions. The patterns in this article and the tools described are field-tested by IBM practitioners across a wide variety of speech and AI engagements. We hope you download these tools from GitHub and find them useful as your build your own AI solutions!

New Speech Testing Utilities for Conversational AI Projects

Verify how dialogue sounds

Extracting text from your skill

Synthesizing the text to audio

Listen to each audio file

Verify speech to text transcription accuracy

Extract user utterance training text

Synthesizing the text to audio

Transcribe and analyze accuracy

Wrapping up

Written by Andrew R. Freed