A mental model for Speech to Text training

Andrew R. Freed
IBM Watson Speech Services
4 min readMar 14, 2023

--

The two following anecdotes are how I think about customizing a speech to text model using a “corpus” of words and phrases in context as well as recognizing words without context.

Figure 1 How do you feel when you can’t understand the words? Photo by Jr Korpa on Unsplash.

Building a corpus of words and phrases in context

As a young boy I took a Spanish course where the teacher talked in nothing but Spanish. This “fully immersive” experience was hard for me as someone who much prefers reading and seeing words. I want to learn a new language first by reading and then sounding it out. How could I work in reverse?

My teacher said “cómo te llamas” (what’s your name) but I couldn’t even make out the words.

Did she say:

· Como tejamas

· Comotayamas

· Comote yamas

How could I know?

Figure 2 Cómo te llamas? Photo by Tim Mossholder on Unsplash

I eventually heard “cómo te va” (how are you doing) and at least learned that the “cómo te” sound was separate from both “llamas” and “va”. (Though I still didn’t know how to spell any of them.)

Finally I heard “cómo estás” and learned that “cómo” is a unique word. My language corpus at this point was three phrases:

· cómo te llamas

· cómo te va

· cómo estás

There’s a lot of information encoded in this corpus:

· “cómo” is a word that often starts a phrase

· “cómo” is often followed by “te”

· “cómo”, “te”, “llamas”, “va”, and “estás” are words in this domain

· “cómo te llamas”, “cómo te va”, and “cómo estás” are phrases in this domain

This is the same philosophy behind customizing a speech to text model with a language corpus.

A corpus is a collection of words and phrases (one phrase per line) that a language model is likely to encounter while transcribing audio into text. The model needs to learn a specific language — generally the language of your domain — and the corpus tells it what words it is likely to encounter and even what order they occur in.

This is especially helpful if the model is highly confident in one part of a phrase it hears but not the entire phrase. Let’s say the model thought it heard “cómo ay va” — it’s confident in the “cómo” part but not the rest. The corpus tells the model that “cómo te” and “te va” are word & character sequences that would fit this transcription and are common phrases. It’s possible the model heard “cómo ay va” but evidence leans toward “cómo te va”.

A language model corpus is great for helping a model understand words and phrases in context. It uses word patterns and sequences to produce a coherent phrase. In other words, it gives linguistic cues.

Figure 3 Words and phrases. Photo by Andreas Fickl on Unsplash

Understanding words without context

What if there is no context? How do you recognize a strange word?

In the medical insurance domain there’s a common word called “eligibility”. (Roughly meaning, “does a patient currently have medical insurance”). This word is often stated by itself. For instance:

· Watson: What do you need help with?

· User: eligibility

At six syllables, this word is a mouthful! It’s difficult to say and it’s difficult to recognize if you are unfamiliar with the word.

It is possible to use a language model corpus to show that “eligibility” can be a complete phrase. But if you’ve never heard that word before, there are several ways you might mis-hear “eligibility”:

· hello ability

· legibility

· knowledgeability

· allergy ability

· L E G B D T

These are all legitimate English words and some of them even make sense in a medical domain!

Figure 4 Words are constructed from letters and sounds. Photo by Clarissa Watson on Unsplash

If we treat these “mistakes” as clues, we can nudge our model in the right direction. These “mistaken” phrases sound like the word we want. Since the incorrect and correct words sound the same, we tell the model to make a phonetic replacement.

This is the philosophy behind training custom words into a language model. We train a custom word for eligibility with a JSON body {“sounds_like”: [“legibility”, “knowledgeability”, “allergy ability”, “L E G B D T”]}. Custom words give phonetic cues.

Put them together!

Teach your Speech to Text model how to recognize the language of your domain by first teaching it phrases in context (through a corpus) and then teaching it words (through custom words) without that context.

Figure 5 Use a blend of corpus and custom words to train a great model! Photo by Louis Reed on Unsplash

Use our Speech customization tools to train your model and assess its accuracy. This GitHub repository has scripts and methodologies you can use plus links to additional Speech to Text education.

--

--

Andrew R. Freed
IBM Watson Speech Services

Technical lead in IBM Watson. Author: Conversational AI (manning.com, 2021). All views are only my own.