Building custom language models to supercharge speech-to-text performance for Amazon Transcribe

Amazon Transcribe is a fully-managed automatic speech recognition service (ASR) that makes it easy to add speech-to-text capabilities to voice-enabled applications. As our service grows, so does the diversity of our customer base, which now spans domains such as insurance, finance, law, real estate, media, hospitality, and more. Naturally, customers in different market segments have asked Amazon Transcribe for more customization options to further enhance transcription performance.

We’re excited to introduce Custom Language Models (CLM). The new feature allows you to submit a corpus of text data to train custom language models that target domain-specific use cases. Using CLM is easy because it capitalizes on existing data that you already possess (such as marketing assets, website content, and training manuals).

In this post, we show you how to best use your available data to train a custom language model tailored for your speech-to-text use case. Although our walkthrough uses a transcription example from the video gaming industry, you can use CLM to enhance custom speech recognition for any domain of your choosing. This post assumes that you’re already familiar with how to use Amazon Transcribe, and focuses on demonstrating how to use the new CLM feature. Additional resources for using the service are available at the end.

Establishing context for evaluating CLM transcription performance

To evaluate how powerful CLM can be in terms of enhancing transcription accuracy, there are few steps we want to take. First we need to establish a baseline. To do that, we recorded an audio sample that contains speech content and lingo commonly found in video game chats. We also have a human-generated transcription to benchmark the ground truth. This ground truth transcript serves as the reference transcript we use to compare the general transcription output of Amazon Transcribe and the output of CLM. The following is a partial snippet of this reference transcript.

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. The Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz.

Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems also offer 3D audio output for immersive experiences.

Next, we want to run the sample audio through Amazon Transcribe using its generic speech engine and compare the text output to the ground truth transcript. We’ve produced a partial snippet of the side-by-side comparison and highlighted errors for visibility. Compared to the ground truth transcript, the default Amazon Transcribe transcript showed a Word Error Rate (WER) of 31.87%. This WER shouldn’t be interpreted as a full representation of the Amazon Transcribe service performance. It is just one instance for a very specific example audio. Note that accuracy is a measure of 100 minus WER. So the lower the WER, the higher the accuracy. For more information about calculating WER, see Word Error Rate on Wikipedia.

*Note that this WER is not a representation of the Amazon Transcribe service performance. It is just one instance for a very specific and limited test example. All figures are for single-case and limited scope illustration purposes.

The following text is the human-generated ground-truth reference transcript:

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. Meanwhile, the Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz. Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems also offer 3D audio output for immersive experiences.

The following text is the machine-generated transcript by the Amazon Transcribe generic speech engine:

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in heart respects between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these Consul’s. The PS five features in a M descend to CPU with up to 3.5 gigahertz frequency is sports and AM the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dahlin at 16 gigabytes in 825 gigabytes. The PS five supports both PS. The PS five supports both four K and A K resolutions. Meanwhile, the Xbox Series X also features an AM descend to CPU, but clocks in at three point take. It hurts instead, the cost. Almost a similar AMG custom GPU, with 12 teraflops and 1.825 gigahertz memory, is the same as out of the PS. Fives come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five. The Siri’s X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

Although the Amazon Transcribe generic speech model has done a decent job of transcribing the audio, it’s more likely that a tailored custom model can yield higher transcription accuracy.

Solution overview

In this post, we walk you through how to train a custom language model and evaluate how its transcription output compares to the reference transcript. You complete the following high-level steps:

Prepare your training data.
Train your CLM.
Use your CLM to transcribe audio.
Evaluate the results by comparing the CLM transcript against the generic model transcript’s accuracy.

Preparing your training data

Before we begin, it’s important to distinguish between training data and tuning data.

Training data for CLM typically includes text data that is specific to your domain. Some examples of training data could include relevant text content from your website, training manuals, sales and marketing collateral, or other text sources.

Meanwhile, human-annotated audio transcripts of actual phone calls or media content that are directly relevant to your use case can be used as tuning data. Ideally, both training and tuning data ought to be domain-specific, but in practice you may only have a small amount of audio transcriptions available. In that case, transcriptions should be used as tuning data. If more transcription data is available, it can and should be used as part of the training set as well. For more information about the difference between training and tuning data, see Improving Domain-Specific Transcription Accuracy with Custom Language Models.

Like many domains, video gaming has its own set of technical jargon, syntax, and speech dynamics that may make the use of a general speech recognition engine suboptimal. To build a custom model, we first need data that’s representative of the domain. For this use case, we want free form text from the video gaming industry. We use a variety of publicly available information from a variety of sources about video gaming. For convenience, we’ve compiled that training and tuning set, which you can download to follow along. Keep in mind that the nature, quality, and quantity of your training data has a dramatic impact on the resultant custom model you build. All else equal, it’s better to have more data than less.

As a general guideline, your training and tuning datasets should meet the following parameters:

Is in plain text (it’s not a file such as a Microsoft Word document, CSV file, or PDF).
Has a single sentence per line.
Is encoded in UTF-8.
Doesn’t contain any formatting characters, such as HTML tags.
Is less than 2 GB in size if you intend to use the file as training data. You can provide a maximum of 2 GB of training data.
Is less than 200 MB in size if you intend to use the file as tuning data. You can provide a maximum of 200 MB of optional tuning data.

The following test is a partial snippet of our example training set:

The PS5 will feature a custom eight-core AMD Zen 2 CPU clocked at 3.5GHz (variable frequency) and a custom GPU based on AMD’s RDNA 2 architecture hardware that promises 10.28 teraflops and 36 compute units clocked at 2.23GHz (also variable frequency).

It’ll also have 16GB of GDDR6 RAM and a custom 825GB SSD that Sony has previously promised will offer super-fast loading times in gameplay, via Eurogamer.

One of the biggest technical updates in the PS5 was already announced last year: a switch to SSD storage for the console’s main hard drive, which Sony says will result in dramatically faster load times.

A previous demo showed Spider-Man loading levels in less than a second on the PS5, compared to the roughly eight seconds it took on a PS4.PlayStation hardware lead Mark Cerny dove into some of the details about those SSD goals at the announcement.

Where it took a PS4 around 20 seconds to load a single gigabyte of data, the goal with the PS5’s SSD was to enable loading

five gigabytes of data in a single second.

Training your custom language model

In the following steps, we show you how to train a CLM using base training data and a tuning split. Using a tuning split is entirely optional. If you prefer not to train a CLM using a tuning split, you can skip step 5.

Upload your training data (and/or tuning data) to their respective Amazon Simple Storage Service (Amazon S3) buckets.
On the Amazon Transcribe console, for Name, enter a name for your custom model so you can reference it for later use.
For Base model, choose a model type that matches your use case, based on audio sample rate.
For Training data, enter the appropriate S3 bucket where you previously uploaded your training data.
For Tuning data, enter the appropriate S3 bucket where you uploaded your tuning data. (This step is optional)
For Access permissions, designate the appropriate access permissions.
Choose Train model.

Amazon Transcribe service does the heavy lifting of automatically training a custom model for you.

To track the progress of the model training, go to the Custom language models page on the Amazon Transcribe console. The status indicates if the training is in progress, complete, or failed.

Using your custom language model to transcribe audio

When your custom model training is complete, you’re ready to use it. Simply start a typical transcription job as you would using Amazon Transcribe. However, this time, we want to invoke the custom language model we just trained, as opposed to using the default speech engine. For this post, we assume you already have familiarity with how to run a typical transcription job, so we only call out the new component: for Model type, select Custom language model.

Evaluating the results

When your transcription job is complete, it’s time to see how well the CLM performed. We can evaluate the output transcript against the human-annotated reference transcript, just as we had compared the generic Amazon Transcribe machine output against the human-annotated reference transcript. We actually made two CLMs (one without tuning split and one with tuning split), which we showcase as a full summary in the following table.

Transcription Type	WER (%)	Accuracy (100-WER)
Amazon Transcribe generic model	31.34%	68.66%
Amazon Transcribe CLM with no tuning split	26.19%	73.81%
Amazon Transcribe CLM with tuning split	20.23%	79.77%

A lower WER is better. These WERs aren’t representative of overall Amazon Transcribe performance. All numbers are relative to demonstrate the point of using custom models over generic models, and are specific only to this singular audio sample.

The WER reductions are pretty significant! As you can see, although AmazonTranscribe’s generic engine performed decently in transcribing the sample audio from the video gaming domain, the CLM we built using training data performed 5% better. And the CLM built using training data and a tuning split performed approximately 11% better. These comparative results are unsurprising because the more relevant training and tuning that a model experiences, the more tailored it is to the specific domain and use case.

To give a qualitative visual comparison, we’ve taken a snippet of the transcript from each CLM’s output and put it against the original human annotated reference transcript to share a qualitative comparison of the different terms that were recognized by each model. We used green highlights to show the progressive accuracy improvements in each iteration.

The following text is the human-generated ground truth reference transcript:

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. Meanwhile, the Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz. Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems alsooffer 3D audio output for immersive experiences.

The following text is the machine transcription output by Amazon Transcribe’s generic speech engine:

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in heart respects between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these Consul’s. The PS five features in a M descend to CPU with up to 3.5 gigahertz frequency is sports and AM the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dahlin at 16 gigabytes in 825 gigabytes. The PS five supports both PS. The PS five supports both four K and A K resolutions. Meanwhile, the Xbox Series X also features an AM descend to CPU, but clocks in at three point take. It hurts instead, the cost. Almost a similar AMG custom GPU, with 12 teraflops and 1.825 gigahertz memory, is the same as out of the PS. Fives come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five. The Siri’s X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

The following text is the machine transcription output by CLM (base training, no tuning split):

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen videogame consoles coming out soon. So what’s the difference in hardware specs between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these consoles. The PS five features an A M D Zen two CPU with up to 3.5 gigahertz frequency it sports and am the radio GPU. That tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively, Dahlin at 16 gigabytes and 825 gigabytes, The PS five supports both PS. The PS five supports both four K and eight K resolutions. Meanwhile, the Xbox Series X also features an A M D Zen two CPU, but clocks in at 3.8 hertz. Instead. The console boasts a similar a AMD custom GPU, with 12 teraflops and 1.825 hertz. Memory is the same as out of the PS five’s come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five, the Series X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences

The following text is the machine transcription output by CLM (base training, with tuning split):

the 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in hardware specs between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hoods of each of these consoles. The PS five features an AMG Zen two CPU with up to 3.5 givers frequency it sports an AMD Radeon GPU that touts 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dial in at 16 gigabytes and 825 gigabytes. The PS five supports both PS. The PS five supports both four K and eight K resolutions. Meanwhile, the Xbox Series X also features an AMG Zen two CPU, but clocks in at 3.8 had hurts. Instead, the console boasts a similar AMD custom. GPU, with 12 teraflops and 1.825 gigahertz memory is the same as out of the PS five’s coming in at 16 gigabytes. But the default storage is where the system has an edge, bringing out a massive one terabyte hard drive. Like the PS five, the Series X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

The difference in improvement varies according to your use case and the quality of your training data and tuning set. Experimentation is encouraged. In general, the more training and tuning that your custom model undergoes, the better the performance. CLM doesn’t guarantee 100% accuracy, but it can offer significant performance improvements over generic speech recognition models.

Best practices

It’s important to note that the resultant custom language model depends directly on what you use as your training dataset. All else equal, the closer the representation of your training data to real use cases, the more performant your custom model is. Moreover, more data is always preferred. For more information about the general guidelines, see Improving Domain-Specific Transcription Accuracy with Custom Language Models.

The Amazon Transcribe CLM doesn’t charge for model training, so feel free to experiment. In a single AWS account, you can train up to 10 custom models to address different domains, use cases, or new training datasets. After you have your CLM, you can choose which transcription jobs to utilize your CLM. You only incur an additional CLM charge for the transcription jobs in which you apply a custom language model.

Conclusion

CLMs can be a powerful capability when it comes to improving transcription accuracy for domain-specific use cases. The new feature is available in all AWS Regions where Amazon Transcribe already operates. At the time of this writing, the feature only supports US English. Additional language support will come with time. Start training your own custom models by visiting Amazon Transcribe and checking out Improving Domain-Specific Transcription Accuracy with Custom Language Models.

Related Resources

For additional resources, see the following:

About the Authors

Paul Zhao is Lead Product Manager at AWS Machine Learning. He manages speech recognition services like Amazon Transcribe and Amazon Transcribe Medical. He was formerly a serial entrepreneur, having launched, operated, and exited two successful businesses in the areas of IoT and FinTech, respectively.

Vivek Govindan is Senior Software Development engineer at AWS Machine Learning. Outside of work, Vivek is an ardent soccer fan.

Vedere AI