Improve transcription accuracy of customer-agent calls with custom vocabulary in Amazon Transcribe

Many AWS customers have been successfully using Amazon Transcribe to accurately, efficiently, and automatically convert their customer audio conversations to text, and extract actionable insights from them. These insights can help you continuously enhance the processes and products that directly improve the quality and experience for your customers.

In many countries, such as India, English is not the primary language of communication. Indian customer conversations contain regional languages like Hindi, with English words and phrases spoken randomly throughout the calls. In the source media files, there can be proper nouns, domain-specific acronyms, words, or phrases that the default Amazon Transcribe model isn’t aware of. Transcriptions for such media files can have inaccurate spellings for those words.

In this post, we demonstrate how you can provide more information to Amazon Transcribe with custom vocabularies to update the way Amazon Transcribe handles transcription of your audio files with business-specific terminology. We show the steps to improve the accuracy of transcriptions for Hinglish calls (Indian Hindi calls containing Indian English words and phrases). You can use the same process to transcribe audio calls with any language supported by Amazon Transcribe. After you create custom vocabularies, you can transcribe audio calls with accuracy and at scale by using our post call analytics solution, which we discuss more later in this post.

Solution overview

We use the following Indian Hindi audio call (SampleAudio.wav) with random English words to demonstrate the process.

We then walk you through the following high-level steps:

  1. Transcribe the audio file using the default Amazon Transcribe Hindi model.
  2. Measure model accuracy.
  3. Train the model with custom vocabulary.
  4. Measure the accuracy of the trained model.

Prerequisites

Before we get started, we need to confirm that the input audio file meets the transcribe data input requirements.

A monophonic recording, also referred to as mono, contains one audio signal, in which all the audio elements of the agent and the customer are combined into one channel. A stereophonic recording, also referred to as stereo, contains two audio signals to capture the audio elements of the agent and the customer in two separate channels. Each agent-customer recording file contains two audio channels, one for the agent and one for the customer.

Low-fidelity audio recordings, such as telephone recordings, typically use 8,000 Hz sample rates. Amazon Transcribe supports processing mono recorded and also high-fidelity audio files with sample rates between 16,000–48,000 Hz.

For improved transcription results and to clearly distinguish the words spoken by the agent and the customer, we recommend using audio files recorded at 8,000 Hz sample rate and are stereo channel separated.

You can use a tool like ffmpeg to validate your input audio files from the command line:

ffmpeg -i SampleAudio.wav

In the returned response, check the line starting with Stream in the Input section, and confirm that the audio files are 8,000 Hz and stereo channel separated:

Input #0, wav, from 'SampleAudio.wav':
Duration: 00:01:06.36, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, stereo, s16, 256 kb/s

When you build a pipeline to process a large number of audio files, you can automate this step to filter files that don’t meet the requirements.

As an additional prerequisite step, create an Amazon Simple Storage Service (Amazon S3) bucket to host the audio files to be transcribed. For instructions, refer to Create your first S3 bucket.Then upload the audio file to the S3 bucket.

Transcribe the audio file with the default model

Now we can start an Amazon Transcribe call analytics job using the audio file we uploaded.In this example, we use the AWS Management Console to transcribe the audio file.You can also use the AWS Command Line Interface (AWS CLI) or AWS SDK.

  1. On the Amazon Transcribe console, choose Call analytics in the navigation pane.
  2. Choose Call analytics jobs.
  3. Choose Create job.
  4. For Name, enter a name.
  5. For Language settings, select Specific language.
  6. For Language, choose Hindi, IN (hi-IN).
  7. For Model type, select General model.
  8. For Input file location on S3, browse to the S3 bucket containing the uploaded audio file.
  9. In the Output data section, leave the defaults.
  10. In the Access permissions section, select Create an IAM role.
  11. Create a new AWS Identity and Access Management (IAM) role named HindiTranscription that provides Amazon Transcribe service permissions to read the audio files from the S3 bucket and use the AWS Key Management Service (AWS KMS) key to decrypt.
  12. In the Configure job section, leave the defaults, including Custom vocabulary deselected.
  13. Choose Create job to transcribe the audio file.

When the status of the job is Complete, you can review the transcription by choosing the job (SampleAudio).

The customer and the agent sentences are clearly separated out, which helps us identify whether the customer or the agent spoke any specific words or phrases.

Measure model accuracy

Word error rate (WER) is the recommended and most commonly used metric for evaluating the accuracy of Automatic Speech Recognition (ASR) systems. The goal is to reduce the WER as much possible to improve the accuracy of the ASR system.

To calculate WER, complete the following steps. This post uses the open-source asr-evaluation evaluation tool to calculate WER, but other tools such as SCTK or JiWER are also available.

  1. Install the asr-evaluation tool, which makes the wer script available on your command line.
    Use a command line on macOS or Linux platforms to run the wer commands shown later in the post.
  2. Copy the transcript from the Amazon Transcribe job details page to a text file named hypothesis.txt.
    When you copy the transcription from the console, you’ll notice a new line character between the words Agent :, Customer :, and the Hindi script.
    The new line characters have been removed to save space in this post. If you choose to use the text as is from the console, make sure that the reference text file you create also has the new line characters, because the wer tool compares line by line.
  3. Review the entire transcript and identify any words or phrases that need to be corrected:
    Customer : हेलो,
    Agent : गुड मोर्निग इंडिया ट्रेवल एजेंसी सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।
    Customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?
    Agent :हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार महीना गोलकुंडा फोर सलार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।
    Customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।
    Agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।
    Customer : सिरियसली एनी टिप्स चिकन शेर
    Agent : आप टेक्सी यूस कर लो ड्रैब और पार्किंग का प्राब्लम नहीं होगा।
    Customer : ग्रेट आइडिया थैंक्यू सो मच।The highlighted words are the ones that the default Amazon Transcribe model didn’t render correctly.
  4. Create another text file named reference.txt, replacing the highlighted words with the desired words you expect to see in the transcription:
    Customer : हेलो,
    Agent : गुड मोर्निग सौथ इंडिया ट्रेवल एजेंसी से मैं । लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।
    Customer : मैं बहुत दिनोंसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?
    Agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।
    Customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।
    Agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।
    Customer : सिरियसली एनी टिप्स यू केन शेर
    Agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा।
    Customer : ग्रेट आइडिया थैंक्यू सो मच।
  5. Use the following command to compare the reference and hypothesis text files that you created:
    wer -i reference.txt hypothesis.txt

    You get the following output:

    REF: customer : हेलो,
    
    HYP: customer : हेलो,
    
    SENTENCE 1
    
    Correct = 100.0% 3 ( 3)
    
    Errors = 0.0% 0 ( 3)
    
    REF: agent : गुड मोर्निग सौथ इंडिया ट्रेवल एजेंसी से मैं । लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।
    
    HYP: agent : गुड मोर्निग *** इंडिया ट्रेवल एजेंसी ** सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।
    
    SENTENCE 2
    
    Correct = 84.0% 21 ( 25)
    
    Errors = 16.0% 4 ( 25)
    
    REF: customer : मैं बहुत ***** दिनोंसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?
    
    HYP: customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?
    
    SENTENCE 3
    
    Correct = 96.0% 24 ( 25)
    
    Errors = 8.0% 2 ( 25)
    
    REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।
    
    HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार महीना गोलकुंडा फोर सलार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।
    
    SENTENCE 4
    
    Correct = 83.3% 20 ( 24)
    
    Errors = 16.7% 4 ( 24)
    
    REF: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।
    
    HYP: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।
    
    SENTENCE 5
    
    Correct = 100.0% 14 ( 14)
    
    Errors = 0.0% 0 ( 14)
    
    REF: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।
    
    HYP: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।
    
    SENTENCE 6
    
    Correct = 100.0% 12 ( 12)
    
    Errors = 0.0% 0 ( 12)
    
    REF: customer : सिरियसली एनी टिप्स यू केन शेर
    
    HYP: customer : सिरियसली एनी टिप्स ** चिकन शेर
    
    SENTENCE 7
    
    Correct = 75.0% 6 ( 8)
    
    Errors = 25.0% 2 ( 8)
    
    REF: agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा।
    
    HYP: agent : आप टेक्सी यूस कर लो ड्रैब और पार्किंग का प्राब्लम नहीं होगा।
    
    SENTENCE 8
    
    Correct = 92.9% 13 ( 14)
    
    Errors = 7.1% 1 ( 14)
    
    REF: customer : ग्रेट आइडिया थैंक्यू सो मच।
    
    HYP: customer : ग्रेट आइडिया थैंक्यू सो मच।
    
    SENTENCE 9
    
    Correct = 100.0% 7 ( 7)
    
    Errors = 0.0% 0 ( 7)
    
    Sentence count: 9
    
    WER: 9.848% ( 13 / 132)
    
    WRR: 90.909% ( 120 / 132)
    
    SER: 55.556% ( 5 / 9)

The wer command compares text from the files reference.txt and hypothesis.txt. It reports errors for each sentence and also the total number of errors (WER: 9.848% ( 13 / 132)) in the entire transcript.

From the preceding output, wer reported 13 errors out of 132 words in the transcript. These errors can be of three types:

  • Substitution errors – These occur when Amazon Transcribe writes one word in place of another. For example, in our transcript, the word “महीना (Mahina)” was written instead of “मिनार (Minar)” in sentence 4.
  • Deletion errors – These occur when Amazon Transcribe misses a word entirely in the transcript.In our transcript, the word “सौथ (South)” was missed in sentence 2.
  • Insertion errors – These occur when Amazon Transcribe inserts a word that wasn’t spoken. We don’t see any insertion errors in our transcript.

Observations from the transcript created by the default model

We can make the following observations based on the transcript:

  • The total WER is 9.848%, meaning 90.152% of the words are transcribed accurately.
  • The default Hindi model transcribed most of the English words accurately. This is because the default model is trained to recognize the most common English words out of the box. The model is also trained to recognize Hinglish language, where English words randomly appear in Hindi conversations. For example:
    • गुड मोर्निग – Good morning (sentence 2).
    • ट्रेवल एजेंसी – Travel agency (sentence 2).
    • ग्रेट आइडिया थैंक्यू सो मच – Great idea thank you so much (sentence 9).
  • Sentence 4 has the most errors, which are the names of places in the Indian city Hyderabad:
    • हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार महीना गोलकुंडा फोर सलार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

In the next step, we demonstrate how to correct the highlighted words in the preceding sentence using custom vocabulary in Amazon Transcribe:

  • चार महीना (Char Mahina) should be चार मिनार (Char Minar)
  • गोलकुंडा फो (Golcunda Four) should be गोलकोंडा फोर्ट (Golconda Fort)
  • लार जंग (Salar Jung) should be सालार जंग (Saalar Jung)

Train the default model with a custom vocabulary

To create a custom vocabulary, you need to build a text file in a tabular format with the words and phrases to train the default Amazon Transcribe model. Your table must contain all four columns (Phrase, SoundsLike, IPA, and DisplayAs), but the Phrase column is the only one that must contain an entry on each row. You can leave the other columns empty. Each column must be separated by a tab character, even if some columns are left empty. For example, if you leave the IPA and SoundsLike columns empty for a row, the Phrase and DisplaysAs columns in that row must be separated with three tab characters (between Phrase and IPA, IPA and SoundsLike, and SoundsLike and DisplaysAs).

To train the model with a custom vocabulary, complete the following steps:

  1. Create a file named HindiCustomVocabulary.txt with the following content.
    Phrase	IPA	SoundsLike	DisplayAs
    गोलकुंडा-फोर			गोलकोंडा फोर्ट
    सालार-जंग		सा-लार-जंग	सालार जंग
    चार-महीना			चार मिनार

    You can only use characters that are supported for your language. Refer to your language’s character set for details.

    The columns contain the following information:

    1. Phrase – Contains the words or phrases that you want to transcribe accurately. The highlighted words or phrases in the transcript created by the default Amazon Transcribe model appear in this column. These words are generally acronyms, proper nouns, or domain-specific words and phrases that the default model isn’t aware of. This is a mandatory field for every row in the custom vocabulary table. In our transcript, to correct “गोलकुंडा फोर (Golcunda Four)” from sentence 4, use “गोलकुंडा-फोर (Golcunda-Four)” in this column. If your entry contains multiple words, separate each word with a hyphen (-); do not use spaces.
    2. IPA – Contains the words or phrases representing speech sounds in the written form. The column is optional; you can leave its rows empty. This column is intended for phonetic spellings using only characters in the International Phonetic Alphabet (IPA). Refer to Hindi character set for the allowed IPA characters for the Hindi language. In our example, we’re not using IPA. If you have an entry in this column, your SoundsLike column must be empty.
    3. SoundsLike – Contains words or phrases broken down into smaller pieces (typically based on syllables or common words) to provide a pronunciation for each piece based on how that piece sounds. This column is optional; you can leave the rows empty. Only add content to this column if your entry includes a non-standard word, such as a brand name, or to correct a word that is being incorrectly transcribed. In our transcript, to correct “सलार जंग (Salar Jung)” from sentence 4, use “सा-लार-जंग (Saa-lar-jung)” in this column. Do not use spaces in this column. If you have an entry in this column, your IPA column must be empty.
    4. DisplaysAs – Contains words or phrases with the spellings you want to see in the transcription output for the words or phrases in the Phrase field. This column is optional; you can leave the rows empty. If you don’t specify this field, Amazon Transcribe uses the contents of the Phrase field in the output file. For example, in our transcript, to correct “गोलकुंडा फोर (Golcunda Four)” from sentence 4, use “गोलकोंडा फोर्ट (Golconda Fort)” in this column.
  2. Upload the text file (HindiCustomVocabulary.txt) to an S3 bucket.Now we create a custom vocabulary in Amazon Transcribe.
  3. On the Amazon Transcribe console, choose Custom vocabulary in the navigation pane.
  4. For Name, enter a name.
  5. For Language, choose Hindi, IN (hi-IN).
  6. For Vocabulary input source, select S3 location.
  7. For Vocabulary file location on S3, enter the S3 path of the HindiCustomVocabulary.txt file.
  8. Choose Create vocabulary.
  9. Transcribe the SampleAudio.wav file with the custom vocabulary, with the following parameters:
    1. For Job name , enter SampleAudioCustomVocabulary.
    2. For Language, choose Hindi, IN (hi-IN).
    3. For Input file location on S3, browse to the location of SampleAudio.wav.
    4. For IAM role, select Use an existing IAM role and choose the role you created earlier.
    5. In the Configure job section, select Custom vocabulary and choose the custom vocabulary HindiCustomVocabulary.
  10. Choose Create job.

Measure model accuracy after using custom vocabulary

Copy the transcript from the Amazon Transcribe job details page to a text file named hypothesis-custom-vocabulary.txt:

Customer : हेलो,

Agent : गुड मोर्निग इंडिया ट्रेवल एजेंसी सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।

Customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?

Agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

Customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।

Agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।

Customer : सिरियसली एनी टिप्स चिकन शेर

Agent : आप टेक्सी यूस कर लो ड्रैब और पार्किंग का प्राब्लम नहीं होगा।

Customer : ग्रेट आइडिया थैंक्यू सो मच।

Note that the highlighted words are transcribed as desired.

Run the wer command again with the new transcript:

wer -i reference.txt hypothesis-custom-vocabulary.txt

You get the following output:

REF: customer : हेलो,

HYP: customer : हेलो,

SENTENCE 1

Correct = 100.0% 3 ( 3)

Errors = 0.0% 0 ( 3)

REF: agent : गुड मोर्निग सौथ इंडिया ट्रेवल एजेंसी से मैं । लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।

HYP: agent : गुड मोर्निग *** इंडिया ट्रेवल एजेंसी ** सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।

SENTENCE 2

Correct = 84.0% 21 ( 25)

Errors = 16.0% 4 ( 25)

REF: customer : मैं बहुत ***** दिनोंसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?

HYP: customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?

SENTENCE 3

Correct = 96.0% 24 ( 25)

Errors = 8.0% 2 ( 25)

REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

SENTENCE 4

Correct = 100.0% 24 ( 24)

Errors = 0.0% 0 ( 24)

REF: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।

HYP: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।

SENTENCE 5

Correct = 100.0% 14 ( 14)

Errors = 0.0% 0 ( 14)

REF: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।

HYP: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।

SENTENCE 6

Correct = 100.0% 12 ( 12)

Errors = 0.0% 0 ( 12)

REF: customer : सिरियसली एनी टिप्स यू केन शेर

HYP: customer : सिरियसली एनी टिप्स ** चिकन शेर

SENTENCE 7

Correct = 75.0% 6 ( 8)

Errors = 25.0% 2 ( 8)

REF: agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा।

HYP: agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा।

SENTENCE 8

Correct = 100.0% 14 ( 14)

Errors = 0.0% 0 ( 14)

REF: customer : ग्रेट आइडिया थैंक्यू सो मच।

HYP: customer : ग्रेट आइडिया थैंक्यू सो मच।

SENTENCE 9

Correct = 100.0% 7 ( 7)

Errors = 0.0% 0 ( 7)

Sentence count: 9

WER: 6.061% ( 8 / 132)

WRR: 94.697% ( 125 / 132)

SER: 33.333% ( 3 / 9)

Observations from the transcript created with custom vocabulary

The total WER is 6.061%, meaning 93.939% of the words are transcribed accurately.

Let’s compare the wer output for sentence 4 with and without custom vocabulary. The following is without custom vocabulary:

REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार महीना गोलकुंडा फोर सलार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

SENTENCE 4

Correct = 83.3% 20 ( 24)

Errors = 16.7% 4 ( 24)

The following is with custom vocabulary:

REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।

SENTENCE 4

Correct = 100.0% 24 ( 24)

Errors = 0.0% 0 ( 24)

There are no errors in sentence 4. The names of the places are transcribed accurately with the help of custom vocabulary, thereby reducing the overall WER from 9.848% to 6.061% for this audio file. This means that the accuracy of transcription improved by nearly 4%.

How custom vocabulary improved the accuracy

We used the following custom vocabulary:

Phrase IPA SoundsLike DisplayAs

गोलकुंडा-फोर गोलकोंडा फोर्ट

सालार-जंग सा-लार-जंग सालार जंग

चार-महीना चार मिनार

Amazon Transcribe checks if there are any words in the audio file that sound like the words mentioned in the Phrase column. Then the model uses the entries in the IPA, SoundsLike, and DisplaysAs columns for those specific words to transcribe with the desired spellings.

With this custom vocabulary, when Amazon Transcribe identifies a word that sounds like “गोलकुंडा-फोर (Golcunda-Four),” it transcribes that word as “गोलकोंडा फोर्ट (Golconda Fort).”

Recommendations

The accuracy of transcription also depends on parameters like the speakers’ pronunciation, overlapping speakers, talking speed, and background noise. Therefore, we recommend that you to follow the process with a variety of calls (with different customers, agents, interruptions, and so on) that cover the most commonly used domain-specific words for you to build a comprehensive custom vocabulary.

In this post, we learned the process to improve accuracy of transcribing one audio call using custom vocabulary. To process thousands of your contact center call recordings every day, you can use post call analytics, a fully automated, scalable, and cost-efficient end-to-end solution that takes care of most of the heavy lifting. You simply upload your audio files to an S3 bucket, and within minutes, the solution provides call analytics like sentiment in a web UI. Post call analytics provides actionable insights to spot emerging trends, identify agent coaching opportunities, and assess the general sentiment of calls.Post call analytics is an open-source solution that you can deploy using AWS CloudFormation.

Note that custom vocabularies don’t use the context in which the words were spoken, they only focus on individual words that you provide. To further improve the accuracy, you can use custom language models. Unlike custom vocabularies, which associate pronunciation with spelling, custom language models learn the context associated with a given word. This includes how and when a word is used, and the relationship a word has with other words. To create a custom language model, you can use the transcriptions derived from the process we learned for a variety of calls, and combine them with content from your websites or user manuals that contains domain-specific words and phrases.

To achieve the highest transcription accuracy with batch transcriptions, you can use custom vocabularies in conjunction with your custom language models.

Conclusion

In this post, we provided detailed steps to accurately process Hindi audio files containing English words using call analytics and custom vocabularies in Amazon Transcribe. You can use these same steps to process audio calls with any language supported by Amazon Transcribe.

After you derive the transcriptions with your desired accuracy, you can improve your agent-customer conversations by training your agents. You can also understand your customer sentiments and trends. With the help of speaker diarization, loudness detection, and vocabulary filtering features in the call analytics, you can identify whether it was the agent or customer who raised their tone or spoke any specific words. You can categorize calls based on domain-specific words, capture actionable insights, and run analytics to improve your products. Finally, you can translate your transcripts to English or other supported languages of your choice using Amazon Translate.


About the Authors

Sarat Guttikonda is a Sr. Solutions Architect in AWS World Wide Public Sector. Sarat enjoys helping customers automate, manage, and govern their cloud resources without sacrificing business agility. In his free time, he loves building Legos with his son and playing table tennis.

Lavanya Sood is a Solutions Architect in AWS World Wide Public Sector based out of New Delhi, India. Lavanya enjoys learning new technologies and helping customers in their cloud adoption journey. In her free time, she loves traveling and trying different foods.

Read More

Detect audio events with Amazon Rekognition

When most people think of using machine learning (ML) with audio data, the use case that usually comes to mind is transcription, also known as speech-to-text. However, there are other useful applications, including using ML to detect sounds.

Using software to detect a sound is called audio event detection, and it has a number of applications. For example, suppose you want to monitor the sounds from a noisy factory floor, listening for an alarm bell that indicates a problem with a machine. In a healthcare environment, you can use audio event detection to passively listen for sounds from a patient that indicate an acute health problem. Media workloads are a good fit for this technique, for example to detect when a referee’s whistle is blown in a sports video. And of course, you can use this technique in a variety of surveillance workloads, like listening for a gunshot or the sound of a car crash from a microphone mounted above a city street.

This post describes how to detect sounds in an audio file even if there are significant background sounds happening at the same time. What’s more, perhaps surprisingly, we use computer vision-based techniques to do the detection, using Amazon Rekognition.

Using audio data with machine learning

The first step in detecting audio events is understanding how audio data is represented. For the purposes of this post, we deal only with recorded audio, although these techniques work with streaming audio.

Recorded audio is typically stored as a sequence of sound samples, which measure the intensity of the sound waves that struck the microphone during recording, over time. There are a wide variety of formats with which to store these samples, but a common approach is to store 10,000, 20,000, or even 40,000 samples per second, with each sample being an integer from 0–65535 (two bytes). Because each sample measures only the intensity of sound waves at a particular moment, the sound data generally isn’t helpful for ML processes because it doesn’t have any useful features in its raw state.

To make that data useful, the sound sample is converted into an image called a spectrogram, which is a representation of the audio data that shows the intensity of different frequency bands over time. The following image shows an example.

The X axis of this image represents time, meaning that the left edge of the image is the very start of the sound, and the right edge of the image is the end. Each column of data within the image represents different frequency bands (indicated by the scale on the left side of the image), and the color at each point represents the intensity of that frequency at that moment in time.

Vertical scaling for spectrograms can be changed to other representations. For example, linear scaling means that the Y axis is evenly divided over frequencies, logarithmic scaling uses a log scale, and so forth. The problem with using these representations is that the frequencies in a sound file are usually not evenly distributed, so most of the information we might be interested in ends up being clustered near the bottom of the image (the lower frequencies).

To solve that problem, our sample image is an example of a Mel spectrogram, which is scaled to closely approximate how human beings perceive sound. Notice the frequency indicators along the left side of the image—they give an idea of how they are distributed vertically, and it’s clear that it’s a non-linear scale.

Additionally, we can modify the measurement of intensity by frequency by time to enhance various features of the audio being measured. As with the Y axis scaling that is implemented by a Mel spectrogram, others emphasize features such as the intensity of the 12 distinctive pitch classes that are used to study music (chroma). Another class emphasizes horizonal (harmonic) features or vertical (percussive) features. The type of sound that is being detected should drive the type of spectrogram used for the detection system.

The earlier example spectrogram represents a music clip that is just over 2 minutes long. Zooming in reveals more detail, as is shown in the following image.

The numbers along the top of the image show the number of seconds from the start of the audio file. You can clearly see a sequence of sounds that seems to repeat more than four times per second, indicated by the bright colors near the bottom of the image.

As you can see, this is one of the benefits of converting audio to a spectrogram—distinct sounds are often easily visible with the naked eye, and even if they aren’t, they can frequently be detected using computer vision object detection algorithms. In fact, this is exactly the process we follow in order to detect sounds.

Looking for discrete sounds in a spectrogram

Depending on the length of the audio file that we’re searching, finding a discrete sound that lasts just a second or two is a challenge. Refer to the first spectrogram we shared—because we’re viewing an entire 3:30 minutes of data, details that last only a second or so aren’t visible. We zoomed in a great deal in order to see the rhythm that is shown in the second image. Clearly, with larger sound files (and therefore much larger spectrograms), we quickly run into problems unless we use a different approach. That approach is called windowing.

Windowing refers to using a sliding window that moves across the entire spectrogram, isolating a few seconds (or less) at a time. By repeatedly isolating portions of the overall image, we get smaller images that are searchable for the presence of the sound to be detected. Because each window could result in only part of the image we’re looking for (as in the case of searching for a sound that doesn’t start exactly at the start of a window), windowing is often performed with succeeding windows being overlapped. For example, the first window starts at 0:00 and extends 2 seconds, then the second window starts at 0:01 and extends 2 seconds, and the third window starts at 0:02 and extends 2 seconds, and so on.

Windowing splits a spectrogram image horizontally. We can improve the effectiveness of the detection process by isolating certain frequency bands by cropping or searching only certain vertical parts of the image. For example, if you know that the alarm bell you want to detect creates sounds that range from one specific frequency to another, you can modify the current window to only consider those frequency ranges. That vastly reduces the amount of data to be manipulated, and results in a much faster search. It also improves accuracy, because it’s eliminating possible false positives matches occurring in frequency bands outside of the desired range. The following images compare a full Y axis (left) with a limited Y axis (right).

Full Y Axis

Full Y Axis

Limited Y Axis

Limited Y Axis

Now that we know how to iterate over a spectrogram with a windowing approach and filter to certain frequency bands, the next step is to do the actual search for the sound. For that, we use Amazon Rekognition Custom Labels. The Rekognition Custom Labels feature builds off of the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images, but optimal training dataset size should be arrived at experimentally based on the specific use case to avoid under- or over-training the model) that are specific to your use case via the Rekognition Custom Labels console.

If your images are already labeled, Amazon Rekognition training is accessible with just a few clicks. Alternatively, you can label the images directly within the Amazon Rekognition labeling interface, or use Amazon SageMaker Ground Truth to label them for you. When Amazon Rekognition begins training from your image set, it produces a custom image analysis model for you in just a few hours. Behind the scenes, Rekognition Custom Labels automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Rekognition Custom Labels API and integrate it into your applications.

Assembling training data and training a Rekognition Custom Labels model

In the GitHub repo associated with this post, you’ll find code that shows how to listen for the sound of a smoke alarm going off, regardless of background noise. In this case, our Rekognition Custom Labels model is a binary classification model, meaning that the results are either “smoke alarm sound was detected” or “smoke alarm sound was not detected.”

To create a custom model, we need training data. That training data is comprised of two main types: environmental sounds, and the sounds you wish to detect (like a smoke alarm going off).

The environmental data should represent a wide variety of soundscapes that are typical for the environment you want to detect the sound in. For example, if you want to detect a smoke alarm sound in a factory environment, start with sounds recorded in that factory environment under a variety of situations (without the smoke alarm sounding, of course).

The sounds you want to detect should be isolated if possible, meaning the recordings should just be the sound itself without any environmental background sounds. For our example, that’s a sound of a smoke alarm going off.

After you’ve collected these sounds, the code in the GitHub repo shows how to combine the environmental sounds with the smoke alarm sounds in various ways (and then convert them to spectrograms) in order to create a number of images that represent the environmental sounds with and without the smoke alarm sounds overlaid on them. The following image is an example of some environmental sounds with a smoke alarm sound (the bright horizontal bars) overlaid on top of it.

The training and test data is stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following directory structure is a good starting point to organize data within the bucket.

The sample code in the GitHub repo allows you to choose how many training images to create. Rekognition Custom Labels doesn’t require a large number of training images. A training set of 200–500 images should be sufficient.

Creating a Rekognition Custom Labels project requires that you specify the URIs of the S3 folder that contains the training data, and (optionally) test data. When specifying the data sources for the training job, one of the options is Automatic labeling, as shown in the following screenshot.

Using this option means that Amazon Rekognition uses the names of the folders as the label names. For our smoke alarm detection use case, the folder structure inside of the train and test folders looks like the following screenshot.

The training data images go into those folders, with spectrograms containing the sound of the smoke alarm going in the alarm folder, and spectrograms that don’t contain the smoke alarm sound in the no_alarm folder. Amazon Rekognition uses those names as the output class names for the custom labels model.

Training a custom label model usually takes 30–90 minutes. At the end of that training, you must start the trained model so it becomes available for use.

End-to-end architecture for sound detection

After we create our model, the next step is to set up an inference pipeline, so we can use the model to detect if a smoke alarm sound is present in an audio file. To do this, the input sound must be turned into a spectrogram and then windowed and filtered by frequency, as was done for the training process. Each window of the spectrogram is given to the model, which returns a classification that indicates if the smoke alarm sounded or not.

The following diagram shows an example architecture that implements this inference pipeline.

This architecture waits for an audio file to be placed into an S3 bucket, which then causes an AWS Lambda function to be invoked. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. You can trigger a Lambda function from over 200 AWS services and software as a service (SaaS) applications, and only pay for what you use.

The Lambda function receives the name of the bucket and the name of the key (or file name) of the audio file. The file is downloaded from Amazon S3 to the function’s memory, which then converts it into a spectrogram and performs windowing and frequency filtering. Each windowed portion of the spectrogram is then sent to Amazon Rekognition, which uses the previously-trained Amazon Custom Labels model to detect the sound. If that sound is found, the Lambda function signals that by using an Amazon Simple Notification Service (Amazon SNS) notification. Amazon SNS offers a pub/sub approach where notifications can be sent to Amazon Simple Queue Service (Amazon SQS) queues, Lambda functions, HTTPS endpoints, email addresses, mobile push, and more.

Conclusion

You can use machine learning with audio data to determine when certain sounds occur, even when other sounds are occurring at the same time. Doing so requires converting the sound into a spectrogram image, and then homing in on different parts of that spectrogram by windowing and filtering by frequency band. Rekognition Custom Labels makes it easy to train a custom model for sound detection.

You can use the GitHub repo containing the example code for this post as a starting point for your own experiments. For more information about audio event detection, refer to Sound Event Detection: A Tutorial.


About the authors

Greg Sommerville is a Senior Prototyping Architect on the AWS Prototyping and Cloud Engineering team, where he helps AWS customers implement innovative solutions to challenging problems with machine learning, IoT and serverless technologies. He lives in Ann Arbor, Michigan and enjoys practicing yoga, catering to his dogs, and playing poker.

Jeff Harman is a Senior Prototyping Architect on the AWS Prototyping and Cloud Engineering team, where he helps AWS customers implement innovative solutions to challenging problems. He lives in Unionville, Connecticut and enjoys woodworking, blacksmithing, and Minecraft.

Read More

How The Chefz serves the perfect meal with Amazon Personalize

This is a guest post by Ramzi Alqrainy, Chief Technology Officer, The Chefz.

The Chefz is a Saudi-based online food delivery startup, founded in 2016. At the core of The Chefz’s business model is enabling its customers to order food and sweets from top elite restaurants, bakeries, and chocolate shops. In this post, we explain how The Chefz uses Amazon Personalize filters to apply business rules on recommendations to end-users, increasing revenue by 35%.

Food delivery is a growing industry but at the same time is extremely competitive. The biggest challenge in the industry is maintaining customer loyalty. This requires a comprehensive understanding of the customer’s preferences, the ability to provide excellent response time in terms of on-time delivery, and good food quality. These three factors determine the most important metric for The Chefz’s customer satisfaction. The Chefz’s demands fluctuate, especially with spikes in order volumes at lunch and dinner times. Demand also fluctuates during special days such as Mother’s Day, the football final, Ramadan dusk (Suhoor) and sundown (Iftaar) times, or Eid festive holidays. During these times, the demand can increase by up to 300%, adding one more critical challenge to recommend the perfect meal based on time of the day, especially in Ramadan.

The perfect meal at the right time

To make the ordering process more deterministic and to cater to peak demand times, The Chefz team decided to divide the day into different periods. For example, during Ramadan season, days are divided into Iftar and Suhoor. On regular days, days consist of four periods: breakfast, lunch, dinner, and dessert. The technology that underpins this deterministic ordering process is Amazon Personalize, a powerful recommendation engine. Amazon Personalize takes these grouped periods along with the location of the customer to provide a perfect recommendation.

This ensures the customer receives restaurant and meal recommendations based on their preference and from a nearby location so that it arrives quickly at their doorstep.

This recommendation engine based on Amazon Personalize is the key ingredient in how The Chefz’s customers enjoy personalized restaurant meal recommendations, rather than random recommendations for categories of favorites.

The personalization journey

The Chefz started its personalization journey by offering restaurant recommendations for customers using Amazon Personalize based on previous interactions, user metadata (such as age, nationality, and diet), restaurant metadata like category and food types offered, along with live tracking for customer interactions on the Chefz mobile application and web portal. The initial deployment phases of Amazon Personalize led to a 10% increase in customer interactions with the portal.

Although that was a milestone step, delivery time was still a problem that many customers encountered. One of the main difficulties customers had was delivery time during rush hour. To address this, the data scientist team added location as an additional feature to user metadata so recommendations would take into consideration both user preference and location for improved delivery time.

The next step in the recommendation journey was to consider annual timing, especially Ramadan, and the time of day. These considerations ensured The Chefz could recommend heavy meals or restaurants that provide Iftaar meals during Ramadan sundown, and lighter meals in the late evening. To solve this challenge, the data scientist team used Amazon Personalize filters updated by AWS Lambda functions, which were triggered by an Amazon CloudWatch cron job.

The following architecture shows the automated process for applying the filters:

  1. A CloudWatch event uses a cron expression to schedule when a Lambda function is invoked.
  2. When the Lambda function is triggered, it attaches the filter to the recommendation engine to apply business rules.
  3. Recommended meals and restaurants are delivered to end-users on the application.

Conclusion

Amazon Personalize enabled The Chefz to apply context about individual customers and their circumstances, and deliver customized recommendations based on business rules such as special deals and offers through our mobile application. This increased revenue by 35% per month and doubled customer orders at recommended restaurants.

“The customer is at the heart of everything we do at The Chefz, and we’re working tirelessly to improve and enhance their experience. With Amazon Personalize, we are able to achieve personalization at scale across our entire customer base, which was previously impossible.”

-Ramzi Algrainy, CTO at The Chefz.


About the authors

Ramzi Alqrainy is Chief Technology Officer at The Chefz. Ramzi is a contributor to Apache Solr and Slack and technical reviewer, and has published many papers in IEEE focusing on search and data functions.

Mohamed Ezzat is a Senior Solutions Architect at AWS with a focus in machine learning. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys playing table tennis.

Read More

Distributed training with Amazon EKS and Torch Distributed Elastic

Distributed deep learning model training is becoming increasingly important as data sizes are growing in many industries. Many applications in computer vision and natural language processing now require training of deep learning models, which are growing exponentially in complexity and are often trained with hundreds of terabytes of data. It then becomes important to use a vast cloud infrastructure to scale the training of such large models.

Developers can use open-source frameworks such as PyTorch to easily design intuitive model architectures. However, scaling the training of these models across multiple nodes can be challenging due to increased orchestration complexity.

Distributed model training mainly consists of two paradigms:

  • Model parallel – In model parallel training, the model itself is so large that it can’t fit in the memory of a single GPU, and multiple GPUs are needed to train the model. The Open AI’s GPT-3 model with 175 billion trainable parameters (approximately 350 GB in size) is a good example of this.
  • Data parallel – In data parallel training, the model can reside in a single GPU, but because the data is so large, it can take days or weeks to train a model. Distributing the data across multiple GPU nodes can significantly reduce the training time.

In this post, we provide an example architecture to train PyTorch models using the Torch Distributed Elastic framework in a distributed data parallel fashion using Amazon Elastic Kubernetes Service (Amazon EKS).

Prerequisites

To replicate the results reported in this post, the only prerequisite is an AWS account. In this account, we create an EKS cluster and an Amazon FSx for Lustre file system. We also push container images to an Amazon Elastic Container Registry (Amazon ECR) repository in the account. Instructions to set up these components are provided as needed throughout the post.

EKS clusters

Amazon EKS is a managed container service to run and scale Kubernetes applications on AWS. With Amazon EKS, you can efficiently run distributed training jobs using the latest Amazon Elastic Compute Cloud (Amazon EC2) instances without needing to install, operate, and maintain your own control plane or nodes. It is a popular orchestrator for machine learning (ML) and AI workflows. A typical EKS cluster in AWS looks like the following figure.

We have released an open-source project, AWS DevOps for EKS (aws-do-eks), which provides a large collection of easy-to-use and configurable scripts and tools to provision EKS clusters and run distributed training jobs. This project is built following the principles of the Do Framework: Simplicity, Flexibility, and Universality. You can configure your desired cluster by using the eks.conf file and then launch it by running the eks-create.sh script. Detailed instructions are provided in the GitHub repo.

Train PyTorch models using Torch Distributed Elastic

Torch Distributed Elastic (TDE) is a native PyTorch library for training large-scale deep learning models where it’s critical to scale compute resources dynamically based on availability. The TorchElastic Controller for Kubernetes is a native Kubernetes implementation for TDE that automatically manages the lifecycle of the pods and services required for TDE training. It allows for dynamically scaling compute resources during training as needed. It also provides fault-tolerant training by recovering jobs from node failure.

In this post, we discuss the steps to train PyTorch EfficientNet-B7 and ResNet50 models using ImageNet data in a distributed fashion with TDE. We use the PyTorch DistributedDataParallel API and the Kubernetes TorchElastic controller, and run our training jobs on an EKS cluster containing multiple GPU nodes. The following diagram shows the architecture diagram for this model training.

TorchElastic for Kubernetes consists mainly of two components: the TorchElastic Kubernetes Controller (TEC) and the parameter server (etcd). The controller is responsible for monitoring and managing the training jobs, and the parameter server keeps track of the worker nodes for distributed synchronization and peer discovery.

In order for the training pods to access the data, we need a shared data volume that can be mounted by each pod. Some options for shared volumes through Container Storage Interface (CSI) drivers included in AWS DevOps for EKS are Amazon Elastic File System (Amazon EFS) and FSx for Lustre.

Cluster setup

In our cluster configuration, we use one c5.2xlarge instance for system pods. We use three p4d.24xlarge instances as worker pods to train an EfficientNet model. For ResNet50 training, we use p3.8xlarge instances as worker pods. Additionally, we use an FSx shared file system to store our training data and model artifacts.

AWS p4d.24xlarge instances are equipped with Elastic Fabric Adapter (EFA) to provide networking between nodes. We discuss EFA more later in the post. To enable communication through EFA, we need to configure the cluster setup through a .yaml file. An example file is provided in the GitHub repository.

After this .yaml file is properly configured, we can launch the cluster using the script provided in the GitHub repo:

./eks-create.sh

Refer to the GitHub repo for detailed instructions.

There is practically no difference between running jobs on p4d.24xlarge and p3.8xlarge. The steps described in this post work for both. The only difference is the availability of EFA on p4d.24xlarge instances. For smaller models like ResNet50, standard networking compared to EFA networking has minimal impact on the speed of training.

FSx for Lustre file system

FSx is designed for high-performance computing workloads and provides sub-millisecond latency using solid-state drive storage volumes. We chose FSx because it provided better performance as we scaled to a large number of nodes. An important detail to note is that FSx can only exist in a single Availability Zone. Therefore, all nodes accessing the FSx file system should exist in the same Availability Zone as the FSx file system. One way to achieve this is to specify the relevant Availability Zone in the cluster .yaml file for the specific node groups before creating the cluster. Alternatively, we can modify the network part of the auto scaling group for these nodes after the cluster is set up, and limit it to using a single subnet. This can be easily done on the Amazon EC2 console.

Assuming that the EKS cluster is up and running, and the subnet ID for the Availability Zone is known, we can set up an FSx file system by providing the necessary information in the fsx.conf file as described in the readme and running the deploy.sh script in the fsx folder. This sets up the correct policy and security group for accessing the file system. The script also installs the CSI driver for FSx as a daemonset. Finally, we can create the FSx persistent volume claim in Kubernetes by applying a single .yaml file:

kubectl apply -f fsx-pvc-dynamic.yaml

This creates an FSx file system in the Availability Zone specified in the fsx.conf file, and also creates a persistent volume claim fsx-pvc, which can be mounted by any of the pods in the cluster in a read-write-many (RWX) fashion.

In our experiment, we used complete ImageNet data, which contains more that 12 million training images divided into 1,000 classes. The data can be downloaded from the ImageNet website. The original TAR ball has several directories, but for our model training, we’re only interested in ILSVRC/Data/CLS-LOC/, which includes the train and val subdirectories. Before training, we need to rearrange the images in the val subdirectory to match the directory structure required by the PyTorch ImageFolder class. This can be done using a simple Python script after the data is copied to the persistent volume in the next step.

To copy the data from an Amazon Simple Storage Service (Amazon S3) bucket to the FSx file system, we create a Docker image that includes scripts for this task. An example Dockerfile and a shell script are included in the csi folder within the GitHub repo. We can build the image using the build.sh script and then push it to Amazon ECR using the push.sh script. Before using these scripts, we need to provide the correct URI for the ECR repository in the .env file in the root folder of the GitHub repo. After we push the Docker image to Amazon ECR, we can launch a pod to copy the data by applying the relevant .yaml file:

kubectl apply -f fsx-data-prep-pod.yaml

The pod automatically runs the script data-prep.sh to copy the data from Amazon S3 to the shared volume. Because the ImageNet data has more than 12 million files, the copy process takes a couple of hours. The Python script imagenet_data_prep.py is also run to rearrange the val dataset as expected by PyTorch.

Network acceleration

We can use Elastic Fabric Adapter (EFA) in combination with supported EC2 instance types to accelerate network traffic between the GPU nodes in your cluster. This can be useful when running large distributed training jobs where standard network communication may be a bottleneck. Scripts to deploy and test the EFA device plugin in the EKS cluster that we use here are included in the efa-device-plugin folder in the GitHub repo. To enable a job with EFA in your EKS cluster, in addition to the cluster nodes having the necessary hardware and software, the EFA device plugin needs to be deployed to the cluster, and your job container needs to have compatible CUDA and NCCL versions installed.

To demonstrate running NCCL tests and evaluating the performance of EFA on p4d.24xlarge instances, we first must deploy the Kubeflow MPI operator by running the corresponding deploy.sh script in the mpi-operator folder. Then we run the deploy.sh script and update the test-efa-nccl.yaml manifest so limits and requests for resource vpc.amazonaws.com are set to 4. The four available EFA adapters in the p4d.24xlarge nodes get bundled together to provide maximum throughput.

Run kubectl apply -f ./test-efa-nccl.yaml to apply the test and then display the logs of the test pod. The following line in the log output confirms that EFA is being used:

NCCL INFO NET/OFI Selected Provider is efa

The test results should look similar to the following output:

[1,0]<stdout>:#                                                       out-of-place                       in-place
[1,0]<stdout>:#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
[1,0]<stdout>:#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
[1,0]<stdout>:           8             2     float     sum    629.7    0.00    0.00  2e-07    631.4    0.00    0.00  1e-07
[1,0]<stdout>:          16             4     float     sum    630.5    0.00    0.00  1e-07    628.1    0.00    0.00  1e-07
[1,0]<stdout>:          32             8     float     sum    627.6    0.00    0.00  1e-07    628.2    0.00    0.00  1e-07
[1,0]<stdout>:          64            16     float     sum    633.6    0.00    0.00  1e-07    628.4    0.00    0.00  6e-08
[1,0]<stdout>:         128            32     float     sum    627.5    0.00    0.00  6e-08    632.0    0.00    0.00  6e-08
[1,0]<stdout>:         256            64     float     sum    634.5    0.00    0.00  6e-08    636.5    0.00    0.00  6e-08
[1,0]<stdout>:         512           128     float     sum    634.8    0.00    0.00  6e-08    635.2    0.00    0.00  6e-08
[1,0]<stdout>:        1024           256     float     sum    646.6    0.00    0.00  2e-07    643.6    0.00    0.00  2e-07
[1,0]<stdout>:        2048           512     float     sum    745.0    0.00    0.01  5e-07    746.0    0.00    0.01  5e-07
[1,0]<stdout>:        4096          1024     float     sum    958.2    0.00    0.01  5e-07    955.8    0.00    0.01  5e-07
[1,0]<stdout>:        8192          2048     float     sum    963.0    0.01    0.02  5e-07    954.5    0.01    0.02  5e-07
[1,0]<stdout>:       16384          4096     float     sum    955.0    0.02    0.03  5e-07    955.5    0.02    0.03  5e-07
[1,0]<stdout>:       32768          8192     float     sum    975.5    0.03    0.06  5e-07   1009.0    0.03    0.06  5e-07
[1,0]<stdout>:       65536         16384     float     sum   1353.4    0.05    0.09  5e-07   1343.5    0.05    0.09  5e-07
[1,0]<stdout>:      131072         32768     float     sum   1395.9    0.09    0.18  5e-07   1392.6    0.09    0.18  5e-07
[1,0]<stdout>:      262144         65536     float     sum   1476.7    0.18    0.33  5e-07   1536.3    0.17    0.32  5e-07
[1,0]<stdout>:      524288        131072     float     sum   1560.3    0.34    0.63  5e-07   1568.3    0.33    0.63  5e-07
[1,0]<stdout>:     1048576        262144     float     sum   1599.2    0.66    1.23  5e-07   1595.3    0.66    1.23  5e-07
[1,0]<stdout>:     2097152        524288     float     sum   1671.1    1.25    2.35  5e-07   1672.5    1.25    2.35  5e-07
[1,0]<stdout>:     4194304       1048576     float     sum   1785.1    2.35    4.41  5e-07   1780.3    2.36    4.42  5e-07
[1,0]<stdout>:     8388608       2097152     float     sum   2133.6    3.93    7.37  5e-07   2135.0    3.93    7.37  5e-07
[1,0]<stdout>:    16777216       4194304     float     sum   2650.9    6.33   11.87  5e-07   2649.9    6.33   11.87  5e-07
[1,0]<stdout>:    33554432       8388608     float     sum   3422.0    9.81   18.39  5e-07   3478.7    9.65   18.09  5e-07
[1,0]<stdout>:    67108864      16777216     float     sum   4783.2   14.03   26.31  5e-07   4782.6   14.03   26.31  5e-07
[1,0]<stdout>:   134217728      33554432     float     sum   7216.9   18.60   34.87  5e-07   7240.9   18.54   34.75  5e-07
[1,0]<stdout>:   268435456      67108864     float     sum    12738   21.07   39.51  5e-07    12802   20.97   39.31  5e-07
[1,0]<stdout>:   536870912     134217728     float     sum    24375   22.03   41.30  5e-07    24403   22.00   41.25  5e-07
[1,0]<stdout>:  1073741824     268435456     float     sum    47904   22.41   42.03  5e-07    47893   22.42   42.04  5e-07
[1,4]<stdout>:test-efa-nccl-worker-0:33:33 [4] NCCL INFO comm 0x7fd4a0000f60 rank 4 nranks 16 cudaDev 4 busId 901c0 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 8.23785

We can observe in the test results that the max throughput is about 42 GB/sec and average bus bandwidth is approximately 8 GB.

We also conducted experiments with a single EFA adapter enabled as well as no EFA adapters. All results are summarized in the following table.

Number of EFA Adapters Net/OFI Selected Provider Avg. Bandwidth (GB/s) Max. Bandwith (GB/s)
4 efa 8.24 42.04
1 efa 3.02 5.89
0 socket 0.97 2.38

We also found that for relatively small models like ImageNet, the use of accelerated networking reduces the training time per epoch only with 5–8% at batch size of 64. For larger models and smaller batch sizes, when increased network communication of weights is needed, the use of accelerated networking has greater impact. We observed a decrease of epoch training time with 15–18% for training of EfficientNet-B7 with batch size 1. The actual impact of EFA on your training will depend on the size of your model.

GPU monitoring

Before running the training job, we can also set up Amazon CloudWatch metrics to visualize the GPU utilization during training. It can be helpful to know whether the resources are being used optimally or potentially identify resource starvation and bottlenecks in the training process.

The relevant scripts to set up CloudWatch are located in the gpu-metrics folder. First, we create a Docker image with amazon-cloudwatch-agent and nvidia-smi. We can use the Dockerfile in the gpu-metrics folder to create this image. Assuming that the ECR registry is already set in the .env file from the previous step, we can build and push the image using build.sh and push.sh. After this, running the deploy.sh script automatically completes the setup. It launches a daemonset with amazon-cloudwatch-agent and pushes various metrics to CloudWatch. The GPU metrics appear under the CWAgent namespace on the CloudWatch console. The rest of the cluster metrics show under the ContainerInsights namespace.

Model training

All the scripts needed for PyTorch training are located in the elasticjob folder in the GitHub repo. Before launching the training job, we need to run the etcd server, which is used by the TEC for worker discovery and parameter exchange. The deploy.sh script in the elasticjob folder does exactly that.

To take advantage of EFA in p4d.24xlarge instances, we need to use a specific Docker image available in the Amazon ECR Public Gallery that supports NCCL communication through EFA. We just need to copy our training code to this Docker image. The Dockerfile under the samples folder creates an image to be used when running training job on p4d instances. As always, we can use the build.sh and push.sh scripts in the folder to build and push the image.

The imagenet-efa.yaml file describes the training job. This .yaml file sets up the resources needed for running the training job and also mounts the persistent volume with the training data set up in the previous section.

A couple of things are worth pointing out here. The number of replicas should be set to the number of nodes available in the cluster. In our case, we set this to 3 because we had three p4d.24xlarge nodes. In the imagenet-efa.yaml file, the nvidia.com/gpu parameter under resources and nproc_per_node under args should be set to the number of GPUs per node, which in the case of p4d.24xlarge is 8. Also, the worker argument for the Python script sets the number of CPUs per process. We chose this to be 4 because, in our experiments, this provides optimal performance when running on p4d.24xlarge instances. These settings are necessary in order to maximize the use of all the hardware resources available in the cluster.

When the job is running, we can observe the GPU usage in CloudWatch for all the GPUs in the cluster. The following is an example from one of our training jobs with three p4d.24xlarge nodes in the cluster. Here we’ve selected one GPU from each node. With the settings mentioned earlier, the GPU usage is close to 100% during the training phase of the epoch for all of the nodes in the cluster.

For training a ResNet50 model using p3.8xlarge instances, we need exactly the same steps as described for the EfficientNet training using p4d.24xlarge. We can also use the same Docker image. As mentioned earlier, p3.8xlarge instances aren’t equipped with EFA. However, for the ResNet50 model, this is not a significant drawback. The imagenet-fsx.yaml script provided in the GitHub repository sets up the training job with appropriate resources for the p3.8xlarge node type. The job uses the same dataset from the FSx file system.

GPU scaling

We ran some experiments to observe how the training time scales for the EfficientNet-B7 model by increasing the number of GPUs. To do this, we changed the number of replicas from 1 to 3 in our training .yaml file for each training run. We only observed the time for a single epoch while using the complete ImageNet dataset. The following figure shows the results for our GPU scaling experiment. The red dotted line represents how the training time should go down from a run using 8 GPUs by increasing the number of GPUs. As we can see, the scaling is quite close to what is expected.

Similarly, we obtained the GPU scaling plot for ResNet50 training on p3.8xlarge instances. For this case, we changed the replicas in our .yaml file from 1 to 4. The results of this experiment are shown in the following figure.

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. With each script that creates resources, the GitHub repo provides a matching script to delete them. To clean up our setup, we must delete the FSx file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC. To delete the FSx file system, we just need to run the following command (from inside the fsx folder):

kubectl delete -f fsx-pvc-dynamic.yaml
./delete.sh

Note that this will not only delete the persistent volume, it will also delete the FSx file system, and all the data on the file system will be lost. When this step is complete, we can delete the cluster by using the following script in the eks folder:

./eks-delete.sh

This will delete all the existing pods, remove the cluster, and delete the VPC created in the beginning.

Conclusion

In this post, we detailed the steps needed for running PyTorch distributed data parallel model training on EKS clusters. This task may seem daunting, but the AWS DevOps for EKS project created by the ML Frameworks team at AWS provides all the necessary scripts and tools to simplify the process and make distributed model training easily accessible.

For more information on the technologies used in this post, visit Amazon EKS and Torch Distributed Elastic. We encourage you to apply the approach described here to your own distributed training use cases.

Resources


About the authors

Imran Younus is a Principal Solutions Architect for ML Frameworks team at AWS. He focuses on large scale machine learning and deep learning workloads across AWS services like Amazon EKS and AWS ParallelCluster. He has extensive experience in applications of Deep Leaning in Computer Vision and Industrial IoT. Imran obtained his PhD in High Energy Particle Physics where he has been involved in analyzing experimental data at peta-byte scales.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source Do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Read More

Learn How Amazon SageMaker Clarify Helps Detect Bias

Bias detection in data and model outcomes is a fundamental requirement for building responsible artificial intelligence (AI) and machine learning (ML) models. Unfortunately, detecting bias isn’t an easy task for the vast majority of practitioners due to the large number of ways in which it can be measured and different factors that can contribute to a biased outcome. For instance, an imbalanced sampling of the training data may result in a model that is less accurate for certain subsets of the data. Bias may also be introduced by the ML algorithm itself—even with a well-balanced training dataset, the outcomes might favor certain subsets of the data as compared to the others.

To detect bias, you must have a thorough understanding of different types of bias and the corresponding bias metrics. For example, at the time of this writing, Amazon SageMaker Clarify offers 21 different metrics to choose from.

In this post, we use an income prediction use case (predicting user incomes from input features like education and number of hours worked per week) to demonstrate different types of biases and the corresponding metrics in SageMaker Clarify. We also develop a framework to help you decide which metrics matter for your application.

Introduction to SageMaker Clarify

ML models are being increasingly used to help make decisions across a variety of domains, such as financial services, healthcare, education, and human resources. In many situations, it’s important to understand why the ML model made a specific prediction and also whether the predictions were impacted by bias.

SageMaker Clarify provides tools for both of these needs, but in this post we only focus on the bias detection functionality. To learn more about explainability, check out Explaining Bundesliga Match Facts xGoals using Amazon SageMaker Clarify.

SageMaker Clarify is a part of Amazon SageMaker, which is a fully managed service to build, train, and deploy ML models.

Examples of questions about bias

To ground the discussion, the following are some sample questions that ML builders and their stakeholders may have regarding bias. The list consists of some general questions that may be relevant for several ML applications, as well as questions about specific applications like document retrieval.

You might ask, given the groups of interest in the training data (for example, men vs. women) which metrics should I use to answer the following questions:

  • Does the group representation in the training data reflect the real world?
  • Do the target labels in the training data favor one group over the other by assigning it more positive labels?
  • Does the model have different accuracy for different groups?
  • In a model whose purpose is to identify qualified candidates for hiring, does the model have the same precision for different groups?
  • In a model whose purpose is to retrieve documents relevant to an input query, does the model retrieve relevant documents from different groups in the same proportion?

In the rest of this post, we develop a framework for how to consider answering these questions and others through the metrics available in SageMaker Clarify.

Use case and context

This post uses an existing example of a SageMaker Clarify job from the Fairness and Explainability with SageMaker Clarify notebook and explains the generated bias metric values. The notebook trains an XGBoost model on the UCI Adult dataset (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science).

The ML task in this dataset is to predict whether a person has a yearly income of more or less than $50,000. The following table shows some instances along with their features. Measuring bias in income prediction is important because we could use these predictions to inform decisions like discount offers and targeted marketing.

Bias terminology

Before diving deeper, let’s review some essential terminology. For a complete list of terms, see Amazon SageMaker Clarify Terms for Bias and Fairness.

  • Label – The target feature that the ML model is trained to predict. An observed label refers to the label value observed in the data used to train or test the model. A predicted label is the value predicted by the ML model. Labels could be binary, and are often encoded as 0 and 1. We assume 1 to represent a favorable or positive label (for example, income more than or equal to $50,000), and 0 to represent an unfavorable or negative label. Labels could also consist of more than two values. Even in these cases, one or more of the values constitute favorable labels. For the sake of simplicity, this post only considers binary labels. For details on handling labels with more than two values and labels with continuous values (for example, in regression), see Amazon AI Fairness and Explainability Whitepaper.
  • Facet – A column or feature with respect to which bias is measured. In our example, the facet is sex and takes two values: woman and man, encoded as female and male in the data (this data is extracted from the 1994 Census and enforces a binary option). Although the post considers a single facet with only two values, for more complex cases involving multiple facets or facets with more than two values, see Amazon AI Fairness and Explainability Whitepaper.
  • Bias – A significant imbalance in the input data or model predictions across different facet values. What constitutes “significant” depends on your application. For most metrics, a value of 0 implies no imbalance. Bias metrics in SageMaker Clarify are divided into two categories:

    • Pretraining – When present, pretraining bias indicates imbalances in the data only.
    • Posttraining – Posttraining bias additionally considers the predictions of the models.

Let’s examine each category separately.

Pretraining bias

Pretraining bias metrics in SageMaker Clarify answer the following question: Do all facet values have equal (or similar) representation in the data? It’s important to inspect the data for pretraining bias because it may translate into posttraining bias in the model predictions. For instance, a model trained on imbalanced data where one facet value appears very rarely can exhibit substantially worse accuracy for that facet value. Equal representation can be calculated over the following:

  • The whole training data irrespective of the labels
  • The subset of the training data with positive labels only
  • Each label separately

The following figure provides a summary of how each metric fits into each of the three categories.

Some categories consist of more than one metric. The basic metrics (grey boxes) answer the question about bias in that category in the simplest form. Metrics in white boxes additionally cover special cases (for example, Simpson’s paradox) and user preferences (for example, focusing on certain parts of the population when computing predictive performance).

Facet value representation irrespective of labels

The only metric in this category is Class Imbalance (CI). The goal of this metric is to measure if all the facet values have equal representation in the data.

CI is the difference in the fraction of the data constituted by the two facet values. In our example dataset, for the facet sex, the breakdown (shown in the pie chart) shows that women constitute 32.4% of the training data, whereas men constitute 67.6%. As a result:

CI = 0.676 - 0.324 = 0.352

A severely high class imbalance could lead to worse predictive performance for the facet value with smaller representation.

Facet value representation at the level of positive labels only

Another way to measure equal representation is to check whether all facet values contain a similar fraction of samples with positive observed labels. Positive labels consist of favorable outcomes (for example, loan granted, selected for the job), so analyzing positive labels separately helps assess if the favorable decisions are distributed evenly.

In our example dataset, the observed labels break down into positive and negative values, as shown in the following figure.

11.4% of all women and 31.4% of all men have the positive label (dark shaded region in the left and right bars). The Difference in Positive Proportions in Labels (DPL) measures this difference.

DPL = 0.314 - 0.114 = 0.20

The advanced metric in this category, Conditional Demographic Disparity in Labels (CDDL), measures the differences in the positive labels, but stratifies them with respect to another variable. This metric helps control for the Simpson’s paradox, a case where a computation over the whole data shows bias, but the bias disappears when grouping the data with respect to some side-information.

The 1973 UC Berkeley Admissions Study provides an example. According to the data, men were admitted at a higher rate than women. However, when examined at the level of individual university departments, women were admitted at similar or higher rate at each department. This observation can be explained by the Simpson’s paradox, which arose here because women applied to schools that were more competitive. As a result, fewer women were admitted overall compared to men, even though school by school they were admitted at a similar or higher rate.

For more detail on how CDDL is computed, see Amazon AI Fairness and Explainability Whitepaper.

Facet value representation at the level of each label separately

Equality in representation can also be measured for each individual label, not just the positive label.

Metrics in this category compute the difference in the label distribution of different facet values. The label distribution for a facet value contains all the observed label values, along with the fraction of samples with that label’s value. For instance, in the figure showing labels distributions, 88.6% of women have a negative observed label and 11.4% have a positive observed label. So the label distribution for women is [0.886, 0.114] and for men is [0.686, 0.314].

The basic metric in this category, Kullback-Leibler divergence (KL), measures this difference as:

KL = [0.686 x log(0.686/0.886)] + [0.314 x log(0.314/0.114)] = 0.143

The advanced metrics in this category, Jensen-Shannon divergence (JS), Lp-norm (LP), Total Variation Distance (TVD), and Kolmogorov-Smirnov (KS), also measure the difference between the distributions but have different mathematical properties. Barring special cases, they will deliver insights similar to KL. For example, although the KL value can be infinity when a facet value contains no samples with a certain labels (for example, no men with a negative label), JS avoids these infinite values. For more detail into these differences, see Amazon AI Fairness and Explainability Whitepaper.

Relationship between DPL (Category 2) and distribution-based metrics of KL/JS/LP/TVD/KS (Category 3)

Distribution-based metrics are more naturally applicable to non-binary labels. For binary labels, owing to the fact that imbalance in the positive label can be used to compute the imbalance in negative label, the distribution metrics deliver the same insights as DPL. Therefore, you can just use DPL in such cases.

Posttraining bias

Posttraining bias metrics in SageMaker Clarify help us answer two key questions:

  • Are all facet values represented at a similar rate in positive (favorable) model predictions?
  • Does the model have similar predictive performance for all facet values?

The following figure shows how the metrics map to each of these questions. The second question can be further broken down depending on which label the performance is measured with respect to.

Equal representation in positive model predictions

Metrics in this category check if all facet values contain a similar fraction of samples with positive predicted label by the model. This class of metrics is very similar to the pretraining metrics of DPL and CDDL—the only difference is that this category considers predicted labels instead of observed labels.

In our example dataset, 4.5% of all women are assigned the positive label by the model, and 13.7% of all men are assigned the positive label.

The basic metric in this category, Difference in Positive Proportions in Predicted Labels (DPPL), measures the difference in the positive class assignments.

DPPL = 0.137 - 0.045 = 0.092

Notice how in the training data, a higher fraction of men had a positive observed label. In a similar manner, a higher fraction of men are assigned a positive predicted label.

Moving on to the advanced metrics in this category, Disparate Impact (DI) measures the same disparity in positive class assignments, but instead of the difference, it computes the ratio:

DI = 0.045 / 0.137 = 0.328

Both DI and DPPL convey qualitatively similar insights but differ at some corner cases. For instance, ratios tend to explode to very large numbers if the denominator is small. Take an example of the numbers 0.1 and 0.0001. The ratio is 0.1/0.0001 = 10,000 whereas the difference is 0.1 – 0.0001 ≈ 0.1. Unlike the other metrics where a value of 0 implies no bias, for DI, no bias corresponds to a value of 1.

Conditional Demographic Disparity in Predicted Labels (CDDPL) measures the disparity in facet value representation in the positive label, but just like the pretraining metric of CDDL, it also controls for the Simpson’s paradox.

Counterfactual Fliptest (FT) measures if similar samples from the two facet values receive similar decisions from the model. A model assigning different decisions to two samples that are similar to each other but differ in the facet values could be considered biased against the facet value being assigned the unfavorable (negative) label. Given the first facet value (women), it assesses whether similar members with the other facet value (men) have a different model prediction. Similar members are chosen based on the k-nearest neighbor algorithm.

Equal performance

The model predictions might have similar representation in positive labels from different facet values, yet the model performance on these groups might significantly differ. In many applications, having a similar predictive performance across different facet values can be desirable. The metrics in this category measure the difference in predictive performance across facet values.

Because the data can be sliced in many different ways based on the observed or predicted labels, there are many different ways to measure predictive performance.

Equal predictive performance irrespective of labels

You could consider the model performance on the whole data, irrespective of the observed or the predicted labels – that is, the overall accuracy.

The following figures shows how the model classifies inputs from the two facet values in our example dataset. True negatives (TN) are cases where both the observed and predicted label were 0. False positives (FP) are misclassifications where the observed label was 0 but the predicted label was 1. True positives (TP) and false negatives (FN) are defined similarly.

For each facet value, the overall model performance, that is, the accuracy for that facet value, is:

Accuracy = (TN + TP) / (TN + FP + FN + TP)

With this formula, the accuracy for women is 0.930 and for men is 0.815. This leads to the only metric in this category, Accuracy Difference (AD):

AD = 0.815 - 0.930 = -0.115

AD = 0 means that the accuracy for both groups is the same. Larger (positive or negative) values indicate larger differences in accuracy.

Equal performance on positive labels only

You could restrict the model performance analysis to positive labels only. For instance, if the application is about detecting defects on an assembly line, it may be desirable to check that non-defective parts (positive label) of different kinds (facet values) are classified as non-defective at the same rate. This quantity is referred to as recall, or true positive rate:

Recall = TP / (TP + FN)

In our example dataset, the recall for women is 0.389, and the recall for men is 0.425. This leads to the basic metric in this category, the Recall Difference (RD):

RD = 0.425 - 0.389 = 0.036

Now let’s consider the three advanced metrics in this category, see which user preferences they encode, and how they differ from the basic metric of RD.

First, instead of measuring the performance on the positive observed labels, you could measure it on the positive predicted labels. Given a facet value, such as women, and all the samples with that facet value that are predicted to be positive by the model, how many are actually correctly classified as positive? This quantity is referred to as acceptance rate (AR), or precision:

AR = TP / (TP + FP)

In our example, the AR for women is 0.977, and the AR for men is 0.970. This leads to the Difference in Acceptance Rate (DAR):

DAR = 0.970 - 0.977 = -0.007

Another way to measure bias is by combining the previous two metrics and measuring how many more positive predictions the models assign to a facet value as compared to the observed positive labels. SageMaker Clarify measures this advantage by the model as the ratio between the number of observed positive labels for that facet value, and the number of predicted positive labels, and refers to it as conditional acceptance (CA):

CA = (TP + FN) / (TP + FP)

In our example, the CA for women is 2.510 and for men is 2.283. The difference in CA leads to the final metric in this category, Difference in Conditional Acceptance (DCA):

DCA = 2.283 - 2.510 = -0.227

Equal performance on negative labels only

In a manner similar to positive labels, bias can also be computed as the performance difference on the negative labels. Considering negative labels separately can be important in certain applications. For instance, in our defect detection example, we might want to detect defective parts (negative label) of different kinds (facet value) at the same rate.

The basic metric in this category, specificity, is analogous to the recall (true positive rate) metric. Specificity computes the accuracy of the model on samples with this facet value that have an observed negative label:

Specificity = TN / (TN + FP)

In our example (see the confusion tables), the specificity for women and men is 0.999 and 0.994, respectively. Consequently, the Specificity Difference (SD) is:

SD = 0.994 - 0.999 = -0.005

Moving on, just like the acceptance rate metric, the analogous quantity for negative labels—the rejection rate (RR)—is:

RR = TN / (TN + FN)

The RR for women is 0.927 and for men is 0.791, leading to the Difference in Rejection Rate (DRR) metric:

DRR = 0.927 - 0.791 = -0.136

Finally, the negative label analogue of conditional acceptance, the conditional rejection (CR), is the ratio between the number of observed negative labels for that facet value, and the number of predicted negative labels:

CR = (TN + FP) / (TN + FN)

The CR for women is 0.928 and for men is 0.796. The final metric in this category is Difference in Conditional Rejection (DCR):

DCR = 0.796 - 0.928 = 0.132

Equal performance on positive vs. negative labels

SageMaker Clarify combines the previous two categories by considering the model performance ratio on the positive and negative labels. Specifically, for each facet value, SageMaker Clarify computes the ration between false negatives (FN) and false positives (FP). In our example, the FN/FP ratio for women is 679/10 = 67.9 and for men is 3678/84 = 43.786. This leads to the Treatment Equality (TE) metric, which measures the difference between the FP/FN ratio:

TE = 67.9 - 43.786 = 24.114

The following screenshot shows how you can use SageMaker Clarify with Amazon SageMaker Studio to show the values as well as ranges and short descriptions of different bias metrics.

Questions about bias: Which metrics to start with?

Recall the sample questions about bias at the start of this post. Having gone through the metrics from different categories, consider the questions again. To answer the first question, which concerns the representations of different groups in the training data, you could start with the Class Imbalance (CI) metric. Similarly, for the remaining questions, you can start by looking into Difference in Positive Proportions in Labels (DPL), Accuracy Difference (AD), Difference in Acceptance Rate (DAR), and Recall Difference (RD), respectively.

Bias without facet values

For the ease of exposition, this description of posttraining metrics excluded the Generalized Entropy Index (GE) metric. This metric measures bias without considering the facet value, and can be helpful in assessing how the model errors are distributed. For details, refer to Generalized entropy (GE).

Conclusion

In this post, you saw how the 21 different metrics in SageMaker Clarify measure bias at different stages of the ML pipeline. You learned about various metrics via an income prediction use case, how to choose metrics for your use case, and which ones you could start with.

Get started with your responsible AI journey by assessing bias in your ML models by using the demo notebook Fairness and Explainability with SageMaker Clarify. You can find the detailed documentation for SageMaker Clarify, including the formal definition of metrics, at What Is Fairness and Model Explainability for Machine Learning Predictions. For the open-source implementation of the bias metrics, refer to the aws-sagemaker-clarify GitHub repository. For a detailed discussion including limitations, refer to Amazon AI Fairness and Explainability Whitepaper.


About the authors

Bilal Zafar is an Applied Scientist at AWS, working on Fairness, Explainability and Security in Machine Learning.

Denis1_resized

Denis V. Batalov is a Solutions Architect for AWS, specializing in Machine Learning. He’s been with Amazon since 2005. Denis has a PhD in the field of AI. Follow him on Twitter: @dbatalov.

Michele Donini is a Sr Applied Scientist at AWS. He leads a team of scientists working on Responsible AI and his research interests are Algorithmic Fairness and Explainable Machine Learning.

Read More

Create a batch recommendation pipeline using Amazon Personalize with no code

With personalized content more likely to drive customer engagement, businesses continuously seek to provide tailored content based on their customer’s profile and behavior. Recommendation systems in particular seek to predict the preference an end-user would give to an item. Some common use cases include product recommendations on online retail stores, personalizing newsletters, generating music playlist recommendations, or even discovering similar content on online media services.

However, it can be challenging to create an effective recommendation system due to complexities in model training, algorithm selection, and platform management. Amazon Personalize enables developers to improve customer engagement through personalized product and content recommendations with no machine learning (ML) expertise required. Developers can start to engage customers right away by using captured user behavior data. Behind the scenes, Amazon Personalize examines this data, identifies what is meaningful, selects the right algorithms, trains and optimizes a personalization model that is customized for your data, and provides recommendations via an API endpoint.

Although providing recommendations in real time can help boost engagement and satisfaction, sometimes this might not actually be required, and performing this in batch on a scheduled basis can simply be a more cost-effective and manageable option.

This post shows you how to use AWS services to not only create recommendations but also operationalize a batch recommendation pipeline. We walk through the end-to-end solution without a single line of code. We discuss two topics in detail:

Solution overview

In this solution, we use the MovieLens dataset. This dataset includes 86,000 ratings of movies from 2,113 users. We attempt to use this data to generate recommendations for each of these users.

Data preparation is very important to ensure we get customer behavior data into a format that is ready for Amazon Personalize. The architecture described in this post uses AWS Glue, a serverless data integration service, to perform the transformation of raw data into a format that is ready for Amazon Personalize to consume. The solution uses Amazon Personalize to create batch recommendations for all users by using a batch inference. We then use a Step Functions workflow so that the automated workflow can be run by calling Amazon Personalize APIs in a repeatable manner.

The following diagram demonstrates this solution.Architecture Diagram

We will build this solution with the following steps:

  1. Build a data transformation job to transform our raw data using AWS Glue.
  2. Build an Amazon Personalize solution with the transformed dataset.
  3. Build a Step Functions workflow to orchestrate the generation of batch inferences.

Prerequisites

You need the following for this walkthrough:

Build a data transformation job to transform raw data with AWS Glue

With Amazon Personalize, input data needs to have a specific schema and file format. Data from interactions between users and items must be in CSV format with specific columns, whereas the list of users for which you want to generate recommendations for must be in JSON format. In this section, we use AWS Glue Studio to transform raw input data into the required structures and format for Amazon Personalize.

AWS Glue Studio provides a graphical interface that is designed for easy creation and running of extract, transform, and load (ETL) jobs. You can visually create data transformation workloads through simple drag-and-drop operations.

We first prepare our source data in Amazon Simple Storage Service (Amazon S3), then we transform the data without code.

  1. On the Amazon S3 console, create an S3 bucket with three folders: raw, transformed, and curated.
  2. Download the MovieLens dataset and upload the uncompressed file named user_ratingmovies-timestamp.dat to your bucket under the raw folder.
  3. On the AWS Glue Studio console, choose Jobs in the navigation pane.
  4. Select Visual with a source and target, then choose Create.
    AWS Glue Studio - Create Job
  5. Choose the first node called Data source – S3 bucket. This is where we specify our input data.
  6. On the Data source properties tab, select S3 location and browse to your uploaded file.
  7. For Data format, choose CSV, and for Delimiter, choose Tab.
    AWS Glue Studio - S3
  8. We can choose the Output schema tab to verify that the schema has inferred the columns correctly.
  9. If the schema doesn’t match your expectations, choose Edit to edit the schema.
    AWS Glue Studio - Fields

Next, we transform this data to follow the schema requirements for Amazon Personalize.

  1. Choose the Transform – Apply Mapping node and, on the Transform tab, update the target key and data types.
    Amazon Personalize, at minimum, expects the following structure for the interactions dataset:
    • user_id (string)
    • item_id (string)
    • timestamp (long, in Unix epoch time format)
      AWS Glue Studio - Field mapping

In this example, we exclude the poorly rated movies in the dataset.

  1. To do so, remove the last node called S3 bucket and add a filter node on the Transform tab.
  2. Choose Add condition and filter out data where rating < 3.5.
    AWS Glue Studio - Output

We now write the output back to Amazon S3.

  1. Expand the Target menu and choose Amazon S3.
  2. For S3 Target Location, choose the folder named transformed.
  3. Choose CSV as the format and suffix the Target Location with interactions/.

Next, we output a list of users that we want to get recommendations for.

  1. Choose the ApplyMapping node again, and then expand the Transform menu and choose ApplyMapping.
  2. Drop all fields except for user_id and rename that field to userId. Amazon Personalize expects that field to be named userId.
  3. Expand the Target menu again and choose Amazon S3.
  4. This time, choose JSON as the format, and then choose the transformed S3 folder and suffix it with batch_users_input/.

This produces a JSON list of users as input for Amazon Personalize. We should now have a diagram that looks like the following.

AWS Glue Studio - Entire Workflow

We are now ready to run our transform job.

  1. On the IAM console, create a role called glue-service-role and attach the following managed policies:
    • AWSGlueServiceRole
    • AmazonS3FullAccess

For more information on how to create IAM service roles, refer to the Creating a role to delegate permissions to an AWS service.

  1. Navigate back to your AWS Glue Studio job, and choose the Job details tab.
  2. Set the job name as batch-personalize-input-transform-job.
  3. Choose the newly created IAM role.
  4. Keep the default values for everything else.
    AWS Glue Studio - Job details
  5. Choose Save.
  6. When you’re ready, choose Run and monitor the job in the Runs tab.
  7. When the job is complete, navigate to the Amazon S3 console to validate that your output file has been successfully created.

We have now shaped our data into the format and structure that Amazon Personalize requires. The transformed dataset should have the following fields and format:

  • Interactions dataset – CSV format with fields USER_ID, ITEM_ID, TIMESTAMP
  • User input dataset – JSON format with element userId

Build an Amazon Personalize solution with the transformed dataset

With our interactions dataset and user input data in the right format, we can now create our Amazon Personalize solution. In this section, we create our dataset group, import our data, and then create a batch inference job. A dataset group organizes resources into containers for Amazon Personalize components.

  1. On the Amazon Personalize console, choose Create dataset group.
  2. For Domain, select Custom.
  3. Choose Create dataset group and continue.
    Amazon Personalize - create dataset group

Next, create the interactions dataset.

  1. Enter a dataset name and select Create new schema.
  2. Choose Create dataset and continue.
    Amazon Personalize - create interactions dataset

We now import the interactions data that we had created earlier.

  1. Navigate to the S3 bucket in which we created our interactions CSV dataset.
  2. On the Permissions tab, add the following bucket access policy so that Amazon Personalize has access. Update the policy to include your bucket name.
    {
       "Version":"2012-10-17",
       "Id":"PersonalizeS3BucketAccessPolicy",
       "Statement":[
          {
             "Sid":"PersonalizeS3BucketAccessPolicy",
             "Effect":"Allow",
             "Principal":{
                "Service":"personalize.amazonaws.com"
             },
             "Action":[
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
             ],
             "Resource":[
                "arn:aws:s3:::<your-bucket-name>",
                "arn:aws:s3:::<your-bucket-name> /*"
             ]
          }
       ]
    }

Navigate back to Amazon Personalize and choose Create your dataset import job. Our interactions dataset should now be importing into Amazon Personalize. Wait for the import job to complete with a status of Active before continuing to the next step. This should take approximately 8 minutes.

  1. On the Amazon Personalize console, choose Overview in the navigation pane and choose Create solution.
    Amazon Personalize - Dashboard
  2. Enter a solution name.
  3. For Solution type, choose Item recommendation.
  4. For Recipe, choose the aws-user-personalization recipe.
  5. Choose Create and train solution.
    Amazon Personalize - create solution

The solution now trains against the interactions dataset that was imported with the user personalization recipe. Monitor the status of this process under Solution versions. Wait for it to complete before proceeding. This should take approximately 20 minutes.
Amazon Personalize - Status

We now create our batch inference job, which generates recommendations for each of the users present in the JSON input.

  1. In the navigation pane, under Custom resources, choose Batch inference jobs.
  2. Enter a job name, and for Solution, choose the solution created earlier.
  3. Choose Create batch inference job.
    Amazon Personalize - create batch inference job
  4. For Input data configuration, enter the S3 path of where the batch_users_input file is located.

This is the JSON file that contains userId.

  1. For Output data configuration path, choose the curated path in S3.
  2. Choose Create batch inference job.

This process takes approximately 30 minutes. When the job is finished, recommendations for each of the users specified in the user input file are saved in the S3 output location.

We have successfully generated a set of recommendations for all of our users. However, we have only implemented the solution using the console so far. To make sure that this batch inferencing runs regularly with the latest set of data, we need to build an orchestration workflow. In the next section, we show you how to create an orchestration workflow using Step Functions.

Build a Step Functions workflow to orchestrate the batch inference workflow

To orchestrate your pipeline, complete the following steps:

  1. On the Step Functions console, choose Create State Machine.
  2. Select Design your workflow visually, then choose Next.
    AWS Step Functions - Create workflow
  3. Drag the CreateDatasetImportJob node from the left (you can search for this node in the search box) onto the canvas.
  4. Choose the node, and you should see the configuration API parameters on the right. Record the ARN.
  5. Enter your own values in the API Parameters text box.

This calls the CreateDatasetImportJob API with the parameter values that you specify.

AWS Step Functions Workflow

  1. Drag the CreateSolutionVersion node onto the canvas.
  2. Update the API parameters with the ARN of the solution that you noted down.

This creates a new solution version with the newly imported data by calling the CreateSolutionVersion API.

  1. Drag the CreateBatchInferenceJob node onto the canvas and similarly update the API parameters with the relevant values.

Make sure that you use the $.SolutionVersionArn syntax to retrieve the solution version ARN parameter from the previous step. These API parameters are passed to the CreateBatchInferenceJob API.

AWS Step Functions Workflow

We need to build a wait logic in the Step Functions workflow to make sure the recommendation batch inference job finishes before the workflow completes.

  1. Find and drag in a Wait node.
  2. In the configuration for Wait, enter 300 seconds.

This is an arbitrary value; you should alter this wait time according to your specific use case.

  1. Choose the CreateBatchInferenceJob node again and navigate to the Error handling tab.
  2. For Catch errors, enter Personalize.ResourceInUseException.
  3. For Fallback state, choose Wait.

This step enables us to periodically check the status of the job and it only exits the loop when the job is complete.

  1. For ResultPath, enter $.errorMessage.

This effectively means that when the “resource in use” exception is received, the job waits for x seconds before trying again with the same inputs.

AWS Step Functions Workflow

  1. Choose Save, and then choose Start the execution.

We have successfully orchestrated our batch recommendation pipeline for Amazon Personalize. As an optional step, you can use Amazon EventBridge to schedule a trigger of this workflow on a regular basis. For more details, refer to EventBridge (CloudWatch Events) for Step Functions execution status changes.

Clean up

To avoid incurring future charges, delete the resources that you created for this walkthrough.

Conclusion

In this post, we demonstrated how to create a batch recommendation pipeline by using a combination of AWS Glue, Amazon Personalize, and Step Functions, without needing a single line of code or ML experience. We used AWS Glue to prep our data into the format that Amazon Personalize requires. Then we used Amazon Personalize to import the data, create a solution with a user personalization recipe, and create a batch inferencing job that generates a default of 25 recommendations for each user, based on past interactions. We then orchestrated these steps using Step Functions so that we can run these jobs automatically.

For steps to consider next, user segmentation is one of the newer recipes in Amazon Personalize, which you might want to explore to create user segments for each row of the input data. For more details, refer to Getting batch recommendations and user segments.


About the author

Maxine Wee

Maxine Wee is an AWS Data Lab Solutions Architect. Maxine works with customers on their use cases, designs solutions to solve their business problems, and guides them through building scalable prototypes. Prior to her journey with AWS, Maxine helped customers implement BI, data warehousing, and data lake projects in Australia.

Read More