Amazon Polly breathes life into text by converting it into lifelike speech. This empowers developers and businesses to create applications that can converse in real time, thereby offering an enhanced interactive experience. Text-to-speech (TTS) in Amazon Polly supports a variety of languages and locales, which enables you to perform TTS conversion according to your preferences. Multiple factors guide this choice, such as geographic location and language locales.
Amazon Polly uses advanced deep learning technologies to synthesize text to speech in real time in various output formats, such as MP3, ogg vorbis, JSON, or PCM, across standard and neural engines. The Speech Synthesis Markup Language (SSML) support for Amazon Polly further bolsters the service’s capability to customize speech with a plethora of options, including controlling speech rate and volume, adding pauses, emphasizing certain words or phrases, and more.
In today’s world, businesses continue to expand across multiple geographic locations, and they’re continuously looking for mechanisms to improve personalized end-user engagement. For instance, you may require accurate pronunciation of certain words in a specific style pertaining to different geographical locations. Your business may also need to pronounce certain words and phrases in certain ways depending on their intended meaning. You can achieve this with the help of SSML tags provided by Amazon Polly.
This post aims to assist you in customizing pronunciation when dealing with a truly global customer base.
Modify pronunciation using phonemes
A phoneme can be considered as the smallest unit of speech. The <phoneme>
SSML tag in Amazon Polly helps customize pronunciation based on phonemes using the IPA (International Phonetic Alphabets) or X-SAMPA (Extended Speech Assessment Methods Phonetic Alphabet). X-SAMPA is a representation of IPA in ASCII encoding. Phoneme tags are available and fully supported in both the standard and neural TTS engine. For example, the word “lead” can be pronounced as the present tense verb, or it can refer to the chemical element lead. We will discuss this with an example further in this blog post.
International Phonetic Alphabet
The IPA is used to portray sounds across different languages. For a list of phonemes Amazon Polly supports, refer to Phoneme and Viseme Tables for Supported Languages.
By default, Amazon Polly determines the pronunciation of the word in a specific format. Let’s use the example of the word “lead,” which can have different pronunciations when referring to the chemical element or the verb. In this example, when we provide the word “lead” as input, it’s spoken in the present tense form (without the use of any customizing SSML tags). The default pronunciation for L E A D
by Amazon Polly is the present tense form of “lead.”
To return the pronunciation of the chemical element lead (which can also be the verb in past tense), we can use phonemes along with IPA or X-SAMPA. IPA is generally used to customize the pronunciation of a word in a given language using phonemes:
Modify pronunciation by specifying parts of speech
If we consider the same example of pronouncing “lead,” we can also differentiate between the chemical element and the verb by specifying the parts of speech using the <w> SSML tag.
The <w>
tag allows us to customize pronunciation by specifying parts of speech. You can configure the pronunciation in terms of verb (present simple or past tense), noun, adjective, preposition, and determiner. See the following example:
Additionally, you can use the <sub> tag to indicate the pronunciation of acronyms and abbreviations:
Extended Speech Assessment Methods Phonetic Alphabet
The X-SAMPA transcription scheme is an extrapolation to the various language-specific SAMPA phoneme sets available.
The following snippet shows how you can use X-SAMPA to pronounce different variations of the word “lead”:
The stress mark in IPA is usually represented by ˈ. We often encounter scenarios in which an apostrophe is used instead, which might give a different output than expected. In X-SAMPA, the stress mark is the double quotation mark, therefore we should use a single quotation mark for the word and specify the phonemic alphabet. See the following example:
In the example above, we can see the character ˈ used for stressing the word. Similarly, the stress mark in X-SAMPA is shown in double quotation below:
Modify pronunciations using other SSML tags
You can use the <say as>
tag to modify pronunciation by enabling the spell-out or character feature. Furthermore, it enhances pronunciations in terms of digits, fractions, unit, date, time, address, telephone, cardinal, and ordinal, and can also censor the text enclosed within the tag. For more information, refer to Controlling How Special Types of Words Are Spoken. Let’s look at examples of these attributes.
Date
By default, Amazon Polly speaks out different text inputs. However, for handling specific attributes such as dates, you can use the date
attribute to customize pronunciation in the required format, such as month-day-year or day-month-year.
Without the date
attribute, Amazon Polly provides the following output when speaking out dates:
However, if you want the dates spoken in a specific format, the date attribute in the <say-as> tags helps customize the pronunciation:
Cardinal
This attribute represents a number in its cardinal format. For example, 124456 is pronounced “one hundred twenty four thousand four hundred fifty six”:
Ordinal
This attribute represents a number in its ordinal format. Without the ordinal
attribute, the number is pronounced in its numerical form:
If we want to pronounce 1242 as “one thousand two hundred forty second,” we can use the ordinal
attribute:
Digits
The digits
attribute is used to speak out the numbers. For example, “1234” is pronounced as “one two three four”:
Fraction
The fraction
attribute is used to customize the pronunciations in the fractional form:
Time
The time
attribute is used to measure the time across minutes and seconds:
Expletive
The expletive
attribute censors the text enclosed within the tags:
Telephone
To pronounce telephone numbers, you can use the telephone
attribute to speak out telephone numbers instead of pronouncing them as standalone digits or as a cardinal number:
Address
The address
attribute is used to customize the pronunciation of an address aligning to a specific format:
Lexicons
We’ve looked at some of the SSML tags readily available in Amazon Polly. Other use cases might require a higher degree of control for customized pronunciations. Lexicons help achieve this requirement. You can use lexicons when certain words need to be pronounced in a certain form that is uncommon to that specific language.
Another use case for lexicons is with the use of numeronyms, which are abbreviations formed with the help of numbers. For example, Y2K is pronounced as the “year 2000.” You can use lexicons to customize these pronunciations.
Amazon Polly supports lexicon files in .pls and .xml formats. For more information, see Managing Lexicons.
Conclusion
Amazon Polly SSML tags can help you customize pronunciation in a variety of ways. We hope that this post gives you a head start into the world of speech synthesis and powers your applications to provide more lifelike human interactions.
About the Authors
Abilashkumar P C is a Cloud Support Engineer at AWS. He works with customers providing technical troubleshooting guidance, helping them achieve their workloads at scale. Outside of work, he loves driving, following cricket, and reading.
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.