Discover insights from Zendesk with Amazon Kendra intelligent search

Customer relationship management (CRM) is a critical tool that organizations maintain to manage customer interactions and build business relationships. Zendesk is a CRM tool that makes it easy for customers and businesses to keep in sync. Zendesk captures a wealth of customer data, such as support tickets created and updated by customers and service agents, community discussions, and helpful guides. With such a wealth of complex data, simple keyword searches don’t suffice when it comes to discovering meaningful, accurate customer information.

Now you can use the Amazon Kendra Zendesk connector to index your Zendesk service tickets, help guides, and community posts, and perform intelligent search powered by machine learning (ML). Amazon Kendra smartly and efficiently answers natural language-based queries using advanced natural language processing (NLP) techniques. It can learn effectively from your Zendesk data, extracting meaning and context.

This post shows how to configure the Amazon Kendra Zendesk connector to index your Zendesk domain and take advantage of Amazon Kendra intelligent search. We use an example of an illustrative Zendesk domain to discuss technical topics related to AWS services.

Overview of solution

Amazon Kendra was built for intelligent search using NLP. You can use Amazon Kendra to ask factoid questions, descriptive questions, and perform keyword searches. You can use the Amazon Kendra connector for Zendesk to crawl your Zendesk domain and index service tickets, guides, and community posts to discover answers for your questions faster.

In this post, we show how to use the Amazon Kendra connector for Zendesk to index data from your Zendesk domain for intelligent search.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Administrator level access to your Zendesk domain
  • Privileges to create an Amazon Kendra index, AWS resources, and AWS Identity and Access Management (IAM) roles and policies
  • Basic knowledge of AWS services and working knowledge of Zendesk

Configure your Zendesk domain

Your Zendesk domain has a domain owner or administrator, service group administrators, and a customer. Sample service tickets, community posts, and guides have been created for the purpose of this walkthrough. A Zendesk API client with the unique identifier amazon_kendra is registered to create an OAuth token for accessing your Zendesk domain from Amazon Kendra for crawling and indexing. The following screenshot shows the details of the OAuth configuration for the Zendesk API client.

Configure the data source using the Amazon Kendra connector for Zendesk

You can add the Zendesk connector data source to an existing Amazon Kendra index or create a new index. Then complete the following steps to configure the Zendesk connector:

  1. On the Amazon Kendra console, open the index and choose Data sources in the navigation pane.
  2. Under Zendesk, choose Add connector.
  3. Choose Add connector.
  4. In the Specify data source details section, enter the name and description of your data source and choose Next.
  5. In the Define access and security section, for Zendesk URL, enter the URL to your Zendesk domain. Use the URL format https://<domain>.zendesk.com/.
  6. Under Authentication, you can either choose Create to add a new secret using the user OAuth token created for the amazon_kendra, or use an existing AWS Secrets Manager secret that has the user OAuth token for the Zendesk domain that you want the connector to access.
  7. Optionally, configure a new AWS secret for Zendesk API access.
  8. For IAM role, you can choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
  9. Choose Next.
  10. In the Configure sync settings section, provide information regarding your sync scope and run schedule.
  11. Choose Next.
  12. In the Set field mappings section, you can optionally configure the field mappings, or how the Zendesk field names are mapped to Amazon Kendra attributes or facets.
  13. Choose Next.
  14. Review your settings and confirm to add the data source.
  15. After the data source is created, select the data source and choose Sync Now.
  16. Choose Facet definition in the navigation pane.
  17. Select the check box in the Facetable column for the facet _category.

Run queries with the Amazon Kendra search console

Now that the data is synced, we can run a few search queries on the Amazon Kendra search console by navigating to the Search indexed content page.

For the first query, we ask Amazon Kendra a general question related to AWS service durability. The following screenshot shows the response. The suggested answer provides the correct answer to the query by applying natural language comprehension.

For our second query, let’s query Amazon Kendra to search for product issues from Zendesk service tickets. The following screenshot shows the response for the search, along with facets showing various categories of documents included in the result.

Notice the search result includes the URL to the source document as well. Choosing the URL takes us directly to the Zendesk service ticket page, as shown in the following screenshot.

Clean up

To avoid incurring future charges, clean up any resources created as part of this solution. Delete the Zendesk connector data source so any data indexed from the source is removed from the index. If you created a new Amazon Kendra index, delete the index as well.

Conclusion

In this post, we discussed how to configure the Amazon Kendra connector for Zendesk to crawl and index service tickets, community posts, and help guides. We showed how Amazon Kendra ML-based search enables your business leaders and agents to discover insights from your Zendesk content quicker and respond to customer needs faster.

To learn more about the Amazon Kendra connector for Zendesk, refer to the Amazon Kendra Developer Guide.


About the author

Rajesh Kumar Ravi is a Senior AI Services Solution Architect at Amazon Web Services specializing in intelligent document search with Amazon Kendra. He is a builder and problem solver, and contributes to the development of new ideas. He enjoys walking and loves to go on short hiking trips outside of work.

Read More

Amazon SageMaker Automatic Model Tuning now provides up to three times faster hyperparameter tuning with Hyperband

Amazon SageMaker Automatic Model Tuning introduces Hyperband, a multi-fidelity technique to tune hyperparameters as a faster and more efficient way to find an optimal model. In this post, we show how automatic model tuning with Hyperband can provide faster hyperparameter tuning—up to three times as fast.

The benefits of Hyperband

Hyperband presents two advantages over existing black-box tuning strategies: efficient resource utilization and a better time-to-convergence.

Machine learning (ML) models are increasingly training-intensive, involve complex models and large datasets, and require a lot of effort and resources to find the optimal hyperparameters. Traditional black-box search strategies, such as Bayesian, random search, or grid search, tend to scale linearly with the complexity of the ML problem at hand, requiring longer training time.

To speed up hyperparameter tuning and optimize training cost, Hyperband uses Asynchronous Successive Halving Algorithm (ASHA), a strategy that massively parallelizes hyperparameter tuning and automatically stops training jobs early by using previously evaluated configurations to predict whether a specific candidate is promising or not.

As we demonstrate in this post, Hyperband converges to the optimal objective metric faster than most black-box strategies and therefore saves training time. This allows you to tune larger-scale models where evaluating each hyperparameter configuration requires running an expensive training loop, such as in computer vision and natural language processing (NLP) applications. If you’re interested in finding the most accurate models, Hyperband also allows you to run your tuning jobs with more resources and converge to a better solution.

Hyperband with SageMaker

The new Hyperband approach implemented for hyperparameter tuning has a few new data elements changed through AWS API calls. Implementation via the AWS Management Console is not available at this time. Let’s look at some data elements introduced for Hyperband:

  • Strategy – This defines the hyperparameter approach you want to choose. A new value for Hyperband is introduced with this change. Valid values are Bayesian, random, and Hyperband.
  • MinResource – Defines the minimum number of epochs or iterations to be used for a training job before a decision is made to stop the training.
  • MaxResource – Specifies the maximum number of epochs or iterations to be used for a training job to achieve the objective. This parameter is not required if you have numbers of training epochs defined as a hyperparameter in the tuning job.

Implementation

The following sample code shows the tuning job config and training job definition:

tuning_job_config = {
    "ParameterRanges": {
    "Strategy": "Hyperband",
    "ResourceLimits": {
        "MaxNumberOfTrainingJobs":20,
        "MaxParallelTrainingJobs": 5},
    "StrategyConfig":{
        "HyperbandStrategyConfig":{
            "MinResource": 10,
            "MaxResource": 100}
            },
    "HyperParameterTuningJobObjective": {
        "MetricName": "validation:auc", 
        "Type": "Maximize"
        },
    "CategoricalParameterRanges": [],
    "ContinuousParameterRanges": [
    {
    "MaxValue": "1",
    "MinValue": "0",
    "Name": "eta",
    },
    {
    "MaxValue": "10",
    "MinValue": "1",
    "Name": "min_child_weight",
    },
    {
    "MaxValue": "2",
    "MinValue": "0",
    "Name": "alpha",
    },
    ],
    "IntegerParameterRanges": [
    {
    "MaxValue": "10",
    "MinValue": "1",
    "Name": "max_depth",
    }
    ],
    },

}

The preceding code defines strategy as Hyperband and also defines the lower and upper bound resource limits inside the strategy configuration using HyperbandStrategyConfig, which serves as a lever to control the training runtime. For more details on how to configure and run automatic model tuning, refer to Specify the Hyperparameter Tuning Job Settings.

Hyperband compared to black-box search

In this section, we perform two experiments to compare Hyperband to a black-box search.

First experiment

In the first experiment, given a binary classification task, we aim to optimize a three-layer fully-connected network on a synthetic data set, which contains 5000 data points. The hyperparameters for the network include the number of units per layer, the learning rate of the Adam optimizer, the L2 regularization parameter, and the batch size. The range of these parameters are:

  • Number of units per layer – 10 to 1e3
  • Learning rate – 1e-4 to 0.1
  • L2 regularization parameter – 1e-6 to 2
  • Batch size – 10 to 200

The following graph shows our findings.

Comparing Hyperband with other strategies, given a target accuracy of 96%, Hyperband can achieve this target in 357 seconds, while Bayesian needs 560 seconds and random search needs 614 seconds. Hyperband consistently finds a more optimal solution at any given wall-clock time budget. This shows a clear advantage of a multi-fidelity optimization algorithm.

Second experiment

In this second experiment, we consider the Cifar10 dataset and train ResNet-20, a popular network architecture for computer vision tasks. We run all experiments on ml.g4dn.xlarge instances and optimize the neural network through SGD. We also apply standard image augmentation including random flip and random crop. The search space contains the following parameters:

  • Mini-batch size: an integer from 4 to 512
  • Learning rate of SGD: a float from 1e-6 to 1e-1
  • Momentum of SGD: a float from 0.01 to 0.99
  • Weight decay: a float from 1e-5 to 1

The following graph illustrates our findings.

Given a target validation accuracy 0.87, Hyperband can reach it in fewer than 2000 seconds while Random and Bayesian require 10000 and almost 9000 seconds respectively. This amount to a speed-up factor of ~5x and ~4.5x for Hyperband on this task compared to Random and Bayesian strategies. This shows a clear advantage for the multi-fidelity optimization algorithm, which significantly reduces the wall-clock time it takes to tune your deep learning models.

Conclusion

SageMaker Automatic Model Tuning allows you to reduce the time to tune a model by automatically searching for the best hyperparameter configuration within the ranges that you specify. You can find the best version of your model by running training jobs on your dataset with several search strategies, such as black-box or multi-fidelity strategies.

In this post, we discussed how you can now use a multi-fidelity strategy called Hyperband in SageMaker to find the best model. The support for Hyperband makes it possible for SageMaker Automatic Model Tuning to tune larger-scale models where evaluating each hyperparameter configuration requires running an expensive training loop, such as in computer vision and NLP applications.

Finally, we saw how Hyperband further optimizes runtime compared to black-box strategies with early stopping by using previously evaluated configurations to predict whether a specific candidate is promising and, if not, stop the evaluation to reduce the overall time and compute cost. Using Hyperband in SageMaker also allows you to specify the minimum and maximum resource in the HyperbandStrategyConfig parameter for further runtime controls.

To learn more, visit Perform Automatic Model Tuning with SageMaker.


About the authors

Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solutions in the cloud.

Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the Financial Services industry with their operations in AWS. As a machine learning specialist, Gopi works to help customers succeed in their ML journey.

Xingchen Ma is an Applied Scientist at AWS. He works in the team owning the service for SageMaker Automatic Model Tuning.

Read More

Read webpages and highlight content using Amazon Polly

In this post, we demonstrate how to use Amazon Polly—a leading cloud service that converts text into lifelike speech—to read the content of a webpage and highlight the content as it’s being read. Adding audio playback to a webpage improves the accessibility and visitor experience of the page. Audio-enhanced content is more impactful and memorable, draws more traffic to the page, and taps into the spending power of visitors. It also improves the brand of the company or organization that publishes the page. Text-to-speech technology makes these business benefits attainable. We accelerate that journey by demonstrating how to achieve this goal using Amazon Polly.

This capability improves accessibility for visitors with disabilities, and could be adopted as part of your organization’s accessibility strategy. Just as importantly, it enhances the page experience for visitors without disabilities. Both groups have significant spending power and spend more freely from pages that use audio enhancement to grab their attention.

Overview of solution

PollyReadsThePage (PRTP)—as we refer to the solution—allows a webpage publisher to drop an audio control onto their webpage. When the visitor chooses Play on the control, the control reads the page and highlights the content. PRTP uses the general capability of Amazon Polly to synthesize speech from text. It invokes Amazon Polly to generate two artifacts for each page:

  • The audio content in a format playable by the browser: MP3
  • A speech marks file that indicates for each sentence of text:
    • The time during playback that the sentence is read
    • The location on the page the sentence appears

When the visitor chooses Play, the browser plays the MP3 file. As the audio is read, the browser checks the time, finds in the marks file which sentence to read at that time, locates it on the page, and highlights it.

PRTP allows the visitor to read in different voices and languages. Each voice requires its own pair of files. PRTP uses neural voices. For a list of supported neural voices and languages, see Neural Voices. For a full list of standard and neural voices in Amazon Polly, see Voices in Amazon Polly.

We consider two types of webpages: static and dynamic pages. In a static page, the content is contained within the page and changes only when a new version of the page is published. The company might update the page daily or weekly as part of its web build process. For this type of page, it’s possible to pre-generate the audio files at build time and place them on the web server for playback. As the following figure shows, the script PRTP Pre-Gen invokes Amazon Polly to generate the audio. It takes as input the HTML page itself and, optionally, a configuration file that specifies which text from the page to extract (Text Extract Config). If the extract config is omitted, the pre-gen script makes a sensible choice of text to extract from the body of the page. Amazon Polly outputs the files in an Amazon Simple Storage Service (Amazon S3) bucket; the script copies them to your web server. When the visitor plays the audio, the browser downloads the MP3 directly from the web server. For highlights, a drop-in library, PRTP.js, uses the marks file to highlight the text being read.Solution Architecture Diagram

The content of a dynamic page changes in response to the visitor interaction, so audio can’t be pre-generated but must be synthesized dynamically. As the following figure shows, when the visitor plays the audio, the page uses PRTP.js to generate the audio in Amazon Polly, and it highlights the synthesized audio using the same approach as with static pages. To access AWS services from the browser, the visitor requires an AWS identity. We show how to use an Amazon Cognito identity pool to allow the visitor just enough access to Amazon Polly and the S3 bucket to render the audio.

Dynamic Content

Generating both Mp3 audio and speech marks requires the Polly service to synthesize the same input twice. Refer to the Amazon Polly Pricing Page to understand cost implications. Pre-generation saves costs because synthesis is performed at build time rather than on-demand for each visitor interaction.

The code accompanying this post is available as an open-source repository on GitHub.

To explore the solution, we follow these steps:

  1. Set up the resources, including the pre-gen build server, S3 bucket, web server, and Amazon Cognito identity.
  2. Run the static pre-gen build and test static pages.
  3. Test dynamic pages.

Prerequisites

To run this example, you need an AWS account with permission to use Amazon Polly, Amazon S3, Amazon Cognito, and (for demo purposes) AWS Cloud9.

Provision resources

We share an AWS CloudFormation template to create in your account a self-contained demo environment to help you follow along with the post. If you prefer to set up PRTP in your own environment, refer to instructions in README.md.

To provision the demo environment using CloudFormation, first download a copy of the CloudFormation template. Then complete the following steps:

  1. On the AWS CloudFormation console, choose Create stack.
  2. Choose With new resources (standard).
  3. Select Upload a template file.
  4. Choose Choose file to upload the local copy of the template that you downloaded. The name of the file is prtp.yml.
  5. Choose Next.
  6. Enter a stack name of your choosing. Later you enter this again as a replacement for <StackName>.
  7. You may keep default values in the Parameters section.
  8. Choose Next.
  9. Continue through the remaining sections.
  10. Read and select the check boxes in the Capabilities section.
  11. Choose Create stack.
  12. When the stack is complete, find the value for BucketName in the stack outputs.

We encourage you to review the stack with your security team prior to using it a production environment.

Set up the web server and pre-gen server in an AWS Cloud9 IDE

Next, on the AWS Cloud9 console, locate the environment PRTPDemoCloud9 created by the CloudFormation stack. Choose Open IDE to open the AWS Cloud9 environment. Open a terminal window and run the following commands, which clones the PRTP code, sets up pre-gen dependencies, and starts a web server to test with:

#Obtain PRTP code
cd /home/ec2-user/environment
git clone https://github.com/aws-samples/amazon-polly-reads-the-page.git

# Navigate to that code
cd amazon-polly-reads-the-page/setup

# Install Saxon and html5 Python lib. For pre-gen.
sh ./setup.sh <StackName>

# Run Python simple HTTP server
cd ..
./runwebserver.sh <IngressCIDR> 

For <StackName>, use the name you gave the CloudFormation stack. For <IngressCIDR>, specify a range of IP addresses allowed to access the web server. To restrict access to the browser on your local machine, find your IP address using https://whatismyipaddress.com/ and append /32 to specify the range. For example, if your IP is 10.2.3.4, use 10.2.3.4/32. The server listens on port 8080. The public IP address on which the server listens is given in the output. For example:

Public IP is

3.92.33.223

Test static pages

In your browser, navigate to PRTPStaticDefault.html. (If you’re using the demo, the URL is http://<cloud9host>:8080/web/PRTPStaticDefault.html, where <cloud9host> is the public IP address that you discovered in setting up the IDE.) Choose Play on the audio control at the top. Listen to the audio and watch the highlights. Explore the control by changing speeds, changing voices, pausing, fast-forwarding, and rewinding. The following screenshot shows the page; the text “Skips hidden paragraph” is highlighted because it is currently being read.

Try the same for PRTPStaticConfig.html and PRTPStaticCustom.html. The results are similar. For example, all three read the alt text for the photo of the cat (“Random picture of a cat”). All three read NE, NW, SE, and SW as full words (“northeast,” “northwest,” “southeast,” “southwest”), taking advantage of Amazon Polly lexicons.

Notice the main differences in audio:

  • PRTPStaticDefault.html reads all the text in the body of the page, including the wrapup portion at the bottom with “Your thoughts in one word,” “Submit Query,” “Last updated April 1, 2020,” and “Questions for the dev team.” PRTPStaticConfig.html and PRTPStaticCustom.html don’t read these because they explicitly exclude the wrapup from speech synthesis.
  • PRTPStaticCustom.html reads the QB Best Sellers table differently from the others. It reads the first three rows only, and reads the row number for each row. It repeats the columns for each row. PRTPStaticCustom.html uses a custom transformation to tailor the readout of the table. The other pages use default table rendering.
  • PRTPStaticCustom.html reads “Tom Brady” at a louder volume than the rest of the text. It uses the speech synthesis markup language (SSML) prosody tag to tailor the reading of Tom Brady. The other pages don’t tailor in this way.
  • PRTPStaticCustom.html, thanks to a custom transformation, reads the main tiles in NW, SW, NE, SE order; that is, it reads “Today’s Articles,” “Quote of the Day,” “Photo of the Day,” “Jokes of the Day.” The other pages read in the order the tiles appear in the natural NW, NE, SW, SE order they appear in the HTML: “Today’s Articles,” “Photo of the Day,” “Quote of the Day,” “Jokes of the Day.”

Let’s dig deeper into how the audio is generated, and how the page highlights the text.

Static pre-generator

Our GitHub repo includes pre-generated audio files for the PRPTStatic pages, but if you want to generate them yourself, from the bash shell in the AWS Cloud9 IDE, run the following commands:

# navigate to examples
cd /home/ec2-user/environment/amazon-polly-reads-the-page-blog/pregen/examples

# Set env var for my S3 bucket. Example, I called mine prtp-output
S3_BUCKET=prtp-output # Use output BucketName from CloudFormation

#Add lexicon for pronuniciation of NE NW SE NW
#Script invokes aws polly put-lexicon
./addlexicon.sh.

#Gen each variant
./gen_default.sh
./gen_config.sh
./gen_custom.sh

Now let’s look at how those scripts work.

Default case

We begin with gen_default.sh:

cd ..
python FixHTML.py ../web/PRTPStaticDefault.html  
   example/tmp_wff.html
./gen_ssml.sh example/tmp_wff.html generic.xslt example/tmp.ssml
./run_polly.sh example/tmp.ssml en-US Joanna 
   ../web/polly/PRTPStaticDefault compass
./run_polly.sh example/tmp.ssml en-US Matthew 
   ../web/polly/PRTPStaticDefault compass

The script begins by running the Python program FixHTML.py to make the source HTML file PRTPStaticDefault.html well-formed. It writes the well-formed version of the file to example/tmp_wff.html. This step is crucial for two reasons:

  • Most source HTML is not well formed. This step repairs the source HTML to be well formed. For example, many HTML pages don’t close P elements. This step closes them.
  • We keep track of where in the HTML page we find text. We need to track locations using the same document object model (DOM) structure that the browser uses. For example, the browser automatically adds a TBODY to a TABLE. The Python program follows the same well-formed repairs as the browser.

gen_ssml.sh takes the well-formed HTML as input, applies an XML stylesheet transformation (XSLT) transformation to it, and outputs an SSML file. (SSML is the language in Amazon Polly to control how audio is rendered from text.) In the current example, the input is example/tmp_wff.html. The output is example/tmp.ssml. The transform’s job is to decide what text to extract from the HTML and feed to Amazon Polly. generic.xslt is a sensible default XSLT transform for most webpages. In the following example code snippet, it excludes the audio control, the HTML header, as well as HTML elements like script and form. It also excludes elements with the hidden attribute. It includes elements that typically contain text, such as P, H1, and SPAN. For these, it renders both a mark, including the full XPath expression of the element, and the value of the element.

<!-- skip the header -->
<xsl:template match="html/head">
</xsl:template>

<!-- skip the audio itself -->
<xsl:template match="html/body/table[@id='prtp-audio']">
</xsl:template>

<!-- For the body, work through it by applying its templates. This is the default. -->
<xsl:template match="html/body">
<speak>
      <xsl:apply-templates />
</speak>
</xsl:template>

<!-- skip these -->
<xsl:template match="audio|option|script|form|input|*[@hidden='']">
</xsl:template>

<!-- include these -->
<xsl:template match="p|h1|h2|h3|h4|li|pre|span|a|th/text()|td/text()">
<xsl:for-each select=".">
<p>
      <mark>
          <xsl:attribute name="name">
          <xsl:value-of select="prtp:getMark(.)"/>
          </xsl:attribute>
      </mark>
      <xsl:value-of select="normalize-space(.)"/>
</p>
</xsl:for-each>
</xsl:template>

The following is a snippet of the SSML that is rendered. This is fed as input to Amazon Polly. Notice, for example, that the text “Skips hidden paragraph” is to be read in the audio, and we associate it with a mark, which tells us that this text occurs in the location on the page given by the XPath expression /html/body[1]/div[2]/ul[1]/li[1].

<speak>
<p><mark name="/html/body[1]/div[1]/h1[1]"/>PollyReadsThePage Normal Test Page</p>
<p><mark name="/html/body[1]/div[2]/p[1]"/>PollyReadsThePage is a test page for audio readout with highlights.</p>
<p><mark name="/html/body[1]/div[2]/p[2]"/>Here are some features:</p>
<p><mark name="/html/body[1]/div[2]/ul[1]/li[1]"/>Skips hidden paragraph</p>
<p><mark name="/html/body[1]/div[2]/ul[1]/li[2]"/>Speaks but does not highlight collapsed content</p>
…
</speak>

To generate audio in Amazon Polly, we call the script run_polly.sh. It runs the AWS Command Line Interface (AWS CLI) command aws polly start-speech-synthesis-task twice: once to generate MP3 audio, and again to generate the marks file. Because the generation is asynchronous, the script polls until it finds the output in the specified S3 bucket. When it finds the output, it downloads to the build server and copies the files to the web/polly folder. The following is a listing of the web folders:

  • PRTPStaticDefault.html
  • PRTPStaticConfig.html
  • PRTPStaticCustom.html
  • PRTP.js
  • polly/PRTPStaticDefault/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
  • polly/PRTPStaticConfig/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
  • polly/PRTPStaticCustom/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks

Each page has its own set of voice-specific MP3 and marks files. These files are the pre-generated files. The page doesn’t need to invoke Amazon Polly at runtime; the files are part of the web build.

Config-driven case

Next, consider gen_config.sh:

cd ..
python FixHTML.py ../web/PRTPStaticConfig.html 
  example/tmp_wff.html
python ModGenericXSLT.py example/transform_config.json 
  example/tmp.xslt
./gen_ssml.sh example/tmp_wff.html example/tmp.xslt 
  example/tmp.ssml
./run_polly.sh example/tmp.ssml en-US Joanna 
  ../web/polly/PRTPStaticConfig compass
./run_polly.sh example/tmp.ssml en-US Matthew 
  ../web/polly/PRTPStaticConfig compass

The script is similar to the script in the default case, but the bolded lines indicate the main difference. Our approach is config-driven. We tailor the content to be extracted from the page by specifying what to extract through configuration, not code. In particular, we use the JSON file transform_config.json, which specifies that the content to be included are the elements with IDs title, main, maintable, and qbtable. The element with ID wrapup should be excluded. See the following code:

{
 "inclusions": [ 
 	{"id" : "title"} , 
 	{"id": "main"}, 
 	{"id": "maintable"}, 
 	{"id": "qbtable" }
 ],
 "exclusions": [
 	{"id": "wrapup"}
 ]
}

We run the Python program ModGenericXSLT.py to modify generic.xslt, used in the default case, to use the inclusions and exclusions that we specify in transform_config.json. The program writes the results to a temp file (example/tmp.xslt), which it passes to gen_ssml.sh as its XSLT transform.

This is a low-code option. The web publisher doesn’t need to know how to write XSLT. But they do need to understand the structure of the HTML page and the IDs used in its main organizing elements.

Customization case

Finally, consider gen_custom.sh:

cd ..
python FixHTML.py ../web/PRTPStaticCustom.html 
   example/tmp_wff.html
./gen_ssml.sh example/tmp_wff.html example/custom.xslt  
   example/tmp.ssml
./run_polly.sh example/tmp.ssml en-US Joanna 
   ../web/polly/PRTPStaticCustom compass
./run_polly.sh example/tmp.ssml en-US Matthew 
   ../web/polly/PRTPStaticCustom compass

This script is nearly identical to the default script, except it uses its own XSLT—example/custom.xslt—rather than the generic XSLT. The following is a snippet of the XSLT:

<!-- Use NW, SW, NE, SE order for main tiles! -->
<xsl:template match="*[@id='maintable']">
    <mark>
        <xsl:attribute name="name">
        <xsl:value-of select="stats:getMark(.)"/>
        </xsl:attribute>
    </mark>
    <xsl:variable name="tiles" select="./tbody"/>
    <xsl:variable name="tiles-nw" select="$tiles/tr[1]/td[1]"/>
    <xsl:variable name="tiles-ne" select="$tiles/tr[1]/td[2]"/>
    <xsl:variable name="tiles-sw" select="$tiles/tr[2]/td[1]"/>
    <xsl:variable name="tiles-se" select="$tiles/tr[2]/td[2]"/>
    <xsl:variable name="tiles-seq" select="($tiles-nw,  $tiles-sw, $tiles-ne, $tiles-se)"/>
    <xsl:for-each select="$tiles-seq">
         <xsl:apply-templates />  
    </xsl:for-each>
</xsl:template>   

<!-- Say Tom Brady load! -->
<xsl:template match="span[@style = 'color:blue']" >
<p>
      <mark>
          <xsl:attribute name="name">
          <xsl:value-of select="prtp:getMark(.)"/>
          </xsl:attribute>
      </mark>
      <prosody volume="x-loud">Tom Brady</prosody>
</p>
</xsl:template>

If you want to study the code in detail, refer to the scripts and programs in the GitHub repo.

Browser setup and highlights

The static pages include an HTML5 audio control, which takes as its audio source the MP3 file generated by Amazon Polly and residing on the web server:

<audio id="audio" controls>
  <source src="polly/PRTPStaticDefault/en/Joanna.mp3" type="audio/mpeg">
</audio>

At load time, the page also loads the Amazon Polly-generated marks file. This occurs in the PRTP.js file, which the HTML page includes. The following is a snippet of the marks file for PRTPStaticDefault:

{“time”:11747,“type”:“sentence”,“start”:289,“end”:356,“value”:“PollyReadsThePage is a test page for audio readout with highlights.“}
{“time”:15784,“type”:“ssml”,“start”:363,“end”:403,“value”:“/html/body[1]/div[2]/p[2]“}
{“time”:16427,“type”:“sentence”,“start”:403,“end”:426,“value”:“Here are some features:“}
{“time”:17677,“type”:“ssml”,“start”:433,“end”:480,“value”:“/html/body[1]/div[2]/ul[1]/li[1]“}
{“time”:18344,“type”:“sentence”,“start”:480,“end”:502,“value”:“Skips hidden paragraph”}
{“time”:19894,“type”:“ssml”,“start”:509,“end”:556,“value”:“/html/body[1]/div[2]/ul[1]/li[2]“}
{“time”:20537,“type”:“sentence”,“start”:556,“end”:603,“value”:“Speaks but does not highlight collapsed content”}

During audio playback, there is an audio timer event handler in PRTP.js that checks the audio’s current time, finds the text to highlight, finds its location on the page, and highlights it. The text to be highlighted is an entry of type sentence in the marks file. The location is the XPath expression in the name attribute of the entry of type SSML that precedes the sentence. For example, if the time is 18400, according to the marks file, the sentence to be highlighted is “Skips hidden paragraph,” which starts at 18334. The location is the SSML entry at time 17667: /html/body[1]/div[2]/ul[1]/li[1].

Test dynamic pages

The page PRTPDynamic.html demonstrates dynamic audio readback using default, configuration-driven, and custom audio extraction approaches.

Default case

In your browser, navigate to PRTPDynamic.html. The page has one query parameter, dynOption, which accepts values default, config, and custom. It defaults to default, so you may omit it in this case. The page has two sections with dynamic content:

  • Latest Articles – Changes frequently throughout the day
  • Greek Philosophers Search By Date – Allows the visitor to search for Greek philosophers by date and shows the results in a table

Create some content in the Greek Philosopher section by entering a date range of -800 to 0, as shown in the example. Then choose Find.

Now play the audio by choosing Play in the audio control.

Behind the scenes, the page runs the following code to render and play the audio:

   buildSSMLFromDefault();
   chooseRenderAudio();
   setVoice();

First it calls the function buildSSMLFromDefault in PRTP.js to extract most of the text from the HTML page body. That function walks the DOM tree, looking for text in common elements such as p, h1, pre, span, and td. It ignores text in elements that usually don’t contain text to be read aloud, such as audio, option, and script. It builds SSML markup to be input to Amazon Polly. The following is a snippet showing extraction of the first row from the philosopher table:

<speak>
...
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[1]"/>Thales</p>
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[2]"/>-624 to -546</p>
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[3]"/>Miletus</p>
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[4]"/>presocratic</p>
...
</speak>

The chooseRenderAudio function in PRTP.js begins by initializing the AWS SDK for Amazon Cognito, Amazon S3, and Amazon Polly. This initialization occurs only once. If chooseRenderAudio is invoked again because the content of the page has changed, the initialization is skipped. See the following code:

AWS.config.region = env.REGION
AWS.config.credentials = new AWS.CognitoIdentityCredentials({
            IdentityPoolId: env.IDP});
audioTracker.sdk.connection = {
   polly: new AWS.Polly({apiVersion: '2016-06-10'}),
   s3: new AWS.S3()
};

It generates MP3 audio from Amazon Polly. The generation is synchronous for small SSML inputs and asynchronous (with output sent to the S3 bucket) for large SSML inputs (greater than 6,000 characters). In the synchronous case, we ask Amazon Polly to provide the MP3 file using a presigned URL. When the synthesized output is ready, we set the src attribute of the audio control to that URL and load the control. We then request the marks file and load it the same way as in the static case. See the following code:

// create signed URL
const signer = new AWS.Polly.Presigner(pollyAudioInput, audioTracker.sdk.connection.polly);

// call Polly to get MP3 into signed URL
signer.getSynthesizeSpeechUrl(pollyAudioInput, function(error, url) {
  // Audio control uses signed URL
  audioTracker.audioControl.src =
    audioTracker.sdk.audio[audioTracker.voice];
  audioTracker.audioControl.load();

  // call Polly to get marks
  audioTracker.sdk.connection.polly.synthesizeSpeech(
    pollyMarksInput, function(markError, markData) {
    const marksStr = new
      TextDecoder().decode(markData.AudioStream);
    // load marks into page the same as with static
    doLoadMarks(marksStr);
  });
});

Config-driven case

In your browser, navigate to PRTPDynamic.html?dynOption=config. Play the audio. The audio playback is similar to the default case, but there are minor differences. In particular, some content is skipped.

Behind the scenes, when using the config option, the page extracts content differently than in the default case. In the default case, the page uses buildSSMLFromDefault. In the config-driven case, the page specifies the sections it wants to include and exclude:

const ssml = buildSSMLFromConfig({
	 "inclusions": [ 
	 	{"id": "title"}, 
	 	{"id": "main"}, 
	 	{"id": "maintable"}, 
	 	{"id": "phil-result"},
	 	{"id": "qbtable"}, 
	 ],
	 "exclusions": [
	 	{"id": "wrapup"}
	 ]
	});

The buildSSMLFromConfig function, defined in PRTP.js, walks the DOM tree in each of the sections whose ID is provided under inclusions. It extracts content from each and combines them together, in the order specified, to form an SSML document. It excludes the sections specified under exclusions. It extracts content from each section in the same way buildSSMLFromDefault extracts content from the page body.

Customization case

In your browser, navigate to PRTPDynamic.html?dynOption=custom. Play the audio. There are three noticeable differences. Let’s note these and consider the custom code that runs behind the scenes:

  • It reads the main tiles in NW, SW, NE, SE order. The custom code gets each of these cell blocks from maintable and adds them to the SSML in NW, SW, NE, SE order:
const nw = getElementByXpath("//*[@id='maintable']//tr[1]/td[1]");
const sw = getElementByXpath("//*[@id='maintable']//tr[2]/td[1]");
const ne = getElementByXpath("//*[@id='maintable']//tr[1]/td[2]");
const se = getElementByXpath("//*[@id='maintable']//tr[2]/td[2]");
[nw, sw, ne, se].forEach(dir => buildSSMLSection(dir, []));
  • “Tom Brady” is spoken loudly. The custom code puts “Tom Brady” text inside an SSML prosody tag:
if (cellText == "Tom Brady") {
   addSSMLMark(getXpathOfNode( node.childNodes[tdi]));
   startSSMLParagraph();
   startSSMLTag("prosody", {"volume": "x-loud"});
   addSSMLText(cellText);
   endSSMLTag();
   endSSMLParagraph();
}
  • It reads only the first three rows of the quarterback table. It reads the column headers for each row. Check the code in the GitHub repo to discover how this is implemented.

Clean up

To avoid incurring future charges, delete the CloudFormation stack.

Conclusion

In this post, we demonstrated a technical solution to a high-value business problem: how to use Amazon Polly to read the content of a webpage and highlight the content as it’s being read. We showed this using both static and dynamic pages. To extract content from the page, we used DOM traversal and XSLT. To facilitate highlighting, we used the speech marks capability in Amazon Polly.

Learn more about Amazon Polly by visiting its service page.

Feel free to ask questions in the comments.


About the authors

Mike Havey is a Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. Visit his Amazon author page to read more.

Vineet Kachhawaha is a Solutions Architect at AWS with expertise in machine learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.

Read More

Use Amazon SageMaker Data Wrangler for data preparation and Studio Labs to learn and experiment with ML

Amazon SageMaker Studio Lab is a free machine learning (ML) development environment based on open-source JupyterLab for anyone to learn and experiment with ML using AWS ML compute resources. It’s based on the same architecture and user interface as Amazon SageMaker Studio, but with a subset of Studio capabilities.

When you begin working on ML initiatives, you need to perform exploratory data analysis (EDA) or data preparation before proceeding with model building. Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes.

A key accelerator of feature preparation in Data Wrangler is the Data Quality and Insights Report. This report checks data quality and helps detect abnormalities in your data, so that you can perform the required data engineering to fix your dataset. You can use the Data Quality and Insights Report to perform an analysis of your data to gain insights into your dataset such as the number of missing values and number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention and help you identify the data preparation steps you need to perform.

Studio Lab users can benefit from Data Wrangler because data quality and feature engineering are critical for the predictive performance of your model. Data Wrangler helps with data quality and feature engineering by giving insights into data quality issues and easily enabling rapid feature iteration and engineering using a low-code UI.

In this post, we show you how to perform exploratory data analysis, prepare and transform data using Data Wrangler, and export the transformed and prepared data to Studio Lab to carry out model building.

Solution overview

The solution includes the following high-level steps:

  1. Create AWS account and admin user. This is a prerequisite
  2. Download the dataset churn.csv.
  3. Load the dataset to Amazon Simple Storage Service (Amazon S3).
  4. Create a SageMaker Studio domain and launch Data Wrangler.
  5. Import the dataset into the Data Wrangler flow from Amazon S3.
  6. Create the Data Quality and Insights Report and draw conclusions on necessary feature engineering.
  7. Perform the necessary data transforms in Data Wrangler.
  8. Download the Data Quality and Insights Report and the transformed dataset.
  9. Upload the data to a Studio Lab project for model training.

The following diagram illustrates this workflow.

Prerequisites

To use Data Wrangler and Studio Lab, you need the following prerequisites:

Build a data preparation workflow with Data Wrangler

To get started, complete the following steps:

  1. Upload your dataset to Amazon S3.
  2. On the SageMaker console, under Control panel in the navigation pane, choose Studio.
  3. On the Launch app menu next to your user profile, choose Studio.

    After you successfully log in to Studio, you should see a development environment like the following screenshot.
  4. To create a new Data Wrangler workflow, on the File menu, choose New, then choose Data Wrangler Flow.

    The first step in Data Wrangler is to import your data. You can import data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, and Databricks. In this example, we use Amazon S3.If you just want to see how Data Wrangler works, you can always choose Use sample dataset.
  5. Choose Import data.
  6. Choose Amazon S3.
  7. Choose the dataset you uploaded and choose Import.

    Data Wrangler enables you to either import the entire dataset or sample a portion of it.
  8. To quickly get insights on the dataset, choose First K for Sampling and enter 50000 for Sample size.

Understand data quality and get insights

Let’s use the Data Quality and Insights Report to perform an analysis of the data that we imported into Data Wrangler. You can use the report to understand what steps you need to take to clean and process your data. This report provides information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

  1. Choose the plus sign next to Data types and choose Get data insights.
  2. For Analysis type, choose Data Quality and Insights Report.
  3. For Target column, choose Churn?.
  4. For Problem type¸ select Classification.
  5. Choose Create.

You’re presented with a detailed report that you can review and download. The report includes several sections such as quick model, feature summary, feature correlation, and data insights. The following screenshots provide examples of these sections.

Observations from the report

From the report, we can make the following observations:

  • No duplicate rows were found.
  • The State column appears to be quite evenly distributed, so the data is balanced in terms of state population.
  • The Phone column presents too many unique values to be of any practical use. Too many unique values make this column not useful. We can drop the Phone column in our transformation.
  • Based on feature correlation section of the report, Mins and Charge are highly correlated. We can remove one of them.

Transformation

Based on our observations, we want to make the following transformations:

  • Remove the Phone column because it has many unique values.
  • We also see several features that essentially have 100% correlation with one another. Including these feature pairs in some ML algorithms can create undesired problems, whereas in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, and Intl Charge from the pair with Intl Mins.
  • Convert True or False in the Churn column to be a numerical value of 1 or 0.
  1. Return to the data flow and choose the plus sign next to Data types.
  2. Choose Add transform.
  3. Choose Add step.
  4. You can search for the transform you looking for (in our case, manage columns).
  5. Choose Manage columns.
  6. For Transform¸ choose Drop column.
  7. For Columns to drop¸ choose Phone, Day Charge, Eve Charge, Night Charge, and Intl Charge.
  8. Choose Preview, then choose Update.

    Let’s add another transform to perform a categorical encode on the Churn? column.
  9. Choose the transform Encode categorical.
  10. For Transform, choose Ordinal encode.
  11. For Input columns, choose the Churn? column.
  12. For Invalid handling strategy, choose Replace with NaN.
  13. Choose Preview, then choose Update.

Now True and False are converted to 1 and 0, respectively.

Now that we have a good understand of the data and have prepared and transformed the data for model building, we can move the data to Studio Lab for model building.

Upload the data to Studio Lab

To start using the data in Studio Lab, complete the following steps:

  1. Choose Export data to export to an S3 bucket.
  2. For Amazon S3 location, enter your S3 path.
  3. Specify the file type.
  4. Choose Export data.
  5. After you export the data, you can download the data from the S3 bucket to your local computer.
  6. Now you can go to Studio Lab and upload the file to Studio Lab.

    Alternatively, you can connect to Amazon S3 from Studio Lab. For more information, refer to Use external resources in Amazon SageMaker Studio Lab.
  7. Let’s install SageMaker and import Pandas.
  8. Import all libraries as required.
  9. Now we can read the CSV file.
  10. Let’s print churn to confirm the dataset is correct.

Now that you have the processed dataset in Studio Lab, you can carry out further steps required for model building.

Data Wrangler pricing

You can perform all the steps in this post for EDA or data preparation within Data Wrangler and pay for the simple instance, jobs, and storage pricing based on usage or consumption. No upfront or licensing fees are required.

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees. To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
    Data Wrangler automatically saves your data flow every 60 seconds.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we saw how you can gain insights into your dataset, perform exploratory data analysis, prepare and transform data using Data Wrangler within Studio, and export the transformed and prepared data to Studio Lab and carry out model building and other steps.

With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.


About the authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about the cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with a passion to design, create and promote human-centered Data and Analytics experiences. He supports AWS Strategic customers on their transformation towards data driven organization.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Read More

Announcing Visual Conversation Builder for Amazon Lex

Amazon Lex is a service for building conversational interfaces using voice and text. Amazon Lex provides high-quality speech recognition and language understanding capabilities. With Amazon Lex, you can add sophisticated, natural language bots to new and existing applications. Amazon Lex reduces multi-platform development efforts, allowing you to easily publish your speech or text chatbots to mobile devices and multiple chat services, like Facebook Messenger, Slack, Kik, or Twilio SMS.

Today, we added a Visual Conversation Builder (VCB) to Amazon Lex—a drag-and-drop conversation builder that allows users to interact and define bot information by manipulating visual objects. These are used to design and edit conversation flows in a no-code environment. There are three main benefits of the VCB:

  • It’s easier to collaborate through a single pane of glass
  • It simplifies conversational design and testing
  • It reduces code complexity

In this post, we introduce the VCB, how to use it, and share customer success stories.

Overview of the Visual Conversation Builder

In addition to the already available menu-based editor and Amazon Lex APIs, the visual builder gives a single view of an entire conversation flow in one location, simplifying bot design and reducing dependency on development teams. Conversational designers, UX designers, and product managers—anyone with an interest in building a conversation on Amazon Lex—can utilize the builder.

Designers and developers can now collaborate and build conversations easily in the VCB without coding the business logic behind the conversation. The visual builder helps accelerate time to market for Amazon Lex-based solutions by providing better collaboration, easier iterations of the conversation design, and reduced code complexity.

With the visual builder, it’s now possible to quickly view the entire conversation flow of the intent at a glance and get visual feedback as changes are made. Changes to your design are instantly reflected in the view, and any effects to dependencies or branching logic is immediately apparent to the designer. You can use the visual builder to make any changes to the intent, such as adding utterances, slots, prompts, or responses. Each block type has its own settings that you can configure to tailor the flow of the conversation.

Previously, complex branching of conversations required implementation of AWS Lambda—a serverless, event-driven compute service—to achieve the desired pathing. The visual builder reduces the need for Lambda integrations, and designers can perform conversation branching without the need for Lambda code, as shown in the following example. This helps to decouple conversation design activities from Lambda business logic and integrations. You can still use the existing intent editor in conjunction with the visual builder, or switch between them at any time when creating and modifying intents.

The VCB is a no-code method of designing complex conversations. For example, you can now add a confirmation prompt in an intent and branch based on a Yes or No response to different paths in the flow without code. Where future Lambda business logic is needed, conversation designers can add placeholder blocks into the flow so developers know what needs to be addressed through code. Code hook blocks with no Lambda functions attached automatically take the Success pathway so testing of the flow can continue until the business logic is completed and implemented. In addition to branching, the visual builder offers designers the ability to go to another intent as part of the conversation flow.

Upon saving, VCB automatically scans the build to detect any errors in the conversation flow. In addition, the VCB auto-detects missing failure paths and provides the capability to auto-add those paths into the flow, as shown in the following example.

Using the Visual Conversation Builder

You can access the VCB via the Amazon Lex console by going to a bot and editing or creating a new intent. On the intent page, you can now switch between the visual builder interface and the traditional intent editor, as shown in the following screenshot.

For the intent, the visual builder shows what has already been designed in a visual layout, whereas new intents start with a blank canvas. The visual builder displays existing intents graphically on the canvas. For new intents, you start with a blank canvas and simply drag the components you want to add onto the canvas and begin connecting them together to create the conversation flow.

The visual builder has three main components: blocks, ports, and edges. Let’s get into how these are used in conjunction to create a conversation from beginning to end within an intent.

The basic building unit of a conversation flow is called a block. The top menu of the visual builder contains all the blocks you are able to use. To add a block to a conversation flow, drag it from the top menu onto the flow.

Each block has a specific functionality to handle different use cases of a conversation. The currently available block types are as follows:

  • Start – The root or first block of the conversation flow that can also be configured to send an initial response
  • Get slot value – Tries to elicit a value for a single slot
  • Condition – Can contain up to four custom branches (with conditions) and one default branch
  • Dialog code hook – Handles invocation of the dialog Lambda function and includes bot responses based on dialog Lambda functions succeeding, failing, or timing out
  • Confirmation – Queries the customer prior to fulfillment of the intent and includes bot responses based on the customer saying yes or no to the confirmation prompt
  • Fulfillment – Handles fulfillment of the intent and can be configured to invoke Lambda functions and respond with messages if fulfillment succeeds or fails
  • Closing response – Allows the bot to respond with a message before ending the conversation
  • Wait for user input – Captures input from the customer and switches to another intent based on the utterance
  • End conversation – Indicates the end of the conversation flow

Take the Order Flowers bot as an example. The OrderFlowers intent, when viewed in the visual builder, uses five blocks: Start, three different Get slot value blocks, and Confirmation.

Each block can contain one more ports, which are used to connect one block to another. Blocks contain an input port and one or more output ports based on desired paths for states such a success, timeout, and error.

The connection between the output port of one block and the input port of another block is referred to as an edge.

In the OrderFlowers intent, when the conversation starts, the Start output port is connected to the Get slot value: FlowerType input port using an edge. Each Get slot value block is connected using ports and edges to create a sequence in the conversation flow, which ensures the intent has all the slot values it needs to put in the order.

Notice that currently there is no edge connected to the failure output port of these blocks, but the builder will automatically add these if you choose Save intent and then choose Confirm in the pop-up Auto add block and edges for failure paths. The visual builder then adds an End conversation block and a Go to intent block, connecting the failure and error output ports to Go to intent and connecting the Yes/No ports of the Confirmation block to End conversation.

After the builder adds the blocks and edges, the intent is saved and the conversation flow can be built and tested. Let’s add a Welcome intent to the bot using the visual builder. From the OrderFlowers intent visual builder, choose Back to intents list in the navigation pane. On the Intents page, choose Add intent followed by Add empty intent. In the Intent name field, enter Welcome and choose Add.

Switch to the Visual builder tab and you will see an empty intent, with only the Start block currently on the canvas. To start, add some utterances to this intent so that the bot will be able to direct users to the Welcome intent. Choose the edit button of the Start block and scroll down to Sample utterances. Add the following utterances to this intent and then close the block:

  • Can you help me?
  • Hi
  • Hello
  • I need help

Now let’s add a response for the bot to give when it hits this intent. Because the Welcome intent won’t be processing any logic, we can drag a Closing response block into the canvas to add this message. After you add the block, choose the edit icon on the block and enter the following response:

Hi! I am the Order Flowers Bot. How can I help you today?

The canvas should now have two blocks, but they aren’t connected to each other. We can connect the ports of these two blocks using an edge.

To connect the two ports, simply click and drag from the No response output port of the Start block to the input port of the Closing response block.

At this point, you can complete the conversation flow in two different ways:

  • First, you can manually add the End conversation block and connect it to the Closing response block.
  • Alternatively, choose Save intent and then choose Confirm to have the builder create this block and connection for you.

After the intent is saved, choose Build and wait for the build to complete, then choose Test.

The bot will now properly greet the customer if an utterance matches this newly created intent.

Customer stories

NeuraFlash is an Advanced AWS Partner with over 40 collective years of experience in the voice and automation space. With a dedicated team of Conversational Experience Designers, Speech Scientists, and AWS developers, NeuraFlash helps customers take advantage of the power of Amazon Lex in their contact centers.

“One of our key focus areas is helping customers leverage AI capabilities for developing conversational interfaces. These interfaces often require specialized bot configuration skills to build effective flows. With the Visual Conversation Builder, our designers can quickly and easily build conversational interfaces, allowing them to experiment at a faster rate and deliver quality products for our customers without requiring developer skills. The drag-and-drop UI and the visual conversation flow is a game-changer for reinventing the contact center experience.”

The SmartBots ML-powered platform lies at the core of the design, prototyping, testing, validating, and deployment of AI-driven chatbots. This platform supports the development of custom enterprise bots that can easily integrate with any application—even an enterprise’s custom application ecosystem.

“The Visual Conversation Builder’s easy-to-use drag-and-drop interface enables us to easily onboard Amazon Lex, and build complex conversational experiences for our customers’ contact centers. With this new functionality, we can improve Interactive Voice Response (IVR) systems faster and with minimal effort. Implementing new technology can be difficult with a steep learning curve, but we found that the drag-and-drop features were easy to understand, allowing us to realize value immediately.“

Conclusion

The Visual Conversation Builder for Amazon Lex is now generally available, for free, in all AWS Regions where Amazon Lex V2 operates.

Additionally, on August 17, 2022, Amazon Lex V2 released a change to the way conversations are managed with the user. This change gives you more control over the path that the user takes through the conversation. For more information, see Understanding conversation flow management. Note that bots created before August 17, 2022, do not support the VCB for creating conversation flows.

To learn more, see Amazon Lex FAQs and the Amazon Lex V2 Developer Guide. Please send feedback to AWS re:Post for Amazon Lex or through your usual AWS support contacts.


About the authors

Thomas Rindfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.

Austin Johnson is a Solutions Architect at AWS , helping customers on their cloud journey. He is passionate about building and utilizing conversational AI platforms to add sophisticated, natural language interfaces to their applications.

Read More

Get better insight from reviews using Amazon Comprehend

“85% of buyers trust online reviews as much as a personal recommendation” – Gartner

Consumers are increasingly engaging with businesses through digital surfaces and multiple touchpoints. Statistics show that the majority of shoppers use reviews to determine what products to buy and which services to use. As per Spiegel Research Centre, the purchase likelihood for a product with five reviews is 270% greater than the purchase likelihood of a product with no reviews. Reviews have the power to influence consumer decisions and strengthen brand value.

In this post, we use Amazon Comprehend to extract meaningful information from product reviews, analyze it to understand how users of different demographics are reacting to products, and discover aggregated information on user affinity towards a product. Amazon Comprehend is a fully managed and continuously trained natural language processing (NLP) service that can extract insight about content of a document or text.

Solution overview

Today, reviews can be provided by customers in various ways, such as star ratings, free text or natural language, or social media shares. Free text or natural language reviews help build trust, as it’s an independent opinion from consumers. It’s often used by product teams to interact with customers through review channels. It’s a proven fact that when customers feel heard, their feeling about the brand improves. Whereas it’s comparatively easier to analyze star ratings or social media shares, natural language or free text reviews pose multiple challenges, like identifying keywords or phrases, topics or concepts, and sentiment or entity-level sentiments. The challenge is mainly due to the variability of length in written text and plausible presence of both signals and noise. Furthermore, the information can either be very clear and explicit (for example, with keywords and key phrases) or unclear and implicit (abstract topics and concepts). Even more challenging is understanding different types of sentiments and relating them to appropriate products and services. Nevertheless, it’s highly critical to understand this information and textual signals in order to provide a frictionless customer experience.

In this post, we use a publicly available NLP – fast.ai dataset to analyze the product reviews provided by customers. We start by using an unsupervised machine learning (ML) technique known as topic modeling. This a popular unsupervised technique that discovers abstract topics that can occur in a text review collection. Topic modeling is a clustering problem that is unsupervised, meaning that the models have no knowledge on possible target variables (such as topics in a review). The topics are represented as clusters. Often, the number of clusters in a corpus of documents is decided with the help of domain experts or by using some standard statistical analysis. The model outputs generally have three components: numbered clusters (topic 0, topic 1, and so on), keywords associated to each cluster, and representative clusters for each document (or review in our case). By its inherent nature, topic models don’t generate human-readable labels for the clusters or topics, which is a common misconception. Something to note about topic modeling in general is that it’s a mixed membership model— every document in the model may have a resemblance to every topic. The topic model learns in an iterative Bayesian process to determine the probability that each document is associated with a given theme or topic. The model output depends on selecting the number of topics optimally. A small number of topics can result in the topics being too broad, and a larger number of topics may result in redundant topics or topics with similarity. There are a number of ways to evaluate topic models:

  • Human judgment – Observation-based, interpretation-based
  • Quantitative metrics – Perplexity, coherence calculations
  • Mixed approach – A combination of judgment-based and quantitative approaches

Perplexity is calculated by splitting a dataset into two parts—a training set and a test set. Likelihood is usually calculated as a logarithm, so this metric is sometimes referred to as the held-out log-likelihood. Perplexity is a predictive metric. It assesses a topic model’s ability to predict a test set after having been trained on a training set. One of the shortcomings of perplexity is that it doesn’t capture context, meaning that it doesn’t capture the relationship between words in a topic or topics in a document. However, the idea of semantic context is important for human understanding. Measures such as the conditional likelihood of the co- occurrence of words in a topic can be helpful. These approaches are collectively referred to as coherence. For this post, we focus on the human judgment (observation-based) approach, namely observing the top n words in a topic.

The solution consists of the following high-level steps:

  1. Set up an Amazon SageMaker notebook instance.
  2. Create a notebook.
  3. Perform exploratory data analysis.
  4. Run your Amazon Comprehend topic modeling job.
  5. Generate topics and understand sentiment.
  6. Use Amazon QuickSight to visualize data and generate reports.

You can use this solution in any AWS Region, but you need to make sure that the Amazon Comprehend APIs and SageMaker are in the same Region. For this post, we use the Region US East (N. Virginia).

Set up your SageMaker notebook instance

You can interact with Amazon Comprehend via the AWS Management Console, AWS Command Line Interface (AWS CLI), or Amazon Comprehend API. For more information, refer to Getting started with Amazon Comprehend. We use a SageMaker notebook and Python (Boto3) code throughout this post to interact with the Amazon Comprehend APIs.

  1. On the Amazon SageMaker console, under Notebook in the navigation pane, choose
    Notebook instances.
  2. Choose Create notebook instance.Notebook Instances
  3. Specify a notebook instance name and set the instance type as ml.r5.2xlarge.
  4. Leave the rest of the default settings.
  5. Create an AWS Identity and Access Management (IAM) role with AmazonSageMakerFullAccess and access to any necessary Amazon Simple Storage Service (Amazon S3) buckets and Amazon Comprehend APIs.
  6. Choose Create notebook instance.
    After a few minutes, your notebook instance is ready.
  7. To access Amazon Comprehend from the notebook instance, you need to attach the ComprehendFullAccess policy to your IAM role.

For a security overview of Amazon Comprehend, refer to Security in Amazon Comprehend.

Create a notebook

After you open the notebook instance that you provisioned, on the Jupyter console, choose New and then Python 3 (Data Science). Alternatively, you can access the sample code file in the GitHub repo. You can upload the file to the notebook instance to run it directly or clone it.

The GitHub repo contains three notebooks:

  • data_processing.ipynb
  • model_training.ipynb
  • topic_mapping_sentiment_generation.ipynb

Perform exploratory data analysis

We use the first notebook (data_processing.ipynb) to explore and process the data. We start by simply loading the data from an S3 bucket into a DataFrame.

# Bucket containing the data
BUCKET = 'clothing-shoe-jewel-tm-blog'

# Item ratings and metadata
S3_DATA_FILE = 'Clothing_Shoes_and_Jewelry.json.gz' # Zip
S3_META_FILE = 'meta_Clothing_Shoes_and_Jewelry.json.gz' # Zip

S3_DATA = 's3://' + BUCKET + '/' + S3_DATA_FILE
S3_META = 's3://' + BUCKET + '/' + S3_META_FILE

# Transformed review, input for Comprehend
LOCAL_TRANSFORMED_REVIEW = os.path.join('data', 'TransformedReviews.txt')
S3_OUT = 's3://' + BUCKET + '/out/' + 'TransformedReviews.txt'

# Final dataframe where topics and sentiments are going to be joined
S3_FEEDBACK_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'

def convert_json_to_df(path):
    """Reads a subset of a json file in a given path in chunks, combines, and returns
    """
    # Creating chunks from 500k data points each of chunk size 10k
    chunks = pd.read_json(path, orient='records', 
                                lines=True, 
                                nrows=500000, 
                                chunksize=10000, 
                                compression='gzip')
    # Creating a single dataframe from all the chunks
    load_df = pd.DataFrame()
    for chunk in chunks:
        load_df = pd.concat([load_df, chunk], axis=0)
    return load_df

# Review data
original_df = convert_json_to_df(S3_DATA)

# Metadata
original_meta = convert_json_to_df(S3_META)

In the following section, we perform exploratory data analysis (EDA) to understand the data. We start by exploring the shape of the data and metadata. For authenticity, we use verified reviews only.

# Shape of reviews and metadata
print('Shape of review data: ', original_df.shape)
print('Shape of metadata: ', original_meta.shape)

# We are interested in verified reviews only
# Also checking the amount of missing values in the review data
print('Frequency of verified/non verified review data: ', original_df['verified'].value_counts())
print('Frequency of missing values in review data: ', original_df.isna().sum())

We further explore the count of each category, and see if any duplicate data is present.

# Count of each categories for EDA.
print('Frequncy of different item categories in metadata: ', original_meta['category'].value_counts())

# Checking null values for metadata
print('Frequency of missing values in metadata: ', original_meta.isna().sum())

# Checking if there are duplicated data. There are indeed duplicated data in the dataframe.
print('Duplicate items in metadata: ', original_meta[original_meta['asin'].duplicated()])

When we’re satisfied with the results, we move to the next step of preprocessing the data. Amazon Comprehend recommends providing at least 1,000 documents in each topic modeling job, with each document at least three sentences long. Documents must be in UTF-8 formatted text files. In the following step, we make sure that data is in the recommended UTF-8 format and each input is no more than 5,000 bytes in size.

def clean_text(df):
    """Preprocessing review text.
    The text becomes Comprehend compatible as a result.
    This is the most important preprocessing step.
    """
    # Encode and decode reviews
    df['reviewText'] = df['reviewText'].str.encode("utf-8", "ignore")
    df['reviewText'] = df['reviewText'].str.decode('ascii')

    # Replacing characters with whitespace
    df['reviewText'] = df['reviewText'].replace(r'r+|n+|t+|u2028',' ', regex=True)

    # Replacing punctuations
    df['reviewText'] = df['reviewText'].str.replace('[^ws]','', regex=True)

    # Lowercasing reviews
    df['reviewText'] = df['reviewText'].str.lower()
    return df

def prepare_input_data(df):
    """Encoding and getting reviews in byte size.
    Review gets encoded to utf-8 format and getting the size of the reviews in bytes. 
    Comprehend requires each review input to be no more than 5000 Bytes
    """
    df['review_size'] = df['reviewText'].apply(lambda x:len(x.encode('utf-8')))
    df = df[(df['review_size'] > 0) & (df['review_size'] < 5000)]
    df = df.drop(columns=['review_size'])
    return df

# Only data points with a verified review will be selected and the review must not be missing
filter = (original_df['verified'] == True) & (~original_df['reviewText'].isna())
filtered_df = original_df[filter]

# Only a subset of fields are selected in this experiment. 
filtered_df = filtered_df[['asin', 'reviewText', 'summary', 'unixReviewTime', 'overall', 'reviewerID']]

# Just in case, once again, dropping data points with missing review text
filtered_df = filtered_df.dropna(subset=['reviewText'])
print('Shape of review data: ', filtered_df.shape)

# Dropping duplicate items from metadata
original_meta = original_meta.drop_duplicates(subset=['asin'])

# Only a subset of fields are selected in this experiment. 
original_meta = original_meta[['asin', 'category', 'title', 'description', 'brand', 'main_cat']]

# Clean reviews using text cleaning pipeline
df = clean_text(filtered_df)

# Dataframe where Comprehend outputs (topics and sentiments) will be added
df = prepare_input_data(df)

We then save the data to Amazon S3 and also keep a local copy in the notebook instance.

# Saving dataframe on S3 df.to_csv(S3_FEEDBACK_TOPICS, index=False) 

# Reviews are transformed per Comprehend guideline- one review per line
# The txt file will be used as input for Comprehend
# We first save the input file locally
with open(LOCAL_TRANSFORMED_REVIEW, "w") as outfile:
    outfile.write("n".join(df['reviewText'].tolist()))

# Transferring the transformed review (input to Comprehend) to S3
!aws s3 mv {LOCAL_TRANSFORMED_REVIEW} {S3_OUT}

This completes our data processing phase.

Run an Amazon Comprehend topic modeling job

We then move to the next phase, where we use the preprocessed data to run a topic modeling job using Amazon Comprehend. At this stage, you can either use the second notebook (model_training.ipynb) or use the Amazon Comprehend console to run the topic modeling job. For instructions on using the console, refer to Running analysis jobs using the console. If you’re using the notebook, you can start by creating an Amazon Comprehend client using Boto3, as shown in the following example.

# Client and session information
session = boto3.Session()
s3 = boto3.resource('s3')

# Account id. Required downstream.
account_id = boto3.client('sts').get_caller_identity().get('Account')

# Initializing Comprehend client
comprehend = boto3.client(service_name='comprehend', 
                          region_name=session.region_name)

You can submit your documents for topic modeling in two ways: one document per file, or one document per line.

We start with 5 topics (k-number), and use one document per line. There is no single best way as a standard practice to select k or the number of topics. You may try out different values of k, and select the one that has the largest likelihood.

# Number of topics set to 5 after having a human-in-the-loop
# This needs to be fully aligned with topicMaps dictionary in the third script 
NUMBER_OF_TOPICS = 5

# Input file format of one review per line
input_doc_format = "ONE_DOC_PER_LINE"

# Role arn (Hard coded, masked)
data_access_role_arn = "arn:aws:iam::XXXXXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXXXXXXX"

Our Amazon Comprehend topic modeling job requires you to pass an InputDataConfig dictionary object with S3, InputFormat, and DocumentReadAction as required parameters. Similarly, you need to provide the OutputDataConfig object with S3 and DataAccessRoleArn as required parameters. For more information, refer to the Boto3 documentation for start_topics_detection_job.

# Constants for S3 bucket and input data file
BUCKET = 'clothing-shoe-jewel-tm-blog'
input_s3_url = 's3://' + BUCKET + '/out/' + 'TransformedReviews.txt'
output_s3_url = 's3://' + BUCKET + '/out/' + 'output/'

# Final dataframe where we will join Comprehend outputs later
S3_FEEDBACK_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'

# Local copy of Comprehend output
LOCAL_COMPREHEND_OUTPUT_DIR = os.path.join('comprehend_out', '')
LOCAL_COMPREHEND_OUTPUT_FILE = os.path.join(LOCAL_COMPREHEND_OUTPUT_DIR, 'output.tar.gz')

INPUT_CONFIG={
    # The S3 URI where Comprehend input is placed.
    'S3Uri':    input_s3_url,
    # Document format
    'InputFormat': input_doc_format,
}
OUTPUT_CONFIG={
    # The S3 URI where Comprehend output is placed.
    'S3Uri':    output_s3_url,
}

You can then start an asynchronous topic detection job by passing the number of topics, input configuration object, output configuration object, and an IAM role, as shown in the following example.

# Reading the Comprehend input file just to double check if number of reviews 
# and the number of lines in the input file have an exact match.
obj = s3.Object(input_s3_url)
comprehend_input = obj.get()['Body'].read().decode('utf-8')
comprehend_input_lines = len(comprehend_input.split('n'))

# Reviews where Comprehend outputs will be merged
df = pd.read_csv(S3_FEEDBACK_TOPICS)
review_df_length = df.shape[0]

# The two lengths must be equal
assert comprehend_input_lines == review_df_length

# Start Comprehend topic modelling job.
# Specifies the number of topics, input and output config and IAM role ARN 
# that grants Amazon Comprehend read access to data.
start_topics_detection_job_result = comprehend.start_topics_detection_job(
                                                    NumberOfTopics=NUMBER_OF_TOPICS,
                                                    InputDataConfig=INPUT_CONFIG,
                                                    OutputDataConfig=OUTPUT_CONFIG,
                                                    DataAccessRoleArn=data_access_role_arn)

print('start_topics_detection_job_result: ' + json.dumps(start_topics_detection_job_result))

# Job ID is required downstream for extracting the Comprehend results
job_id = start_topics_detection_job_result["JobId"]
print('job_id: ', job_id)

You can track the current status of the job by calling the DescribeTopicDetectionJob operation. The status of the job can be one of the following:

  • SUBMITTED – The job has been received and is queued for processing
  • IN_PROGRESS – Amazon Comprehend is processing the job
  • COMPLETED – The job was successfully completed and the output is available
  • FAILED – The job didn’t complete
# Topic detection takes a while to complete. 
# We can track the current status by calling Use the DescribeTopicDetectionJob operation.
# Keeping track if Comprehend has finished its job
description = comprehend.describe_topics_detection_job(JobId=job_id)

topic_detection_job_status = description['TopicsDetectionJobProperties']["JobStatus"]
print(topic_detection_job_status)
while topic_detection_job_status not in ["COMPLETED", "FAILED"]:
    time.sleep(120)
    topic_detection_job_status = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']["JobStatus"]
    print(topic_detection_job_status)

topic_detection_job_status = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']["JobStatus"]
print(topic_detection_job_status)

When the job is successfully complete, it returns a compressed archive containing two files: topic-terms.csv and doc-topics.csv. The first output file, topic-terms.csv, is a list of topics in the collection. For each topic, the list includes, by default, the top terms by topic according to their weight. The second file, doc-topics.csv, lists the documents associated with a topic and the proportion of the document that is concerned with the topic. Because we specified ONE_DOC_PER_LINE earlier in the input_doc_format variable, the document is identified by the file name and the 0-indexed line number within the file. For more information on topic modeling, refer to Topic modeling.
The outputs of Amazon Comprehend are copied locally for our next steps.

# Bucket prefix where model artifacts are stored
prefix = f'{account_id}-TOPICS-{job_id}'

# Model artifact zipped file
artifact_file = 'output.tar.gz'

# Location on S3 where model artifacts are stored
target = f's3://{BUCKET}/out/output/{prefix}/{artifact_file}'

# Copy Comprehend output from S3 to local notebook instance
! aws s3 cp {target}  ./comprehend-out/

# Unzip the Comprehend output file. 
# Two files are now saved locally- 
#       (1) comprehend-out/doc-topics.csv and 
#       (2) comprehend-out/topic-terms.csv

comprehend_tars = tarfile.open(LOCAL_COMPREHEND_OUTPUT_FILE)
comprehend_tars.extractall(LOCAL_COMPREHEND_OUTPUT_DIR)
comprehend_tars.close()

Because the number of topics is much less than the vocabulary associated with the document collection, the topic space representation can be viewed as a dimensionality reduction process as well. You may use this topic space representation of documents to perform clustering. On the other hand, you can analyze the frequency of words in each cluster to determine topic associated with each cluster. For this post, we don’t perform any other techniques like clustering.

Generate topics and understand sentiment

We use the third notebook (topic_mapping_sentiment_generation.ipynb) to find how users of different demographics are reacting to products, and also analyze aggregated information on user affinity towards a particular product.

We can combine the outputs from the previous notebook to get topics and associated terms for each topic. However, the topics are numbered and may lack explainability. Therefore, we prefer to use a human-in-the-loop with enough domain knowledge and subject matter expertise to name the topics by looking at their associated terms. This process can be considered as a mapping from topic numbers to topic names. However, it’s noteworthy that the individual list of terms for the topics can be mutually inclusive and therefore may create multiple mappings. The human-in-the-loop should formalize the mappings based on the context of the use case. Otherwise, the downstream performance may be impacted.

We start by declaring the variables. For each review, there can be multiple topics. We count their frequency and select a maximum of three most frequent topics. These topics are reported as the representative topics of a review. First, we define a variable TOP_TOPICS to hold the maximum number of representative topics. Second, we define and set values to the language_code variable to support the required language parameter of Amazon Comprehend. Finally, we create topicMaps, which is a dictionary that maps topic numbers to topic names.

# boto3 session to access service
session = boto3.Session()
comprehend = boto3.client(  'comprehend',
                            region_name=session.region_name)

# S3 bucket
BUCKET = 'clothing-shoe-jewel-tm-blog'

# Local copy of doc-topic file
DOC_TOPIC_FILE = os.path.join('comprehend-out', 'doc-topics.csv')

# Final dataframe where we will join Comprehend outputs later
S3_FEEDBACK_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'

# Final output
S3_FINAL_OUTPUT = 's3://' + BUCKET + '/out/' + 'reviewTopicsSentiments.csv'

# Top 3 topics per product will be aggregated
TOP_TOPICS = 3

# Working on English language only. 
language_code = 'en'

# Topic names for 5 topics created by human-in-the-loop or SME feed
topicMaps = {
    0: 'Product comfortability',
    1: 'Product Quality and Price',
    2: 'Product Size',
    3: 'Product Color',
    4: 'Product Return',
}

Next, we use the topic-terms.csv file generated by Amazon Comprehend to connect the unique terms associated with each topic. Then, by applying the mapping dictionary on this topic-term association, we connect the unique terms to the topic names.

# Loading documents and topics assigned to each of them by Comprehend
docTopics = pd.read_csv(DOC_TOPIC_FILE)
docTopics.head()

# Creating a field with doc number. 
# This doc number is the line number of the input file to Comprehend.
docTopics['doc'] = docTopics['docname'].str.split(':').str[1]
docTopics['doc'] = docTopics['doc'].astype(int)
docTopics.head()

# Load topics and associated terms
topicTerms = pd.read_csv(DOC_TOPIC_FILE)

# Consolidate terms for each topic
aggregatedTerms = topicTerms.groupby('topic')['term'].aggregate(lambda term: term.unique().tolist()).reset_index()

# Sneak peek
aggregatedTerms.head(10)

This mapping improves the readability and explainability of the topics generated by Amazon Comprehend, as we can see in the following DataFrame.

Topic to Topic Names Mapping

Furthermore, we join the topic number, terms, and names to the initial input data, as shown in the following steps.

This returns topic terms and names corresponding to each review. The topic numbers and terms are joined with each review and then further joined back to the original DataFrame we saved in the first notebook.

# Load final dataframe where Comprehend results will be merged to 
feedbackTopics = pd.read_csv(S3_FEEDBACK_TOPICS)

# Joining topic numbers to main data
# The index of feedbackTopics is referring to doc field of docTopics dataframe
feedbackTopics = pd.merge(feedbackTopics, 
                          docTopics, 
                          left_index=True, 
                          right_on='doc', 
                          how='left')

# Reviews will now have topic numbers, associated terms and topics names
feedbackTopics = feedbackTopics.merge(aggregatedTerms, 
                                      on='topic', 
                                      how='left')
feedbackTopics.head()

We generate sentiment for the review text using detect_sentiment. It inspects text and returns an inference of the prevailing sentiment (POSITIVE, NEUTRAL, MIXED, or NEGATIVE).

def detect_sentiment(text, language_code):
    """Detects sentiment for a given text and language
    """
    comprehend_json_out = comprehend.detect_sentiment(Text=text, LanguageCode=language_code)
    return comprehend_json_out

# Comprehend output for sentiment in raw json 
feedbackTopics['comprehend_sentiment_json_out'] = feedbackTopics['reviewText'].apply(lambda x: detect_sentiment(x, language_code))

# Extracting the exact sentiment from raw Comprehend Json
feedbackTopics['sentiment'] = feedbackTopics['comprehend_sentiment_json_out'].apply(lambda x: x['Sentiment'])

# Sneak peek
feedbackTopics.head(2)

Both topics and sentiments are tightly coupled with reviews. Because we will be aggregating topics and sentiments at product level, we need to create a composite key by combining the topics and sentiments generated by Amazon Comprehend.

# Creating a composite key of topic name and sentiment.
# This is because we are counting frequency of this combination.
feedbackTopics['TopicSentiment'] = feedbackTopics['TopicNames'] + '_' + feedbackTopics['sentiment']

Afterwards, we aggregate at product level and count the composite keys for each product.

This final step helps us better understand the granularity of the reviews per product and categorizing it per topic in an aggregated manner. For instance, we can consider the values shown for topicDF DataFrame. For the first product, of all the reviews for it, overall the customers had a positive experience on product return, size, and comfort. For the second product, the customers had mostly a mixed-to-positive experience on product return and a positive experience on product size.

# Create product id group
asinWiseDF = feedbackTopics.groupby('asin')

# Each product now has a list of topics and sentiment combo (topics can appear multiple times)
topicDF = asinWiseDF['TopicSentiment'].apply(lambda x:list(x)).reset_index()

# Count appreances of topics-sentiment combo for product
topicDF['TopTopics'] = topicDF['TopicSentiment'].apply(Counter)

# Sorting topics-sentiment combo based on their appearance
topicDF['TopTopics'] = topicDF['TopTopics'].apply(lambda x: sorted(x, key=x.get, reverse=True))

# Select Top k topics-sentiment combo for each product/review
topicDF['TopTopics'] = topicDF['TopTopics'].apply(lambda x: x[:TOP_TOPICS])

# Sneak peek
topicDF.head()

Top Topics per Product

Our final DataFrame consists of this topic information and sentiment information joined back to the final DataFrame named feedbackTopics that we saved on Amazon S3 in our first notebook.

# Adding the topic-sentiment combo back to product metadata
finalDF = S3_FEEDBACK_TOPICS.merge(topicDF, on='asin', how='left')

# Only selecting a subset of fields
finalDF = finalDF[['asin', 'TopTopics', 'category', 'title']]

# Saving the final output locally
finalDF.to_csv(S3_FINAL_OUTPUT, index=False)

Use Amazon QuickSight to visualize the data

You can use QuickSight to visualize the data and generate reports. QuickSight is a business intelligence (BI) service that you can use to consume data from many different sources and build intelligent dashboards. In this example, we generate a QuickSight analysis using the final dataset we produced, as shown in the following example visualizations.

QuickSight Visualization

To learn more about Amazon QuickSight, refer to Getting started with Amazon Quicksight.

Cleanup

At the end, we need to shut down the notebook instance we have used in this experiment from AWS Console.

Conclusion

In this post, we demonstrated how to use Amazon Comprehend to analyze product reviews and find the top topics using topic modeling as a technique. Topic modeling enables you to look through multiple topics and organize, understand, and summarize them at scale. You can quickly and easily discover hidden patterns that are present across the data, and then use that insight to make data-driven decisions. You can use topic modeling to solve numerous business problems, such as automatically tagging customer support tickets, routing conversations to the right teams based on topic, detecting the urgency of support tickets, getting better insights from conversations, creating data-driven plans, creating problem-focused content, improving sales strategy, and identifying customer issues and frictions.

These are just a few examples, but you can think of many more business problems that you face in your organization on a daily basis, and how you can use topic modeling with other ML techniques to solve those.


About the Authors

Gurpreet CheemaGurpreet is a Data Scientist with AWS Professional Services based out of Canada. She is passionate about helping customers innovate with Machine Learning and Artificial Intelligence technologies to tap business value and insights from data. In her spare time, she enjoys hiking outdoors and reading books.i

Rushdi ShamsRushdi Shams is a Data Scientist with AWS Professional Services, Canada. He builds machine learning products for AWS customers. He loves to read and write science fictions.

Wrick TalukdarWrick Talukdar is a Senior Architect with Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.

Read More

Prepare data at scale in Amazon SageMaker Studio using serverless AWS Glue interactive sessions

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps, including preparing data and building, training, and deploying models.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue enables you to seamlessly collect, transform, cleanse, and prepare data for storage in your data lakes and data pipelines using a variety of capabilities, including built-in transforms.

Data engineers and data scientists can now interactively prepare data at scale using their Studio notebook’s built-in integration with serverless Spark sessions managed by AWS Glue. Starting in seconds and automatically stopping compute when idle, AWS Glue interactive sessions provide an on-demand, highly-scalable, serverless Spark backend to achieve scalable data preparation within Studio. Notable benefits of using AWS Glue interactive sessions on Studio notebooks include:

  • No clusters to provision or manage
  • No idle clusters to pay for
  • No up-front configuration required
  • No resource contention for the same development environment
  • The exact same serverless Spark runtime and platform as AWS Glue extract, transform, and load (ETL) jobs

In this post, we show you how to prepare data at scale in Studio using serverless AWS Glue interactive sessions.

Solution overview

To implement this solution, you complete the following high-level steps:

  1. Update your AWS Identity and Access Management (IAM) role permissions.
  2. Launch an AWS Glue interactive session kernel.
  3. Configure your interactive session.
  4. Customize your interactive session and run a scalable data preparation workload.

Update your IAM role permissions

To start, you need to update your Studio user’s IAM execution role with the required permissions. For detailed instructions, refer to Permissions for Glue interactive sessions in SageMaker Studio.

You first add the managed policies to your execution role:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Find the Studio execution role that you will use, and choose the role name to go to the role summary page.
  3. On the Permissions tab, on the Add Permissions menu, choose Attach policies.
  4. Select the managed policies AmazonSageMakerFullAccess and AwsGlueSessionUserRestrictedServiceRole
  5. Choose Attach policies.
    The summary page shows your newly-added managed policies.Now you add a custom policy and attach it to your execution role.
  6. On the Add Permissions menu, choose Create inline policy.
  7. On the JSON tab, enter the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "iam:GetRole",
                    "iam:PassRole",
                    "sts:GetCallerIdentity"
                ],
                "Resource": "*"
            }
        ]
    }

  8. Modify your role’s trust relationship:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "glue.amazonaws.com",
                        "sagemaker.amazonaws.com"
                    ]
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

Launch an AWS Glue interactive session kernel

If you already have existing users within your Studio domain, you may need to have them shut down and restart their Jupyter Server to pick up the new notebook kernel images.

Upon reloading, you can create a new Studio notebook and select your preferred kernel. The built-in SparkAnalytics 1.0 image should now be available, and you can choose your preferred AWS Glue kernel (Glue Scala Spark or Glue PySpark).

Configure your interactive session

You can easily configure your AWS Glue interactive session with notebook cell magics prior to initialization. Magics are small commands prefixed with % at the start of Jupyter cells that provide shortcuts to control the environment. In AWS Glue interactive sessions, magics are used for all configuration needs, including:

  • %region – The AWS Region in which to initialize a session. The default is the Studio Region.
  • %iam_role – The IAM role ARN to run your session with. The default is the user’s SageMaker execution role.
  • %worker_type – The AWS Glue worker type. The default is standard.
  • %number_of_workers – The number of workers that are allocated when a job runs. The default is five.
  • %idle_timeout – The number of minutes of inactivity after which a session will time out. The default is 2,880 minutes.
  • %additional_python_modules – A comma-separated list of additional Python modules to include in your cluster. This can be from PyPi or Amazon Simple Storage Service (Amazon S3).
  • %%configure – A JSON-formatted dictionary consisting of AWS Glue-specific configuration parameters for a session.

For a comprehensive list of configurable magic parameters for this kernel, use the %help magic within your notebook.

Your AWS Glue interactive session will not start until the first non-magic cell is run.

Customize your interactive session and run a data preparation workload

As an example, the following notebook cells show how you can customize your AWS Glue interactive session and run a scalable data preparation workload. In this example, we perform an ETL task to aggregate air quality data for a given city, grouping by the hour of the day.

We configure our session to save our Spark logs to an S3 bucket for real-time debugging, which we see later in this post. Be sure that the iam_role that is running your AWS Glue session has write access to the specified S3 bucket.

%help

%session_id_prefix air-analysis-
%glue_version 3.0
%idle_timeout 60
%%configure
{
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://<BUCKET>/gis-spark-logs/"
}

Next, we load our dataset directly from Amazon S3. Alternatively, you could load data using your AWS Glue Data Catalog.

from pyspark.sql.functions import split, lower, hour
print(spark.version)
day_to_analyze = "2022-01-05"
df = spark.read.json(f"s3://openaq-fetches/realtime-gzipped/{day_to_analyze}/1641409725.ndjson.gz")
df_air = spark.read.schema(df.schema).json(f"s3://openaq-fetches/realtime-gzipped/{day_to_analyze}/*")

Finally, we write our transformed dataset to an output bucket location that we defined:

df_city = df_air.filter(lower((df_air.city)).contains('delhi')).filter(df_air.parameter == "no2").cache()
df_avg = df_city.withColumn("Hour", hour(df_city.date.utc)).groupBy("Hour").avg("value").withColumnRenamed("avg(value)", "no2_avg")
df_avg.sort("Hour").show()

# Examples of reading / writing to other data stores: 
# https://github.com/aws-samples/aws-glue-samples/tree/master/examples/notebooks

df_avg.write.parquet(f"s3://<BUCKET>/{day_to_analyze}.parquet")

After you’ve completed your work, you can end your AWS Glue interactive session immediately by simply shutting down the Studio notebook kernel, or you could use the %stop_session magic.

Debugging and Spark UI

In the preceding example, we specified the ”--enable-spark-ui”: “true” argument along with a "--spark-event-logs-path": location. This configures our AWS Glue session to record the sessions logs so that we can utilize a Spark UI to monitor and debug our AWS Glue job in real time.

For the process for launching and reading those Spark logs, refer to Launching the Spark history server. In the following screenshot, we’ve launched a local Docker container that has permission to read the S3 bucket the contains our logs. Optionally, you could host an Amazon Elastic Compute Cloud (Amazon EC2) instance to do this, as described in the preceding linked documentation.

Pricing

When you use AWS Glue interactive sessions on Studio notebooks, you’re charged separately for resource usage on AWS Glue and Studio notebooks.

AWS charges for AWS Glue interactive sessions based on how long the session is active and the number of Data Processing Units (DPUs) used. You’re charged an hourly rate for the number of DPUs used to run your workloads, billed in increments of 1 second. AWS Glue interactive sessions assign a default of 5 DPUs and require a minimum of 2 DPUs. There is also a 1-minute minimum billing duration for each interactive session. To see the AWS Glue rates and pricing examples, or to estimate your costs using the AWS Pricing Calculator, see AWS Glue pricing.

Your Studio notebook runs on an EC2 instance and you’re charged for the instance type you choose, based on the duration of use. Studio assigns you a default EC2 instance type of ml-t3-medium when you select the SparkAnalytics image and associated kernel. You can change the instance type of your Studio notebook to suit your workload. For information about SageMaker Studio pricing, see Amazon SageMaker Pricing.

Conclusion

The native integration of Studio notebooks with AWS Glue interactive sessions facilitates seamless and scalable serverless data preparation for data scientists and data engineers. We encourage you to try out this new functionality in Studio!

See Prepare Data using AWS Glue Interactive Sessions for more information.


About the authors

Sean MorganSean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.

Read More