Roundup of re:Invent 2021 Amazon SageMaker announcements

At re:Invent 2021, AWS announced several new Amazon SageMaker features that make machine learning (ML) accessible to new types of users while continuing to increase performance and reduce cost for data scientists and ML experts. In this post, we provide a summary of these announcements, along with resources for you to get more details on each one.

ML for all

As ML adoption grows, ML skills are in higher demand. To help meet this growing demand, AWS is expanding the reach of ML beyond data scientists and developers to the broader business user community, including line-of-business analysts supporting finance, marketing, operations, and HR teams. AWS announced that Amazon SageMaker Canvas is expanding access to ML by providing business analysts with a visual point-and-click interface that lets them generate accurate ML predictions on their own—without requiring any ML experience or having to write a single line of code. Get started on a two-month free trial including up to 10 ML models with up to 1 million cells of data free.

Processing structured and unstructured data at scale

As more people start using ML in their daily work, the need to label datasets for training grows and data science teams can’t keep up with the growing demand. AWS announced Amazon SageMaker Ground Truth Plus to make it easy to create high-quality training datasets without having to build labeling applications or manage labeling workforces on your own. SageMaker Ground Truth Plus provides an expert workforce that is trained on ML tasks and can help meet your data security, privacy, and compliance requirements. Simply upload your data, and Amazon SageMaker Ground Truth Plus creates data labeling workflows and manages workflows on your behalf. Request a pilot to get started.

Optimize the performance and cost of building, training, and deploying ML models

AWS is also continuing to make it easier and cheaper for data scientists and developers to prepare data and build, train, and deploy ML models.

First, for building ML models, AWS released enhancements to Amazon SageMaker Studio so that you can now do data processing, analytics, and ML workflows in one unified notebook. From this universal notebook, you can access a wide range of data sources and write code for any transformation for a variety of data workloads.

In addition to making training faster, AWS launched a new compiler, Amazon SageMaker Training Compiler, which can accelerate training by up to 50% through graph- and kernel-level optimizations to use GPUs more efficiently. SageMaker Training Compiler is integrated with versions of TensorFlow and PyTorch in SageMaker. Therefore, you can speed up training in these popular frameworks with minimal code changes.

And lastly, for inference, AWS announced two features to reduce inference costs. Amazon SageMaker Serverless Inference (preview) lets you deploy ML models on pay-per-use pricing without worrying about servers or clusters for use cases with intermittent traffic patterns. In addition, Amazon SageMaker Inference Recommender helps you choose the best available compute instance and configuration to deploy ML models for optimal inference performance and cost.

Learn ML for free

Amazon SageMaker Studio Lab (preview) is a free ML notebook environment that makes it easy for anyone to experiment with building and training ML models without needing to configure infrastructure or manage identity and access. SageMaker Studio Lab accelerates model building through GitHub integration, and it comes preconfigured with the most popular ML tools, frameworks, and libraries to get you started immediately. SageMaker Studio Lab offers 15 GB of dedicated storage for your ML projects and automatically saves your work so that you don’t need to restart in between sessions. It’s as easy as closing your laptop and coming back later. All you need is a valid email ID to get started with SageMaker Studio Lab.

To learn more about these features, visit the Amazon SageMaker website.


About the Author

Kimberly Madia is the Sr. Manager of Product Marketing, AWS, heading up product marketing for AWS Machine Learning services. Her goal is to make it easy for customers to build, train, and deploy ML models using Amazon SageMaker. For fun outside of work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

Read More

Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra

Amazon Kendra customers can now enrich document metadata and content during the document ingestion process using custom document enrichment (CDE). Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization.

You can further enhance the accuracy and search experience of Amazon Kendra by improving the quality of documents indexed in it. Documents with precise content and rich metadata are more searchable and yield more accurate results. Organizations often have large repositories of raw documents that can be improved for search by modifying content or adding metadata before indexing. So how does CDE help? By simplifying the process of creating, modifying, or deleting document metadata and content before they’re ingested into Amazon Kendra. This can include detecting entities from text, extracting text from images, transcribing audio and video, and more by creating custom logic or using services like Amazon Comprehend, Amazon Textract, Amazon Transcribe, Amazon Rekognition, and others.

In this post, we show you how to use CDE in Amazon Kendra using custom logic or with AWS services like Amazon Textract, Amazon Transcribe, and Amazon Comprehend. We demonstrate CDE using simple examples and provide a step-by-step guide for you to experience CDE in an Amazon Kendra index in your own AWS account.

CDE overview

CDE enables you to create, modify, or delete document metadata and content when you ingest your documents into Amazon Kendra. Let’s understand the Amazon Kendra document ingestion workflow in the context of CDE.

The following diagram illustrates the CDE workflow.

The path a document takes depends on the presence of different CDE components:

  • Path taken when no CDE is present – Steps 1 and 2
  • Path taken with only CDE basic operations – Steps 3, 4, and 2
  • Path taken with only CDE advanced operations – Steps 6, 7, 8, and 9
  • Path taken when CDE basic operations and advanced operations are present – Steps, 3, 5, 7, 8, and 9

The CDE basic operations and advanced operations components are optional. For more information on the CDE basic operations and advanced operations with the preExtraction and postExtraction AWS Lambda functions, refer to the Custom Document Enrichment section in the Amazon Kendra Developer Guide.

In this post, we walk you through four use cases:

  • Automatically assign category attributes based on the subdirectory of the document being ingested
  • Automatically extract text while ingesting scanned image documents to make them searchable
  • Automatically create a transcription while ingesting audio and video files to make them searchable
  • Automatically generate facets based on entities in a document to enhance the search experience

Prerequisites

You can follow the step-by-step guide in your AWS account to get a first-hand experience of using CDE. Before getting started, complete the following prerequisites:

  1. Download the sample data files AWS_Whitepapers.zip, GenMeta.zip, and Media.zip to a local drive on your computer.
  2. In your AWS account, create a new Amazon Kendra index, Developer Edition. For more information and instructions, refer to the Getting Started chapter in the Amazon Kendra Essentials workshop and Creating an index.
  3. Open the AWS Management Console, and make sure that you’re logged in to your AWS account
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket to use as a data source. Refer to Amazon S3 User Guide for more information.
  5. Click on to launch the AWS CloudFormation to deploy the preExtraction and postExtraction Lambda functions and the required AWS Identity and Access Management (IAM) roles. It will open the AWS CloudFormation Management Console.
    1. Provide a unique name for your CloudFormation stack and the name of the bucket you just created as a parameter.
    2. Choose Next, select the acknowledgement check boxes, and choose Create stack.
    3. After the stack creation is complete, note the contents of the Outputs. We use these values later.
  6. Configure the S3 bucket as a data source using the S3 data source connector in the Amazon Kendra index you created. When configuring the data source, in the Additional configurations section, define the Include pattern to be Data/. For more information and instructions, refer to the Using Amazon Kendra S3 Connector subsection of the Ingesting Documents section in the Amazon Kendra Essentials workshop and Getting Started with an Amazon S3 data source (console).
  7. Extract the contents of the data file AWS_Whitepapers.zip to your local machine and upload them to the S3 bucket you created at the path s3://<YOUR-DATASOURCE-BUCKET>/Data/ while preserving the subdirectory structure.

Automatically assign category attributes based on the subdirectory of the document being ingested

The documents in the sample data are stored in subdirectories Best_Practices, Databases, General, Machine_Learning, Security, and Well_Architected. The S3 bucket used as the data source looks like the following screenshot.

We use CDE basic operations to automatically set the category attribute based on the subdirectory a document belongs to while the document is being ingested.

  1. On the Amazon Kendra console, open the index you created.
  2. Choose Data sources in the navigation pane.
  3. Choose the data source used in this example.
  4. Copy the data source ID.
  5. Choose Document enrichment in the navigation pane.
  6. Choose Add document enrichment.
  7. For Data Source ID, enter the ID you copied.
  8. Enter six basic operations, one corresponding to each subdirectory.

  1. Choose Next.
  2. Leave the configuration for both Lambda functions blank.
  3. For Service permissions, choose Enter custom role ARN and enter the CDERoleARN value (available on the stack’s Outputs tab).

  1. Choose Next.

  1. Review all the information and choose Add document enrichment.
  2. Browse back to the data source we’re using by choosing Data sources in the navigation pane and choose the data source.
  3. Choose Sync now to start data source sync.

The data source sync can take up to 10–15 minutes to complete.

  1. While waiting for the data source sync to complete, choose Facet definition in the navigation pane.
  2. For the Index field of _category, select Facetable, Searchable, and Displayable to enable these properties.
  3. Choose Save.
  4. Browse back to the data source page and wait for the sync to complete.
  5. When the data source sync is complete, choose Search indexed content in the navigation pane.
  6. Enter the query Which service provides 11 9s of durability?.
  7. After you get the search results, choose Filter search results.

The following screenshot shows the results.

For each of the documents that were ingested, the category attribute values set by the CDE basic operations are seen as selectable facets.

Note Document fields for each of the results. When you click on it, it shows the fields or attributes of the document included in that result as seen in the screenshot below.

From the selectable facets, you can select a category, such as Best Practices, to filter your search results to be only from the Best Practices category, as shown in the following screenshot. The search experience improved significantly without requiring additional manual steps during document ingestion.

Automatically extract text while ingesting scanned image documents to make them searchable

In order for documents that are scanned as images to be searchable, you first need to extract the text from such documents and ingest that text in an Amazon Kendra index. The pre-extraction Lambda function from the CDE advanced operations provides a place to implement text extraction and modification logic. The pre-extraction function we configure has the code to extract the text from images using Amazon Textract. The function code is embedded in the CloudFormation template we used earlier. You can choose the Template tab of the template on the AWS CloudFormation console and review the code for PreExtractionLambda.

We now configure CDE advanced operations to try out this and additional examples.

  1. On the Amazon Kendra console, choose Document enrichments in the navigation pane.
  2. Select the CDE we configured.
  3. On the Actions menu, choose Edit.
  4. Choose Add basic operations.

You can view all the basic operations you added.

  1. Add two more operations: one for Media and one for GEN_META.

  1. Choose Next.

In this step, you need the ARNs of the preExtraction and postExtraction functions (available on the Outputs tab of the CloudFormation stack). We use the same bucket that you’re using as the data source bucket.

  1. Enter the conditions, ARN, and bucket details for the pre-extraction and post-extraction functions.
  2. For Service permissions, choose Enter custom role ARN and enter the CDERoleARN value (available on the stack’s Outputs tab).

  1. Choose Next. 
  2. Choose Add document enrichment.

Now we’re ready to ingest scanned images into our index. The sample data file Media.zip you downloaded earlier contains two image files: Yosemite.png and Yellowstone.png. These are scanned pictures of the Wikipedia pages of Yosemite National Park and Yellowstone National Park, respectively.

  1. Upload these to the S3 bucket being used as the data source in the folder s3://<YOUR-DATASOURCE-BUCKET>/Data/Media/.
  2. Open the data source on the Amazon Kendra console start a data source sync.
  3. When the data source sync is complete, browse to Search indexed content and enter the query Where is Yosemite National Park?.

The following screenshot shows the search results.

  1. Choose the link from the top search result.

The scanned image pops up, as in the following screenshot.

You can experiment with similar questions related to Yellowstone.

Automatically create a transcription while ingesting audio or video files to make them searchable

Similar to images, audio and video content needs to be transcribed in order to be searchable. The pre-extraction Lambda function also contains the code to call Amazon Transcribe for audio and video files to transcribe them and extract a time-marked transcript. Let’s try it out.

The maximum runtime allowed for a CDE pre-extraction Lambda function is 5 minutes (300 seconds), so you can only use it to transcribe audio or video files of short duration, about 10 minutes or less. For longer files, you can use the approach described in Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra.

The sample data file Media.zip contains a video file How_do_I_configure_a_VPN_over_AWS_Direct_Connect_.mp4, which has a video tutorial.

  1. Upload this file to the S3 bucket being used as the data source in the folder s3://<YOUR-DATASOURCE-BUCKET>/Data/Media/.
  2. On the Amazon Kendra console, open the data source and start a data source sync.
  3. When the data source sync is complete, browse to Search indexed content and enter the query What is the process to configure VPN over AWS Direct Connect?.

The following screenshot shows the search results.

  1. Choose link in the answer to start the video.

If you seek to an offset of 84.44 seconds (1 minute, 24 seconds), you’ll hear exactly what the excerpt shows.

Automatically generate facets based on entities in a document to enhance the search experience

Relevant facets such as the entities in documents like places, people, and events, when presented as as part of search results, provide an interactive way for a user to filter search results and find what they’re looking for. Amazon Kendra metadata, when populated correctly, can provide these facets, and enhances the user experience.

The post-extraction Lambda function allows you to implement the logic to process the text extracted by Amazon Kendra from the ingested document, then create and update the metadata. The post-extraction function we configured implements the code to invoke Amazon Comprehend to detect entities from the text extracted by Amazon Kendra, and uses them to update the document metadata, which is presented as facets in an Amazon Kendra search. The function code is embedded in the CloudFormation template we used earlier. You can choose the Template tab of the stack on the CloudFormation console and review the code for PostExtractionLambda.

The maximum runtime allowed for a CDE post-extraction function is 60 seconds, so you can only use it to implement tasks that can be completed in that time.

Before we can try out this example, we need to define the entity types that we detect using Amazon Comprehend as facets in our Amazon Kendra index.

  1. On the Amazon Kendra console, choose the index we’re working on.
  2. Choose Facet definition in the navigation pane.
  3. Choose Add field and add fields for COMMERCIAL_ITEM, DATE, EVENT, LOCATION, ORGANIZATION, OTHER, PERSON, QUANTITY, and TITLE of type StringList.
  4. Make LOCATION, ORGANIZATION and PERSON facetable by selecting Facetable.

  1. Extract the contents of the GenMeta.zip data file and upload the files United_Nations_Climate_Change_conference_Wikipedia.pdf, United_Nations_General_Assembly_Wikipedia.pdf, United_Nations_Security_Council_Wikipedia.pdf, and United_Nations_Wikipedia.pdf to the S3 bucket being used as the data source in the folder s3://<YOUR-DATASOURCE-BUCKET>/Data/GEN_META/.
  2. Open the data source on the Amazon Kendra console and start a data source sync.
  3. When the data source sync is complete, browse to Search indexed content and enter the query What is Paris agreement?.
  4. After you get the results, choose Filter search results in the navigation pane.

The following screenshot shows the faceted search results.

All the facets of the type ORGANIZATION, LOCATION, and PERSON are automatically generated by the post-extraction Lambda function with the detected entities using Amazon Comprehend. You can use these facets to interactively filter the search results. You can also try a few more queries and experiment with the facets.

Clean up

After you have experimented with the Amazon Kendra index and the features of CDE, delete the infrastructure you provisioned in your AWS account while working on the examples in this post:

  • CloudFormation stack
  • Amazon Kendra index
  • S3 bucket

Conclusion

Enhancing data and metadata can improve the effectiveness of search results and improve the search experience. You can use the custom data enrichment (CDE) feature of Amazon Kendra to easily automate the CDE process by creating, modifying, or deleting the metadata using the basic operations. You can also use the advanced operations with pre-extraction and post-extraction Lambda functions to implement the logic to manipulate the data and metadata.

We demonstrated using subdirectories to assign categories, using Amazon Textract to extract text from scanned images, using Amazon Transcribe to generate a transcript of audio and video files, and using Amazon Comprehend to detect entities that are added as metadata and later available as facets to interact with the search results. This is just an illustration of how you can use CDE to create a differentiated search experience for your users.

For a deeper dive into what you can achieve by combining other AWS services with Amazon Kendra, refer to Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra, Build an intelligent search solution with automated content enrichment, and other posts on the Amazon Kendra blog.


About the Authors

Abhinav JawadekarAbhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS Partners to help them in their cloud journey.

Read More

Continuously improve search application effectiveness with Amazon Kendra Analytics Dashboard

Unstructured data belonging to enterprises continues to grow, making it a challenge for customers and employees to get the information they need. Amazon Kendra is a highly accurate intelligent search service powered by machine learning (ML). It helps you easily find the content you’re looking for, even when it’s scattered across multiple locations and content repositories.

Amazon Kendra provides mechanisms such as relevance tuning, filtering, and submitting feedback for incremental learning to improve the effectiveness of the search solution based on specific use cases. As the data, users, and user expectations evolve, there is a need to continuously measure and recalibrate the search effectiveness, by adjusting the search configuration.

Amazon Kendra analytics provides a snapshot of how your users interact with your Amazon Kendra-powered search application in the form of key metrics. You can view the analytics data in a visual dashboard on the Amazon Kendra console or via Application Programming Interface (API) or using the AWS Command Line Interface (AWS CLI). These metrics enable administrators and content creators to better understand the ease of finding relevant information, the quality of the search results, gaps in the content, and the role of instant answers in providing answers to a user’s questions.

This post illustrates how you can dive deep into search trends and user behavior to identify insights and bring clarity to potential areas of improvement and the specific actions to take.

Overview of the Amazon Kendra analytics dashboard

Let’s start with reviewing the Search Analytics Dashboard of the Amazon Kendra index we use during this post. To view the Amazon Kendra analytics dashboard, open the Amazon Kendra management console, choose your index, and then choose Analytics in the navigation pane.

Just by looking at the top, there is a trend of an increasing number of queries, implying an increase in application adoption. There is little change since the last period in the clickthrough rate, zero click rate, zero search result rate, and the instant answer rate, signifying that the new queries and potentially new users’ usage pattern is consistent with that of the previous period.

Let’s look at the other macro trend charts available on the dashboard (see the following screenshots).

All the charts show a flat trend, meaning that the usage pattern is steady.

The clickthrough rate is at single digits with a slight downward trend. This either means that the users are finding the information through instant answers, FAQs, or document excerpts, or this could indicate that the results are totally uninteresting to the users.

The top zero click queries hover a little below 10%, which also means that the users are finding the information through instant answers, FAQs, or document excerpts, or the results are uninteresting to the users.

The instant answer rate is above 90%, which means that the overall quality of content is good and contains the information users are looking for.

The top zero result queries is lower than 5%, which is a good indicator that for the most part the users are finding the information they’re looking for.

Now let’s look at the drill-down charts starting with top queries, sorted high to low by Count.

The most important insight here is that the top queried items matter most to the users. The organization can use this information to potentially change their business priorities to focus more on these items of interest. It can also be an indicator to add more content on these topics.

When looking at the top queries sorted low to high on the Instant answer (%) column, we get the following results.

This provides insights into the items that the users are looking for but can’t find the answers. Depending on the query count, this may be a good indicator to add more content with specific information that answers the queries.

Now let’s look at the top clicked documents, sorted on the Count column from high to low.

These items indicate topics of interest to the users, not just for answers but also for detailed information. It could be an indicator to possibly add more content on these topics, but might be a business indicator to arrange training on these topics.

Let’s continue with the top zero click queries, sorted on 0 click count high to low.

This shows items of high interest that coincide with a high instant answer rate, implying that the users quickly find the answers through instance answers.

Now let’s look at the same chart sorted on Instant answer rate, low to high.

This indicates that there is lack of information on these topics that are of interest to users, and that the content owners need to add more content on these topics.

Now let’s look at the top zero result queries, sorted on the 0 result count column from high to low.

This is an indicator of a gap in content, because the users are looking for information that can’t be found. The content owners can fix this by adding content on these topics.

Using AWS CLI and API to get the Amazon Kendra analytics dashboard

So far we have used the visual dashboard in the Amazon Kendra management console to view all the available charts. The same dashboards are also available via API or the AWS CLI, which you can use to integrate this information in your applications as well as the tools of your choice for analytics and dashboards. You can use the following AWS CLI command to get the top queries this week based on their count:

aws kendra get-snapshots --index-id <YOUR-INDEX-ID> --interval "THIS_WEEK" --metric-type "QUERIES_BY_COUNT"

The output looks similar to the following:

{
  "SnapShotTimeFilter": {
    "StartTime": "2021-11-14T08:00:00+00:00",
    "EndTime": "2021-11-20T07:00:00+00:00" 
  },
  "SnapshotsDataHeader": [ 
    "query_content", "count", "ctr", "zero_click_rate", "click_depth", "instant_answer",     "confidence"
    ],
  "SnapshotsData": [
    [
      "what is Kendra", 3216, 3.70, 96.30, 27.71, 97.01, HIGH,
      "NBA game schedule", 1632, 4.47, 95.53, 24.19, 95.47, MEDIUM,
      "Most popular search", 1603, 3.49, 96.51, 29.43, 94.14, MEDIUM,
      "New York City", 1551, 3.68, 96.32, 33.40, 94.58, MEDIUM,
      "how many weeks in a year", 1310, 2.21, 97.79, 42.10, 96.03, LOW,
      "what is my ip address", 859, 2.56, 97.44, 48.45, 96.97, MEDIUM,
      "how to draw", 857, 2.80, 97.20, 36.33, 96.38, HIGH,
      "what is love", 855, 2.46, 97.54, 27.33, 96.73, MEDIUM,
      "equal opportunity bill", 855, 5.26, 94.74, 23.62, 94.62, MEDIUM,
      "when are the nba playoffs", 836, 3.35, 96.65, 32.32, 92.34, LOW
    ]
  ],
  "NextToken":    "uVu4IDozCVdFz5klt0h9+YPTTNcCGGwGujsYChp1/vPp5nPdC+reHO8TRvg5ANhWQu10jvKltuM8KzUvYCvBGi7mWJdpOF7LFiBjFcIuY6cabYI9nb2b0u3AU3565RC9kCytG6RjeVcU/NjBAxLMyB96+WdEYv+jFCbejnM6YjWa0LRL+MmvlnXEkFMWvmgyrdF22JXWklTZc77NJILR+BTsCB5Xg34OJ4149968kDdb2CNhH4Bzk+qOGph+KoFDW/CpmQ=="
}

You can also get similar output using the following Python code:

import boto3
kendra = boto3.client('kendra')

index_ID = '${indexID}'
interval = 'THIS_WEEK'
metric_type = 'QUERIES_BY_COUNT'

snapshots_response = kendra.get_snapshots(
    IndexId = index_id,
    Interval = interval,
    MetricType = metric_type,
print("Top queries data: " + snapshots_response['snapshotsData'])

Conclusion

Growth of data and information along with evolving user needs make it imperative that the effectiveness of the search application also evolves. The metrics provided by the Amazon Kendra analytics empower you to dive deep into search trends and user behavior to identify insights. They help bring clarity to potential areas of improvement for Amazon Kendra-powered search applications. If you already implement an Amazon Kendra index-powered search solution, start looking at the Analytics Dashboard with the usage metrics for the last few weeks and get insights on how you can improve the search effectiveness. For new Amazon Kendra-powered search applications, the Analytics Dashboard is a great place to get immediate feedback with actionable insights on the search effectiveness. For a hands-on experience with Amazon Kendra, see the Kendra Essentials workshop. For a deeper dive into Amazon Kendra use cases, see the Amazon Kendra blog.


About the Author

Abhinav JawadekarAbhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS Partners to help them in their cloud journey.

Read More

Expedite conversation design with the automated chatbot designer in Amazon Lex

Today, we’re launching the Amazon Lex automated chatbot designer (preview), which reduces the time and effort it takes for customers to design a chatbot by automating the process using existing conversation transcripts. Amazon Lex helps you build, test, and deploy chatbots and virtual assistants on contact center services (such as Amazon Connect), websites, and messaging channels (such as Facebook Messenger). The automated chatbot designer expands the usability of Amazon Lex to the design phase. It uses machine learning (ML) to provide an initial bot design that you can then refine and launch conversational experiences faster. With the automated chatbot designer, Amazon Lex customers and partners get an easy and intuitive way of designing chatbots and can reduce bot design time from weeks to hours.

Conversation design

Organizations are rapidly adopting chatbots to increase self-service and improve customer experience at scale. Contact center chatbots automate common user queries and free up human agents to focus on solving more complex issues. You can use Amazon Lex to build chatbots that deliver engaging user experiences and lifelike conversations. Amazon Lex provides automatic speech recognition and language understanding technologies to build effective chatbots through an easy-to-use console. But before you can build a chatbot, you have to design it. The design phase of the chatbot building process is still manual, time-consuming, and one that requires conversational design expertise.

Conversation design is the discipline of designing conversational interfaces, including the purpose, experience, and interactions. The discipline is still evolving and requires a deep understanding of spoken language and human interactions.

Creating a chatbot needs equal parts technology and business knowledge. The first step of designing a bot is conducting user research based on business needs and identifying the user requests or intents to focus on. Customers often start with analyzing transcripts of conversations between agents and users to discover and track the most frequently occurring intents. An intent signifies the key reason for customer contact or a goal the customer is trying to achieve. For example, a person contacting an insurance company to file a claim might say, “My basement is flooded, I need to start a new claim.” The intent in this case is “file a new claim.” It can take a team of business analysts, product owners, and developers multiple weeks to analyze thousands of lines of transcripts and find the right intents while designing chatbots for their contact center flows. This is time-consuming and may lead to missing intents. The second step is to remove ambiguity among intents. For example, if a user says “I want to file a claim,” it is important to distinguish if the user is trying to file a home or auto claim. The typical trial-and-error approach to identify such overlaps across intents can be error-prone. The third and final step is compiling a list of valid values of information required to fulfill different intents. For example, to fulfill the intent “file a new claim,” developers need a list of different policy types (auto, home, and travel). A chatbot with missing, incomplete, or overlapping intents will fail to resolve user requests accurately, resulting in frustrated customers.

Automated chatbot designer simplifies the design process

The automated chatbot designer builds on the simplicity and ease of use of Amazon Lex by automatically surfacing an initial bot design. It uses ML to analyze conversation transcripts between callers and agents, and semantically clusters them around the most common intents and related information. Instead of starting your design from scratch, you can use the intents surfaced by the chatbot designer, iterate on the design, and achieve your target experience faster.

In the example of an insurance chatbot, the automated chatbot designer first analyzes transcripts to identify intents such as “file a new claim” automatically from phrases, such as “My basement is flooded, I need to start a new claim” or “I want to file a roof damage claim.” The automated chatbot designer can analyze thousands of lines of transcripts within a few hours, minimizing effort and reducing chatbot design time. This helps make sure that the intents are well defined and well separated by automatically removing any overlaps between them. This way, the bot can understand the user better and avoid frustration. Finally, the automated chatbot designer compiles information, such as policy ID or claim type, needed to fulfill all identified intents.

By reducing manual effort and human error from every step of chatbot design, the automated chatbot designer helps create bots that understand user requests without confusion, improving the end user experience.

NeuraFlash, a certified Amazon Services Delivery Partner, provides a full range of professional services to companies worldwide. “We specialize in building solutions grounded in data that transform and improve the customer journey across any use case in the contact center. We often analyze large amounts of conversational data to chart the optimal conversational experience for our clients,” says Dennis Thomas, CTO at NeuraFlash. “With the automated chatbot designer, we can identify different paths in calls quickly based on the conversational data. The automated discovery accelerates our time to market across our client engagements and helps us deliver better customer experiences. We are excited to partner with AWS and help organizations transform their businesses with AI-powered experiences.”

Create a bot with the automated chatbot designer

Getting started with the automated chatbot designer is very easy. Developers can access it on the Amazon Lex console and upload transcripts to automatically create the bot design.

  1. On the Amazon Lex V2 console, choose Bots.
  2. Choose Create bot.
  3. Select Start with transcripts as the creation method.
  4. Give the bot a name (for this example, InsuranceBot) and provide a description.
  5. Select Create a role with basic Amazon Lex permissions and use this as your runtime role.
  6. After you fill out the other fields, choose Next to proceed to the language configuration.

As of this writing, the automated chatbot designer is only available in US English.

  1. Choose the language and voice for your interaction.

Next, you specify the Amazon Simple Storage Service (Amazon S3) location of the transcripts. Amazon Connect customers using Contact Lens can use the transcripts in their original format. Conversation transcripts from other transcription services may require a simple conversion.

  1. Choose the S3 bucket and the path where the transcripts are located.

In case of Contact Lens for Amazon Connect format, the files should be located at /Analysis/Voice. If you have redacted transcripts, you can provide /Analysis/Voice/Redacted as well. For this post, you can use the following sample transcripts. Note that fields like names and phone numbers included in these sample transcripts or in our examples are comprised of synthetic (or ‘fake’) data.

If you plan to use the sample transcripts, you will have to first upload the transcripts to an S3 bucket:  Unzip the files to a local folder. Next, navigate to the S3 console, provide a bucket name, and click on Create Bucket. Once the bucket is created, click on the bucket name and click on Add Folder to provide the location of the unzipped files. Finally, click on Upload to upload the conversation transcripts.

  1. Choose your AWS Key Management Service (AWS KMS) key for access permissions.
  2. Apply a filter (date range) for your input transcripts.
  3. Choose Done.

You can use the status bar on the console to track the analysis. Within a few hours, the automated chatbot designer surfaces a chatbot design that includes user intents, sample phrases associated with those intents, and a list of all the information required to fulfill them. The amount of time it takes to complete training depends on several factors, including the volume of transcripts and the complexity of the conversations. Typically, 600 lines of transcript are analyzed every minute.

  1. Choose Review to view the intents and slot types discovered by the automated chatbot designer.

The Intents tab lists all the intents along with sample phrases and slots, and the Slot types tab provides a list of all the slot types along with slot type values.

You can choose any of the intents to review the sample utterances and slots. For example, in the following screenshot, we choose ChangePassword to view the utterances.

  1. You can click on the associated transcripts to review the conversations used to identify the intents.
  2. After you review the results, you can select the intents and slot types relevant to your use case and choose Add.

This adds the selected intents and slot types to the bot. You can now iterate on this design by making changes such as adding prompts, merging intents or slot types, and renaming slots.

In summary, the chatbot designer analyzes a conversation transcript to surface common intents, associated phrases, and information the chatbot needs to capture to resolve issues (such as customer policy number, claim type, and so on). You still have to iterate on the design to fit your business needs, add chatbot prompts and responses, integrate business logic to fulfill user requests,  and then build, test, and deploy the chatbot in Amazon Lex. The automated chatbot designer automates a significant portion of the bot design, minimizing effort and reducing the overall time it takes to design a chatbot.

Things to know

The automated chatbot designer is launching today as a preview, and you can get started with it right away for free.  After the preview, you pay the prices listed on the Amazon Lex pricing page. Pricing is based on the time it takes to analyze the transcripts and discover intents.

The automated chatbot designer is available on English (US) in all the AWS Regions where Amazon Lex V2 operates. With the automated chatbot designer in Amazon Lex, you can streamline the lengthy design process and create chatbots that understand customer requests and improve customer experiences. For more information, please check out our documentation here.


About the Authors

Priyanka Tiwari is a product marketing lead for AWS data and machine learning where she focuses on educating decision makers on the impact of data, analytics, and machine learning. In her spare time, she enjoys reading and exploring the beautiful New England area with her family.

As a Product Manager on the Amazon Lex team, Harshal Pimpalkhute spends his time trying to get machines to engage (nicely) with humans.

Read More

Quickly build custom search applications without writing code using Amazon Kendra Experience Builder

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization. With Amazon Kendra, you don’t need to click through multiple documents to find the answer you’re looking for. It gives you the exact answer to your query.

Getting started with Amazon Kendra is quick and simple; you can index and start searching your content from the Amazon Kendra console in less than 10 minutes. You now have multiple ways to deploy your search application. You can use APIs to integrate with an existing search application. You can also use the downloadable React code snippets to build your own experience. Or you can use the out-of-the-box search interface with the Amazon Kendra Experience Builder to quickly configure your own custom search experience and make it available to your users.

With the new Experience Builder, you can deploy a fully functional and customizable search experience with Amazon Kendra in a few clicks, without any coding or ML experience. Experience Builder delivers an intuitive visual workflow to quickly build, customize, and launch your Amazon Kendra-powered search application, securely on the cloud. You can start with the ready-to-use search experience template in the builder, which you can customize by simply dragging and dropping the components you want, such as filters or sorting. You can invite others to collaborate or test your application for feedback, and then share the project with all users when you’re ready to deploy the experience. The Experience Builder comes with AWS Single Sign-On (AWS SSO) integration, which supports popular identity providers (IdPs) such as Azure AD and Okta, so you can deliver secure end user SSO authentication while accessing the search experience.

In this post, we discuss how to build a custom search application quickly with the Amazon Kendra Experience Builder.

Solution overview

Below are the steps to build your own custom search interface using Experience builder.

Configure your index

To build your custom search application using Amazon Kendra Experience Builder, first sign in to the Amazon Kendra console and create an index. After you create an index, add data sources to your index, such as Amazon Simple Storage Service (Amazon S3), SharePoint, or Confluence. You can skip these steps if you already have an index and data sources set up.

Create your experience

To create your experience, complete the following steps:

  1. On the Amazon Kendra console, navigate to your index.
  2. Choose Create experience.

  1. For Experience name, enter a name.
  2. Under Content sources, select the data sources you want to search.
  3. For IAM role, choose your AWS Identity and Access Management (IAM) role to grant Amazon Kendra access permissions.
  4. Choose Next.

The Experience Builder comes with AWS SSO integration, supporting popular IdPs such as Azure AD and Okta, and automatically detects AWS SSO directories in your account.

  1. In the Confirm your identity from an AWS SSO directory section, select your identity.
  2. Choose Next.

If you don’t have AWS SSO, Amazon Kendra provides an easy step to enable it and add yourself as an owner. You can then add additional lists of users or groups to your directory and assign access permissions. For example, you can assign owner or viewer permissions to users and groups as you add them to your experience. Users with viewer permissions are your end users; they’re authorized to load your search application and perform searches. Users with owner permissions are authorized to configure, design, tune, manage access, and share search experiences.

  1. After you configure your SSO and assign yourself as owner, review the settings, and choose Create experience and open Experience Builder.

After you launch the Experience Builder, you’re redirected to the URL that was generated for your experience. Here, the experience verifies if you have a valid authenticated session. If you do, you’re redirected to the Experience Builder; if not, you’re redirected to your IdP via AWS SSO to authenticate you. After authentication is successful, the IdP redirects you back to the Experience Builder.

Customize, tune, and share your experience

Now, inside the Experience Builder, you can start customizing the default template, which already comes preconfigured with most key features, such as the search box, Amazon Kendra suggested answers, FAQ matches, and recommended documents. You can customize the experience by dragging and dropping these components from the components panel onto your page canvas. You can also configure the content rendered inside each component.

For example, if you want to customize filters, choose Filter in the Design pane.

You can customize which fields you want your application to facet search results on, and assign display labels if needed.

Similarly, you can customize other UI components, including the search bar, sort, suggested answers, FAQ, and document ranking.

Optionally, you can further improve relevancy by boosting the search results using relevancy tuning.

Choose Preview to visualize your search experience without any editor tool distractions. When you’re happy with the changes, choose Publish to push the changes you made to the live or production version of the search experience.

You have successfully built and deployed a custom search application. Users with viewers permissions can now start searching by going to the search experience URL that was generated when you first created the search experience.

Conclusion

The Amazon Kendra Experience Builder enables you to configure your own custom search experience and make it available to your users in a few clicks, without any coding or ML experience.

You can use Experience Builder today in the following Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Europe (Ireland), and Canada (Central). For the most up-to-date information about Amazon Kendra Region availability, see AWS Regional Services.  You can configure Experience Builder using the AWS Command Line Interface (AWS CLI), AWS SDKs, and the AWS Management Console. There is no additional charge for using Experience Builder. For more information about pricing, see Amazon Kendra pricing.

To learn more about Experience Builder, visit Amazon Kendra Developer Guide


About the Authors

Jean-Pierre Dodel leads product management for Amazon Kendra, a new ML-powered enterprise search service from AWS. He brings 15 years of Enterprise Search and ML solutions experience to the team, having worked at Autonomy, HP, and search startups for many years prior to joining Amazon four years ago. JP has led the Amazon Kendra team from its inception, defining vision, roadmaps, and delivering transformative semantic search capabilities to customers like Dow Jones, Liberty Mutual, 3M, and PwC.

Read More

Create and manage Amazon EMR Clusters from SageMaker Studio to run interactive Spark and ML workloads – Part 2

In Part 1 of this series, we offered step-by-step guidance for creating, connecting, stopping, and debugging Amazon EMR clusters from Amazon SageMaker Studio in a single-account setup.

In this post, we dive deep into how you can use the same functionality in certain enterprise-ready, multi-account setups. As described in the AWS Well-Architected Framework, separating workloads across accounts enables your organization to set common guardrails while isolating environments. This can be particularly useful for certain security requirements, as well as simplify cost between projects and teams.

Solution overview

In this post, we go through the process to achieve the following architectural setup. We present the same simple interface as we saw in Part 1 for our data workers, abstracting away multi-account details from their day-to-day workflow when not needed.

We first describe how to set up your cross-account networks in order to connect to Amazon EMR from Studio. To start, we need to make sure that some prerequisites are set correctly. For our example, a DevOps admin needs to configure an Amazon SageMaker domain with an elastic network interface to a private VPC and specify the security group ID to attach.

Set up the network

After we set up the Studio domain, we need to configure our network settings to allow communication between accounts.

VPC peering

We start with VPC peering between the accounts in order to facilitate traffic back and forth.

  1. From our Studio account, on the Amazon Virtual Private Cloud (Amazon VPC) console, choose Peering connections.
  2. Choose Create peering connection.
  3. Create your request to peer the Studio VPC within the Amazon EMR account’s VPC.

After you make the peering request, the admin can accept this request from the second account.

When peering private subnets, you should enable private IP DNS resolution at the VPC peering connection level.

Route tables

After you establish the peering connection, you must enable the flow of traffic by manually adding routes to the private subnet route tables in both accounts. We do this to enable creation and connection of EMR clusters from the Studio account to the remote account’s private subnet.

These routes point to the IP address range of the peered VPC’s private subnets and are set by going to the Route Tables tab found on the subnet page. Here the admin on each account can edit the routes.

The following route table of a Studio subnet shows traffic outbound from the Studio account for 2.0.1.0/24 through a peering connection.

The following route table of an Amazon EMR subnet shows traffic outbound from the Amazon EMR account to Studio for 10.0.20.0/24 through a peering connection.

Security groups

Lastly, the security group that is attached to your Studio domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound TCP traffic from the Studio instance security group.

The following screenshot shows the inbound rules configuration in your SageMaker account.

The following screenshot shows the inbound rules configuration in your Amazon EMR account.

Set up permissions

We need to create an AWS Identity and Access Management (IAM) role in the secondary Amazon EMR account that has the same Amazon EMR visibility permission as we saw in Part 1.

The following code shows the specific permissions for the IAM role. It’s the same as in Part 1, but includes the policy AllowRoleAssumptionForCrossAccountDiscovery:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPresignedUrl",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:CreatePersistentAppUI",
                "elasticmapreduce:DescribePersistentAppUI",
                "elasticmapreduce:GetPersistentAppUIPresignedURL",
                "elasticmapreduce:GetOnClusterAppUIPresignedURL"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDetailsDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:ListClusters"
            ],
            "Resource": "*"
        },
        { 
            "Sid": "AllowRoleAssumptionForCrossAccountDiscovery", 
            "Effect": "Allow", 
            "Action": "sts:AssumeRole", 
            "Resource": ["arn:aws:iam::<cross-account>:role/<studio-execution-role>" ]
        },
        {
            "Sid": "AllowEMRTemplateDiscovery",
            "Effect": "Allow",
            "Action": [
              "servicecatalog:SearchProducts"
            ],
            "Resource": "*"
        }
    ]
}

This assumable role also needs a trust relationship with the Studio account (be sure to modify the account ID):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<account-id>:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

User journey

The following diagram illustrates the user journey for a unified notebook experience after you connect your various accounts. Just as in the previous post, the DevOps persona creates an AWS Service Catalog product and portfolio within the Studio account, from which data workers can provision templated EMR clusters.

Again, it’s worth noting that you can modify the full set of properties for Amazon EMR when creating AWS CloudFormation templates that can be deployed though Studio. This means that you can enable Spot, auto scaling, and other popular configurations through your Service Catalog product.

You can parameterize the preset CloudFormation template, which creates the EMR cluster, so that end-users can modify different aspects of the cluster to match their workloads. For example, the data scientist or data engineer may want to specify the number of core nodes on the cluster, and the creator of the template can specify AllowedValues to set guardrails.

Discover EMR clusters across accounts

To enable cluster discovery across accounts, we need to provide the previously created remote IAM role ARN to the Studio execution role. The Studio execution role assumes that remote role to discover and connect to EMR clusters in the remote account. The ARN of this assumable cross-account role is loaded by the Studio Jupyter server at launch and determines which role to use for cross-account cluster discoverability. To set and modify these user-specific ARNs, admins can create a Lifecycle Configuration (LCC), associated with the Jupyter server not the kernel gateway app, which writes the role ARN onto the Amazon Elastic File System (Amazon EFS) home directory for each user. You can apply this LCC to the entire set of users or it can be specific to individuals so they have granular access to which clusters can be viewed through assumed roles.

When the Jupyter server starts, lifecycle configurations run prior to reading of ARN roles that are written in the config file. This enables administrators to overwrite and fully control which cross-account ARNs are used at runtime. After the LCC runs and the files are written, the server reads the file /home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE/emr-discovery-iam-role-arns-DO_NOT_DELETE.json and stores that cross-account ARN. The following is an example LCC bash script:

# This script creates the file that informs SageMaker Studio that the role "arn:aws:iam::123456789012:role/ASSUMABLE-ROLE" in remote account "123456789012" must be assumed to list and describe EMR clusters in the remote account.

#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat > "$FILE" <<- "EOF"
{
  "123456789012": "arn:aws:iam::123456789012:role/ASSUMABLE-ROLE"
}
EOF

At this point, a user can log in to their account and although they can modify this file, there’s no impact to the admin’s ARN designation. This is because the value is already stored by this point and the file is overwritten upon the server being restarted, because the LCC runs every time the Jupyter server app is started.

This configuration process can be completely abstracted away from data workers who discover and connect to clusters within the Studio. The only noticeable difference for cross-account clusters is that on the browsing tab, there is a column for account ID for which the cluster is housed in.

Use EMR clusters across accounts

After you establish cross-account visibility, the process for creating and stopping clusters remains the same as in Part 1. Refer to our GitHub repository for example cross-account CloudFormation stacks.

After you deploy the Service Catalog product, the process for end-users to spin up a cluster remains the same. Simply go to the Clusters page and choose Create cluster.

After cluster creation, we connect to our cluster using the Clusters graphical interface in Studio Notebooks. This creates an auto-populated magic cell that appears largely the same as with a single account, but with an appended parameter for the assumable cross-account role.

After the connection is made, we can proceed with the demo as before. You can clone our GitHub example repo and run through the notebook example just as in Part 1.

Conclusion

In this second and final part of our series, we showed how Studio users can create, connect, debug, and stop EMR clusters in cross-account setups. After you set up the networking and permissions, the end-user experience is just as we saw in Part 1. We encourage you to utilize this new functionality of Studio in your multi-account workloads today!


About the Authors

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.

Prateek Mehrotra is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions which simplify usability by abstracting away complexity. In his spare time, Prateek enjoys spending time with his family and likes to explore the world with them.

Sriharsha M S is an AI/ML specialist solutions architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.

Sean MorganSean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Ruchir Tewari is a Senior Solutions Architect specializing in security and is a member of the ML TFC. For several years he has helped customers build secure architectures for a variety of hybrid, big data and AI/ML applications. He enjoys spending time with family, music and hikes in nature.

Luna Wang is a UX designer at AWS who has a background in computer science and interaction design. She is passionate about building customer-obsessed products and solving complex technical and business problems by using design methods. She is now working with a cross-functional team to build a set of new capabilities for interactive ML in SageMaker Studio.

Read More

Create and manage Amazon EMR Clusters from SageMaker Studio to run interactive Spark and ML workloads – Part 1

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. We recently introduced the ability to visually browse and connect to Amazon EMR clusters right from the Studio notebook. Starting today, you can now monitor and debug your Spark jobs running on Amazon EMR from Studio notebooks with just a single click. Additionally, you can now discover, connect to, create, stop, and manage EMR clusters directly from Studio.

We demonstrate these newly introduced capabilities in this two-part post.

Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Data workers such as data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation. Until today, these data workers could easily discover and connect to EMR clusters running in the same account as Studio but were unable to do so across accounts—a configuration common among several customer setups. Furthermore, when data workers needed to create EMR clusters tailored to their specific interactive workloads on demand, they had to switch interfaces to either request their administrator to create one or use detailed technical knowledge of DevOps to create it by themselves. This process was not only difficult and disruptive to their workflow, but also distracted data workers from focusing on their data preparation tasks. Consequently, although uneconomical, many customers kept persistent clusters running in anticipation of incoming workload regardless of active usage. Finally, monitoring and debugging Spark jobs running on Amazon EMR required setting up complex security rules and web proxies, adding significant friction to the data workers’ workflow.

Starting today, data workers can easily discover and connect to EMR clusters in single-account and cross-account configurations directly from Studio. Furthermore, you now have one-click access to the Spark UI to monitor and debug Spark jobs running on Amazon EMR right from Studio notebooks, which greatly simplifies your Spark debugging workflow. Finally, you can use the AWS Service Catalog to define and roll out preconfigured templates to select data workers to enable them to create EMR clusters right from Studio. You can fully control the organizational, security, compute, and networking guardrails to be adhered to when data workers use these templates. Data workers can visually browse through a set of templates made available to them, customize them for their specific workloads, create EMR clusters on demand, and stop them with just a few clicks in Studio. This feature considerably simplifies the data preparation workflow and enables you to more optimally use EMR clusters for interactive workloads from Studio.

In Part 1 of our series, we dive into the details of how DevOps administrators can use the AWS Service Catalog to define parameterized templates that data workers can use to create EMR clusters directly from the Studio interface. We provide an AWS CloudFormation template to create an AWS Service Catalog product for creating EMR clusters within an existing Amazon SageMaker domain, as well as a new CloudFormation template to stand up a SageMaker domain, Studio user profile, and Service Catalog product shared with that user so you can get started from scratch. As part of the solution, we utilize a single-click Spark UI interface to debug and monitor our ETL jobs. We use the transformed data to train and deploy an ML model using SageMaker training and hosting services.

As a follow-up, Part 2 provides a deep dive into cross-account setups. These multi-account setups are common amongst customers and are a best practice for many enterprise account setups, as mentioned in our AWS Well-Architected Framework.

Solution overview

We first describe how to communicate with Amazon EMR from Studio, as shown in the post Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks. In our solution, we utilize a SageMaker domain that has been configured with an elastic network interface through private VPC mode. That connected VPC is where we spin up our EMR clusters for this demo. For more information about the prerequisites, see our documentation.

The following diagram shows the complete user journey. A DevOps persona creates the Service Catalog product within a portfolio that is accessible to the Studio execution roles.

It’s important to note that you can use the full set of CloudFormation properties for Amazon EMR when creating templates that can be deployed though Studio. This means that you can enable Spot, auto scaling, and other popular configurations through your Service Catalog product.

You can parameterize the preset CloudFormation template (which creates the EMR cluster) so that end users can modify different aspects of the cluster to match their workloads. For example, the data scientist or data engineer may want to specify the number of core nodes on the cluster, and the creator of the template can specify AllowedValues to set guardrails.

The following template parameters give some examples of commonly used parameters:

"Parameters": {
    "EmrClusterName": {
      "Type": "String",
      "Description": "EMR cluster Name."
    },
    "CoreInstanceType": {
      "Type": "String",
      "Description": "Instance type of the EMR core nodes.",
      "Default": "m5.xlarge",
      "AllowedValues": [
        "m5.xlarge",
        "m3.2xlarge"
      ]
    },
    "CoreInstanceCount": {
      "Type": "String",
      "Description": "Number of core instances in the EMR cluster.",
      "Default": "2",
      "AllowedValues": [
        "2",
        "5",
        "10"
      ]
    },
    "EmrReleaseVersion": {
      "Type": "String",
      "Description": "The release version of EMR to launch.",
      "Default": "emr-5.33.1",
      "AllowedValues": [
        "emr-5.33.1",
        "emr-6.4.0"
      ]
    }
  }

For the product to be visible within the Studio interface, we need to set the following tags on the Service Catalog product:

sagemaker:studio-visibility:emr true

Lastly, the CloudFormation template in the Service Catalog product must have the following mandatory stack parameters:

```
SageMakerProjectName:
Type: String
Description: Name of the project

SageMakerProjectId:
Type: String
Description: Service generated Id of the project
````

Both values for these parameters are automatically injected when the stack is launched, so you don’t need to fill them in. They’re part of the template because SageMaker projects are utilized as part of the integration between the Service Catalog and Studio.

The second part of the single-account user journey (as shown in the architecture diagram) is from the data worker’s perspective within Studio. As shown in the post Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks, Studio users can browse existing EMR clusters and seamlessly connect to them using Kerberos, LDAP, HTTP, or no-auth mechanisms. Now, you can also create new EMR clusters through provisioning of templates, as shown in the following architecture diagram.

For Studio users to browse the available clusters, we need to attach an AWS Identity and Access Management (IAM) policy that permits Amazon EMR discoverability. For more information, see our existing documentation.

Deploy resources with AWS CloudFormation

For this post, we’ve provided two CloudFormation stacks to demonstrate the Studio and EMR capabilities found in our GitHub repository.

The first stack provides an end-to-end CloudFormation template that stands up a private VPC, a SageMaker domain attached to that VPC, and a SageMaker user with visibility to the pre-created Service Catalog product.

The second stack is intended for users with existing Studio private VPC setups who want to utilize a CloudFormation stack to deploy a Service Catalog product and make it visible to an existing SageMaker user.

You will be charged for Studio and Amazon EMR resources used when you launch the following stacks. For more information, see Amazon SageMaker Pricing and Amazon EMR pricing.

Follow the instructions in the cleanup sections at the end of this post to make sure that you don’t continue to be charged for these resources.

To launch the end-to-end stack, choose the stack for your desired Region.

ap-northeast-1
ap-northeast-2
ap-south-1
ap-southeast-1
ca-central-1
eu-central-1
eu-north-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
us-east-1
us-east-2
us-west-1
us-west-2

This stack is intended to be a from-scratch setup and therefore the admin doesn’t need to launch this stack to input specific parameters related to their account. However, because our subsequent Amazon EMR stack uses the outputs of this stack, we need to provide a deterministic stack name so that it can be referenced. The preceding link provides the stack name as expected by this demo and it should not be modified.

After we launch the stack, we can see that our Studio domain has been created, and studio-user is attached to an execution role that was created with visibility to our Service Catalog product.

If you choose to run the end-to-end stack, skip the following existing domain information.

If you have an existing domain stack, launch the following stack in your preferred Region.

ap-northeast-1
ap-northeast-2
ap-south-1
ap-southeast-1
ca-central-1
eu-central-1
eu-north-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
us-east-1
us-east-2
us-west-1
us-west-2

Because this stack is intended for accounts with existing domains that are attached to a private subnet, the admin fills in the required parameters during the stack launch. This is intended to simplify the experience for downstream data workers, and we abstract this networking information away from them.

Again, because the subsequent Amazon EMR stack utilizes the parameters the admin inputs here, we need to provide a deterministic stack name so that they can be referenced. The preceding stack link provides the stack name as expected by this demo.

If you’re using the second stack with an existing domain and users, you need to complete one additional step to make sure the Spark UI functionality is available and that your user can browse EMR clusters and spin them up and down. Simply attach the following policy to the SageMaker execution role that you input as a parameter, providing the Region and account ID as needed:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPresignedUrl",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:CreatePersistentAppUI",
                "elasticmapreduce:DescribePersistentAppUI",
                "elasticmapreduce:GetPersistentAppUIPresignedURL",
                "elasticmapreduce:GetOnClusterAppUIPresignedURL"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDetailsDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:ListClusters"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowSagemakerProjectManagement",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateProject",
                "sagemaker:DeleteProject"
            ],
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:project/*"
        },
        {
            "Sid": "AllowEMRTemplateDiscovery",
            "Effect": "Allow",
            "Action": [
              "servicecatalog:SearchProducts"
            ],
            "Resource": "*"
        }
    ]
}

Review the AWS Service Catalog product

After you launch your stack, you can see that an IAM role was created as a launch constraint, which provisions our EMR cluster. Both stacks also generated the AWS Service Catalog product and the association to our Studio execution role.

On the list of AWS Service Catalog products, we see the product name, which is later visible from the Studio interface.

This product has a launch constraint that governs the role that creates the cluster.

Note that our product has been tagged appropriately for visibility within the Studio interface.

If we look into the template that was provisioned, we can see the CloudFormation template that initializes our cluster, creates the Hive tables, and loads them with the demo data.

Create an EMR cluster from Studio

After the Service Catalog product has been created in your account through the stack that fits your setup, we can continue the demonstration from the data worker’s prospective.

  1. Launch a Studio notebook.
  2. Under SageMaker resources, choose Clusters on the drop-down menu.
  3. Choose Create cluster.
  4. From the available templates, choose the provisioned template SageMaker Studio Domain No Auth EMR.
  5. Enter your desired configurable parameters and choose Create cluster.

You can now monitor the deployment on the Clusters management tab. As part of the template, our cluster instantiates Hive tables with some data that we can use as part of our example.

Connect to an EMR Cluster from Studio

After your cluster has entered the Running/Waiting status, you can connect to the cluster in the same way as was described in the post Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks.

First, we clone our GitHub repo.

As of this writing, only a subset of kernels support connecting to an existing EMR cluster. For the full list of supported kernels, and information on building your own Studio images with connectivity capabilities; see our documentation. For this post, we use the SparkMagic kernel from the PySpark image and run the smstudio-pyspark-hive-sentiment-analysis.ipynb notebook from the repository.

For simplicity, the template that we deploy uses a no-auth authentication mechanism, but as shown in our previous post, this works seamlessly with Kerberos, LDAP, and HTTP auth as well.

After a connection is made, there is a hyperlink for the Spark UI, which we use to debug and monitor our demonstration. We dive into the technical details later in the post, but you can open this in a new tab now.

Next, we show the functionality from our previous post where we can query the newly instantiated tables using PySpark, write transformed data to Amazon Simple Storage Service (Amazon S3), and launch SageMaker training and hosting jobs all from the same smstudio-pyspark-hive-sentiment-analysis.ipynb notebook.

The following screenshots demonstrate preprocessing the data.

The following screenshots show the process of training the model.

The following screenshots demonstrate deploying the model.

Monitor and debug with the Spark UI

As mentioned before, the process for viewing the Spark UI has been greatly simplified, and a presigned URL is generated at the time of connection to your cluster. Each pre-signed URL has a time to live of 5 minutes.

You can use this UI for monitoring your Spark run and shuffling, among other things. For more information, see the documentation.

Stop an EMR cluster from Studio

After we’re done with our analysis and model building, we can use the Studio interface to stop our cluster. Because this runs DELETE STACK under the hood, users only have access to stop clusters that were launched using provisioned Service Catalog templates and can’t stop existing clusters that were created outside of Studio.

Clean up the end-to-end stack

If you deployed the end-to-end stack, complete the following steps to clean up resources deployed for this solution:

  1. Stop your cluster, as shown in the previous section.

This also deletes the S3 bucket, so you should copy the contents in the bucket to a backup location if you want to retain the data for later use.

  1. On the Studio console, choose your user name (studio-user).
  2. Delete all the apps listed under Apps by choosing Delete app.
  3. Wait until the status shows as Completed.

Next, you delete your Amazon Elastic File System (Amazon EFS) volume.

  1. On the Amazon EFS console, delete the file system that SageMaker created.

You can confirm it’s the correct volume by choosing the file system ID and confirming the tag is ManagedByAmazonSageMakerResource.

Finally, you delete the CloudFormation template.

  1. On the AWS CloudFormation console, choose Stacks.
  2. Select the stack you deployed for this solution.
  3. Choose Delete.

Clean up the existing domain stack

The second stack has a simpler cleanup because we’re leaving the Studio resources in place as they were prior to starting this tutorial.

  1. Stop your cluster as shown in the previous cleanup instructions.
  2. Remove the attached policy you added to the SageMaker execution role that permitted Amazon EMR browsing and PresignedURL access.
  3. On the AWS CloudFormation console, choose Stacks.
  4. Select the stack you deployed for this solution.
  5. Choose Delete.

Conclusion

In this post, we demonstrated a unified notebook-centric experience to create and manage EMR clusters, run analytics on those clusters, and train and deploy SageMaker models, all from the Studio interface. We also showed a one-click interface for debugging and monitoring Amazon EMR jobs through the Spark UI. We encourage you to try out this new functionality in Studio yourself, and check out Part 2 of this post, which dives deep how data workers can discover, connect, create, and stop clusters in a multi-account setup.


About the Authors

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time, he likes photographing the amazing geology of the American Southwest.

Prateek Mehrotra is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions which simplify usability by abstracting away complexity. In his spare time, Prateek enjoys spending time with his family and likes to explore the world with them.

Sriharsha M S is an AI/ML specialist solutions architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.

Sean MorganSean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Ruchir Tewari is a Senior Solutions Architect specializing in security and is a member of the ML TFC. For several years he has helped customers build secure architectures for a variety of hybrid, big data and AI/ML applications. He enjoys spending time with family, music and hikes in nature.

Luna Wang is a UX designer at AWS who has a background in computer science and interaction design. She is passionate about building customer-obsessed products and solving complex technical and business problems by using design methods. She is now working with a cross-functional team to build a set of new capabilities for interactive ML in SageMaker Studio.

Read More