Creating a multi-department enterprise search using custom attributes in Amazon Kendra

Creating a multi-department enterprise search using custom attributes in Amazon Kendra

An enterprise typically houses multiple departments such as engineering, finance, legal, and marketing, creating a growing number of documents and content that employees need to access. Creating a search experience that intuitively delivers the right information according to an employee’s role, and the department is critical to driving productivity and ensuring security.

Amazon Kendra is a highly accurate and easy-to-use enterprise search service powered by machine learning. Amazon Kendra delivers powerful natural language search capabilities to your websites and applications. These capabilities help your end-users easily find the information they need within the vast amount of content spread across your company.

With Amazon Kendra, you can index the content from multiple departments and data sources into one Amazon Kendra index. To tailor the search experience by user role and department, you can add metadata to your documents and FAQs using Kendra’s built-in attributes and custom attributes and apply user context filters.

For search queries issued from a specific department’s webpage, you can set Kendra to only return content from that department filtered by the employee’s access level. For example, an associate role may only access a subset of restricted documents. In contrast, the department manager might have access to all the documents.

This post provides a solution to indexing content from multiple departments into one Amazon Kendra index. To manage content access, the organization can create restrictions based on an employee’s role and department or provide page-level filtering of search results. It demonstrates how content is filtered based on the web page location and individual user groups.

Solution architecture

The following architecture is comprised of two primary components: document ingestion into Amazon Kendra and document query using Amazon API Gateway.

Architecture diagram depicting a pattern for multi-department enterprise search

The preceding diagram depicts a fictitious enterprise environment with two departments: Marketing and Legal. Each department has its own webpage on their internal website. Every department has two employee groups: associates and managers. Managers are entitled to see all the documents, but associates can only see a subset.

When employees from Marketing issue a search query on their department page, they only see the documents they are entitled to within their department (pink documents without the key). In contrast, the Marketing Manager sees all Marketing documents (all pink documents).

When employees from Legal search on a Marketing department page, they don’t see any documents. When all employees search on the internal website’s main page, they see the public documents common to all departments (yellow).

The following table shows the types of documents an employee gets for the various query combinations of webpage, department, and access roles.

Access Control Table

Ingesting documents into Amazon Kendra

The document ingestion step consists of ingesting content and metadata from different departments’ specific S3 buckets, indexed by Amazon Kendra. Content can comprise structured data like FAQs and unstructured content like HTML, Microsoft PowerPoint, Microsoft Word, plain text, and PDF documents. For ingesting FAQ documents into Amazon Kendra, you can provide the questions, answers, and optional custom and access control attributes either in a CSV or JSON format.

You can add metadata to your documents and FAQs using the built-in attributes in Amazon Kendra, custom attributes, and user context filters. You can filter content using a combination of these custom attributes and user context filters. For this post, we index each document and FAQ with:

  1. Built-in attribute _category to represent the web page.
  2. User context filter attribute for the employee access level.
  3. Custom attribute department representing the employee department.

The following code is an example of the FAQ document for the Marketing webpage:

{
"SchemaVersion": 1,
"FaqDocuments": [{
"Question": "What is the reimbursement policy for business related expenses?",
"Answer": "All expenses must be submitted within 2 weeks.",
"Attributes": {
"_category": "page_marketing",
"department": "marketing"
},
"AccessControlList": [{
"Name": "associate",
"Type": "GROUP",
"Access": "ALLOW"
},{
"Name": "manager",
"Type": "GROUP",
"Access": "ALLOW"
}
]
},
{
"Question": "What are the manager guidelines for employee promotions ?",
"Answer": "Guidelines for employee promotions can be found on the manager portal.",
"Attributes": {
"_category": "page_marketing",
"department": "marketing"
},
"AccessControlList": [{
"Name": "manager",
"Type": "GROUP",
"Access": "ALLOW"
}]
}
]
}

The following code is an example of the metadata document for the Legal webpage:

{
"DocumentId": "doc1",
"Title": "What is the immigration policy?",
"ContentType": "PLAIN_TEXT",
"Attributes": {
"_category": "page_legal",
"department": "legal"
},
"AccessControlList": [
{
"Name": "associate",
"Type": "GROUP",
"Access": "ALLOW"
}
]
}

Document search by department

The search capability is exposed to the client application using an API Gateway endpoint. The API accepts an optional path parameter for the webpage on which the query was issued. If the query comes from the Marketing-specific page, the query looks like /search/dept/marketing. For a comprehensive website search covering all the departments, you will leave out the path parameter. The query looks like /searchEvery API request also has two header values: EMP_ROLE, representing the employee access level, and EMP_DEPT, representing the department name. In this post, we don’t describe how to authenticate users. We assume that you populate these two header values after authenticating the user with Amazon Cognito or your custom solutions.

The AWS Lambda function that serves the API Gateway parses the path parameters and headers and issues an Amazon Kendra query call with AttributeFilters set to the category name from the path parameter (if present), the employee access level, and department from the headers. Amazon Kendra returns the FAQs and documents for that particular category and filters them by the employee access level and department. The Lambda function constructs a response with these search results and sends the FAQ and document search results back to the client application.

Deploying the AWS CloudFormation template

  1. You can deploy this architecture using the provided AWS CloudFormation template in us-east-1. Please click to get started.

CloudFormation Stack

  1. Choose Next.
  2. Provide a stack name and choose Next.
  3. In the Capabilities and transforms section, select all three check-boxes to provide acknowledgment to AWS CloudFormation to create IAM resources and expand the template. Acknowledgement section of CloudFormation template
  4. Choose Create stack.

This process might take 15 minutes or more to complete and creates the following resources:

  • An Amazon Kendra index
  • Three S3 buckets representing the departments: Legal, Marketing, and Public
  • Three Amazon Kendra data sources that connect to the S3 buckets
  • A Lambda function and an API Gateway endpoint that is called by the client application

After the CloudFormation template finishes deploying the above infrastructure, you will see the following Outputs.

CloudWatch Outputs Section

API Key and Usage Plan

  1. The KendraQueryAPI will require an API key. The CloudFormation output ApiGWKey refers to the name of the API key. Currently, this API key is associated with a usage plan that allows 2000 requests per month.
  2. Click the link in the Value column corresponding to the Key ApiGWKey. This will open the API Keys section of the API Gateway console.
  3. Click Show next to the API key.
  4. Copy the API key. We will use this when testing the API.API Key section in API Gateway Console
  5. You can manage the usage plan by following the instructions on, Create, configure, and test usage plans with the API Gateway console.
  6. You can also add fine-grained authentication and authorization to your APIs. For more information on securing your APIs, you can follow instructions on Controlling and managing access to a REST API in API Gateway.

Uploading sample documents and FAQ

Add your documents and FAQs file to their corresponding S3 buckets. We’ve also provided you with some sample document files and sample FAQs file to download.

Upload all the document files whose file name prefix corresponds to the S3 buckets created as part of the CloudFormation. For example, all Marketing documents and their corresponding metadata files go into the kendra-blog-data-source-marketing-[STACK_NAME] bucket. Upload the FAQ document into to the kendra-blog-faqs-[STACK_NAME]bucket.

Creating the facet definition for custom attributes

In this step, you add a facet definition to the index.

  1. On the Amazon Kendra console, choose the index created in the previous step.
  2. Choose Facet definition.
  3. Choose the Add
  4. For Field name, enter department.
  5. For Data type, choose String.
  6. For Usage types, select Facetable, Searchable, Displayable, and Sortable.
  7. Choose Add.

Adding a Facet to Kendra index

  1. On the Amazon Kendra console, choose the newly created index.
  2. Choose Data sources.
  3. Sync kendra-blog-data-source-legal-[STACK_NAME], kendra-blog-data-source-marketing-[STACK_NAME], and kendra-blog-data-source-public-[STACK_NAME] by selecting the data source name and choosing Sync now. You can sync multiple data sources simultaneously.

This should start the indexing process of the documents in the S3 buckets.

Adding FAQ documents

After you create your index, you can add your FAQ data.

  1. On the Amazon Kendra console, choose the new index.
  2. Choose FAQs.
  3. Choose Add FAQ.
  4. For FAQ name, enter a name, such as demo-faqs-with-metadata.
  5. For FAQ file format, choose JSON file.
  6. For S3, browse Amazon S3 to find kendra-blog-faqs-[STACK_NAME], and choose the faqs.json file.
  7. For IAM role, choose Create a new role to allow Amazon Kendra to access your S3 bucket.
  8. For Role name, enter a name, such as AmazonKendra-blog-faq-role.
  9. Choose Add.

Setting up FAQs in Amazon Kendra

Testing the solution

You can test the various combinations of page and user-level attributes on the API Gateway console. You can refer to Test a method with API Gateway console to learn about how to test your API using the API Gateway console.

The following screenshot is an example of testing the scenario where an associate from the Marketing department searches on the department-specific page.

You will have to pass the following parameters while testing the above scenario.

  1. Path: page_marketing
  2. Query String: queryInput="financial targets"
  3. Headers:
    1. x-api-key: << Your API Key copied earlier from the CloudFormation step >>
    2. EMP_ROLE:associate
    3. EMP_DEPT:marketing

You will see a JSON response with a FAQ result matching the above conditions.

…
"DocumentExcerpt": {"Text": "Please work with your manager to understand the goals for your department.", 
…

You can keep the queryInput="financial targets" but change the EMP_ROLE from associate to manager, and you should see a different answer.

…
"DocumentExcerpt": { "Text": "The plan is achieve 2x the sales in the next quarter.", 
….

Cleaning up

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. Delete all the files from the S3 buckets.
  2. On the AWS CloudFormation console, delete the stack you created. This removes the resources the CloudFormation template created.

Conclusion

In this post, you learned how to use Amazon Kendra to deploy a cognitive search service across multiple departments in your organization and filter documents using custom attributes and user context filters. To enable, Amazon Kendra you don’t need to have any previous ML or AI experience. Use Amazon Kendra to provide your employees with faster access to information that is spread across your organization.


About the Authors

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and develop products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

Marty Jiang is a Conversational AI Consultant with AWS Professional Services. Outside of work, he loves spending time outdoors with his family and exploring new technologies.

Read More

Getting started with AWS DeepRacer community races

Getting started with AWS DeepRacer community races

AWS DeepRacer allows you to get hands-on with machine learning (ML) through a fully autonomous 1/18th scale race car driven by reinforcement learning, a 3D racing simulator on the AWS DeepRacer console, a global racing league, and hundreds of customer-initiated community races.

With AWS DeepRacer community races, you can create your own race and invite your friends and colleagues to compete. The AWS DeepRacer console now supports object avoidance and head-to-bot races in addition to time trial racing formats, enabling racers at all skill levels to engage and learn about ML and challenge their friends. There’s never been a better time to get rolling with AWS DeepRacer!

The Accenture Grand Prix

We have worked with partners all over the world to bring ML to their employees and customers by enabling them to host their own races. One of these partners, Accenture, has been hosting its own internal AWS DeepRacer event since 2019. Accenture enables customers all over the world to build artificial intelligence (AI) and ML-powered solutions through their team of more than 8,000 AWS-trained technologists. They’re always looking for new and engaging ways to develop their teams with hands-on training.

In November 2019, Accenture launched their own internal AWS DeepRacer League. The Accenture league was planned to run throughout 2020, spanning 30 global locations in 17 countries, with a physical and virtual track at each location, for their employees to compete for the title of Accenture AWS DeepRacer Champion. At the start of their league season, Accenture hosted some in-person local events, which were well-attended and received, but as the COVID-19 pandemic unfolded, Accenture pivoted to all virtual events. This was made possible with AWS DeepRacer community races. Accenture quickly set up and customized races with the ability to select, date, time, track, and invite participants.

This fall, Accenture takes their racing to a new level with their 2-month-long finals championship, the Accenture Grand Prix. This event takes advantage of the latest update to community races as of October 1, 2020: the addition of object avoidance and head-to-bot racing formats. In object avoidance races, you use sensors to detect and avoid obstacles placed on the track. In head-to-bot, you race against another AWS DeepRacer bot on the same track and try to avoid it while still turning in the best lap time. You can use visual information to sense and avoid objects being approached on the track.

Amanda Jensen, Associate Director in the Accenture AWS Business Group, is heading up the Accenture Grand Prix. Making sure their employees are trained with the right skills is crucial to their business of helping other organizations unlock the advantages of ML.

“The skills most relevant are a combination of basic cloud skills as well as programming, including languages such as Python and R, statistics and regression, and data science,” Jensen says. “One of the largest obstacles in training for employees staffed on non-AI or ML projects is the opportunity to apply or grow skills in a setting where they can visualize how data science works. Applying algorithms on paper or reading about them isn’t the same.”

That’s where AWS DeepRacer comes in. It’s a great way for teams to get started in ML training, see it come to life, and enable team building. AWS DeepRacer makes the experience of learning fun and accessible.

“One of our team members mentioned that before getting hands-on with DeepRacer, she didn’t have any background in ML,” Jensen says. “The console, models, and training module for AWS DeepRacer made it easy to visualize the steps and understand how the model was being trained in the background without getting too deep into the complicated mathematics. With the added bonus of having the physical car, she was able to actually see in real time the changes, failures, and successes of the model.”

Jensen also sees the added head-to-bot format as a key new feature to elevate the AWS DeepRacer competition experience.

“In our global competition last year, it quickly became apparent that the competition between the locations was really the driving force behind the engagement,” Jensen says. “People wanted their office location to be on the board. This will bring that level of competition to the individual races and get people enthusiastic.”

Starting your own race

Whether or not you have competed in races before, creating and hosting a community race may be what you’re looking for to get you started with AWS DeepRacer and ML. Anyone can start a community race for free and invite participants to join.

With community races, you can select your own track, race date, time, and who you want to invite to participate. Hosting your own race provides an opportunity for you to build your own community and provide team-building events for friends and work colleagues. Community races are another exciting way AWS DeepRacer provides an opportunity for you to compete, meet fellow enthusiasts, and get started with ML!

In this section, we walk you through setting up your own community race. All you need to do is sign up for an AWS account (if you don’t already have one) and go to the AWS DeepRacer console.

  1. On the AWS DeepRacer console, choose Community races.
  2. Choose Create a race.
  3. For Choose race type, select the type of race you want. For this post, we select Time Trial.
  4. For Name of the racing event, enter a name for your race.
  5. For Race dates, enter the start and end dates for your race.

In the Race customization section, you can optionally customize your race details with track and evaluation criteria.

  1. For Competition tracks, select your track. For this post, we select Cumulo Turnpike.
  2. Customize the remaining race track options as desired.
  3. Choose Next.

  1. Review your race settings and choose Submit.

An invitation page and link is generated for you to copy and send out to your friends and colleagues you want to invite to compete in your race.

Now that the race is created, you’re ready to host your own event. Make sure that everyone you invited takes the proper training to build, train, and evaluate their model before the race. When everyone is ready, you’re all set to start racing with your friends!

Who can host an event?

Community races are hosted by all kinds of people and groups, from large companies like Accenture to ML enthusiasts who want to test their skills.

Juv Chan, a community hero for AWS DeepRacer, recently hosted his own event. “I was the main organizer for the AWS DeepRacer Beginner Challenge virtual community race, which started on April 3, 2020, and ended May 31, 2020,” Chan says. “It was the first community race that was organized exclusively for the DeepRacer beginner community globally.”

After Chan decided he wanted to get more beginner-level developers involved in racing and learning ML, he set out to create his own event through the AWS DeepRacer console.

“My first experience on setting up a new community race in the AWS DeepRacer console was easy, fast, and straightforward” Chan says. “I was able to create my first community race in less than 3 minutes when I had all the necessary requirements and details to create the race. I would recommend new users who want to create a new community race to create a mock race in advance to get familiar with all the required details and limitations. It’s possible to edit race details after creating the event too.”

After you set up the race, you need to invite other developers to create an account, train models, and compete in your race. Chan worked with AWS and the AWS ML community to convince racers to join the fun.

“Getting beginner racers to participate was my next challenge,” Chan says. “I worked with AWS and AWS Machine Learning Community Heroes to create a community race event landing page and step-by-step race guide blog post on how to get started and participate the race. I have promoted the events through various AWS, autonomous driving, reinforcement learning, and relevant AI user groups and social media channels in different regions globally. I also created a survey to get feedback from the communities.”

Overall, Chan had a great experience hosting the race. For more information about his experiences and best-kept secrets, see Train a Viable Model in 45 minutes for AWS DeepRacer Beginner Challenge Virtual Community Race 2020.

Join the race to win glory and prizes!

As you can see, there are plenty of ways to compete against your fellow racers right now! If you think you’re ready to create your own community race and invite fellow racers to create a model and compete, it’s easy to get started.

If you’re new to AWS DeepRacer but still want to compete, you can create your own model on the console and submit it compete in the AWS DeepRacer Virtual Circuit, where you can compete in time trial, object avoidance, and head-to-head racing formats. Hundreds of developers have extended their ML journey by competing in the Virtual Circuit races in 2020.

For more information about an AWS DeepRacer competition from earlier in the year, check out the AWS DeepRacer League F1 ProAm event. You can also learn more about AWS DeepRacer in upcoming AWS Summit Online events. Sign in to the AWS DeepRacer console now to learn more, start your ML journey, and get rolling with AWS DeepRacer!


About the Author

Dan McCorriston is a Senior Product Marketing Manager for AWS Machine Learning. He is passionate about technology, collaborating with developers, and creating new methods of expanding technology education. Out of the office he likes to hike, cook and spend time with his family.

 

Read More

Onboarding Amazon SageMaker Studio with AWS SSO and Okta Universal Directory

Onboarding Amazon SageMaker Studio with AWS SSO and Okta Universal Directory

In 2019, AWS announced Amazon SageMaker Studio, a unified integrated development environment (IDE) for machine learning (ML) development. You can write code, track experiments, visualize data, and perform debugging and monitoring within a single, integrated visual interface.

Amazon SageMaker Studio supports a single sign-on experience with AWS Single Sign-On (AWS SSO) authentication. External identity provider (IdP) such as Azure Active Directory and Okta Universal Directory can be integrated with AWS SSO to be the source of truth for Amazon SageMaker Studio. Users are given access to Amazon SageMaker Studio via a unique login URL that directly opens Amazon SageMaker Studio, and they can sign-in with their existing corporate credentials. Administrators can continue to manage users and groups in their existing identity systems which can then be synchronized with AWS SSO. For instance, AWS SSO enables administrators to connect their on-premises Active Directory (AD) or their AWS Managed Microsoft AD directory, as well as other Supported Identity Providers. For more information, see The Next Evolution in AWS Single Sign-On and Single Sign-On between Okta Universal Directory and AWS.

In this post, we walk you through setting up SSO with Amazon SageMaker Studio and enabling SSO with Okta Universal Directory. I also demonstrate the SSO experience for system administrators and Amazon SageMaker Studio users.

Prerequisites

To use the same Okta user login for Amazon SageMaker Studio, you need to set up AWS SSO and connect to Okta Universal Directory. The high-level steps are as follows:

  1. Enable AWS SSO on the AWS Management Console. Create this AWS SSO account in the same AWS Region as Amazon SageMaker Studio.
  2. Add AWS SSO as an application Okta users can connect to.
  3. Configure the mutual agreement between AWS SSO and Okta, download IdP metadata in Okta, and configure an external IdP in AWS SSO.
  4. Enable identity synchronization between Okta and AWS SSO.

For instructions, see Single Sign-On between Okta Universal Directory and AWS.

This setup makes sure that when a new account is added to Okta and connected to the AWS SSO, a corresponding AWS SSO user is created automatically.

After you complete these steps, you can see the users assigned on the Okta console.

You can also see the users on the AWS SSO console, on the Users page.

Creating Amazon SageMaker Studio with AWS SSO authentication

We now need to create Amazon SageMaker Studio with AWS SSO as the authentication method. Complete the following steps:

  1. On the Amazon SageMaker console, choose Amazon SageMaker Studio.
  2. Select Standard setup.
  3. For Authentication method, select AWS Single Sign-On (SSO).
  4. For Permission, choose the Amazon SageMaker execution role.

If you don’t have this role already, choose Create role. Amazon SageMaker creates a new AWS Identity and Access Management (IAM) role with the AmazonSageMakerFullAccess policy attached.

  1. Optionally, you can specify other settings such as notebook sharing configuration, networking and storage, and tags.

  1. Choose Submit to create Amazon SageMaker Studio.

A few moments after initialization, the Amazon SageMaker Studio Control Panel appears.

  1. Choose Assign users.

The Assign users page contains a list of all the users from AWS SSO (synchronised from your Okta Universal Directory).

  1. Select the users that are authorized to access Amazon SageMaker Studio.
  2. Choose Assign users.

You can now see these users listed on the Amazon SageMaker Studio Control Panel.

On the AWS SSO console, under Applications, you can see the detailed information about the newly created Amazon SageMaker Studio.

In addition, you can view the assigned users.

Amazon SageMaker Studio also automatically creates a user profile with the domain execution role for each SSO user. A user profile represents a single user within a domain, and is the main way to reference a user for the purposes of sharing, reporting, and other user-oriented features such as allowed instance types. You can use the UpdateUserProfile API to associate a different role for a user, allowing fine-grained permission control so the user can pass this associated IAM role when creating a training job, hyperparameter tuning job, or a model. For more information about available Amazon SageMaker SDK API references, see Amazon SageMaker API Reference.

Using Amazon SageMaker Studio via SSO

As a user, you can start in one of three ways:

  1. Start from the Okta user portal page, select AWS SSO application, and choose Amazon SageMaker Studio
  2. Start from the AWS SSO user portal (the URL is on the AWS SSO Settings page), redirect to Okta login page, choose Amazon SageMaker Studio
  3. Bookmark the Amazon SageMaker Studio address (the URL is on the Amazon SageMaker Studio page), the page redirects automatically to Okta login page

For this post, we start in the AWS SSO user portal and are redirected to the Okta login page.

After you log in, you see an application named Amazon SageMaker Studio.

When you choose the application, the Amazon SageMaker Studio welcome page launches.

Now data scientists and ML builders can rely on this web-based IDE and use Amazon SageMaker to quickly and easily build and train ML models, and directly deploy them into a production-ready hosted environment. To learn more about the key features of Amazon SageMaker Studio, see Amazon SageMaker Studio Tour.

Conclusion

In this post, I showed how you can take advantage of the new AWS SSO capabilities to use Okta identities to open Amazon SageMaker Studio. Administrators can now use a single source of truth to manage their users, and users no longer need to manage an additional identity and password to sign in to their AWS accounts and applications.

AWS SSO with Okta is free to use and available in all Regions where AWS SSO is available. Amazon SageMaker Studio is now generally available in US East (Ohio), US East (N. Virginia), US West (Oregon), EU (Ireland) and China (Beijing and Ningxia), with additional Regions coming soon. Please read the product documentation to learn more.


About the Author

Yanwei Cui, PhD, is a Machine Learning Specialist Solution Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Read More

Halloween-themed AWS DeepComposer Chartbusters Challenge: Track or Treat

Halloween-themed AWS DeepComposer Chartbusters Challenge: Track or Treat

We are back with a spooktacular AWS DeepComposer Chartbusters challenge, Track or Treat! In this challenge, you can interactively collaborate with the ghost in the machine (learning) and compose spooky music! Chartbusters is a global monthly challenge where you can use AWS DeepComposer to create original compositions on the console using machine learning techniques, compete to top the charts, and win prizes. This challenge launches today and participants can submit their compositions until October 23, 2020.

Participation is easy: you can generate spooky compositions using one of the supported generative AI techniques and models on the AWS DeepComposer console. You can add or remove notes and interactively collaborate with AI using the Edit melody feature, and can include spooky instruments in your composition.

How to compete

To participate in Track or Treat, just do the following:

  1. Go to AWS DeepComposer Music Studio and create a melody with the keyboard, import a melody, or choose a sample melody on the console.
  2. Under Generative AI technique, for Model parameters, choose Autoregressive.
  3. For Model, choose Autoregressive CNN Bach.

You have four advanced parameters that you can choose to adjust: Maximum notes to add, Maximum notes to remove, Sampling iterations, and Creative risk.

  1. After adjusting the values to your liking, choose Enhance input melody.

  1. Choose Edit melody to add or remove notes.
  2. You can also change the note duration and pitch.

  1. When finished, choose Apply changes.

  1. Repeat these steps until you’re satisfied with the generated music.
  2. To add accompaniments to your melody, switch the Generative AI technique to Generative Adversarial Networks, then choose Generate composition.
  1. Choose the right arrow next to an accompaniment track (green bar).
  2. For Instrument type, choose Spooky.

  1. When you’re happy with your composition, choose Download composition.

You can choose to post-process your composition; however, one of the judging criteria is how close your final submission is to the track generated using AWS DeepComposer.

  1. In the navigation pane, choose Chartbusters.
  2. Choose Submit a composition.
  3. Select Import a post-processed audio track and upload your composition.
  4. Provide a track name for your composition and choose Submit.

AWS DeepComposer then submits your composition to the Track or Treat playlist on SoundCloud.

Conclusion

You’ve successfully submitted your composition to the AWS DeepComposer Chartbusters challenge Track or Treat. Now you can invite your friends and family to listen to your creation on SoundCloud, vote for their favorite, and join the fun by participating in the competition.

Although you don’t need a physical keyboard to compete, you can buy the AWS DeepComposer keyboard for $99.00 to enhance your music generation experience. To learn more about the different generative AI techniques supported by AWS DeepComposer, check out the learning capsules available on the AWS DeepComposer console.


About the Author

Maryam Rezapoor is a Senior Product Manager with AWS AI Ecosystem team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

 

Read More

This month in AWS Machine Learning: September 2020 edition

This month in AWS Machine Learning: September 2020 edition

Every day there is something new going on in the world of AWS Machine Learning—from launches to new use cases to interactive trainings. We’re packaging some of the not-to-miss information from the ML Blog and beyond for easy perusing each month. Check back at the end of each month for the latest roundup.

Launches

This month we announced native support for TorchServe in Amazon SageMaker, launched a new NFL Next Gen Stat, and enhanced our language services including Amazon Transcribe and Amazon Comprehend. Read on for our September launches:

 TorchServe is now natively supported in Amazon SageMaker as the default model server for PyTorch inference to help you bring models to production quickly without having to write custom code.

  • With the new automatic language detection feature for Amazon Transcribe, you can now simply provide audio files and Amazon Transcribe detects the dominant language from the speech signal and generates transcriptions in the identified language. Amazon Transcribe now also expands support for Channel Identification to streaming audio transcription. With Channel Identification, you can process live audio from multiple channels, and produce a single transcript of the conversation with channel labels.
  • You can now use Amazon Comprehend to detect and redact personally identifiable information (PII) in customer emails, support tickets, product reviews, social media, and more.
  • To kick off football season, AWS and the National Football League announced new Next Gen Stats powered by AWS. One of those stats, Expected Rushing Yards, was developed at the NFL’s Big Data Bowl, powered by AWS. Expected Rushing Yards is designed to show how many rushing yards a ball carrier is expected to gain on a given carry based on the relative location, speed, and direction of blockers and defenders.

Use cases

Get ideas and architectures from AWS customers, partners, ML Heroes, and AWS experts on how to apply ML to your use case:

Explore more ML stories

Want more news about developments in ML? Check out the following stories:

  • DBS Bank is training 3,000 employees in AI with the help of the AWS DeepRacer League. Watch to see how they are developing ML skills and fostering community for their workforce.
  • Amazon scientist Amaia Salvador shared her views on the challenges facing women in computer vision in a recent Science article.
  • Aella Credit is making banking more accessible using AWS ML.
  • Read about how ML is identifying and tracking pandemics like COVID-19, and check out this VentureBeat Webinar featuring Michelle Lee, VP of the Amazon ML Solutions Lab, on how companies are finding new ways to operate and respond in this pandemic with AI/ML.
  • Build, Train and Deploy a real-world flower classifier of 102 flower types with instruction from this tutorial by AWS ML Community Builder Juv Chan.
  • Earlier this month, we announced that Lena Taupie, a software developer for Blubrr, won the AWS DeepComposer Chartbusters challenge. Learn about Lena’s experience competing in the challenge and how she drew from her own background, having grown up in St. Lucia, a city with a history of oral and folk traditional music, to develop a new custom genre model using Generative AI.

Mark your calendars

Join us for the following exciting ML events:

  • SageMaker Fridays is back with season 2 and it is going to be bigger. You can get started faster with machine learning using Amazon SageMaker, where we will discuss practical use cases and applications through the season. Join us on the season premiere on Oct. 9, and we will continue to meet every week with a new use case.
  • On October 18, join Quantiphi for a conversation on how you can add AI to your contact centers to drive better customer engagement. Register now!
  • AWS is sponsoring the WSJ Pro AI Forum on October 28. Register to learn strategies for bringing AI and ML to your organization.

About the Author

Laura Jones is a product marketing lead for AWS AI/ML where she focuses on sharing the stories of AWS’s customers and educating organizations on the impact of machine learning. As a Florida native living and surviving in rainy Seattle, she enjoys coffee, attempting to ski and enjoying the great outdoors.

Read More

Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions

Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions

Customers need machine learning (ML) models to detect objects that are interesting for their business. In most cases doing so is hard as these models needs thousands of labelled images and deep learning expertise.  Generating this data can take months to gather, and can require large teams of labelers to prepare it for use. In addition, setting up a workflow for auditing or reviewing model predictions to validate adherence to your requirements can further add to the overall complexity.

With Amazon Rekognition Custom Labels, which is built on the existing capabilities of Amazon Rekognition, you can identify the objects and scenes in images that are specific to your business needs. For example, you can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect animated characters in videos.

But what if the custom label model you trained can’t recognize images with a high level of confidence, or you need your team of experts to validate the results from your model during the testing phase or review the results in production? You can easily send predictions from Amazon Rekognition Custom Labels to Amazon Augmented AI (Amazon A2I). Amazon A2I makes it easy to integrate a human review into your ML workflow. This allows you to automatically have humans step into your ML pipeline to review results below a confidence threshold, set up review and auditing workflows, and augment the prediction results to improve model accuracy.

In this post, we show you to how to build a custom object detection model trained to detect pepperoni slices in a pizza using Amazon Rekognition Custom Labels with a dataset labeled using Amazon SageMaker Ground Truth. We then show how to create your own private workforce and set up an Amazon A2I workflow definition to conditionally trigger human loops for review and augmenting tasks. You can use the annotations created by Amazon A2I for model retraining.

We walk you through the following steps using the accompanying Amazon SageMaker Jupyter Notebook:

  1. Complete prerequisites.
  2. Create an Amazon Rekognition custom model.
  3. Set up an Amazon A2I workflow and send predictions to an Amazon A2I human loop.
  4. Evaluate results and retrain the model.

Prerequisites

Before getting started, set up your Amazon SageMaker notebook. Follow the steps in the accompanying Amazon SageMaker Jupyter Notebook to create your human workforce and download the datasets.

  1. Create a notebook instance in Amazon SageMaker.

Make sure your Amazon SageMaker notebook has the necessary AWS Identity and Access Management (IAM) roles and permissions mentioned in the prerequisite section of the notebook.

  1. When the notebook is active, choose Open Jupyter.
  2. On the Jupyter dashboard, choose New, and choose Terminal.
  3. In the terminal, enter the following code:
cd SageMaker
git clone https://github.com/aws-samples/amazon-a2i-sample-jupyter-notebooks
  1. Open the notebook by choosing Amazon-A2I-Rekognition-Custom-Label.ipynb in the root folder.

For this post, you create a private work team and add one user (you) to it.

  1. Create your human workforce by following the appropriate section of the notebook.

Alternatively, you can create your private workforce using Amazon Cognito. For instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.

  1. After you create the private workforce, find the workforce ARN and enter the ARN in step 4 of the notebook.

 The dataset is composed of 12 images for training that contain pepperoni pizza in the data/images folder and 3 images for testing in the data/testimages folder. We sourced our images from pexels.com. We labeled this dataset for this post using Ground Truth custom labeling workflows. A Ground Truth labeled manifest template is available in data/manifest.

  1. Download the dataset.
  2. Run the notebook steps Download the Amazon SageMaker GroundTruth object detection manifest to Amazon S3 to process and upload the manifest in your Amazon Simple Storage Service (Amazon S3) bucket.

Creating an Amazon Rekognition custom model

To create our custom model, we follow these steps:

  1. Create a project in Amazon Rekognition Custom Labels.
  2. Create a dataset with images containing one or more pizzas and label them by applying bounding boxes.

Because we already have a dataset labeled using Ground Truth, we just point to that labeled dataset in this step. Alternatively, you can label the images using the user interface provided by Amazon Rekognition Custom Labels.

  1. Train the model.
  2. Evaluate the model’s performance.
  3. Test the deployed model using a sample image.

Creating a project

In this step, we create a project to detect peperoni pizza slices. For instructions on creating a project, see Creating an Amazon Rekognition Custom Labels Project.

  1. On the Amazon Rekognition console, choose Custom Labels.
  2. Choose Get started.
  3. For Project name, enter a2i-rekog-pepperoni-pizza.
  4. Choose Create project.

Creating a dataset

To create your dataset, complete the following steps:

  1. On the Amazon Rekognition Custom Labels console, choose Datasets.
  2. Choose Create dataset.
  3. For Dataset name, enter rekog-a2i-pizza-dataset.
  4. Select Import images labeled by Amazon SageMaker Ground Truth.
  5. For .manifest file location, enter the S3 bucket location of your .manifest file.

  1. Choose Submit.

You should get a prompt to provide the S3 bucket policy when you provide the S3 path. For more information about these steps, see SageMaker Ground Truth.

After you create the dataset, you should see the images with the bounding boxes and labels, as in the following screenshot.

Make sure to upload the images for our dataset (that you downloaded in the Prerequisites section) to the console S3 bucket for Amazon Rekognition Custom Labels.

Training an Amazon Rekognition custom model

You’re now ready to train your model.

  1. On the Amazon Rekognition console, choose Train model.
  2. For Choose project, choose your newly created project.
  3. For Choose training dataset, choose your newly created dataset.
  4. Select Split training dataset.
  5. Choose Train.

The training takes approximately 45 minutes to complete.

Checking training status

Run the following notebook cell to get information about the project you created using the describe-projects API:

!aws rekognition describe-projects

Use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

To get the project ARN using the describe-project versions API, run the following cell:

#Replace the project-arn below with the project-arn of your project from the describe-projects output above
!aws rekognition describe-project-versions —project-arn "<project-arn>"

Enter the ARN in <project-version-arn-of-your-model> in the following code and run this cell to start running of the version of a model using the start-project-version API:

# Copy/paste the ProjectVersionArn for your model from the describe-project-versions cell output above to the --project-version-arn parameter here
!aws rekognition start-project-version 
--project-version-arn <project-version-arn-of-your-model> 
--min-inference-units 1 
--region us-east-1

Evaluating performance

When training is complete, you can evaluate the performance of the model. To help you, Amazon Rekognition Custom Labels provides summary metrics and evaluation metrics for each label. For information about the available metrics, see Metrics for Evaluating Your Model. To improve your model using metrics, see Improving an Amazon Rekognition Custom Labels Model.

The following screenshot shows our evaluation results and label performance.

Custom Labels determines the assumed threshold for each label based on maximum precision and recall achieved. Your model defaults to returning predictions above that threshold. You can reset this when you start your model. For information about adjusting your assumed threshold (such as looking at predictions either below or above your assumed threshold) and optimizing the model for your use case, see Training a custom single class object detection model with Amazon Rekognition Custom Labels.

For a demonstration of the Amazon Rekognition Custom Labels solution, see Custom Labels Demonstration. The demo shows you how you can analyze an image from your local computer.

Testing the deployed model

Run following notebook cell to load the test image. Use the accompanying Amazon SageMaker Jupyter Notebook to follow the steps in this post.

from PIL import Image, ImageDraw, ExifTags, ImageColor, ImageFont
image=Image.open('./data/images/pexels_brett_jordan_825661__1_.jpg')
display(image)

Enter the project version model ARN from previous notebook step aws rekognition start-project-version and run the following cell to analyze the response from the detect_custom_labels API:

model_arn = '<project-version-arn-of-your-model>'
min_confidence=40
#Call DetectCustomLabels
response = client.detect_custom_labels(Image={'S3Object': {'Bucket': BUCKET, 'Name': PREFIX + test_photo}},
MinConfidence=min_confidence,
ProjectVersionArn=model_arn)

Run the next cell in the notebook to display results (see the following screenshot).

The model detected the pepperoni pizza slices in our test image and drew bounding boxes. We can use Amazon A2I to send the prediction results from our model to a human loop consisting of our private workforce to review and audit the results.

Setting up an Amazon A2I human loop

In this section, you set up a human review loop for low-confidence detection in Amazon A2I. It includes the following steps:

  1. Create a human task UI.
  2. Create a worker task template and workflowflow definition.
  3. Send predictions to Amazon A2I human loops.
  4. Check the human loop status.

Use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

Creating a human task UI

Create a human task UI resource with a UI template in liquid HTML. This template is used whenever a human loop is required.

For this post, we take an object detection UI and fill in the object categories in the labels variable in the template. Run the following template to create the human task AI for object detection:

def create_task_ui():
    '''
    Creates a Human Task UI resource.

    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

For over 70 pre-built UIs, see the Amazon Augmented AI Sample Task UIs GitHub repo.

Creating a worker task template and workflow definition

Workflow definitions allow you to specify the following:

  • The workforce that your tasks are sent to
  • The instructions that your workforce receives

This post uses the following API to create a workflow definition:

create_workflow_definition_response = sagemaker.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn= ROLE,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Identify custom labels in the image",
            "TaskTitle": "Identify custom image"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

Optionally, you can create this workflow definition on the Amazon A2I console. For instructions, see Create a Human Review Workflow.

Sending predictions to Amazon A2I human loops

We now loop through our test images and invoke the trained model endpoint to detect the custom label pepperoni pizza slice using the detect_custom_labels API. If the confidence score is less than 60% for the detected labels in our test dataset, we send that data for human review and start the human review A2I loop with the start-human-loop API. When using Amazon A2I for a custom task, a human loops starts when StartHumanLoop is called in your application. Use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

You can change the value of the SCORE_THRESHOLD based on what confidence level you want to trigger the human review. See the following code:

import uuid

human_loops_started = []
SCORE_THRESHOLD = 60

folderpath = r"data/testimages" # make sure to put the 'r' in front and provide the folder where your files are
filepaths  = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if not name.startswith('.')] # do not select hidden directories
for path in filepaths:
    # Call custom label endpoint and not display any object detected with probability lower than 0.01
    response = client.detect_custom_labels(Image={'S3Object': {'Bucket': BUCKET, 'Name': PREFIX+'/'+path}},
        MinConfidence=20,
        ProjectVersionArn=model_arn)    
  
    #Get the custom labels
    labels=response['CustomLabels']
    if labels and labels[0]['Confidence'] < SCORE_THRESHOLD: 
        s3_fname='s3://%s/%s' % (BUCKET, PREFIX+'/'+path)
        print("Images with labels less than 60% confidence score: " + s3_fname)
        humanLoopName = str(uuid.uuid4())
        inputContent = {
            "initialValue": 'null',
            "taskObject": s3_fname
        }
        start_loop_response = a2i.start_human_loop(
            HumanLoopName=humanLoopName,
            FlowDefinitionArn=flowDefinitionArn,
            HumanLoopInput={
                "InputContent": json.dumps(inputContent)
            }
        )
        human_loops_started.append(humanLoopName)
        print(f'Starting human loop with name: {humanLoopName}  n')

The following screenshot shows the output.

Checking the status of human loop

Run the steps in the notebook to check the status of the human loop. You can use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

  1. Run the following notebook cell to get a login link to navigate to the private workforce portal:
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])
  1. Choose the login link to the private worker portal.

After you log in, you can start working on the task assigned.

  1. Draw bounding boxes with respect to the label as required and choose Submit.

  1. To check if the workers have completed the labeling task, enter the following code:
completed_human_loops = []
for human_loop_name in human_loops_started:
    resp = a2i.describe_human_loop(HumanLoopName=human_loop_name)
    print(f'HumanLoop Name: {human_loop_name}')
    print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
    print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
    print('n')
    
    if resp["HumanLoopStatus"] == "Completed":
        completed_human_loops.append(resp)

The following screenshot shows the output.

Evaluating the results and retraining your model

When the labeling work is complete, your results should be available in the Amazon S3 output path specified in the human review workflow definition. The human answer, label, and bounding box are returned and saved in the JSON file. Run the notebook cell to get the results from Amazon S3. The following screenshot shows the Amazon A2I annotation output JSON file.

Because we created our training set using Ground Truth, it’s in the form of an output.manifest file in the data/manifest folder. We need to do the following:

  1. Convert the Amazon A2I labeled JSON output to a .manifest file for retraining.
  2. Merge the output.manifest file with the existing training dataset .manifest file in data/manifest.
  3. Train the new model using the augmented file.

Converting the JSON output to an augmented .manifest file

You can loop through all the Amazon A2I output, convert the JSON file, and concatenate them into a JSON Lines file, in which each line represents the results of one image. Run the following code in the Amazon SageMaker Jupyter notebook to convert the Amazon A2I results into an augmented manifest:

object_categories = ['pepperoni pizza slice'] # if you have more labels, add them here
object_categories_dict = {str(i): j for i, j in enumerate(object_categories)}

dsname = 'pepperoni_pizza'

def convert_a2i_to_augmented_manifest(a2i_output):
    annotations = []
    confidence = []
    for i, bbox in enumerate(a2i_output['humanAnswers'][0]['answerContent']['annotatedResult']['boundingBoxes']):
        object_class_key = [key for (key, value) in object_categories_dict.items() if value == bbox['label']][0]
        obj = {'class_id': int(object_class_key), 
               'width': bbox['width'],
               'top': bbox['top'],
               'height': bbox['height'],
               'left': bbox['left']}
        annotations.append(obj)
        confidence.append({'confidence': 1})

    # Change the attribute name to the dataset-name_BB for this dataset. This will later be used in setting the training data
    augmented_manifest={'source-ref': a2i_output['inputContent']['taskObject'],
                        dsname+'_BB': {'annotations': annotations,
                                           'image_size': [{'width': a2i_output['humanAnswers'][0]['answerContent']['annotatedResult']['inputImageProperties']['width'],
                                                           'depth':3,
                                                           'height': a2i_output['humanAnswers'][0]['answerContent']['annotatedResult']['inputImageProperties']['height']}]},
                        dsname+'_BB-metadata': {'job-name': 'a2i/%s' % a2i_output['humanLoopName'],
                                                    'class-map': object_categories_dict,
                                                    'human-annotated':'yes',
                                                    'objects': confidence,
                                                    'creation-date': a2i_output['humanAnswers'][0]['submissionTime'],
                                                    'type':'groundtruth/object-detection'}}
    return augmented_manifest

Merging the augmented manifest with the existing training manifest

You now need to merge the output.manifest file, which consists of the existing training dataset manifest in data/manifest, with the Amazon A2I augmented .manifest file you generated from the JSON output.

Run the following notebook cell in the Amazon SageMaker Jupyter notebook to generate this file for training a new model:

f4 = open('./augmented-temp.manifest', 'r')
with open('augmented.manifest', 'w') as outfl:
    for lin1 in f4:
        z_json = json.loads(lin1)
        done_json = json.loads(lin1)
        done_json['source-ref'] = 'a'
        f3 = open('./data/manifest/output.manifest', 'r')
        for lin2 in f3:
            x_json = json.loads(lin2)
            if z_json['source-ref'] == x_json['source-ref']:
                print("replacing the annotations")
                x_json[dsname+'_BB'] = z_json[dsname+'_BB']
                x_json[dsname+'_BB-metadata'] = z_json[dsname+'_BB-metadata']
            elif done_json['source-ref'] != z_json['source-ref']:
                print("This is a net new annotation to augmented file")
                json.dump(z_json, outfl)
                outfl.write('n')
                print(str(z_json))
                done_json = z_json
            json.dump(x_json, outfl)
            outfl.write('n')         
        f3.close()       
f4.close()    

Training a new model

To train a new model with the augmented manifest, enter the following code:

# upload the manifest file to S3
s3r.meta.client.upload_file('./augmented.manifest', BUCKET, PREFIX+'/'+'data/manifest/augmented.manifest')

We uploaded the augmented .manifest file from Amazon A2I to the S3 bucket. You can train a new model by creating a new dataset by following the Creating a dataset step in this post using this augmented manifest. The following screenshot shows some of the results from the dataset.

Follow the instructions provided in the section Creating an Amazon Rekognition custom label model of this post. After you train a new model using this augmented dataset, you can inspect the model metrics for accuracy improvement.

The following screenshot shows the metrics for the trained model using the Amazon AI labeled dataset.

We noticed an overall improvement in the metrics after retraining using the augmented dataset. Moreover, we used only 12 images to achieve these performance metrics.

Cleaning up

To avoid incurring unnecessary charges, delete the resources used in this walkthrough when not in use. For instructions, see the following:

Conclusion

This post demonstrated how you can use Amazon Rekognition Custom Labels and Amazon A2I to train models to detect objects and images unique to your business and define conditions to send the predictions to a human workflow with labelers to review and update the results. You can use the human labeled output to augment the training dataset for retraining, which improves model accuracy, or you can send it to downstream applications for analytics and insights.


About the Authors

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.

 

 

 

Prem Ranga is an Enterprise Solutions Architect based out of Houston, Texas. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an autonomous vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

 

 

Neel Sendas is a Senior Technical Account Manager at Amazon Web Services. Neel works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He has worked on various ML use cases, ranging from anomaly detection to predictive product quality for manufacturing and logistics optimization. When he is not helping customers, he dabbles in golf and salsa dancing.

Read More

Building custom language models to supercharge speech-to-text performance for Amazon Transcribe

Building custom language models to supercharge speech-to-text performance for Amazon Transcribe

Amazon Transcribe is a fully-managed automatic speech recognition service (ASR) that makes it easy to add speech-to-text capabilities to voice-enabled applications. As our service grows, so does the diversity of our customer base, which now spans domains such as insurance, finance, law, real estate, media, hospitality, and more. Naturally, customers in different market segments have asked Amazon Transcribe for more customization options to further enhance transcription performance.

We’re excited to introduce Custom Language Models (CLM). The new feature allows you to submit a corpus of text data to train custom language models that target domain-specific use cases. Using CLM is easy because it capitalizes on existing data that you already possess (such as marketing assets, website content, and training manuals).

In this post, we show you how to best use your available data to train a custom language model tailored for your speech-to-text use case. Although our walkthrough uses a transcription example from the video gaming industry, you can use CLM to enhance custom speech recognition for any domain of your choosing. This post assumes that you’re already familiar with how to use Amazon Transcribe, and focuses on demonstrating how to use the new CLM feature. Additional resources for using the service are available at the end.

Establishing context for evaluating CLM transcription performance

To evaluate how powerful CLM can be in terms of enhancing transcription accuracy, there are few steps we want to take. First we need to establish a baseline. To do that, we recorded an audio sample that contains speech content and lingo commonly found in video game chats. We also have a human-generated transcription to benchmark the ground truth. This ground truth transcript serves as the reference transcript we use to compare the general transcription output of Amazon Transcribe and the output of CLM. The following is a partial snippet of this reference transcript.

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. The Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz.

Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems also offer 3D audio output for immersive experiences.

Next, we want to run the sample audio through Amazon Transcribe using its generic speech engine and compare the text output to the ground truth transcript. We’ve produced a partial snippet of the side-by-side comparison and highlighted errors for visibility. Compared to the ground truth transcript, the default Amazon Transcribe transcript showed a Word Error Rate (WER) of 31.87%. This WER shouldn’t be interpreted as a full representation of the Amazon Transcribe service performance. It is just one instance for a very specific example audio. Note that accuracy is a measure of 100 minus WER. So the lower the WER, the higher the accuracy. For more information about calculating WER, see Word Error Rate on Wikipedia.

*Note that this WER is not a representation of the Amazon Transcribe service performance. It is just one instance for a very specific and limited test example. All figures are for single-case and limited scope illustration purposes.

The following text is the human-generated ground-truth reference transcript:

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. Meanwhile, the Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz. Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems also offer 3D audio output for immersive experiences.

The following text is the machine-generated transcript by the Amazon Transcribe generic speech engine:

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in heart respects between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these Consul’s. The PS five features in a M descend to CPU with up to 3.5 gigahertz frequency is sports and AM the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dahlin at 16 gigabytes in 825 gigabytes. The PS five supports both PS. The PS five supports both four K and A K resolutions. Meanwhile, the Xbox Series X also features an AM descend to CPU, but clocks in at three point take. It hurts instead, the cost. Almost a similar AMG custom GPU, with 12 teraflops and 1.825 gigahertz memory, is the same as out of the PS. Fives come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five. The Siri’s X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

Although the Amazon Transcribe generic speech model has done a decent job of transcribing the audio, it’s more likely that a tailored custom model can yield higher transcription accuracy.

Solution overview

In this post, we walk you through how to train a custom language model and evaluate how its transcription output compares to the reference transcript. You complete the following high-level steps:

  1. Prepare your training data.
  2. Train your CLM.
  3. Use your CLM to transcribe audio.
  4. Evaluate the results by comparing the CLM transcript against the generic model transcript’s accuracy.

Preparing your training data

Before we begin, it’s important to distinguish between training data and tuning data.

Training data for CLM typically includes text data that is specific to your domain. Some examples of training data could include relevant text content from your website, training manuals, sales and marketing collateral, or other text sources.

Meanwhile, human-annotated audio transcripts of actual phone calls or media content that are directly relevant to your use case can be used as tuning data. Ideally, both training and tuning data ought to be domain-specific, but in practice you may only have a small amount of audio transcriptions available. In that case, transcriptions should be used as tuning data. If more transcription data is available, it can and should be used as part of the training set as well. For more information about the difference between training and tuning data, see Improving Domain-Specific Transcription Accuracy with Custom Language Models.

Like many domains, video gaming has its own set of technical jargon, syntax, and speech dynamics that may make the use of a general speech recognition engine suboptimal. To build a custom model, we first need data that’s representative of the domain. For this use case, we want free form text from the video gaming industry. We use a variety of publicly available information from a variety of sources about video gaming. For convenience, we’ve compiled that training and tuning set, which you can download to follow along. Keep in mind that the nature, quality, and quantity of your training data has a dramatic impact on the resultant custom model you build. All else equal, it’s better to have more data than less.

As a general guideline, your training and tuning datasets should meet the following parameters:

  • Is in plain text (it’s not a file such as a Microsoft Word document, CSV file, or PDF).
  • Has a single sentence per line.
  • Is encoded in UTF-8.
  • Doesn’t contain any formatting characters, such as HTML tags.
  • Is less than 2 GB in size if you intend to use the file as training data. You can provide a maximum of 2 GB of training data.
  • Is less than 200 MB in size if you intend to use the file as tuning data. You can provide a maximum of 200 MB of optional tuning data.

The following test is a partial snippet of our example training set:

The PS5 will feature a custom eight-core AMD Zen 2 CPU clocked at 3.5GHz (variable frequency) and a custom GPU based on AMD’s RDNA 2 architecture hardware that promises 10.28 teraflops and 36 compute units clocked at 2.23GHz (also variable frequency).

It’ll also have 16GB of GDDR6 RAM and a custom 825GB SSD that Sony has previously promised will offer super-fast loading times in gameplay, via Eurogamer.

One of the biggest technical updates in the PS5 was already announced last year: a switch to SSD storage for the console’s main hard drive, which Sony says will result in dramatically faster load times.

A previous demo showed Spider-Man loading levels in less than a second on the PS5, compared to the roughly eight seconds it took on a PS4.PlayStation hardware lead Mark Cerny dove into some of the details about those SSD goals at the announcement.

Where it took a PS4 around 20 seconds to load a single gigabyte of data, the goal with the PS5’s SSD was to enable loading

five gigabytes of data in a single second.

Training your custom language model

In the following steps, we show you how to train a CLM using base training data and a tuning split. Using a tuning split is entirely optional. If you prefer not to train a CLM using a tuning split, you can skip step 5.

  1. Upload your training data (and/or tuning data) to their respective Amazon Simple Storage Service (Amazon S3) buckets.
  2. On the Amazon Transcribe console, for Name, enter a name for your custom model so you can reference it for later use.
  3. For Base model, choose a model type that matches your use case, based on audio sample rate.
  4. For Training data, enter the appropriate S3 bucket where you previously uploaded your training data.
  5. For Tuning data, enter the appropriate S3 bucket where you uploaded your tuning data. (This step is optional)
  6. For Access permissions, designate the appropriate access permissions.
  7. Choose Train model.

Amazon Transcribe service does the heavy lifting of automatically training a custom model for you.

To track the progress of the model training, go to the Custom language models page on the Amazon Transcribe console. The status indicates if the training is in progress, complete, or failed.

Using your custom language model to transcribe audio

When your custom model training is complete, you’re ready to use it. Simply start a typical transcription job as you would using Amazon Transcribe. However, this time, we want to invoke the custom language model we just trained, as opposed to using the default speech engine. For this post, we assume you already have familiarity with how to run a typical transcription job, so we only call out the new component: for Model type, select Custom language model.

Evaluating the results

When your transcription job is complete, it’s time to see how well the CLM performed. We can evaluate the output transcript against the human-annotated reference transcript, just as we had compared the generic Amazon Transcribe machine output against the human-annotated reference transcript. We actually made two CLMs (one without tuning split and one with tuning split), which we showcase as a full summary in the following table.

Transcription Type WER (%) Accuracy (100-WER)
Amazon Transcribe generic model 31.34% 68.66%
Amazon Transcribe CLM with no tuning split 26.19% 73.81%
Amazon Transcribe CLM with tuning split 20.23% 79.77%

 A lower WER is better. These WERs aren’t representative of overall Amazon Transcribe performance. All numbers are relative to demonstrate the point of using custom models over generic models, and are specific only to this singular audio sample.

The WER reductions are pretty significant! As you can see, although AmazonTranscribe’s generic engine performed decently in transcribing the sample audio from the video gaming domain, the CLM we built using training data performed 5% better. And the CLM built using training data and a tuning split performed approximately 11% better. These comparative results are unsurprising because the more relevant training and tuning that a model experiences, the more tailored it is to the specific domain and use case.

To give a qualitative visual comparison, we’ve taken a snippet of the transcript from each CLM’s output and put it against the original human annotated reference transcript to share a qualitative comparison of the different terms that were recognized by each model. We used green highlights to show the progressive accuracy improvements in each iteration.

The following text is the human-generated ground truth reference transcript:

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. Meanwhile, the Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz. Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems alsooffer 3D audio output for immersive experiences.

The following text is the machine transcription output by Amazon Transcribe’s generic speech engine:

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in heart respects between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these Consul’s. The PS five features in a M descend to CPU with up to 3.5 gigahertz frequency is sports and AM the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dahlin at 16 gigabytes in 825 gigabytes. The PS five supports both PS. The PS five supports both four K and A K resolutions. Meanwhile, the Xbox Series X also features an AM descend to CPU, but clocks in at three point take. It hurts instead, the cost. Almost a similar AMG custom GPU, with 12 teraflops and 1.825 gigahertz memory, is the same as out of the PS. Fives come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five. The Siri’s X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

The following text is the machine transcription output by CLM (base training, no tuning split):

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen videogame consoles coming out soon. So what’s the difference in hardware specs between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these consoles. The PS five features an A M D Zen two CPU with up to 3.5 gigahertz frequency it sports and am the radio GPU. That tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively, Dahlin at 16 gigabytes and 825 gigabytes, The PS five supports both PS. The PS five supports both four K and eight K resolutions. Meanwhile, the Xbox Series X also features an A M D Zen two CPU, but clocks in at 3.8 hertz. Instead. The console boasts a similar a AMD custom GPU, with 12 teraflops and 1.825 hertz. Memory is the same as out of the PS five’s come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five, the Series X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences

The following text is the machine transcription output by CLM (base training, with tuning split):

the 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in hardware specs between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hoods of each of these consoles. The PS five features an AMG Zen two CPU with up to 3.5 givers frequency it sports an AMD Radeon GPU that touts 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dial in at 16 gigabytes and 825 gigabytes. The PS five supports both PS. The PS five supports both four K and eight K resolutions. Meanwhile, the Xbox Series X also features an AMG Zen two CPU, but clocks in at 3.8 had hurts. Instead, the console boasts a similar AMD custom. GPU, with 12 teraflops and 1.825 gigahertz memory is the same as out of the PS five’s coming in at 16 gigabytes. But the default storage is where the system has an edge, bringing out a massive one terabyte hard drive. Like the PS five, the Series X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

The difference in improvement varies according to your use case and the quality of your training data and tuning set. Experimentation is encouraged. In general, the more training and tuning that your custom model undergoes, the better the performance. CLM doesn’t guarantee 100% accuracy, but it can offer significant performance improvements over generic speech recognition models.

Best practices

It’s important to note that the resultant custom language model depends directly on what you use as your training dataset. All else equal, the closer the representation of your training data to real use cases, the more performant your custom model is. Moreover, more data is always preferred. For more information about the general guidelines, see Improving Domain-Specific Transcription Accuracy with Custom Language Models.

The Amazon Transcribe CLM doesn’t charge for model training, so feel free to experiment. In a single AWS account, you can train up to 10 custom models to address different domains, use cases, or new training datasets. After you have your CLM, you can choose which transcription jobs to utilize your CLM. You only incur an additional CLM charge for the transcription jobs in which you apply a custom language model.

Conclusion

CLMs can be a powerful capability when it comes to improving transcription accuracy for domain-specific use cases. The new feature is available in all AWS Regions where Amazon Transcribe already operates. At the time of this writing, the feature only supports US English. Additional language support will come with time. Start training your own custom models by visiting Amazon Transcribe and checking out Improving Domain-Specific Transcription Accuracy with Custom Language Models.

Related Resources

For additional resources, see the following:

 


About the Authors

Paul Zhao is Lead Product Manager at AWS Machine Learning. He manages speech recognition services like Amazon Transcribe and Amazon Transcribe Medical. He was formerly a serial entrepreneur, having launched, operated, and exited two successful businesses in the areas of IoT and FinTech, respectively.

 

 

 

Vivek Govindan is Senior Software Development engineer at AWS Machine Learning. Outside of work, Vivek is an ardent soccer fan.

 

Read More

Running on-demand, serverless Apache Spark data processing jobs using Amazon SageMaker managed Spark containers and the Amazon SageMaker SDK

Running on-demand, serverless Apache Spark data processing jobs using Amazon SageMaker managed Spark containers and the Amazon SageMaker SDK

Apache Spark is a unified analytics engine for large scale, distributed data processing. Typically, businesses with Spark-based workloads on AWS use their own stack built on top of Amazon Elastic Compute Cloud (Amazon EC2), or Amazon EMR to run and scale Apache Spark, Hive, Presto, and other big data frameworks. This is useful for persistent workloads, in which you want these Spark clusters to be up and running 24/7, or at best, would have to come up with an architecture to spin up and spin down the cluster on a schedule or on demand.

Amazon SageMaker Processing lets you easily run preprocessing, postprocessing, model evaluation or other fairly generic transform workloads on a fully managed infrastructure. Previously, Amazon SageMaker Processing included a built-in container for Scikit-learn style preprocessing. For using other libraries like Spark, you have the flexibility to bring in your own Docker containers. Amazon SageMaker Processing jobs can also be part of your Step Functions workflow for ML involving pre- and post-processing steps. For more information, see AWS Step Functions adds support for Amazon SageMaker Processing.

Several machine learning(ML) workflows involve preprocessing data with Spark (or other libraries) and then passing in training data to a training step. The following workflow shows an Extract, Transform and Load (ETL) step that leads to model training and finally to model endpoint deployment using AWS Step Functions.

Including Spark steps in such workflows requires additional steps to provision and set up these clusters. Alternatively, you can do using AWS Glue, a fully managed ETL service that makes it easy for customers to write Python or Scala based scripts to preprocess data for ML training.

We’re happy to add a managed Spark container and associated SDK enhancements to Amazon SageMaker Processing, which lets you perform large scale, distributed processing on Spark by simply submitting a PySpark or Java/Scala Spark application. You can use this feature in Amazon SageMaker Studio and Amazon SageMaker notebook instances.

To demonstrate, the following code example runs a PySpark script on Amazon SageMaker Processing by using the PySparkProcessor:

from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=1200,
) 

spark_processor.run(
    submit_app_py="./path/to/your/preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix],
    spark_event_logs_s3_uri='s3://' + bucket + '/' + prefix + '/spark_event_logs',
    logs=False
)

We can look at this example in some more detail. The PySpark script name ‘preprocess.py’ such as the one shown below, that loads a large CSV file from Amazon Simple Storage Service (Amazon S3) into a Spark dataframe, fits and transforms this dataframe into an output dataframe, and converts and saves a CSV back to Amazon S3:

import time
import sys
import os
import shutil
import csv

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import *


def csv_line(data):
    r = ','.join(str(d) for d in data[1])
    return str(data[0]) + "," + r


def main():
    spark = SparkSession.builder.appName("PySparkApp").getOrCreate()
    
    # Convert command line args into a map of args
    args_iter = iter(sys.argv[1:])
    args = dict(zip(args_iter, args_iter))

    spark.sparkContext._jsc.hadoopConfiguration().set("mapred.output.committer.class",
                                                      "org.apache.hadoop.mapred.FileOutputCommitter")
    
    # Defining the schema corresponding to the input data. The input data does not contain the headers
    schema = StructType([StructField("sex", StringType(), True), 
                         StructField("length", DoubleType(), True),
                         StructField("diameter", DoubleType(), True),
                         StructField("height", DoubleType(), True),
                         StructField("whole_weight", DoubleType(), True),
                         StructField("shucked_weight", DoubleType(), True),
                         StructField("viscera_weight", DoubleType(), True), 
                         StructField("shell_weight", DoubleType(), True), 
                         StructField("rings", DoubleType(), True)])

    # Downloading the data from S3 into a Dataframe
    total_df = spark.read.csv(('s3://' + os.path.join(args['s3_input_bucket'], args['s3_input_key_prefix'],'abalone.csv')), header=False, schema=schema)

    #StringIndexer on the sex column which has categorical value
    sex_indexer = StringIndexer(inputCol="sex", outputCol="indexed_sex")
    
    #one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)
    sex_encoder = OneHotEncoder(inputCol="indexed_sex", outputCol="sex_vec")

    #vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format
    assembler = VectorAssembler(inputCols=["sex_vec", 
                                           "length", 
                                           "diameter", 
                                           "height", 
                                           "whole_weight", 
                                           "shucked_weight", 
                                           "viscera_weight", 
                                           "shell_weight"], 
                                outputCol="features")
    
    # The pipeline comprises of the steps added above
    pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])
    
    # This step trains the feature transformers
    model = pipeline.fit(total_df)
    
    # This step transforms the dataset with information obtained from the previous fit
    transformed_total_df = model.transform(total_df)
    
    # Split the overall dataset into 80-20 training and validation
    (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])
    
    # Convert the train dataframe to RDD to save in CSV format and upload to S3
    train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))
    train_lines = train_rdd.map(csv_line)
    train_lines.saveAsTextFile('s3://' + os.path.join(args['s3_output_bucket'], args['s3_output_key_prefix'], 'train'))
    
    # Convert the validation dataframe to RDD to save in CSV format and upload to S3
    validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))
    validation_lines = validation_rdd.map(csv_line)
    validation_lines.saveAsTextFile('s3://' + os.path.join(args['s3_output_bucket'], args['s3_output_key_prefix'], 'validation'))


if __name__ == "__main__":
    main()

You can easily start a Spark based processing job by using the PySparkProcessor() class as shown below:

from sagemaker.spark.processing import PySparkProcessor

# Upload the raw input dataset to S3
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = 'sagemaker/spark-preprocess-demo/' + timestamp_prefix
input_prefix_abalone = prefix + '/input/raw/abalone'
input_preprocessed_prefix_abalone = prefix + '/input/preprocessed/abalone'
model_prefix = prefix + '/model'

sagemaker_session.upload_data(path='./data/abalone.csv', bucket=bucket, key_prefix=input_prefix_abalone)

# Run the processing job
spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=1200,
)

spark_processor.run(
    submit_app_py="./code/preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix_abalone,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix_abalone],
    spark_event_logs_s3_uri='s3://' + bucket + '/' + prefix + '/spark_event_logs',
    logs=False
)

When running this in Amazon SageMaker Studio or Amazon SageMaker notebook instance, the output shows the job’s progress:

Job Name:  sm-spark-<...>
Inputs:  [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://<bucketname>/<prefix>/preprocess.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'S3Output': {'S3Uri': 's3://<bucketname>/<prefix>', 'LocalPath': '/opt/ml/processing/spark-events/', 'S3UploadMode': 'Continuous'}}]

In Amazon SageMaker Studio, you can describe your processing jobs and view relevant details by choosing the processing job name (right-click), and choosing Open in trial details.

You can also track the processing job’s settings, logs, and metrics on the Amazon SageMaker console as shown in the following screenshot.

After a job completes, if the spark_event_logs_s3_uri was specified in the run() function, the Spark UI can be viewed by running the history server:

spark_processor.start_history_server()

If run from an Amazon SageMaker Notebook instance, the output will include a proxy URL where the history server can be accessed:

Starting history server...
History server is up on https://<your-notebook>.notebook.us-west-2.sagemaker.aws/proxy/15050

Visiting this URL will bring you to the history server web interface as shown in the screenshot below:

Additional python and jar file dependencies can also be specified in your Spark jobs. For example, if you want to serialize an MLeap model, you can specify these additional dependencies by modifying the call to the run() function of PySparkProcessor:

spark_processor.run(
    submit_app_py="./code/preprocess-mleap.py",
    submit_py_files=["./spark-mleap/mleap-0.15.0.zip"],
    submit_jars=["./spark-mleap/mleap-spark-assembly.jar"],
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix_abalone,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix_abalone],
    logs=False
)

Finally, overriding Spark configuration is crucial for several tasks such as tuning your Spark application or configuring the Hive metastore. You can override Spark, Hive, Hadoop configurations using our Python SDK.

For example, the following code overrides spark.executor.memory and spark.executor.cores:

spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=1200,
)

configuration = [{
  "Classification": "spark-defaults",
  "Properties": {"spark.executor.memory": "2g", "spark.executor.cores": "1"},
}]

spark_processor.run(
    submit_app_py="./code/preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix_abalone,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix_abalone],
    configuration=configuration,
    logs=False
)

Try out this example on your own by navigating to the examples tab in your Amazon SageMaker notebook instance, or by cloning the Amazon SageMaker examples directory and navigating to the folder with Amazon SageMaker Processing examples.

Additionally, you can set up an end-to-end Spark workflow for your use cases using Amazon SageMaker and other AWS services:

Conclusion

Amazon SageMaker makes extensive use of Docker containers to allow users to build a runtime environment for data preparation, training, and inference code. Amazon SageMaker’s built-in Spark container for Amazon SageMaker Processing provides a managed Spark runtime including all library components and dependencies needed to run distributed data processing workloads. The example discussed in the blog shows how developers and data scientists can take advantage of the built-in Spark container on Amazon SageMaker to focus on more important aspects of preparing and preprocessing data. Instead of spending time tuning, scaling, or managing Spark infrastructure, developers can focus on core implementation.


About the Authors

 Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform.

 

 

 

 

Andrew Packer is a Software Engineer in Amazon AI where he is excited about building scalable, distributed machine learning infrastructure for the masses. In his spare time, he likes playing guitar and exploring the PNW.

 

 

 

 

Vidhi Kastuar is a Sr. Product Manager for Amazon SageMaker, focusing on making machine learning and artificial intelligence simple, easy to use and scalable for all users and businesses. Prior to AWS, Vidhi was Director of Product Management at Veritas Technologies. For fun outside work, Vidhi loves to sketch and paint, work as a career coach, and spend time with her family and friends.

Read More

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS has expanded the availability of Amazon EC2 Inf1 instances to four new AWS Regions, bringing the total number of supported Regions to 11: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Paris), and South America (São Paulo).

Amazon EC2 Inf1 instances are powered by AWS Inferentia chips, which are custom-designed to provide you with the lowest cost per inference in the cloud and lower the barriers for everyday developers to use machine learning (ML) at scale. Customers using models such as YOLO v3 and YOLO v4 can get up to 1.85 times higher throughput and up to 40% lower cost per inference compared to the EC2 G4 GPU-based instances.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure cost spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to deliver faster performance and more cost-effective ML inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances natively with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. You can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes by using the AWS Neuron SDK, which is integrated with these popular ML frameworks. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen a dramatic improvement in the breadth of models that deliver best-in-class performance at a fraction of the cost. This includes natural language processing models like the popular BERT, image classification models (ResNet and VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new neuron tool that allows you to easily scale monitoring of large fleets of inference applications.

Customer success stories

Since the launch of Inf1 instances, a broad spectrum of customers, from large enterprises to startups, as well as Amazon services, have begun using them to run production workloads.

Anthem is one of the nation’s leading health benefits companies, serving the healthcare needs of over 40 million members across dozens of states. They use deep learning to automate the generation of actionable insights from customer opinions via natural language models.

“Our application is computationally intensive and needs to be deployed in a highly performant manner,” says Numan Laanait, PhD, Principal AI/Data Scientist at Anthem. “We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide two times higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Condé Nast, another AWS customer, has a global portfolio that encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair.

“Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips,” says Paul Fryzel, Principal Engineer in AI Infrastructure at Condé Nast. “This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances.”

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.


About the Author

Gadi Hutt is a Sr. Director, Business Development at AWS. Gadi has over 20 years’ experience in engineering and business disciplines. He started his career as an embedded software engineer, and later on moved to product lead positions. Since 2013, Gadi leads Annapurna Labs technical business development and product management focused on hardware acceleration software and hardware products like the EC2 FPGA F1 instances and AWS Inferentia along side with its Neuron SDK, accelerating machine learning in the cloud.

Read More