Live call analytics for your contact center with Amazon language AI services

Your contact center connects your business to your community, enabling customers to order products, callers to request support, clients to make appointments, and much more. When calls go well, callers retain a positive image of your brand, and are likely to return and recommend you to others. And the converse, of course, is also true.

Naturally, you want to do what you can to ensure that your callers have a good experience. There are two aspects to this:

  • Help supervisors assess the quality of your caller’s experiences in real time – For example, your supervisors need to know if initially unhappy callers become happier as the call progresses. And if not, why? What actions can be taken, before the call ends, to assist the agent to improve the customer experience for calls that aren’t going well?
  • Help agents optimize the quality of your caller’s experiences – For example, can you deploy live call transcription? This removes the need for your agents to take notes during calls, freeing them to focus more attention on providing positive customer interactions.

Contact Lens for Amazon Connect provides real-time supervisor and agent assist features that could be just what you need, but you may not yet be using Amazon Connect. You need a solution that works with your existing contact center.

Amazon Machine Learning (ML) services like Amazon Transcribe and Amazon Comprehend provide feature-rich APIs that you can use to transcribe and extract insights from your contact center audio at scale. Although you could build your own custom call analytics solution using these services, that requires time and resources. In this post, we introduce our new sample solution for live call analytics.

Solution overview

Our new sample solution, Live Call Analytics (LCA), does most of the heavy lifting associated with providing an end-to-end solution that can plug into your contact center and provide the intelligent insights that you need.

It has a call summary user interface, as shown in the following screenshot.

It also has a call detail user interface.

LCA currently supports the following features:

  • Accurate streaming transcription with support for personally identifiable information (PII) redaction and custom vocabulary
  • Sentiment detection
  • Automatic scaling to handle call volume changes
  • Call recording and archiving
  • A dynamically updated web user interface for supervisors and agents:
    • A call summary page that displays a list of in-progress and completed calls, with call timestamps, metadata, and summary statistics like duration, and sentiment trend
    • Call detail pages showing live turn-by-turn transcription of the caller/agent dialog, turn-by-turn sentiment, and sentiment trend
  • Standards-based telephony integration with your contact center using Session Recording Protocol (SIPREC)
  • A built-in standalone demo mode that allows you to quickly install and try out LCA for yourself, without needing to integrate with your contact center telephony
  • Easy-to-install resources with a single AWS CloudFormation template

This is just the beginning! We expect to add many more exciting features over time, based on your feedback.

Deploy the CloudFormation stack

Start your LCA experience by using AWS CloudFormation to deploy the sample solution with the built-in demo mode enabled.

The demo mode downloads, builds, and installs a small virtual PBX server on an Amazon Elastic Compute Cloud (Amazon EC2) instance in your AWS account (using the free open-source Asterisk project) so you can make test phone calls right away and see the solution in action. You can integrate it with your contact center later after evaluating the solution’s functionality for your unique use case.

  1. Use the appropriate Launch Stack button for the AWS Region in which you’ll use the solution. We expect to add support for additional Regions over time.
    • US East (N. Virginia) us-east-1 
    • US West (Oregon) us-west-2 
  1. For Stack name, use the default value, LiveCallAnalytics.
  2. For Install Demo Asterisk Server, use the default value, true.
  3. For Allowed CIDR Block for Demo Softphone, use the IP address of your local computer with a network mask of /32.

To find your computer’s IP address, you can use the website checkip.amazonaws.com.

Later, you can optionally install a softphone application on your computer, which you can register with LCA’s demo Asterisk server. This allows you to experiment with LCA using real two-way phone calls.

If that seems like too much hassle, don’t worry! Simply leave the default value for this parameter and elect not to register a softphone later. You will still be able to test the solution. When the demo Asterisk server doesn’t detect a registered softphone, it automatically simulates the agent side of the conversation using a built-in audio recording.

  1. For Allowed CIDR List for SIPREC Integration, leave the default value.

This parameter isn’t used for demo mode installation. Later, when you want to integrate LCA with your contact center audio stream, you use this parameter to specify the IP address of your SIPREC source hosts, such as your Session Border Controller (SBC) servers.

  1. For Authorized Account Email Domain, use the domain name part of your corporate email address (this allows others with email addresses in the same domain to sign up for access to the UI).
  2. For Call Audio Recordings Bucket Name, leave the value blank to have an Amazon Simple Storage Service (Amazon S3) bucket for your call recordings automatically created for you. Otherwise, use the name of an existing S3 bucket where you want your recordings to be stored.
  3. For all other parameters, use the default values.

If you want to customize the settings later, for example to apply PII redaction or custom vocabulary to improve accuracy, you can update the stack for these parameters.

  1. Check the two acknowledgement boxes, and choose Create stack.

The main CloudFormation stack uses nested stacks to create the following resources in your AWS account:

The stacks take about 20 minutes to deploy. The main stack status shows CREATE_COMPLETE when everything is deployed.

Create a user account

We now open the web user interface and create a user account.

  1. On the AWS CloudFormation console, choose the main stack, LiveCallAnalytics, and choose the Outputs tab.
  2. Open your web browser to the URL shown as CloudfrontEndpoint in the outputs.

You’re directed to the login page.

  1. Choose Create account.
  2. For Username, use your email address that belongs to the email address domain you provided earlier.
  3. For Password, use a sequence that has a length of at least 8 characters, and contains uppercase and lowercase characters, plus numbers and special characters.
  4. Choose CREATE ACCOUNT.

The Confirm Sign up page appears.


Your confirmation code has been emailed to the email address you used as your username. Check your inbox for an email from no-reply@verificationemail.com with subject “Account Verification.”

  1. For Confirmation Code, copy and paste the code from the email.
  2. Choose CONFIRM.

You’re now logged in to LCA.

Make a test phone call

Call the number shown as DemoPBXPhoneNumber in the AWS CloudFormation outputs for the main LiveCallAnalytics stack.

You haven’t yet registered a softphone app, so the demo Asterisk server picks up the call and plays a recording. Listen to the recording, and answer the questions when prompted. Your call is streamed to the LCA application, and is recorded, transcribed, and analyzed. When you log in to the UI later, you can see a record of this call.

Optional: Install and register a softphone

If you want to use LCA with live two-person phone calls instead of the demo recording, you can register a softphone application with your new demo Asterisk server.

The following README has step-by-step instructions for downloading, installing, and registering a free (for non-commercial use) softphone on your local computer. The registration is successful only if Allowed CIDR Block for Demo Softphone correctly reflects your local machine’s IP address. If you got it wrong, or if your IP address has changed, you can choose the LiveCallAnalytics stack in AWS CloudFormation, and choose Update to provide a new value for Allowed CIDR Block for Demo Softphone.

If you still can’t successfully register your softphone, and you are connected to a VPN, disconnect and update Allowed CIDR Block for Demo Softphone—corporate VPNs can restrict IP voice traffic.

When your softphone is registered, call the phone number again. Now, instead of playing the default recording, the demo Asterisk server causes your softphone to ring. Answer the call on the softphone, and have a two-way conversation with yourself! Better yet, ask a friend to call your Asterisk phone number, so you can simulate a contact center call by role playing as caller and agent.

Explore live call analysis features

Now, with LCA successfully installed in demo mode, you’re ready to explore the call analysis features.

  1. Open the LCA web UI using the URL shown as CloudfrontEndpoint in the main stack outputs.

We suggest bookmarking this URL—you’ll use it often!

  1. Make a test phone call to the demo Asterisk server (as you did earlier).
    1. If you registered a softphone, it rings on your local computer. Answer the call, or better, have someone else answer it, and use the softphone to play the agent role in the conversation.
    2. If you didn’t register a softphone, the Asterisk server demo audio plays the role of agent.

Your phone call almost immediately shows up at the top of the call list on the UI, with the status In progress.

The call has the following details:

  • Call ID – A unique identifier for this telephone call
  • Initiation Timestamp – Shows the time the telephone call started
  • Caller Phone Number – Shows the number of the phone from which you made the call
  • Status – Indicates that the call is in progress
  • Caller Sentiment – The average caller sentiment
  • Caller Sentiment Trend –The caller sentiment trend
  • Duration – The elapsed time since the start of the call
  1. Choose the call ID of your In progress call to open the live call detail page.

As you talk on the phone from which you made the call, your voice and the voice of the agent are transcribed in real time and displayed in the auto scrolling Call Transcript pane.

Each turn of the conversation (customer and agent) is annotated with a sentiment indicator. As the call continues, the sentiment for both caller and agent is aggregated over a rolling time window, so it’s easy to see if sentiment is trending in a positive or negative direction.

  1. End the call.
  2. Navigate back to the call list page by choosing Calls at the top of the page.

Your call is now displayed in the list with the status Done.

  1. To display call details for any call, choose the call ID to open the details page, or select the call to display the Calls list and Call Details pane on the same page.

You can change the orientation to a side-by-side layout using the Call Details settings tool (gear icon).

You can make a few more phone calls to become familiar with how the application works. With the softphone installed, ask someone else to call your Asterisk demo server phone number: pick up their call on your softphone and talk with them while watching the turn-by-turn transcription update in real time. Observe the low latency. Assess the accuracy of transcriptions and sentiment annotation—you’ll likely find that it’s not perfect, but it’s close! Transcriptions are less accurate when you use technical or domain-specific jargon, but you can use custom vocabulary to teach Amazon Transcribe new words and terms.

Processing flow overview

How did LCA transcribe and analyze your test phone calls? Let’s take a quick look at how it works.

The following diagram shows the main architectural components and how they fit together at a high level.

The demo Asterisk server is configured to use Voice Connector, which provides the phone number and SIP trunking needed to route inbound and outbound calls. When you configure LCA to integrate with your contact center instead of the demo Asterisk server, Voice Connector is configured to integrate instead with your existing contact center using SIP-based media recording (SIPREC) or network-based recording (NBR). In both cases, Voice Connector streams audio to Kinesis Video Streams using two streams per call, one for the caller and one for the agent.

When a new video stream is initiated, an event is fired using EventBridge. This event triggers a Lambda function, which uses an Amazon Simple Queue Service (Amazon SQS) queue to initiate a new call processing job in Fargate, a serverless compute service for containers. A single container instance processes multiple calls simultaneously. AWS auto scaling provisions and de-provisions additional containers dynamically as needed to handle changing call volumes.

The Fargate container immediately creates a streaming connection with Amazon Transcribe and starts consuming and relaying audio fragments from Kinesis Video Streams to Amazon Transcribe.

The container writes the streaming transcription results in real time to a DynamoDB table.

A Lambda function, the Call Event Stream Processor, fed by DynamoDB streams, processes and enriches call metadata and transcription segments. The event processor function interfaces with AWS AppSync to persist changes (mutations) in DynamoDB and to send real-time updates to logged in web clients.

The LCA web UI assets are hosted on Amazon S3 and served via CloudFront. Authentication is provided by Amazon Cognito. In demo mode, user identities are configured in an Amazon Cognito user pool. In a production setting, you would likely configure Amazon Cognito to integrate with your existing identity provider (IdP) so authorized users can log in with their corporate credentials.

When the user is authenticated, the web application establishes a secure GraphQL connection to the AWS AppSync API, and subscribes to receive real-time events such as new calls and call status changes for the calls list page, and new or updated transcription segments and computed analytics for the call details page.

The entire processing flow, from ingested speech to live webpage updates, is event driven, and so the end-to-end latency is small—typically just a few seconds.

Monitoring and troubleshooting

AWS CloudFormation reports deployment failures and causes on the relevant stack Events tab. See Troubleshooting CloudFormation for help with common deployment problems. Look out for deployment failures caused by limit exceeded errors; the LCA stacks create resources such as NAT gateways, Elastic IP addresses, and other resources that are subject to default account and Region Service Quotas.

Amazon Transcribe has a default limit of 25 concurrent transcription streams, which limits LCA to 12 concurrent calls (two streams per call). Request an increase for the number of concurrent HTTP/2 streams for streaming transcription if you need to handle a larger number of concurrent calls.

LCA provides runtime monitoring and logs for each component using CloudWatch:

  • Call trigger Lambda function – On the Lambda console, open the LiveCallAnalytics-AISTACK-transcribingFargateXXX function. Choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.
  • Call processing Fargate task – On the Amazon ECS console, choose the LiveCallAnalytics cluster. Open the LiveCallAnalytics service to see container health metrics. Choose the Logs tab to inspect container logs.
  • Call Event Stream Processor Lambda function – On the Lambda console, open the LiveCallAnalytics-AISTACK-CallEventStreamXXX function. Choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.
  • AWS AppSync API – On the AWS AppSync console, open the CallAnalytics-LiveCallAnalytics-XXX API. Choose Monitoring in the navigation pane to see API metrics. Choose View logs in CloudWatch to inspect AppSyncAPI logs.

Cost assessment

This solution has hourly cost components and usage cost components.

The hourly costs add up to about $0.15 per hour, or $0.22 per hour with the demo Asterisk server enabled. For more information about the services that incur an hourly cost, see AWS Fargate Pricing, Amazon VPC pricing (for the NAT gateway), and Amazon EC2 pricing (for the demo Asterisk server).

The hourly cost components comprise the following:

  • Fargate container – 2vCPU at $0.08/hour and 4 GB memory at $0.02/hour = $0.10/hour
  • NAT gateways – Two at $0.09/hour
  • EC2 instance – t4g.large at $0.07/hour (for demo Asterisk server)

The usage costs add up to about $0.30 for a 5-minute call, although this can vary based on total usage, because usage affects Free Tier eligibility and volume tiered pricing for many services. For more information about the services that incur usage costs, see the following:

To explore LCA costs for yourself, use AWS Cost Explorer or choose Bill Details on the AWS Billing Dashboard to see your month-to-date spend by service.

Integrate with your contact center

To deploy LCA to analyze real calls to your contact center using AWS CloudFormation, update the existing LiveCallAnalytics demo stack, changing the parameters to disable demo mode.

Alternatively, delete the existing LiveCallAnalytics demo stack, and deploy a new LiveCallAnalytics stack (use the stack options from the previous section).

You could also deploy a new LiveCallAnalytics stack in a different AWS account or Region.

Use these parameters to configure LCA for contact center integration:

  1. For Install Demo Asterisk Server, enter false.
  2. For Allowed CIDR Block for Demo Softphone, leave the default value.
  3. For Allowed CIDR List for Siprec Integration, use the CIDR blocks of your SIPREC source hosts, such as your SBC servers. Use commas to separate CIDR blocks if you enter more than one.

When you deploy LCA, a Voice Connector is created for you. Use the Voice Connector documentation as guidance to configure this Voice Connector and your PBX/SBC for SIP-based media recording (SIPREC) or network-based recording (NBR). The Voice Connector Resources page provides some vendor-specific example configuration guides, including:

  • SIPREC Configuration Guide: Cisco Unified Communications Manager (CUCM) and Cisco Unified Border Element (CUBE)
  • SIPREC Configuration Guide: Avaya Aura Communication Manager and Session Manager with Sonus SBC 521

The LCA GitHub repository has additional vendor specific notes that you may find helpful; see SIPREC.md.

Customize your deployment

Use the following CloudFormation template parameters when creating or updating your stack to customize your LCA deployment:

  • To use your own S3 bucket for call recordings, use Call Audio Recordings Bucket Name and Audio File Prefix.
  • To redact PII from the transcriptions, set IsContentRedactionEnabled to true. For more information, see Redacting or identifying PII in a real-time stream.
  • To improve transcription accuracy for technical and domain-specific acronyms and jargon, set UseCustomVocabulary to the name of a custom vocabulary that you already created in Amazon Transcribe. For more information, see Custom vocabularies.

LCA is an open-source project. You can fork the LCA GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

Clean up

When you’re finished experimenting with this solution, clean up your resources by opening the AWS CloudFormation console and deleting the LiveCallAnalytics stacks that you deployed. This deletes resources that were created by deploying the solution. The recording S3 buckets, DynamoDB table, and CloudWatch Log groups are retained after the stack is deleted to avoid deleting your data.

Post Call Analytics: Companion solution

Our companion solution, Post Call Analytics (PCA), offers additional insights and analytics capabilities by using the Amazon Transcribe Call Analytics batch API to detect common issues, interruptions, silences, speaker loudness, call categories, and more. Unlike LCA, which transcribes and analyzes streaming audio in real time, PCA transcribes and analyzes your call recordings after the call has ended. Configure LCA to store call recordings to the PCA’s ingestion S3 bucket, and use the two solutions together to get the best of both worlds. For more information, see Post call analytics for your contact center with Amazon language AI services.

Conclusion

The Live Call Analytics (LCA) sample solution offers a scalable, cost-effective approach to provide live call analysis with features to assist supervisors and agents to improve focus on your callers’ experience. It uses Amazon ML services like Amazon Transcribe and Amazon Comprehend to transcribe and extract real-time insights from your contact center audio.

The sample LCA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the LCA GitHub repository.


About the Authors

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Oliver Atoa is a Principal Solutions Architect in the AWS Language AI Services team.

Sagar Khasnis is a Senior Solutions Architect focused on building applications for Productivity Applications. He is passionate about building innovative solutions using AWS services to help customers achieve their business objectives. In his free time, you can find him reading biographies, hiking, working out at a fitness studio, and geeking out on his personal rig at home.

Court Schuett is a Chime Specialist SA with a background in telephony and now likes to build things that build things.

Read More

Post call analytics for your contact center with Amazon language AI services

Your contact center connects your business to your community, enabling customers to order products, callers to request support, clients to make appointments, and much more. Each conversation with a caller is an opportunity to learn more about that caller’s needs, and how well those needs were addressed during the call. You can uncover insights from these conversations that help you manage script compliance and find new opportunities to satisfy your customers, perhaps by expanding your services to address reported gaps, improving the quality of reported problem areas, or by elevating the customer experience delivered by your contact center agents.

Contact Lens for Amazon Connect provides call transcriptions with rich analytics capabilities that can provide these kinds of insights, but you may not currently be using Amazon Connect. You need a solution that works with your existing contact center call recordings.

Amazon Machine Learning (ML) services like Amazon Transcribe Call Analytics and Amazon Comprehend provide feature-rich APIs that you can use to transcribe and extract insights from your contact center audio recordings at scale. Although you could build your own custom call analytics solution using these services, that requires time and resources. In this post, we introduce our new sample solution for post call analytics.

Solution overview

Our new sample solution, Post Call Analytics (PCA), does most of the heavy lifting associated with providing an end-to-end solution that can process call recordings from your existing contact center. PCA provides actionable insights to spot emerging trends, identify agent coaching opportunities, and assess the general sentiment of calls.

You provide your call recordings, and PCA automatically processes them using Transcribe Call Analytics and other AWS services to extract valuable intelligence such as customer and agent sentiment, call drivers, entities discussed, and conversation characteristics such as non-talk time, interruptions, loudness, and talk speed. Transcribe Call Analytics detects issues using built-in ML models that have been trained using thousands of hours of conversations. With the automated call categorization capability, you can also tag conversations based on keywords or phrases, sentiment, and non-talk time. And you can optionally redact sensitive customer data such as names, addresses, credit card numbers, and social security numbers from both transcript and audio files.

PCA’s web user interface has a home page showing all your calls, as shown in the following screenshot.

You can choose a record to see the details of the call, such as speech characteristics.

You can also scroll down to see annotated turn-by-turn call details.

You can search for calls based on dates, entities, or sentiment characteristics.

You can also search your call transcriptions.

Lastly, you can query detailed call analytics data from your preferred business intelligence (BI) tool.

PCA currently supports the following features:

  • Transcription

  • Analytics

    • Caller and agent sentiment details and trends
    • Talk and non-talk time for both caller and agent
    • Configurable Transcribe Call Analytics categories based on the presence or absence of keywords or phrases, sentiment, and non-talk time
    • Detects callers’ main issues using built-in ML models in Transcribe Call Analytics
    • Discovers entities referenced in the call using Amazon Comprehend standard or custom entity detection models, or simple configurable string matching
    • Detects when caller and agent interrupt each other
    • Speaker loudness
  • Search

    • Search on call attributes such as time range, sentiment, or entities
    • Search transcriptions
  • Other

    • Detects metadata from audio file names, such as call GUID, agent’s name, and call date time
    • Scales automatically to handle variable call volumes
    • Bulk loads large archives of older recordings while maintaining capacity to process new recordings as they arrive
    • Sample recordings so you can quickly try out PCA for yourself
    • It’s easy to install with a single AWS CloudFormation template

This is just the beginning! We expect to add many more exciting features over time, based on your feedback.

Deploy the CloudFormation stack

Start your PCA experience by using AWS CloudFormation to deploy the solution with sample recordings loaded.

  1. Use the following Launch Stack button to deploy the PCA solution in the us-east-1 (N. Virginia) AWS Region.

The source code is available in our GitHub repository. Follow the directions in the README to deploy PCA to additional Regions supported by Amazon Transcribe.

  1. For Stack name, use the default value, PostCallAnalytics.
  2. For AdminUsername, use the default value, admin.
  3. For AdminEmail, use a valid email address—your temporary password is emailed to this address during the deployment.
  4. For loadSampleAudioFiles, change the value to true.
  5. For EnableTranscriptKendraSearch, change the value to Yes, create new Kendra Index (Developer Edition).

If you have previously used your Amazon Kendra Free Tier allowance, you incur an hourly cost for this index (more information on cost later in this post). Amazon Kendra transcript search is an optional feature, so if you don’t need it and are concerned about cost, use the default value of No.

  1. For all other parameters, use the default values.

If you want to customize the settings later, for example to apply custom vocabulary to improve accuracy, or to customize entity detection, you can update the stack to set these parameters.

  1. Select the two acknowledgement boxes, and choose Create stack.

The main CloudFormation stack uses nested stacks to create the following resources in your AWS account:

The stacks take about 20 minutes to deploy. The main stack status shows as CREATE_COMPLETE when everything is deployed.

Set your password

After you deploy the stack, you need to open the PCA web user interface and set your password.

  1. On the AWS CloudFormation console, choose the main stack, PostCallAnalytics, and choose the Outputs tab.
  2. Open your web browser to the URL shown as WebAppURL in the outputs.

You’re redirected to a login page.

  1. Open the email your received, at the email address you provided, with the subject “Welcome to the Amazon Transcribe Post Call Analytics (PCA) Solution!”

This email contains a generated temporary password that you can use to log in (as user admin) and create your own password.

  1. Set a new password.

Your new password must have a length of at least eight characters, and contain uppercase and lowercase characters, plus numbers and special characters.

You’re now logged in to PCA. Because you set loadSampleAudioFiles to true, your PCA deployment now has three sample calls pre-loaded for you to explore.

Optional: Open the transcription search web UI and set your permanent password

Follow these additional steps to log in to the companion transcript search web app, which is deployed only when you set EnableTranscriptKendraSearch when you launch the stack.

  1. On the AWS CloudFormation console, choose the main stack, PostCallAnalytics, and choose the Outputs tab.
  2. Open your web browser to the URL shown as TranscriptionMediaSearchFinderURL in the outputs.

You’re redirected to the login page.

  1. Open the email your received, at the email address you provided, with the subject “Welcome to Finder Web App.”

This email contains a generated temporary password that you can use to log in (as user admin).

  1. Create your own password, just like you already did for the PCA web application.

As before, your new password must have a length of at least eight characters, and contain uppercase and lowercase characters, plus numbers and special characters.

You’re now logged in to the transcript search Finder application. The sample audio files are indexed already, and ready for search.

Explore post call analytics features

Now, with PCA successfully installed, you’re ready to explore the call analysis features.

Home page

To explore the home page, open the PCA web UI using the URL shown as WebAppURL in the main stack outputs (bookmark this URL, you’ll use it often!)

You already have three calls listed on the home page, sorted in descending time order (most recent first). These are the sample audio files.

The calls have the following key details:

  • Job Name – Is assigned from the recording audio file name, and serves as a unique job name for this call
  • Timestamp – Is parsed from the audio file name if possible, otherwise it’s assigned the time when the recording is processed by PCA
  • Customer Sentiment and Customer Sentiment Trend – Show the overall caller sentiment and, importantly, whether the caller was more positive at the end of the call than at the beginning
  • Language Code – Shows the specified language or the automatically detected dominant language of the call

Call details

Choose the most recently received call to open and explore the call detail page. You can review the call information and analytics such as sentiment, talk time, interruptions, and loudness.

Scroll down to see the following details:

  • Entities grouped by entity type. Entities are detected by Amazon Comprehend and the sample entity recognizer string map.
  • Categories detected by Transcribe Call Analytics. By default, there are no categories; see Call categorization for more information.
  • Issues detected by the Transcribe Call Analytics built-in ML model. Issues succinctly capture the main reasons for the call. For more information, see Issue detection.

Scroll further to see the turn-by-turn transcription for the call, with annotations for speaker, time marker, sentiment, interruptions, issues, and entities.

Use the embedded media player to play the call audio from any point in the conversation. Set the position by choosing the time marker annotation on the transcript or by using the player time control. The audio player remains visible as you scroll down the page.

PII is redacted from both transcript and audio—redaction is enabled using the CloudFormation stack parameters.

Search based on call attributes

To try PCA’s built-in search, choose Search at the top of the screen. Under Sentiment, choose Average, Customer, and Negative to select the calls that had average negative customer sentiment.

Choose Clear to try a different filter. For Entities, enter Hyundai and then choose Search. Select the call from the search results and verify from the transcript that the customer was indeed calling about their Hyundai.

Search call transcripts

Transcript search is an experimental, optional, add-on feature powered by Amazon Kendra.

Open the transcript web UI using the URL shown as TranscriptionMediaSearchFinderURL in the main stack outputs. To find a recent call, enter the search query customer hit the wall.

The results show transcription extracts from relevant calls. Use the embedded audio player to play the associated section of the call recording.

You can expand Filter search results to refine the search results with additional filters. Choose Open Call Analytics to open the PCA call detail page for this call.

Query call analytics using SQL

You can integrate PCA call analytics data into a reporting or BI tool such as Amazon QuickSight by using Amazon Athena SQL queries. To try it, open the Athena query editor. For Database, choose pca.

Observe the table parsedresults. This table contains all the turn-by-turn transcriptions and analysis for each call, using nested structures.

You can also review flattened result sets, which are simpler to integrate into your reporting or analytics application. Use the query editor to preview the data.

Processing flow overview

How did PCA transcribe and analyze your phone call recordings? Let’s take a quick look at how it works.

The following diagram shows the main data processing components and how they fit together at a high level.

Call recording audio files are uploaded to the S3 bucket and folder, identified in the main stack outputs as InputBucket and InputBucketPrefix, respectively. The sample call recordings are automatically uploaded because you set the parameter loadSampleAudioFiles to true when you deployed PCA.

As each recording file is added to the input bucket, an S3 Event Notification triggers a Lambda function that initiates a workflow in Step Functions to process the file. The workflow orchestrates the steps to start an Amazon Transcribe batch job and process the results by doing entity detection and additional preparation of the call analytics results. Processed results are stored as JSON files in another S3 bucket and folder, identified in the main stack outputs as OutputBucket and OutputBucketPrefix.

As the Step Functions workflow creates each JSON results file in the output bucket, an S3 Event Notification triggers a Lambda function, which loads selected call metadata into a DynamoDB table.

The PCA UI web app queries the DynamoDB table to retrieve the list of processed calls to display on the home page. The call detail page reads additional detailed transcription and analytics from the JSON results file for the selected call.

Amazon S3 Lifecycle policies delete recordings and JSON files from both input and output buckets after a configurable retention period, defined by the deployment parameter RetentionDays. S3 Event Notifications and Lambda functions keep the DynamoDB table synchronized as files are both created and deleted.

When the EnableTranscriptKendraSearch parameter is true, the Step Functions workflow also adds time markers and metadata attributes to the transcription, which are loaded into an Amazon Kendra index. The transcription search web application is used to search call transcriptions. For more information on how this works, see Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra.

Monitoring and troubleshooting

AWS CloudFormation reports deployment failures and causes on the stack Events tab. See Troubleshooting CloudFormation for help with common deployment problems.

PCA provides runtime monitoring and logs for each component using CloudWatch:

  • Step Functions workflow – On the Step Functions console, open the workflow PostCallAnalyticsWorkflow. The Executions tab show the status of each workflow run. Choose any run to see details. Choose CloudWatch Logs from the Execution event history to examine logs for any Lambda function that was invoked by the workflow.
  • PCA server and UI Lambda functions – On the Lambda console, filter by PostCallAnalytics to see all the PCA-related Lambda functions. Choose your function, and choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.

Cost assessment

For pricing information for the main services used by PCA, see the following:

When transcription search is enabled, you incur an hourly cost for the Amazon Kendra index: $1.125/hour for the Developer Edition (first 750 hours are free), or $1.40/hour for the Enterprise Edition (recommended for production workloads).

All other PCA costs are incurred based on usage, and are Free Tier eligible. After the Free Tier allowance is consumed, usage costs add up to about $0.15 for a 5-minute call recording.

To explore PCA costs for yourself, use AWS Cost Explorer or choose Bill Details on the AWS Billing Dashboard to see your month-to-date spend by service.

Integrate with your contact center

You can configure your contact center to enable call recording. If possible, configure recordings for two channels (stereo), with customer audio on one channel (for example, channel 0) and the agent audio on the other channel (channel 1).

Via the AWS Command Line Interface (AWS CLI) or SDK, copy your contact center recording files to the PCA input bucket folder, identified in the main stack outputs as InputBucket and InputBucketPrefix. Alternatively, if you already save your call recordings to Amazon S3, use deployment parameters InputBucketName and InputBucketRawAudio to configure PCA to use your existing S3 bucket and prefix, so you don’t have to copy the files again.

Customize your deployment

Use the following CloudFormation template parameters when creating or updating your stack to customize your PCA deployment:

  • To enable or disable the optional (experimental) transcription search feature, use EnableTranscriptKendraSearch.
  • To use your existing S3 bucket for incoming call recordings, use InputBucket and InputBucketPrefix.
  • To configure automatic deletion of recordings and call analysis data when using auto-provisioned S3 input and output buckets, use RetentionDays.
  • To detect call timestamp, agent name, or call identifier (GUID) from the recording file name, use FilenameDatetimeRegex, FilenameDatetimeFieldMap, FilenameGUIDRegex, and FilenameAgentRegex.
  • To use the standard Amazon Transcribe API instead of the default call analytics API, use TranscribeApiMode. PCA automatically reverts to the standard mode API for audio recordings that aren’t compatible with the call analytics API (for example, mono channel recordings). When using the standard API some call analytics, metrics such as issue detection and speaker loudness aren’t available.
  • To set the list of supported audio languages, use TranscribeLanguages.
  • To mask unwanted words, use VocabFilterMode and set VocabFilterName to the name of a vocabulary filter that you already created in Amazon Transcribe. See Vocabulary filtering for more information.
  • To improve transcription accuracy for technical and domain specific acronyms and jargon, set VocabularyName to the name of a custom vocabulary that you already created in Amazon Transcribe. See Custom vocabularies for more information.
  • To configure PCA to use single-channel audio by default, and to identify speakers using speaker diarizaton rather than channel identification, use SpeakerSeparationType and MaxSpeakers. The default is to use channel identification with stereo files using Transcribe Call Analytics APIs to generate the richest analytics and most accurate speaker labeling.
  • To redact PII from the transcriptions or from the audio, set CallRedactionTranscript or CallRedactionAudio to true. See Redaction for more information.
  • To customize entity detection using Amazon Comprehend, or to provide your own CSV file to define entities, use the Entity detection parameters.

See the README on GitHub for more details on configuration options and operations for PCA.

PCA is an open-source project. You can fork the PCA GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

Clean up

When you’re finished experimenting with this solution, clean up your resources by opening the AWS CloudFormation console and deleting the PostCallAnalytics stacks that you deployed. This deletes resources that you created by deploying the solution. S3 buckets containing your audio recordings and analytics, and CloudWatch log groups are retained after the stack is deleted to avoid deleting your data.

Live Call Analytics: Companion solution

Our companion solution, Live Call Analytics (LCA), offers real time-transcription and analytics capabilities by using the Amazon Transcribe and Amazon Comprehend real-time APIs. Unlike PCA, which transcribes and analyzes recorded audio after the call has ended, LCA transcribes and analyzes your calls as they are happening and provides real-time updates to supervisors and agents. You can configure LCA to store call recordings to the PCA’s ingestion S3 bucket, and use the two solutions together to get the best of both worlds. See Live call analytics for your contact center with Amazon language AI services for more information.

Conclusion

The Post Call Analytics solution offers a scalable, cost-effective approach to provide call analytics with features to help improve your callers’ experience. It uses Amazon ML services like Transcribe Call Analytics and Amazon Comprehend to transcribe and extract rich insights from your customer conversations.

The sample PCA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the PCA GitHub repository.


About the Authors

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Dr. Andrew Kane is an AWS Principal WW Tech Lead (AI Language Services) based out of London. He focuses on the AWS Language and Vision AI services, helping our customers architect multiple AI services into a single use-case driven solution. Before joining AWS at the beginning of 2015, Andrew spent two decades working in the fields of signal processing, financial payments systems, weapons tracking, and editorial and publishing systems. He is a keen karate enthusiast (just one belt away from Black Belt) and is also an avid home-brewer, using automated brewing hardware and other IoT sensors.

Steve Engledow is a Solutions Engineer working with internal and external AWS customers to build reusable solutions to common problems.

Connor Kirkpatrick is an AWS Solutions Engineer based in the UK. Connor works with the AWS Solution Architects to create standardised tools, code samples, demonstrations, and quickstarts. He is an enthusiastic rower, wobbly cyclist, and occasional baker.

Franco Rezabek is an AWS Solutions Engineer based in London, UK. Franco works with AWS Solution Architects to create standardized tools, code samples, demonstrations, and quick starts.

Read More

Build custom Amazon SageMaker PyTorch models for real-time handwriting text recognition

In many industries, including financial services, banking, healthcare, legal, and real estate, automating document handling is an essential part of the business and customer service. In addition, strict compliance regulations make it necessary for businesses to handle sensitive documents, especially customer data, properly. Documents can come in a variety of formats, including digital forms or scanned documents (either PDF or images), and can include typed, handwritten, or embedded forms and tables. Manually extracting data and insight from these documents can be error-prone, expensive, time-consuming, and not scalable to a high volume of documents.

Optical character recognition (OCR) technology for recognizing typed characters has been around for years. Many companies manually extract data from scanned documents like PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration, which often requires reconfiguration when the form changes.

The digital document is often a scan or image of a document, and therefore you can use machine learning (ML) models to automatically extract text and information (such as tables, images, captions, and key-pair values) from the document. For example, Amazon Textract, an API-based AI-enabled service, offers such capabilities with built-in trained models, which you can use in applications without the need for any ML skills. At the same time, custom ML models use computer vision (CV) techniques to automate text extraction from images; this is particularly helpful when handwritten text needs to be extracted, which is a challenging problem. This is also known as handwriting recognition (HWR), or handwritten text recognition (HTR). HTR can lead to making documents with handwritten content searchable or for storing the content of older documents and forms in modern databases.

Unlike standard text recognition that can be trained on documents with typed content or synthetic datasets that are easy to generate and inexpensive to obtain, HWR comes with many challenges. These challenges include variability in writing styles, low quality of old scanned documents, and collecting good quality labeled training datasets, which can be expensive or hard to collect.

In this post, we share the processes, scripts, and best practices to develop a custom ML model in Amazon SageMaker that applies deep learning (DL) techniques based on the concept outlined in the paper GNHK: A Dataset for English Handwriting in the Wild to transcribe text in images of handwritten passages into strings. If you have your own data, you can use this solution to label your data and train a new model with it. The solution also deploys the trained models as endpoints that you can use to perform inference on actual documents and convert handwriting script to text. We explain how you can create a secure public gateway to your endpoint by using Amazon API Gateway.

Prerequisites

To try out the solution in your own account, make sure that you have the following in place:

We recommend using the JumpStart solution, which creates the resources properly set up and configured to successfully run the solution.

You can also use your own data to train the models, in which case you need to have images of handwritten text stored in Amazon Simple Storage Service (Amazon S3).

Solution overview

In the next sections, we walk you through each step of creating the resources outlined in the following architecture. However, launching the solution with SageMaker JumpStart in your account automatically creates these resources for you.

Launching this solution creates multiple resources in your account, including seven sample notebooks, multiple accompanying custom scripts that we use in training models and inference, and two pre-built demo endpoints that you can use for real-time inference if you don’t want to do the end-to-end training and hosting. The notebooks are as follows:

  • Demo notebook – Shows you how to use the demo endpoints for real-time handwritten text recognition
  • Introduction – Explains the architecture and the different stages of the solution
  • Labeling your own data – Shows you how to use Amazon SageMaker Ground Truth to label your own dataset
  • Data visualization – Visualizes the outcomes of data labeling
  • Model training – Trains custom PyTorch models with GNHK data
  • Model training with your own data – Allows you to use your own labeled data for training models
  • Endpoints – Creates SageMaker endpoints with custom trained models

GNHK data overview

This solution uses the GoodNotes Handwriting Kollection (GNHK) dataset released by GoodNotes under CC-BY-4.0 License. This dataset is presented in a paper titled GNHK: A Dataset for English Handwriting in the Wild at the International Conference of Document Analysis and Recognition (ICDAR) in 2021, with the following citation:

@inproceedings{Lee2021,
  author={Lee, Alex W. C. and Chung, Jonathan and Lee, Marco},
  booktitle={International Conference of Document Analysis and Recognition (ICDAR)},
  title={GNHK: A Dataset for English Handwriting in the Wild},
  year={2021},
}

The GNHK dataset includes images of English handwritten text to allow ML practitioners and researchers to investigate new handwritten text recognition techniques. You can download the data for SageMaker training and testing in manifest format, which includes images, bounding box coordinates, and text strings for each bounding box. The following figure shows one of the images that is part of the training dataset.

Use your own labeled dataset

If you don’t want to use the GNHK dataset for training, you can train the models with your own data. If your data is labeled with the bounding box coordinates, you can create a custom manifest training file with the following format and readily use it for training the models. In this manifest file format, each line is a JSON that includes the following content:

{'source-ref': 'FILE_NAME.jpg',
 'annotations': 
 {'texts': 
 [{'text': 'FIRST_BOUNDING_BOX_CONTENT_TEXT',
    'polygon': [{'x': 178, 'y': 253},
     {'x': 172, 'y': 350},
     {'x': 627, 'y': 313},
     {'x': 615, 'y': 421}]},
   {'text': 'SECOND_BOUNDING_BOX_CONTENT_TEXT',
    'polygon': [{'x': 713, 'y': 307},
     {'x': 990, 'y': 322},
     {'x': 710, 'y': 404},
     {'x': 950, 'y': 413}]},
...

Label your raw data using Ground Truth

If you don’t have a labeled training dataset, you can use Ground Truth to label your data using your private workforce or external resources such as Amazon Mechanical Turk. Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML. Ground Truth offers built-in workflows that support a variety of use cases, including text, images, and video. In addition, Ground Truth offers automatic data labeling, which uses an ML model to label your data. The following figure illustrates how Ground Truth works.

The JumpStart solution that is launched in your account creates a sample notebook (label_own_data.ipynb) that allows you to create Ground Truth labeling jobs to label your data using your private workforce. For details on how to set up labeling jobs for images as well as training and tutorial resources, see SageMaker Ground Truth Data Labeling Resources.

When data labeling is complete, you can use the accompanying data_visualization.ipynb notebook to visualize the results of the data labeling.

Train word segmentation and handwriting text recognition models

Now that the labeled data is prepared, you can use that to train a model that can recognize handwritten passages and return the text string equivalents. In this section, we walk you through this process and explain each step of building and training the models. We use PyTorch to take advantage of state-of-the-art frameworks for object detection and text recognition. You can also develop the same approach using other deep learning frameworks, such as TensorFlow or MXNet. SageMaker provides pre-built Docker images that include deep learning framework libraries and other dependencies needed for training and inference. For a complete list of pre-built Docker images, see Available Deep Learning Containers Images.

Train and test datasets

Before we get started with model training, we need to have a training dataset and a test (or validation) dataset to validate the trained model performance. The GNHK dataset already offers two separate datasets for training and testing in SageMaker manifest format, and this solution uses these datasets. If you want to use your own labeled dataset, the easiest way is to split a labeled data manifest file into train and test sets (for example, 80% training and 20% test).

SageMaker reads the training and test datasets from Amazon S3. After splitting the data, you need to store the manifest files and the corresponding images to Amazon S3, and then use the URI links in the training scripts, as outlined in the following sections.

Train the word segmentation model

For inference on images of handwritten text that consists of multiple lines and each line of multiple words, we need to create two models. The first model segments the image into single words by using bounding box prediction (or localization); the second model runs a text recognition on each segment, separately. Each model is hosted on a SageMaker inference endpoint for real-time inference. Both models use PyTorch framework containers for version 1.6.0. For more details on training and deploying models with PyTorch, including requirements for training and inference scripts, see Use PyTorch with the SageMaker Python SDK. For training purposes, we use the SageMaker PyTorch estimator class. For more details, see Create an Estimator. For training, you need a custom training script as the entry point. When launching this JumpStart solution in your account, SageMaker automatically adds all accompanying custom training and inference codes to your files. For the localization model, we use the custom 1_train_localisation.py code under the src_localisation folder. The estimator utilizes one GPU-based instance for training purposes.

In the following code snippet, we define model performance metrics and create a PyTorch estimator class with the entry point directing to the training script directory in the code repository. At the end, we launch the training by calling the .fit method on the estimator with the training dataset and validation on the test dataset.

# Define model performance metrics
metric_definitions=[
    {
        "Name": "iter",
        "Regex": ".*iter:s([0-9\.]+)s*"
    },
    {
        "Name": "total_loss",
        "Regex": ".*total_loss:s([0-9\.]+)s*"
    }
]

# Define PyTorch estimator class, and then call .fit method to launch training 
from sagemaker.pytorch import PyTorch

session = sagemaker.session.Session()
role = sagemaker_config["SageMakerIamRole"]

localization_estimator = PyTorch(entry_point='1_train_localisation.py',
                                 source_dir='src_localisation',
                                 dependencies=['utils', 'configs'],
                                 role=role,
                                 train_instance_type=["SageMakerTrainingInstanceType"],
                                 train_instance_count=1,
                                 output_path=output_path_s3_url,
                                 framework_version='1.6.0',
                                 py_version='py3',
                                 metric_definitions=metric_definitions,
                                 base_job_name='htr-word-segmentation',
                                 sagemaker_session=session
                                )

localization_estimator.fit({"train": train_dataset_s3_uri,
                            "test": test_dataset_s3_uri},
                           wait=False)

Train the handwriting text recognition model

After the word segments are determined by the previous model, the next piece of the inference pipeline is to run handwriting recognition inference on each segment. The process is the same, but this time we utilize a different custom training script, the 2_train_recogniser.py script under src_recognition as the entry point for the estimator, and train a new model. Similar to the previous model, this model also trains the model on the train dataset and evaluates its performance on the test dataset. If you launch the JumpStart solution in your account, SageMaker automatically adds these source codes to your files in your Studio domain.

# Define model performance metrics
metric_definitions = [
    {'Name': 'Iteration', 'Regex': 'Iteration ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},
    {'Name': 'train_loss', 'Regex': 'Train loss ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},
    {'Name': 'test_loss',  'Regex': 'Test loss ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'}
]

# Define PyTorch estimator class, and then call .fit method to launch training 
 recognition_estimator = PyTorch(entry_point='2_train_recogniser.py',
                                source_dir='src_recognition',
                                dependencies=['utils', 'configs'],
                                role=role,
                                instance_type=["SageMakerTrainingInstanceType"],
                                instance_count=1,
                                output_path=output_path_s3_url,
                                framework_version='1.6.0',
                                py_version='py3',
                                metric_definitions=metric_definitions,
                                base_job_name='htr-text-recognition',
                                sagemaker_session=session
                               )

recognition_estimator.fit({"train": train_dataset_s3_uri,
                           "test": test_dataset_s3_uri},
                          wait=False)

Next, we attach the estimators to the training jobs, and wait until the training is complete before proceeding with deployment of the models. The purpose of attaching is that if the status of the training job is Completed, it can be deployed to create a SageMaker endpoint and return a predictor, but if the training job is in progress, the attach blocks and displays log messages from the training job, until the training job is complete. Each training job may take around 1 hour to complete.

localisation_estimator = PyTorch.attach(training_job_name=localisation_estimator.latest_training_job.name,
                                        sagemaker_session=session)
recognition_estimator = PyTorch.attach(training_job_name=recognition_estimator.latest_training_job.name,
                                       sagemaker_session=session)

When both model trainings are complete, you can move on to the next step, which is creating an endpoint for real-time inference on images using the two models we just trained.

Create SageMaker endpoints for real-time inference

The next step in building this solution is to create endpoints with the trained models that we can use for real-time inference on handwritten text. We walk you through the steps of downloading the model artifacts, creating model containers, deploying the containers, and finally using the deployed models to make real-time inference on a demo image or your own image.

First we need to parse trained model artifacts from Amazon S3. After each training job, SageMaker stores the trained model in the form of a tar ball (.tar.gz) on Amazon S3 that you can download to utilize inside or outside of SageMaker. For this purpose, the following code snippet uses three utility functions (get_latest_training_job, get_model_data, and parse_model_data) from the sm_utils folder that is automatically added to your files in Studio when you launch the JumpStart solution in your account. The script shows how to download the PyTorch word segmentation (or localization) model data, compress it into a tar ball, and copy it to Amazon S3 for building the model later. You can repeat this process for the text recognition model.

from utils.sm_utils import get_latest_training_job, get_model_data, parse_model_data

# Download word segmentation model, rename it for packaging
os.mkdir("model_word_seg")

word_seg_training_job = get_latest_training_job('htr-word-segmentation')
word_seg_s3 = get_model_data(word_seg_training_job)
parse_model_data(word_seg_s3, "model_word_seg")

os.rename("model_word_seg/mask_rcnn/model_final.pth", "model_word_seg/mask_rcnn/model.pth")

# Repackage the model and copy to S3 for building the model later
!tar -czvf model.tar.gz --directory=model_word_seg/mask_rcnn/ model.pth
!aws s3 cp model.tar.gz s3://<YOUR-S3-BUCKET>/custom_data/artifacts/word-seg/model.tar.gz

Now that we have the trained model files, it’s easy to create a model container in SageMaker. Because we trained the model with the PyTorch estimator class, we can use the PyTorch model class to create a model container that uses a custom inference script. See Deploy PyTorch Models for more details. After we create the model, we can create a predictor by deploying the model as an endpoint for real-time inference. You can change the number of instances for your endpoint or select a different accelerated computing (GPU) instance from the list of available instances for real-time inference. The PyTorch model class uses the inference.py script for each model that is added to your files when you launch the JumpStart solution in your Studio domain. In the following code, we create the word segmentation model. You can follow the same approach for creating the text recognition model.

from sagemaker.pytorch import PyTorchModel

# Create word segmentation model
seg_model = PyTorchModel(model_data='s3://<YOUR-S3-BUCKET>/custom_data/artifacts/word-seg/model.tar.gz',
                     role=role,
                     source_dir="src_localisation",
                     entry_point="inference.py",
                     framework_version="1.6.0",
                     name=model_name,
                     py_version="py3"
                    )

Now we can build the endpoint by calling the .deploy method on the mode and creating a predictor. Then we attach the serializer and deserializer to the endpoint. You can follow the same approach for the second mode, for text recognition.

# Deploy word segmentation model to an endpoint
localisation_predictor = seg_model.deploy(instance_type=sagemaker_config["SageMakerInferenceInstanceType"],
                                      endpoint_name='word_segmentation_endpoint',
                                      initial_instance_count=1,
                                      deserializer= sagemaker.deserializers.JSONDeserializer(),
                                      serializer=sagemaker.serializers.JSONSerializer(),
                                      wait=False)

Endpoint creation should take about 6–7 minutes to complete. The following code creates waiters for endpoint creation and shows as complete when they’re done.

client = boto3.client('sagemaker')
waiter = client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName='word_segmentation_endpoint') 

When the model deployments are complete, you can send an image of a handwritten passage to the first endpoint to get the bounding boxes and their coordinates for each word. Then use those coordinates to crop each bounding box and send them to the second endpoint individually and get the recognized text string for each bounding box. You can then take the outputs of the two endpoints and overlay the bounding boxes and the text on the raw image, or use the outputs in your downstream processes.

The following diagram illustrates the overall process workflow.

Extensions

Now that you have working endpoints that are making real-time inference, you can use them for your applications or website. However, your SageMaker endpoints are still not public facing; you need to build API Gateways to allow external traffic to your SageMaker endpoints. API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. You can use API Gateway to present an external-facing, single point of entry for SageMaker endpoints, and provide security, throttling, authentication, firewall as provided by AWS WAF, and much more. With API Gateway mapping templates, you can invoke your SageMaker endpoint with a REST API request and receive an API response back. Mapping templates enable you to integrate your API Gateway directly with SageMaker endpoints without the need for any intermediate AWS Lambda function, making your online applications faster and cheaper. To create an API Gateway and use it to make real-time inference with your SageMaker endpoints (as in the following architecture), see Creating a machine learning-powered REST API with Amazon API Gateway mapping templates and Amazon SageMaker.

Conclusion

In this post, we explained an end-to-end solution for recognizing handwritten text using SageMaker custom models. The solution included labeling training data using Ground Truth, training data with PyTorch estimator classes and custom scripts, and creating SageMaker endpoints for real-time inference. We also explained how you can create a public API Gateway that can be securely used with your mobile applications or website.

For more SageMaker examples, visit the GitHub repository. In addition, you can find more PyTorch bring-your-own-script examples.

For more SageMaker Python examples for MXNet, TensorFlow, and PyTorch, visit Amazon SageMaker Pre-Built Framework Containers and the Python SDK.

For more Ground Truth examples, visit Introduction to Ground Truth Labeling Jobs. Additional information about SageMaker can be found in the technical documentation.


About the Authors

Jonathan Chung is an Applied Scientist in Halo Health tech. He works on applying classical signal processing and deep learning techniques to time series and biometrics data. Previously he was an Applied Scientists at AWS. He enjoys cooking and visiting historical cities around the world.

Dr. Nick Minaie, is the Manager of Data Science and Business Intelligence at Amazon, leading innovative machine learning product development for Amazon’s Time and Attendance team. Previously, he served as a Senior AI/ML Solutions Architect at AWS, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Nick enjoys family time, abstract painting, and exploring nature.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Shenghua Yue is a Software Development Engineer at Amazon SageMaker. She focuses on building machine learning tools and products for customers. Outside of work, she enjoys outdoors, yoga and hiking.

Read More

Achieve 35% faster training with Hugging Face Deep Learning Containers on Amazon SageMaker

Natural language processing (NLP) has been a hot topic in the AI field for some time. As current NLP models get larger and larger, data scientists and developers struggle to set up the infrastructure for such growth of model size. For faster training time, distributed training across multiple machines is a natural choice for developers. However, distributed training comes with extra node communication overhead, which negatively impacts the efficiency of model training.

This post shows how to pretrain an NLP model (ALBERT) on Amazon SageMaker by using Hugging Face Deep Learning Container (DLC) and transformers library. We also demonstrate how a SageMaker distributed data parallel (SMDDP) library can provide up to a 35% faster training time compared with PyTorch’s distributed data parallel (DDP) library.

SageMaker and Hugging Face

SageMaker is a cloud machine learning (ML) platform from AWS. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models by bringing together a broad set of capabilities purpose-built for ML.

Hugging Face’s transformers library is the most popular open-source library for innovative NLP and computer vision. It provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages.

AWS and Hugging Face collaborated to create an Amazon SageMaker Hugging Face DLC for training and inference. With SageMaker, you can scale training from a small cluster to a large one without the need to manage the infrastructure on your own. With the help of the SageMaker enhancement libraries and AWS Deep Learning Containers, we can significantly speed up NLP model training.

Solution overview

In this section, we discuss the various components to set up our model training.

ALBERT model

ALBERT, released in 2019, is an optimized version of BERT. ALBERT-large uses 18 times fewer parameters in size and is 1.7 times faster in training speed than BERT-large. For more details, refer to the original paper, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.

Also compared with BERT, two parameter reduction operations were applied:

  • Factorized embedding parameterization – Decomposes large vocabulary embedding into two smaller ones, which helps grow the hidden layer number
  • Cross-layer parameter sharing – Shares all parameters across layers, which helps reduce the total parameter size by 18 times

Pretrain task

In this post, we train the ALBERT-base model (11 million parameters) using the most commonly used task in NLP pretraining: masked language modeling (MLM). MLM replaces input tokens with mask tokens randomly and trains the model to predict the masked ones. To simplify the training procedure, we removed the sentence order prediction task and kept the MLM task.

Set up the number of training steps and global batch sizes at different scales

To make a fair comparison across different training scales (namely, different numbers of nodes), we train by using different numbers of nodes but the same number of examples. For example, if we set a single GPU batch size to 16:

  • Two nodes (16 GPUs) training run 2,500 steps with global batch size 256
  • Four nodes (32 GPUs) training run 1,250 steps with global batch size 512

Dataset

As in the original ALBERT paper, the dataset we used for the ALBERT pretraining is the English Wikipedia Dataset and Book Corpus. This collection is taken from English-language Wikipedia and more than 11,000 English-language books. After being preprocessed and tokenized from the text, the total dataset size is around 75 GB and stored in an Amazon Simple Storage Service (Amazon S3) bucket.

In practice, we used the Amazon S3 plugin to stream the data. The S3 plugin is a high-performance PyTorch dataset library that can directly and efficiently access datasets stored in S3 buckets.

Metrics

In this post, we focus on two performance metrics:

  • Throughput – How many samples are processed per second
  • Scaling efficiency – Defined as T(N) / (N * T(1)), T(1) is the throughput of 1 node, T(N) is the throughput of N nodes

Infrastructure

We use P4d instances in SageMaker to train our model. P4d instances are powered by the latest NVIDIA A100 Tensor Core GPUs and deliver exceptionally high throughput and low latency networking. These instances are the first in the cloud to support 400 Gbps instance networking.

For SageMaker training, we prepared a Docker container image based on AWS Deep Learning Containers, which has PyTorch 1.8.1 and Elastic Fabric Adapter (EFA) enabled.

Tuning the number of data loader workers

The num_workers parameter indicates how many subprocesses to use for data loading. This parameter is set to zero by default. Zero means that the data is loaded in the main process.

In the early stage of our experiments, we scaled the SageMaker distributed data parallel library trainings from 2 nodes to 16 nodes and kept the default num_workers parameter unchanged. We noticed that scaling efficiency kept reducing, as shown in the following table.

Nodes algorithm Train time (s) Throughput (samples/s) num_workers scaling efficiency max_step global_batch_size
2 SMDDP 197.8595 3234.6185 0 1 2500 256
4 SMDDP 117.7135 5436.92949 0 0.84043 1250 512
8 SMDDP 79.0686 8094.23716 0 0.62559 625 1024
16 SMDDP 59.2767 10814.09728 0 0.41724 313 2048

Increasing the num_workers parameter of the data loader can let more CPU cores handle data preparation for GPU computation, which helps the training run faster. An AWS p4d instance has 96 vCPUs, which gives us plenty of space to tune the number of data loader workers.

We designed our experiments to find the best value for num_workers. We gradually increased the data loader worker number under different training scales. These results are generated using the SageMaker distributed data parallel library.

We tuned the following parameters:

  • Data loader number of workers: 0, 4, 8, 12, 16
  • Node number: 2, 4, 8, 16
  • Single GPU batch size: 16

As we can see from the following results, the throughput and scaling efficiency went into saturation when the data loader worker number was 12.

Next, we wanted to see if this situation would be similar if the global batch size changed, which indicates that the upper bound of the throughput changed. We set the single GPU batch size to 32 and retrained the models. We tuned the following parameters:

  • Data loader number of workers: 0, 4, 8, 12, 16
  • Node number: 2, 4, 8, 16
  • Single GPU batch size: 32

The following figure shows our results.

We got similar results from this second set of experiments, with 12 data loader workers still providing the best result.

From the two preceding sets of results, we can see that a good starting point is to set the data loader worker number equal to the free CPU number. For example, the P4d instance has 96 vCPUs and 8 processes. Each process has 12 CPUs on average, so we can set the data loader worker number equal to 12.

This can be a good empirical rule. However, different hardware and local batch size may result in some variance, so we suggest tuning the data loader worker number for your use case.

Finally, let’s look at how much improvement we got from tuning the number of data loader workers.

The following graphs show the throughput comparison with a batch size of 16 and 32. In both cases, we can observe consistent throughput gains by increasing the data loader worker number from zero to 12.

The following table summarizes our throughput increase when we compare throughput between data loader worker numbers equal to 0 and 12.

Nodes Throughput Increase (local batch size 16) Throughput Increase (local batch size 32)
2 15.53% 25.24%
4 23.98% 40.89%
8 41.14% 65.15%
16 60.37% 102.16%

Compared with the default data loader worker number setup, setting the data loader worker number to 12 results in a 102% throughput increase. This means we made the training speed twice as fast by using rich hardware resources in the P4d instance.

SMDDP vs. DDP

The SageMaker distributed data parallel library for PyTorch implements torch.distributed APIs, optimizing network communication by using the AWS network infrastructure and topology. In particular, SMDDP optimizes key collective communication primitives used throughout the training loop. SMDDP is available through the Amazon Elastic Container Registry (Amazon ECR) in the SageMaker training platform. You can start a SageMaker training job from the SageMaker Python SDK or the SageMaker APIs through the AWS SDK for Python (Boto3) or the AWS Command Line Interface (AWS CLI).

The distributed data parallel library is PyTorch’s data parallelism module. It implements data parallelism at the module level, which can run across multiple machines.

We compared SMDDP and DDP. As previous sections suggested, we set hyperparameters as follows:

  • Data loader worker number: 12
  • Single GPU batch size: 16

The following graphs compare throughput and scaling efficiency.

The following table summarizes our throughput speed increase.

Nodes SMDDP Throughput Speed Increase
2 13.96%
4 33.07%
8 34.94%
16 31.64%

From 2 nodes (16 A100 GPUs) to 16 nodes (128 A100 GPUs) in the ALBERT trainings, SMDDP consistently performed better than DDP. When we have more nodes and GPUs, we benefit from using SMDDP.

Summary

In this post, we demonstrated how we can use SageMaker to scale our NLP training jobs from 16 GPUs to 128 GPUs by changing a few lines of code. We also discussed why it’s important to tune the data loader worker number parameter. It provides up to a 102.16% increase in the 16-node training case, and setting that parameter to the vCPU number divided by the number of processes can be a good starting point. In our tests, SMDDP performed much better (almost 35% better) than DDP when increasing the training scale. The larger the scale we use, the more time and money SMDDP can save.

For detailed instructions on how to run the training in this post, we will provide the open-source training code in the AWS Samples GitHub repo soon. You can find more information through the following resources:


About the Authors

Yu Liu is a Software Developer with AWS Deep Learning. He focuses on optimizing distributed Deep Learning models and systems. In his spare time, he enjoys traveling, singing and exploring new technologies. He is also a metaverse believer.

Roshani Nagmote is a Software Developer for AWS Deep Learning. She focuses on building distributed Deep Learning systems and innovative tools to make Deep Learning accessible for all. In her spare time, she enjoys hiking, exploring new places and is a huge dog lover.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for Amazon Machine Learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Michele Monclova is a principal product manager at AWS on the SageMaker team. She is a native New Yorker and Silicon Valley veteran. She is passionate about innovations that improve our quality of life.

Philipp Schmid is a Machine Learning Engineer and Tech Lead at Hugging Face, where he leads the collaboration with the Amazon SageMaker team. He is passionate about democratizing, optimizing, and productionizing cutting-edge NLP models and improving the ease of use for Deep Learning.

Jeff Boudier builds products at Hugging Face, creator of Transformers, the leading open-source ML library. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing, Business Development and Corporate Development.

Read More

Build a computer vision model using Amazon Rekognition Custom Labels and compare the results with a custom trained TensorFlow model

Building accurate computer vision models to detect objects in images requires deep knowledge of each step in the process—from labeling, processing, and preparing the training and validation data, to making the right model choice and tuning the model’s hyperparameters adequately to achieve the maximum accuracy. Fortunately, these complex steps are simplified by Amazon Rekognition Custom Labels, a service of Amazon Rekognition that enables you to build your own custom computer vision models for image classification and object detection tasks without requiring any prior computer vision expertise or advanced programming skills.

In this post, we showcase how we can train a model to detect bees in images using Amazon Rekognition Custom Labels. We also compare these results against a custom-trained TensorFlow model (DIY model). We use Amazon SageMaker as the platform to develop and train our model. Finally, we demonstrate how to build a serverless architecture to process new images using Amazon Rekognition APIs.

When and where to use each model

Before diving deeper, it is important to understand the use cases that drive the decision of which model to use, whether it’s an Amazon Rekognition Custom Labels model or a DIY model.

Amazon Rekognition Custom Labels models are a great choice when our desired goal is to achieve maximum quality results in our task quickly. These models are heavily optimized and fine-tuned to perform at a high accuracy and recall. This is a cloud service, so when the model is trained, images must be uploaded to the cloud to be analyzed. A great advantage of this service is that the user doesn’t need to have expertise to run this training pipeline. You can do it on the AWS Management Console with just a few clicks, and it takes care of the heavy lifting of training and fine-tuning the model for you. Then, a simple set of API calls is offered, tailored to this specific model, for you to apply when needed.

DIY models are the choice for advanced users with expertise in machine learning (ML). They allow you to control every aspect of the model, and tune the training data and the necessary parameters as needed. This requires advanced coding skills. These models trade off accuracy for latency: you can run them faster at the expense of lower qualitative performance. This lower latency fits really well in low bandwidth scenarios where the model needs to be deployed on the edge. For instance, IoT devices that support these models can host and run them and only upload the inference results to the cloud, which reduces the amount of data sent upstream.

Overview of solution

To build our DIY model, we follow the solution from the GitHub repo TensorFlow 2 Object Detection API SageMaker, which consists of these steps:

  1. Download and prepare our bee dataset.
  2. Train the model using a SageMaker custom container instance.
  3. Test the model using a SageMaker model endpoint.

After we have our DIY model, we can proceed with the steps to build our bee detection model using Amazon Rekognition Custom Labels:

  1. Deploy a serverless architecture using AWS CloudFormation.
  2. Download and prepare our bee dataset.
  3. Create a project in Amazon Rekognition Custom Labels and import the dataset.
  4. Train the Amazon Rekognition Custom Labels model.
  5. Test the Amazon Rekognition Custom Labels model using the automatically generated API endpoint using Amazon Simple Storage Service (Amazon S3) events.

Amazon Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end process. After we train both models, we can compare them.

Set up the environment

We prepare our serverless environment using the CloudFormation template on GitHub. On the AWS CloudFormation console, we create a new stack and use the template.yaml file present in the root folder of our code repository. We provide a unique Amazon Simple Storage Service (Amazon S3) bucket name when prompted, where our images are downloaded for further processing. We also provide a name for the inference processing Amazon Simple Queue Service (Amazon SQS) queue, as well as an AWS Key Management Service (AWS KMS) alias to securely encrypt the inference pipeline.

The architecture diagram is as follows, and it is used for detecting objects in new images as they are uploaded to our bucket.

Following the first notebook (1_prepare_data), we download and store our images in a bucket in Amazon S3. The dataset is already curated and annotated, and the images used have been licensed under CC0. For convenience, the dataset is stored in a single .zip archive: dataset.zip.

Inside the dataset folder, the manifest file output.manifest contains the bounding box annotations of the dataset. The Amazon S3 references of these images belong to a different S3 bucket where the images were annotated originally. To import this manifest in Amazon Rekognition Custom Labels, the notebook rewrites the manifest file according to the bucket name we chose.

Train your DIY model

To establish a comparison between a DIY and Amazon Rekognition Custom Labels model, we follow the steps in the following public repository that demonstrates how to train a TensorFlow2 model using the same dataset.

We follow the steps described in this repository to train an EfficientNet object detector using our bee dataset. We modify the training notebook so that it runs for 10,000 steps. The model trains for about 2 hours, achieving an average precision of 83% and a recall of 56%.

Create your Amazon Rekognition Custom Labels project

To create your bee detection project, complete the following steps:

  1. On the Amazon Rekognition console, choose Amazon Rekognition Custom Labels.
  2. Choose Get Started.
  3. For Project name, enter bee-detection.
  4. Choose Create project.

Import your dataset

We created a manifest using the first notebook (1_prepare_data) that contains the Amazon S3 URIs of our image annotations. We follow these steps to import our manifest into Amazon Rekognition Custom Labels:

  1. On the Amazon Rekognition Custom Labels console, choose Create dataset.
  2. Select Import images labeled by Amazon SageMaker Ground Truth.
  3. Name your dataset (for example, bee_dataset).
  4. Enter the Amazon S3 URI of the manifest file that we created.
  5. Copy the bucket policy that appears on the console.
  6. Open the Amazon S3 console in a new tab and access the bucket where the images are stored.
  7. On the Permissions tab, enter the bucket policy to allow access of the dataset by Amazon Rekognition Custom Labels.
  8. Go back to the dataset creation console and choose Submit.

Train your model

After the dataset is imported into Amazon Rekognition Custom Labels, we can train a model immediately.

  1. Choose Train Model from the dataset page.
  2. For Choose project, choose your bee-detection project.
  3. For Choose training dataset, choose your bee_dataset dataset.

As part of model training, Amazon Rekognition Custom Labels requires a labeled test dataset to validate the model training. Amazon Rekognition Custom Labels uses the test dataset to verify how well your trained model predicts the correct labels and to generate evaluation metrics. Images in the test dataset are not used to train your model and should represent the same types of images you use your model to analyze.

  1. For Create test set, select how you want to provide your test dataset.

Amazon Rekognition Custom Labels provides three options:

  • Choose an existing test dataset
  • Create a new test dataset
  • Split training dataset

For this post, we choose to split our training dataset, which sets aside 20% of our dataset for testing the model.

  1. Select Split training dataset.
  2. Choose Train.

Our model took approximately 1.5 hours to train. The model achieved an average precision of 99% with a recall of 90% on the test data. The training time required for your model depends on many factors, including the number of images provided in the dataset and the complexity of the model. When training is complete, Amazon Rekognition Custom Labels outputs key quality metrics including F1 score, precision, recall, and the assumed threshold for each label. For more information about metrics, see Metrics for evaluating your model.

Serverless inference architecture

After our model is trained, Amazon Rekognition Custom Labels provides the API calls for starting, using, and stopping your model. In the environment setup section, we set up a serverless architecture to process test images that are uploaded to our S3 bucket via Amazon S3 events. It uses an AWS Lambda function to call the inference API, and manages these API calls using Amazon SQS.

We’re ready now to start applying our trained model to new images. We first need to start the project model version via the Amazon Rekognition Custom Labels console.

We take note of our model’s ARN and update the Lambda function bee-detection-inference with it. This indicates which endpoint we must invoke to retrieve the object detection results. We can also change the assumed threshold to accept or reject results with a low confidence score.

Now it’s time to start uploading our test images to our S3 bucket prefix (s3://your-bucket/test_images). We can either use the Amazon S3 console or the AWS Command Line Interface (AWS CLI). We choose some test images present in our bee detection dataset and upload them using the console. As the images are uploaded, they’re queued in Amazon SQS and then processed by our Lambda function, leaving the result with the same file name, plus the .json suffix.

We visualize the results of the JSON response from our Amazon Rekognition Custom Labels model using the second notebook (2_visualize_images). The following is an example of a response output:

{'CustomLabels': [{'Name': 'bee',
   			   'Confidence': 99.9679946899414,
   			   'Geometry': {'BoundingBox': {'Width': 0.17472000420093536,
     'Height': 0.23267999291419983,
     'Left': 0.34907999634742737,
     'Top': 0.36125999689102173}}}],
 'ResponseMetadata': {'RequestId': '4f98fdc8-a7d3-4251-b21e-484baf958efb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Thu, 11 Mar 2021 15:23:39 GMT',
   'x-amzn-requestid': '4f98fdc8-a7d3-4251-b21e-484baf958efb',
   'content-length': '202',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

This bee is detected with a confidence of 99.97%

In the following image on the left, we find six bees over 99.4% confidence, which is our optimal threshold. The image on the right shows the same result with a threshold of 90% (15 bees).

Clean up

When you’re done, remember to follow these steps to avoid incurring in unnecessary charges:

  1. Stop the model version on the Amazon Rekognition Custom Labels console.
  2. Empty the S3 bucket that was created where images were uploaded.
  3. Delete the CloudFormation stack to remove all provisioned resources.

Comparison with a custom DIY model

The performance of our Amazon Rekognition Custom Labels model is quantitatively better than our DIY model, achieving almost perfect precision (99%). It is noticeable how it’s also able to prevent false negatives, yielding a very robust recall of 90%, smashing the 56% recall of our DIY model. This is partly due to the optimized tuning that Amazon Rekognition Custom Labels applies to the model, and the assumed thresholds that it yields after training to achieve the best performance at test time.

For the first example, our single bee is detected at a much lower confidence score (64%), and with a rather large bounding box that doesn’t reflect accurately the size of the bee.

For the more challenging picture, we must lower our threshold to 81% to find the very first detection (left), and lower it even more to 50% to find 7 bees (right).

Playing with this threshold can be risky. Setting a very low threshold can detect more bees (better recall), but at the same time find false detections, lowering our model precision. However, Amazon Rekognition Custom Labels can detect bees with a much higher confidence, which allows us to set a higher threshold for a much better overall performance.

Conclusion

In this post, we showed you how to create a computer vision object detection model with Amazon Rekognition Custom Labels using annotated data, and compared the results with a custom DIY model. Amazon Rekognition Custom Labels brings a great advantage over using your own models. Amazon Rekognition Custom Labels enables you to build and optimize your own specialized computer vision models to detect unique objects without the need of advanced programming knowledge.

With more experiments with other model architectures and hyperparameters, an ML scientist can improve the DIY model we tested in this post. The Amazon Rekognition Custom Labels value proposition is that it does these experiments on your behalf, thereby reducing the time to get a usable model and its development costs. Finally, we also showed how to set up a minimal serverless architecture to process new images using our trained model.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?


About the Author

Raúl Díaz García is a Sr Data Scientist in the EMEA SDT IoT Team. Raúl works with customers across the EMEA region, where he helps them enable solutions related to Computer Vision and Machine Learning in the IoT space.

Read More

Build GAN with PyTorch and Amazon SageMaker

GAN is a generative ML model that is widely used in advertising, games, entertainment, media, pharmaceuticals, and other industries. You can use it to create fictional characters and scenes, simulate facial aging, change image styles, produce chemical formulas synthetic data, and more.

For example, the following images show the effect of picture-to-picture conversion.

The following images show the effect of synthesizing scenery based on semantic layout.

This post walks you through building your first GAN model using Amazon SageMaker. This is a journey of learning GAN from the perspective of practical engineering experiences, as well as opening a new AI/ML domain of generative models.

We also introduce a use case of one of the hottest GAN applications in the synthetic data generation area. We hope this gives you a tangible sense on how GAN is used in real-life scenarios.

Overview of solution

Among the following two pictures of handwritten digits, one of them is actually generated by a GAN model. Can you tell which one?

The main topic of this article is to use ML techniques to generate synthetic handwritten digits. To achieve this goal, you personally experience the training of a GAN model. Generating synthetic handwritten digits is basically the same as the basic principles and engineering processes of portrait generation, although their data, algorithm complexity, and accuracy requirements are different.

Generative Adversarial Networks by Ian Goodfellow et al. is a deep neural network architecture consisting of a generator network and a discriminator network. The generator synthesizes data and tries to deceive the discriminator, whereas the discriminator authenticates the data and tries to correctly identify all synthesized data. In the process of training iterations, the two networks continue to evolve and confront until they reach an equilibrium state (Nash equilibrium). The discriminator can no longer recognize synthesized data anymore, at which point the training process is over.

To train a GAN model, we need to start with some tools and services that are efficient and necessary for ML practices on AWS. As the working environment, SageMaker is a fully managed ML service. It offers all mainstream ML frameworks as managed container images, such as Scikit-Learn, XGBoost, MXNet, TensorFlow, PyTorch, and more. The SageMaker SDK is an open-source development kit for SageMaker that allows you to use SageMaker and other AWS services, for example, accessing data in an Amazon Simple Storage Service (Amazon S3) bucket, or training a model with a managed Amazon Elastic Compute Cloud (Amazon EC2) instance.

With SageMaker end-to-end ML functionality, you can focus on the model building work and easily train a variety of GAN models, without overheads in infrastructure and framework maintenance.

The following diagram illustrates our architecture.

The training data comes from the S3 storage bucket, and is loaded into the local storage of the training instance. The managed training frameworks and managed algorithms serve in the form of container images in Amazon Elastic Container Registry (Amazon ECR), which are combined with the custom training code when the training container is launched. The training output is collected and sent to a specified S3 bucket. In the following sections, we learn how to use these resources via the SageMaker SDK.

We use AWS services such as Amazon SageMaker and Amazon S3, which incur certain cloud resource usage fees.

Set up the development environment

SageMaker provides a managed Jupyter notebook instance, for model building, training, and more. You can carry out ML activities effectively and easily via Jupyter notebooks. For instructions on setting up your Jupyter notebook working environment, see Get Started with Amazon SageMaker Notebook Instances.

Alternatively, you may want to work with Amazon SageMaker Studio. For instructions, see Get Started with Studio Notebooks.

Download the source code

The source code is available in SageMaker Examples GitHub repository.

  1. On the Git menu, choose Clone a Repository.
  2. Enter the clone URI of the repository (https://github.com/aws/amazon-sagemaker-examples.git).
  3. Choose Clone.

When the download is complete, browse the source code structure through the file browser.

  1. Open the notebook build_gan_with_pytorch.ipynb, which is under the folder /amazon-sagemaker-examples/advanced_functionality/pytorch_bring_your_own_gan/.
  2. In the Select Kernel pop-up, choose conda_pytorch_latest_p36.

If using a Studio environment, select the Python3 (PyTorch 1.6 Python 3.6 GPU Optimized) kernel instead.

The code and notebooks used in this post are available on GitHub, and are all verified with Python 3.6, PyTorch 1.5, and SageMaker-managed JupyterLab.

Deep convolutional generative adversarial network (DCGAN)

In 2016, Alec Radford et al. published the paper “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”. This pioneered the application of convolutional neural networks to GAN. In the algorithm design, the full connected layers are replaced with convolutional layers, which improves the stability of training in the image generation scenarios.

Network structure

The generator network uses a stride transposed convolutional layers to improve the resolution of the tensor. The input shape is (batch_size, 100) and the output shape is (batch_size, 64, 64, 3). In other words, the network accepts a 100-dimensional uniform distribution vector, and then undergoes continuous transformation until the final image is generated.

The discriminator network receives pictures in (64, 64, 3) format, uses 2D convolutional layers for downsampling, and finally passes them to the full connected layer for classification.

The training process of the DCGAN model can be roughly divided into three sub-processes.

Firstly, the generator network uses a random number as input to generate a synthetic picture. Then it uses the authentic picture and the synthetic picture to train the discriminator network and update the parameters. Finally, it updates the generator network parameters.

Code structure

The file structure of the project directory pytorch_bring_your_own_gan is as follows:

├── data
├── src
│   └── train.py
├── tmp
└── build_gan_with_pytorch.ipynb

The file train.py contains three classes: the generator network Generator, the discriminator network Discriminator, and a wrapper class for a single batch training process. See the following code:

class Generator(nn.Module):
...

class Discriminator(nn.Module):
...

class DCGAN(object):
    """
    A wrapper class for Generator and Discriminator,
    'train_step' method is for single batch training.
    """
...

The train.py file also contains several functions, which are used to facilitate training of the networks of Generator and Discriminator. Some of the major functions are as follows:

def parse_args():
...

def get_datasets(dataset_name, ...):
...

def train(dataloader, hps, ...):
...

Model development

During development, you may run the train.py script directly from the Linux command line. You can specify input data channels, model hyperparameters, and training output storage via command line arguments (for more information, see Use PyTorch with the SageMaker Python SDK):

python src/train.py --dataset mnist 
        --model-dir '/home/myhome/byos-pytorch-gan/model' 
        --output-dir '/home/myhome/byos-pytorch-gan/tmp' 
        --data-dir '/home/myhome/byos-pytorch-gan/data' 
        --hps '{"beta1":0.5, "dataset":"mnist", "epochs":18,
            "learning-rate":0.0002, "log-interval":64, "nz":100, "ngf":28, "ndf":28}'

Such design of the training script parameter not only provides a good debugging method, but also is a protocol and prerequisite for integration with SageMaker containers. This takes into account the flexibility of model development and the portability of the training environment.

Model training and validation

Find and open the notebook file build_gan_with_pytorch.ipynb, which introduces and runs the training process. Some of the code in this section is omitted; refer to the notebook for details.

Download data

Many public datasets are available on the internet that are very helpful for ML engineering and scientific research, such as algorithm study and evaluation. We use the MNIST dataset, which is a handwritten digits dataset, to train a DCGAN model, and eventually generate some synthetic handwritten digits. See the following code:

from sagemaker.s3 import S3Downloader as s3down
s3down.download('s3://sagemaker-sample-files/datasets/image/MNIST/pytorch/', './data')

Prepare the data

The PyTorch framework has a torchvision.datasets package, which provides access to several datasets. You can use the following commands to read the pre-downloaded MNIST dataset from local storage, for later use:

from torchvision import datasets

dataroot = './data'
trainset = datasets.MNIST(root=dataroot, train=True, download=False)
testset = datasets.MNIST(root=dataroot, train=False, download=False)

The SageMaker SDK creates a default S3 bucket for you to access various files and data that you may need in the ML engineering lifecycle. We can get the name of this bucket through the default_bucket method of the sagemaker.session.Session class in the SageMaker SDK:

from sagemaker.session import Session

sess = Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

The SageMaker SDK provides tools for operating AWS services. For example, the S3Downloader class is used to download objects in Amazon S3, and S3Uploader is used to upload local files to Amazon S3. You upload the dataset files to Amazon S3 for model training. During model training, we don’t download data from the internet in order to avoid network latency caused by fetching data from the internet. This also avoids possible security risks due to direct access to the internet. See the following code:

import os
from sagemaker.s3 import S3Uploader as s3up

s3_data_location = s3up.upload(os.path.join(dataroot, "MNIST"),
    f"s3://{bucket}/{prefix}/data/mnist")

Train the model

Via the sagemaker.get_execution_role() method, the notebook can get the role pre-assigned to the notebook instance. This role is used to obtain training resources, such as downloading training framework images, allocating EC2 instances, and so on.

The hyperparameters used in the model training task can be defined in the notebook so that it’s separated from the algorithm and training code. The hyperparameters are passed in when the training task is created and dynamically combined with the training task. See the following code:

import json

hps = {
         'seed': 0,
         'learning-rate': 0.0002,
         'epochs': 18,
         'pin-memory': 1,
         'beta1': 0.5,
         'nz': 100,
         'ngf': 28,
         'ndf': 28,
         'batch-size': 128,
         'log-interval': 20,
     }

The PyTorch class from the sagemaker.pytorch package is an estimator for the PyTorch framework. You can use it to create and run training tasks. In the parameter list, instance_type specifies the type of the training instance, such as CPU or GPU instances. The directory containing the training script and model code is specified by source_dir, and the training script name must be clearly defined by entry_point. These parameters are passed to the training job along with other parameters, and they determine the environment settings of the training task. See the following code:

from sagemaker.pytorch import PyTorch

my_estimator = PyTorch(role=role,
                        entry_point='train.py',
                        source_dir='src',
                        output_path=s3_model_artifacts_location,
                        code_location=s3_custom_code_upload_location,
                        instance_count=1,
                        instance_type='ml.g4dn.2xlarge',
                        use_spot_instances=False,
                        framework_version='1.5.0',
                        py_version='py3',
                        hyperparameters=hps)

Pay special attention to the use_spot_instances parameter. The value of True here means that you want to use Spot Instances to train the model. Because ML training usually requires a large amount of computing resources to run for a long time, using Spot Instances can help you control your cost. Spot Instances may save cost up to 90% vs. On-Demand Instances. Depending on the instance type, Region, and time, the actual price might be different.

You have created a PyTorch object, and you can use it to fit pre-uploaded training data on Amazon S3. The following command initiates the training job, and the training data is loaded into the training instance local storage in the form of an input channel named MNIST. When the training task starts, the training data is already available on the local file system of the training instance, and the training script train.py can access the data from the local disk afterwards.

# Start training
my_estimator.fit({'MNIST': s3_data_location}, wait=False)

Depending on the training instance you choose, the training process may last from dozens of minutes to hours. We recommend setting the wait parameter to False, which detaches the notebook from the training job. In scenarios with long training time and many training logs, it can prevent the notebook context from being lost due to network interruption or session timeout. After the notebook is detached from the training task, the output is temporarily invisible. Run the following code to allow the notebook to obtain and resume the previous training session:

%%time
from sagemaker.estimator import Estimator

# Attaching previous training session
training_job_name = my_estimator.latest_training_job.name
attached_estimator = Estimator.attach(training_job_name)

Because the model was designed to use the GPU power to accelerate training, it’s much faster on GPU instances than on CPU instances. For example, the g4dn.2xlarge instance take about 12 minutes, whereas the c5.xlarge instance may take more than 6 hours. The current model doesn’t support multi-instance training, so an instance_count parameter with a value more than 1 doesn’t bring extra benefits in training time optimization.

When the training job is complete, the trained model is collected and uploaded to Amazon S3. The upload location is specified by the output_path parameter, which is provided when creating the PyTorch object.

Test the model

You download the trained model from Amazon S3 to the local file system of the notebook instance, where this Jupyter notebook is running. The following code loads and runs the model, and then generates a picture of handwritten digits from a random number as input:

import matplotlib.pyplot as plt
import numpy as np
import torch
from src.train import Generator

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

params = {'nz': hps['nz'], 'nc': 1, 'ngf': hps['ngf']}
model = load_model("./tmp/model.pth", model_cls=Generator, params=params, device=device, strict=False)
img = generate_fake_handwriting(model, num_images=64, nz=hps['nz'], device=device)

plt.imshow(np.asarray(img))

Use case: Synthetic data boosting handwritten text recognition

GAN and DCGAN have been derived into a remarkable number of variants that address different problems in their respective domains. Let’s look at one use case, which is designed to reduce the effort and cost in data collection and annotation, as well as improve the performance of a handwriting text recognition system.

ScrabbleGAN (see also the GitHub repo), introduced by scientists from Amazon, is a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. It relies on a novel generative model that can generate images of words with an arbitrary length. The generator can manipulate the resulting text style, for instance, whether the text is cursive, or how thin the pen stroke is.

Problem definition

Optical character recognition (OCR), especially handwritten text recognition (HTR) systems, have seen significant performance improvements in the deep learning era. However, deep learning-based HTR is limited by the number of training examples. In other words, data gathering and labeling are challenging and costly tasks.

Targeting the lack of versatile, annotated handwritten text, and the difficulty to obtain it, Amazon scientists introduced a semi-supervised learning solution by creating realistic, synthesized text, reducing the need for annotations and enriching the variety of training data in both style and lexicon.

Network architecture

In contrast to the vast majority of text-related networks that rely on recurrent neural networks (RNNs), ScrabbleGAN introduces a novel fully convolutional handwritten text generation architecture, which allows for arbitrarily long outputs. This architecture learns character embeddings without the need for character-level annotation.

Handwriting is a local process—each letter is influenced by its predecessor and successor. The attention of the synthesizer is focused on the immediate neighbors of the current letter, and the generator G is designed to mimic this process. Instead of generating the image out of an entire word representation, each convolutional-upsampling layer widens the receptive field, as well as the overlap between two neighboring characters. This overlap allows adjacent characters to interact, and creates a smooth transition. The style of each image is controlled by a noise vector z given as input to the network. To generate the same style for the entire word or sentence, this noise vector is kept constant throughout the generation of all the characters in the input.

The purpose of the discriminator D is to identify synthetic images generated by G from the real ones. It also discriminates between such images based on the handwriting output style. The discriminator architecture has to account for the varying length of the generated image, therefore it’s designed to be convolutional, and is essentially a concatenation of separate binary classifiers with overlapping receptive fields. Because it’s designed not to rely on character-level annotations, it doesn’t use class supervision for each of these classifiers, therefore unlabeled images can be used to train D. A pooling layer aggregates scores from all classifiers into the final discriminator output.

While discriminator D promotes real-looking images, the recognizer R promotes readable text, in essence identifying between gibberish and real text. Generated images are penalized by comparing the recognized text in the output of R to the one that was given as input to G. R is trained only on real, labeled, handwritten samples.

Most recognition networks use a recurrent module, which learns an implicit language model that helps it identify the correct character even if it’s not written clearly. Although this quality is usually desired in a handwriting recognition model, in this synthetic data case, it may lead the network to correctly read characters that weren’t written clearly by the generator G. Therefore, the recurrent head of the recognition network isn’t excluded, and only the convolutional backbone is used.

Conclusion

The PyTorch framework, one of the most popular deep learning frameworks, has been advancing rapidly, and is widely recognized and applied in recent years. More and more new models have been composed with PyTorch, and a remarkable number of existing models are being migrated from other frameworks to PyTorch. It has already become one of the de facto mainstream deep learning frameworks.

SageMaker is closely integrated with a variety of AWS services, such as EC2 instances of various types, Amazon S3, and Amazon ECR. It provides an end-to-end, consistent ML experience for ML practitioners of all frameworks. SageMaker continues to support mainstream ML frameworks, including PyTorch. ML algorithms and models developed with PyTorch can be easily transplanted to a SageMaker environment by using the fully managed Jupyter notebook, Spot training instances, Amazon ECR, the SageMaker SDK, and more. This lowers the overhead of ML engineering and infrastructure operation, improves productivity and efficiency, and reduces operation and maintenance costs.

Synthetic data, generated by GAN, is rich and versatile in features, and can be produced in substantial amounts. Therefore, you can use it to improve the performance of a model by enriching the training set. Moreover, this technique can reduce effort and cost in data gathering and labeling.

DCGAN is a landmark in the field of generative adversarial networks, and it’s the cornerstone of many modern complex generative adversarial networks today. We explore some of the most recent and interesting variants of GANs in later posts. The introduction and engineering practices discussed in this post can help you understand the principles and engineering methods for GAN in general. Try out your first generative model, available as an example of SageMaker, have fun, and see you next time.


About the Author

Laurence MIAO, Solutions Architect at AWS. Laurence is specialized in AI/ML. He helps customers empower their business with AI/ML on AWS. Before AWS, Laurence served in a variety of software projects and organizations. His tech spectrum covers high-performance internet applications, enterprise information system integration, DevOps, cloud computing, and Machine Learning.

Read More

Process Amazon Redshift data and schedule a training pipeline with Amazon SageMaker Processing and Amazon SageMaker Pipelines

Customers in many different domains tend to work with multiple sources for their data: object-based storage like Amazon Simple Storage Service (Amazon S3), relational databases like Amazon Relational Database Service (Amazon RDS), or data warehouses like Amazon Redshift. Machine learning (ML) practitioners are often driven to work with objects and files instead of databases and tables from the different frameworks they work with. They also prefer local copies of such files in order to reduce the latency of accessing them.

Nevertheless, ML engineers and data scientists might be required to directly extract data from data warehouses with SQL-like queries to obtain the datasets that they can use for training their models.

In this post, we use the Amazon SageMaker Processing API to run a query against an Amazon Redshift cluster, create CSV files, and perform distributed processing. As an extra step, we also train a simple model to predict the total sales for new events, and build a pipeline with Amazon SageMaker Pipelines to schedule it.

Prerequisites

This post uses the sample data that is available when creating a Free Tier cluster in Amazon Redshift. As a prerequisite, you should create your cluster and attach to it an AWS Identity and Access Management (IAM) role with the correct permissions. For instructions on creating the cluster with the sample dataset, see Using a sample dataset. For instructions on associating the role with the cluster, see Authorizing access to the Amazon Redshift Data API.

You can then use your IDE of choice to open the notebooks. This content has been developed and tested using SageMaker Studio on a ml.t3.medium instance. For more information about using Studio, refer to the following resources:

Define the query

Now that your Amazon Redshift cluster is up and running, and loaded with the sample dataset, we can define the query to extract data from our cluster. According to the documentation for the sample database, this application helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In particular, analysts can identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons.

Analysts may be tasked to solve a very common ML problem: predict the number of tickets sold given the characteristics of an event. Because we have two fact tables and five dimensions in our sample database, we have some data that we can work with. For the sake of this example, we try to use information from the venue in which the event takes place as well as its date. The SQL query looks like the following:

SELECT sum(s.qtysold) AS total_sold, e.venueid, e.catid, d.caldate, d.holiday
from sales s, event e, date d
WHERE s.eventid = e.eventid and e.dateid = d.dateid
GROUP BY e.venueid, e.catid, d.caldate, d.holiday

We can run this query in the query editor to test the outcomes and change it to include additional information if needed.

Extract the data from Amazon Redshift and process it with SageMaker Processing

Now that we’re happy with our query, we need to make it part of our training pipeline.

A typical training pipeline consists of three phases:

  • Preprocessing – This phase reads the raw dataset and transforms it into a format that matches the input required by the model for its training
  • Training – This phase reads the processed dataset and uses it to train the model
  • Model registration – In this phase, we save the model for later usage

Our first task is to use a SageMaker Processing job to load the dataset from Amazon Redshift, preprocess it, and store it to Amazon S3 for the training model to pick up. SageMaker Processing allows us to directly read data from different resources, including Amazon S3, Amazon Athena, and Amazon Redshift. SageMaker Processing allows us to configure access to the cluster by providing the cluster and database information, and use our previously defined SQL query as part of a RedshiftDatasetDefinition. We use the SageMaker Python SDK to create this object, and you can check the definition and the parameters needed on the GitHub page. See the following code:

from sagemaker.dataset_definition.inputs import RedshiftDatasetDefinition

rdd = RedshiftDatasetDefinition(
    cluster_id="THE-NAME-OF-YOUR-CLUSTER",
    database="THE-NAME-OF-YOUR-DATABASE",
    db_user="YOUR-DB-USERNAME",
    query_string="THE-SQL-QUERY-FROM-THE-PREVIOUS-STEP",
    cluster_role_arn="THE-IAM-ROLE-ASSOCIATED-TO-YOUR-CLUSTER",
    output_format="CSV",
    output_s3_uri="WHERE-IN-S3-YOU-WANT-TO-STORE-YOUR-DATA"
)

Then, you can define the DatasetDefinition. This object is responsible for defining how SageMaker Processing uses the dataset loaded from Amazon Redshift:

from sagemaker.dataset_definition.inputs import DatasetDefinition

dd = DatasetDefinition(
    data_distribution_type='ShardedByS3Key', # This tells SM Processing to shard the data across instances
    local_path='/opt/ml/processing/input/data/', # Where SM Processing will save the data in the container
    redshift_dataset_definition=rdd # Our ResdhiftDataset
)

Finally, you can use this object as input of your processor of choice. For this post, we wrote a very simple scikit-learn script that cleans the dataset, performs some transformations, and splits the dataset for training and testing. You can check the code in the file processing.py.

We can now instantiate the SKLearnProcessor object, where we define the framework version that we plan on using, the amount and type of instances that we spin up as part of our processing cluster, and the execution role that contains the right permissions. Then, we can pass the parameter dataset_definition as the input of the run() method. This method accepts our processing.py script as the code to run, given some inputs (namely, our RedshiftDatasetDefinition), generates some outputs (a train and a test dataset), and stores both to Amazon S3. We run this operation synchronously thanks to the parameter wait=True:

from sagemaker.sklearn import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

skp = SKLearnProcessor(
    framework_version='0.23-1',
    role=get_execution_role(),
    instance_type='ml.m5.large',
    instance_count=1
)
skp.run(
    code='processing/processing.py',
    inputs=[ProcessingInput(
        dataset_definition=dd,
        destination='/opt/ml/processing/input/data/',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs = [
        ProcessingOutput(
            output_name="train", 
            source="/opt/ml/processing/train"
        ),
        ProcessingOutput(
            output_name="test", 
            source="/opt/ml/processing/test"
        ),
    ],
    wait=True
)

With the outputs created by the processing job, we can move to the training step, by means of the sagemaker.sklearn.SKLearn() Estimator:

from sagemaker.sklearn import SKLearn

s = SKLearn(
    entry_point='training/script.py',
    framework_version='0.23-1',
    instance_type='ml.m5.large',
    instance_count=1,
    role=get_execution_role()
)
s.fit({
    'train':skp.latest_job.outputs[0].destination, 
    'test':skp.latest_job.outputs[1].destination
})

To learn more about the SageMaker Training API and Scikit-learn Estimator, see Using Scikit-learn with the SageMaker Python SDK.

Define a training pipeline

Now that we have proven that we can read data from Amazon Redshift, preprocess it, and use it to train a model, we can define a pipeline that reproduces these steps, and schedule it to run. To do so, we use SageMaker Pipelines. Pipelines is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for ML. With Pipelines, you can create, automate, and manage end-to-end ML workflows at scale.

Pipelines are composed of steps. These steps define the actions that the pipeline takes, and the relationships between steps using properties. We already know that our pipelines are composed of three steps:

Furthermore, to make the pipeline definition dynamic, Pipelines allows us to define parameters, which are values that we can provide at runtime when the pipeline starts.

The following code is a snippet that shows the definition of a processing step. The step requires the definition of a processor, which is very similar to the one defined previously during the preprocessing discovery phase, but this time using the parameters of Pipelines. The others parameters, code, inputs, and outputs are the same as we have defined previously:

#### PROCESSING STEP #####

# PARAMETERS
processing_instance_type = ParameterString(name='ProcessingInstanceType', default_value='ml.m5.large')
processing_instance_count = ParameterInteger(name='ProcessingInstanceCount', default_value=2)

# PROCESSOR
skp = SKLearnProcessor(
    framework_version='0.23-1',
    role=get_execution_role(),
    instance_type=processing_instance_type,
    instance_count=processing_instance_count
)

# DEFINE THE STEP
processing_step = ProcessingStep(
    name='ProcessingStep',
    processor=skp,
    code='processing/processing.py',
    inputs=[ProcessingInput(
        dataset_definition=dd,
        destination='/opt/ml/processing/input/data/',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs = [
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test"),
    ]
)

Very similarly, we can define the training step, but we use the outputs from the processing step as inputs:

# TRAININGSTEP
training_step = TrainingStep(
    name='TrainingStep',
    estimator=s,
    inputs={
        "train": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri),
        "test": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri)
    }
)

Finally, let’s add the model step, which registers the model to SageMaker for later use (for real-time endpoints and batch transform):

# MODELSTEP
model_step = CreateModelStep(
    name="Model",
    model=model,
    inputs=CreateModelInput(instance_type='ml.m5.xlarge')
)

With all the pipeline steps now defined, we can define the pipeline itself as a pipeline object comprising a series of those steps. ParallelStep and Condition steps also are possible. Then we can update and insert (upsert) the definition to Pipelines with the .upsert() command:

#### PIPELINE ####
pipeline = Pipeline(
    name = 'Redshift2Pipeline',
    parameters = [
        processing_instance_type, processing_instance_count,
        training_instance_type, training_instance_count,
        inference_instance_type
    ],
    steps = [
        processing_step, 
        training_step,
        model_step
    ]
)
pipeline.upsert(role_arn=role)

After we upsert the definition, we can start the pipeline with the pipeline object’s start() method, and wait for the end of its run:

execution = pipeline.start()
execution.wait()

After the pipeline starts running, we can view the run on the SageMaker console. In the navigation pane, under Components and registries, choose Pipelines. Choose the Redshift2Pipeline pipeline, and then choose the specific run to see its progress. You can choose each step to see additional details such as the output, logs, and additional information. Typically, this pipeline should take about 10 minutes to complete.

Conclusions

In this post, we created a SageMaker pipeline that reads data from Amazon Redshift natively without requiring additional configuration or services, processed it via SageMaker Processing, and trained a scikit-learn model. We can now do the following:

If you want additional notebooks to play with, check out the following:


About the Author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Add AutoML functionality with Amazon SageMaker Autopilot across accounts

AutoML is a powerful capability, provided by Amazon SageMaker Autopilot, that allows non-experts to create machine learning (ML) models to invoke in their applications.

The problem that we want to solve arises when, due to governance constraints, Amazon SageMaker resources can’t be deployed in the same AWS account where they are used.

Examples of such a situation are:

  • A multi-account enterprise setup of AWS where the Autopilot resources must be deployed in a specific AWS account (the trusting account), and should be accessed from trusted accounts
  • A software as a service (SaaS) provider that offers AutoML to their users and adopts the resources in the customer AWS account so that the billing is associated to the end customer

This post walks through an implementation using the SageMaker Python SDK. It’s divided into two sections:

  • Create the AWS Identity and Access Management (IAM) resources needed for cross-account access
  • Perform the Autopilot job, deploy the top model, and make predictions from the trusted account accessing the trusting account

The solution described in this post is provided in the Jupyter notebook available in this GitHub repository.

For a full explanation of Autopilot, you can refer to the examples available in GitHub, particularly Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK).

Prerequisites

We have two AWS accounts:

  • Customer (trusting) account – Where the SageMaker resources are deployed
  • SaaS (trusted) account – Drives the training and prediction activities

You have to create a user for each account, with programmatic access enabled and the IAMFullAccess managed policy associated.

You have to configure the user profiles in the .aws/credentials file:

  • customer_config for the user configured in the customer account
  • saas_config for the user configured in the SaaS account

To update the SageMaker SDK, run the following command in your Python environment:

!pip install --update sagemaker

The procedure has been tested in the SageMaker environment conda_python3.

Common modules and initial definitions

Import common Python modules used in the script:

import boto3
import json
import sagemaker
from botocore.exceptions import ClientError

Let’s define the AWS Region that will host the resources:

REGION = boto3.Session().region_name

and the reference to the dataset for the training of the model:

DATASET_URI = "s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt"

Set up the IAM resources

The following diagram illustrates the IAM entities that we create, which allow the cross-account implementation of the Autopilot job.

On the customer account, we define the single role customer_trusting_saas, which consolidates the permissions for Amazon Simple Storage Service (Amazon S3) and SageMaker access needed for the following:

  • The local SageMaker service that performs the Autopilot actions
  • The principal in the SaaS account that initiates the actions in the customer account

On the SaaS account, we define the following:

  • The AutopilotUsers group with the policy required to assume the customer_trusting_saas role via AWS Security Token Service (AWS STS)
  • The saas_user, which is a member of the AutopilotUsers group and is the actual principal triggering the Autopilot actions

For additional security, in the cross-account trust relationship, we use the external ID to mitigate the confused deputy problem.

Let’s proceed with the setup.

For each of the two accounts, we complete the following tasks:

  1. Create the Boto3 session with the profile of the respective configuration user.
  2. Retrieve the AWS account ID by means of AWS STS.
  3. Create the IAM client that performs the configuration steps in the account.

For the customer account, use the following code:

customer_config_session = boto3.session.Session(profile_name="customer_config")
CUSTOMER_ACCOUNT_ID = customer_config_session.client("sts").get_caller_identity()["Account"]
customer_iam_client = customer_config_session.client("iam")

Use the following code in the SaaS account:

saas_config_session = boto3.session.Session(profile_name="saas_config")
SAAS_ACCOUNT_ID = saas_config_session.client("sts").get_caller_identity()["Account"]
saas_iam_client = saas_config_session.client("iam")

Set up the IAM entities in the customer account

Let’s first define the role needed to perform cross-account tasks from the SaaS account in the customer account.

For simplicity, the same role is adopted for trusting SageMaker in the customer account. Ideally, consider splitting this role into two roles with fine-grained permissions in line with the principle of granting least privilege.

The role name and the references to the ARN of the SageMaker AWS managed policies are as follows:

CUSTOMER_TRUST_SAAS_ROLE_NAME = "customer_trusting_saas"
CUSTOMER_TRUST_SAAS_ROLE_ARN = "arn:aws:iam::{}:role/{}".format(CUSTOMER_ACCOUNT_ID, CUSTOMER_TRUST_SAAS_ROLE_NAME)
SAGEMAKERFULLACCESS_POLICY_ARN = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"

The following customer managed policy gives the role the permissions to access the Amazon S3 resources that are needed for the SageMaker tasks and for the cross-account copy of the dataset.

We restrict the access to the S3 buckets dedicated to SageMaker in the AWS Region for the customer account. See the following code:

CUSTOMER_S3_POLICY_NAME = "customer_s3"
CUSTOMER_S3_POLICY = 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
         "s3:GetObject",
         "s3:PutObject",
         "s3:DeleteObject",
         "s3:ListBucket"
      ],
      "Resource": [
         "arn:aws:s3:::sagemaker-{}-{}".format(REGION, CUSTOMER_ACCOUNT_ID),
         "arn:aws:s3:::sagemaker-{}-{}/*".format(REGION, CUSTOMER_ACCOUNT_ID)
      ]
    }
  ]
}

Then we define the external ID to mitigate the confused deputy problem:

EXTERNAL_ID = "XXXXX"

The trust relationships policy allows the principals from the trusted account and SageMaker to assume the role:

CUSTOMER_TRUST_SAAS_POLICY = 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{}:root".format(SAAS_ACCOUNT_ID)
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": EXTERNAL_ID
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

For simplicity, we don’t include the management of the exceptions in the following snippets. See the Jupyter notebook for the full code.

We create the customer managed policy in the customer account, create the new role, and attach the two policies. We use the maximum session duration parameter to manage long-running jobs. See the following code:

MAX_SESSION_DURATION = 10800
create_policy_response = customer_iam_client.create_policy(PolicyName=CUSTOMER_S3_POLICY_NAME,
                                                           PolicyDocument=json.dumps(CUSTOMER_S3_POLICY))
customer_s3_policy_arn = create_policy_response["Policy"]["Arn"]

create_role_response = customer_iam_client.create_role(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                                       AssumeRolePolicyDocument=json.dumps(CUSTOMER_TRUST_SAAS_POLICY),
                                                       MaxSessionDuration=MAX_SESSION_DURATION)

customer_iam_client.attach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                       PolicyArn=customer_s3_policy_arn)
customer_iam_client.attach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                       PolicyArn=SAGEMAKERFULLACCESS_POLICY_ARN)

Set up IAM entities in the SaaS account

We define the following in the SaaS account:

  • A group of users allowed to perform the Autopilot job in the customer account
  • A policy associated with the group for assuming the role defined in the customer account
  • A policy associated with the group for uploading data to Amazon S3 and managing bucket policies
  • A user that is responsible for the implementation of the Autopilot jobs – the user has programmatic access
  • A user profile to store the user access key and secret in the file for the credentials

Let’s start with defining the name of the group (AutopilotUsers):

SAAS_USER_GROUP_NAME = "AutopilotUsers"

The first policy refers to the customer account ID and the role:

SAAS_ASSUME_ROLE_POLICY_NAME = "saas_assume_customer_role"
SAAS_ASSUME_ROLE_POLICY = 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{}:role/{}".format(CUSTOMER_ACCOUNT_ID, CUSTOMER_TRUST_SAAS_ROLE_NAME)
        }
    ]
}

The second policy is needed to download the dataset, and to manage the Amazon S3 bucket used by SageMaker:

SAAS_S3_POLICY_NAME = "saas_s3"
SAAS_S3_POLICY = 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(DATASET_URI.split('://')[1])
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutBucketPolicy",
                "s3:DeleteBucketPolicy"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-{}-{}".format(REGION, SAAS_ACCOUNT_ID),
                "arn:aws:s3:::sagemaker-{}-{}/*".format(REGION, SAAS_ACCOUNT_ID)
            ]
        }
    ]
}

For simplicity, we give the same value to the user name and user profile:

SAAS_USER_PROFILE = SAAS_USER_NAME = "saas_user"

Now we create the two new managed policies. Next, we create the group, attach the policies to the group, create the user with programmatic access, and insert the user into the group. See the following code:

create_policy_response = saas_iam_client.create_policy(PolicyName=SAAS_ASSUME_ROLE_POLICY_NAME,
                                                       PolicyDocument=json.dumps(SAAS_ASSUME_ROLE_POLICY))      
saas_assume_role_policy_arn = create_policy_response["Policy"]["Arn"]

create_policy_response = saas_iam_client.create_policy(PolicyName=SAAS_S3_POLICY_NAME,
                                                       PolicyDocument=json.dumps(SAAS_S3_POLICY))
saas_s3_policy_arn = create_policy_response["Policy"]["Arn"]

saas_iam_client.create_group(GroupName=SAAS_USER_GROUP_NAME)

saas_iam_client.attach_group_policy(GroupName=SAAS_USER_GROUP_NAME,PolicyArn=saas_assume_role_policy_arn)
saas_iam_client.attach_group_policy(GroupName=SAAS_USER_GROUP_NAME,PolicyArn=saas_s3_policy_arn)

saas_iam_client.create_user(UserName=SAAS_USER_NAME)
saas_iam_client.create_access_key(UserName=SAAS_USER_NAME)

add_user_to_group(GroupName=SAAS_USER_GROUP_NAME,UserName=SAAS_USER_NAME)

Update the credentials file

Create the user profile for saas_user in the .aws/credentials file:

from pathlib import Path
import configparser

credentials_config = configparser.ConfigParser()
credentials_config.read(str(Path.home()) + "/.aws/credentials")

if not credentials_config.has_section(SAAS_USER_PROFILE):
    credentials_config.add_section(SAAS_USER_PROFILE)
    
credentials_config[SAAS_USER_PROFILE]["aws_access_key_id"] = create_akey_response["AccessKey"]["AccessKeyId"]
credentials_config[SAAS_USER_PROFILE]["aws_secret_access_key"] = create_akey_response["AccessKey"]["SecretAccessKey"]

with open(str(Path.home()) + "/.aws/credentials", "w") as configfile:
    credentials_config.write(configfile, space_around_delimiters=False)

This completes the configuration of IAM entities that are needed for the cross-account implementation of the Autopilot job.

Autopilot cross-account access

This is the core objective of the post, where we demonstrate the main differences with respect to the single-account scenario.

First, we prepare the dataset the Autopilot job uses for training the models.

Data

We reuse the same dataset adopted in the SageMaker example: Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK).

For a full explanation of the data, refer to the original example.

We skip the data inspection and proceed directly to the focus of this post, which is the cross-account Autopilot job invocation.

Download the churn dataset with the following AWS Command Line Interface (AWS CLI) command:

!aws s3 cp $DATASET_URI ./ --profile saas_user

Split the dataset for the Autopilot job and the inference phase

After you load the dataset, split it into two parts:

  • 80% for the Autopilot job to train the top model
  • 20% for testing the model that we deploy

Autopilot applies a cross-validation resampling procedure, on the dataset passed as input, to all candidate algorithms to test their ability to predict data they have not been trained on.

Split the dataset with the following code:

import pandas as pd
import numpy as np

churn = pd.read_csv("./churn.txt")
train_data = churn.sample(frac=0.8,random_state=200)
test_data = churn.drop(train_data.index)
test_data_no_target = test_data.drop(columns=["Churn?"])

Let’s save the training data into a file locally that we pass to the fit method of the AutoML estimator:

train_file = "train_data.csv"
train_data.to_csv(train_file, index=False, header=True)

Autopilot training job, deployment, and prediction overview

The training, deployment, and prediction process is illustrated in the following diagram.

The following are the steps for the cross-account invocation:

  1. Initiate a session as saas_user in the SaaS account and load the profile from the credentials.
  2. Assume the role in the customer account via the AWS STS.
  3. Set up and train the AutoML estimator in the customer account.
  4. Deploy the top candidate model proposed by AutoML in the customer account.
  5. Invoke the deployed model endpoint for the prediction on test data.

Initiate the user session in the SaaS account

The setup procedure of IAM entities, explained at the beginning of the post, created the saas_user, identified by the saas_user profile in the .aws/credentials file. We initiate a Boto3 session with this profile:

saas_user_session = boto3.session.Session(profile_name=SAAS_USER_PROFILE, 
                                          region_name=REGION)

The saas_user inherits from the AutopilotUsers group the permission to assume the customer_trusting_saas role in the customer account.

Assume the role in the customer account via AWS STS

AWS STS provides the credentials for a temporary session that is initiated in the customer account:

saas_sts_client = saas_user_session.client("sts", region_name=REGION)

The default session duration (the DurationSeconds parameter) is 1 hour. We set it to the maximum duration session value set for the role. If the session expires, you can recreate it by performing the following steps again. See the following code:

assumed_role_object = saas_sts_client.assume_role(RoleArn=CUSTOMER_TRUST_SAAS_ROLE_ARN,
                                                  RoleSessionName="sagemaker_autopilot",
                                                  ExternalId=EXTERNAL_ID,
                                                  DurationSeconds=MAX_SESSION_DURATION)

assumed_role_credentials = assumed_role_object["Credentials"]
			 
assumed_role_session = boto3.Session(aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                     aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                     aws_session_token=assumed_role_credentials["SessionToken"],
                                     region_name=REGION)
									 
sagemaker_session = sagemaker.Session(boto_session=assumed_role_session)

The sagemaker_session parameter is needed for using the high-level AutoML estimator.

Set up and train the AutoML estimator in the customer account

We use the AutoML estimator from the SageMaker Python SDK to invoke the Autopilot job to train a set of candidate models for the training data.

The setup of the AutoML object is similar to the single-account scenario, but with the following differences for the cross-account invocation:

  • The role for SageMaker access in the customer account is CUSTOMER_TRUST_SAAS_ROLE_ARN
  • The sagemaker_session is the temporary session created by AWS STS

See the following code:

target_attribute_name = "Churn?"

from sagemaker import AutoML
from time import gmtime, strftime, sleep

timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
base_job_name = "automl-churn-sdk-" + timestamp_suffix

target_attribute_name = "Churn?"
target_attribute_values = np.unique(train_data[target_attribute_name])
target_attribute_true_value = target_attribute_values[1] # 'True.'

automl = AutoML(role=CUSTOMER_TRUST_SAAS_ROLE_ARN,
                target_attribute_name=target_attribute_name,
                base_job_name=base_job_name,
                sagemaker_session=sagemaker_session,
                max_candidates=10)

We now launch the Autopilot job by calling the fit method of the AutoML estimator in the same way as in the single-account example. We consider the following alternative options for providing the training dataset to the estimator.

First option: upload a local file and train by fit method

We simply pass the training dataset by referring to the local file that the fit method uploads into the default Amazon S3 bucket used by SageMaker in the customer account:

automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)

Second option: cross-account copy

Most likely, the training dataset is located in an Amazon S3 bucket owned by the SaaS account. We copy the dataset from the SaaS account into the customer account and refer to the URI of the copy in the fit method.

  1. Upload the dataset into a local bucket of the SaaS account. For convenience, we use the SageMaker default bucket in the Region.
    DATA_PREFIX = "auto-ml-input-data"
    local_session = sagemaker.Session(boto_session=saas_user_session)
    local_session_bucket = local_session.default_bucket()
    train_data_s3_path = local_session.upload_data(path=train_file,key_prefix=DATA_PREFIX)

  2. To allow the cross-account copy, we set the following policy in the local bucket, only for the time needed for the copy operation:
    train_data_s3_arn = "arn:aws:s3:::{}/{}/{}".format(local_session_bucket,DATA_PREFIX,train_file)
    bucket_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": CUSTOMER_TRUST_SAAS_ROLE_ARN
                },
                "Action": "s3:GetObject",
                "Resource": train_data_s3_arn
            }
        ]
    }
    bucket_policy = json.dumps(bucket_policy)
    
    saas_s3_client = saas_user_session.client("s3")
    saas_s3_client.put_bucket_policy(Bucket=local_session_bucket,Policy=bucket_policy)

  3. Then the copy is performed by the assumed role in the customer account:
    assumed_role_s3_client = boto3.client("s3",
                                           aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                           aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                           aws_session_token=assumed_role_credentials["SessionToken"])
    target_train_key = "{}/{}".format(DATA_PREFIX, train_file)
    assumed_role_s3_client.copy_object(Bucket=sagemaker_session.default_bucket(), 
                                       CopySource=train_data_s3_path.split("://")[1], 
                                       Key=target_train_key)

  4. Delete the bucket policy so that the access has been granted only for the time of the copy:
    saas_s3_client.delete_bucket_policy(Bucket=local_session_bucket)

  5. Finally, we launch the Autopilot job, passing the URI of the object copy:
target_train_uri = "s3://{}/{}".format(sagemaker_session.default_bucket(), 
                                       target_train_key)
automl.fit(target_train_uri, job_name=base_job_name, wait=False, logs=False)

Another option is to refer to the URI of the source dataset in the bucket in SaaS account. In this case, the bucket policy should include the s3:ListBucket action for the source bucket.

The bucket policy should be assigned for the duration of all the training and allow the s3:ListBucket action for the source bucket, including a statement like the following:

{
  "Effect": "Allow",
  "Principal": {
     "AWS": "arn:aws:iam::CUSTOMER_ACCOUNT_ID:role/customer_trusting_saas"
  },
  "Action": "s3:ListBucket",
  "Resource": "arn:aws:s3:::sagemaker-REGION-SAAS_ACCOUNT_ID"
}

We can use the describe_auto_ml_job method to track the status of our SageMaker Autopilot job:

describe_response = automl.describe_auto_ml_job()
print (describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response["AutoMLJobStatus"]
    
    print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
    sleep(30)

Because an Autopilot job can take a long time, if the session token expires during the fit, you can create a new session following the steps described earlier and retrieve the current Autopilot job reference by implementing the following code:

automl = AutoML.attach(auto_ml_job_name=base_job_name,sagemaker_session=sagemaker_session)

Deploy the top candidate model proposed by AutoML

The Autopilot job trains and returns a set of trained candidate models, identifying among them the top candidate that optimizes the evaluation metric related to the ML problem.

In this post, we only demonstrate the deployment of the top candidate proposed by AutoML, but you can choose a different candidate that better fits your business criteria.

First, we review the performance achieved by the top candidate in the cross-validation:

best_candidate = automl.describe_auto_ml_job()["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]
print("n")
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"]))

If the performance is good enough for our business criteria, we deploy the top candidate in the customer account:

from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

inference_response_keys = ["predicted_label", "probability"]

predictor = automl.deploy(initial_instance_count=1,
                          instance_type="ml.m5.large",
                          inference_response_keys=inference_response_keys,
                          predictor_cls=Predictor,
                          serializer=CSVSerializer(),
                          deserializer=CSVDeserializer())

print("Created endpoint: {}".format(predictor.endpoint_name))

The instance is deployed and billed to the customer account.

Prediction on test data

Finally, we access the model endpoint for the prediction of the label output for the test data:

predictor.predict(test_data_no_target.to_csv(sep=",", 
                                             header=False, 
                                             index=False))

If the session token expires after the deployment of the endpoint, you can recreate a new session following the steps described earlier and connect to the already deployed endpoint by implementing the following code:

predictor = Predictor(predictor.endpoint_name, 
                      sagemaker_session = sagemaker_session,
                      serializer=CSVSerializer(), 
                      deserializer=CSVDeserializer())

Clean up

To avoid incurring unnecessary charges, delete the endpoints and resources that were created when deploying the model after they are no longer needed.

Delete the model endpoint

The model endpoint is deployed in a container that is always active. We delete it first to avoid consumption of credits:

predictor.delete_endpoint()

Delete the artifacts generated by the Autopilot job

Delete all the artifacts created by the Autopilot job, such as the generated candidate models, scripts, and notebook.

We use the high-level resource for Amazon S3 to simplify the operation:

assumed_role_s3_resource = boto3.resource("s3",
                                          aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                          aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                          aws_session_token=assumed_role_credentials["SessionToken"])

s3_bucket = assumed_role_s3_resource.Bucket(automl.sagemaker_session.default_bucket())
s3_bucket.objects.filter(Prefix=base_job_name).delete()

Delete the training dataset copied into the customer account

Delete the training dataset in the customer account with the following code:

from urllib.parse import urlparse

train_data_uri = automl.describe_auto_ml_job()["InputDataConfig"][0][ "DataSource"]["S3DataSource"]["S3Uri"]

o = urlparse(train_data_uri, allow_fragments=False)
assumed_role_s3_resource.Object(o.netloc, o.path.lstrip("/")).delete()

Clean up IAM resources

We delete the IAM resources in reverse order to the creation phase.

  1. Remove the user from the group, and the profile from the credentials, and delete the user:
    saas_iam_client.remove_user_from_group(GroupName = SAAS_USER_GROUP_NAME,
                                           UserName = SAAS_USER_NAME)
                                          
    credentials_config.remove_section(SAAS_USER_PROFILE)
    with open(str(Path.home()) + "/.aws/credentials", "w") as configfile:
        credentials_config.write(configfile, space_around_delimiters=False)
        
    user_access_keys = saas_iam_client.list_access_keys(UserName=SAAS_USER_NAME)
    for AccessKeyId in [element["AccessKeyId"] for element in user_access_keys["AccessKeyMetadata"]]:
        saas_iam_client.delete_access_key(UserName=SAAS_USER_NAME, AccessKeyId=AccessKeyId)
    	
    saas_iam_client.delete_user(UserName=SAAS_USER_NAME)

  2. Detach the policies from the group in the SaaS account, and delete the group and policies:
    attached_group_policies = saas_iam_client.list_attached_group_policies(GroupName=SAAS_USER_GROUP_NAME)
    for PolicyArn in [element["PolicyArn"] for element in attached_group_policies["AttachedPolicies"]]:
        saas_iam_client.detach_group_policy(GroupName=SAAS_USER_GROUP_NAME, PolicyArn=PolicyArn)
        
    saas_iam_client.delete_group(GroupName=SAAS_USER_GROUP_NAME)
    saas_iam_client.delete_policy(PolicyArn=saas_assume_role_policy_arn)
    saas_iam_client.delete_policy(PolicyArn=saas_s3_policy_arn)

  3. Detach the AWS policies from the role in the customer account, then delete the role and the policy:
    attached_role_policies = customer_iam_client.list_attached_role_policies(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME)
    for PolicyArn in [element["PolicyArn"] for element in attached_role_policies["AttachedPolicies"]]:
        customer_iam_client.detach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME, PolicyArn=PolicyArn)
    
    customer_iam_client.delete_role(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME)
    customer_iam_client.delete_policy(PolicyArn=customer_s3_policy_arn)

Conclusion

This post described a possible implementation, using the SageMaker Python SDK, of an Autopilot training job, model deployment, and prediction in a cross-account configuration. The originating account owns the data for the training and it delegates the activities to the account hosting the SageMaker resources.

You can use the API calls shown in this post to incorporate AutoML capabilities into a SaaS application, by delegating the management and billing of SageMaker resources to the customer account.

SageMaker decouples the environment where the data scientist drives the analysis from the containers that perform each phase of the ML process.

This capability simplifies other cross-account scenarios. For example: a SaaS provider who owns sensitive data, instead of sharing its data with the customer, could expose certified training algorithms and generate models on behalf of the customer. The customer will receive the trained model at the end of the Autopilot job.

For more examples of how to integrate Autopilot into SaaS products, see the following posts:


About the Authors

Francesco Polimeni is a Sr Solutions Architect at AWS with focus on Machine Learning. He has over 20 years of experience in professional services and pre-sales organizations for IT management software solutions.

Mehmet Bakkaloglu is a Sr Solutions Architect at AWS. He has vast experience in data analytics and cloud architecture, having provided technical leadership for transformation programs and pre-sales activities in a variety of sectors.

Read More