The University of Oxford’s Michael Wooldridge and Amazon’s Zachary Lipton on the topic of Wooldridge’s AAAI keynote — and the road ahead for AI research.Read More
English-language Alexa voice learns to speak Spanish
Neural text-to-speech enables new multilingual model to use the same voice for Spanish and English responses.Read More
Automating complex deep learning model training using Amazon SageMaker Debugger and AWS Step Functions
Amazon SageMaker Debugger can monitor ML model parameters, metrics, and computation resources as the model optimization is in progress. You can use it to identify issues during training, gain insights, and take actions like stopping the training or sending notifications through built-in or custom actions. Debugger is particularly useful in training challenging deep learning model architectures, which often require multiple rounds of manual tweaks to model architecture or training parameters to rectify training issues. You can also use AWS Step Functions, our powerful event-driven function orchestrator, to automate manual workflows with pre-planned steps in reaction to anticipated events.
In this post, we show how we can use Debugger with Step Functions to automate monitoring, training, and tweaking deep learning models with complex architecture and challenging training convergence characteristics. Designing deep neural networks often involves manual trials where the model is modified based on training convergence behavior to arrive at a baseline architecture. In these trials, new layers may get added or existing layers removed to stabilize unwanted behaviors like the gradients becoming too large (explode) or too small (vanish), or different learning methods or parameters may be tried to speed up training or improve performance. This manual monitoring and adjusting is a time-consuming part of model development workflow, exacerbated by the typically long deep learning training computation duration.
Instead of manually inspecting the training trajectory, you can configure Debugger to monitor convergence, and the new Debugger built-in actions can, for example, stop training if any of the specified set of rules are triggered. Furthermore, we can use Debugger as part of an iterative Step Functions workflow that modifies the model architecture and training strategy at a successfully-trained model. In such an architecture, we use Debugger to identify potential issues like misbehaving gradients or activation units, and Step Functions orchestrates modifying the model in response to events produced by Debugger.
Overview of the solution
A common challenge in training very deep convolutional neural networks is exploding or vanishing gradients, where gradients grow too large or too small during training, respectively. Debugger supports a number of useful built-in rules to monitor training issues like exploding gradients, dead activation units, or overfitting, and even take actions through built-in or custom actions. Debugger allows for custom rules also, although the built-in rules are quite comprehensive and insightful on what to look for when training doesn’t yield desired results.
We build this post’s example around the seminal 2016 paper “Deep Residual Networks with Exponential Linear Unit” by Shah et al. investigating exponential linear unit (ELU) activation, instead of the combination of ReLU activation with batch normalization layers, for the challenging ResNet family of very deep residual network models. Several architectures are explored in their paper, and in particular, the ELU-Conv-ELU-Conv architecture (Section 3.2.2 and Figure 3b in the paper) is reported to be among the more challenging constructs suffering from exploding gradients. To stabilize gradients, the paper modifies the architecture by adding batch normalization before the addition layers to stabilize training.
For this post, we use Debugger to monitor the training process for exploding gradients, and use SageMaker built-in stop training and notification actions to automatically stop the training and notify us if issues occur. As the next step, we devise a Step Functions workflow to address training issues on the fly with pre-planned strategies that we can try each time training fails through model development process. Our workflow attempts to stabilize the training first by trying different training warmup parameters to stabilize the starting training point, and if that fails, resorts to Shah et al.’s approach of adding batch normalization before addition layers. You can use the workflow and model code as a template to add in other strategies, for example, swapping the activation units, or try different flavors of the gradient-descent optimizers like Adam or RMSprop.
The workflow
The following diagram shows a schematic of the workflow.
The main components are state, model, train, and monitor, which we discuss in more detail in this section.
State component
The state component is a JSON collection that keeps track of the history of models or training parameters tried, current training status, and what to try next when an issue is observed. Each step of the workflow receives this state
payload, possibly modifies it, and passes it to the next step. See the following code:
{
"state": {
"history": {
"num_warmup_adjustments": int,
"num_batch_layer_adjustments": int,
"num_retraining": int,
"latest_job_name": str,
"num_learning_rate_adjustments": int,
"num_monitor_transitions": int
},
"next_action": "<launch_new|monitor|end>",
"job_status": str,
"run_spec": {
"warmup_learning_rate": float,
"learning_rate": float,
"add_batch_norm": int,
"bucket": str,
"base_job_name": str,
"instance_type": str,
"region": str,
"sm_role": str,
"num_epochs": int,
"debugger_save_interval": int
}
}
}
Model component
Faithful to Shah et al.’s paper, the model is a residual network of (configurable) depth 20, with additional hooks to insert additional layers, change activation units, or change the learning behavior via input configuration parameters. See the following code:
def generate_model(input_shape=(32, 32, 3), activation='elu',
add_batch_norm=False, depth=20, num_classes=10, num_filters_layer0=16):
Train component
The train step reads the model and training parameters that the monitor step specified to be tried next, and uses an AWS Lambda step to launch the training job using the SageMaker API. See the following code:
def lambda_handler(event, context):
try:
state = event['state']
params = state['run_spec']
except KeyError as e:
...
...
...
try:
job_name = params['base_job_name'] + '-' +
datetime.datetime.now().strftime('%Y-%b-%d-%Hh-%Mm-%S')
sm_client.create_training_job(
TrainingJobName=job_name,
RoleArn=params['sm_role'],
AlgorithmSpecification={
'TrainingImage': sm_tensorflow_image,
'TrainingInputMode': 'File',
'EnableSageMakerMetricsTimeSeries': True,
'MetricDefinitions': [{'Name': 'loss', 'Regex': 'loss: (.+?)'}]
},
...
Monitor component
The monitor step uses another Lambda step that queries the status of the latest training job and plans the next steps of the workflow: Wait if there are no changes, or stop and relaunch with new parameters if training issues are found. See the following code:
if rule['RuleEvaluationStatus'] == "IssuesFound":
logging.info(
'Evaluation of rule configuration {} resulted in "IssuesFound". '
'Attempting to stop training job {}'.format(
rule.get("RuleConfigurationName"), job_name
)
)
stop_job(job_name)
logger.info('Planning a new launch')
state = plan_launch_spec(state)
logger.info(f'New training spec {json.dumps(state["run_spec"])}')
state["rule_status"] = "ExplodingTensors"
The monitor step is also responsible for publishing updates about the status of the workflow to an Amazon Simple Notification Service (Amazon SNS) topic:
if state["next_action"] == "launch_new":
sns.publish(TopicArn=topic_arn, Message=f'Retraining. n'
f='State: {json.dumps(state)}')
Prerequisites
To launch this walkthrough, you only need to have an AWS account and basic familiarity with SageMaker notebooks.
Solution code
The entire code for this solution can be found in the following GitHub repository. This notebook serves as the entry point to the repository, and includes all necessary code to deploy and run the workflow. Use this AWS CloudFormation Stack to create a SageMaker notebook linked to the repository, together with the required AWS Identity and Access Management (IAM) roles to run the notebook. Besides the notebook and the IAM roles, the other resources like the Step Functions workflow are created inside the notebook itself.
In summary, to run the workflow, complete the following steps:
- Launch this CloudFormation stack. This stack creates a Sagemaker Notebook with necessary IAM roles, and clones the solution’s repository.
- Follow the steps in the notebook to create the resources and step through the workflow.
Creating the required resources manually without using the above CloudFormation stack
To manually create and run our workflow through a SageMaker notebook, we need to be able to create and run Step Functions, and create Lambda functions and SNS topics. The Step Functions workflow also needs an IAM policy to invoke Lambda functions. We also define a role for our Lambda functions to be able to access SageMaker. If you do not have permission to use the CloudFormation stack, you can create the roles on the IAM console.
The IAM policy for our notebook can be found in the solution’s repository here. Create an IAM role named sagemaker-debugger-notebook-execution
and attach this policy to it. Our Lambda functions need permissions to create or stop training jobs and check their status. Create an IAM role for Lambda, name it lambda-sagemaker-train
, and attach the policy provided here to it. We also need to add sagemaker.amazonaws.com
as a trusted principal in additional to lambda.amazonaws.com
for this role.
Finally, the Step Functions workflow only requires access to invoke Lambda functions. Create an IAM role for workflow, name it step-function-basic-role
, and attach the default AWS managed policy AWSLambdaRole
. The following screenshot shows the policy on the IAM console.
Next, launch a SageMaker notebook. Use the SageMaker console to create a SageMaker notebook. Use default settings except for what we specify in this post. For the IAM role, use the sagemaker-debugger-notebook-execution
role we created earlier. This role allows our notebook to create the services we need, run our workflow, and clean up the resources at the end. You can link the project’s Github repository to the notebook, or alternatively, you can clone the repository using a terminal inside the notebook into the /home/ec2-user/SageMaker
folder.
Final results
Step through the notebook. At the end, you will get a link to the Step Functions workflow. Follow the link to navigate to the AWS Step Function workflow dashboard.
The following diagram shows the workflow’s state machine schematic diagram.
As the workflow runs through its steps, it sends SNS notifications with latest training parameters. When the workflow is complete, we receive a final notification that includes the final training parameters and the final status of the training job. The output of the workflow shows the final state of the state
payload, where we can see the workflow completed seven retraining iterations, and settled at the end with lowering the warmup learning rate to 0.003125 and adding a batch normalization layer to the model (“add_batch_norm
”: 1). See the following code:
{
"state": {
"history": {
"num_warmup_adjustments": 5,
"num_batch_layer_adjustments": 1,
"num_retraining": 7,
"latest_job_name": "complex-resnet-model-2021-Jan-27-06h-45m-19",
"num_learning_rate_adjustments": 0,
"num_monitor_transitions": 16
},
"next_action": "end",
"job_status": "Completed",
"run_spec": {
"sm_role": "arn:aws:iam::xxxxxxx:role/lambda-sagemaker-train",
"bucket": "xxxxxxx-sagemaker-debugger-model-automation",
"add_batch_norm": 1,
"warmup_learning_rate": 0.003125,
"base_job_name": "complex-resnet-model",
"region": "us-west-2",
"learning_rate": 0.1,
"instance_type": "ml.m5.xlarge",
"num_epochs": 5,
"debugger_save_interval": 100
},
"rule_status": "InProgress"
}
}
Cleaning up
Follow the steps in the notebook under the Clean Up section to delete the resources created. The notebook’s final step deletes the notebook itself as a consequence of deleting the CloudFormation stack. Alternatively, you can delete the SageMaker notebook via the SageMaker console.
Conclusion
Debugger provides a comprehensive set of tools to develop and train challenging deep learning models. Debugger can monitor the training process for hardware resource usage and training problems like dead activation units, misbehaving gradients, or stalling performance, and through its built-in and custom actions, take automatic actions like stopping the training job or sending notifications. Furthermore, you can easily devise Step Functions workflows around Debugger events to change model architecture, try different training strategies, or tweak optimizer parameters and algorithms, while tracking the history of recipes tried, together with detailed notification messaging to keep data scientists in full control. The combination of Debugger and Step Functions toolchains significantly reduces experimentation turnaround and saves on development and infrastructure costs.
About the Authors
Peyman Razaghi is a data scientist at AWS. He holds a PhD in information theory from the University of Toronto and was a post-doctoral research scientist at the University of Southern California (USC), Los Angeles. Before joining AWS, Peyman was a staff systems engineer at Qualcomm contributing to a number of notable international telecommunication standards. He has authored several scientific research articles peer-reviewed in statistics and systems-engineering area, and enjoys parenting and road cycling outside work.
Ross Claytor is a Sr Data Scientist on the ProServe Intelligence team at AWS. He works on the application of machine learning and orchestration to real world problems across industries including media and entertainment, life sciences, and financial services.
Setting up an IVR to collect customer feedback via phone using Amazon Connect and AWS AI Services
As many companies place their focus on customer centricity, customer feedback becomes a top priority. However, as new laws are formed, for instance GDPR in Europe, collecting feedback from customers can become increasingly difficult. One means of collecting this feedback is via phone. When a customer calls an agency or call center, feedback may be obtained by forwarding them to an Interactive Voice Response (IVR) system that records their review star rating with open text feedback. If the customer is willing to stay on the line for this, valuable feedback is captured automatically, quickly, and conveniently while complying with modern regulations.
In this post, we share a solution that can be implemented very quickly and leverages AWS artificial intelligence (AI) services, such as Amazon Transcribe or Amazon Comprehend, to further analyze spoken customer feedback. These services provide insights to the sentiment and key phrases used by the caller, redact PII and automate call analysis. Amazon Comprehend extracts the sentiment and key phrases from the open feedback quickly and automatically. It also ensures that PII is redacted before data is stored.
Solution overview
We guide you through the following steps, all of which may be done in only a few minutes:
- Upload the content to an Amazon Simple Storage Service (Amazon S3) bucket in your account.
- Run an AWS CloudFormation template.
- Set up your Amazon Connect contact center.
- Generate a phone number.
- Attach a contact flow to this number.
- Publish your first IVR phone feedback system.
The following diagram shows the serverless architecture that you build.
This post makes use of the following services:
- Amazon Connect is your contact center in the cloud. It allows you to set up contact centers and contact flows. We use a template that you can upload that helps you create your first contact flow easily.
- Amazon Kinesis Video Streams records the spoken feedback of your customers.
- Amazon Simple Queue Service (Amazon SQS) queues the video streams and triggers an AWS Lambda function that extracts the audio stream from the video stream.
- Amazon Transcribe translates the spoken feedback into written text.
- Amazon Comprehend extracts the sentiment and key phrases from the open feedback quickly and automatically.
- Amazon DynamoDB is the NoSQL data storage for your feedback.
- Amazon S3 stores the WAV files generated by Amazon Transcribe and the JSON files generated by Amazon Comprehend.
- Lambda extracts audio streams from video streams, loads feedback data to DynamoDB, and orchestrates usage of the AI services. One of AWS Lambda functions is a modification of the following code on GitHub.
Download Github repository
As a first step, download the GitHub repository for this post. It contains the following folder structure:
- cloudformation – Contains the CloudFormation template
- contactflow – The 30 seconds of silence WAV file and the Amazon Connect contact flow
- src – The source code of the Lambda functions
You also have a file named build.sh
. We use this file to deploy all the resources in your account. You need the AWS Command Line Interface (AWS CLI) and Gradle to compile, upload, and deploy your resources.
Running the build script
Before we can run the build, we need to open the build.sh
file and set a Region, S3 bucket name, and other information. The script performs the following steps for you:
- Creates a S3 bucket that hosts your CloudFormation template and Lambda source code
- Builds the Gradle project (a Java-based Lambda function) to extract voice from a video stream (see also the following GitHub repository)
- Zip all Lambda functions from the
src
folder - Upload the CloudFormation template and the zipped Lambda functions
- Create a CloudFormation stack
The top of this script looks like the following screenshot.
Provide your preferred setup parameters and make sure you comply with the allowed pattern. You fill out the following fields:
- ApplicationRegion – The Region in which your IVR is deployed.
- S3SourceBucket – The name of the S3 bucket that is created during the build phase. Your CloudFormation template and Lambda code resources are uploaded to this bucket as part of this script.
- S3RecordingsBucketName – This is where the open feedback field (WAV recordings) are stored in your IVR feedback channel.
- S3TranscriptionBucketName – After the WAV files are transcribed, the JSON output is saved here and a Lambda function—as part of your CloudFormation stack—is triggered.
- DynamoDBTableName – The name of your DynamoDB table where the feedback data is stored.
- SQSQueueName – The SQS queue that orchestrates the extraction of the customer open feedback.
- CloudFormationStack– The name of the CloudFormation stack that is created to deploy all the resources.
After you fill in the variables with the proper values, you can run the script. Open your bash terminal on your laptop or computer and navigate to the downloaded folder. Then run the following code:
> ./build.sh
After performing this step, all the necessary resources are deployed and you’re ready to set up your Amazon Connect instance.
Creating an Amazon Connect instance
In our next step, we create an Amazon Connect instance.
- First, we need to give it a name.
You can also link the Amazon Connect instance to an existing account or use SAML for authentication. We named our application ivr-phone-feedback-system
. This name appears in the login as Amazon Connect and is not managed from within the AWS Management Console.
- For the rest of the setup, you can leave the default values, but don’t forget to create an administrator login.
- After the instance is created (it takes just a few moments), go back to the Amazon Connect console.
- Choose your instance alias and choose Data Storage.
- Choose Enable live media streaming.
- For Prefix, enter a prefix.
- For Encryption, select Select KMS key by name.
- For KMS master key, choose aws/kinesisvideo.
- Choose Save.
- In the navigation pane, choose Contact Flows.
- In the AWS Lambda section, add two of the functions created by the CloudFormation stack.
Setting up the contact flow
In this section, we use the files from the contactflow
folder, also downloaded from the Knowledge Mine repository in our very first step.
- In the Amazon Connect contact center, on the Routing menu, choose Prompts.
- Choose Create new prompt.
- Upload the
30_seconds_silence.wav
file.
You also use this for the open feedback section that your customers can interact with to provide verbal feedback.
- On the Routing menu, choose Contact flows.
- Choose Create contact flow.
- On the drop-down menu, choose Import flow.
- Upload the contact flow from the
contactflow
folder (ivr-feedback-flow.json
, included in the Knowledge Mine repo).
After you import the contact flow, you have the architecture as shown in the following screenshot.
For more information about this functionality, see Amazon Connect resources.
To make this contact flow work with your account, you only need to set the Lambda functions that are invoked. You can ignore the warning icons when adding them; they disappear after you save your settings.
- Choose your respective contact flow box.
- In the pop-up window, for Function ARN, select Select a function.
- For box 1, choose IVR-AddRating2DynamoDB.
- For box 2, choose IVR-SendStream2SQS.
- Choose Save.
- Choose Publish.
Your contact flow is now ready to use.
Creating your phone number
Your IVR phone feedback system is now ready. You only need to claim a phone number and associate your contact flow with that number before you’re done.
- On the Routing menu, choose Phone Numbers.
- Choose Claim a number.
- Select a number to associate.
- For Contact flow/IVR, choose your contact flow.
- Choose Save.
Congratulations! You can now call the number you claimed and test your system. Results end up in the DynamoDB table that you created earlier.
Every time open feedback is provided, it’s stored in DynamoDB and analyzed by our AWS AI services. You can extend this system to ask for more ratings as well (such as service dimensions). You can also encrypt customer feedback. For more information, see Creating a secure IVR solution with Amazon Connect.
Conclusion
Gathering customer feedback is an important step in improving the customer experience, but can be difficult to implement in a rapidly changing legal landscape.
In this post, we described how to not only collect feedback but process it, with the customer’s consent, via automation on the AWS platform. We used a CloudFormation template to generate all the serverless backend services necessary, created an Amazon Connect instance, and created a contact flow therein that was associated with a phone number.
With this setup, we’re ready to collect critical customer feedback ratings and identify areas for improvement. This solution is meant to serve as only a basic example; we encourage you to customize your contact flow to best serve your business needs. For more information, see Create Amazon Connect contact flows.
The AI services used in this post have countless applications. For more information, see AI Services.
About the Authors
Michael Wallner is a Global Data Scientist with AWS Professional Services and is passionate about enabling customers on their AI/ML journey in the cloud to become AWSome. Besides having a deep interest in Amazon Connect he likes sports and enjoys cooking.
Chris Boomhower is a Machine Learning Engineer for AWS Professional Services. He loves helping enterprise customers around the world develop and automate impactful AI/ML solutions to their most challenging business problems. When he’s not tackling customers’ problems head-on, you’ll likely find him tending to his hobby farm with his family or barbecuing something delicious.
Does the Turing Test pass the test of time?
Four Amazon scientists weigh in on whether the famed mathematician’s definition of artificial intelligence is still applicable, and what might surprise him most today.Read More
This month in AWS Machine Learning: January edition
Hello and welcome to our first “This month in AWS Machine Learning” of 2021! Every day there is something new going on in the world of AWS Machine Learning—from launches to new to use cases to interactive trainings. We’re packaging some of the not-to-miss information from the ML Blog and beyond for easy perusing each month. Check back at the end of each month for the latest roundup.
Launches
We ended the year with more than 250 features launched in 2020, and January has kicked us off with even more new features for you to enjoy.
- AWS Contact Center Intelligence solutions are now available through multiple partners in EMEA, and contact center providers. Avaya, Talkdesk, Salesforce, and 8×8 now join Genesys as technology partners for AWS CCI.
- Reach new audiences, have more natural conversations, and develop and iterate faster, even in more than one language, with the new Amazon Lex V2 APIs. Check it out along with information on the new console.
Use cases
Get ideas and architectures from AWS customers, partners, ML Heroes, and AWS experts on how to apply ML to your use case:
- Learn how Talkspace, a virtual therapy platform, integrated Amazon SageMaker and other AWS services to improve the quality of the mental healthcare it provides.
- Learn how AWS ML Hero Agustinus Nalwan helped make his toddler’s dream of flying come true with Amazon SageMaker.
- AWS ML can help you automate Paycheck Protection Program (PPP) loans and save small businesses across the US days of waiting for relief. Amazon Textract, Amazon Comprehend, Amazon Augmented AI (Amazon A2I), and Amazon SageMaker are helping our customers like BlueVine, Kabbage, Baker Tilly, and Biz2Credit process payment protection loans in hours versus days. Learn more about how you can automate loan processing.
- Amazon Fulfillment Technologies migrated from a legacy custom solution for identifying misplaced inventory to SageMaker, reducing AWS infrastructure costs by a projected 40% per month and simplifying its architecture.
- Learn how to build predictive disease models using SageMaker with data stored in Amazon HealthLake using two example predictive disease models.
- deepset explains how they’re building the next-level search engine for business documents using AWS an NVIDIA to achieve a speedup of 3.9 times faster and a cost reduction of 12.8 times less for training NLP models.
Explore more ML stories
Want more news about developments in ML? Check out the following stories:
- In our AWS Innovators series, we feature Chris Miller, who created a computer-controlled camera that uses a machine learning algorithm to detect and deter dogs who are pooping on your lawn.
- Commercial buildings are responsible for 40% of U.S. emissions. Learn how Carbon Lighthouse uses machine learning on AWS to develop insights that deliver energy savings and decrease CO2 emissions in commercial real estate.
- Explore how AWS customers like Koch, Woodside, and Bayer have leveraged machine learning in this WSJ article, The Next Industrial Revolution Is Powered by Machine Learning. And get more in-depth information on how Bayer helps farmers achieve more bountiful and sustainable harvests in this technical deep dive.
Mark your calendars
- If you missed AWS re:Invent 2020, you can watch sessions on demand and check out the first-ever ML keynote with Swami Sivasubramanian, VP of Machine Learning at AWS. And our AWS Heroes break down the keynote.
- The AWS DeepRacer pre-season launches today (February 1)! Register here and read more in this post.
- On Feb. 24, we are hosting the AWS Innovate Online Conference – AI & Machine Learning Edition, a free virtual event designed to inspire and empower you to accelerate your AI/ML journey. Whether you are new to AI/ML or an advanced user, AWS Innovate has the right sessions for you to apply AI/ML to your organization and take your skills to the next level. Register here.
About the Author
Laura Jones is a product marketing lead for AWS AI/ML where she focuses on sharing the stories of AWS’s customers and educating organizations on the impact of machine learning. As a Florida native living and surviving in rainy Seattle, she enjoys coffee, attempting to ski and enjoying the great outdoors.
Get ready to roll! AWS DeepRacer pre-season racing is now open
AWS DeepRacer allows you to get hands on with machine learning (ML) through a fully autonomous 1/18th scale race car driven by reinforcement learning, a 3D racing simulator on the AWS DeepRacer console, a global racing league, and hundreds of customer-initiated community races.
Pre-season qualifying underway
We’re excited to announce that racing action is right around the next turn as the 2021 AWS DeepRacer League season starts March 1. But as of today, you can start training your models to get racing fit! February 1 is the kickoff of the official pre-season, where racers with the fastest qualifying Time Trial race results earn a spot to commence the official season (March 1) in the new AWS DeepRacer League Pro division.
After midnight GMT on February 28, the league will calculate the top 10% of times recorded from February 1 through February 28. The developers that make those times will be our first group of Pro division racers and start the official 2021 season in that division.
Introducing new racing divisions and digital rewards
The 2021 season will introduce new skill-based Open and Pro racing divisions, where developers have five times more opportunities to win rewards and prizes than in the 2020 season! The Open division is available to all developers who want to train their reinforcement learning (RL) model and compete in the Time Trial format. The Pro division is for those racers who have earned a top 10% Time Trial result from the previous month. Racers in the Pro division can earn bigger rewards and win qualifying seats for the 2021 AWS re:Invent Championship Cup.
The new league structure splits the current Virtual Circuit monthly leaderboard into two skill-based divisions, each with their own prizes to maintain a high level of competitiveness in the League. The Open division is where all racers begin their ML learning journey, and rewards participation each month with new digital rewards.
The digital rewards feature, coming soon, enables you to earn and accumulate rewards that recognize achievements along your ML journey. Rewards include vehicle customizations, badges, and avatar accessories that recognize achievements like races completed and fastest times earned. The top racers in the Open division can earn their way into the Pro division each month by finishing in the top 10% of Time Trial results. Similar to previous seasons, winners of the Pro division’s monthly race automatically qualify for the Championship Cup with a trip to AWS re:Invent for a chance to lift the 2021 Cup and receive $10,000 in AWS credits and an F1 experience or a $20,000 value ML education sponsorship.
Racing your model to faster and faster time results in Open and Pro division races can earn you digital rewards like this new racing skin for your virtual racing fun!
“The DeepRacer League has been a fantastic way for thousands of people to test out their newly learnt machine learning skills,” says AWS Hero and AWS Machine Learning Community Founder Lyndon Leggate. “Everyone’s competitive spirit quickly shows through, and the DeepRacer community has seen tremendous engagement from members keen to learn from each other, refine their skills, and move up the ranks. The new 2021 League format looks incredible, and the Open and Pro divisions bring an interesting new dimension to racing! It’s even more fantastic that everyone will get more chances for their efforts to be rewarded, regardless of how long they’ve been racing. This will make it much more engaging for everyone, and I can’t wait to take part!”
Follow your progress during each month’s race and compare how you stack up against the competition in either the Open or Pro division.
Start training your model today and get ready to race!
We’re excited for the 2021 AWS DeepRacer League season to get underway on March 1. Take advantage of pre-season racing to get your model into racing shape. With more opportunities to earn rewards and win prizes through the new skill-based Open and Pro racing divisions, there has never been a better time to get rolling with the AWS DeepRacer League. Start racing today!
About the Author
Dan McCorriston is a Senior Product Marketing Manager for AWS Machine Learning. He is passionate about technology, collaborating with developers, and creating new methods of expanding technology education. Out of the office he likes to hike, cook and spend time with his family.
Columbia Center of AI Technology announces faculty research awards and two PhD student fellowships
Amazon is providing $5 million in funding over five years to support research, education, and outreach programs.Read More
Performing anomaly detection on industrial equipment using audio signals
Industrial companies have been collecting a massive amount of time-series data about operating processes, manufacturing production lines, and industrial equipment. You might store years of data in historian systems or in your factory information system at large. Whether you’re looking to prevent equipment breakdown that would stop a production line, avoid catastrophic failures in a power generation facility, or improve end product quality by adjusting your process parameters, having the ability to process time-series data is a challenge that modern cloud technologies are up to. However, everything is not about the cloud itself: your factory edge capability must also allow you to stream the appropriate data to the cloud (bandwidth, connectivity, protocol compatibility, putting data in context, and more).
What if you had a frugal way to qualify your equipment health with little data? This can definitely help leverage robust and easier-to-maintain edge-to-cloud blueprints. In this post, we focus on a tactical approach industrial companies can use to help reduce the impact of machine breakdowns by reducing how unpredictable they are.
Machine failures are often addressed by either reactive action (stop the line and repair) or costly preventive maintenance where you have to build the proper replacement parts inventory and schedule regular maintenance activities. Skilled machine operators are the most valuable assets in such settings: years of experience allow them to develop a fine knowledge of how the machinery should operate. They become expert listeners, and can to detect unusual behavior and sounds in rotating and moving machines. However, production lines are becoming more and more automated, and augmenting these machine operators with AI-generated insights is a way to maintain and develop the fine expertise needed to prevent reactive-only postures when dealing with machine breakdowns.
In this post, we compare and contrast two different approaches to identify a malfunctioning machine, providing you have sound recordings from its operation. We start by building a neural network based on an autoencoder architecture and then use an image-based approach where we feed images of sound (namely spectrograms) to an image-based automated machine learning (ML) classification feature.
Services overview
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.
Amazon Rekognition Custom Labels is an automated ML service that enables you to quickly train your own custom models for detecting business-specific objects from images. For example, you can train a custom model to classify unique machine parts in an assembly line or to support visual inspection at quality gates to detect surface defects.
Amazon Rekognition Custom Labels builds off the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images) that are specific to your use case. If you already labeled your images, Amazon Rekognition Custom Labels can begin training in just a few clicks. If not, you can label them directly within the labeling tool provided by the service or use Amazon SageMaker Ground Truth.
After Amazon Rekognition trains from your image set, it can produce a custom image analysis model for you in just a few hours. Amazon Rekognition Custom Labels automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Amazon Rekognition Custom Labels API and integrate it into your applications.
Solution overview
In this use case, we use sounds recorded in an industrial environment to perform anomaly detection on industrial equipment. After the dataset is downloaded, it takes roughly an hour and a half to go through this project from start to finish.
To achieve this, we explore and leverage the Malfunctioning Industrial Machine Investigation and Inspection (MIMII) dataset for anomaly detection purposes. It contains sounds from several types of industrial machines (valves, pumps, fans, and slide rails). For this post, we focus on the fans. For more information about the sound capture procedure, see MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection.
In this post, we implement the area in red of the following architecture. This is a simplified extract of the Connected Factory Solution with AWS IoT. For more information, see Connected Factory Solution based on AWS IoT for Industry 4.0 success.
We walk you through the following steps using the Jupyter notebooks provided with this post:
- We first focus on data exploration to get familiar with sound data. This data is particular time-series data, and exploring it requires specific approaches.
- We then use SageMaker to build an autoencoder that we use as a classifier to discriminate between normal and abnormal sounds.
- We take on a more novel approach in the last part of this post: we transform the sound files into spectrogram images and feed them directly to an image classifier. We use Amazon Rekognition Custom Labels to perform this classification task and leverage SageMaker for the data preprocessing and to drive the Amazon Rekognition Custom Labels training and evaluation process.
Both approaches require an equal amount of effort to complete. Although the models obtained in the end aren’t comparable, this gives you an idea of how much of a kick-start you may get when using an applied AI service.
Introducing the machine sound dataset
You can use the data exploration work available in the first companion notebook from the GitHub repo. The first thing we do is plot the waveforms of normal and abnormal signals (see the following screenshot).
From there, you see how to leverage the short Fourier transformation to build a spectrogram of these signals.
These images have interesting features; this is exactly the kind of features that a neural network can try to uncover and structure. We now build two types of feature extractors based on this data exploration work and feed them to different types of architectures.
Building a custom autoencoder architecture
The autoencoder architecture is a neural network with the same number of neurons in the input and the output layers. This kind of architecture learns to generate the identity transformation between inputs and outputs. The second notebook of our series goes through these different steps:
- To feed the spectrogram to an autoencoder, build a tabular dataset and upload it to Amazon Simple Storage Service (Amazon S3).
- Create a TensorFlow autoencoder model and train it in script mode by using the TensorFlow/Keras existing container.
- Evaluate the model to obtain a confusion matrix highlighting the classification performance between normal and abnormal sounds.
Building the dataset
For this post, we use the librosa library, which is a Python package for audio analysis. A features extraction function based on the steps to generate the spectrogram described earlier is central to the dataset generation process. This feature extraction function is in the sound_tools.py library.
We train our autoencoder only on the normal signals: we want our model to learn how to reconstruct these signals (learning the identity transformation). The main idea is to leverage this for classification later; when we feed this trained model with abnormal sounds, the reconstruction error is a lot higher than when trying to reconstruct normal sounds. We use an error threshold to discriminate abnormal and normal sounds.
Creating the autoencoder
To build our autoencoder, we use Keras and assemble a simple autoencoder architecture with three hidden layers:
from tensorflow.keras import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense
def autoencoder_model(input_dims):
inputLayer = Input(shape=(input_dims,))
h = Dense(64, activation="relu")(inputLayer)
h = Dense(64, activation="relu")(h)
h = Dense(8, activation="relu")(h)
h = Dense(64, activation="relu")(h)
h = Dense(64, activation="relu")(h)
h = Dense(input_dims, activation=None)(h)
return Model(inputs=inputLayer, outputs=h)
We put this in a training script (model.py) and use the SageMaker TensorFlow estimator to configure our training job and launch the training:
tf_estimator = TensorFlow(
base_job_name='sound-anomaly',
entry_point='model.py',
source_dir='./autoencoder/',
role=role,
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='2.2',
py_version='py37',
hyperparameters={
'epochs': 30,
'batch-size': 512,
'learning-rate': 1e-3,
'n_mels': n_mels,
'frame': frames
},
debugger_hook_config=False
)
tf_estimator.fit({'training': training_input_path})
Training over 30 epochs takes a few minutes on a p3.2xlarge instance. At this stage, this costs you a few cents. If you plan to use a similar approach on the whole MIMII dataset or use hyperparameter tuning, you can further reduce this training cost by using Managed Spot Training. For more information, see Amazon SageMaker Spot Training Examples.
Evaluating the model
We now deploy the autoencoder behind a SageMaker endpoint:
tf_endpoint_name = 'sound-anomaly-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
tf_predictor = tf_estimator.deploy(
initial_instance_count=1,
instance_type='ml.c5.large',
endpoint_name=tf_endpoint_name
)
This operation creates a SageMaker endpoint that continues to incur costs as long as it’s active. Don’t forget to shut it down at the end of this experiment.
Our test dataset has an equal share of normal and abnormal sounds. We loop through this dataset and send each test file to this endpoint. Because our model is an autoencoder, we evaluate how good the model is at reconstructing the input. The higher the reconstruction error, the greater the chance that we have identified an anomaly. See the following code:
y_true = test_labels
reconstruction_errors = []
for index, eval_filename in tqdm(enumerate(test_files), total=len(test_files)):
# Load signal
signal, sr = sound_tools.load_sound_file(eval_filename)
# Extract features from this signal:
eval_features = sound_tools.extract_signal_features(
signal,
sr,
n_mels=n_mels,
frames=frames,
n_fft=n_fft,
hop_length=hop_length
)
# Get predictions from our autoencoder:
prediction = tf_predictor.predict(eval_features)['predictions']
# Estimate the reconstruction error:
mse = np.mean(np.mean(np.square(eval_features - prediction), axis=1))
reconstruction_errors.append(mse)
The following plot shows that the distribution of the reconstruction error for normal and abnormal signals differs significantly. The overlap between these histograms means we have to compromise between the metrics we want to optimize for (fewer false positives or fewer false negatives).
Let’s explore the recall-precision tradeoff for a reconstruction error threshold varying between 5.0–10.0 (this encompasses most of the overlap we can see in the preceding plot). First, let’s visualize how this threshold range separates our signals on a scatter plot of all the testing samples.
If we plot the number of samples flagged as false positives and false negatives, we can see that the best compromise is to use a threshold set around 6.3 for the reconstruction error (assuming we’re not looking at minimizing either the false positive or false negatives occurrences).
For this threshold (6.3), we obtain the following confusion matrix.
The metrics associated to this matrix are as follows:
- Precision – 92.1%
- Recall – 92.1%
- Accuracy – 88.5%
- F1 score – 92.1%
Cleaning up
Let’s not forget to delete our endpoint to prevent any additional costs by using the delete_endpoint()
API.
Autoencoder improvement and further exploration
The spectrogram approach requires defining the spectrogram square dimensions (the number of Mel cell defined in the data exploration notebook), which is a heuristic. In contrast, deep learning networks with a CNN encoder can learn the best representation to perform the task at hand (anomaly detection). The following are further steps to investigate to improve on this first result:
- Experimenting with several more or less complex autoencoder architectures, training for a longer time, performing hyperparameter tuning with different optimizers, or tuning the data preparation sequence (sound discretization parameters).
- Leveraging high-resolution spectrograms and feeding them to a CNN encoder to uncover the most appropriate representation of the sound.
- Using an end-to-end model architecture with an encoder-decoder that have been known to give good results on waveform datasets.
- Using deep learning models with multi-context temporal and channel (eight microphones) attention weights.
- Using time-distributed 2D convolution layers to encode features across the eight channels. You could feed these encoded features as sequences across time steps to an LSTM or GRU layer. From there, multiplicative sequence attention weights can be learnt on the output sequence from the RNN layer.
- Exploring the appropriate image representation for multi-variate time-series signals that aren’t waveform. You could replace spectrograms with Markov transition fields, recurrence plots, or network graphs to achieve the same goals for non-sound time-based signals.
Using Amazon Rekognition Custom Labels
For our second approach, we feed the spectrogram images directly into an image classifier. The third notebook of our series goes through these different steps:
- Build the datasets. For this use case, we just use images, so we don’t need to prepare tabular data to feed into an autoencoder. We then upload them to Amazon S3.
- Create an Amazon Rekognition Custom Labels project:
- Associate the project with the training data, validation data, and output locations.
- Train a project version with these datasets.
- Start the model. This provisions an endpoint and deploys the model behind it. We can then do the following:
- Query the endpoint for inference for the validation and testing datasets.
- Evaluate the model to obtain a confusion matrix highlighting the classification performance between normal and abnormal sounds.
Building the dataset
Previously, we had to train our autoencoder on only normal signals. In this use case, we build a more traditional split of training and testing datasets. Based on the fans sound database, this yields the following:
- 4,390 signals for the training dataset, including 3,210 normal signals and 1,180 abnormal signals
- 1,110 signals for the testing dataset, including 815 normal signals and 295 abnormal signals
We generate and store the spectrogram of each signal and upload them in either a train or test bucket.
Creating an Amazon Rekognition Custom Labels model
The first step is to create a project with the Rekognition Custom Labels boto3 API:
# Initialization, get a Rekognition client:
PROJECT_NAME = 'sound-anomaly-detection'
reko_client = boto3.client("rekognition")
# Let's try to create a Rekognition project:
try:
project_arn = reko_client.create_project(ProjectName=PROJECT_NAME)['ProjectArn']
# If the project already exists, we get its ARN:
except reko_client.exceptions.ResourceInUseException:
# List all the existing project:
print('Project already exists, collecting the ARN.')
reko_project_list = reko_client.describe_projects()
# Loop through all the Rekognition projects:
for project in reko_project_list['ProjectDescriptions']:
# Get the project name (the string after the first delimiter in the ARN)
project_name = project['ProjectArn'].split('/')[1]
# Once we find it, we store the ARN and break out of the loop:
if (project_name == PROJECT_NAME):
project_arn = project['ProjectArn']
break
print(project_arn)
We need to tell Amazon Rekognition where to find the training data and testing data, and where to output its results:
TrainingData = {
'Assets': [{
'GroundTruthManifest': {
'S3Object': {
'Bucket': BUCKET_NAME,
'Name': f'{PREFIX_NAME}/manifests/train.manifest'
}
}
}]
}
TestingData = {
'AutoCreate': True
}
OutputConfig = {
'S3Bucket': BUCKET_NAME,
'S3KeyPrefix': f'{PREFIX_NAME}/output'
}
Now we can create a project version. Creating a project version builds and trains a model within this Amazon Rekognition project for the data previously configured. Project creation can fail if Amazon Rekognition can’t access the bucket you selected. Make sure the right bucket policy is applied to your bucket (check the notebooks to see the recommended policy).
The following code launches a new model training, and you have to wait approximately 1 hour (less than $1 from a cost perspective) for the model to be trained:
version = 'experiment-1'
VERSION_NAME = f'{PROJECT_NAME}.{version}'
# Let's try to create a new project version in the current project:
try:
project_version_arn = reko_client.create_project_version(
ProjectArn=project_arn, # Project ARN
VersionName=VERSION_NAME, # Name of this version
OutputConfig=OutputConfig, # S3 location for the output artefact
TrainingData=TrainingData, # S3 location of the manifest describing the training data
TestingData=TestingData # S3 location of the manifest describing the validation data
)['ProjectVersionArn']
# If a project version with this name already exists, we get its ARN:
except reko_client.exceptions.ResourceInUseException:
# List all the project versions (=models) for this project:
print('Project version already exists, collecting the ARN:', end=' ')
reko_project_versions_list = reko_client.describe_project_versions(ProjectArn=project_arn)
# Loops through them:
for project_version in reko_project_versions_list['ProjectVersionDescriptions']:
# Get the project version name (the string after the third delimiter in the ARN)
project_version_name = project_version['ProjectVersionArn'].split('/')[3]
# Once we find it, we store the ARN and break out of the loop:
if (project_version_name == VERSION_NAME):
project_version_arn = project_version['ProjectVersionArn']
break
print(project_version_arn)
status = reko_client.describe_project_versions(
ProjectArn=project_arn,
VersionNames=[project_version_arn.split('/')[3]]
)['ProjectVersionDescriptions'][0]['Status']
Deploying and evaluating the model
First, we deploy our model by using the ARN collected earlier (see the following code). This deploys an endpoint that costs you around $4 per hour. Don’t forget to decommission it when you’re done.
# Start the model
print('Starting model: ' + model_arn)
response = reko_client.start_project_version(ProjectVersionArn=model_arn, MinInferenceUnits=min_inference_units)
# Wait for the model to be in the running state:
project_version_running_waiter = client.get_waiter('project_version_running')
project_version_running_waiter.wait(ProjectArn=project_arn, VersionNames=[version_name])
# Get the running status
describe_response=client.describe_project_versions(ProjectArn=project_arn, VersionNames=[version_name])
for model in describe_response['ProjectVersionDescriptions']:
print("Status: " + model['Status'])
print("Message: " + model['StatusMessage'])
When the model is running, you can start querying it for predictions. The notebook contains the function get_results()
, which queries a given model with a list of pictures sitting in a given path. This takes a few minutes to run all the test samples and costs less than $1 (for approximately 3,000 test samples). See the following code:
predictions_ok = rt.get_results(project_version_arn, BUCKET, s3_path=f'{BUCKET}/{PREFIX}/test/normal', label='normal', verbose=True)
predictions_ko = rt.get_results(project_version_arn, BUCKET, s3_path=f'{BUCKET}/{PREFIX}/test/abnormal', label='abnormal', verbose=True)
def get_results(project_version_arn, bucket, s3_path, label=None, verbose=True):
"""
Sends a list of pictures located in an S3 path to
the endpoint to get the associated predictions.
"""
fs = s3fs.S3FileSystem()
data = {}
predictions = pd.DataFrame(columns=['image', 'normal', 'abnormal'])
for file in fs.ls(path=s3_path, detail=True, refresh=True):
if file['Size'] > 0:
image = '/'.join(file['Key'].split('/')[1:])
if verbose == True:
print('.', end='')
labels = show_custom_labels(project_version_arn, bucket, image, 0.0)
for L in labels:
data[L['Name']] = L['Confidence']
predictions = predictions.append(pd.Series({
'image': file['Key'].split('/')[-1],
'abnormal': data['abnormal'],
'normal': data['normal'],
'ground truth': label
}), ignore_index=True)
return predictions
def show_custom_labels(model, bucket, image, min_confidence):
# Call DetectCustomLabels from the Rekognition API: this will give us the list
# of labels detected for this picture and their associated confidence level:
reko_client = boto3.client('rekognition')
try:
response = reko_client.detect_custom_labels(
Image={'S3Object': {'Bucket': bucket, 'Name': image}},
MinConfidence=min_confidence,
ProjectVersionArn=model
)
except Exception as e:
print(f'Exception encountered when processing {image}')
print(e)
# Returns the list of custom labels for the image passed as an argument:
return response['CustomLabels']
Let’s plot the confusion matrix associated to this test set (see the following diagram).
The metrics associated to this matrix are as follows:
- Precision – 100.0%
- Recall – 99.8%
- Accuracy – 99.8%
- F1 score – 99.9%
Without much effort (and no ML knowledge!), we can get impressive results. With such a low false negative rate and without any false positives, we can leverage such a model in even the most challenging industrial context.
Cleaning up
We need to stop the running model to avoid incurring costs while the endpoint is live:
response = reko_client.stop_project_version(ProjectVersionArn=model_arn)
Results comparison
Let’s display the results side by side. The following matrix shows the unsupervised custom TensorFlow model, and an F1 score of 92.1%.
The data preparation effort includes the following:
- No need to collect abnormal signals; only the normal signal is used to build the training dataset
- Generating spectrograms
- Building sequences of sound frames
- Building training and testing datasets
- Uploading datasets to the S3 bucket
The modeling and improvement effort include the following:
- Designing the autoencoder architecture
- Writing a custom training script
- Writing the distribution code (to ensure scalability)
- Performing hyperparameter tuning
The following matrix shows the supervised custom Amazon Rekognition Custom Labels model, and an F1 score of 99.9%.
The data preparation effort includes the following:
- Collecting a balanced dataset with enough abnormal signals
- Generating spectrograms
- Making the train and test split
- Uploading spectrograms to the respective S3 training and testing buckets
The modeling and improvement effort include the following:
- None!
Determining which approach to use
Which approach should you use? You might use both! As expected, using a supervised approach yields better results. However, the unsupervised approach is perfect to start curating your collected data to easily identify abnormal situations. When you have enough abnormal signals to build a more balanced dataset, you can switch to the supervised approach. Your overall process looks something like the following:
- Start collecting the sound data.
- When you have enough data, train an unsupervised model and use the results to start issuing warnings to a pilot team, who annotates (confirms) abnormal conditions and sets them aside.
- When you have enough data characterizing abnormal conditions, train a supervised model.
- Deploy the supervised approach to a larger scale (especially if you can tune it to limit the undesired false negative to a minimum number).
- Continue collecting sound signals for normal and abnormal conditions, and monitor potential drift between the recent data and the one used for training. Optionally, you can also further detail the anomalies to detect different types of abnormal conditions.
Conclusion
A major challenge factory managers have in order to take advantage of the most recent progress in AI and ML is the amount of customization needed. Training anomaly detection models that can be adapted to many different industrial machineries in order to reduce the maintenance effort, reduce rework or waste, increase product quality, or improve overall equipment efficiency (OEE) or product lines is a massive amount of work.
SageMaker and Amazon Applied AI services such as Amazon Rekognition Custom Labels enables manufacturers to build AI models without having access to a versatile team of data scientists sitting next to each production line. These services allow you to focus on collecting good quality data to augment your factory and provide machine operators, process engineers, and lean manufacturing practioners with high quality insights.
Building upon this solution, you could record 10 seconds sound snippets of your machines and send them to the cloud every 5 minutes, for instance. After you train a model, you can use its predictions to feed custom notifications that you can send back to the supervision screens sitting in the factory.
Can you apply the same process to actual time series as captured by machine sensors? In these cases, spectrograms might not be the best visual representation for these. What about multivariate time series? How can we generalize this approach? Stay tuned for future posts and samples on this impactful topic!
If you’re an ML practitioner passionate about industrial use cases, head over to the Performing anomaly detection on industrial equipment using audio signals GitHub repo for more examples. The solution in this post features an industrial use case, but you can use sound classification ML models in a variety of other settings, for example to analyze animal behavior in agriculture, or to detect anomalous urban sounds such as gunshots, accidents, or dangerous driving. Don’t hesitate to test these services, and let us know what you built!
About the Author
Michaël Hoarau is an AI/ML specialist solution architect at AWS who alternates between a data scientist and machine learning architect, depending on the moment. He is passionate about bringing the power of AI/ML to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.
Amazon Scholars Michael Kearns and Aaron Roth win 2021 PROSE Award
Authors’ book, The Ethical Algorithm, wins award in computer and information sciences category.Read More