The tech behind the Bundesliga Match Facts xGoals: How machine learning is driving data-driven insights in soccer

The tech behind the Bundesliga Match Facts xGoals: How machine learning is driving data-driven insights in soccer

It’s quite common to be watching a soccer match and, when seeing a player score a goal, surmise how difficult scoring that goal was. Your opinions may be further confirmed if you’re watching the match on television and hear the broadcaster exclaim how hard it was for that shot to find the back of the net. Previously, it was based on the naked eye and colored with assumptions based on the number of defenders present, where the goalkeeper was, or if a player was in front of the net or angled to the side. Now, with xGoals (short for “Expected Goals”), one of the Bundesliga Match Facts powered by AWS, it’s possible to put data and insights behind the wow factor, showing the fans the exact probability of a player scoring a goal when shooting from any position on the playing field.

Deutsche Fußball Liga (DFL) is responsible for the organization and marketing of Germany’s professional soccer league, Bundesliga and Bundesliga 2. In every match, DFL collects more than 3.6 million data points for deeper insights into what’s happening on the playing field. The vision is to become the most innovative sports league by enhancing the experiences of over 500 million Bundesliga fans and more than 70 media partners around the globe. DFL aims to achieve its vision by using technology in new ways to provide real-time statistics driven by machine learning (ML), build personalized content for the fans, and turn data into insights and action.

xGoals is one of the two new Match Facts (Average Positions being the second) that DFL and AWS officially launched at the end of May 2020, enhancing global fan engagement with Germany’s Bundesliga, the top soccer league in the country, and the league with the highest average number of goals per match. Using Amazon SageMaker, a fully managed service to build, train, and deploy ML models, xGoals can objectively evaluate the goal-scoring chances of Bundesliga players shooting from any position on the playing field. xGoals can also determine if a pass helped open up a better opportunity than if the player had taken a shot or passed the ball to a player in a better position to shoot.

xGoals and other Bundesliga Match Facts are setting new standards by providing data-driven insights in the world of soccer.

Quantifying goal-scoring chances

The xGoals Match Facts debuted on May 26, 2020, during the Borussia Dortmund vs. FC Bayern Munich match, which was broadcast in over 200 countries worldwide. In a game where little was given away and everything had to be fought for, FC Bayern Munich’s Joshua Kimmich managed to take a remarkable shot. Given the distance to goal, angle of the strike, number of surrounding players, and other factors, his goal-scoring probability in this specific situation was only 6%.

The xGoals ML model produces a probability figure between 0 and 1, after which the values are displayed as a percentage. For example, an evaluation of ML models trained on Bundesliga matches showed that every penalty kick has an xGoals (or “xG”) value of 0.77—meaning that the goal-scoring probability is 77%. xGoals introduces a value that qualitatively measures the utilization of goal-scoring chances of a player or a team and provides information about their performance.

At the end of each match, an aggregation of xGoals values by both teams is also shown. This way, viewers get an objective metric of goal-scoring chances. The particular match mentioned before had a high probability of a draw, if not for that one successful shot from Kimmich. xGoals can enhance the viewing experience and provide insights in several ways, keeping fans engaged and enabling them to understand the potential of players and teams throughout a match or a season.

Given the highly dynamic circumstances around scoring attempts prior to a goal, it’s very hard to achieve an xG value above 70%. Player positions constantly change, and players must make split-second decisions with limited information, mostly relying on intuition. Thus, even when positioned in close proximity to the goal, depending on the situation, the difficulty to score may vary significantly. Therefore, it’s important to have a data-driven, holistic view of all the events on the playing field at any given moment. Only then is it possible to make accurate predictions by also taking into account other players’ positions when feeding this information into the xGoals ML model.

It all starts with data

To bring Match Facts to life, several checks and processes happen before, during, and after a match. Various stakeholders are involved in data acquisition, data processing, graphics, content creation (such as TV feed editing), and live commentary. Each one of the Bundesliga soccer stadiums is equipped with up to 20 cameras for automatic optical tracking of player and ball positions. An editorial team processes additional video data and picks the ideal camera angles and scenes to broadcast. This also includes the decision of when exactly to display Match Facts on TV.

Nearly all match events, such as penalty kicks and shots at goals, are documented live and sent to the DFL systems for remote verification. Human annotators categorize and supplement events with additional situation-specific information. For example, they can add player and team assignments and the type of the shot taken (such as blocking or assisting).

Eventually, all the raw match data is ingested into the Bundesliga Match Facts system on AWS to calculate the xGoals values, which are then distributed worldwide for broadcasting.

In the case of the official Bundesliga app and website, Match Facts are continuously displayed on end-user devices as soon as possible. The same applies to other external customers of DFL with third-party digital platforms, which also offer the latest insights and advanced statistics to soccer fans around the globe.

Real-time content distribution and fan engagement are especially important now, because Bundesliga matches are being played in empty stadiums, which has impacted the in-person soccer viewing experience.

Our ML journey: Bringing code to production

DFL’s leadership, management, and developers have been working hand-in-hand with AWS Professional Services Teams through this cloud-adoption journey, enabling ML for an enhanced viewer experience. The mission of AWS Data Science consultants is to accelerate customer business outcomes through the effective use of ML. Customer engagements start with an initial assessment and taking a closer look at desired outcomes and feasibility from both a business and technical perspective. AWS Professional Services consultants supplement customers’ existing teams with specialized skill sets and industry experience, developing proof of concepts (POCs), minimal viable products (MVPs), and bringing ML solutions to production. At the same time, continued learning and knowledge transfer drive sustainable and directly attributable business value.

In addition to in-house experimentations and prototyping performed at DFL’s subsidiary Sportec Solutions, a well-established research community is already working on refining the performance and accuracy of xGoals calculations. Combining this domain knowledge with the right tech stack and establishing best practices allows for faster innovation and execution at scale while ensuring operational excellence, security, reliability, performance efficiency, and cost optimization.

Historical soccer match data is the foundation of state-of-the-art ML-based xGoals model-training approaches. We can use this data to train ML models to infer xGoals outcomes based on given conditions on the playing field. For data quality evaluations and initial experimentations, we need to perform exploratory data analysis, data visualization, data transformation, and data validation. As an example, this can be done in Amazon SageMaker notebooks. The next natural step is to move the ML workloads from research to development. Deploying ML models to production requires an interdisciplinary engineering approach involving a combination of data engineering, data science, and software development. Production settings require error handling, failover, and recovery plans. Overall, ML system development and operations (MLOps) necessitates code refactoring, re-engineering and optimization, automation, setting up the foundational cloud infrastructure, implementing DevOps and security patterns, end-to-end testing, monitoring, and proper system design. The goal should always be to automate as many system components as possible to minimize manual intervention and reduce the need for maintenance.

In the next sections, we further explore the tech stack behind the Bundesliga Match Facts powered by AWS and underlying considerations when streamlining the path to bring xGoals to production.

xGoals model training with Amazon SageMaker

Traditional xGoals ML models are based on event data only. This means that only the approximate position of a player and their distance to a goal are taken into account when evaluating goal-scoring chances. In the case of the Bundesliga, shot-at-goal events are combined with additional high-precision positional data obtained with a 25 Hz frame rate. This comes with additional overhead in data cleaning and data preprocessing within the necessary data stream analytics pipeline. However, the benefits of having more accurate results clearly outweigh the necessary engineering effort and complexity introduced. Based on the ball and player positions, which are constantly being tracked, the model can determine an array of additional features, such as the distance of a player to the goal, angle to the goal, a player’s speed, number of defenders in the line of shot, and goalkeeper coverage.

For xGoals, we used the Amazon SageMaker XGBoost algorithm to train an ML model on over 40,000 historical shots at goals in the Bundesliga since 2017. This can either be performed with the default training script (XGBoost as a built-in algorithm) or extended by adding preprocessing and postprocessing scripts (XGBoost as a framework). The Amazon SageMaker Python SDK makes it easy to perform training programmatically with built-in scaling. It also abstracts away the complexity of resource deployment and management needed for automatic XGBoost hyperparameter optimization. It’s advisable to start developing with small subsets of the available data for faster experimentation and gradually evolve and optimize towards more complex ML models trained on the full dataset.

An xGoals training job consists of a binary classification task with Area Under the ROC Curve (AUC) as the objective metric and a highly imbalanced training and validation dataset of shots at goals, which either did or didn’t lead to a goal being scored.

Given the various ML model candidates from the Bayesian search-based hyperparameter optimization job, the best-performing one is picked for deployment on an Amazon SageMaker endpoint. Due to differing resource requirements and longevity, ML model training is decoupled from hosting. The endpoint can be invoked from within applications such as AWS Lambda functions or from within Amazon SageMaker notebooks using an API call for real-time inference.

However, training an ML model using Amazon SageMaker isn’t enough. Other infrastructure components are necessary to handle the full cloud ML pipeline, which consists of data integration, data cleaning, data preprocessing, feature engineering, and ML model training and deployment. In addition, other application-specific cloud components need to be integrated.

xGoals architecture: Serverless ML

Before designing application architecture, we put a continuous integration and continuous delivery/deployment (CI/CD) pipeline in place. In accordance with the guidelines stated in the AWS Well-Architected Framework whitepaper, we followed a multi-account setup approach for independent development, staging, and production CI/CD pipeline stages. We paired this with an infrastructure as code (IaC) approach to provision these environments and have predictable deployments for each code change. This allows the team to have segregate environments, reduces release cycles, and facilitates testability of code. After the developer tools were in place, we started to draft the architecture for the application. The following diagram illustrates this architecture.

Data is ingested in two separate ways: AWS Fargate is used for (serverless compute engines for containers) receiving positional and event data streams, and Amazon API Gateway for receiving additional metadata such as team compositions and player names. This incoming data triggers a Lambda function. This Lambda function takes care of a variety of short-lived, one-time tasks such as automatic de-provisioning of idle resources; data preprocessing; simple extract, transform, and load (ETL) jobs; and several data quality tests that occur every time new match data is consumed. We also use Lambda to invoke the Amazon SageMaker endpoint to retrieve the xGoals predictions given a set of input features.

We use two databases to store the match states: Amazon DynamoDB, a key-value database, and Amazon DocumentDB (with MongoDB compatibility), a document database. The latter makes it easy to query and index position and event data in JSON format with nested structures. This is especially suitable if workloads require a flexible schema for fast, iterative development. For central storage of official match data, we use Amazon Simple Storage Service (Amazon S3). Amazon S3 stores the historical data from all match days, which is used to iteratively improve the xGoals model. Amazon S3 also stores metadata on model performance, model monitoring, and security metrics.

To monitor the performance of the application, we use an AWS Amplify web application. This gives the operations team and business stakeholders an overview of the system health and status of Match Facts calculations and its underlying cloud infrastructure in the form of a user-friendly dashboard. Such operational insights are important to capture and incorporate in post-match retrospective analyses to ensure continuous improvements of the current system. This dashboard also allows us to collect metrics to measure and evaluate the achievement of desired business outcomes. Continuous monitoring of relevant KPIs, such as overall system load and performance, end-to-end latency, and other non-functional requirements, ensures a holistic view of the current system from both business and technical perspectives.

The xGoals architecture is built in a fully serverless fashion for improved scalability and ease of use. Fully-managed services remove the undifferentiated heavy lifting of managing servers and other basic infrastructure components. The architecture allows us to dynamically support demand when matches start and release the resources at the end of the game without the need for manual actions, which reduces application costs and operational overhead.

Summary

Since naming AWS as its official technology provider in January 2020, the Bundesliga and AWS have embarked on a journey together to bring advanced analytics to life for soccer fans and broadcasters in over 200 countries. Bundesliga Match Facts powered by AWS helps audiences better understand the strategy involved in decision-making on the pitch. xGoals allows soccer viewers to quantitatively evaluate goal-scoring probabilities based on several conditions on the playing field. Other use cases include scoring chances aggregations in the form of individual players’ and goalkeepers’ performance metrics, and objective evaluations of whether or not the scoreline in a match is a fair reflection of what took place on the playing field.

AWS Professional Services has been working hand-in-hand with DFL and its subsidiary Sportec Solutions, advancing its digital transformation, accelerating business outcomes, and ensuring continuous innovation. Over the course of the coming seasons, DFL will introduce new Bundesliga Match Facts powered by AWS to keep fans engaged, entertained, and provide them with a world-class soccer viewing experience.

“We at Bundesliga are able to use this advanced technology from AWS, including statistics, analytics, and machine learning, to interpret the data and deliver more in-depth insights and better understanding of the split-second decisions made on the pitch. The use of Bundesliga Match Facts enables viewers to gain deeper insights into the key decisions in each match.”

— Andreas Heyden, Executive Vice President of Digital Innovations for the DFL Group


About the Authors

Marcelo Aberle is a Data Scientist in the AWS Professional Services team, working with customers to accelerate their business outcomes through the use of AI/ML. He was the lead developer of the Bundesliga Match Facts xGoals. He enjoys traveling for extended periods of time and is an avid admirer of minimalist design and architecture.

Mirko Janetzke is the Head of IT Development at Sportec Solutions GmbH, the DFL subsidiary responsible for data gathering, data and statistics systems, and soccer analytics within the DFL group. Mirko loves soccer and has been following the Bundesliga and his home team since he was a young boy. In his spare time, he likes to go hiking in the Bavarian Alps with his family and friends.

Lina Mongrand is a Senior Enterprise Services Manager at AWS Professional Services. Lina focuses on helping Media & Entertainment customers build their cloud strategies and approaches and guiding them through their transformation journeys. She is passionate about emerging technologies such as AI/ML and especially how these can help customers achieve their business outcomes. In her spare time, Lina enjoys mountaineering in the nearby Alps (she lives in Munich) with friends and family.

 

 

Read More

Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend

Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend

Named entity recognition (NER) involves sifting through text data to locate noun phrases called named entities and categorizing each with a label, such as person, organization, or brand. For example, in the statement “I recently subscribed to Amazon Prime,” Amazon Prime is the named entity and can be categorized as a brand. Building an accurate in-house custom entity recognizer can be a complex process, and requires preparing large sets of manually annotated training documents and selecting the right algorithms and parameters for model training.

This post explores an end-to-end pipeline to build a custom NER model using Amazon SageMaker Ground Truth and Amazon Comprehend.

Amazon SageMaker Ground Truth enables you to efficiently and accurately label the datasets required to train machine learning systems. Ground Truth provides built-in labeling workflows that take human labelers step-by-step through tasks and provide tools to efficiently and accurately build the annotated NER datasets required to train machine learning (ML) systems.

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. Amazon Comprehend processes any text file in UTF-8 format. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. To use this custom entity recognition service, you need to provide a dataset for model training purposes, with either a set of annotated documents or a list of entities and their type label (such as PERSON) and a set of documents containing those entities. The service automatically tests for the best and most accurate combination of algorithms and parameters to use for model training.

The following diagram illustrates the solution architecture.

The end-to-end process is as follows:

  1. Upload a set of text files to Amazon Simple Storage Service (Amazon S3).
  2. Create a private work team and a NER labeling job in Ground Truth.
  3. The private work team labels all the text documents.
  4. On completion, Ground Truth creates an augmented manifest named manifest in Amazon S3.
  5. Parse the augmented output manifest file and create the annotations and documents file in CSV format, which is acceptable by Amazon Comprehend. We mainly focus on a pipeline that automatically converts the augmented output manifest file, and this pipeline can be one-click deployed using an AWS CloudFormation In addition, we also show how to use the convertGroundtruthToComprehendERFormat.sh script from the Amazon Comprehend GitHub repo to parse the augmented output manifest file and create the annotations and documents file in CSV format. Although you need only one method for the conversion, we highly encourage you to explore both options.
  6. On the Amazon Comprehend console, launch a custom NER training job, using the dataset generated by the AWS Lambda

To minimize the time spent on manual annotations while following this post, we recommend using the small accompanying corpus example. Although this doesn’t necessarily lead to performant models, you can quickly experience the whole end-to-end process, after which you can further experiment with larger corpus or even replace the Lambda function with another AWS service.

Setting up

You need to install the AWS Command Line Interface (AWS CLI) on your computer. For instructions, see Installing the AWS CLI.

You create a CloudFormation stack that creates an S3 bucket and the conversion pipeline. Although the pipeline allows automatic conversions, you can also use the conversion script directly and the setup instructions described later in this post.

Setting up the conversion pipeline

This post provides a CloudFormation template that performs much of the initial setup work for you:

  • Creates a new S3 bucket
  • Creates a Lambda function with Python 3.8 runtime and a Lambda layer for additional dependencies
  • Configures the S3 bucket to auto-trigger the Lambda function on arrivals of output.manifest files

The pipeline source codes are hosted in the GitHub repo. To deploy from the template, in another browser window or tab, sign in to your AWS account in us-east-1.

Launch the following stack:

Complete the following steps:

  1. For Amazon S3 URL, enter the URL for the template.
  2. Choose Next.
  3. For Stack name, enter a name for your stack.
  4. For S3 Bucket Name, enter a name for your new bucket.
  5. Leave the remaining parameters at their default.
  6. Choose Next.
  7. On the Configure stack options page, choose Next.
  8. Review your stack details.
  9. Select the three check-boxes acknowledging that AWS CloudFormation might create additional resources and capabilities.
  10. Choose Create stack.

CloudFormation is now creating your stack. When it completes, you should see something like the following screenshot.

Setting up the conversion script

To set up the conversion script on your computer, complete the following steps:

  1. Download and install Git on your computer.
  2. Decide where you want to store the repository on your local machine. We recommend making a dedicated folder so you can easily navigate to it using the command prompt later.
  3. In your browser, navigate to the Amazon Comprehend GitHub repo.
  4. Under Contributors, choose Clone or download.
  5. Under Clone with HTTPS, choose the clipboard icon to copy the repo URL.

To clone the repository using an SSH key, including a certificate issued by your organization’s SSH certificate authority, choose Use SSH and choose the clipboard icon to copy the repo URL to your clipboard.

  1. In the terminal, navigate to the location in which you want to store the cloned repository. You can do so by entering $ cd <directory>.
  2. Enter the following code:
    $ git clone <repo-url>

After you clone the repository, follow the steps in the README file on how to use the script to integrate the Ground Truth NER labeling job with Amazon Comprehend custom entity recognition.

Uploading the sample unlabeled corpus

Run the following commands to copy the sample data files to your S3 bucket:

$ aws s3 cp s3://aws-ml-blog/artifacts/blog-groundtruth-comprehend-ner/sample-data/groundtruth/doc-00.txt s3://<your-bucket>/raw/

$ aws s3 cp s3://aws-ml-blog/artifacts/blog-groundtruth-comprehend-ner/sample-data/groundtruth/doc-01.txt s3://<your-bucket>/raw/

$ aws s3 cp s3://aws-ml-blog/artifacts/blog-groundtruth-comprehend-ner/sample-data/groundtruth/doc-02.txt s3://<your-bucket>/raw/

The sample data shortens the time you spent annotating, and isn’t necessarily optimized for best model performance.

Running the NER labeling job

This step involves three manual steps:

  1. Create a private work team.
  2. Create a labeling job.
  3. Annotate your data.

You can reuse the private work team over different jobs.

When the job is complete, it writes an output.manifest file, which a Lambda function picks up automatically. The function converts this augmented manifest file into two files: .csv and .txt. Assuming that the output manifest is s3://<your-bucket>/gt/<gt-jobname>/manifests/output/output.manifest, the two files for Amazon Comprehend are located under s3://<your-bucket>/gt/<gt-jobname>/manifests/output/comprehend/.

Creating a private work team

For this use case, you form a private work team with your own email address as the only worker. Ground Truth also allows you to use Amazon Mechanical Turk or a vendors workforce.

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
  2. On the Private tab, choose Create private team.
  3. For Team name, enter a name for your team.
  4. For Add workers, select Invite new workers by email.
  5. For Email addresses, enter your email address.
  6. For Organization name, enter a name for your organization.
  7. For Contact email, enter your email.
  8. Choose Create private team.
  9. Go to the new private team to find the URL of the labeling portal.

    You also receive an enrollment email with the URL, user name, and a temporary password, if this is your first time being added to this team (or if you set up Amazon Simple Notification Service (Amazon SNS) notifications).
  10. Sign in to the labeling URL.
  11. Change the temporary password to a new password.

Creating a labeling job

The next step is to create a NER labeling job. This post highlights the key steps. For more information, see Adding a data labeling workflow for named entity recognition with Amazon SageMaker Ground Truth.

To reduce the amount of annotation time, use the sample corpus that you have copied to your S3 bucket as the input to the Ground Truth job.

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling jobs.
  2. Choose Create labeling job.
  3. For Job name, enter a job name.
  4. Select I want to specify a label attribute name different from the labeling job name.
  5. For Label attribute name, enter ner.
  6. Choose Create manifest file.

This step lets Ground Truth automatically convert your text corpus into a manifest file.

A pop-up window appears.

  1. For Input dataset location, enter the Amazon S3 location.
  2. For Data type, select Text.
  3. Choose Create.

You see a message that your manifest is being created.

  1. When the manifest creation is complete, choose Create.

You can also prepare the input manifest yourself. Be aware that the NER labeling job requires its input manifest in the {"source": "embedded text"} format rather than the refer style {"source-ref": "s3://bucket/prefix/file-01.txt"}. In addition, the generated input manifest automatically detects line break n and generates one JSON line per line per document, whereas if you generate it on your own, you may decide for one JSON line per document (although you may still need to preserve the n for your downstream).

  1. For IAM role, choose the role the CloudFormation template created.
  2. For Task selection, choose Named entity recognition.
  3. Choose Next.
  4. On the Select workers and configure tool page, for Worker types, select Private.
  5. For Private teams, choose the private work team you created earlier.
  6. For Number of workers per dataset object, make sure the number of workers (1) matches the size of your private work team.
  7. In the text box, enter labeling instructions.
  8. Under Labels, add your desired labels. See the next screenshot for the labels needed for the recommended corpus.
  9. Choose Create.

The job has been created, and you can now track the status of the individual labeling tasks on the Labeling jobs page.

The following screenshot shows the details of the job.

Labeling data

If you use the recommended corpus, you should complete this section in a few minutes.

After you create the job, the private work team can see this job listed in their labeling portal, and can start annotating the tasks assigned.

The following screenshot shows the worker UI.

When all tasks are complete, the labeling job status shows as Completed.

Post-check

The CloudFormation template configures the S3 bucket to emit an Amazon S3 put event to a Lambda function whenever new objects with the prefix manifests/output/output.manifest lands in the specific S3 bucket. However, AWS recently added Amazon CloudWatch Events support for labeling jobs, which you can use as another mechanism to trigger the conversion. For more information, see Amazon SageMaker Ground Truth Now Supports Multi-Label Image and Text Classification and Amazon CloudWatch Events.

The Lambda function loads the augmented manifest and converts the augmented manifest into comprehend/output.csv and comprehend/output.txt, located under the same prefix as the output.manifest. For example, s3://<your_bucket>/gt/<gt-jobname>/manifests/output/output.manifest yields the following:

s3://<your_bucket>/gt-output/<gt-jobname>/manifests/output/comprehend/output.csv
s3://<your_bucket>/gt-output/<gt-jobname>/manifests/output/comprehend/output.txt

You can further track the Lambda execution context in CloudWatch Logs, starting by inspecting the tags that the Lambda function adds to output.manifest, which you can do on the Amazon S3 console or AWS CLI.

To track on the Amazon S3 console, complete the following steps:

  1. On the Amazon S3 console, navigate to the output
  2. Select output.manifest.
  3. On the Properties tab, choose Tags.
  4. View the tags added by the Lambda function.

To use the AWS CLI, enter the following code (the log stream tag __LATEST_xxx denotes CloudWatch log stream [$LATEST]xxx, because [$] aren’t valid characters for Amazon S3 tags, and therefore substituted by the Lambda function):

$ aws s3api get-object-tagging --bucket gtner-blog --key gt/test-gtner-blog-004/manifests/output/output.manifest
{
    "TagSet": [
        {
            "Key": "lambda_log_stream",
            "Value": "2020/02/25/__LATEST_24497900b44f43b982adfe2fb1a4fbe6"
        },
        {
            "Key": "lambda_req_id",
            "Value": "08af8228-794e-42d1-aa2b-37c00499bbca"
        },
        {
            "Key": "lambda_log_group",
            "Value": "/aws/lambda/samtest-ConllFunction-RRQ698841RYB"
        }
    ]
}

You can now go to the CloudWatch console and trace the actual log groups, log streams, and the RequestId. See the following screenshot.

Training a custom NER model on Amazon Comprehend

Amazon Comprehend requires the input corpus to obey these minimum requirements per entity:

  • 1000 samples
  • Corpus size of 5120 bytes
  • 200 annotations

The sample corpus you used for Ground Truth doesn’t meet this minimum requirement. Therefore, we have provided you with additional pre-generated Amazon Comprehend input. This sample data is meant to let you quickly start training your custom model, and is not necessarily optimized for model performance.

On your computer, enter the following code to upload the pre-generated data to your bucket:

$ aws s3 cp s3://aws-ml-blog/artifacts/blog-groundtruth-comprehend-ner/sample-data/comprehend/output-x112.txt s3://<your_bucket>/gt/<gt-jobname>/manifests/output/comprehend/documents/

$ aws s3 cp s3://aws-ml-blog/artifacts/blog-groundtruth-comprehend-ner/sample-data/comprehend/output-x112.csv s3://<your_bucket>/gt/<gt-jobname>/manifests/output/comprehend/annotations/

Your s3://<your_bucket>/gt/<gt-jobname>/manifests/output/comprehend/documents/ folder should end up with two files: output.txt and output-x112.txt.

Your s3://<your_bucket>/gt/<gt-jobname>/manifests/output/comprehend/annotations/ folder should contain output.csv and output-x112.csv.

You can now start your custom NER training.

  1. On the Amazon Comprehend console, under Customization, choose Custom entity recognition.
  2. Choose Train recognizer.
  3. For Recognizer name, enter a name.
  4. For Custom entity type, enter your labels.

Make sure the custom entity types match what you used in the Ground Truth job.

  1. For Training type, select Using annotations and training docs.
  2. Enter the Amazon S3 locations for your annotations and training documents.
  3. For IAM role, if this is the first time you’re using Amazon Comprehend, select Create an IAM role.
  4. For Permissions to access, choose Input and output (if specified) S3 bucket.
  5. For Name suffix, enter a suffix.
  6. Choose Train.

You can now see your recognizer listed.

The following screenshot shows your view when training is complete, which can take up to 1 hour.

Cleaning up

When you finish this exercise, remove your resources with the following steps:

  1. Empty the S3 bucket (or delete the bucket)
  2. Terminate the CloudFormation stack

Conclusion

You have learned how to use Ground Truth to build a NER training dataset and automatically convert the produced augmented manifest into the format that Amazon Comprehend can readily digest.

As always, AWS welcomes feedback. Please submit any comments or questions.


About the Authors

Verdi March is a Senior Data Scientist with AWS Professional Services, where he works with customers to develop and implement machine learning solutions on AWS. In his spare time, he enjoys honing his coffee-making skills and spending time with his families.

 

 

 

 

Jyoti Bansal is a software development engineer on the AWS Comprehend team where she is working on implementing and improving NLP based features for Comprehend Service. In her spare time, she loves to sketch and read books.

 

 

 

 

Nitin Gaur is a Software Development Engineer with AWS Comprehend, where he works implementing and improving NLP based features for AWS. In his spare time, he enjoys recording and playing music.

 

 

 

 

Read More

Generating compositions in the style of Bach using the AR-CNN algorithm in AWS DeepComposer

Generating compositions in the style of Bach using the AR-CNN algorithm in AWS DeepComposer

AWS DeepComposer gives you a creative way to get started with machine learning (ML) and generative AI techniques. AWS DeepComposer recently launched a new generative AI algorithm called autoregressive convolutional neural network (AR-CNN), which allows you to generate music in the style of Bach. In this blog post, we show a few examples of how you can use the AR-CNN algorithm to generate interesting compositions in the style of Bach and explain how the algorithm’s parameters impact the characteristics of the generated composition.

The AR-CNN algorithm provided in the AWS DeepComposer console offers a variety of parameters to generate unique compositions, such as the number of iterations and the maximum number of notes to add to or remove from the input melody to generate unique compositions. The parameter values will directly impact the extent to which you modify the input melody. For example, setting the maximum number of notes to add to a high value allows the algorithm to add additional notes it predicts are suitable for a composition in the style of Bach music.

The AR-CNN algorithm allows you to collaborate iteratively with the machine learning algorithm by experimenting with the parameters; you can use the output from one iteration of the AR-CNN algorithm as input to the next iteration.

For more information about the algorithm’s concepts, see the Introduction to autoregressive convolutional neural network learning capsule available in the AWS DeepComposer console. Learning capsules provide easy-to-consume, bite-size modules to help you learn the concepts of generative AI algorithms.

Bach is widely regarded as one of the greatest composers of all time. His compositions represent the best of the Baroque era. Listen to a few examples from original Bach compositions to familiarize yourself with his music:

Composition 1:

Composition 2:

Composition 3:

The AR-CNN algorithm enhances the original input melody by adding or removing notes from the input melody. If the algorithm detects off-key or extraneous notes, it may choose to remove them. If it identifies certain notes that are highly probable in a Bach composition, it may choose to add them. Listen to the following example of an enhanced composition that is generated by applying the AR-CNN algorithm.

Input:

Enhanced composition:

You can change the values of the AR-CNN algorithm parameters to generate music with different characteristics.

AR-CNN parameters in AWS DeepComposer

In the previous section, you heard an example of a composition created in the style of Bach music using AR-CNN algorithm. The following section explores how the algorithm parameters provided in the AWS DeepComposer console influence the generated compositions. The following parameters are available in the console: Sampling iterations, Maximum number of notes to remove or add, and Creative risk.

Sampling iterations

This parameter controls the number of times the input melody is passed through the algorithm for it to add or remove notes. As the sampling iterations parameter increases, the model gets more chances to add or remove notes from the input melody to make the composition sound more Bach-like.

On the AWS DeepComposer console, the Music Studio has a limit of 100 sampling iterations in a single run. You can choose to increase the sampling iterations beyond 100 by feeding the generated music as an input to the algorithm again. Listen to the generated output for the following input melody “Me and my jar” at different iterations

Me and My Jar original input melody:

Me and my jar output at iteration 100:

Me and my jar output at iteration 500:

After a certain number of sampling iterations, you can observe that the generated music doesn’t change much, even after going through more iterations. At this stage, the model has improved the input melody as much as it can. Further iterations may cause it to add notes, and promptly remove them, or vice versa. Thus, the generated music remains mostly unchanged with more iterations after this stage.

Maximum number of notes to remove

This parameter allows you to specify the maximum percentage of your original composition that the algorithm can remove. Setting this number to 0% ensures the input melody is completely preserved. Setting this number to 100% removes a majority of the original notes. Even with a value of 100%, the algorithm may choose to retain parts of your input melody depending on the quality and similarity to Bach music. Listen to the following example of the generated music for “Me and My Jar” when the maximum notes to remove parameter is set to 100%.

Me and My Jar original input melody:

Me and My Jar at 100% removal:

Me and My Jar at 0% removal:

At 100% removal, it’s difficult to detect the original composition in the generated composition. At 0%, the algorithm preserves your original input melody but the algorithm is limited in its ability to enhance the music. For instance, the algorithm can’t remove off-key notes in the input melody.

Maximum number of notes to add

This parameter specifies the maximum number of notes that the algorithm can add to your input melody. Setting a low value fills in missing notes; setting a high value adds more notes to the input melody. If you don’t limit the number of notes the algorithm can add, the model attempts to make the melody as close to a Bach composition as possible.

Me and My Jar original input melody:

Me and My Jar at max 350 notes addition:

Me and My Jar at max 50 notes addition:

Notice how you can’t hear the original soundtrack anymore on adding a maximum of 350 notes. By limiting the number of notes that the algorithm can add, you can prevent the original input melody from being drowned out by the addition of new notes. The downside is that the algorithm becomes limited in its ability to generate music, which may result in less than desirable results. When we picked a maximum of 50 notes to add, there are significantly fewer notes added, so you can hear the input melody. However, the music quality isn’t high because the algorithm is limited in the number of notes it can add.

You can choose how you’d like to balance your original composition and allow the algorithm freedom in generating music by using the parameters of maximum notes to remove or add.

Creative risk

This parameter contributes to the surprise element or the creativity factor in a composition. Setting this value too low leads to more predictable music because the model only chooses safe, high-probability notes. A very high creative risk can result in more unique, and less predictable compositions. However, it may sometimes produce poor quality melodies due to the model choosing notes that are less likely to occur.

The algorithm identifies the notes by sampling from a probability distribution of what it believes is the best note to add next. The creative risk parameter allows you to adjust the shape of that probability distribution. As you increase the creative risk, the probability distribution flattens, which means that several less likely notes are chosen and receive a higher probability than before to be added to the composition. The opposite occurs as you turn down creative risk, which makes sure that the algorithm focuses on notes that it’s sure are correct. This parameter is also known as temperature in machine learning.

Output after 1000 iterations with a creative risk of 1:

Output after 1000 iterations with a creative risk of 2:

Output after 1000 iterations with a temperature of 6:

A creative risk of 1 is usually the baseline. As the creative risk increases, there are more and more scattered notes. These can add flair to your music. However, the scattered notes eventually devolve into noise because a completely flattened probability distribution is no different than random choice.

To experiment, you can turn up the creative risk for a few iterations to generate some scattered notes and then turn down the creative risk to have the algorithm use those notes when generating new music.

Conclusion

Congratulations! You’ve now learned how each parameter can affect the characteristics of your composition. In addition to changing the hyperparameters, we encourage you to try out the following next steps:

  • Change the input melody. Play your own input melody using the virtual or physical AWS DeepComposer keyboard. Don’t worry if you’re not a musical expert; the AR-CNN algorithm automatically fixes mistakes.
  • Crop your input melodies and see how the algorithm fills in missing sections.
  • Feed your autoregressive composition into GANs to create accompaniments.

We are excited for you to try out various combinations to generate your creative musical piece. Start composing now!


About the Authors

Jyothi Nookula is a Principal Product Manager for AWS AI devices. She loves to build products that delight her customers. In her spare time, she loves to paint and host charity fund raisers for her art exhibitions.

 

 

 

Rahul Suresh is an Engineering Manager with the AWS AI org, where he has been working on AI based products for making machine learning accessible for all developers. Prior to joining AWS, Rahul was a Senior Software Developer at Amazon Devices and helped launch highly successful smart home products. Rahul is passionate about building machine learning systems at scale and is always looking for getting these advanced technologies in the hands of customers. In addition to his professional career, Rahul is an avid reader and a history buff.

 

 

Prachi Kumar is a ML Engineer and AI Researcher at AWS, where she has been working on AI based products to help teach users about Machine Learning. Prior to joining AWS, Prachi was a graduate student in Computer Science at UCLA, and has been focusing on ML projects and courses throughout her masters and undergrad. In her spare time, she enjoys reading and watching movies.

 

 

Wayne Chi is a ML Engineer and AI Researcher at AWS. He works on researching interesting Machine Learning problems to teach new developers and then bringing those ideas into production. Prior to joining AWS he was a Software Engineer and AI Researcher at JPL, NASA where he worked on AI Planning and Scheduling systems for the Mars 2020 Rover (Perseverance). In his spare time he enjoys playing tennis, watching movies, and learning more about AI.

 

 

Enoch Chen is a Senior Technical Program Manager for AWS AI Devices. He is a big fan of machine learning and loves to explore innovative AI applications. Recently he helped bring DeepComposer to thousands of developers. Outside of work, Enoch enjoys playing piano and listening to classical music.

Read More

Discover the latest in voice technology at Alexa Live, a free virtual event for builders and business leaders

Discover the latest in voice technology at Alexa Live, a free virtual event for builders and business leaders

Are you interested in the latest advancements in voice and natural language processing (NLP) technology? Maybe you’re curious about how you can use these technologies to build with Alexa? We’re excited to share that Alexa Live—a free virtual developer education event hosted by the Alexa team—is back for its second year on July 22, 2020. Alexa Live is a half-day event focused on the latest advancements in voice technology and how to build delightful customer experiences with them. Attendees will learn about the latest in voice and NLP, including our AI-driven dialog manager, Alexa Conversations, and will dive deep into implementing these technologies through expert-led breakout sessions.

Explore the cutting edge of voice and NLP

Alexa Live will kick off with a keynote address from VP of Alexa Devices Nedim Fresko, who will share the latest advancements in voice technology and spotlight developers who are getting started with these technologies. You’ll see demonstrations, learn about product updates and developer tools, and hear from guest speakers who are innovating with Alexa skills and devices.

You can roll up your sleeves and dive into these technologies through breakout sessions in five dedicated tracks. During the event, you can also interact with Alexa experts through live chat.

What you’ll learn at Alexa Live

No matter how you build—or are interested in building—for Alexa, you’ll get an inside look at the latest in voice and come away with practical tips for implementing voice into your brand, device, or other project. Alexa Live features five dedicated tracks that fall into three topical areas.

Skill Development

These tracks focus on the Alexa Skills Kit (ASK). You’ll hear from Alexa product teams and third parties about creating engaging Alexa skills for multiple use cases. There are three Skill Development tracks:

  • Advanced Voice – Dive deep dive into advanced applications of voice technology—especially our AI-driven dialog manager, Alexa Conversations.
  • Multimodal Experiences – Learn how to create visually rich Alexa skills, including games.
  • On-the-Go Skills – Learn how to engage customers in new places with Alexa skills.

Device Development

The Device Development track is dedicated to enabling smart home devices and other device types with Alexa. You’ll learn from speakers from inside and outside of Amazon about how to get started or advance your device development for multiple use cases. Technologies covered include the following:

Development Services

The Development Services track is focused on Alexa solution providers that can help businesses create and implement a voice strategy through skills or devices. You’ll learn about the services these solution providers offer and hear from brands that have worked with them to bring their projects to market faster.

Sessions will explore how to develop a voice strategy with a skill-building agency and working with Alexa original design manufacturers (ODMs) and systems integrators for Alexa devices.

Register for Alexa Live today

Whether you’re a builder or a business leader, Alexa Live will help you understand the latest advancements in voice technology and equip you to add them to your own projects. Join us on July 22, 2020, for this half-day event packed with inspiration and education.

Ready to start planning your experience? Register now for free to get instant access to the full speaker and session catalog.

Looking for more? Check out the Alexa developer blog and community resources, and engage with the community on Twitter, LinkedIn, and Facebook using the hashtag #AlexaLive. We can’t wait to see you on July 22!


About the Author

Drew Meyer is a Product Marketing Manager at Alexa. He helps developers understand the latest in AI and ML capabilities for voice so they can build exciting new experiences with Alexa. In his spare time he plays with his twin boys, mountain bikes, and plays guitar.

Read More

Building a speech-to-text notification system in different languages with AWS Transcribe and an IoT device

Building a speech-to-text notification system in different languages with AWS Transcribe and an IoT device

Have you ever wished that people visiting your home could leave you a message if you’re not there? What if that solution could support your native language? Here is a straightforward and cost-effective solution that you can build yourself, and you only pay for what you use.

This post demonstrates how to build a notification system to detect a person, record audio, transcribe the audio to text, and send the text to your mobile device in your preferred language. The solution uses the following services:

Prerequisites

To complete this walkthrough, you need the following:

Workflow and architecture

When the sensor detects a person within a specified range, the speaker attached to the Raspberry Pi plays the initial greeting and asks the user to record a message. This recording is sent to Amazon S3, which triggers a Lambda function to transcribe the speech to text using Amazon Transcribe. When the transcription is complete, the user receives a text notification of the transcript from Amazon SNS.

The following diagram illustrates the solution workflow.

Amazon Transcribe uses a deep learning process called automatic speech recognition (ASR) to convert speech to text quickly and accurately in the language of your choice. It automatically adds punctuation and formatting so that the output matches the quality of any manual transcription. You can configure Amazon Transcribe with custom vocabulary for more accurate transcriptions (for example, the names of people living in your home). You can also configure it to remove specific words from transcripts (such as profane or offensive words). Amazon Transcribe supports many different languages. For more information, see What Is Amazon Transcribe?

Uploading the CloudFormation stack

This post provides a CloudFormation template that creates an input S3 bucket that triggers a Lambda function to transcribe from audio to text, an SNS notification to send the text to the user, and the permissions around it.

  1. Download the CloudFormation template.
  2. On the AWS CloudFormation console, choose Upload a template file.
  3. Choose the file you downloaded.
  4. Choose Next.
  5. For Stack Name, enter the name of your stack.
  6. Under Parameters, update the parameters for the template with inputs below
Parameter Default Description
MobileNumber <Requires input> A valid mobile number to receive SNS notifications
LanguageCode <Requires input> A language code of your audio file, such as English US
SourceS3Bucket <Requires input> A unique bucket name
  1. Choose Next.
  2. On the Options page, choose Next.
  3. On the Review page, review and confirm the settings.
  4. Select the check-box that acknowledges that the template creates IAM resources.
  5. Choose Create.

You can view the status of the stack on the AWS CloudFormation console. You should see the status CREATE_COMPLETE in approximately 5 minutes.

  1. Record the BucketName and RaspberryPiUserName from the Outputs

Downloading the greeting message

To download the greeting message, complete the following steps:

  1. On the Amazon Polly console, on the Plain text tab, enter your greeting.
  2. For Language and Region, choose your preferred language.
  3. Choose Download MP3.
  4. Rename the file to greetings.mp3.
  5. Move the file to the folder on raspberrypi /home/pi/Downloads/.

Setting up AWS IoT credentials provider

Set up your AWS IoT credentials to allow you to securely authenticate IoT devices. For instructions, see How to Eliminate the Need for Hardcoded AWS Credentials in Devices by Using the AWS IoT Credentials Provider. Add the following policy below in Step 3 of the post to upload the file to Amazon S3 instead of updating an Amazon DynamoDB table:

             {
                "Version": "2012-10-17",
                "Statement": {
                  "Effect": "Allow",
                  "Action": [
                    "s3:PutObject"
                  ],
                  "Resource": "arn:aws:s3:::<sourceS3Bucket>"
                }
              }

Setting up Raspberry Pi

To set up Raspberry Pi, complete the following steps:

  1. On Raspberry Pi, open the terminal and install AWS CLI.
  2. Create a Python file and code for the sensor to detect if a person is between a specific range (for example, 30 to 200 cm), play the greeting message, record the audio for a specified period (for example, 20 seconds), and send to Amazon S3. See the following example code.
        while True:
            GPIO.setmode(GPIO.BOARD)
           #Setting trigger and echo pin from ultrasonic sensor
            PIN_TRIGGER = 7
            PIN_ECHO = 11
            GPIO.setup(PIN_TRIGGER, GPIO.OUT)
            GPIO.setup(PIN_ECHO, GPIO.IN)
            GPIO.output(PIN_TRIGGER, GPIO.LOW)
    
            print ("Waiting for sensor to settle")
            time.sleep(2)
    
            print ("Calculating distance")
            GPIO.output(PIN_TRIGGER, GPIO.HIGH)
            time.sleep(0.00001)
            GPIO.output(PIN_TRIGGER, GPIO.LOW)	
            while GPIO.input(PIN_ECHO)==0:
                  pulse_start_time = time.time()
            while GPIO.input(PIN_ECHO)==1:
                  pulse_end_time = time.time()
            pulse_duration = pulse_end_time - pulse_start_time
            print(pulse_end_time)
            print(pulse_end_time)
           #Calculating distance in cm based on duration of pulse.       
            distance = round(pulse_duration * 17150, 2)
            print ("Distance:",distance,"cm")
    
    
            if 30 <= distance <= 200:
                cmd = "ffplay -nodisp -autoexit /home/pi/Downloads/greetings.mp3"
                print ("Starting Recorder")
                os.system(cmd)
                #Recording for 20 seconds, adding timestamp to the filename and sending file to S3
                cmd1  ='DATE_HREAD=$(date "+%s");arecord /home/pi/Desktop/$DATE_HREAD.wav -D sysdefault:CARD=1 -d 20 -r 48000;aws s3 cp /home/pi/Desktop/$DATE_HREAD.wav s3://homeautomation12121212'
                os.system(cmd1)
    
            else:
                print ("Nothing detected")
    

  3. Run the Python file.

The ultrasonic sensor continuously looks for a person approaching your home. When it detects a person, the speaker plays and asks the guest to start the recording. The recording is then sent to Amazon S3.

If your speaker and microphone are connected to more than one device, such as HDMI and USB, configure the asoundrc file.

Testing the solution

Place the Raspberry Pi in your house at a location where it can sense the person and record their audio.

When the person appears in front of the Raspberry Pi, they should hear a welcome message. They can record a message and leave. You should receive a text message of the recorded audio.

Conclusion

This post demonstrated how to build a secure voice-to-text notification solution using AWS services. You can integrate this solution the next time you need a voice-to-text feature in your application in various different languages. If you have questions or comments, please leave your feedback in the comments.


About the Author

Vikas Shah is an Enterprise Solutions Architect at Amazon web services. He is a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His areas of interest are ML, IoT, robotics and storage. In his spare time, Vikas enjoys building robots, hiking, and traveling.

 

 

 

Anusha Dharmalingam is a Solutions Architect at Amazon Web Services with a passion for Application Development and Big Data solutions. Anusha works with enterprise customers to help them architect, build, and scale applications to achieve their business goals.

Read More

Building AI-powered forecasting automation with Amazon Forecast by applying MLOps

Building AI-powered forecasting automation with Amazon Forecast by applying MLOps

This post demonstrates how to create a serverless Machine Learning Operations (MLOps) pipeline to develop and visualize a forecasting model built with Amazon Forecast. Because Machine Learning (ML) workloads need to scale, it’s important to break down the silos among different stakeholders to capture business value. The MLOps model makes sure that the data science, production, and operations teams work together seamlessly across workflows that are as automated as possible, ensuring smooth deployments and effective ongoing monitoring.

Similar to the DevOps model in software development, the MLOps model in ML helps build code and integration across ML tools and frameworks. You can automate, operationalize, and monitor data pipelines without having to rewrite custom code or rethink existing infrastructures. MLOps helps scale existing distributed storage and processing infrastructures to deploy and manage ML models at scale. It can also be implemented to track and visualize drift over time for all models across the organization in one central location and implement automatic data validation policies.

MLOps combines best practices from DevOps and the ML world by applying continuous integration, continuous deployment, and continuous training. MLOps helps streamline the lifecycle of ML solutions in production. For more information, see the whitepaper Machine Learning Lens: AWS Well-Architected Framework.

In the following sections you will build, train, and deploy a time-series forecasting model leveraging an MLOps pipeline encompassing Amazon Forecast, AWS Lambda, and AWS Step Functions. To visualize the generated forecast, you will use a combination of AWS serverless analytics services such as Amazon Athena and Amazon QuickSight.

Solution architecture

In this section, you deploy an MLOps architecture that you can use as a blueprint to automate your Amazon Forecast usage and deployments. The provided architecture and sample code help you build an MLOps pipeline for your time series data, enabling you to generate forecasts to define future business strategies and fulfil customer needs.

You can build this serverless architecture using AWS-managed services, which means you don’t need to worry about infrastructure management while creating your ML pipeline. This helps iterate through a new dataset and adjust your model by tuning features and hyperparameters to optimize performance.

The following diagram illustrates the components you will build throughout this post.

In the preceding diagram, the serverless MLOps pipeline is deployed using a Step Functions workflow, in which Lambda functions are stitched together to orchestrate the steps required to set up Amazon Forecast and export the results to Amazon Simple Storage Service (Amazon S3).

The architecture contains the following components:

  • The time series dataset is uploaded to the Amazon S3 cloud storage under the /train directory (prefix).
  • The uploaded file triggers Lambda, which initiates the MLOps pipeline built using a Step Functions state machine.
  • The state machine stitches together a series of Lambda functions to build, train, and deploy a ML model in Amazon Forecast. You will learn more details about the state machine’s Lambda components in the next section.
  • For log analysis, the state machine uses Amazon CloudWatch, which captures Forecast metrics. You use Amazon Simple Notification Service (Amazon SNS) to send email notifications when the final forecasts become available in the source Amazon S3 bucket in the /forecast directory. The ML pipeline saves any old forecasts in the /history directory.
  • Finally, you will use Athena and QuickSight to provide a visual presentation of the current forecast.

In this post, you use the Individual household electric power consumption dataset available in the UCI Machine Learning Repository. The time series dataset aggregates hourly energy consumption for various customers households and shows spikes in energy utilization over weekdays. You can replace the sample data as needed for later use cases.

Now that you are familiar with the architecture, you’re ready to explore the details of each Lambda component in the state machine.

Building an MLOps pipeline using Step Functions

In the previous section, you learned that the Step Functions state machine is the core of the architecture automating the entire MLOps pipeline. The following diagram illustrates the workflow deployed using the state machine.

As shown in preceding diagram, the Lambda functions from the Step Functions workflow are as follows (the steps also highlight the mapping between Lambda functions and the parameters used from the params.json file stored in Amazon S3):

  • Create-Dataset – Creates a Forecast dataset. The information about the dataset helps Forecast understand how to consume the data for model training.
  • Create-DatasetGroup – Creates a dataset group.
  • Import-Data – Imports your data into a dataset that resides inside a dataset group.
  • Create-Predictor – Creates a predictor with a forecast horizon that the parameters file specifies.
  • Create-Forecast – Creates the forecast and starts an export job to Amazon S3, including quantiles specified in the parameters file.
  • Update-Resources – Creates the necessary Athena resources and transforms the exported forecasts to the same format as the input dataset.
  • Notify Success – Sends an email alerting when the job is finished by posting a message to Amazon SNS.
  • Strategy-Choice – Checks whether to delete the Forecast resources, according to the parameters file.
  • Delete-Forecast – Deletes the forecast and keeps the exported data.
  • Delete-Predictor – Deletes the predictor.
  • Delete-ImportJob – Deletes the Import-Data job in Forecast.

In Amazon Forecast, a dataset group is an abstraction that contains all the datasets for a particular collection of forecasts. There is no information sharing between dataset groups; to try out various alternatives to the schemas, you create a new dataset group and make changes inside its corresponding datasets. For more information, see Datasets and Dataset Groups. For this use case, the workflow imports a target time series dataset into a dataset group.

After completing these steps, the workflow triggers the predictor training job. A predictor is a Forecast-trained model used for making forecasts based on time series data. For more information, see Predictors.

When your predictor is trained, the workflow triggers the creation of a forecast using that predictor. During forecast creation, Amazon Forecast trains a model on the entire dataset before hosting the model and making inferences. For more information, see Forecasts.

The state machine sends a notification email to the address specified at the deployment of a successful forecast export. After exporting your forecast, the Update-Resources step reformats the exported data so Athena and QuickSight can easily consume it.

You can reuse this MLOps pipeline to build, train, and deploy other ML models by replacing the algorithms and datasets in the Lambda function for each step.

Prerequisites

Before you deploy the architecture, complete the following prerequisite steps:

  1. Install Git.
  2. Install the AWS Serverless Application Model (AWS SAM) CLI on your system. For instructions, see Installing the AWS SAM CLI. Make sure the latest version is installed with the following code:
    sam --version 

Deploying the sample architecture to your AWS account

To simplify deployment, this post provides the entire architecture as infrastructure as code using AWS CloudFormation and is available in the Forecast Visualization Automation Blogpost GitHub repo. You use AWS SAM to deploy this solution.

  1. Clone the Git repo. See the following code:
    git clone https://github.com/aws-samples/amazon-forecast-samples.git

    The code is also available on the Forecast Visualization Automation Blogpost GitHub repo.

  2. Navigate to the newly created amazon-forecast-samples/ml_ops/visualization_blog directory and enter the following code to start solution deployment:
    cd amazon-forecast-samples/ml_ops/visualization_blog
    
    sam build && sam deploy --guided
    

    At this stage, AWS SAM builds a CloudFormation template change set. After a few seconds, AWS SAM prompts you to deploy the CloudFormation stack.

  3. Provide parameters for the stack deployment. This post uses the following parameters; you can keep the default parameters:
    Setting default arguments for 'sam deploy'
    	=========================================
    	Stack Name [ForecastSteps]:  <Enter Stack Name e.g.  - forecast-blog-stack>
    	AWS Region [us-east-1]: <Enter region e.g. us-east-1>
    	Parameter Email [youremail@yourprovider.com]: <Enter valid e-mail id>
    	Parameter ParameterFile [params.json]: <Leave Default>
    	#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    	Confirm changes before deploy [Y/n]: y
    	#SAM needs permission to be able to create roles to connect to the resources in   
            your template
    	Allow SAM CLI IAM role creation [Y/n]: y
            Save arguments to samconfig.toml [Y/n]: n 
    

    AWS SAM creates an AWS CloudFormation change set and asks for confirmation.

  4. Enter Y.

For more information about change sets, see Updating Stacks Using Change Sets.

After a successful deployment, you see the following output:

CloudFormation outputs from the deployed stack
------------------------------------------------------------
Outputs                                                                                                                                
-------------------------------------------------------------
Key                 AthenaBucketName                                                                                                   
Description         Athena bucket name to drop your files                                                                              
Value               forecast-blog-stack-athenabucket-1v6qnz7n5f13w                                                                     

Key                 StepFunctionsName                                                                                                  
Description         Step Functions Name                                                                                                
Value               arn:aws:states:us-east-1:789211807855:stateMachine:DeployStateMachine-5qfVJ1kycEOj                                 

Key                 ForecastBucketName                                                                                                 
Description         Forecast bucket name to drop your files                                                                            
Value               forecast-blog-stack-forecastbucket-v61qpov2cy8c                                                                    
-------------------------------------------------------------
Successfully created/updated stack - forecast-blog-stack in us-east-1
  1. On the AWS CloudFormation console, on the Outputs tab, record the value of ForecastBucketName, which you use in the testing step.

Testing the sample architecture

The following steps outline how to test the sample architecture. To trigger the Step Functions workflow, you need to upload two files to the newly created S3 bucket: a parameter file and the time series training dataset.

  1. Under the same directory in which you cloned the GitHub repo, enter the following code, replacing YOURBUCKETNAME with the value from the AWS CloudFormation Outputs tab that you copied earlier:
    aws s3 cp ./testing-data/params.json s3://{YOURBUCKETNAME}

    The preceding command copied the parameters file that the Lambda functions use to configure your Forecast API calls.

  2. Upload the time series dataset by entering the following code:
    aws s3 sync ./testing-data/ s3://{YOURBUCKETNAME}

  3. On the Step Functions dashboard, locate the state machine named DeployStateMachine-<random string>.
  4. Choose the state machine to explore the workflow in execution.

In the preceding screenshot, all successfully executed steps (Lambda functions) are in a green box. The blue box indicates that a step is still in progress. All boxes without colors are steps that are pending execution. It can take up to 2 hours to complete all the steps of this workflow.

After the successful completion of the workflow, you can go to the Amazon S3 console and find an Amazon S3 bucket with the following directories:

/params.json    # Your parameters file.
/train/         # Where the training CSV files are stored
/history/       # Where the previous forecasts are stored
/history/raw/   # Contains the raw Amazon Forecast exported files
/history/clean/ # Contains the previous processed Amazon Forecast exported files
/quicksight/    # Contains the most updated forecasts according to the train dataset
/tmp/           # Where the Amazon Forecast files are temporarily stored before processing

The parameter file params.json stores attributes to call Forecast APIs from the Lambda functions. These parameter configurations contain information such as forecast type, predictor setting, and dataset setting, in addition to forecast domain, frequency, and dimension. For more information about API actions, see Amazon Forecast Service.

Now that your data is in Amazon S3, you can visualize your results.

Analyzing forecasted data with Athena and QuickSight

To complete your forecast pipeline, you need to query and visualize your data. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. QuickSight is a fast, cloud-powered business intelligence service that makes it easy to uncover insights through data visualization. To start analyzing your data, you first ingest your data into QuickSight using Athena as a data source.

If you’re new to AWS, set up QuickSight to create a QuickSight account. If you have an AWS account, subscribe to QuickSight to create an account.

If this is your first time using Athena on QuickSight, you have to give permissions to QuickSight to query Amazon S3 using Athena. For more information, see Insufficient Permissions When Using Athena with Amazon QuickSight.

  1. On the QuickSight console, choose New Analysis.
  2. Choose New Data Set.
  3. Choose Athena.
  4. In the New Athena data source window, for Data source name, enter a name; for example, Utility Prediction.
  5. Choose Validate connection.
  6. Choose Create data source.

The Choose your table window appears.

  1. Choose Use custom SQL.
  2. In the Enter custom SQL query window, enter a name for your query; for example, Query to merge Forecast result with training data.
  3. Enter the following code into the query text box:
    SELECT LOWER(forecast.item_id) as item_id,
             forecast.target_value,
             date_parse(forecast.timestamp, '%Y-%m-%d %H:%i:%s') as timestamp,
             forecast.type
    FROM default.forecast
    UNION ALL
    SELECT LOWER(train.item_id) as item_id,
             train.target_value,
             date_parse(train.timestamp, '%Y-%m-%d %H:%i:%s') as timestamp,
             'history' as type
    FROM default.train
    

  4. Choose Confirm query.

You now have the option to import your data to SPICE or query your data directly.

  1. Choose either option, then choose Visualize.

You see the following fields under Fields list:

  • item_id
  • target_value
  • timestamp
  • type

The exported forecast contains the following fields:

  • item_id
  • date
  • The requested quantiles (P10, P50, P90)

The type field contains the quantile type (P10, P50, P90) for your forecasted window and history as its value for your training data. This was done through the custom query to have a consistent historical line between your historical data and the exported forecast.

You can customize the quantiles by using the CreateForecast API optional parameter called ForecastType. For this post, you can configure this in the params.json file in Amazon S3.

  1. For X axis, choose timestamp.
  2. For Value, choose target_value.
  3. For Color, choose type.

In your parameters, you specified a 72-hour horizon. To visualize results, you need to aggregate the timestamp field on an hourly frequency.

  1. From the timestamp drop-down menu, choose Aggregate and Hour.

The following screenshot is your final forecast prediction. The graph shows a future projection in the quantiles p10, p50m, and p90, at which probabilistic forecasts are generated.

Conclusion

Every organization can benefit from more accurate forecasting to better predict product demand, optimize planning and supply chains, and more. Forecasting demand is a challenging task, and ML can narrow the gap between predictions and reality.

This post showed you how to create a repeatable, AI-powered, automated forecast generation process. You also learned how to implement an ML operation pipeline using serverless technologies and used a managed analytics service to get data insights by querying the data and creating a visualization.

There’s even more that you can do with Forecast. For more information about energy consumption predictions, see Making accurate energy consumption predictions with Amazon Forecast. For more information about quantiles, see Amazon Forecast now supports the generation of forecasts at a quantile of your choice.

If this post helps you or inspires you to solve a problem, share your thoughts and questions in the comments. You can use and extend the code on the GitHub repo.


About the Author

Luis Lopez Soria is an AI/ML specialist solutions architect working with the AWS machine learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.

 

 

 

Saurabh Shrivastava is a solutions architect leader and AI/ML specialist working with global systems integrators. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture in hybrid and AWS environments. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.

 

 

 

Pedro Sola Pimentel is an R&D solutions architect working with the AWS Brazil commercial team. He works with AWS to innovate and develop solutions using new technologies and services. He’s interested in recent computer science research topics and enjoys traveling and watching movies.

 

 

 

Read More

Enhancing enterprise search with Amazon Kendra

Enhancing enterprise search with Amazon Kendra

Amazon Kendra is an easy-to-use enterprise search service that allows you to add search capabilities to your applications so end-users can easily find information stored in different data sources within your company. This could include invoices, business documents, technical manuals, sales reports, corporate glossaries, internal websites, and more. You can harvest this information from storage solutions like Amazon Simple Storage Service (Amazon S3) and OneDrive; applications such as SalesForce, SharePoint and Service Now; or relational databases like Amazon Relational Database Service (Amazon RDS)

When you type a question, the service uses machine learning (ML) algorithms to understand the context and return the most relevant results, whether that’s a precise answer or an entire document. Most importantly, you don’t need to have any ML experience to do this—Amazon Kendra also provides you with the code that you need to easily integrate with your new or existing applications.

This post shows you how to create your internal enterprise search by using the capabilities of Amazon Kendra. This enables you to build a solution to create and query your own search index. For this post, you use Amazon.com help documents in HTML format as the data source, but Amazon Kendra also supports MS Office (.doc, .ppt), PDF, and text formats.

Overview of solution

This post provides the steps to help you create an enterprise search engine on AWS using Amazon Kendra. You can provision a new Amazon Kendra index in under an hour without much technical depth or ML experience.

The post also demonstrates how to configure Amazon Kendra for a customized experience by adding FAQs, deploying Amazon Kendra in custom applications, and synchronizing data sources. This post addresses and answers these questions in the subsequent sections.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Creating and configuring your document repository

Before you can create an index in Amazon Kendra, you need to load documents into an S3 bucket. This section contains instructions to create an S3 bucket, get the files, and load them into the bucket. After completing all the steps in this section, you have a data source that Amazon Kendra can use.

  1. On the AWS Management Console, in the Region list, choose US East (N. Virginia) or any Region of your choice that Amazon Kendra is available in.
  2. Choose Services.
  3. Under Storage, choose S3.
  4. On the Amazon S3 console, choose Create bucket.
  5. Under General configuration, provide the following information:
    • Bucket name: kendrapost-{your account id}.
    • Region: Choose the same Region that you use to deploy your Amazon Kendra index (this post uses US East (N. Virginia) us-east-1).
  6. Under Bucket settings for Block Public Access, leave everything with the default values.
  7. Under Advanced settings, leave everything with the default values.
  8. Choose Create bucket.
  9. Download amazon_help_docs.zip and unzip the files.
  10. On the Amazon S3 console, select the bucket that you just created and choose Upload.
  11. Upload the unzipped files.

Inside your bucket, you should now see two folders: amazon_help_docs (with 3,100 objects) and faqs (with one object).

The following screenshot shows the contents of amazon_help_docs.

The following screenshot shows the contents of faqs.

Creating an index

An index is the Amazon Kendra component that provides search results for documents and frequently asked questions. After completing all the steps in this section, you have an index ready to consume documents from different data sources. For more information about indexes, see Index.

To create your first Amazon Kendra index, complete the following steps:

  1. On the console, choose Services.
  2. Under Machine Learning, choose Amazon Kendra.
  3. On the Amazon Kendra main page, choose Create an Index.
  4. In the Index details section, for Index name, enter kendra-blog-index.
  5. For Description, enter My first Kendra index.
  6. For IAM role, choose Create a new role.
  7. For Role name, enter -index-role (your role name has the prefix AmazonKendra-YourRegion-).
  8. For Encryption, don’t select Use an AWS KMW managed encryption key.

(Your data is encrypted with an Amazon Kendra-owned key by default.)

  1. Choose Next.

For more information about the IAM roles Amazon Kendra creates, see Prerequisites.

Amazon Kendra offers two editions. Kendra Enterprise Edition provides a high-availability service for production workloads. Kendra Developer Edition is suited for building a proof-of-concept and experimentation. For this post, you use the Developer edition.

  1. In the Provisioning editions section, select Developer edition.
  2. Choose Create.

For more information on the free tier, document size limits, and total storage for each Amazon Kendra edition, see Amazon Kendra pricing.

The index creation process can take up to 30 minutes. When the creation process is complete, you see a message at the top of the page that you successfully created your index.

Adding a data source

A data source is a location that stores the documents for indexing. You can synchronize data sources automatically with an Amazon Kendra index to make sure that searches correctly reflect new, updated, or deleted documents in the source repositories.

After completing all the steps in this section, you have a data source linked to Amazon Kendra. For more information, see Adding documents from a data source.

Before continuing, make sure that the index creation is complete and the index shows as Active.

  1. On the kendra-blog-index page, choose Add data sources.

Amazon Kendra supports six types of data sources: Amazon S3, SharePoint Online, ServiceNow, OneDrive, Salesforce online, and Amazon RDS. For this post, you use Amazon S3.

  1. Under Amazon S3, choose Add connector.

For more information about the different data sources that Amazon Kendra supports, see Adding documents from a data source.

  1. In the Define attributes section, for Data source name, enter amazon_help_docs.
  2. For Description, enter AWS services documentation.
  3. Choose Next.
  4. In the Configure settings section, for Enter the data source location, enter the S3 bucket you created: kendrapost-{your account id}.
  5. Leave Metadata files prefix folder location

By default, metadata files are stored in the same directory as the documents. If you want to place these files in a different folder, you can add a prefix. For more information, see S3 document metadata.

  1. For Select decryption key, leave it deselected.
  2. For Role name, enter source-role (your role name is prefixed with AmazonKendra-).
  3. For Additional configuration, you can add a pattern to include or exclude certain folders or files. For this post, keep the default values.
  4. For Frequency, choose Run on demand.

This step defines the frequency with which the data source is synchronized with the Amazon Kendra index. For this walkthrough, you do this manually (one time only).

  1. Choose Next.
  2. On the Review and create page, choose Create.
  3. After you create the data source, choose Sync now to synchronize the documents with the Amazon Kendra index.

The duration of this process depends on the number of documents that you index. For this use case, it may take 15 minutes, after which you should see a message that the sync was successful.

In the Sync run history section, you can see that 3,099 documents were synchronized.

Exploring the search index using the search console

The goal of this section is to let you explore possible search queries via the built-in Amazon Kendra console.

To search the index you created above, complete the following steps:

  1. Under Indexes, choose kendra-blog-index.
  2. Choose Search console.

Kendra can answer three types of questions: factoid, descriptive, and keyword. For more information, see Amazon Kendra FAQs. You can ask some questions using the Amazon.com help documents that you uploaded earlier.

In the search field, enter What is Amazon music unlimited?

With a factoid question (who, what, when, where), Amazon Kendra can answer and also offer a link to the source document.

As a keyword search, enter shipping rates to Canada. The following screenshot shows the answer Amazon Kendra gives.

Adding FAQs

You can also upload a list of FAQs to provide direct answers to common questions your end-users ask. To do this, you need to load a .csv file with the information related to the questions. This section contains instructions to create and configure that file and load it into Amazon Kendra.

  1. On the Amazon Kendra console, navigate to your index.
  2. Under Data management, choose FAQs.
  3. Choose Add FAQ.
  4. In the Define FAQ project section, for FAQ name, enter kendra-post-faq.
  5. For Description, enter My first FAQ list.

Amazon Kendra accepts .csv files formatted with each row beginning with a question followed by its answer. For example, see the following table.

Question Answer URL (optional)
What is the height of the Space Needle?  605 feet  https://www.spaceneedle.com/
How tall is the Space Needle?  605 feet  https://www.spaceneedle.com/
What is the height of the CN Tower? 1815 feet https://www.cntower.ca/
How tall is the CN Tower? 1815 feet https://www.cntower.ca/

This is how the .CSV file included for this use case looks like:

"How do I sign up for the Amazon Prime free Trial?"," To sign up for the Amazon Prime free trial, your account must have a current, valid credit card. Payment options such as an Amazon.com Corporate Line of Credit, checking accounts, pre-paid credit cards, or gift cards cannot be used. "," https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=201910190”
  1. Under FAQ settings, for S3, enter s3://kendrapost-{your account id}/faqs/kendrapost.csv.
  2. For IAM role, choose Create a new role.
  3. For Role name, enter faqs-role (your role name is prefixed with AmazonKendra-).
  4. Choose Add.
  5. Wait until you see the status show as Active.

You can now see how the FAQ works on the search console.

  1. Under Indexes, choose your index.
  2. Under Data management, choose Search console.
  3. In the search field, enter How do I sign up for the Amazon Prime free Trial?
  4. The following screenshot shows that Amazon Kendra added the FAQ that you uploaded previously to the results list, and provides an answer and a link to the related documentation.

Using Amazon Kendra in your own applications

You can add the following components from the search console in your application:

  • Main search page The main page that contains all the components. This is where you integrate your application with the Amazon Kendra API.
  • Search bar The component where you enter a search term and that calls the search function.
  • Results The component that displays the results from Amazon Kendra. It has three components: suggested answers, FAQ results, and recommended documents.
  • Pagination The component that paginates the response from Amazon Kendra.

Amazon Kendra provides source code that you can deploy in your website. This is offered free of charge under a modified MIT license so you can use it as is or change it for your own needs.

This section contains instructions to deploy Amazon Kendra search to your website. You use a Node.js demo application that runs locally in your machine. This use case is based on a MacOS environment.

To run this demo, you need the following components:

  1. Download amazon_aws-kendra-sample-app-master.zip and unzip the file.
  2. Open a terminal window and go to the aws-kendra-sample-app-master folder:
    cd /{folder path}/aws-kendra-sample-app-master

  3. Create a copy of the .env.development.local.example file as .env.development.local:
    cp .env.development.local.example .env.development.local

  4. Edit the .env.development.local file and add the following connection parameters:
    • REACT_APP_INDEX – Your Amazon Kendra index ID (you can find this number on the Index home page)
    • REACT_APP_AWS_ACCESS_KEY_ID – Your account access key
    • REACT_APP_AWS_SECRET_ACCESS_KEY – Your account secret access key
    • REACT_APP_AWS_SESSION_TOKEN – Leave it blank for this use case
    • REACT_APP_AWS_DEFAULT_REGION – The Region that you used to deploy the Kendra index (for example, us-east-1)
  5. Save the changes.
  6. Install the Node.js dependencies:
    npm install

  7. Launch the local development server:
    npm start

  8. View the demo app at http://localhost:3000/. You should see the following screenshot.
  9. Enter the same question you used to test the FAQs: How do I sign up for the Amazon Prime free Trial?

The following screenshot shows that the result is the same as the one you got from the Amazon Kendra console, even though the demo webpage is running locally in your machine.

Cleaning up

To avoid incurring future charges and to clean out unused roles and policies, delete the resources you created: the Amazon Kendra index, S3 bucket, and corresponding IAM roles.

 

  1. To delete the Amazon Kendra index, under Indexes, choose kendra-blog-index.
  2. In the index settings section, from the Actions drop-down menu, choose Delete.
  3. To confirm deletion, enter Delete in the field and choose Delete.

Wait until you get the confirmation message; the process can take up to 15 minutes.

For instructions on deleting your S3 bucket, see How do I delete an S3 Bucket?

Conclusion

In this post, you learned how to use Amazon Kendra to deploy an enterprise search service. You can use Amazon Kendra to improve the search experience in your company, powered by ML. You can enable rapid look for your documents using natural language, without any previous ML/AI experience. For more information about Amazon Kendra, see AWS re:Invent 2019 – Keynote with Andy Jassy on YouTube, Amazon Kendra FAQs, and What is Amazon Kendra?


About the Author

Leonardo Gómez is a Big Data Specialist Solutions Architect at AWS. Based in Toronto, Canada, He works with customers across Canada to design and build big data architectures.

 

 

 

 

 

 

 

Read More