Amazon AWS – Page 294

Automating business processes with machine learning in the COVID-19 pandemic

November 20, 2020

by Dr. Taha A. Kass-Hout Amazon AWS

COVID-19 has changed our world significantly. All of this change has been almost instantaneous, forcing companies to pivot quickly and find new ways to operate. Automation is playing an increasingly important role to help companies adjust. The ability to automate business processes with machine learning (ML) is unlocking new efficiencies and allowing companies to move faster where they might have otherwise been stuck using antiquated systems. What might have previously taken an organization years is now happening in weeks. In this post, we discuss how AWS customers are applying ML in areas such as document processing and forecasting to quickly respond to the challenges at hand.

Automating document processing

The ability to automate document processing remotely has proven essential as companies face new challenges in this pandemic. Demand for services like loan processing and grocery delivery has spiked in areas that no one could have predicted and the ability to quickly respond to those demands remains vital.

In April 2020, the US federal government announced the Paycheck Protection Program (PPP) to provide small businesses with funds to cover up to 8 weeks of payroll, mortgage, rent, and utility expenses. With phenomenal demand and over $349 billion allocated in just the first 13 days of the program, small business owners were scrambling to qualify.

BlueVine, a fintech company that provides small business banking, used their technology and engineering expertise to help process billions in loans. They chose Amazon Textract, a fully managed ML service that automatically extracts text and data from documents, to help automate the loan application process. In just a few days, they were up and running, analyzing tens of thousands of pages with high accuracy. In just 4 months, they were able to serve more than 155,000 small businesses with over $4.5 billion in loans. They delivered services to those who needed it most, with 68% of loans going to customers with fewer than 10 employees and 90% of loans under $60,000—serving small businesses struggling to remain afloat. BlueVine worked closely with DoorDash as their strategic partner to serve many stressed small independent restaurants, and simplify and accelerate the loan process. BlueVine used ML to automate loan application processing and scale quickly to meet the unprecedented demand. The company estimates they helped save 470,000 jobs as a result of their efforts.

Other areas of the economy were also experiencing unprecedented demand and needed to staff up quickly. However, it was a challenge to process new hire employment paperwork at the rate required. A typical PDF form has about 50 form fields; to recreate it as a digital form, the customer had to drag and drop data to the right location on each form—a particularly time-consuming task. Enter HelloSign, a Dropbox company that automates the signature process. HelloWorks is a HelloSign product that turns PDFs into mobile friendly forms. It uses Amazon Textract to automate document processing and save customers hundreds of hours. A popular on-demand grocery delivery service was able to onboard millions of shoppers using HelloWorks in a few weeks. HelloWorks helped the company scale their onboarding paperwork quickly by automating document processing with ML. An NY-based urgent care started to use HelloSign to register new patients. An ambulance service started using HelloWorks to send out COVID-19 test applications. What’s more, this was all happening online. As organizations continue to limit in-person interactions, demand surged for HelloWorks with users creating 3x more forms than they used to. With Textract, HelloWorks was able to automate the process and automatically create all of the fields and components, saving customers time and keeping them safe.

Forecasting the pandemic

Forecasting is a growing challenge as supply chains and demand have been disrupted on a global scale. Amazon Forecast, a fully managed service that uses ML to deliver highly accurate forecasts, is helping customers deliver forecasts for everything from product demand to financial performance. Many forecasting tools only look at a historical series of data to predict the future, with the assumption being that the future is determined by the past. When faced with irregular trends this approach falters – as demonstrated by the challenges faced by companies to develop models that accurately capture the complexities of the real world since the beginning of the COVID-19 pandemic. With Amazon Forecast, you can integrate irregular trends and a multitude of other variables—on top of your historical series of data—to deliver the most accurate forecast possible with no ML experience required.

One of the largest challenges when it comes to forecasting has been understanding the projection of the disease itself. How quickly will it spread? When will it spike next? How many hospital beds will be needed to accommodate that spike? Forecasting models have the potential to assess disease trends and the course of the COVID-19 pandemic. However, the nature of the COVID-19 time-series makes forecasting extremely challenging, given the variations we’ve observed in disease spread across multiple communities and populations. COVID-19 remains a relatively unknown disease with no historic data to predict trends, such as seasonality and vulnerable sections of the population.

To better understand and forecast the disease, Rackspace Technology, University of California Irvine, Scientific Systems, and Plan4Co have come together to introduce a new COVID-19 forecasting model to deliver greater accuracy using Amazon Forecast. The team of medical, academic, data science modeling, and forecasting experts worked together to use Amazon Forecast DeepAR+ to incorporate related time-series to build more powerful forecasting models. Their model used deep learning to learn patterns between related time-series, like mobility data, and the target time-series. As a result, the model outperformed other approaches, such as those provided by the well-known IHME model.

With Amazon Forecast, the team was able to preprocess the time-series, train dozens of models quickly, compare model performance, and quantify the best forecasts. These forecasts can be developed on a daily and weekly basis, now available for countries, states, counties, and zip-codes. This information can, for example, help forecast what new cases will be in the short-term and long-term by learning from real-world data, like time to peak and rate of transmission. This information is critical because government agencies frequently use the occurrence of new cases in a population over a specified period of time to help determine when it’s safe to re-open sectors of the economy.

Conclusion

As the pandemic continues, new challenges will inevitably arise. When time was of the essence, these organizations looked to ML technology and automation to serve their customers’ needs and find new ways to operate. This use of new technology will not only help them respond to the pandemic today, but also set them up to thrive in the future.

To learn about other ways AWS is working toward solutions in the COVID-19 pandemic check out Introducing the COVID-19 Simulator and Machine Learning Toolkit for Predicting COVID-19 Spread and Intelligently connect to customers using machine learning in the COVID-19 pandemic.

About the Author

Taha A. Kass-Hout, MD, MS, is director of machine learning and chief medical officer at Amazon Web Services (AWS). Taha received his medical training at Beth Israel Deaconess Medical Center, Harvard Medical School, and during his time there, was part of the BOAT clinical trial. He holds a doctor of medicine and master’s of science (bioinformatics) from the University of Texas Health Science Center at Houston.

Helping small businesses deliver personalized experiences with the Amazon Personalize extension for Magento

November 19, 2020

by Jeff Finkelstein Amazon AWS

This is a guest post by Jeff Finkelstein, founder of Customer Paradigm, a full-service interactive media firm and Magento solutions partner.

Many small retailers use Magento, an open-source ecommerce platform, to create websites or mobile applications to sell their products online. Personalization is key to creating high-quality ecommerce experiences, but small businesses often lack access to resources required to implement a scalable, sophisticated personalization solution—especially one that is powered by machine learning (ML). An ML-based solution results in better end-user engagement, conversion, and increased sales compared to traditional rudimentary techniques like static rules. With the Amazon Personalize extension for Magento, we’re now making the same ML personalization technology used by Amazon.com accessible to small businesses that use Magento.

This is the case with Hoopologie, a small hula hoop supply company based in Boulder, Colorado. Founder Melina Rider started the business in 2013 and has an extensive product line of over 1,000 SKUs. For Hoopologie, like many other small ecommerce merchants, creating sophisticated personalized experiences has been out of reach. Hiring ML experts to implement an ML solution isn’t affordable, and using a rules-based system requires a lot of manual maintenance, limiting scale and performance.

“Creating personalized recommendations for every type of user on the site would take us hours and hours every month. It’s not affordable, and I’d rather have our limited staff help our customers in other ways,” Rider says.

Hoopologie uses Magento to power their website, and has seen an increase of 40.5% in sales and an average order value increase of just over $50 per order by using the Amazon Personalize extension for Magento. In this post, we show you how to implement the Amazon Personalize extension for Magento to start creating ML-powered personalized experiences for your customers to improve engagement, conversion, and revenue.

Amazon Personalize for Magento

Amazon Personalize is a fully managed ML service that leverages over 20 years of experience at Amazon.com to help you deliver personalized experiences faster. The Amazon Personalize extension allows Magento merchants to take advantage of the benefits of Amazon Personalize and use its algorithms to power a Magento store’s product recommendations. All data and ML models are stored privately and securely in the merchant’s AWS account.

It’s easy to install the extension on your site, create an AWS account, and authorize the extension to access your AWS account. For instructions, see Amazon Personalize for Magento 2: Installation & Configuration Instructions. After you complete those steps, you see the Amazon Personalize extension in your Magento admin area, under Stores, Configuration.

Entering configuration details

For instructions on configuring the extension, watch the video clip Amazon Personalize for Magento 2: Extension Configuration on YouTube.

On the configuration page, you can provide all the values that tie your Magento site into Amazon Personalize in your AWS account:

For License Key, enter the license key that you received for the extension.

If you installed the extension but don’t have a license key, you can start a 15-day free trial.

If your license is active, the License Active field shows as yes.

If the license isn’t activated, add a valid access key.
For Module Enabled, you can enable or disable the module from the admin area.

This is helpful if you need to troubleshoot a site, or if you have a copy of the site on a test server.

When your Amazon Personalize campaign (an ML-powered recommender trained on your data) is active in Amazon Personalize, Campaign Active shows as yes.

The system automatically uploads, ingests, and trains the data. This is a read-only display; this isn’t something you can change to turn on or off the campaign.

For File owner home directory, enter a file directory outside your web root (such as ../../keys).

Amazon’s security requirements mandate that your AWS credentials not be stored in the Magento database. Instead, your keys need to be stored in a directory outside the web root, usually up a level or two from where your Magento site is stored in your file system.

For AWS Region, enter the Region where you want the extension to access Amazon Personalize.

Because Amazon Personalize may not be available in all Regions, be sure to enter a Region code where Amazon Personalize is available and that is located geographically closest to your Magento server’s physical location.

For AWS Account Number, enter your AWS account number.
For Access Key, enter the access key ID for the user that you created in the authorization part of the setup.
For Secret Key, enter the secret key ID for the user that you created in authorization part of the setup.

For instructions on finding your AWS account number, access key, and secret key, watch the video clip Amazon Personalize for Magento 2: Creating IAM User in AWS on YouTube.

Choose Save Config.

This saves the access information and writes the access key and secret key to a file outside the web root.

Starting training

To begin the training process, choose Start Process.

Make sure that your Magento cron is running. If it’s not running, it’s highly likely that the process won’t move from one step to the next.

The process includes the following high-level steps:

Exporting historical data from your Magento site into CSV files.
Creating a private Amazon Simple Storage Service (Amazon S3) bucket in your AWS account to stage CSV files.
Uploading the CSV files to the S3 bucket in your AWS account.
Instructing Amazon Personalize to create a custom solution and campaign based on your data. The result is a private ML model hosted in your AWS account.

The following screenshot shows the progress tracker of the individual steps.

You can restart the process at any time by choosing Reset Process. This process starts at the beginning, and may incur additional data upload and training costs.

A/B split testing

To evaluate the effectiveness of the Amazon Personalize campaign created on your Magento store, we built in an A/B split testing system. A/B testing allows you to expose two subsets of your users to two variations of a user experience on your site and measure which variation results in the highest conversion rate. Control is default Magento; Test is personalized recommendations from your Amazon Personalize campaign.

For the system to be active, make sure that Enabled is set to Yes.

If Enabled is set to No, Magento doesn’t use the extension.

For Set Percentage, choose from the following A/B split test options:
1. 100% Control / 0% test (Amazon Personalize isn’t used at all)
2. 75% Control / 25% test (Amazon Personalize is used 25% of the time)
3. 50% Control / 50% test (Amazon Personalize is used 50% of the time)
4. 25% Control / 75% test (Amazon Personalize is used 75% of the time)
5. 10% Control / 90% test (Amazon Personalize is used 90% of the time)
6. 0% Control / 100% test (Amazon Personalize is used 100% of the time)

Choose Save Config to save the settings.

Next you allow Amazon Personalize to train. Depending on the amount of historical data in your Magento system, this process can take a few hours to complete. You can revisit this page to see the progress of the training.

After you complete this process the first time, you can retrain new versions of the model while continuing to use the active campaign to provide product recommendations. To contain costs and create the most relevant dataset for your site, we’ve limited historical data to the previous 6 months. (Interaction data older than 6 months provides less value in making relevant product recommendations.)

After the Amazon Personalize system is enabled, the system automatically does the following:

Displays personalized product recommendations to your end-users.
Adds a “We also suggest” list of recommended products to your product page.
Automatically adds a Google Analytics tag (if your site uses Google Analytics) for any order that was placed on the site when Amazon Personalize was active. This uses the existing Google Analytics system. If you’re not using Google Analytics, this step is omitted.
Adds a field to the Magento database that indicates if an order was placed when Amazon Personalize was active.
Adds real-time data interaction indicators, allowing Amazon Personalize to learn from users when they add a product to their cart, wishlist, or complete a purchase.

To add Amazon Personalize to additional pages, choose Content, Pages, Select a Page.

For help troubleshooting, see Amazon Personalize Extension for Magento 2.

Conclusion

In just a few easy steps, you can add the Amazon Personalize extension for Magento to an ecommerce site. Hooplogie started testing the Amazon Personalize extension in January 2020. They began by using an A/B split test to show the personalized recommendations to 50% of their users. After 6 weeks, users seeing personalized recommendations spent $50.02 more per transaction, and overall revenue from users in the test was up by 42%. Hoopologie continues to use the Amazon Personalize extension to create personalized experiences for their customers. They intend to expand usage by continuously evaluating the system, retraining as new products are added, and adding personalization to additional areas of the site.

Advanced personalization is no longer beyond the reach of many ecommerce merchants using Magento. The results from Hoopologie are a testament to the positive impact of implementing a scalable, sophisticated personalization solution powered by ML. Learn more about the Amazon Personalize extension for Magento for Customer Paradigm and enjoy a complementary 15-day free trial.

About the Author

Jeff Finklelstein is the founder of Customer Paradigm based in Boulder, Colorado. Founded in 2002, their team has completed more than 12,600 web development and marketing projects for ecommerce and other clients throughout the world. The Customer Paradigm team previously built an integration between Magento and Amazon Marketplace, which was incorporated into the core Magento framework in 2017. More information can be found at https://www.CustomerParadigm.com.

Detecting hidden but non-trivial problems in transfer learning models using Amazon SageMaker Debugger

November 19, 2020

by Nathalie Rauschmayr Amazon AWS

Rapid development of deep learning technology has produced an abundance of open-sourced, pre-trained models in computer vision and natural language processing. As a result, transfer learning has become a popular approach in deep learning. Transfer learning is a machine learning technique where a model pre-trained on one task is fine-tuned on a new task. Given the significant compute and time resources required to develop neural network models, adapting a pre-trained model to new data is compelling in business applications. If you’re new to deep learning, transfer learning is also a good starting point because you don’t have to build a model from scratch. For deep learning beginners, one question you may have is, how do I systematically examine model predictions to see what mistakes were made now that my data is in the form of pictures or text?

In this post, we show you an end-to-end example of doing transfer learning by using Amazon SageMaker Debugger to detect hidden problems that may have serious consequences. Debugger doesn’t incur additional costs if you’re running training on Amazon SageMaker. Moreover, you can enable the built-in rules with just a few lines of code when you call the Amazon SageMaker estimator function. For our use case, we do transfer learning using a ResNet model to recognize German traffic signs [1].

In this post, we focus on issues that occur during training. For more information about using Debugger for inference and explainability, see Detecting and analyzing incorrect model predictions with Amazon SageMaker Model Monitor and Debugger.

Setting up a transfer learning training job

For our use case, we want to adapt a pre-trained computer vision model to recognize traffic signs in Germany. We use the GTSRB dataset [1] for this new task. You can find the notebook and training script in the GitHub repo.

Applying preprocessing on the dataset

We first apply some typical preprocessing for a ResNet model on our dataset (see the complete notebook for where to download the dataset). To improve model generalization, we apply data augmentation (RandomResizedCrop and RandomHorizontalFlip). These operations ensure that an image looks differently in each epoch. Lastly, we normalize the data: because the model has been pre-trained on the ImageNet dataset, we apply the same preprocessing and normalization (subtract the mean and divide by the standard deviation of the ImageNet dataset). See the following code:

from torchvision import datasets, models, transforms

# Define pre-processing
train_transform =  transforms.Compose([
                                        transforms.RandomResizedCrop(224),
                                        transforms.RandomHorizontalFlip(),
                                        transforms.ToTensor(),
                                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                    ])

We use Pytorch’s ImageFolder function, which takes a local folder and loads all images located in the subdirectories and encodes the directory name as a label. Next we specify the dataloader that takes the batch size and dataset. We use the dataloader during training to provide new batches in each iteration. See the following code:

# Apply the pre-processing to the training dataset
dataset = datasets.ImageFolder(root='GTSRB/Training', transform=train_transform)
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)

For the validation dataset, we don’t apply data augmentation and only resize images to the appropriate size:

# Apply the pre-processing to validation dataset
val_transform = transforms.Compose([
                                        transforms.Resize(256),
                                        transforms.CenterCrop(224),
                                        transforms.ToTensor(),
                                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                        ])

dataset_test = datasets.ImageFolder(root='GTSRB/Final_Test', transform=val_transform)
val_dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=False)

Loading a pre-trained ResNet model

Because we have a limited variety of traffic signs, we pick a simpler ResNet model for this task: resnet18. You can load a ResNet18 from the PyTorch model zoo with pre-trained weights using just one line of code:

#get pretrained ResNet model
model = models.resnet18(pretrained=True)

The model has been pre-trained on the ImageNet dataset, which consists of 1,000 image classes. For our use case, we fine-tune it on a dataset that only has 43 classes. We adjust the last layer, which is a fully connected Linear layer:

#traffic sign dataset has 43 classes
nfeatures = model.fc.in_features
model.fc = torch.nn.Linear(nfeatures, 43)

Because we train a multi-classification model, we use the cross entropy loss function:

#loss for multi label classification
loss_function = torch.nn.CrossEntropyLoss()

Next we specify the optimizer that takes the model parameters and learning rate. Here we use the stochastic gradient descent optimizer:

# optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

Defining the training loop

The following code blocks define the training loop. We iterate over ten epochs, perform the forward and backward pass, and update the model parameters.

for epoch in range(10):  # loop over the entire dataset 10 times
   
    for data in train_dataloader:
    
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward 
        outputs = model(inputs)
        
        #compute loss
        loss = loss_function(outputs, labels)
        
        #backward pass
        loss.backward()
        
        #optimize 
        optimizer.step()
        
        #get predictions
        _, preds = torch.max(outputs, 1)

        # statistics
        epoch_loss += loss.item() 
        
        print('Epoch {}/{} Loss: {:.4f}'.format(epoch, 1, epoch_loss))

If you just run the preceding code, the training runs just on your Amazon SageMaker notebook. To make the most out of Amazon SageMaker, you want to use the pre-built DLC containers, which come with optimized performance and let you access the full feature sets of Debugger at no additional cost. By running on Amazon SageMaker, we can easily train our models at scale. Most deep learning models are trained on GPU due to the computational intensity. With Amazon SageMaker, GPU instances are automatically created and torn down after training completes, so you only pay for the time the resources were used.

Making a training script compatible with Amazon SageMaker

To run training on Amazon SageMaker, you need to change the location variable in your pre-processing code to the generic Amazon SageMaker environment variables. When Amazon SageMaker spins up the training instance, it automatically downloads the training and validation data from Amazon Simple Storage Service (Amazon S3) into a local folder on the training instance. We can retrieve the local path with os.environ['SM_CHANNEL_TRAIN'] and os.environ['SM_CHANNEL_TEST']:

# update environment variable for training and testing data sets
dataset = datasets.ImageFolder(os.environ['SM_CHANNEL_TRAIN'], transform=train_transform)
dataset_test = datasets.ImageFolder(os.environ['SM_CHANNEL_TEST'], transform=val_transform)

After the change, you should save all the model code as a separate script called train.py.

Uploading data to the S3 bucket

As mentioned in the previous step, Amazon SageMaker automatically downloads training and validation data into the training instance. We need to upload the data to Amazon S3 first. You can find detailed instructions on how to do that in the notebook.

Setting up debugger

Now that we have defined the training script and uploaded the data, we’re ready to start training on Amazon SageMaker. We run the training with several Debugger built-in rules enabled. Via the Amazon SageMaker Python SDK and the rule_configs module, we can select any of the 20 available built-in rules, which run at no additional cost. For demonstration purposes, we select loss_not_decreasing, class_imbalance and dead_relu. We can configure several parameters for these rules: for instance, most rules take a threshold parameter that can be adjusted to define when a rule should trigger. We can also define the set of tensors the rules should run on.

The class imbalance rule takes the inputs into the loss function and counts the number of samples per class that the model has seen throughout training. To create the rule, we specify rule_configs.class_imbalance() and the rule runs on the inputs of the loss function. To fine-tune the model, we use the cross entropy loss function, which takes predictions and labels and outputs a loss value. See the following code:

from sagemaker.debugger import Rule, CollectionConfig, rule_configs
class_imbalance_rule = Rule.sagemaker(base_config=rule_configs.class_imbalance(),
                                    rule_parameters={"labels_regex": "CrossEntropyLoss_input_1"}
                                    )

Next we define the loss_not_decreasing rule. It determines if the training or validation loss is decreasing and raises an issue if the loss has not decreased by a certain percentage in the last few iterations. In contrast to the previous rule, this rule runs on the outputs of the loss function (CrossEntropyLoss_output_0). See the following code:

loss_not_decreasing_rule = Rule.sagemaker(base_config=rule_configs.loss_not_decreasing(),
                             rule_parameters={"tensor_regex": "CrossEntropyLoss_output_0",
                                             "mode": "TRAIN"})

The dead_relu rule identifies how many rectified linear unit (ReLU) activations are outputting zero values. ReLU is a non-linear activation function used in many state-of-the-art models. It increases linearly for increasing positive values and outputs zero otherwise. A model can suffer from the dying ReLU problem, where the gradients become zero due to the activation output being zero. If the majority of ReLU activations output zero values, the model can’t effectively learn because weights are no longer getting updated. We instantiate the rule by specifying rule_configs.dead_relu(), and the rule runs on all tensors that captured outputs from ReLU activations:

dead_relu_rule = Rule.sagemaker(base_config=rule_configs.dead_relu(),
                                rule_parameters={"tensor_regex": "relu_output"})

To record additional tensors, we can specify a debugger hook configuration. We can either use default collections such as weights and gradients or define our own custom collection. The following collection saves model inputs and loss function inputs and outputs. We just need to specify a regular expression of tensor names. We save the tensors every 500 steps, and a step is one forward and backward pass. So we get tensors for step 0, 500, 1,000, and so on. See the following code:

from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

debugger_hook_config = DebuggerHookConfig(
      collection_configs=[ 
          CollectionConfig(
                name="custom_collection",
                parameters={ "include_regex": "*ResNet_input|*CrossEntropyLoss",
                             "save_interval": "500" })])

For a full list of collections and rules Debugger offers, see List of Debugger Built-in Rules. Debugger captures the tensor collections you specified throughout the training steps and automatically analyzes them against the rules.

Calling the Amazon SageMaker training API

We define the PyTorch estimator that takes the separate training script we saved earlier and specify the instance type that Amazon SageMaker creates for us. To run the training with Debugger and built-in rules, we only have to pass the list of rules and the debugger hook configuration:

from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            role=sagemaker.get_execution_role(),
                            train_instance_type='ml.p2.xlarge',
                            train_instance_count=1,
                            framework_version='1.6.0',
                            py_version='py3',
                            debugger_hook_config=debugger_hook_config,
                            rules=[class_imbalance_rule, dead_relu_rule, loss_not_decreasing_rule]
                           )

Now we start the training on Amazon SageMaker by calling fit(). The function takes a dictionary that specifies the location of the training and validation data in Amazon S3. The keys of the dictionary are the name of data channels Amazon SageMaker creates in the training instance. See the following code:

pytorch_estimator.fit(inputs={'train': 's3://{}/train'.format(bucket), 
                              'test': 's3://{}/test'.format(bucket)}, 
                      wait=True)

While the training is in progress, we can monitor the rule status in real time in Amazon SageMaker Studio. It turns out that the loss_not_decreasing and class_imbalance rules are triggered. The training runs for 10 epochs and reaches a final test accuracy of 96.3%.

This seems good, but why were the rules triggered? Let’s dive into the data Debugger captured to find out the root causes.

Using SageMaker Debugger rules and data to uncover hidden problems

In this section, we investigate the data to find any hidden problems, create custom rules to fix the model, and rerun the training.

Inspecting loss_not_decreasing

We use Debugger to investigate what triggered the loss_not_decreasing rule. We use the smdebug library, which provides all the functionalities to read and access Debugger data. First we create a trial object that takes the path where the Debugger data is stored as input. This can either be a local or Amazon S3 path. With just a few lines of code, we can retrieve and visualize the loss values as training is still in progress. With trial.steps(), we retrieve the number of recorded steps: a step is one forward and backward pass. We can also specify a mode to retrieve data from the training (modes.TRAIN) or validation phase (modes.EVAL). Debugger’s default sampling interval is 500, so we get loss values for step 0, 500, 1,000, and so on.

To access the loss values, we pass the name of the loss into the trial.tensor() function. The cross entropy loss function we picked measures the performance of a multi-classification model. It takes two inputs: the model outputs and ground truth labels. We can access its outputs via trial.tensor('CrossEntropyLoss_output_0').values():

trial.tensor('CrossEntropyLoss_output_0').values(mode=modes.TRAIN)

{0: array(3.9195325, dtype=float32),
 500: array(0.8488243, dtype=float32),
 1000: array(0.54870504, dtype=float32),
 1500: array(0.25874993, dtype=float32),
 2000: array(0.20406848, dtype=float32),
 2500: array(0.29052508, dtype=float32),
 3000: array(0.18074727, dtype=float32),
 3500: array(0.1956144, dtype=float32),
 4000: array(0.2597512, dtype=float32)}

This code returns a dictionary in which the keys are the step numbers and the values are the loss values. We can now easily visualize the loss values as training is still in progress. See the following code:

import matplotlib.pyplot as plt
from smdebug.trials import create_trial

"path = pytorch_estimator.latest_job_debugger_artifacts_path()
create_trial(path)"

plt.ylabel('Train Loss')
plt.xlabel('Steps')
plt.plot(trial.steps(mode=modes.TRAIN),
         list(trial.tensor('CrossEntropyLoss_output_0').values(mode=modes.TRAIN).values()))
plt.show()

The blue curve in the following graph shows that the default training configuration ran the training for too long. Instead of training for 4,000 steps, early_stopping should have been applied after 1,000 steps. We can use Debugger to enable auto-termination, which stops the training when a rule triggers. For our use case, doing so reduces compute time by more than half (orange curve).

Debugger can auto-terminate training jobs. Metrics are sent to Amazon CloudWatch, so you can set up a CloudWatch alarm and AWS Lambda function that stops a training job if a rule triggers.

For more information about how the auto-termination feature helped one customer reduce compute costs by 70%, see Autodesk optimizes visual similarity search model in Fusion 360 with Amazon SageMaker Debugger.

Inspecting class_imbalance

Real-world datasets are often imbalanced and noisy. If the model training doesn’t account for these factors, it produces a model that has low or no predictive power for the classes with few samples. You can address this in different ways, such as during data-loading, when more samples can be drawn from the under-represented classes, or you can adjust the loss function to assign a higher penalty to incorrect predictions using class weights.

To investigate the class imbalance issue, we retrieve the inputs of the loss function (previously we retrieved the outputs) The loss function takes the model predictions and the ground truth labels as inputs. We use the latter (CrossEntropyLoss_input_1) to count the number of samples the model has seen during training:

from collections import Counter

labels = []
for step in trial.steps(mode=modes.TRAIN):
    labels.extend(trial.tensor("CrossEntropyLoss_input_1").value(step, mode=modes.TRAIN))

label_counts = Counter(labels)
plt.bar(np.arange(0,43),  label_counts.values()

The following visualization shows a high imbalance and several classes with fewer than a hundred samples.

To fix the class imbalance issue, we change the default configuration of the dataloaders to take the class weights into account and draw more samples from classes with fewer samples. Therefore, we define WeightedRandomSampler:

sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))                     

train_dataloader = torch.utils.data.DataLoader(dataset, 
                                                batch_size=64,
                                                sampler=sampler)

During training, the dataloader now draws more samples from classes with lower counts. Class imbalance may lead to the problem where the model performs well on classes with a lot of samples but poorly on classes with fewer counts. Because we trained the model without WeightedRandomSampler, let’s see which classes had particularly low accuracy by looking at the confusion matrix.

Visualizing the confusion matrix in real time

To evaluate the performance of our model, we retrieve labels and predictions and create the confusion matrix:

from sklearn.metrics import confusion_matrix
import seaborn as sns

predictions = []
labels = []
for step in trial.steps(mode=modes.EVAL):
    predictions.extend(np.argmax(trial.tensor("CrossEntropyLoss_input_0").value(step, mode=modes.EVAL), axis=1))
    labels.extend(trial.tensor("CrossEntropyLoss_input_1").value(step, mode=modes.EVAL))

cm = confusion_matrix(labels,predictions)
sns.heatmap(cm, ax=ax, cbar=True)

Each row in the matrix corresponds to the actual class, and the column indicates the predicted classes. For example, the first row shows class 0 and how often it was predicted as class 0, class 1, class 2, and so on. Ideally, we want high counts on the diagonal, because these are correctly predicted classes. Elements not on the diagonal are incorrect predictions. The confusion matrix helps us determine if particular classes in our dataset get confused more often with each other. This can happen, for instance, because samples from two different classes may be very similar. Debugger also provides a confusion built-in rule that computes the confusion matrix while the training is in progress and triggers if the ratio of data on-diagonal values and off-diagonal values exceeds a pre-defined threshold.

The following image shows that in most cases our model is predicting the correct classes, but there are a few outliers. You can use Debugger to look more closely into those outliers.

Inspecting incorrect model predictions

To find out what is causing those outliers in the confusion matrix, we investigate the examples upon which the model made false predictions. To do this analysis, we take both inputs into the loss function into account: CrossEntropyLoss_input_0 presents the model predictions, and CrossEntropyLoss_input_1 are the labels. We also retrieve the model inputs ResNet_input_0, which presents the input images. We perform the analysis on data recorded during the validation phase, so we specify mode=modes.EVAL.

We iterate over the predictions and model inputs saved by Debugger and select those where the label and prediction do not match. Then we plot the predictions and corresponding images:

for step in trial.steps(mode=modes.EVAL):
    
    predictions = np.argmax(trial.tensor('CrossEntropyLoss_input_0').value(step, mode=modes.EVAL),axis=1)
    labels = trial.tensor('CrossEntropyLoss_input_1').value(step, mode=modes.EVAL)
    images = trial.tensor('ResNet_input_0').value(step, mode=modes.EVAL)
    
    for prediction, label, image in zip(predictions, labels, images):
        if prediction != label:
            print(f"Predicted: '{signnames[p]}' Groundtruth: '{signnames[l]}' ")
            plt.imshow(i)

The following images show the result of the code segment. The analysis reveals that the model is often confused by traffic signs that involve a direction. Clearly this is a severe model bug despite the model achieving a decent test accuracy of 96.3%.

The root cause of this is the data augmentation pipeline, which performs a random horizontal flip on the training data. This data augmentation step is typically used when ResNet models are trained from scratch, but it causes a problem in our use case where the dataset contains similar classes where images just differ in their direction.

Running training with a custom rule

With Debugger, we can easily write a custom rule that checks how often the model was confused about directions. For example, we take the image “Dangerous curve to left” (class 19) and count how often it was mistaken as “Dangerous curve to right” (class 20) or vice versa. We just need to implement the function invoke_at_step that Debugger calls every time data for a new step is available. Like before, we access the inputs into the loss function: check if image class 19 or 20 is present, and count how often it was mistaken for the other. If this happens more than 10 times, the rule triggers. See the following code:

from smdebug.rules.rule import Rule

class MyCustomRule(Rule):
    def __init__(self, base_trial):
        super().__init__(base_trial)
        self.counter = 0
        
    def invoke_at_step(self, step):
        
        
        predictions = np.argmax(trial.tensor('CrossEntropyLoss_input_0').value(step),axis=1)
        labels = trial.tensor('CrossEntropyLoss_input_1').value(step)

        for prediction, label in zip(predictions, labels):
            
            if prediction == 19 and label == 20 or prediction == 20 and label == 19:
                self.counter += 1
                if self.counter > 10:
                    self.logger.info(f'Found {self.counter} cases where class 19 was mistaken as class 20 and vice versa')
                    return True
                
        return False

We can easily test and run the custom rule locally by creating the trial object and invoking the rule on the data:

from smdebug.rules import invoke_rule
from smdebug.exceptions import *

rule = MyCustomRule(trial)
try:
    invoke_rule(rule, raise_eval_cond=True)
except RuleEvaluationConditionMet as e:
    print(e)

Running the code cell in the notebook gives the following output:

[2020-10-18 18:51:24.588 28f0f34b9e29:12513 INFO rule_invoker.py:15] Started execution of rule MyCustomRule at step 0
[2020-10-18 18:53:11.846 28f0f34b9e29:12513 INFO <ipython-input-69-cae132ce9a97>:19] Found 11 cases where class 19 was mistaken as class 20 and vice versa
Evaluation of the rule MyCustomRule at step 1812 resulted in the condition being met

The rule triggered at step 1812. After the rule has been tested locally, we can run it as part of our Amazon SageMaker training job. First we need to save the rule in a separate file and then define the following configuration where we indicate on which instance type the rule should run:

from sagemaker.debugger import Rule, CollectionConfig

custom_rule = Rule.custom(
    name='MyCustomRule',
    image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', 
    instance_type='ml.t3.medium',     
    source='my_custom_rule.py',
    volume_size_in_gb=10, 
    rule_to_invoke='MyCustomRule',     
)

After we define the configuration, we add the custom_rule to the list of rules in the estimator object.

Fixing the model and rerunning training

Now that Debugger has helped us identify some critical issues in our model, we apply the fixes and rerun the training. As mentioned before, weighted re-sampling allows us to fix the class imbalance problem. We also change the data augmentation pipeline and remove the horizontal flip. We reduce the number of epochs from 10 to 3, because we have seen that the loss doesn’t decrease after roughly 1,000 iterations.

With Debugger, we can now compare data from different training jobs and see if the issues persist or not. We just need to create a new trial object, read data from both trials, and compare their tensors:

trial1 = create_trial("s3://bucket/training-job/debug-output")
trial2 = create_trial("s3://bucket/improved-training-job/debug-output")

The following visualization shows the label counts for the original training job and the one where we applied weighted re-sampling (orange). We see that there is no longer a class imbalance issue and the model sees roughly the same amount of instances per class.

We run the training with the same built-in rules as before and add our own custom rule. We monitor the status of the rules and can now see that none of them trigger.

Summary

In this post, we have shown an end-to-end example of how to use Amazon SageMaker Debugger to automatically find, inspect, and fix issues in a deep neural network training.

As the state-of-the-art models grow in size and complexity, finding issues early in the prototyping phase is critical to save time and costs. Model bugs may not always be obvious and as we have shown in this post, a suboptimal model may still achieve an overall good accuracy.

In our use case, we not only found critical bugs, but also reduced training time by a factor of two and improved model performance.

There is no extra cost for running Debugger built-in rules, and you benefit by having them enabled because you may discover non-obvious model issues. If you want to learn more about Debugger, check out the following:

References

[1] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, Christian Igel, The German traffic sign recognition benchmark: A multi-class classification competition, The 2011 International Joint Conference on Neural Networks, 2011

About the Authors

Nathalie Rauschmayr is an Applied Scientist at AWS, where she helps customers develop deep learning applications.

Lu Huang is a Senior Product Manager on the AWS Deep Engine team, managing Sagemaker Debugger.

Satadal Bhattacharjee is Principal Product Manager at AWS AI. He leads the machine learning engine PM team on projects such as SageMaker and optimizes machine learning frameworks such as TensorFlow, PyTorch, and MXNet.

How to make AI better at reading comprehension

November 19, 2020

by admin Amazon AWS

AI models exceed human performance on public data sets; modified training and testing could help ensure that they aren’t exploiting short cuts.Read More

How Rajeev Rastogi’s machine learning team in India develops innovations for customers worldwide

November 18, 2020

by admin Amazon AWS

Team works to address the needs of 600 million people online who together speak more than 22 Indian languages with over 19,500 dialects.Read More

Using Transformers to create music in AWS DeepComposer Music studio

November 18, 2020

by Rahul Suresh Amazon AWS

AWS DeepComposer provides a creative and hands-on experience for learning generative AI and machine learning (ML). We recently launched a Transformer-based model that iteratively extends your input melody up to 20 seconds. This newly created extension will use the style and musical motifs found in your input melody and create additional notes that sound like they’ve come from your input melody. In this post, we show you how the Transformer model extends the duration of your existing compositions. You can create new and interesting musical scores by using various parameters, including the Edit melody feature.

Introduction to Transformers

The Transformer is a recent deep learning model for use with sequential data such as text, time series, music, and genomes. Whereas older sequence models such as recurrent neural networks (RNNs) or long short-term memory networks (LSTMs) process data sequentially, the Transformer processes data in parallel. This allows them to process massive amounts of available training data by using powerful GPU-based compute resources.

Furthermore, traditional RNNs and LSTMs can have difficulty modeling the long-term dependencies of a sequence because they can forget earlier parts of the sequence. Transformers use an attention mechanism to overcome this memory shortcoming by directing each step of the output sequence to pay “attention” to relevant parts of the input sequence. For example, when a Transformer-based conversational AI model is asked “How is the weather now?” and the model replies “It is warm and sunny today,” the attention mechanism guides the model to focus on the word “weather” when answering with “warm” and “sunny,” and to focus on “now” when answering with “today.” This is different from traditional RNNs and LSTMs, which process sentences from left to right and forget the context of each word as the distance between the words increases.

Training a Transformer model to generate music

To work with musical datasets, the first step is to convert the data into a sequence of tokens. Each token represents a distinct musical event in the score. A token might represent something like the timestamp for when a note is struck, or the pitch of a note. The relationship between these tokens to musical notes is similar to the relationship between words in a sentence or paragraph. Tokens in music can represent notes or other musical features just like how tokens in language can represent words or punctuation. This differs from previous models supported by AWS DeepComposer such as GAN and AR-CNN, which treat music generation like an image generation problem.

These sequences of tokens are then used to train the Transformer model. During training, the model attempts to learn a distribution that matches the underlying distribution of the training dataset. During inference, the model generates a sequence of tokens by sampling from the distribution learned during training. The new musical score is created by turning the sequence of tokens back into music. Music Transformer and MuseNet are examples of other algorithms that use the Transformer architecture for music generation.

In AWS DeepComposer, we use the TransformerXL architecture to generate music because it’s capable of capturing long-term dependencies that are 4.5 times longer than a traditional Transformer. Furthermore, it has been shown to be 18 times faster than a traditional Transformer during inference. This means that AWS DeepComposer can provide you with higher quality musical compositions at lower latency when generating new compositions.

Extending your input melody using AWS DeepComposer

The Transformers technique extends your input melody by up to 20 seconds. The following screenshot shows the view of your input melody on the AWS DeepComposer console.

To extend your input melody, complete the following steps:

On the AWS DeepComposer console, in the navigation pane, choose Music studio.
Choose the arrow next to Input melody to expand that section.
For Sample track, choose a melody specifically recommended for the Transformers technique.

These options represent the kinds of complex classical-genre melodies that will work best with the Transformer technique. You can also import a MIDI file or create your own melody using a MIDI keyboard.

Under Generative AI technique, for Model parameters, choose Transformers.

The available model, TransformerXLClassical, is preselected.

Under Advanced parameters, for Model parameters, you have seven parameters that you can adjust (more details about these parameters are provided in the next section of this post).
Choose Extend input melody.
To listen to your new composition, choose Play (►).

This model works by extending your input melody by up to 20 seconds.

After performing inference, you can use the Edit melody tool to add or remove notes, or change the pitch or the length of notes in the track generated.
You can repeat these steps to create compositions up to 2 minutes long.

The following compositions were created using the TransformersXLClassical model in AWS DeepComposer:

Beethoven:

Mozart:

Bach:

In the next section of this post, we look at how different inference parameters affect your output and how we can effectively use these parameters to create interesting and diverse music.

Configuring the advanced parameters for Transformers

In AWS DeepComposer Music studio, you can choose from seven different Advanced parameters that can be used to change how your extended melody is created:

Sampling technique and sampling threshold
Creative risk
Input duration
Track extension duration
Maximum rest time
Maximum note length

Sampling technique and sampling threshold

You have three sampling techniques to choose from: TopK, Nucleus, and Random. You can also set the Sampling threshold value for your chosen technique. We first discuss each technique and provide some examples of how it affects the output below.

TopK sampling

When you choose the TopK sampling technique, the model chooses the K-tokens that have the highest probability of occurring. To set the value for K, change the Sampling threshold.

If your sampling threshold is set high, the number of available tokens (K) is large. A large number of available tokens means the model can choose from a wider variety of musical tokens. In your extended melody, this means the generated notes are likely to be more diverse, but it comes at the cost of potentially creating less coherent music.

On the other hand, if you choose a threshold value that is too low, the model is limited to choosing from a smaller set of tokens (that the model believes has a higher probability of being correct). In your extended melody, you might notice less musical diversity and more repetitive results.

Nucleus sampling

At a high level, Nucleus sampling is very similar to TopK. Setting a higher sampling threshold allows for more diversity at the cost of coherence or consistency. There is a subtle difference between the two approaches. Nucleus sampling chooses the top probability tokens that sum up to the value set for the sampling threshold. We do this by sorting the probabilities from greatest to least, and calculating a cumulative sum for each token.

For example, we might have six musical tokens with the probabilities {0.3, 0.3, 0.2, 0.1, 0.05, 0.05}. If we choose TopK with a sampling threshold equal to 0.5, we choose three tokens (six total musical tokens * 0.5). Then we sample between the tokens with probabilities equal to 0.3, 0.3, and 0.2. If we choose Nucleus sampling with a 0.5 sampling threshold, we only sample between two tokens {0.3, 0.3} as the cumulative probability (0.6) exceeds the threshold (0.5).

Random sampling

Random sampling is the most basic sampling technique. With random sampling, the model is free to sample between all the available tokens and is “randomly” sampled from the output distribution. The output of this sampling technique is identical to that of TopK or Nucleus sampling when the sampling threshold is set to 1. The following are some audio clips generated using different sampling thresholds paired with the TopK sampling threshold.

The following audio uses TopK and a Sampling threshold equal to 0.1:

Notice how the notes quickly start to form a pattern.

The following audio uses TopK and a Sampling threshold equal to 0.9:

You can decide which one sounds better, but did you hear the difference?

The notes are very diverse, but as a whole the notes lose coherence and sound somewhat random at times. This general trend holds for Nucleus sampling as well, but the results differ from TopK depending on the shape of the output distribution. Play around and see for yourself!

Creative risk

Creative risk is a parameter used to control the randomness of predictions. A low creative risk makes the model more confident but also more conservative in its samples (it’s less likely to sample from unlikely candidate tokens). On the other hand, a high creative risk produces a softer (flatter) probability distribution over the list of musical tokens, so the model takes more risks in its samples (it’s more likely to sample from unlikely candidate tokens), resulting in more diversity and probably more mistakes. Mistakes might include creating longer or shorter notes, longer or shorter periods of rest in the generated melody, or adding wrong notes to the generated melody.

Input duration

This parameter tells the model what portion of the input melody to use during inference. The portion used is defined as the number of seconds selected counting backwards from the end of the input track. When extending the melody, the model conditions the output it generates based on the portion of the input melody you provide. For example, if you choose 5 seconds as the input duration, the model only uses the last 5 seconds of the input melody for conditioning and ignores the remaining portion when performing inference. The following audio clips were generated using different input durations.

The following audio has an input duration of 5 seconds:

The following audio has an input duration of 30 seconds:

The output conditioned on 30 seconds of input draws more inspiration from the input melody.

Track extension duration

When extending the melody, the Transformer continuously generates tokens until the generated portion reaches the track extension duration you have selected. The reason the model sometimes generates less than the value you selected is because the model generates values in terms of tokens, not time. Tokens, however, can represent different lengths of the time. For example, a token could represent a note duration of 0.1 seconds or 1 second depending on what the model thinks is appropriate. That token, however, takes the same amount of run time for the model to generate. Because the model can generate hundreds of tokens, this difference adds up. To make sure the model doesn’t have extreme runtime latencies, sometimes the model stops before generating your entire output.

Maximum rest time

During inference, the Transformers model can create musical artifacts. Changing the value of maximum rest time limits the periods of silence, in seconds, the model can generate while performing inference.

Maximum note length

Changing the value of maximum note length limits the amount of time a single note can be held for while performing inference. The following audio clips are some example tracks generated using different maximum rest time and maximum note length.

In the first example audio, we set the maximum note length to 10 seconds.

In the second sample, we set it to 1 second, but set the maximum rest period to 11 seconds.

In the third sample, we set the maximum note length to 1 second and maximum rest period to 2 seconds.

The first sample contains extremely long notes. The second sample doesn’t contain long notes, but contains many missing gaps in the music. On the other hand, the third sample contains both shorter notes and shorter gaps.

Creating compositions using different AWS DeepComposer techniques

What’s great about AWS DeepComposer is that you can mix and match the new Transformers technique with the other techniques found in AWS DeepComposer, such as the AR-CNN and GAN techniques.

To create a sample, we completed the following steps:

Choose the sample melody Pathétique.
Use the Transformers technique to extend the melody.

For this track, we extended the melody to 11 bars. Transformers tries to extend the melody up to the value you choose for the extension duration.

The AR-CNN and GAN techniques only work with eight bars of input, so we use the Edit melody feature to cut the track down to eight bars.

Use the AR-CNN technique to fill in notes and enhance the melody.

For this post, we set Sampling iterations equal to 100.

We use the GAN technique, paired with the MuseGAN algorithm and the Rock model, to generate accompaniments.

The following audio is our final output:

We think the output sounds pretty impressive. What do you think? Play around and see what kind of composition you can create yourself!

Conclusion

You’ve now learned about the Transformer model and how AWS DeepComposer uses it to extend your input melody. You can also better understand how each parameter for the Transformers technique can affect the characteristics of your composition.

To continue exploring AWS DeepComposer, consider some of the following:

Choose a different input melody. You can try importing a track or recording your own.
Use the Edit melody feature to assist your AI or correct mistakes.
Try feeding the output of the AR-CNN model into the Transformers model.
Iteratively extend your melody to create a musical composition up to 2 minutes long.

Although you don’t need a physical device to experiment with AWS DeepComposer, you can take advantage of a limited-time offer and purchase the AWS DeepComposer keyboard at a special price of $79.20 (20% off) on Amazon.com. The pricing includes the keyboard and a 3-month free trial of AWS DeepComposer.

We’re excited for you to try out various combinations to generate your creative musical piece. Start composing in the AWS DeepComposer Music Studio now!

About the Authors

Rahul Suresh is an Engineering Manager with the AWS AI org, where he has been working on AI based products for making machine learning accessible for all developers. Prior to joining AWS, Rahul was a Senior Software Developer at Amazon Devices and helped launch highly successful smart home products. Rahul is passionate about building machine learning systems at scale and is always looking for getting these advanced technologies in the hands of customers. In addition to his professional career, Rahul is an avid reader and a history buff.

Wayne Chi is a ML Engineer and AI Researcher at AWS. He works on researching interesting Machine Learning problems to teach new developers and then bringing those ideas into production. Prior to joining AWS he was a Software Engineer and AI Researcher at JPL, NASA where he worked on AI Planning and Scheduling systems for the Mars 2020 Rover (Perseverance). In his spare time he enjoys playing tennis, watching movies, and learning more about AI.

Liang Li is an AI Researcher at AWS, where she works on AI based products to bring new cutting-edge ideas about Deep Learning to teach developers. Prior to joining AWS, Liang graduated from the University of Tennessee Knoxville with a Ph. D in EE, and she has been focusing on ML projects since graduation. In her spare time, she enjoys cooking and hiking.

Suri Yaddanapudi is an AI Researcher and ML Engineer at AWS. He works on researching and implementing modern machine learning algorithms across different domains and teaching them to customers in a fun way. Prior to joining AWS, Suri graduated with his Ph.D. degree from University of Cincinnati and his thesis was focused on implementing AI techniques to Drug Repurposing. In his spare time, he enjoys reading, watching anime and playing futsal.

Aashiq Muhamed is an AI Researcher and ML Engineer at AWS. He believes that AI can change the world and that democratizing AI is key to making this happen. At AWS, he works on creating meaningful AI products and translating ideas from academia into industry. Prior to joining AWS, he was a graduate student at Stanford where he worked on model reduction in robotics, learning and control. In his spare time he enjoys playing the violin and thinking about healthcare on MARS.

Patrick L. Cavins is a Programmer Writer for DeepComposer and DeepLens. Previously, he worked in radiochemistry using isotopically labelled compounds to study how plants communicate. In his spare time, he enjoys skiing, playing the piano, and writing.

Maryam Rezapoor is a Senior Product Manager with AWS AI Devices team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

Store output in custom Amazon S3 bucket and encrypt using AWS KMS for multi-page document processing with Amazon Textract

November 18, 2020

by Kashif Imran Amazon AWS

Amazon Textract is a fully managed machine learning (ML) service that makes it easy to process documents at scale by automatically extracting printed text, handwriting, and other data from virtually any type of document. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. This enables businesses across many industries, including financial, medical, legal, and real estate, to easily process large numbers of documents for different business operations. Healthcare providers, for example, can use Amazon Textract to extract patient information from an insurance claim or values from a table in a scanned medical chart without requiring customization or human intervention. The blog post Automatically extract text and structured data from documents with Amazon Textract shows how to use Amazon Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience.

Amazon Textract provides both synchronous and asynchronous API actions to extract document text and analyze the document text data. You can use synchronous APIs for single-page documents and low latency use cases such as mobile capture. Asynchronous APIs can process single-page or multi-page documents such as PDF documents with thousands of pages.

In this post, we show how to control the output location and the AWS Key Management Service (AWS KMS) key used to encrypt the output data when you use the Amazon Textract asynchronous API.

Amazon Textract asynchronous API

Amazon Textract provides asynchronous APIs to extract text and structured data in single-page (jpeg, png, pdf) or multi-page documents that are in PDF format. Processing documents asynchronously allows your application to complete other tasks while it waits for the process to complete. You can use StartDocumentTextDetection and GetDocumentTextDetection to detect lines and words in a document or use StartDocumentAnalysis and GetDocumentAnalysis to detect lines, words, forms, and table data from a document.

The following diagram shows the workflow of an asynchronous API action. We use AWS Lambda as an example of the compute environment calling Amazon Textract, but the general concept applies to other compute environments as well.

You start by calling the StartDocumentTextDetection or StartDocumentAnalysis API with an Amazon Simple Storage Service (Amazon S3) object location that you want to process, and a few additional parameters.
Amazon Textract gets the document from the S3 bucket and starts a job to process the document.
As the document is processed, Amazon Textract internally saves and encrypt the inference results and notifies you using an Amazon Simple Notification Service (Amazon SNS) topic.
You can then call the corresponding GetDocumentTextDetection or GetDocumentAnalysis API to get the results in JSON format.

Store and encrypt output of asynchronous API in custom S3 bucket

When you start an Amazon Textract job by calling StartDocumentTextDetection or StartDocumentAnalysis, an optional parameter in the API action is called OutputConfig. This parameter allows you to specify the S3 bucket for storing the output. Another optional input parameter KMSKeyId allows you to specify the AWS KMS customer master key (CMK) to use to encrypt the output. The user calling the Start operation must have permission to use the specified CMK.

The following diagram shows the overall workflow when you use the output preference parameter with the Amazon Textract asynchronous API.

You start by calling the StartDocumentTextDetection or StartDocumentAnalysis API with an S3 object location, output S3 bucket name, output prefix for S3 path and KMS key ID, and a few additional parameters.
Amazon Textract gets the document from the S3 bucket and starts a job to process the document.
As the document is processed, Amazon Textract stores the JSON output at the path in the output bucket and encrypts it using the KMS CMK that was specified in the start call.
You get a job completion notification via Amazon SMS.
You can then call the corresponding GetDocumentTextDetection GetDocumentAnalysis to get the JSON result. You can also get the JSON result directly from the output S3 bucket at the path with the following format: s3://{S3Bucket}/{S3Prefix}/{TextractJobId}/*.

Starting the asynchronous job with OutputConfig

The following code shows how you can start the asynchronous API job to analyze a document and store encrypted inference output in a custom S3 bucket:

import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
    ...
    OutputConfig={
        'S3Bucket': 'string',
        'S3Prefix': 'string'
    },
    KMSKeyId='string'
)

The following code shows how you can get the results API job to analyze a document:

response = client.get_document_analysis(JobId='string',MaxResults=123,NextToken='string')

You can also use AWS SDK to download output directly from your custom S3 bucket.

The following table shows how the Amazon Textract output is stored and encrypted based on the provided input parameters of OutputConfig and KMSKeyId.

OutputConfig	KMSKeyId	Amazon Textract Output
None	None	Amazon Textract output is stored internally by Amazon Textract and encrypted using AWS owned CMK.
Customer’s S3 Bucket	None	Amazon Textract output is stored in customer’s S3 bucket and encrypted using SSE-S3
None	Customer managed CMK	Amazon Textract output is stored internally by Amazon Textract and encrypted using Customer managed CMK.
Customer’s S3 bucket	Customer managed CMK	Amazon Textract output is stored in customer’s S3 bucket and encrypted using Customer managed CMK.

IAM permissions

When you use the Amazon Textract APIs to start an analysis or detection job, you must have access to the S3 object specified in your call. To take advantage of output preferences to write the output to an encrypted object in Amazon S3, you must have the necessary permissions for both the target S3 bucket and the CMK specified when you call the analysis or detection APIs.

The following example AWS Identity and Access Management (IAM) identity policy allows you to get objects from the textract-input S3 bucket with a prefix:

{
"Sid":"AllowTextractUserToReadInputData",
"Action":["s3:GetObject"],
"Effect":"Allow",
"Resource":["arn:aws:s3:::textract-input/documents/*"]
}

The following IAM identity policy allows you to write output to the textract-output S3 bucket with a prefix:

{
"Sid":"AllowTextractUserToReadInputData",
"Action":["s3:GetObject"],
"Effect":"Allow",
"Resource":["arn:aws:s3:::textract-input/documents/*"]
}

When placing objects into Amazon S3 using SSE-KMS, you need specific permissions on the CMK. The following CMK policy language allows a user (textract-start) to use the CMK to protect the output files from an Amazon Textract analysis or detection job:

{
  "Sid": "Allow use of the key to write Textract output to S3",
  "Effect": "Allow",
  "Principal": {"AWS":"arn:aws:iam::111122223333:user/textract-start"},
  "Action": ["kms:DescribeKey","kms:GenerateDataKey", "kms:ReEncrypt", "kms:Decrypt"],
  "Resource": "*"
}

The following KMS key policy allows a user (textract-get) to get the output file that’s backed by SSE-KMS.

{
"Sid": "Allow use of the key to read S3 objects for output",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:user/textract-get"},
"Action": ["kms:Decrypt","kms:DescribeKey"],
"Resource": "*"
}

You must still have separate sections of the key policy to allow the management of the key.

For some workloads, you may need to provide a record of actions taken by a user, role, or an AWS service in Amazon Textract. Amazon Textract is integrated with AWS CloudTrail, which captures all API calls for Amazon Textract as events. For more information, see Logging Amazon Textract API Calls with AWS CloudTrail.

AWS KMS and Amazon S3 provide similar integration with CloudTrail. For more information, see Logging AWS KMS API calls with AWS CloudTrail and Logging Amazon S3 API calls using AWS CloudTrail, respectively. To get log visibility into Amazon S3 GETs and PUTs, you can enable the data trail for Amazon S3. This enables you to have end-to-end visibility into your document-processing lifecycle.

Conclusion

In this post, we showed you how to use the Amazon Textract asynchronous API and your S3 bucket and AWS KMS CMK to store and encrypt the results of Amazon Textract output. We also highlighted how you can use CloudTrail integration to get visibility into your overall document processing lifecycle.

For more information about different security controls in Amazon Textract, see Security in Amazon Textract.

About the Authors

Kashif Imran is a Principal Solutions Architect at Amazon Web Services. He works with some of the largest AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement computer vision applications at scale. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

Peter M. O’Donnell is an AWS Principal Solutions Architect, specializing in security, risk, and compliance with the Strategic Accounts team. Formerly dedicated to a major US commercial bank customer, Peter now supports some of AWS’s largest and most complex strategic customers in security and security-related topics, including data protection, cryptography, incident response, and CISO engagement.

Incorporating your enterprise knowledge graph into Amazon Kendra

November 17, 2020

by Yazdan Shirvany Amazon AWS

For many organizations, consolidating information assets and making them available to employees when needed remains a challenge. Commonly used technology like spreadsheets, relational databases, and NoSQL databases exacerbate this issue by creating more and more unconnected, unstructured data.

Knowledge graphs can provide easier access and understanding to this data by organizing this data and capturing dataset semantics, properties, and relationships. While some organizations use knowledge graphs like Amazon Neptune to add structure to their data, they still lack a targeted search engine that users can leverage to search this information.

Amazon Kendra is an intelligent search service powered by machine learning. Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they are looking for, even when it’s scattered across multiple locations and content repositories within your organization.

This solution illustrates how to create an intelligent search engine on AWS using Amazon Kendra to search a knowledge graph stored in Amazon Neptune. We illustrate how you can provision a new Amazon Kendra index in just a few clicks – no prior Machine Learning (ML) experience required! We then show how an existing knowledge graph stored in Amazon Neptune can be connected to Amazon Kendra’s pipeline as metadata as well as into the knowledge panels to build a targeted search engine. Knowledge panels are information boxes that appear on search engines when you search for entities (people, places, organizations, things) that are contained in the knowledge graph. Finally, this solution provides you with a ranked list of the top notable entities that match certain criteria and finds more relevant search results extracted from the knowledge graph.

The following are several common use cases for integrating an enterprise search engine with a knowledge graph:

Add a knowledge graph as metadata to Amazon Kendra to more relevant results
Derive a ranked list of the top notable entities that match certain criteria
Predictively complete entities in a search box
Annotate or organize content using the knowledge graph entities by querying in Neptune

Required Services

In order to complete this solution, you will require the following:

An AWS account
Basic knowledge of AWS
An S3 bucket for your documents (for more information, see Creating a bucket and What is Amazon S3?)
Amazon Kendra
Neptune
AWS Cloud9
AWS Amplify
Node.js

Sample dataset

For this solution, we use a subset of the Arbitration Awards Online database, which is publicly available. Arbitration is an alternative to litigation or mediation when resolving a dispute. Arbitration panels are composed of one or three arbitrators who are selected by the parties. They read the pleadings filed by the parties, listen to the arguments, study the documentary and testimonial evidence, and render a decision.

The panel’s decision, called an award, is final and binding on all the parties. All parties must abide by the award, unless it’s successfully challenged in court within the statutory time period. Arbitration is generally confidential, and documents submitted in arbitration are not publicly available, unlike court-related filings.

However, if an award is issued at the conclusion of the case, the Financial Industry Regulatory Authority (FINRA) posts it in its Arbitration Awards Online database, which is publicly available. We use a subset of this dataset for our use case under FINRA licensing (©2020 FINRA. All rights reserved. FINRA is a registered trademark of the Financial Industry Regulatory Authority, Inc. Reprinted with permission from FINRA) to create a knowledge graph for awards.

Configuring your document repository

Before you can create an index in Amazon Kendra, you need to load documents into an S3 bucket. This section contains instructions to create an S3 bucket, get the files, and load them into the bucket. After completing all the steps in this section, you have a data source that Amazon Kendra can use.

On the AWS Management Console, in the Region list, choose US East (N. Virginia) or any Region of your choice that Amazon Kendra is available in.
Choose Services.
Under Storage, choose S3.
On the Amazon S3 console, choose Create bucket.
Under General configuration, provide the following information:
1. Bucket name – enterprise-search-poc-ds-UNIQUE-SUFFIX
2. Region – Choose the same Region that you use to deploy your Amazon Kendra index (this post uses US East (N. Virginia) us-east-1)
Under Bucket settings for Block Public Access, leave everything with the default values.
Under Advanced settings, leave everything with the default values.
Choose Create bucket.
Download kendra-graph-blog-data and unzip the files.
Upload the index_data folder from the unzipped files.

Inside your bucket, you should now see two folders: index_data (with 20 objects) and graph_data (with two objects).

The following screenshot shows the contents of enterprise-search-poc-ds-UNIQUE-SUFFIX.

The following screenshot shows the contents of index_data.

The index_data folder contains two files: an arbitration PDF file and arbitration metadata file.

The following code is an example for arbitration metadata. DocumentId is the arbitration case number, and we use this identifier to create a correlation between the Amazon Kendra index and a graph dataset that we load into Neptune.

{
  "DocumentId": "17-00486",
  "ContentType": "PDF",
  "Title": "17-00486",
  "Attributes": {
    "_source_uri": "https://www.finra.org/sites/default/files/aao_documents/17-00486.pdf"
  }
}

The following screenshot shows the contents of graph_data.

Setting up an Amazon Kendra index

In this section, we set up an Amazon Kendra index and configure an S3 bucket as the data source.

Sign in to the console and confirm that you have set the Region to us-east-1.
Navigate to the Amazon Kendra service and choose Launch Amazon Kendra.
For Index name, enter enterprise-search-poc.
For IAM role, choose Create a new role.
For Role name, enter poc-role.
Leave Use an AWS KMS managed encryption key
Choose Create.

The index creation may take some time. For more information about AWS Identity and Access Management (IAM) access roles, see IAM access roles for Amazon Kendra.

When index creation is complete, on the Amazon Kendra console, choose your new index.
In the Index settings section, locate the index ID.

You use the index ID in a later step.

Adding a data source

To add your data source, complete the following steps:

On the Amazon Kendra console, choose your index.
Choose Add data source.
For Select connector type for your data source, choose Add connector under Amazon S3.
For Data source name, enter a name for your data source (for example, ent-search-poc-ds-001).
For Enter the data source location, choose Browse S3 and choose the bucket you created earlier.
For IAM role, choose Create new role.
For Role name, enter poc-ds-role.
In the Additional configuration section, on the Include pattern tab, add index_data.
In the Set sync run schedule section, choose Run on demand.
Choose Next.

Review your details and choose Create.
In the details page of your data source, choose Sync now.

Kendra starts crawling and indexing the data source from Amazon S3 and prepares the index.

You can also monitor the process on Amazon CloudWatch.

Searching the ingested documents

To test the index, complete the following steps:

On the Amazon Kendra console, navigate to your index.
Choose Search console.

Enter a question (for example, how much was initial claim fees for case 17-00486?).

The following screenshot shows your search results.

Knowledge graph

This section describes a knowledge graph of the entities and relationships that participate in arbitration panels. We use Apache TinkerPop Gremlin format to load the data to Neptune. For more information, see Gremlin Load Data Format.

To load Apache TinkerPop Gremlin data using the CSV format, you must specify the vertices and the edges in separate files. The loader can load from multiple vertex files and multiple edge files in a single load job.

The following diagram shows the graph ontology. Each award has properties, such as Problem, Customer, Representative, Firm, and Subtype. The is_related edge shows relationships between awards.

You can access the CSV files for both vertex and nodes in the enterprise-search-poc.zip file. The following screenshot shows a tabular view of the vertex file.

The following screenshot shows a tabular view of the edge file.

Launching the Neptune-SageMaker stack

You can launch the Neptune-SageMaker stack from the AWS CloudFormation console by choosing Launch Stack:

Region	View	Launch
US East 1 (N. Virginia)	View

Acknowledge that AWS CloudFormation will create IAM resources, and choose Create.

The Neptune and Amazon SageMaker resources described here incur costs. With Amazon SageMaker hosted notebooks, you pay simply for the Amazon Elastic Compute Cloud (Amazon EC2) instance that hosts the notebook. For this post, we use an ml.t2.medium instance, which is eligible for the AWS free tier.

The solution creates five stacks, as shown in the following screenshot.

Browsing and running the content

After the stacks are created, you can browse your notebook instance and run the content.

On the Amazon SageMaker console, choose Notebook instances on the navigation pane.
Select your instance and from the Actions menu, choose Open Jupyter.

In the Jupyter window, in the Neptune directory, open the Getting-Starteddirectory.

The Getting-Starteddirectory contains three notebooks:

01-Introduction.ipynb
02-Labelled-Property-Graph.ipynb
03-Graph-Recommendations.ipynb

The first two introduce Neptune and the property graph data model. The third contains an runnable example of an arbitration knowledge graph recommendation engine. When you run the content, the notebook populates Neptune with a sample award dataset and issues several queries to generate related cases recommendations.

To see this in action, open 03-Graph-Recommendations.ipynb.
Change the bulkLoad Amazon S3 location to your created Amazon S3 location (enterprise-search-poc-ds-UNIQUE-SUFFIX).

Run each cell in turn, or choose Run All from the Cell drop-down menu.

You should see the results of each query printed below each query cell (as in the following screenshot).

Architecture overview

DocumentId in Amazon Kendra is the key for an Amazon Kendra index and knowledge graph data. DocumentId in an Amazon Kendra index should be the same as ~id in the graph node. This creates the association between the Amazon Kendra results and graph nodes. The following diagram shows the architecture for integrating Amazon Kendra and Neptune.

The architecture workflow includes the following steps:

The search user interface (UI) sends the query to Amazon Kendra.
Amazon Kendra returns results based on its best match.
The UI component calls Neptune via Amazon API Gateway and AWS Lambda with the docId as the request parameter.
Neptune runs the query and returns all related cases with the requested docId.
The UI component renders the knowledge panel with the graph responses.

Testing Neptune via API Gateway

To test Neptune via API Gateway, complete the following steps:

On the API Gateway console, choose APIs.
Choose KendraGraphAPI to open the API page.
Select the POST method and choose Test.

Enter the following sample event data:

{
  "docId": "14-02936",
  "repCrd": "5048331",
  "firm": ["23131","29604"]
}

Choose Test.

This sends an HTTP POST request to the endpoint, using the sample event data in the request body. In the following screenshot, the response shows the related cases for award 14-02936.

On the navigation name, choose Stages.
Choose dev.
Copy the Invoke URL value and save it to use in the next step.

To test the HTTP, enter the following CURL command. Replace the endpoint with your API Gateway invoke URL.

curl --location --request POST 'REPLACE_WITH_API_GATEWAY_ENDPOINT' 
--header 'Content-Type: application/json' 
--data-raw '{
"docId": "14-02936",
"repCrd": "5048331",
"firm": ["23131","29604"]
}'

Testing Neptune via Lambda

Another way to test Neptune is with a Lambda function. This section outlines how to invoke the Lambda function using the sample event data provided.

On the Lambda console, choose Kendra-Neptune-Graph-AddLamb-NeptuneLambdaFunction.
Choose Test.
In the Configure test event page, choose Create new test event.
For Event template, choose the default Hello World

For Event name, enter a name and note the following sample event template:

{
  "docId": "14-02936",
  "repCrd": "5048331",
  "firm": ["23131","29604"]
}

Choose Create.

Choose Test.

Each user can create up to 10 test events per function. Those test events aren’t available to other users.

Lambda runs your function on your behalf. The handler in your Lambda function receives and processes the sample event.

After the function runs successfully, view the results on the Lambda console.

The results have the following sections:

Execution result – Shows the run status as succeeded and also shows the function run results, returned by the return statement.
Summary – Shows the key information reported in the Log output section (the REPORT line in the run log).
Log output – Shows the log Lambda generates for each run. These are the logs written to CloudWatch by the Lambda function. The Lambda console shows these logs for your convenience. The Click here link shows the logs on the CloudWatch console. The function then adds logs to CloudWatch in the log group that corresponds to the Lambda function.

Developing the web app

In this section, we develop a web app with a search interface to search the documents. We use AWS Cloud9 as our integrated development environment (IDE) and Amplify to build and deploy the web app.

AWS Cloud9 is a cloud-based IDE that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and a terminal. AWS Cloud9 comes prepackaged with essential tools for popular programming languages, including JavaScript, Python, PHP, and more, so you don’t need to install files or configure your development machine to start new projects.

The AWS Cloud9 workspace should be built by an IAM user with administrator privileges, not the root account user. Please ensure you’re logged in as an IAM user, not the root account user.

Ad blockers, JavaScript disablers, and tracking blockers should be disabled for the AWS Cloud9 domain, otherwise connecting to the workspace might be impacted.

Creating a new environment

To create your environment, complete the following steps:

On the AWS Cloud9 console, make sure you’re using one of the following Regions:
- US East (N. Virginia)
- US West (Oregon)
- Asia Pacific (Singapore)
- Europe (Ireland)
Choose Create environment.
Name the environment kendrapoc.
Choose Next step.
Choose Create a new instance for environment (EC2) and choose small.
Leave all the environment settings as their defaults and choose Next step.
Choose Create environment.

Preparing the environment

To prepare your environment, make sure you’re in the default directory in an AWS Cloud9 terminal window (~/environment) before completing the following steps:

Enter the following code to download the code for the Amazon Kendra sample app and extract it in a temporary directory:

mkdir tmp
cd tmp
aws s3 cp s3://aws-ml-blog/artifacts/Incorporating-your-enterprise-knowledge-graph-into-Amazon-Kendra/kendra-graph-blog-data/kendrasgraphsampleui.zip .
unzip kendrasgraphsampleui.zip
rm kendrasgraphsampleui.zip
cd ..

We build our app in ReactJS with the following code:

echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p

npx create-react-app kendra-poc

Change the working directory to kendra-poc and ensure that you’re in the /home/ec2-
user/environment/kendra-poc directory:
```
cd kendra-poc
```

Install a few prerequisites with the following code:

npm install --save node-sass typescript bootstrap react-bootstrap @types/lodash aws-sdk
npm install --save semantic-ui-react
npm install aws-amplify @aws-amplify/ui-react
npm install -g @aws-amplify/cli

Copy the source code to the src directory:
```
cp -r ../tmp/kendrasgraphsampleui/* src/
```

Initializing Amplify

To initialize Amplify, complete the following steps:

On the command line, in the kendra-poc directory, enter the following code:
```
amplify init
```
Choose Enter.
Accept the default project name kendrapoc.
Enter dev for the environment name.
Choose None for the default editor (we use AWS Cloud9).
Choose JavaScript and React when prompted.
Accept the default values for paths and build commands.
Choose the default profile when prompted.

Your run should look like the following screenshot.

Adding authentication

To add authentication to the app, complete the following steps:

Enter the following code:
```
amplify add auth
```

Choose Default Configuration when asked if you want to use the default authentication and security configuration.
Choose Username when asked how you want users to sign in.
Choose No, I am done. when asked about advanced settings.

This session should look like the following screenshot.

To create these changes in the cloud, enter:
```
amplify push
```

Confirm you want Amplify to make changes in the cloud for you.

Provisioning takes a few minutes to complete. The Amplify CLI takes care of provisioning the appropriate cloud resources and updates src/aws-exports.js with all the configuration data we need to use the cloud resources in our app.

Amazon Cognito lets you add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily. We made a user pool, which is a secure user directory that lets our users sign in with the user name and password pair they create during registration. Amazon Cognito (and the Amplify CLI) also supports configuring sign-in with social identity providers, such as Facebook, Google, and Amazon, and enterprise identity providers via SAML 2.0. For more information, see Amazon Cognito Developer and Amplify Authentication documentations

Configuring an IAM role for authenticated users

To configure your IAM role for authenticated users, complete the following steps:

In the AWS Cloud9 IDE, in the left panel, browse to the file kendra-poc/amplify/team-provider-info.json and open it (double-click).
Note the value of AuthRoleName.

On the IAM console, choose Roles.
Search for the AuthRole using the value from the previous step and open that role.

Choose Add inline policy.

Choose JSON and replace the contents with the following policy. Replace enterprise-search-poc-ds-UNIQUE-SUFFIX with the name of the S3 bucket that is configured as the data source (leave the * after the bucket name, which allows the policy to access any object in the bucket). Replace ACCOUNT-NUMBER with the AWS account number and KENDRA-INDEX-ID with the index ID of your index.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "kendra:ListIndices",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "kendra:Query",
                "s3:ListBucket",
                "s3:GetObject",
                "kendra:ListFaqs",
                "kendra:ListDataSources",
                "kendra:DescribeIndex",
                "kendra:DescribeFaq",
                "kendra:DescribeDataSource"
            ],
            "Resource": [
                "arn:aws:s3:::enterprise-search-poc-ds-UNIQUE-SUFFIX*",
                "arn:aws:kendra:us-east-1:ACCOUNT-NUMBER:index/KENDRA-INDEX-ID"
            ]
        }
    ]
}

Choose Review policy.
Enter a policy name, such as my-kendra-poc-policy.
Choose Create policy.
Browse back to the role and confirm that my-kendra-poc-policy is present.

Create a user (poctester - customer) and add to the corresponding groups by entering the following at the command prompt:

USER_POOL_ID=`grep user_pools_id src/aws-exports.js| awk 'BEGIN {FS =""" } {print $4}'`
aws cognito-idp create-group --group-name customer --user-pool-id $USER_POOL_ID
aws cognito-idp admin-create-user --user-pool-id $USER_POOL_ID --username poctester --temporary-password AmazonKendra
aws cognito-idp admin-add-user-to-group --user-pool-id $USER_POOL_ID --username poctester --group-name customer

Configuring the application

You’re now ready to configure the application.

In the AWS Cloud9 environment, browse to the file kendra-poc/src/search/Search.tsx and open it for editing.

A new window opens.

Replace REPLACE_WITH_KENDRA-INDEX-ID with your index ID.

In the AWS Cloud9 environment, browse to the file kendra-poc/src/properties.js and open it for editing.
In the new window that opens, replace REPLACE_WITH_API_GATEWAY_ENDPOINT with the PI Gateway invoke URL value from earlier.

Start the application in the AWS Cloud9 environment by entering the following code in the command window in the ~/environment/kendra-poc directory:
```
npm start
```

Compiling the code and starting takes a few minutes.

Preview the running application by choosing Preview on the AWS Cloud9 menu bar.
From the drop-down menu, choose Preview Running Application.

A new browser window opens.

Amazon Cognito forces a password reset upon first login.

Using the application

Now we can try out the app we developed by making a few search queries, such as “how much was initial claim fees for case 17-00486?”

The following screenshot shows the Amazon Kendra results and the knowledge panel, which shows the related cases for each result. The knowledge panel details on the left is populated from the graph database and shows all the related cases for each search item.

Conclusion

This post demonstrated how to build a targeted and flexible cognitive search engine with a knowledge graph stored in Neptune and integrated with Amazon Kendra. You can enable rapid search for your documents and graph data using natural language, without any previous AI or ML experience. Finally, you can create an ensemble of other content types, including any combination of structured and unstructured documents, to make your archives indexable and searchable for harvesting knowledge and gaining insight. For more information about Amazon Kendra, see AWS re:Invent 2019 – Keynote with Andy Jassy on YouTube, Amazon Kendra FAQs, and What is Amazon Kendra?

About the Authors

Dr. Yazdan Shirvany is all 12 AWS-certified Senior Solution Architect with deep experience in AI/ML, IOT and big data technologies including NLP, Knowledge Graph, applications reengineering, and optimizing software to leverage the cloud. Dr. Shirvany has 20+ scientific publications, and several issued patents in AI/ML field. Dr. Shirvany holds a M.S and Ph.D. in Computer Science from Chalmers University of Technology.

Dipto Chakravarty is a leader in Amazon’s Alexa engineering group and heads up the Personal Mobility team in HQ2 utilizing AI, ML and IoT to solve local search analytics challenges. He has 12 patents issued to date and has authored two best-selling books on computer architecture and operating systems published by McGraw-Hill and Wiley. Dipto holds a B.S and M.S in Computer Science and Electrical Engineering from U. of Maryland, an EMBA from Wharton School, U. Penn, and a GMP from Harvard Business School.

Mohit Mehta is a leader in the AWS Professional Services Organization with expertise in AI/ML and Big Data technologies. Mohit holds a M.S in Computer Science, all 12 AWS certifications, MBA from College of William and Mary and GMP from Michigan Ross School of Business.

As voice agents proliferate, EMNLP broadens its scope

November 16, 2020

by admin Amazon AWS

Amazon Scholar Julia Hirschberg on why speech understanding and natural-language understanding are intertwined.Read More

Predicting qualification ranking based on practice session performance for Formula 1 Grand Prix

November 14, 2020

by Guang Yang Amazon AWS

If you’re a Formula 1 (F1) fan, have you ever wondered why F1 teams have very different performances between qualifying and practice sessions? Why do they have multiple practice sessions in the first place? Can practice session results actually tell something about the upcoming qualifying race? In this post, we answer these questions and more. We show you how we can predict qualifying results based on practice session performances by harnessing the power of data and machine learning (ML). These predictions are being integrated into the new “Qualifying Pace” insight for each F1 Grand Prix (GP). This work is part of the continuous collaboration between F1 and the Amazon ML Solutions Lab to generate new F1 Insights powered by AWS.

Each F1 GP consists of several stages. The event starts with three practice sessions (P1, P2, and P3), followed by a qualifying (Q) session, and then the final race. Teams approach practice and qualifying sessions differently because these sessions serve different purposes. The practice sessions are the teams’ opportunities to test out strategies and tire compounds to gather critical data in preparation for the final race. They observe the car’s performance with different strategies and tire compounds, and use this to determine their overall race strategy.

In contrast, qualifying sessions determine the starting position of each driver on race day. Teams focus solely on obtaining the fastest lap time. Because of this shift in tactics, Friday and Saturday practice session results often fail to accurately predict the qualifying order.

In this post, we introduce deterministic and probabilistic methods to model the time difference between the fastest lap time in practice sessions and the qualifying session (∆t = t_q-t_p). The goal is to more accurately predict the upcoming qualifying standings based on the practice sessions.

Error sources of ∆t

The delta of the fastest lap time between practice and qualifying sessions (∆t) comes primarily from variations in fuel level and tire grip.

A higher fuel level adds weight to the car and reduces the speed of the car. For practice sessions, teams vary the fuel level as they please. For the second practice session (P2), it’s common to begin with a low fuel level and run with more fuel in the latter part of the session. During qualifying, teams use minimal fuel levels in order to record the fastest lap time. The impact of fuel on lap time varies from circuit to circuit, depending on how many straights the circuit has and how long these straights are.

Tires also play a significant role in an F1 car’s performance. During each GP event, the tire supplier brings various tire types with varying compounds suitable for different racing conditions. Two of these are for wet circuit conditions: intermediate tires for light standing water and wet tires for heavy standing water. The remaining dry running tires can be categorized into three compound types: hard, medium, and soft. These tire compounds provide different grips to the circuit surface. The more grip the tire provides, the faster the car can run.

Past racing results showed that car performance dropped significantly when wet tires were used. For example, in the 2018 Italy GP, because the P1 session was wet and the qualifying session was dry, the fastest lap time in P1 was more than 10 seconds slower than the qualifying session.

Among the dry running types, the hard tire provides the least grip but is the most durable, whereas the soft tire has the most grip but is the least durable. Tires degrade over the course of a race, which reduces the tire grip and slows down the car. Track temperature and moisture affects the progression of degradation, which in turn changes the tire grip. As in the case with fuel level, tire impact on lap time changes from circuit to circuit.

Data and attempted approaches

Given this understanding of factors that can impact lap time, we can use fuel level and tire grip data to estimate the final qualifying lap time based on known practice session performance. However, as of this writing, data records to directly infer fuel level and tire grip during the race are not available. Therefore, we take an alternative approach with data we can currently obtain.

The data we used in the modeling were records of fastest lap times for each GP since 1950 and partial years of weather data for the corresponding sessions. The lap times data included the fastest lap time for each session (P1, P2, P3, and Q) of each GP with the driver, car and team, and circuit name (publicly available on F1’s website). Track wetness and temperature for each corresponding session was available in the weather data.

We explored two implicit methods with the following model inputs: the team and driver name, and the circuit name. Method one was a rule-based empirical model that attributed observed to circuits and teams. We estimated the latent parameter values (fuel level and tire grip differences specific to each team and circuit) based on their known lap time sensitivities. These sensitivities were provided by F1 and calculated through simulation runs on each circuit track. Method two was a regression model with driver and circuit indicators. The regression model learned the sensitivity of ∆t for each driver on each circuit without explicitly knowing the fuel level and tire grip exerted. We developed and compared deterministic models using XGBoost and AutoGluon, and probabilistic models using PyMC3.

We built models using race data from 2014 to 2019, and tested against race data from 2020. We excluded data from before 2014 because there were significant car development and regulation changes over the years. We removed races in which either the practice or qualifying session was wet because ∆t for those sessions were considered outliers.

Managed model training with Amazon SageMaker

We trained our regression models on Amazon SageMaker.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. Specifically for model training, it provides many features to assist with the process.

For our use case, we explored multiple iterations on the choices of model feature sets and hyperparameters. Recording and comparing the model metrics of interest was critical to choosing the most suitable model. The Amazon SageMaker API allowed customized metrics definition prior to launching a model training job, and easy retrieval after the training job was complete. Using the automatic model tuning feature reduced the mean squared error (MSE) metric on the test data by 45% compared to the default hyperparameter choice.

We trained an XGBoost model using the Amazon SageMaker’s built-in implementation. Its built-in implementation allowed us to run model training through a general estimator interface. This approach provided better logging, superior hyperparameter validation, and a larger set of metrics than the original implementation.

Rule-based model

In the rule-based approach, we reason that the differences of lap times ∆t primarily come from systematic variations of tire grip for each circuit and fuel level for each team between practice and qualifying sessions. After accounting for these known variations, we assume residuals are random small numbers with a mean of zero. ∆t can be modeled with the following equation:

∆t_f(c) and ∆t_g(c) are known sensitivities of fuel mass and tire grip, and is the residual. A hierarchy exists among the factors contained in the equation. We assume grip variations for each circuit (g(c)) are at the top level. Under each circuit, there are variations of fuel level across teams (f(t,c)).

To further simplify the model, we neglect because we assume it is small. We further assume fuel variation for each team across all circuits is the same (i.e., f(t,c) = f(t)). We can simplify the model to the following:

Because ∆t_f(c) and ∆t_g(c) are known, f(t) and g(c), we can estimate team fuel variations and tire grip variations from the data.

The differences in the sensitivities depend on the characteristics of circuits. From the following track maps, we can observe that the Italian GP circuit has fewer corner turns and the straight sections are longer compared to the Singapore GP circuit. Additional tire grip gives a larger advantage in the Singapore GP circuit.

ML regression model

For the ML regression method, we don’t directly model the relation between and fuel level and grip variations. Instead, we fit the following regression model with just the circuit, team, and driver indicator variables:

I_c, I_t, and I_d represent the indicator variables for circuits, teams, and drivers.

Hierarchical Bayesian model

Another challenge with modeling the race pace was due to noisy measurements in lap times. The magnitude of random effect (ϵ) of ∆t could be non-negligible. Such randomness might come from drivers’ accidental drift from their normal practice at the turns or random variations of drivers’ efforts during practice sessions. With deterministic approaches, such random effect wasn’t appropriately captured. Ideally, we wanted a model that could quantify uncertainty about the predictions. Therefore, we explored Bayesian sampling methods.

With a hierarchical Bayesian model, we account for the hierarchical structure of the error sources. As with the rule-based model, we assume grip variations for each circuit (g(c))) are at the top level. The additional benefit of a hierarchical Bayesian model is that it incorporates individual-level variations when estimating group-level coefficients. It’s a middle ground between two extreme views of data. One extreme is to pool data for every group (circuit and driver) without considering the intrinsic variations among groups. The other extreme is to train a regression model for each circuit or driver. With 21 circuits, this amounts to 21 regression models. With a hierarchical model, we have a single model that considers the variations simultaneously at the group and individual level.

We can mathematically describe the underlying statistical model for the hierarchical Bayesian approach as the following varying intercepts model:

Here, i represents the index of each data observation, j represents the index of each driver, and k represents the index of each circuit. μ_jk represents the varying intercept for each driver under each circuit, and θ_k represents the varying intercept for each circuit. w_p and w_q represent the wetness level of the track during practice and qualifying sessions, and ∆T represents the track temperature difference.

Test models in the 2020 races

After predicting ∆t, we added it into the practice lap times to generate predictions of qualifying lap times. We determined the final ranking based on the predicted qualifying lap times. Finally, we compared predicted lap times and rankings with the actual results.

The following figure compares the predicted rankings and the actual rankings for all three practice sessions for the Austria, Hungary, and Great Britain GPs in 2020 (we exclude P2 for the Hungary GP because the session was wet).

For the Bayesian model, we generated predictions with an uncertainty range based on the posterior samples. This enabled us to predict the ranking of the drivers relatively with the median while accounting for unexpected outcomes in the drivers’ performances.

The following figure shows an example of predicted qualifying lap times (in seconds) with an uncertainty range for selected drivers at the Austria GP. If two drivers’ prediction profiles are very close (such as MAG and GIO), it’s not surprising that either driver might be the faster one in the upcoming qualifying session.

Metrics on model performance

To compare the models, we used mean squared error (MSE) and mean absolute error (MAE) for lap time errors. For ranking errors, we used rank discounted cumulative gain (RDCG). Because only the top 10 drivers gain points during a race, we used RDCG to apply more weight to errors in the higher rankings. For the Bayesian model output, we used median posterior value to generate the metrics.

The following table shows the resulting metrics of each modeling approach for the test P2 and P3 sessions. The best model by each metric for each session is highlighted.

MODEL	MSE		MAE		RDCG
	P2	P3	P2	P3	P2	P3
Practice raw	2.822	1.053	1.544	0.949	0.92	0.95
Rule-based	0.349	0.186	0.462	0.346	0.88	0.95
XGBoost	0.358	0.141	0.472	0.297	0.91	0.95
AutoGluon	0.567	0.351	0.591	0.459	0.90	0.96
Hierarchical Bayesian	0.431	0.186	0.521	0.332	0.87	0.92

All models reduced the qualifying lap time prediction errors significantly compared to directly using the practice session results. Using practice lap times directly without considering pace correction, the MSE on the predicted qualifying lap time was up to 2.8 seconds. With machine learning methods which automatically learned pace variation patterns for teams and drivers on different circuits, we brought the MSE down to smaller than half a second. The resulting prediction was a more accurate representation of the pace in the qualifying session. In addition, the models improved the prediction of rankings by a small margin. However, there was no one single approach that outperformed all others. This observation highlighted the effect of random errors on the underlying data.

Summary

In this post, we described a new Insight developed by the Amazon ML Solutions Lab in collaboration with Formula 1 (F1).

This work is part of the six new F1 Insights powered by AWS that are being released in 2020, as F1 continues to use AWS for advanced data processing and ML modeling. Fans can expect to see this new Insight unveiled at the 2020 Turkish GP to provide predictions for the upcoming qualifying races at practice sessions.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab .

About the Author

Guang Yang is a data scientist at the Amazon ML Solutions Lab where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.