Hierarchical Forecasting using Amazon SageMaker

Time series forecasting is a common problem in machine learning (ML) and statistics. Some common day-to-day use cases of time series forecasting involve predicting product sales, item demand, component supply, service tickets, and all as a function of time. More often than not, time series data follows a hierarchical aggregation structure. For example, in retail, weekly sales for a Stock Keeping Unit (SKU) at a store can roll up to different geographical hierarchies at the city, state, or country level. In these cases, we must make sure that the sales estimates are in agreement when rolled up to a higher level. In these scenarios, Hierarchical Forecasting is used. It is the process of generating coherent forecasts (or reconciling incoherent forecasts) that allows individual time series to be forecasted individually while still preserving the relationships within the hierarchy. Hierarchical time series often arise due to various smaller geographies combining to form a larger one. For example, the following figure shows the case of a hierarchical structure in time series for store sales in the state of Texas. Individual store sales are depicted in the lowest level (level 2) of the tree, followed by sales aggregated on the city level (level 1), and finally all of the city sales aggregated on the state level (level 0).

In this post, we will first review the concept of hierarchical forecasting, including different reconciliation approaches. Then, we will take an example of demand forecasting on synthetic retail data to show you how to train and tune multiple hierarchical time series models. We will also perform hyper-parameter combinations using the scikit-hts toolkit on Amazon SageMaker, which is the most comprehensive and fully managed ML service. Amazon SageMaker lets data scientists and developers quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

The forecasts at all of the levels must be coherent. The forecast for Texas in the previous figure should break down accurately into forecasts for the cities, and the forecasts for cities should also break down accurately for forecasts on the individual store level. There are various approaches to combining and breaking forecasts at different levels. The most common of these methods, as discussed in detail in Hyndman and Athanasopoulos, are as follows:

  • Bottom-Up:

In this method, the forecasts are carried out at the bottom-most level of the hierarchy, and then summed going up. For example, in the preceding figure, by using the bottom-up method, the time series’ for the individual stores (level 2) are used to build forecasting models. The outputs of individual models are then summed to generate the forecast for the cities. For example, forecasts for Store 1 and Store 2 are summed to get the forecasts for Austin. Finally, forecasts for all of the cities are summed to generate the forecasts for Texas.

  • Top-down:

In top-down approaches, the forecast is first generated for the top level (Texas in the preceding figure) and then disaggregated down the hierarchy. Disaggregate proportions are used in conjunction with the top level forecast to generate forecasts at the bottom level of the hierarchy. There are multiple methods to generate these disaggregate proportions, such as average historical proportions, proportions of the historical averages, and forecast proportions. These methods are briefly described in the following section. For a detailed discussion, please see Hyndman and Athanasopoulos.

    • Average historical proportions:

In this method, the bottom level series is generated by using the average of the historical proportions of the series at the bottom level (stores in the figure preceding), relative to the series at the top level (Texas in the preceding figure).

    • Proportions of the historical averages:

The average historical value of the series at the bottom level (stores in the preceding figure) relative to the average historical value of the series at the top level (Texas in the preceding figure) is used as the disaggregation proportion.

While both of the preceding top-down approaches are simple to implement and use, they are generally very accurate for the top level and are less accurate for lower levels. This is due to the loss of information and the inability to take advantage of characteristics of individual time series at lower levels. Furthermore, these methods also fail to account for how the historical proportions may change over time.

    • Forecast proportions:

In this method, instead of historical data, proportions based on forecasts are used for disaggregation. Forecasts are first generated for each individual series. These forecasts are not used directly, since they are not coherent at different levels of hierarchy. At each level, the proportions of these initial forecasts to that of the aggregate of all initial forecasts at the level are calculated. Then, these forecast proportions are used to disaggregate the top level forecast into individual forecasts at various levels. This method does not rely on outdated historical proportions and uses the current data to calculate the appropriate proportions. Due to this reason, forecast proportions often result in much more accurate forecasts as compared to the average historical proportions and proportions of the historical averages top-down approaches.

  • Middle-out:

In this method, forecasts are first generated for all of the series at a “middle level” (for example, Austin, Dallas, Houston, and San Antonio in the preceding figure). From these forecasts, the bottom-up approach is used to generate the aggregated forecasts for the levels above this middle level. For the levels below the middle level, a top-down approach is used.

  • Ordinary least squares (OLS):

In OLS, a least squares estimator is used to compute the reconciliation weights needed for generating coherent forecasts.

Solution overview

In this post, we take the example of demand forecasting on synthetic retail data to fine tune multiple hierarchical time series models across algorithms and hyper-parameter combinations. We are using the scikit-hts toolkit on Amazon SageMaker, which is the most comprehensive and fully managed ML service. SageMaker lets data scientists and developers quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

First, we will show you how to setup scikit-hts on SageMaker using the SKLearn estimator, train multiple models using the SKLearn estimator, and track and organize experiments using SageMaker Experiments. We will walk you through the following steps:

  1. Prerequisites
  2. Prepare Time Series Data
  3. Setup the scikit-hts training script
  4. Setup the SKLearn Estimator
  5. Setup Amazon SageMaker Experiment and Trials
  6. Evaluate metrics and select a winning candidate
  7. Runtime series forecasts
  8. Visualize the forecasts:
    • Visualization at Region Level
    • Visualization at State Level

Prerequisites

The following is needed to follow along with this post and run the associated code:

Prepare Time Series Data

For this post, we will use synthetic retail clothing data to perform feature engineering steps to clean data. Then, we will convert the data into hierarchical representations as required by the scikit-hts package.

The retail clothing data is the time series daily quantity of sales data for six item categories: men’s clothing, men’s shoes, women’s clothing, women’s shoes, kids’ clothing, and kids’ shoes. The date range for the data is 11/25/1997 through 7/28/2009. Each row of the data corresponds to the quantity of sales for an item category in a state (total of 18 US states) for a specific date in the date range. Furthermore, the 18 states are also categorized into five US regions. The data is synthetically generated using repeatable patterns (for seasonality) with random noise added for each day.

First, let’s read the data into a Pandas DataFrame.

df_raw = pd.read_csv("retail-usa-clothing.csv",
                          parse_dates=True,
                          header=0,
                          names=['date', 'state',
                                   'item', 'quantity', 'region',
                                   'country']
                    )


Define the S3 bucket and folder locations to store the test and training data. This should be within the same region as SageMaker Studio.

Now, let’s divide the raw data into train and test samples, and save them in their respective S3 folder locations using the Pandas DataFrame query function. We can check the first few entries of the train and test dataset. Both datasets should have the same fields, as in the following code:

df_train = df_raw.query(f'date <= "2009-04-29"').copy()
df_train.to_csv("train.csv")
s3_client.upload_file("train.csv", bucket, pref+"/train.csv")

df_test = df_raw.query(f'date > "2009-04-29"').copy()
df_test.to_csv("test.csv")
s3_client.upload_file("test.csv", bucket, pref+"/test.csv")

Convert data into Hierarchical Representation

scikit-hts requires that each column in our DataFrame is a time series of its own, and for all hierarchy levels. To acheive this, we have created a dataset_prep.py script, which performs the following steps:

  1. Transform the dataset into a column-oriented one.
  2. Create the hierarchy representation as a dictionary.

For a complete description of how this is done under the hood, and for a sense of what the API accepts, see the scikit-hts’ docs.

Once we have created the hierarchy represenation as a dictionary, then we can visualize the data as a tree structure:

from hts.hierarchy import HierarchyTree
ht = HierarchyTree.from_nodes(nodes=train_hierarchy, df=train_product_bottom_level)
- total
   |- Mid-Alantic
   |  |- Mid-Alantic_NewJersey
   |  |- Mid-Alantic_NewYork
   |  - Mid-Alantic_Pennsylvania
   |- SouthCentral
   |  |- SouthCentral_Alabama
   |  |- SouthCentral_Kentucky
   |  |- SouthCentral_Mississippi
   |  - SouthCentral_Tennessee
   |- Pacific
   |  |- Pacific_Alaska
   |  |- Pacific_California
   |  |- Pacific_Hawaii
   |  - Pacific_Oregon
   |- EastNorthCentral
   |  |- EastNorthCentral_Illinois
   |  |- EastNorthCentral_Indiana
   |  - EastNorthCentral_Ohio
   - NewEngland
      |- NewEngland_Connecticut
      |- NewEngland_Maine
      |- NewEngland_RhodeIsland
      - NewEngland_Vermont

Setup the scikit-hts training script

We use a Python entry script to import the necessary SKLearn libraries, set up the scikit-hts estimators using the model packages for our algorithms of interest, and pass in our algorithm and hyper-parameter preferences from the SKLearn estimator that we set up in the notebook. In this post and associated code, we show the implementation and results for the bottom-up approach and the top-down approach with the average historical proportions division method. Note that the user can change these to select different hierarchical methods from the package. In addition, for the hyperparameters, we used additive and multiplicative seasonality with both the bottom-up and top-down approaches. The script uses the train and test data files that we uploaded to Amazon S3 to create the corresponding SKLearn datasets for training and evaluation. When training is complete, the script runs an evaluation to generate metrics, which we use to choose a winning model. For further analysis, the metrics are also available via the SageMaker trial component analytics (discussed later in this post). Then, the model is serialized for storage and future retrieval.

For more details, refer to the entry script “train.py” that is available in the GitHub repo. From the accompanying notebook, you can also run the cell in Step 3 to review the script. The following code shows the train function calling HTSRegressor with the Prophet algorithm along with the hierarchical method and seasonality mode:

def train(bucket, seasonality_mode, algo, daily_seasonality, changepoint_prior_scale, revision_method):
    print('**************** Training Script ***********************')
    # create train dataset
    df = pd.read_csv(filepath_or_buffer=os.environ['SM_CHANNEL_TRAIN'] + "/train.csv", header=0, index_col=0)
    hierarchy, data, region_states = prepare_data(df)
    regions = df["region"].unique().tolist()
    # create test dataset
    df_test = pd.read_csv(filepath_or_buffer=os.environ['SM_CHANNEL_TEST'] + "/test.csv", header=0, index_col=0)
    test_hierarchy, test_df, region_states = prepare_data(df_test)
    print("************** Create Root Edges *********************")
    print(hierarchy)
    print('*************** Data Type for Hierarchy *************', type(hierarchy))
    # determine estimators##################################
    if algo == "Prophet":
        print('************** Started Training Prophet Model ****************')
        estimator = HTSRegressor(model='prophet', 
                                 revision_method=revision_method, 
                                 n_jobs=4, 
                                 daily_seasonality=daily_seasonality, 
                                 changepoint_prior_scale = changepoint_prior_scale,
                                 seasonality_mode=seasonality_mode,
                                )
        # train the model
        print("************** Calling fit method ************************")
        model = estimator.fit(data, hierarchy)
        print("Prophet training is complete SUCCESS")
        
        # evaluate the model on test data
        evaluate(model, test_df, regions, region_states)
    
    ###################################################
 
    mainpref = "scikit-hts/models/"
    prefix = mainpref + "/"
    print('************************ Saving Model *************************')
    joblib.dump(estimator, os.path.join(os.environ['SM_MODEL_DIR'], "model.joblib"))
    print('************************ Model Saved Successfully *************************')

    return model

Setup Amazon SageMaker Experiment and Trials

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. SageMaker Experiments is integrated with SageMaker Studio. This provides a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models. SageMaker Experiments comes with its own Experiments SDK, which makes the analytics capabilities easily accessible in SageMaker notebooks. Because SageMaker Experiments enables tracking of all the steps and artifacts that go into creating a model, you can quickly revisit the origins of a model when you’re troubleshooting issues in production or auditing your models for compliance verifications. You can create your experiment with the following code:

from datetime import datetime
from smexperiments.experiment import Experiment

#name of experiment
timestep = datetime.now()
timestep = timestep.strftime("%d-%m-%Y-%H-%M-%S")
experiment_name = "hierarchical-forecast-models-" + timestep

#create experiment
Experiment.create(
experiment_name=experiment_name,
description="Hierarchical Timeseries models",
sagemaker_boto_client=sagemaker_boto_client)

For each job, we define a new Trial component within that experiment:

from smexperiments.trial import Trial
trial = Trial.create(
experiment_name=experiment_name,
sagemaker_boto_client=sagemaker_boto_client
)
print(trial)

Next, we define an experiment config, which is a dictionary that we pass into the fit() method of SKLearn estimator later on. This makes sure that the training job is associated with that experiment and trial. For the full code block for this step, refer to the accompanying notebook. In the notebook, we use the bottom-up and top-down (with average historical proportions) approaches, along with additive and multiplicative seasonality as the seasonality hyperparameter values. This lets us train four different models. The code can be modified easily to use the rest of the hierarchical forecasting approaches discussed in the previous sections, since they are also implemented in scikit-hts package.

Creating the SKLearn estimator

You can run SKLearn training scripts on SageMaker’s fully managed training environment by creating an SKLearn estimator. Let’s set up the actual training runs with a combination of parameters and encapsulate the training jobs within SageMaker experiments.

We will use scikit-hts to fit the FBProphet model in our data and compare the results.

  • FBProphet
    • daily_seasonality: By default, daily seasonality is set to False, thereby explicitly changing it to True.
    • changepoint_prior_scale: If the trend changes are being overfit (too much flexibility) or underfit (not enough flexibility), you can adjust the strength of the sparse prior using the input argument changepoint_prior_scale. By default, this parameter is set to 0.05. Increasing it will make the trend more flexible.

See the following code:

import sagemaker
from sagemaker.sklearn import SKLearn

for idx, row in df_hps_combo.iterrows():
    trial = Trial.create(
        experiment_name=experiment_name,
        sagemaker_boto_client=sagemaker_boto_client
    )

    experiment_config = { "ExperimentName": experiment_name, 
                      "TrialName":  trial.trial_name,
                      "TrialComponentDisplayName": "Training"}
    
    sklearn_estimator = SKLearn('train.py',
                                source_dir='code',
                                instance_type='ml.m4.xlarge',
                                framework_version='0.23-1',
                                role=sagemaker.get_execution_role(),
                                debugger_hook_config=False,
                                hyperparameters = {'bucket': bucket,
                                                   'algo': "Prophet", 
                                                   'daily_seasonality': True,
                                                   'changepoint_prior_scale': 0.5,
                                                   'seasonality_mode': row['seasonality_mode'],
                                                   'revision_method' : row['revision_method']
                                                  },
                                metric_definitions = metric_definitions,
                               )

After specifying our estimator with all of the necessary hyperparameters, we can train it using our training dataset. We train it by invoking the fit() method of the SKLearn estimator. We pass the location of the train and test data, as well as the experiment configuration. The training algorithm returns a fitted model that we can use to construct forecasts. See the following code:

sklearn_estimator.fit({'train': s3_train_channel, "test": s3_test_channel},
                     experiment_config=experiment_config, wait=False)

We start four training jobs in this case corresponding to the combinations of two hierarchical forecasting methods and two seasonality modes. These jobs are run in parallel using SageMaker training. The average runtime for these training jobs in this example was approximately 450 seconds on ml.m4.xlarge instances. You can review the job parameters and metrics from the trial component view in SageMaker Studio (see the following screenshot):

Evaluate metrics and select a winning candidate

Amazon SageMaker Studio provides an experiments browser that you can use to view the lists of experiments, trials, and trial components. You can choose one of these entities to view detailed information about the entity, or choose multiple entities for comparison. For more details, refer to the documentation. Once the training jobs are running, we can use the experiment view in Studio (see the following screenshot) or the ExperimentAnalytics module to track the status of our training jobs and their metrics.

In the training script, we used SKLearn Metrics to calculate the mean_squared_error (MSE) and stored it in the experiment. We can access the recorded metrics via the ExperimentAnalytics function and convert it to a Pandas DataFrame. The training job with the lowest Mean Squared Error (MSE) is the winner.

from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(experiment_name=experiment_name)
tc_df = trial_component_analytics.dataframe()
for name in tc_df['sagemaker_job_name']:
        description = sagemaker_boto_client.describe_training_job(TrainingJobName=name[1:-1])
        total_mse.append(description['FinalMetricDataList'][0]['Value'])
        model_url.append(description['ModelArtifacts']['S3ModelArtifacts'])
tc_df['total_mse'] = total_mse
new_df = tc_df[['sagemaker_job_name','algo', 'changepoint_prior_scale', 'revision_method', 'total_mse', 'seasonality_mode']]
mse_min = new_df['total_mse'].min()
df_winner = new_df[new_df['total_mse'] == mse_min]

Let’s select the winner model and download it for running forecasts:

for name in df_winner['sagemaker_job_name']:
    model_dir = sagemaker_boto_client.describe_training_job(TrainingJobName = name[1:-1])['ModelArtifacts']['S3ModelArtifacts']
key = model_dir.split('s3://{}/'.format(bucket))
s3_client.download_file(bucket, key[1], 'model.tar.gz')

Runtime series forecasts

Now, we will load the model and make forecasts 90 days in future:

import joblib
def model_fn(model_dir):
    clf = joblib.load(model_dir)
    return clf
model = model_fn('model.joblib')
predictions = model.predict(steps_ahead=90)

Visualize the forecasts

Let’s visualize the model results and fitted values for all of the states:

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
def plot_results(cols, axes, preds):
    axes = np.hstack(axes)
    for ax, col in zip(axes, cols):
        preds[col].plot(ax=ax, label="Predicted")
        train_product_bottom_level[col].plot(ax=ax, label="Observed")
        ax.legend()
        ax.set_title(col)
        ax.set_xlabel("Date")
        ax.set_ylabel("Quantity")  

Visualization at Region Level

Visualization at State Level

The following screenshot is for some of the states. For a full list of state visualizations, execute the visualization section of the notebook.

Clean up

Make sure to shut down the studio notebook. You can reach the Running Terminals and Kernels pane on the left side of Amazon SageMaker Studio with the icon. The Running Terminals and Kernels pane consists of four sections. Each section lists all of the resources of that type. You can shut down each resource individually or shut down all of the resources in a section at the same time.

Conclusion

Hierarchical forecasting is important where time series data can be grouped or aggregated at various levels in a hierarchical fashion. For accurate forecasting/prediction at various levels of hierarchy, methods that generate coherent forecasts at these different levels are needed. In this post, we demonstrated how we can leverage Amazon SageMaker’s training capabilities to carry out hierarchical forecasting. We used synthetic retail data and showed how to train hierarchical forecasting models using the scikit-hts package. We used the FBProphet model along with bottom-up and top-down (average historic proportions) hierarchical aggregation and disaggregation methods (see code). Furthermore, we used SageMaker Experiments to train multiple models and picked the best model out of the four trained models. While we only demonstrated this approach on a synthetic retail dataset, the code provided can easily be used with any time-series dataset that exhibits a similar hierarchical structure.

References


About the Authors

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers use machine learning to solve their business challenges with AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at the edge. She has created her own lab with a self-driving kit and prototype manufacturing production line, where she spends a lot of her free time.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from The University of Texas at Austin and a MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization and related domains. Based in Dallas, Texas, he and his family love to travel and make long road trips.

Neha Gupta is a Solutions Architect at AWS and has 16 years of experience as a Database architect/ DBA. Apart from work, she’s outdoorsy and loves to dance.

Read More

Live transcriptions of F1 races using Amazon Transcribe

The Formula 1 (F1) live steaming service, F1 TV, has live automated closed captions in three different languages: English, Spanish, and French.

For the 2021 season, FORMULA 1 has achieved another technological breakthrough, building a fully automated workflow to create closed captions in three languages and broadcasting to 85 territories using Amazon Transcribe. Amazon Transcribe is an automatic speech recognition (ASR) service that allows you to generate audio transcription.

In this post, we share how Formula 1 joined forces with the AWS Professional Services team to make it happen. We discuss how they used Amazon Transcribe and its custom vocabulary feature as well as custom-built postprocessing logic to improve their live transcription accuracy in three languages.

The challenge

For F1, everything is about extreme speed: with pit stops as short as 2 seconds, speeds of up to 375 KPH (233 MPH), and 5g forces on drivers under braking and through corners. In this fast-paced and dynamic environment, milliseconds dictate the difference between pole position or second on the grid. The role of the race commentators is to weave the multitude of parallel events and information into a single exciting narrative. This form of commentary greatly increases the engagement and excitement of viewers.

F1 has a strong affinity to cutting edge technology, and partnered with AWS to build a scalable and sustainable closed caption solution for F1 TV, their Over-the-top (OTT) platform, that can support a growing calendar and language portfolio. F1 now provides real-time live captions in three languages across four series: F1 in British English, US Spanish and French; and F2, F3, and Porsche Supercup in British English and US Spanish. This was achieved using Amazon Transcribe to automatically convert the commentary into subtitles.

This task provides many unique challenges. With the excitement of an F1 race, it’s common to have commentators with differing accents move quickly from one topic to another as the race unfolds. Being a sport steeped in technology, commentators often refer to F1 domain-specific terminology such as DRS (Drag Reduction System), aerodynamic, downforce, or halo (a safety device) for example. Moreover, F1 is a global sport, traveling across the world and drawing drivers from many different countries. Looking only at the 2021 season, 16/20 drivers had non-English names and 17/20 had non-Spanish names or non-French names. With the advanced customization features available in Amazon Transcribe, we tailored the underlying language models to recognize domain-specific terms that are rare in general language use, which boosted transcription accuracy.

In the following sections, we take a deep dive into how AWS Professional Services partnered with F1 to build a robust, state-of-the-art, real-time race commentary captioning system by enhancing Amazon Transcribe to understand the particularities of the F1 world. You will learn how to utilize Amazon Transcribe in real-time broadcasts and supercharge live captioning for your use case with custom vocabularies, postprocessing steps, and a human-in-the-loop validation layer.

Solution overview

The solution works as a proxy to Amazon Transcribe. Custom vocabularies are passed as parameters to Amazon Transcribe, and the resulting captions are postprocessed. The postprocessed text is then moderated by an F1 moderator before being transformed to captions that are displayed to the viewers. The following diagram shows the sequential process.

Live transcriptions: Understanding use case specific terminology and context

The output of Automatic Speech Recognition (ASR) systems is highly context-dependent. ASR language models benefit from utilizing the words across a fully spoken sentence. For example, in the following sentence, the system uses the words ‘WORLD CHAMPIONSHIP’ towards the end of the sentence to inform context and allow ‘FORMER ONE’ to be correctly transcribed as ‘FORMULA 1’.

GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMER ONE

 

GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMULA 1 WORLD CHAMPIONSHIP IN 2019

Amazon Transcribe supports both batch and streaming transcription models. In batch transcription, the model issues a transcription using the full context provided in the audio segment. Amazon Transcribe streaming transcription enables you to send an audio stream and receive a transcription stream in real time. Generating subtitles for a live broadcast requires a streaming model because transcriptions should appear on screen shortly after the commentary is spoken. This real-time need presents unique challenges compared to batch transcriptions and often affects the quality of the results because the language model has limited knowledge of the future context.

Amazon Transcribe is pre-trained to capture a wide range of use cases. However, F1 domain-specific terminology, names, and locations aren’t present in the Amazon Transcribe general language model. Getting those words correct is nevertheless crucial for the understanding of the narrative, such as who is leading the race, circuit corners, and technical details.

Amazon Transcribe allows you to develop with custom vocabularies and custom language models to improve transcription accuracy. You can use them separately for streaming transcriptions or together for batch transcriptions.

Custom vocabularies consist of a list of specific words that you want Amazon Transcribe to recognize in the audio input. These are generally domain-specific words and phrases, such as proper nouns. You can inform Amazon Transcribe how to pronounce these terms with information such as SoundsLike (in regular orthography) or the IPA (International Phonetic Alphabet) description of the term. Custom vocabularies are available for all languages supported by Amazon Transcribe. Custom vocabularies improve the ability of Amazon Transcribe to recognize terms without using the context in which they’re spoken.

The following table shows some examples of a custom vocabulary.

Phrase DisplayAs SoundsLike IPA
Charles-Leclerc Charles Leclerc ʃ ɑ ɹ l l ə k l ɛ ɹ
Charles-Leclerc Charles Leclerc shal-luh-klurk
Lewis-Hamilton Lewis Hamilton loo-is-ha-muhl-tn
Lewis-Hamilton Lewis Hamilton loo-uhs-ha-muhl-tn
Ferrari Ferrari f ɝ ɹ ɑ ɹ ɪ
Ferrari Ferrari fuh-rehr-ee
Mercedes Mercedes mer-sey-deez
Mercedes Mercedes m ɛ ɹ s eɪ d i z

The custom vocabulary includes the following details:

  • Phrase – The term that should be recognized.
  • DisplayAs – How the word or phrase looks when it’s output. If not declared, the output would be the phrase.
  • SoundsLike – The term broken into small pieces with the respective pronunciations in the specified language using standard orthography.
  • IPA – The International Phonetic Alphabet representation for the term.

Custom language models are valuable when there are larger corpuses of text data that can be used to train models. With the additional data, the models learn to predict the probabilities of sequences of words in the domain-specific context. For this project, F1 chose to use custom vocabulary given the unique words and phrases that are unique to F1 racing.

Postprocessing: the final layer of performance boosting

Due to the fast-paced nature of F1 commentary with rapidly changing context as well as commentator accents, inaccurate transcriptions may still occur. However, recurring mistakes can be easily fixed using text replacement. For example, “Kvyat and Albon” may be misunderstood as “create an album” by the British English language model. Because “create an album” is an unlikely term to occur in F1 commentaries, we can safely replace them with their assumed real meanings in a postprocessing routine. On top of that, postprocessing terms can be defined as general, or based on location and race series filters. Such selection allows for more specific term replacement, reducing the chance of erroneous replacements with this approach.

For this project, we gathered thousands of replacements for each language using hours of real-life F1 audio commentary that was analyzed by F1 domain specialists. On top of that, during every live event, F1 runs a transcribed commentary through a human-in-the-loop tool (described in the next section), which allows sentence rejection before the subtitles appear on screen. This data is used later to continuously improve the custom vocabulary and postprocessing rules. The following table shows examples of postprocessing rules for English captions. The location filter is a replacement filter based on race location, and the race series filter is based on the race series.

Original Term Replacement Location Filter Race Series Filter
CHARLOTTE CLAIRE CHARLES LECLERC FORMULA 1
CREATE AN ALBUM KVYAT AND ALBON FORMULA 1
SCHWARTZMAN SHWARTZMAN FORMULA 2
CURVE A PARABOLIC CURVA PARABOLICA Italy
CIRCUIT THE CATALONIA CIRCUIT DE CATALUNYA Spain
TYPE COMPOUNDS TYRE COMPOUNDS

Another important function of postprocessing is the standardization and formatting of numbers. When generating transcriptions for live broadcasts such as television, it’s a best practice to use digits when displaying numbers because they’re faster to read and occupy less space on screen. In English, Amazon Transcribe automatically displays numbers bigger than 10 as digits, and numbers between 0–10 are converted to digits under specific conditions, such as when there are more than one in a row. For example, “three four five” converts to 345. In an effort to standardize number transcriptions, we digitize all numbers.

As of August 8, 2021, transcriptions only output numbers as digits instead of words for a defined list of languages in both batch and streaming (for more information, see Transcribing numbers and punctuation). Notably, this list doesn’t include Spanish (es-US and es-ES) or French (fr-FR and fr-CA). With the postprocessing routine, numbers were also formatted to handle integers, decimals, and ordinals, as well F1-specific lap time formatting.

The following shows an example of number postprocessing for different languages that were built for F1.

Human in the loop: Continuous improvement and adaptation

Amazon Transcribe custom vocabularies and postprocessing boost the service’s real-time performance significantly. However, the fast-paced and quickly changing environment remains a challenge for automated transcriptions. It’s better for a person reliant on closed captions to miss out on a phase of commentary, rather than see an incorrect transcription that may be misleading. To this end, F1 employs a human in the loop as a final validation, where a moderator has a number of seconds to verify if a word or an entire sentence should be removed before it’s included in the video stream. Any removed sentences are then used to improve the custom vocabularies and postprocessing step for the next races.

Evaluation

Minor grammatical errors don’t greatly decrease the understandability of a sentence. However, using the wrong F1 terminology breaks a sentence. Usually ASR systems are evaluated on word error rate (WER), which quantifies how many insertions, deletions, and substitutions are required to change the predicted sentence to the correct one.

Although WER is important, F1-specific terms are even more crucial. For this, we created an accuracy score that measures the accuracy of people names (such as Charles Leclerc), teams (McLaren), locations (Hungaroring), and other F1 terms (DRS) transcribed in a commentary. These scores allow us to evaluate how understandable the transcriptions are to F1 fans and, combined with WER, allow us to maintain high-quality transcriptions and improvements in Amazon Transcribe.

Results

The F1 TV enhanced live transcriptions system was released on March 26, 2021, during the Formula 1 Gulf Air Bahrain Grand Prix. By the first race, the solution had already achieved a strong reduction in WER and F1-specific accuracy improvements for all three languages, compared to the Amazon Transcribe standard model. In the following tables, we highlight the WER and F1 specific accuracy improvements for the different languages. The numbers compare the developed solution using Amazon Transcribe using custom vocabularies and postprocessing with Amazon Transcribe generic model. The lower the WER, the better.

  Standard Amazon Transcribe WER Amazon Transcribe with CV and Postprocessing WER WER Improvement
English 18.95% 11.37% 39.99%
Spanish 25.95% 16.21% 37.16%
French 37.40% 16.80% 55.08%
Accuracy Group Standard Amazon Transcribe Accuracy Amazon Transcribe with CV and Postprocessing Accuracy Accuracy Improvement
English People Names 40.17% 92.25% 129.68%
Teams 56.33% 95.28% 69.15%
Locations 61.82% 94.33% 52.59%
Other F1 terms 81.47% 90.89% 11.55%
Spanish People Names 45.31% 95.43% 110.62%
Teams 39.40% 95.46% 142.28%
Locations 58.32% 87.58% 50.17%
Other F1 terms 63.87% 85.25% 33.47%
French People Names 39.12% 92.38% 136.15%
Teams 33.20% 90.84% 173.61%
Locations 55.34% 89.33% 61.42%
Other F1 terms 61.15% 86.77% 41.90%

Although the approach significantly improves the WER measures, its main influence is seen on F1 names, teams, and locations. Because the F1 specific terms are often in local languages, custom vocabularies, and postprocessing steps can quickly teach Amazon Transcribe to consider those terms and correctly transcribe them. The postprocessing step then further adapts the outcome transcriptions to F1’s domain to provide highly accurate automated transcriptions. In the following examples, we present phrases in English, Spanish, and French where Amazon Transcribe custom vocabularies, postprocessing, and number handling techniques successfully improved the transcription accuracy.

For Spanish, we have the original Amazon Transcribe output “EL PILOTO BRITÁNICO LORIS JAMIL TODOS ESTÁ A DOS SEGUNDOS PUNTO TRES DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN UNO VEINTINUEVE DOSCIENTOS TREINTA Y CUATRO” compared to the final transcription “EL PILOTO BRITÁNICO LEWIS HAMILTON ESTÁ A 2.3 s DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN 1:29.234.”

The custom vocabulary and postprocessing combination converted “LORIS JAMIL TODOS” to “LEWIS HAMILTON,” and the number handling routine converted the lap time to digits and added the appropriate punctuation (1:29.234).

For English, compare the original output “THE GERMAN DRIVER THE BASTION BETTER COMPLETED THE LAST LAP IN ONE 15 632” to the final transcription “THE GERMAN DRIVER SEBASTIAN VETTEL COMPLETED THE LAST LAP IN 1:15.632.”

The custom vocabulary and postprocessing combination converted “THE BASTION BETTER” to “SEBASTIAN VETTEL.”

In French, we can compare the original output “VICTOIRE POUR LES MISS MILLE TONNE DIX-HUIT POLE CENT TROIS PODIUM QUATRE VICTOIRES ICI” to the final output “VICTOIRE POUR LEWIS HAMILTON 18 POLE 103 PODIUM 4 VICTOIRES ICI.”

The custom vocabulary and postprocessing combination converted “LES MISS MILLE TONNE” to “LEWIS HAMILTON,” and the number handling routine converted the numbers to digits.

The following short video shows live captions in action during the Formula 1 Gulf Air Bahrain Grand Prix 2021.

Summary

In this post, we explained how F1 is now able to provide live closed captions on their OTT (Over-The-Top) platform to benefit viewers with accessibility needs and those who want to ensure they do not miss any live commentary.

In collaboration with AWS Professional Services, F1 has set up live transcriptions in English, Spanish, and French by using Amazon Transcribe and applying enhancements to capture domain-specific terminology.

Whether for sport broadcasting, streaming educational content, or conferences and webinars, AWS Professional Services is ready to help your team develop a real-time captioning system that is accurate and customizable by making full use of your domain-specific knowledge and the advanced features of Amazon Transcribe. For more information, see AWS Professional Services or reach out through your account manager to get in touch.


About the Authors

Beibit Baktygaliyev is a Senior Data Scientist with AWS Professional Services. As a technical lead, he helps customers to attain their business goals through innovative technology. In his spare time, Beibit enjoys sports and spending time with his family and friends.

Maira Ladeira Tanke is a Data Scientist at AWS Professional Services. She works with customers across industries to help them achieve business outcomes with AI and ML technologies. In her spare time, Maira likes to play with her cat Smila. She also loves to travel and spend time with her family and friends.

Sara Kazdagli is a Professional Services consultant specialized in Data Analytics and Machine Learning. She helps customers across different industries to build innovative solutions and make data-driven decisions. Sara holds a MSc in Software Engineering and a MSc in Data Science. In her spare time, she like to go on hikes and walks with her Australian shepherd dog Kiba.

Pablo Hermoso Moreno is a Data Scientist in the AWS Professional Services Team. He works with clients across industries using Machine Learning to tell stories with data and reach more informed engineering decisions faster. Pablo’s background is in Aerospace Engineering and having worked in the motorsport industry he has an interest in bridging physics and domain expertise with ML. In his spare time, he enjoys rowing and playing guitar.

Read More

AWS Deep Learning AMIs: New framework-specific DLAMIs for production complement the original multi-framework DLAMIs

Since its launch in November 2017, the AWS Deep Learning Amazon Machine Image (DLAMI) has been the preferred method for running deep learning frameworks on Amazon Elastic Compute Cloud (Amazon EC2). For deep learning practitioners and learners who want to accelerate deep learning in the cloud, the DLAMI comes pre-installed with AWS-optimized deep learning (DL) frameworks and their dependencies so you can get started right away with conducting research, developing machine learning (ML) applications, or educating yourself about deep learning. DLAMIs also make it easy to get going on instance types based on AWS-built processors such as Inferentia, Trainium, and Graviton, with all the necessary dependencies pre-installed.

The original DLAMI contained several popular frameworks such as PyTorch, TensorFlow, and MXNet, all in one bundle that AWS tested and supported on AWS instances. Although the multiple-framework DLAMI enables developers to explore various frameworks in a single image, some use cases require a smaller DLAMI that contains only a single framework. To support these use cases, we recently released DLAMIs that each contain a single framework. These framework-specific DLAMIs have less complexity and smaller size, making them more optimized for production environments.

In this post, we describe the components of the framework-specific DLAMIs and compare the use cases of the framework-specific and multi-framework DLAMIs.

All the DLAMIs contain similar libraries. The PyTorch DLAMI and the TensorFlow DLAMI each contain all the drivers necessary to run the framework on AWS instances including p3, p4, Trainium, or Graviton. The following table compares DLAMIs and components. More information can be found in the release notes.

Component Framework-specific PyTorch 1.9.0

Framework-specific

Tensorflow 2.5.0

Multi-framework (AL2 – v50)
PyTorch 1.9.0 N/A 1.4.0 & 1.8.1
TensorFlow N/A 2.5.0 2.4.2, 2.3.3 & 1.15.5
NVIDIA CUDA 11.1.1 11.2.2 10.x, 11.x
NVIDIA cuDNN 8.0.5 8.1.1 N/A

Eliminating other frameworks and their associated components makes each framework-specific DLAMI approximately 60% smaller (approximately 45 GB vs. 110 GB). As described in the following section, this reduction in complexity and size has advantages for certain use cases.

DLAMI use cases

The multi-framework DLAMI has, until now, been the default for AWS developers doing deep learning on EC2. This is because DLAMIs simplify the experience for developers looking to explore and compare different frameworks within a single AMI. The multi-framework DLAMI remains as a great solution for use cases focusing on research, development, and education. This is because the multi-framework DLAMI comes preinstalled with the deep learning infrastructure for TensorFlow, PyTorch, and MXNet. Developers don’t have to spend any time installing deep learning libraries and components specific to any of these frameworks, and can experiment with the latest versions of each of the most popular frameworks. This one-stop shop means that you can focus on your deep learning-related tasks instead of MLOps and driver configurations. Having multiple frameworks in the DLAMI provides flexibility and options for practitioners looking to explore multiple deep learning frameworks.

Some examples of use cases for the multi-framework DLAMI include:

  • Medical research – Research scientists want to develop models that detect malignant tumors and want to compare performance between deep learning frameworks to achieve the highest performance metrics possible
  • Deep learning college course – College students learning to train deep learning models can choose from the multiple frameworks installed on the DLAMI in a Jupyter environment
  • Developing a model for a mobile app – Developers use the multi-framework DLAMI to develop multiple models for their voice assistant mobile app using a combination of deep learning frameworks

When deploying in a production environment, however, developers may only require a single framework and its related dependencies. The lightweight, framework-specific DLAMIs provide a more streamlined image that minimizes dependencies. In addition to a smaller footprint, the framework-specific DLAMIs minimize the surface area for security attacks and provide more consistent compatibility across versions due to the limited number of included libraries. The framework-specific DLAMIs also have less complexity, which makes them more reliable as developers increment versions in production environments.

Some examples of use cases for framework-specific DLAMIs include:

  • Deploying an ML-based credit underwriting model – A finance startup wants to deploy an inference endpoint with high reliability and availability with faster auto scaling during demand spikes
  • Batch processing of video – A film company creates a command line application that increases the resolution of low-resolution digital video files using deep learning by interpolating pixels
  • Training a framework-specific model – A mobile app startup needs to train a model using TensorFlow because their app development stack requires a TensorFlow Lite compiled model

Conclusion

DLAMIs have become the go-to image for deep learning on EC2. Now, framework-specific DLAMIs build on that success by providing images that are optimized for production use cases. Like multi-framework DLAMIs, the single-framework images remove the heavy lifting necessary for developers to build and maintain deep learning applications. With the launch of the new, lightweight framework-specific DLAMIs, developers now have more choices for accelerated Deep Learning on EC2.

Get started with Single-framework DLAMIs today using this tutorial and selecting a framework-specific Deep Learning AMI in the Launch Wizard.


About the Authors

Francisco Calderon is a Data Scientist in the Amazon ML Solutions Lab. As a member of the ML Solutions Lab, he helps solve critical business problems for AWS customers using deep learning. In his spare time, Francisco likes to play music and guitar, play soccer with his daughters, and enjoy time with his family.

Corey Barrett is a Data Scientist in the Amazon ML Solutions Lab. As a member of the ML Solutions Lab, he uses machine learning and deep learning to solve critical business problems for AWS customers. Outside of work, you can find him enjoying the outdoors, sipping on scotch, and spending time with his family.

Read More

Clinical text mining using the Amazon Comprehend Medical new SNOMED CT API

Mining medical concepts from written clinical text, such as patient encounters, plays an important role in clinical analytics and decision-making applications, such as population analytics for providers, pre-authorization for payers, and adverse-event detection for pharma companies. Medical concepts contain medical conditions, medications, procedures, and other clinical events. Extracting medical concepts is a complicated process due to the specialist knowledge required and the broad use of synonyms in the medical field. Furthermore, to make detected concepts useful for large-scale analytics and decision-making applications, they have to be codified. This is a process where a specialist looks up matching codes from a medical ontology, often containing tens to hundreds of thousands of concepts.

To solve these problems, Amazon Comprehend Medical provides a fast and accurate way to automatically extract medical concepts from the written text found in clinical documents. You can now also use a new feature to automatically standardize and link detected concepts to the SNOMED CT (Systematized Nomenclature of Medicine—Clinical Terms) ontology. SNOMED CT provides a comprehensive clinical healthcare terminology and accompanying clinical hierarchy, and is used to encode medical conditions, procedures, and other medical concepts to enable big data applications.

This post details how to use the new SNOMED CT API to link SNOMED CT codes to medical concepts (or entities) in natural written text that can then be used to accelerate research and clinical application building. After reading this post, you will be able to detect and extract medical terms from unstructured clinical text, map them to the SNOMED CT ontology (US edition), retrieve and manipulate information from a clinical database, including electronic health record (EHR) systems, and map SNOMED CT concepts to other ontologies using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) if your EHR system uses an ontology other than SNOMED CT.

Solution overview

Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) to extract clinical data from unstructured medical text—no ML experience required—and automatically map them to SNOMED CT, ICD10, or RxNorm ontologies with a simple API call. You can then add the ontology codes to your EHR database to augment patient data or link to other ontologies as desired through OMOP CDM. For this post, we demonstrate the solution workflow as shown in the following diagram with code based on the example sentence “Patient X was diagnosed with insomnia.”

To use clinical concept codes based on a text input, we detect and extract clinical terms, connect to the clinical data base, transform SNOMED code to OMOP CDM code, and use them within our records.

For this post, we use the OMOP CDM as a database schema as an example. Historically, healthcare institutions in different regions and countries use their own terminologies and classifications for their own purposes, which prevents the interoperability of the systems. While SNOMED CT standardizes medical concepts with a clinical hierarchy, the OMOP CDM provides a standardization mechanism to move from one ontology to another, with an accompanying data model. The OMOP CDM standardizes the format and content of observational data so that standardized applications, tools and methods can be applied across different datasets. In addition, the OMOP CDM makes it easier to convert codes from one vocabulary to another by having maps between medical concepts in different hierarchical ontologies and vocabularies. The ontologies hierarchy is set such that descendants are more specific than ascendants. For example, non-small cell lung cancer is a descendent of malignant neoplastic disease. This allows querying and retrieving concepts and all their hierarchical descendants, and also enables interoperability between ontologies.

We demonstrate implementing this solution with the following steps:

  1. Extract concepts with Amazon Comprehend Medical SNOMED CT and link them to the SNOMED CT (US edition) ontology.
  2. Connecting to the OMOP CDM.
  3. Map the SNOMED CT code to OMOP CDM concept IDs.
  4. Use the structured information to perform the following actions:
    1. Retrieve the number of patients with the disease.
    2. Traverse the ontology.
    3. Map to other ontologies.

Prerequisites

Before you get started, make sure you have the following:

  • Access to an AWS account.
  • Permissions to create an AWS CloudFormation.
  • Permissions to call Amazon Comprehend Medical from Amazon SageMaker.
  • Permissions to query Amazon Redshift from SageMaker.
  • The SNOMED CT license. SNOMED International is a strong member-owned and driven organization with free use of SNOMED CT within the member’s territory. Members manage the release, distribution, and sub-licensing of SNOMED CT and other products of the association within their territory.

This post assumes that you have an OMOP CDM database set up in Amazon Redshift. See Create data science environments on AWS for health analysis using OHDSI to set up a sample OMOP CDM in your AWS account using CloudFormation templates.

Extract concepts with Amazon Comprehend Medical SNOMED CT

You can extract SNOMED CT codes using Amazon Comprehend Medical with two lines of code. Assume you have a document, paragraph, or sentence:

clinical_note = "Patient X was diagnosed with insomnia."

First, we instantiate the Amazon Comprehend Medical client in boto3. Then, we simply call Amazon Comprehend Medical’s SNOMED CT API:

import boto3
cm_client = boto3.client("comprehendmedical")
response = cm_client.infer-snomedct(Text=clinical_note)

Done! In our example, the response is as follows:

{'Characters': {'OriginalTextCharacters': 38},
 'Entities': [{'Attributes': [],
               'BeginOffset': 29,
               'Category': 'MEDICAL_CONDITION',
               'EndOffset': 37,
               'Id': 0,
               'SNOMEDCTConcepts': [{'Code': '193462001',
                                     'Description': 'Insomnia (disorder)',
                                     'Score': 0.7997841238975525},
                                    {'Code': '191997003',
                                     'Description': 'Persistent insomnia '
                                                    '(disorder)',
                                     'Score': 0.6464713215827942},
                                    {'Code': '762348004',
                                     'Description': 'Acute insomnia (disorder)',
                                     'Score': 0.6253700256347656},
                                    {'Code': '59050008',
                                     'Description': 'Initial insomnia '
                                                    '(disorder)',
                                     'Score': 0.6112624406814575},
                                    {'Code': '24121004',
                                     'Description': 'Insomnia disorder related '
                                                    'to another mental '
                                                    'disorder (disorder)',
                                     'Score': 0.6014388203620911}],
               'Score': 0.9989109039306641,
               'Text': 'insomnia',
               'Traits': [{'Name': 'DIAGNOSIS', 'Score': 0.7624053359031677}],
               'Type': 'DX_NAME'}],
 'ModelVersion': '0.0.1',
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '873',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Mon, 20 Sep 2021 18:32:04 GMT',
                                      'x-amzn-requestid': 'e9188a79-3884-4d3e-b73e-4f63ed831b0b'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'e9188a79-3884-4d3e-b73e-4f63ed831b0b',
                      'RetryAttempts': 0},
 'SNOMEDCTDetails': {'Edition': 'US',
                     'Language': 'en',
                     'VersionDate': '20200901'}}

The response contains the following:

  • Characters – Total number of characters. In this case, we have 38 characters.
  • Entities – List of detected medical concepts, or entities, from Amazon Comprehend Medical. The main elements in each entity are:

    • Text – Original text from the input data.
    • BeginOffset and EndOffset –The beginning and ending location of the text in the input note, respectively.
    • Category – Category of the detected entity. For example, MEDICAL_CONDITION for medical condition.
    • SNOMEDCTConcepts – Top five predicted SNOMED CT concept codes with the model’s confidence scores (in descending order). Each linked concept code has the following:

      • Code – SNOMED CT concept code.
      • Description – SNOMED CT concept description.
      • Score – Confidence score of the linked SNOMED CT concept.
    • ModelVersion – Version of the model used for the inference.
    • ResponseMetadata – API call metadata.
    • SNOMEDCTDetails – Edition, language, and date of the SNOMED CT version used.

For more information, refer to the Amazon Comprehend Medical Developer Guide. By default, the API links detected entities to the SNOMED CT US edition. To request support for your edition, for example the UK edition, contact us via AWS Support or the Amazon Comprehend Medical forum.

In our example, Amazon Comprehend Medical identifies “insomnia” as a clinical term and provides five ordered SNOMED CT concepts and code that we might be referring to in the sentence. In this example, Amazon Comprehend Medical correctly identifies the clinical term as the most likely option. Therefore, the next step is to extract the response. See the following code:

#Get top predicted SNOMED CT Concept
pred_snomed = response['Entities'][0]['SNOMEDCTConcepts'][0]

The content of pred_snomed is as follows, with its predicted SNOMED concept code, concept description, and prediction score (probability):

{
 'Description': 'Insomnia (disorder)',
 'Code': '193462001',
 'Score': 0.803254246711731
}

We have identified clinical terms in our text and linked them to SNOMED CT concepts. We can now use SNOMED CT’s hierarchical structure and relations to other ontologies to accelerate clinical analytics and decision-making application development.

Before we access the database, let’s define some utility functions that are helpful in our operations. First, we must import the necessary Python packages:

import pandas
import psycopg2

The following code is a function to connect to the Amazon Redshift database:

def connect_to_db(redshift_parameters, user, password):
    """Connect to database and returns connection
    Args:
        redshift_parameters (dict): Redshift connection parameters.
        user (str): Redshift user required to connect. 
        password (str): Password associated to the user
    Returns:
        Connection: boto3 redshift connection 
    """

    try:
        conn = psycopg2.connect(
            host=redshift_parameters["url"],
            port=redshift_parameters["port"],
            user=user,
            password=password,
            database=redshift_parameters["database"],
        )

        return conn

    except psycopg2.Error:
        raise ValueError("Failed to open database connection.")

The following code is a function to run a given query on the Amazon Redshift database:

def execute_query(cursor, query, limit=None):
    """Execute query
    Args:
        cursor (boto3 cursor): boto3 object pointing and with established connection to Redshift.
        query (str): SQL query.
        limit (int): Limit of rows returned by the data frame. Default to 'None' for no limit
    Returns:
        pd.DataFrame: Data Frame with the query results.
    """
    try:
        cursor.execute(query)
    except:
        return None

    columns = [c.name for c in cursor.description]
    results = cursor.fetchall()
    if limit:
        results = results[:limit]

    out = pd.DataFrame(results, columns=columns)

    return out

In the next sections, we connect to the database and run our queries.

Connect to the OMOP CDM

EHRs are often stored in databases using a specific ontology. In our case, we use the OMOP CDM, which contains a large number of ontologies (SNOMED, ICD10, RxNorm, and more), but you can extend the solution to other data models by modifying the queries. The first step is to connect to Amazon Redshift where the EHR data is stored.

Let’s define the variables used to connect the database. You must substitute the placeholder values in the following code within with your actual values based on your Amazon Redshift database:

#Connect to Amazon Redshift Database
REDSHIFT_PARAMS = {
                    "url": "<database-url>", 
                    "port": "<database-port>",
                    "database": "<database-name>",
                  }
REDSHIFT_USER = "<user-name>"
REDSHIFT_PASSWORD = "<user-password>"

conn = connect_to_db(REDSHIFT_PARAMS, REDSHIFT_USER, REDSHIFT_PASSWORD)
cursor = conn.cursor()

Map the SNOMED CT code to OMOP CDM concept IDs

The OMOP CDM uses its own concept IDs as data model identifiers across ontologies. Those differ from specific ontology codes such as SNOMED CT’s codes, but you can retrieve them from SNOMED CT codes using pre-built OMOP CDM maps. To retrieve the concept_id of SNOMED CT code 193462001, we use the following query:

query1 = f"
SELECT DISTINCT concept_id 
FROM cmsdesynpuf23m.concept 
WHERE vocabulary_id='SNOMED' AND concept_code='{pred_snomed['Code']}';
"

out_df = execute_query(cursor, query1)
concept_id = out_df['concept_id'][0]
print(concept_id)

The output OMOP CDM concept_id is 436962. The concept ID uniquely identifies a given medical concept in the OMOP CDM database and is used as a primary key in the concept table. This enables linking of each code with patient information in other tables.

Use the structured information map from the SNOMED CT code to OMOP CDM concept ID

Now that we have OMOP’s concept_id, we can run many queries from the database. When we find the particular concept, we can use it for different use cases. For example, we can use it to query population statistics with a given condition, traverse ontologies to bridge operability gaps, and extract the unique hierarchical structure of concepts to achieve the right queries. In this section, we walk you through a few examples.

Retrieve the number of patients with a disease

The first example is retrieving the total number of patients with the insomnia condition that we linked to its appropriate ontology concept using Amazon Comprehend Medical. The following code formulates and runs the corresponding SQL query:

query2 = f"
SELECT COUNT(DISTINCT person_id) 
FROM cmsdesynpuf23m.condition_occurrence 
WHERE condition_concept_id='{concept_id}';
"
out_df = execute_query(cursor, query2)
print(out_df)

In our sample records described in the prerequisites section, the total number of patients in the database that have been diagnosed with insomnia are 26,528.

Traverse the ontology

One of the advantages of using SNOMED CT is that we can exploit its hierarchical taxonomy. Let’s illustrate how via some examples.

Ancestors: Going up the hierarchy

First, let’s find the immediate ancestors and descendants of the concept insomnia. We use concept_ancestor and concept tables to get the parent (ancestor) and children (descendants) of the given concept code. The following code is the SQL statement to output the parent information:

query3 = f"
SELECT DISTINCT concept_code, concept_name 
FROM cmsdesynpuf23m.concept 
WHERE concept_id IN (SELECT ancestor_concept_id 
FROM cmsdesynpuf23m.concept_ancestor 
WHERE descendant_concept_id='{concept_id}' AND max_levels_of_separation=1);
"
out_df = execute_query(cursor, query3)
print(out_df)

In the preceding example, we used max_levels_of_separation=1 to limit concept codes that are immediate ancestors. You can increase the number to get more in the hierarchy. The following table summarizes our results.

concept_code concept_name
44186003 Dyssomnia
194437008 Disorders of initiating and maintaining sleep

SNOMED CT offers a polyhierarchical classification, which means a concept can have more than one parent. This hierarchy is also called a directed acyclic graph (DAG).

Descendants: Going down the hierarchy

We can use a similar logic to retrieve the children of the code insomnia:

query4 = f"SELECT DISTINCT concept_code, concept_name 
FROM cmsdesynpuf23m.concept 
WHERE concept_id IN (SELECT descendant_concept_id 
FROM cmsdesynpuf23m.concept_ancestor 
WHERE ancestor_concept_id='{concept_id}' AND max_levels_of_separation=1);
"
out_df = execute_query(cursor, query4)
print(out_df)

As a result, we get 26 descendant codes; the following table shows the first 10 rows.

concept_code concept_name
24121004 Insomnia disorder related to another mental disorder
191997003 Persistent insomnia
198437004 Menopausal sleeplessness
88982005 Rebound insomnia
90361000119105 Behavioral insomnia of childhood
41975002 Insomnia with sleep apnea
268652009 Transient insomnia
81608000 Insomnia disorder related to known organic factor
162204000 Late insomnia
248256006 Not getting enough sleep

We can then use these codes to query a broader set of patients (parent concept) or a more specific one (child concept).

Finding the concept in the appropriate hierarchy level is important, because if not accounted for appropriately, you might get wrong statistical answers from your queries. For example, in the preceding use case, let’s say that you want to find the number of patients with insomnia that is only related with not getting enough sleep. Using the parent concept for the general insomnia gives you a different answer than when specifying the descendant concept code only related with not getting enough sleep.

Map to other ontologies

We can also map the SNOMED concept code to other ontologies such as ICD10CM for conditions and RxNorm for medications. Because insomnia is condition, let’s find the corresponding ICD10 concept codes for the given insomnia’s SNOMED concept code. The following code is the SQL statement and function to find the ICD10 concept codes:

query5 = f"
SELECT DISTINCT concept_code, concept_name, vocabulary_id 
FROM cmsdesynpuf23m.concept 
WHERE vocabulary_id='ICD10CM' AND 
concept_id IN (SELECT concept_id_2 
FROM cmsdesynpuf23m.concept_relationship 
WHERE concept_id_1='{concept_id}' AND relationship_id='Mapped from');
"
out_df = execute_query(cursor, query5)
print(out_df)

The following table lists the corresponding ICD10 concept codes with their descriptions.

concept_code concept_name vocabulary_id
G47.0 Insomnia ICD10CM
G47.00 Insomnia, unspecified ICD10CM
G47.09 Other insomnia ICD10CM

When we’re done running SQL queries, let’s close the connection to the database:

conn.close()

Conclusion

Now that you have reviewed this example, you’re ready to apply Amazon Comprehend Medical on your clinical text to extract and link SNOMED CT concepts. We also provided concrete examples of how to use this information with your medical records using an OMOP CDM database to run SQL queries and get patient information related with the medical concepts. Finally, we also showed how to extract the different hierarchies of medical concepts and convert SNOMED CT concepts to other standardized vocabularies such as ICD10CM.

The Amazon ML Solutions Lab pairs your team with ML experts to help you identify and implement your organization’s highest value ML opportunities. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.


About the Author

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps customers across different industries accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Read More

Plan the locations of green car charging stations with an Amazon SageMaker built-in algorithm

While the fuel economy of new gasoline or diesel-powered vehicles improves every year, green vehicles are considered even more environmentally friendly because they’re powered by alternative fuel or electricity. Hybrid electric vehicles (HEVs), battery only electric vehicles (BEVs), fuel cell electric vehicles (FCEVs), hydrogen cars, and solar cars are all considered types of green vehicles.

Charging stations for green vehicles are similar to the gas pump in a gas station. They can be fixed on the ground or wall and installed in public buildings (shopping malls, public parking lots, and so on), residential district parking lots, or charging stations. They can be based on different voltage levels and charge various types of electric vehicles.

As a charging station vendor, you should consider many factors when building a charging station. The location of charging stations is a complicated problem. Customer convenience, urban setting, and other infrastructure needs are all important considerations.

In this post, we use machine learning (ML) with Amazon SageMaker and Amazon Location Service to provide guidance for charging station vendors looking to choose optimal charging station locations.

Solution overview

In this solution, we focus use SageMaker training jobs to train the cluster model and a SageMaker endpoint to deploy the model. We use an Amazon Location Service display map and cluster result.

We also use Amazon Simple Storage Service (Amazon S3) to store the training data and model artifacts.

The following figure illustrates the architecture of the solution.

Data preparation

GPS data is highly sensitive information because it can be used to track historical movement of an individual. In the following post, we use the tool trip-simulator to generate GPS data that simulates a taxi driver’s driving behavior.

We choose Nashville, Tennessee, as our location. The following script simulates 1,000 agents and generates 14 hours of driving data starting September 15, 2020, 8:00 AM:

trip-simulator 
  --config scooter 
  --pbf nash.osm.pbf 
  --graph nash.osrm 
  --agents 1000 
  --start 1600128000000 
  --seconds 50400 
  --traces ./traces.json 
  --probes ./probes.json 
  --changes ./changes.json 
  --trips ./trips.json

The preceding script generates three output files. We use changes.json. It includes car driving GPS data as well as pickup and drop off information. The file format looks like the following:

{
	"vehicle_id":"PLC-4375",
	"event_time":1600128001000,
	"event_type":"available",
	"event_type_reason":"service_start",
	"event_location":{
					"type":"Feature",
					"properties":{

								},
					"geometry":{
					"type":"Point",
					"coordinates":
								[
								-86.7967066040155,
								36.17115028383999
								]
								}
					}
}

The field event_reason has four main values:

  • service_start – The driver receives a ride request, and drives to the designated location
  • user_pick_up – The driver picks up a passenger
  • user_drop_off – The driver reaches the destination and drops off the passenger
  • maintenance – The driver is not in service mode and doesn’t receive the request

In this post, we only collect the location data with the status user_pick_up and user_drop_off as the algorithm’s input. In real-life situations, you should also consider features such as the passenger’s information and business district information.

Pandas is an extended library of the Python language for data analysis. The following script converts the data from JSON format to CSV format via Pandas:

df=pd.read_json('./data/changes.json', lines=True)
df_event=df.event_location.apply(pd.Series)
df_geo=df_event.geometry.apply(pd.Series)
df_coord=df_geo.coordinates.apply(pd.Series)
result = pd.concat([df, df_coord], axis=1)
result = result.drop("event_location",axis = 1)
result.columns=["vehicle_id","event_time","event_type","event_reason","longitude","latitude"]
result.to_csv('./data/result.csv',index=False,sep=',')

The following table shows our results.

There is noise data in the original GPS data. This includes some pickup and drop-off coordinate points being marked in the lake. The generated GPS data follows uniform distribution without considering business districts, no-stop areas, and depopulated zones. In practice, there is no standard process for data preprocessing. You can simplify the process of data preprocessing and feature engineering with Amazon SageMaker Data Wrangler.

Data exploration

To better to observe and analyze the simulated track data, we use Amazon Location for data visualization. Amazon Location provides frontend SDKs for Android, iOS, and the web. For more information about Amazon Location, see the Developer Guide.

We start by creating a map on the Amazon Location console.

We use the MapLibre GL JS SDK for our map display. The following script displays a map of Nashville, Tennessee, and renders a specific car’s driving route (or trace) line:

async function initializeMap() {
// load credentials and set them up to refresh
await credentials.getPromise();

// Initialize the map
map = new maplibregl.Map({
container: "map",
center:[-86.792845,36.16378],// initial map centerpoint
zoom: 10, // initial map zoom
style: mapName,
transformRequest,
});
});

map.addSource('route', {
'type': 'geojson',
'data': {
'type': 'Feature',
'properties': {},
'geometry': {
'type': 'LineString',
'coordinates': [
				[-86.85009051679292,36.144774042081494],
				[-86.85001827659116,36.14473133061205],
				[-86.85004741661184,36.1446756197635],
				[-86.85007975396945,36.14465452846737],
				[-86.85005249508677,36.14469518290888]
				......
				]
			}
		}
						}
			);

The following graph displays a taxi’s 14-hour driving route.

The following script displays the car’s route distribution:

map.addSource('car-location', {
'type': 'geojson',
'data': {
'type': 'FeatureCollection',
'features': [
{'type': 'Feature','geometry': {'type': 'Point','coordinates': [-86.79417828985571,36.1742558685242]}},
{'type': 'Feature','geometry': {'type': 'Point','coordinates': [-86.76932509874324,36.18006513143749]}},
......
{'type': 'Feature','geometry': {'type': 'Point','coordinates': [-86.84082991448976,36.14558741886923]}}

]
}
});

The following map visualization shows our results.

Algorithm selection

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.

SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared to the original version of the algorithm, the version SageMaker uses is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, it streams mini-batches (small, random subsets) of the training data.

The k-means algorithm expects tabular data. In this solution, the GPS coordinate data (longitude, latitude) is the input training data. See the following code:

df = pd.read_csv('./data/result.csv', sep=',',header=0,usecols=['longitude','latitude'])

#routine that converts the training data into protobuf format required for Sagemaker K-means.
def write_to_s3(bucket, prefix, channel, file_prefix, X):
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, X.astype('float32'))
buf.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, channel, file_prefix + '.data')).upload_fileobj(buf)

#prepare training training and save to S3.
def prepare_train_data(bucket, prefix, file_prefix, save_to_s3=True):
train_data = df.as_matrix()
if save_to_s3:
write_to_s3(bucket, prefix, 'train', file_prefix, train_data)
return train_data

# using the dataset
train_data = prepare_train_data(bucket, prefix, 'train', save_to_s3=True)

# SageMaker k-means ECR images ARNs
images = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:latest',
'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:latest',
'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/kmeans:latest',
'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/kmeans:latest'}

image = images[boto3.Session().region_name]

Train the model

Before you train your model, consider the following:

  • Data format – Both protobuf recordIO and CSV formats are supported for training. In this solution, we use protobuf format and File mode as the training data input.
  • EC2 instance selection – AWS suggests using an Amazon Elastic Compute Cloud (Amazon EC2) CPU instance when selecting the k-means algorithm. We use two ml.c5.2xlarge instances for training.
  • Hyperparameters – Hyperparameters are closely related to the dataset; you can adjust them according to the actual situation to get the best results:

    • k – The number of required clusters (k). Because we don’t know the number of clusters in advance, we train many models with different values (k).
    • init_method – The method by which the algorithm chooses the initial cluster centers. A valid value is random or kmeans++.
    • epochs – The number of passes done over the training data. We set this to 10.
    • mini_batch_size – The number of observations per mini-batch for the data iterator. We tried 50, 100, 200, 500, 800, and 1,000 in our dataset.

We train our model with the following code. To get results faster, we start up SageMaker training job concurrently, each training jobs includes two instances. The range of k is between 3 and 16, and each training job will generate a model, the model artifacts are saved in S3 bucket.

K = range(3,16,1) #Select different k, k increased by 1 until 15
INSTANCE_COUNT = 2 #use two CPU instances
run_parallel_jobs = True #make this false to run jobs one at a time, especially if you do not want 
#create too many EC2 instances at once to avoid hitting into limits.
job_names = []

# launching jobs for all k
for k in K:
    print('starting train job:' + str(k))
    output_location = 's3://{}/kmeans_example/output/'.format(bucket) + output_folder
    print('training artifacts will be uploaded to: {}'.format(output_location))
    job_name = output_folder + str(k)

    create_training_params = 
    {
        "AlgorithmSpecification": {
            "TrainingImage": image,
            "TrainingInputMode": "File"
        },
        "RoleArn": role,
        "OutputDataConfig": {
            "S3OutputPath": output_location
        },
        "ResourceConfig": {
            "InstanceCount": INSTANCE_COUNT,
            "InstanceType": "ml.c4.xlarge",
            "VolumeSizeInGB": 20
        },
        "TrainingJobName": job_name,
        "HyperParameters": {
            "k": str(k),
            "feature_dim": "2",
          	"epochs": "100",
            "init_method": "kmeans++",
            "mini_batch_size": "800"
        },
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 60 * 60
        },
            "InputDataConfig": [
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },

                "CompressionType": "None",
                "RecordWrapperType": "None"
            }
        ]
    }

    sagemaker = boto3.client('sagemaker')

    sagemaker.create_training_job(**create_training_params)

Evaluate the model

The number of clusters (k) is the most important hyperparameter in k-means clustering. Because we don’t know the value of k, we can use various methods to find the optimal value of k. In this section, we discuss two methods.

Elbow method

The elbow method is an empirical method to find the optimal number of clusters for a dataset. In this method, we select a range of candidate values of k, then apply k-means clustering using each of the values of k. We find the average distance of each point in a cluster to its centroid, and represent it in a plot. We select the value of k where the average distance falls suddenly. See the following code:

plt.plot()
models = {}
distortions = []
for k in K:
s3_client = boto3.client('s3')
key = 'kmeans_example/output/' + output_folder +'/' + output_folder + str(k) + '/output/model.tar.gz'
s3_client.download_file(bucket, key, 'model.tar.gz')
print("Model for k={} ({})".format(k, key))
!tar -xvf model.tar.gz
kmeans_model=mx.ndarray.load('model_algo-1')
kmeans_numpy = kmeans_model[0].asnumpy()
print(kmeans_numpy)
distortions.append(sum(np.min(cdist(train_data, kmeans_numpy, 'euclidean'), axis=1)) / train_data.shape[0])
models[k] = kmeans_numpy

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('distortion')
plt.title('Elbow graph')
plt.show()

We select a k range from 3–15 and train the model with a built-in k-means clustering algorithm. When the model is fit with 10 clusters, we can see an elbow shape in the graph. This is an optimal cluster number.

Silhouette method

The silhouette method is another method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters by providing a succinct graphical representation of how well each object has been classified.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The value of the silhouette ranges between [1, -1], where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

First, we must deploy the model and predict the y value as silhouette input:

import json
runtime = boto3.Session().client('runtime.sagemaker')
endpointName="kmeans-30-2021-08-06-00-48-38-963"
response = runtime.invoke_endpoint(EndpointName=endpointName,
ContentType='text/csv',
Body=b"-86.77971153,36.16336978n-86.77971153,36.16336978")
r=response['Body'].read()
response_json = json.loads(r)
y_km=[]
for item in response_json['predictions']:
y_km.append(int(item['closest_cluster']))

Next, we call the silhouette:

import numpy as np
from matplotlib import cm
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score,silhouette_samples

cluster_labels=np.unique(y_km)
print(cluster_labels)
n_clusters=cluster_labels.shape[0]
silhouette_score_cluster_10=silhouette_score(X, y_km)
print("Silhouette Score When Cluster Number Set to 10: %.3f" % silhouette_score_cluster_10)
silhouette_vals=silhouette_samples(X,y_km,metric='euclidean')
y_ax_lower,y_ax_upper=0,0
yticks=[]
for i,c in enumerate(cluster_labels):
c_silhouette_vals=silhouette_vals[y_km==c]
c_silhouette_vals.sort()
y_ax_upper+=len(c_silhouette_vals)
color=cm.jet(float(i)/n_clusters)
plt.barh(range(y_ax_lower,y_ax_upper),
c_silhouette_vals,
height=1.0,
edgecolor='none',
color=color)
yticks.append((y_ax_lower+y_ax_upper)/2.0)
y_ax_lower+=len(c_silhouette_vals)

silhouette_avg=np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
color='red',
linestyle='--')
plt.yticks(yticks,cluster_labels+1)
plt.ylabel("Cluster")
plt.xlabel("Silhouette Coefficients k=10,Score=%.3f" % silhouette_score_cluster_10)
plt.savefig('./figure.png')
plt.show()

When the silhouette score is closer to 1, it means clusters are well apart from each other. In the following experiment result, when k is set to 8, each cluster is well apart from each other.

We can use different model evaluation methods to get different values for the best k. In our experiment, we choose k=10 as optimal clusters.

Now we can display the k-means clustering result via Amazon Location. The following code marks selected locations on the map:

new maplibregl.Marker().setLngLat([-86.755974, 36.19235]).addTo(map);
new maplibregl.Marker().setLngLat([-86.710972, 36.203389]).addTo(map);
new maplibregl.Marker().setLngLat([-86.733895, 36.150209]).addTo(map);
new maplibregl.Marker().setLngLat([-86.795974, 36.165639]).addTo(map);
new maplibregl.Marker().setLngLat([-86.786743, 36.222799]).addTo(map);
new maplibregl.Marker().setLngLat([-86.701209, 36.267679]).addTo(map);
new maplibregl.Marker().setLngLat([-86.820134, 36.209863]).addTo(map);
new maplibregl.Marker().setLngLat([-86.769743, 36.131246]).addTo(map);
new maplibregl.Marker().setLngLat([-86.803346, 36.142358]).addTo(map);
new maplibregl.Marker().setLngLat([-86.833890, 36.113466]).addTo(map);

The following map visualization shows our results, with 10 clusters.

We also need to consider the scale of the charging station. Here, we divide the number of points around the center of each cluster by a coefficient (for example, the coefficient value is 100, which means every 100 cars share a charger pile). The following visualization includes charging station scale.

Conclusion

In this post, we explained an end-to-end scenario for creating a clustering model in SageMaker based on simulated driving data. The solution includes training an MXNet model and creating an endpoint for real-time model hosting. We also explained how you can display the clustering results via the Amazon Location SDK.

You should also consider charging type and quantity. Plug-in charging is categorized by voltage and power levels, leading to different charging times. Slow charging usually takes several hours to charge, whereas fast charging can achieve a 50% charge in 10–15 minutes. We cover these factors in a later post.

Many other industries are also affected by location planning problems, including retail stores and warehouses. If you have feedback about this post, submit comments in the Comments section below.


About the Author

Zhang Zheng is a Sr. Partner Solutions Architect with AWS, helping industry partners on their journey to well-architected machine learning solutions at scale.

Read More

AWS computer vision and Amazon Rekognition: AWS recognized as an IDC MarketScape Leader in Asia Pacific (excluding Japan), up to 38% price cut, and major new features

Computer vision, the automatic recognition and description of documents, images, and videos, has far-reaching applications, from identifying defects in high-speed assembly lines, to intelligently automating document processing workflows, and identifying products and people in social media. AWS computer vision services, including Amazon Lookout for Vision, AWS Panorama, Amazon Rekognition, and Amazon Textract, help developers automate image, video, and text analysis without requiring machine learning (ML) experience. As a result, you can implement solutions faster and decrease your time to value.

As customers continue to expand their use of computer vision, we have been investing in all of our services to make them easier to apply to use cases, easier to implement with fewer data requirements, and more cost-effective. Recently, AWS was named a Leader in the IDC MarketScape: Asia/Pacific (Excluding Japan) Vision AI Software Platform 2021 Vendor Assessment (Doc # AP47490521, October 2021). The IDC MarketScape evaluated our product functionality, service delivery, research and innovation strategy, and more for three vision AI use cases: productivity, end-user experience, and decision recommendation. They found that our offerings have a product-market fit for all three use cases. The IDC MarketScape recommends that computer vision decision-makers consider AWS for Vision AI services when you need to centrally plan vision AI capabilities in a large-scope initiative, such as digital transformation (DX), or want flexible ways to control costs.

“Vision AI is one of the emerging technology markets,” says Christopher Lee Marshall, Associate Vice President, Artificial Intelligent and Analytics Strategies at IDC Asia Pacific. “AWS is placed in the Leader’s Category in IDC MarketScape: Asia/Pacific (Excluding Japan) Vision AI Software Platform 2021 Vendor Assessment. It’s critical to watch the major vendors and more mature market solutions, as the early movers tend to consolidate their strengths with greater access to training data, more iterations of algorithm variations, deeper understanding of the operational contexts, and more systematic approaches to work with solution partners in the ecosystem.”

A key service of focus in the report was Amazon Rekognition. We’re excited to announce several enhancements to make Amazon Rekognition more cost-effective, more accurate, and easier to implement. First, we’re lowering prices for image APIs. Next, we’re enriching Amazon Rekognition with new features for content moderation, text-in-image analysis, and automated machine learning (AutoML). The new capabilities enable more accurate content moderation workflows, optical character recognition for a broader range of scenarios, and simplified training and deployment of custom computer vision models.

These latest announcements add to the Amazon Textract innovations we introduced recently, where we added TIFF file support, lowered the latency of asynchronous operations by 50%, and reduced prices by up to 32% in eight AWS Regions. The Amazon Textract innovations make it easier, faster, and less expensive to process documents at scale using computer vision on AWS.

Let’s dive deeper into the Amazon Rekognition announcements and product improvements.

Up to 38% price reduction for Amazon Rekognition Image APIs

We want to help you get a better return on investment for computer vision workflows. Therefore, we’re lowering the price for all Amazon Rekognition Image APIs by up to 38%. This price reduction applies to all 14 Regions where the Amazon Rekognition service endpoints are available.

We offer four pricing tiers based on usage volume for Amazon Rekognition Image APIs today: up to 1 million, 1 – 10M, 10 – 100M, and above 100M images processed per month. The price points for these tiers are $0.001, $0.0008, $0.0006, and $0.0004 per image. With this price reduction, we lowered the API volumes that unlock lower prices:

  • We lowered the threshold from 10 million images per month to 5 million images per month for Tier 2. As a result, you can now benefit from a lower Tier 3 price of $0.0006 per image after 5 million images.
  • We lowered the Tier 4 threshold from 100 million images per month to 35 million images per month.

We summarize the volume threshold changes in the following table.

Old volume (images processed per month) New volume (images processed per month)
Tier 1 Unchanged at first 1 million images
Tier 2 Next 9 million images Next 4 million images
Tier 3 Next 90 million images Next 30 million images
Tier 4 Over 100 million images Over 35 million images

Finally, we’re lowering the price per image for the highest-volume tier from $0.0004 to $0.00025 per image for select APIs. The prices in the following table are for the US East (N. Virginia) Region. In summary, the new prices are as follows.

Pricing tier Volume (images per month) Price per image
Images processed by Group 1 APIs: CompareFaces, IndexFaces, SearchFacebyImage, and SearchFaces Images processed by Group 2 APIs: DetectFaces, DetectModerationLabels, DetectLabels, DetectText, and RecognizeCelebrities
Tier 1 First 1 million images $0.00100 $0.00100
Tier 2 Next 4 million images $0.00080 $0.00080
Tier 3 Next 30 million images $0.00060 $0.00060
Tier 4 Over 35 million images $0.00040 $0.00025

Your savings will vary based on your usage. The following table provides example savings for a few scenarios in the US East (N. Virginia) Region.

API Volumes Group 1 & 2 Image APIs: Old Price Group 1 Image APIs Group 2 Image APIs
New Price % Reduction New Price % Reduction
12 Million in a month $9,400 $8,400 -10.6% $8,400 -10.6%
12M Annual (1M in a month) $12,000 $12,000 0.0% $12,000 0.0%
60M in a month $38,200 $32,200 -15.7% $28,450 -25.5%
60M Annual (5M in a month) $50,400 $50,400 0.0% $50,400 0.0%
120M in a month $70,200 $56,200 -19.9% $43,450 -38.1%
120M Annual (10M in a month) $98,400 $86,400 -12.2% $86,400 -12.2%
420M in a month $190,200 $176,200 -7.4% $118,450 -37.7%
420M Annual (35M in a month) $278,400 $266,400 -4.3% $266,400 -4.3%
1.2 Billion in a month $502,200 $488,200 -2.8% $313,450 -37.6%
1.2B Annual (100M in a month) $746,400 $578,400 -22.5% $461,400 -38.2%

Learn more about the price reduction by visiting the pricing page.

Accuracy improvements for content moderation

Organizations need a scalable solution to make sure users aren’t exposed to inappropriate content from user-generated and third-party content in social media, ecommerce, and photo-sharing applications.

The Amazon Rekognition Content Moderation API helps you automatically detect inappropriate or unwanted content to streamline moderation workflows.

With the Amazon Rekognition Content Moderation API, you now get improved accuracy across all ten top-level categories (such as explicit nudity, violence, and tobacco) and all 35 subcategories.

The improvements in image model moderation reduce false positive rates across all moderation categories. Lower false positive rates lead to lower volumes of images flagged for further review by human moderators, reducing their workload and improving efficiency. When combined with a price reduction for image APIs, you get more value for your content moderation solution at lower prices. Learn more about the improved Content Moderation API by visiting Moderating content.

11 Street is an online shopping company. They’re using Amazon Rekognition to automate the review of images and videos. “As part of 11st’s interactive experience, and to empower our community to express themselves, we have a feature where users can submit a photo or video review of the product they have just purchased. For example, a user could submit a photo of themselves wearing the new makeup they just bought. To make sure that no images or videos contain content that is prohibited by our platform guidelines, we originally resorted to manual content moderation. We quickly found that this was costly, error-prone, and not scalable. We then turned to Amazon Rekognition for Content Moderation, and found that it was easy to test, deploy, and scale. We are now able to automate the review of more than 7,000 uploaded images and videos every day with Amazon Rekognition, saving us time and money. We look forward to the new model update that the Amazon Rekognition team is releasing soon.” – 11 Street Digital Transformation team

Flipboard is a content recommendation platform that enables publishers, creators, and curators to share stories with readers to help them stay up to date on their passions and interests. Says Anuj Ahooja, Senior Engineering Manager at Flipboard: “On average, Flipboard processes approximately 90 million images per day. To maintain a safe and inclusive environment and to confirm that all images comply with platform guidelines at scale, it is crucial to implement a content moderation workflow using ML. However, building models for this system internally was labor-intensive and lacked the accuracy necessary to meet the high-quality standards Flipboard users expect. This is where Amazon Rekognition became the right solution for our product. Amazon Rekognition is a highly accurate, easily deployed, and performant content moderation platform that provides a robust moderation taxonomy. Since putting Amazon Rekognition into our workflows, we’ve been catching approximately 63,000 images that violate our standards per day. Moreover, with frequent improvements like the latest content moderation model update, we can be confident that Amazon Rekognition will continue to help make Flipboard an even more inclusive and safe environment for our users over time.”

Yelp connects people with great local businesses. With unmatched local business information, photos, and review content, Yelp provides a one-stop local platform for consumers to discover, connect, and transact with local businesses of all sizes by making it easy to request a quote, join a waitlist, and make a reservation, appointment, or purchase. Says Alkis Zoupas, Head of Trust and Safety Engineering at Yelp: “Yelp’s mission is to connect people with great local businesses, and we take significant measures to give people access to reliable and useful information. As part of our multi-stage, multi-model approach to photo classification, we use Amazon Rekognition to tune our systems for various outcomes and levels of filtering. Amazon Rekognition has helped reduce development time, allowing us to be more effective with our resource utilization and better prioritize what our teams should focus on.”

Support for seven more languages and accuracy improvements for text analysis

Customers use the Amazon Rekognition text service for a variety of applications, such as ensuring compliance of images with corporate policies, analysis of marketing assets, and reading street signs. With the Amazon Rekognition DetectText API, you can detect text in images and check it against your list of inappropriate words and phrases. In addition, you can further enable content redaction by using the detected text bounding box area to blur sensitive information.

The newest version of the DetectText API now supports Arabic, French, German, Italian, Portuguese, Russian, and Spanish languages in addition to English. The DetectText API also provides improved accuracy for detecting curved and vertical text in images. With the expanded language support and higher accuracy for curved and vertical text, you can scale and improve your content moderation, text moderation, and other text detection workflows.

OLX Group is one of the world’s fastest-growing networks of trading platforms, with operations in over 30 countries and over 20 brands worldwide. Says Jaroslaw Szymczak, Data Science Manager at OLX Group: “As a leader in the classifieds marketplace sector, and to foster a safe, inclusive, and vibrant buying and selling community, it is paramount that we make sure that all products listed on our platforms comply with our rules for product display and authenticity. To do that, among other aspects of the ads, we have placed focus on analyzing the non-organic text featured on images uploaded by our users. We tested Amazon Rekognition’s text detection functionality for this purpose and found that it was highly accurate and augmented our in-house violations detection systems, helping us improve our moderation workflows. Using Amazon Rekognition for text detection, we were able to flag 350,000 policy violations last year. It has also helped us save significant amounts in development costs and has allowed us to refocus data science time on other projects. We are very excited about the upcoming text model update as it will even further expand our capabilities for text analysis.”

VidMob is a leading creative analytics platform that uses data to understand the audience, improve ads, and increase marketing performance. Says James Kupernick, Chief Technology Officer at VidMob: “At VidMob, our goal is to maximize ROI for our customers by leveraging real-time insights into creative content. We have been working with the Amazon Rekognition team for years to extract meaningful visual metadata from creative content, helping us drive data-driven outcomes for our customers. It is of the utmost importance that our customers get actionable data signals. In turn, we have used Amazon Rekognition’s text detection feature to determine when there is overlaid text in a creative and classify that text in a way that creates unique insights. We can scale this process using the Amazon Rekognition Text API, allowing our data science and engineers teams to create differentiated value. In turn, we are very excited about the new text model update and the addition of new languages so that we can better support our international clients.”

Simplicity and scalability for AutoML

Amazon Rekognition Custom Labels is an AutoML service that allows you to build custom computer vision models to detect objects and scenes in images specific to your business needs. For example, with Rekognition Custom Labels, you can develop solutions for detecting brand logos, proprietary machine parts, and items on store shelves without the need for in-depth ML expertise. Instead, your critical ML experts can continue working on higher-value projects.

With the new capabilities in Rekognition Custom Labels, you can simplify and scale your workflows for custom computer vision models.

First, you can train your computer vision model in four simple steps with a few clicks. You get a guided step-by-step console experience with directions for creating projects, creating image datasets, annotating and labeling images, and training models.

Next, we improved our underlying ML algorithms. As a result, you can now build high-quality models with less training data to detect vehicles, their make, or possible damages to vehicles.

Finally, we have introduced seven new APIs to make it even easier for you to build and train computer vision models programmatically. With the new APIs, you can do the following:

  • Create, copy, or delete datasets
  • List the contents and get details of the datasets
  • Modify datasets and auto-split them to create a test dataset

For more information, visit the Rekognition Custom Labels Guide.

Prodege, LLC is a cutting-edge marketing and consumer insights platform that leverages its global audience of reward program members to power its business solutions. Prodege uses Rekognition Custom Labels to detect anomalies in store receipts. Says Arun Gupta, Director, Business Intelligence at Prodege: “By using Rekognition Custom Labels, Prodege was able to detect anomalies with high precision across store receipt images being uploaded by our valued members as part of our rewards program offerings. The best part of Rekognition Custom Labels is that it’s easy to set up and requires only a small set of pre-classified images (a couple of hundred in our case) to train the ML model for high confidence image detection. The model’s endpoints can be easily accessed using the API. Rekognition Custom Labels has been an extremely effective solution to enable the smooth functioning of our validated receipt scanning product and helped us save a lot of time and resources performing manual detection. The new console experience of Rekognition Custom Labels has made it even easier to build and train a model, especially with the added capability of updating and deleting an existing dataset. This will significantly improve our constant iteration of training models as we grow and add more data in the pursuit of enhancing our model performance. I can’t even thank the AWS Support Team enough, who has been diligently helping us with all aspects of the product through this journey.”

Says Arnav Gupta, Global AWS Practice Lead at Quantiphi: “As an advanced consulting partner for AWS, Quantiphi has been leveraging Amazon’s computer vision services such as Amazon Rekognition and Amazon Textract to solve some of our customer’s most pressing business challenges. The simplified and guided experience offered by the updated Rekognition Custom Labels console and the new APIs has made it easier for us to build and train computer vision models, significantly reducing the time to deliver solutions from months to weeks for our customers. We have also built our document processing solution Qdox on top of Amazon Textract, which has enabled us to provide our own industry-specific document processing solutions to customers.”

Get started with Amazon Rekognition

With the new features we’re announcing today, you can increase the accuracy of your content moderation workflows, deploy text moderation solutions across a broader range of scenarios and languages, and simplify your AutoML implementation. In addition, you can use the price reduction on the image APIs to analyze more images with your existing budget. Use one or more of the following options to get started today:


About the Author

Roger Barga is the GM of Computer Vision at AWS.

Read More