Extract granular sentiment in text with Amazon Comprehend Targeted Sentiment

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. As a fully managed service, Amazon Comprehend requires no ML expertise and can scale to large volumes of data. Amazon Comprehend provides several different APIs to easily integrate NLP into your applications. You can simply call the APIs in your application and provide the location of the source document or text. The APIs output entities, key phrases, sentiment, document classification, and language in an easy-to-use format for your application or business.

The sentiment analysis APIs provided by Amazon Comprehend help businesses determine the sentiment of a document. You can gauge the overall sentiment of a document as positive, negative, neutral, or mixed. However, to get the granularity of understanding the sentiment associated with specific products or brands, businesses have had to employ workarounds like chunking the text into logical blocks and inferring the sentiment expressed towards a specific product.

To help simplify this process, starting today, Amazon Comprehend is launching the Targeted Sentiment feature for sentiment analysis. This provides the capability to identify groups of mentions (co-reference groups) corresponding to a single real-world entity or attribute, provide the sentiment associated with each entity mention, and provide the classification of the real-world entity based on a pre-determined list of entities.

This post provides an overview of how you can get started with Amazon Comprehend targeted sentiment, demonstrates what you can do with the output, and walks through three common targeted sentiment use cases.

Solution overview

The following is an example of targeted sentiment:

“Spa” is the primary entity, identified as type facility, and is mentioned two more times, referred to as the pronoun “it.” The Targeted Sentiment API provides the sentiment towards each entity. Positive sentiment is green, negative is red,  and neutral is blue. We can also determine how the sentiment towards the spa changes throughout the sentence. We dive deeper into the API later in the post.

This capability opens up several different capabilities for businesses. Marketing teams can track popular sentiments toward their brands in social media over time. Ecommerce merchants can understand which specific attributes of their products were best- and worst-received by customers. Call center operators can use the feature to mine transcripts for escalation issues and to monitor customer experience. Restaurants, hotels, and other hospitality industry organizations can use the service to turn broad ratings categories into rich descriptions of good and bad customer experiences.

Targeted sentiment use cases

The Targeted Sentiment API in Amazon Comprehend takes text data such as social media posts, application reviews, and call center transcriptions as input. Then it analyzes the input using the power of NLP algorithms to extract entity-level sentiment automatically. An entity is a textual reference to the unique name of a real-world object, such as people, places, and commercial items, in addition to precise references to measures such as dates and quantities. For a full list of supported entities, refer to Targeted Sentiment Entities.

We use the Targeted Sentiment API to enable the following use cases:

  • A business can identify parts of the employee/customer experience that are enjoyable and parts that may be improved.
  • Contact centers and customer service teams can analyze on-call transcriptions or chat logs to identify agent training effectiveness, and conversation details such as specific reactions from a customer and phrases or words that were used to illicit that response.
  • Product owners and UI/UX developers can identify features of their product that users enjoy and parts that require improvement. This can support product roadmap discussions and prioritizations.

The following diagram illustrates the targeted sentiment process:

In this post, we demonstrate this process using the following three sample reviews:

  • Sample 1: Business and product review – “I really like how thick the jacket is. I wear a large jacket because I have broad shoulders and that’s what I ordered and it fits perfectly there. I almost feel like it balloons out from the chest down. I thought I would use the strings in the bottom of the jacket to help close it and bring it in, but those don’t work. The jacket feels very bulky.”
  • Sample 2: Contact center transcription – “Hi there, there is a fraud block on my credit card, can you remove it for me. My credit card keeps getting flagged for fraud. It is quite annoying, every time I go to use it, I keep getting declined. I’m going to cancel the card if this happens again.”
  • Sample 3: Employer feedback survey – “I’m glad management is upskilling the team. But the instructor did not go over the basics well. Management should do more due diligence on everyone’s skill level for future sessions.”

Prepare the data

To get started, download the sample files containing the example text using the AWS Command Line Interface (AWS CLI) by running the following commands:

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-8148/ts-sample-data.zip .

Create an Amazon Simple Storage Service (Amazon S3) bucket, unzip the folder and upload the folder containing the three sample files. Make sure you’re using the same Region throughout.

You can now access the three sample text files in your S3 bucket.

Create a job in Amazon Comprehend

After you upload the files to your S3 bucket, complete the following steps:

  1. On the Amazon Comprehend console, choose Analysis jobs in the navigation pane.
  2. Choose Create job.
  3. For Name, enter a name for your job.
  4. For Analysis type, choose Targeted sentiment.
  5. Under Input data, enter the Amazon S3 location of the ts-sample-data folder.
  6. For Input format, choose One document per file.

You can change this configuration if your data is in a single file delimited by lines.

  1. Under Output location, enter the Amazon S3 location where you want to save the job output.
  2. Under Access permissions, for IAM role, choose an existing AWS Identity and Access Management (IAM) role or create one that has permissions to the S3 bucket.
  3. Leave the other options as default and choose Create job.

After you start the job, you can review your job details. The total job runtime depends on the size of the input data.

  1. When the job is complete, under Output, choose the link to the output data location.

Here you can find a compressed output file.

  1. Download and decompress the file.

You can now inspect the output files for each sample text. Open the files in your preferred text editor to review the API response structure. We describe this in more detail in the next section.

API response structure

The Targeted Sentiment API provides a simple way to consume the output of your jobs. It provides a logical grouping of the entities (entity groups) detected, along with the sentiment for each entity. The following are some definitions of the fields that are in the response:

  • Entities – The significant parts of the document. For example, Person, Place, Date, Food, or Taste.
  • Mentions – The references or mentions of the entity in the document. These can be pronouns or common nouns such as “it,” “him,” “book,” and so on. These are organized in order by location (offset) in the document.
  • DescriptiveMentionIndex – The index in Mentions that gives the best depiction of the entity group. For example, “ABC Hotel” instead of “hotel,” “it,” or other common noun mentions.
  • GroupScore – The confidence that all the entities mentioned in the group are related to the same entity (such as “I,” “me,” and “myself” referring to one person).
  • Text – The text in the document that depicts the entity
  • Type – A description of what the entity depicts.
  • Score – The model confidence that this is a relevant entity.
  • MentionSentiment – The actual sentiment found for the mention.
  • Sentiment – The string value of positive, neutral, negative, or mixed.
  • SentimentScore – The model confidence for each possible sentiment.
  • BeginOffset – The offset into the document text where the mention begins.
  • EndOffset – The offset into the document text where the mention ends.

To demonstrate this visually, let’s take the output of the third use case, the employer feedback survey, and walk through the entity groups that represent the employee completing the survey, management, and the instructor.

Let’s first look at all the mentions of the co-reference entity group associated with “I” (the employee writing the response) and the location of the mention in the text. DescriptiveMentionIndex represents indexes of the entity mentions that best depict the co-reference entity group (in this case I):

{
      "DescriptiveMentionIndex": [
        0
      ],
      "Mentions": [
        {
          "BeginOffset": 0,
          "EndOffset": 1,
          "Score": 0.999997,
          "GroupScore": 1,
          "Text": "I",
          "Type": "PERSON",
          "MentionSentiment": {
            "Sentiment": "NEUTRAL",
            "SentimentScore": {
              "Mixed": 0,
              "Negative": 0,
              "Neutral": 1,
              "Positive": 0
            }
          }
        }
      ]
    }

The next group of entities provides all mentions of the co-reference entity group associated with management, along with its location in the text. DescriptiveMentionIndex represents indexes of the entity mentions that best depict the co-reference entity group (in this case management). Something to observe in this example is the sentiment shift towards management. You can use this data to infer what parts of management’s actions were perceived as positive, and what parts were perceived as negative and therefore can be improved upon.

{
      "DescriptiveMentionIndex": [
        0,
        1
      ],
      "Mentions": [
        {
          "BeginOffset": 9,
          "EndOffset": 19,
          "Score": 0.999984,
          "GroupScore": 1,
          "Text": "management",
          "Type": "ORGANIZATION",
          "MentionSentiment": {
            "Sentiment": "POSITIVE",
            "SentimentScore": {
              "Mixed": 0,
              "Negative": 0,
              "Neutral": 0,
              "Positive": 1
            }
          }
        },
        {
          "BeginOffset": 103,
          "EndOffset": 113,
          "Score": 0.999998,
          "GroupScore": 0.999896,
          "Text": "Management",
          "Type": "ORGANIZATION",
          "MentionSentiment": {
            "Sentiment": "NEGATIVE",
            "SentimentScore": {
              "Mixed": 0.000149,
              "Negative": 0.990075,
              "Neutral": 0.000001,
              "Positive": 0.009775
            }
          }
        }
      ]
    }

To conclude, let’s observe all mentions of the instructor and the location in the text. DescriptiveMentionIndex represents indexes of the entity mentions that best depict the co-reference entity group (in this case instructor):

{
      "DescriptiveMentionIndex": [
        0
      ],
      "Mentions": [
        {
          "BeginOffset": 52,
          "EndOffset": 62,
          "Score": 0.999996,
          "GroupScore": 1,
          "Text": "instructor",
          "Type": "PERSON",
          "MentionSentiment": {
            "Sentiment": "NEGATIVE",
            "SentimentScore": {
              "Mixed": 0,
              "Negative": 0.999997,
              "Neutral": 0.000001,
              "Positive": 0.000001
            }
          }
        }
      ]
    }

Reference architecture

You can apply targeted sentiment to many scenarios and use cases to drive business value, such as the following:

  • Determine efficacy of marketing campaigns and feature launches by detecting the entities and mentions that contain the most positive or negative feedback
  • Query output to determine which entities and mentions relate to a corresponding entity (positive, negative, or neutral)
  • Analyze sentiment across the customer interaction lifecycle in contact centers to demonstrate efficacy of process or training changes

The following diagram depicts an end-to-end process:

Conclusion

Understanding the interactions and feedback organizations receive from customers about their products and services remains crucial in developing better products and customer experiences. As such, more granular details are required to infer better outcomes.

In this post, we provided some examples of how using these granular details can help organizations improve products, customer experiences, and training while also incentivizing and validating positive attributes. There are many use cases across industries where you can experiment with and gain value from targeted sentiment.

We encourage you to try this new feature with your use cases. For more information and to get started, refer to Targeted Sentiment.


About the Authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Sanjeev Pulapaka is a Senior Solutions Architect in the U.S. Fed Civilian SA team at Amazon Web Services (AWS). He works closely with customers in building and architecting mission critical solutions. Sanjeev has extensive experience in leading, architecting and implementing high-impact technology solutions that address diverse business needs in multiple sectors including commercial, federal, state and local governments. He has an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.

Read More

Amazon SageMaker Autopilot now supports time series data

Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning (ML) models based on your data, while allowing you to maintain full control and visibility. We have recently announced support for time series data in Autopilot. You can use Autopilot to tackle regression and classification tasks on time series data, or sequence data in general. Time series data is a special type of sequence data where data points are collected at even time intervals.

Manually preparing the data, selecting the right ML model, and optimizing its parameters is a complex task, even for an expert practitioner. Although automated approaches exist that can find the best models and their parameters, these typically can’t handle data that comes as sequences, such as network traffic, electricity consumption, or household expenses recorded over time. Because this data takes the form of observations acquired at different time points, consecutive observations can’t be treated as independent of each other and need to be processed as a whole. You can use Autopilot for a wide range of problems dealing with sequential data. For example, you can classify network traffic recorded over time to identify malicious activities, or determine if individuals qualify for a mortgage based on their credit history. You provide a dataset containing time series data and Autopilot handles the rest, processing the sequential data through specialized feature transforms and finding the best model on your behalf.

Autopilot eliminates the heavy lifting of building ML models, and helps you automatically build, train, and tune the best ML model based on your data. Autopilot runs several algorithms on your data and tunes their hyperparameters on a fully managed compute infrastructure. In this post, we demonstrate how you can use Autopilot to solve classification and regression problems on time series data. For instructions on creating and training an Autopilot model, see Customer Churn Prediction with Amazon SageMaker Autopilot.

Time series data classification using Autopilot

As a running example, we consider a multi-class problem on the time series dataset UWaveGestureLibraryX, containing equidistant readings of accelerometer sensors while performing one of eight predefined hand gestures. For simplicity, we consider only X dimension of the accelerometer. The task is to build a classification model to map the time series data from the sensor readings to the predefined gestures. The following figure shows the first rows of the dataset in CSV format. The entire table consists of 896 rows and two columns: the first column is a gesture label and the second column is a time series of sensor readings.

Convert data to the right format with Amazon SageMaker Data Wrangler

On top of accepting numerical, categorical, and standard text columns, Autopilot now also accepts a sequence input column. If your time series data doesn’t follow this format, you can easily convert it through Amazon SageMaker Data Wrangler. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. For instance, consider the same dataset but in a different input format: each gesture (specified by ID) is a sequence of equidistant measurements of the accelerometer. When stored vertically, each row contains a timestamp and one value. The following figure compares this data in its original format and a sequence format.

To convert this dataset to the format described earlier using Data Wrangler, load the dataset from Amazon Simple Storage Service (Amazon S3). Then use the time series Group by transform, as shown in the following screenshot, and export the data back to Amazon S3 in CSV format.

When the dataset is in its designated format, you can proceed with Autopilot. To check out other time series transformers of Data Wrangler refer to Prepare time series data with Amazon SageMaker Data Wrangler.

Launch an AutoML job

As with other input types supported by Autopilot, each row of the dataset is a different observation and each column is a feature. In this example, we have a single column containing time series data, but you can have multiple time series columns. You can also have multiple columns with different input types, such as time series, text, and numerical.

To create an Autopilot experiment, place the dataset in an S3 bucket and create a new experiment within Amazon SageMaker Studio. As shown in the following screenshot, you must specify the name of experiment, S3 location of the dataset, S3 location for the output artifacts, and the column name to predict.

Autopilot analyzes the data, generates ML pipelines, and runs a default 250 iterations of hyperparameter optimization on this classification task. As shown in the following model leaderboard, Autopilot reaches 0.821 accuracy, and you can deploy the best model in just one click.

In addition, Autopilot generates a data exploration report, where you can visualize and explore your data.

Transparency is foundational for Autopilot. You can inspect and modify generated ML pipelines within the candidate definition notebook. The following screenshot demonstrates how Autopilot recommends a range of pipelines, combining the time series transformer TSFeatureExtractor with different ML algorithms, such as gradient boosted decision trees and linear models. The TSFeatureExtractor extracts hundreds of time series features for you, which are then fed to the downstream algorithms to make predictions. For the full list of time series features, refer to Overview on extracted features.

Conclusion

In this post, we demonstrated how to use SageMaker Autopilot to solve time series classification and regression problems in just a few clicks.

For more information about Autopilot, see Amazon SageMaker Autopilot. To explore related features of SageMaker, see Amazon SageMaker Data Wrangler.


About the Authors

Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.

Anne Milbert is a Software Development engineer working on Amazon SageMaker Automatic Model Tuning.

Valerio Perrone is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.

Meghana Satish is a Software Development engineer working on Amazon SageMaker Automatic Model Tuning.

Ali Takbiri is an AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.

Read More

Enable Amazon SageMaker JumpStart for custom IAM execution roles

With an Amazon SageMaker Domain, you can onboard users with an AWS Identity and Access Management (IAM) execution role different than the Domain execution role. In such case, the onboarded Domain user can’t create projects using templates and Amazon SageMaker JumpStart solutions. This post outlines an automated approach to enable JumpStart for Domain users with a custom execution role. We walk you through two different use cases for enabling JumpStart and how to solve these cases programmatically. The automated solution can help you scale your process to enable JumpStart for Domain users with custom roles, increasing productivity of your data science team and Amazon SageMaker Studio administrators.

JumpStart is a feature within Studio that helps you quickly and easily get started with machine learning (ML). With more and more customers increasingly using ML and adopting Amazon SageMaker, JumpStart is making it easier for data science and ML teams to access and fine-tune more than 150 popular open-source models, such as natural language processing, object detection, and image classification models.

Solution overview

JumpStart requires a SageMaker Domain with project templates enabled for the account and Studio users, as shown in the following screenshot.

If enabled, this setting allows users (configured to use the Domain execution role) to create projects using templates and JumpStart solutions. In the scenario where the user’s execution role is different than the Domain execution role, JumpStart remains disabled for that user even when it’s enabled on the Domain. We address this custom role scenario and the automated solution in the following sections.

In this solution, we address the issue for the following two cases:

  • Use case 1 – Enabling JumpStart in an automated manner for existing Domain users with custom roles regardless of apps assigned
  • Use case 2 – Providing a reference script that you can use to programmatically enable JumpStart while onboarding a new Domain user with a custom role

Domain user onboarding

After you create a Domain, you can onboard users to launch apps (such as Studio, RStudio, or Canvas). You must assign a default execution role to a Domain user during the creation process, as shown in the following screenshot.

You can choose a role different than the Domain execution role for a user. However, this may disable JumpStart for such users even when it’s enabled on the Domain. This behavior is due to the fact that SageMaker makes no assumption on a custom role and its permission boundary. The required permissions and policies have to be assigned explicitly to access templates and JumpStart solutions published by SageMaker in AWS Service Catalog.

You can enable SageMaker Projects and JumpStart manually for every user by selecting the user profile on the SageMaker Domain control panel. However, this process can be time-consuming if a user already has some apps assigned. The Edit button at bottom right is only enabled when no apps are assigned to that user (see the following screenshot). You have to delete the assigned apps first in order to edit a user profile.

The cause of the disabled JumpStart feature is evident during Step 2 of editing a user profile, where a message states “If there are individual users using custom execution roles in your organization, you need to enable them on the user profile page.”

In the following sections, we walk you through two automated solutions that cover use cases for both existing and new Domain users.

Prerequisites

The steps described as part of this solution have the following prerequisites:

  • You have created a SageMaker Domain
  • The SageMaker Domain authentication method is IAM
  • Custom roles assigned to the SageMaker Domain users have the AmazonSageMakerFullAccess policy attached

In order for JumpStart Solutions to be enabled for users, the AWS Service Catalog portfolio Amazon SageMaker Solutions and ML Ops products must be imported into the account, and this portfolio must be associated with the role that runs SageMaker. The role association is necessary so that Studio can invoke AWS Service Catalog APIs associated with the Solutions portfolio.

As a general best practice, we recommend testing the process in a non-production environment followed by validation tests to make sure everything is configured and operating as per your expectations before making changes to the production environment.

Use case 1: Enable JumpStart for all existing Domain users with a custom role

Let’s first consider the use case for existing users and enable JumpStart for those users in an automated way.

To achieve this, we have created an AWS CloudFormation template that you can run in the same Region where the SageMaker Domain exists.

The CloudFormation stack contained in the attached jumpstart_solutions_resources.template.yaml file has the following components:

  • AmazonSageMakerServiceCatalogProductsLaunchRole and AmazonSageMakerServiceCatalogProductsUseRole – Creates these two IAM roles, if they don’t already exist.
  • 1PProductUseRolePolicy – Creates this policy used by AmazonSageMakerServiceCatalogProductsUseRole, if this role doesn’t already exist.
  • setup_solutions_tests_portfolio – An AWS Lambda function that performs the AWS Service Catalog portfolio import and role association by calling Boto3 APIs. This function is called once during CloudFormation stack creation.
  • LambdaIAMRole role – Used by the function setup_solutions_tests_portfolio for calling AWS Service Catalog and SageMaker APIs.
  • SetupPortfolioInvoker – Invokes the function setup_solutions_tests_portfolio.

After the Lambda function runs as part of the CloudFormation deployment, it retrofits all the existing SageMaker Domain users to enable JumpStart and Projects for them. For more information on creating and monitoring a CloudFormation stack, refer to How does AWS CloudFormation work.

Use case 2: Enable JumpStart for a single Domain user with a custom role

Many customers prefer to scale the Domain user onboarding process by automating it programmatically. In this section, we provide a Python script reference that you can use as part of the onboarding process to enable JumpStart for a new user with a custom role. This Python script performs the required association for the given user role. The automated process calling this script must have permission to use AWS Service Catalog and SageMaker APIs. See the following code:

sagemaker_client = boto3.client("sagemaker")
sc_client = boto3.client("servicecatalog")

# function to return 'Amazon SageMaker' portfolio id
def get_solutions_portfolio_id(sc_client):
    portfolio_shares = sc_client.list_accepted_portfolio_shares()
    for portfolio in portfolio_shares['PortfolioDetails']:
            if portfolio['ProviderName'] == 'Amazon SageMaker':
                    return(portfolio['Id'])

portfolio_id = get_solutions_portfolio_id(sc_client)
# import Solutions Service Catalog Portfolio 
sagemaker_client.enable_sagemaker_servicecatalog_portfolio()
    	
sc_client.associate_principal_with_portfolio(
                    PortfolioId=portfolio_id,
                    PrincipalARN=, # custom role ARN
                    PrincipalType='IAM'
                    )

You can either call the script independently or embed it as a step within an automated process to create a user profile for onboarding to Studio. For more information on using Boto3, refer to Boto3 reference.

Clean up

After all the custom roles are enabled to use JumpStart, we can clean up the resources no longer needed. You can delete the Lambda function setup_solutions_tests_portfolio and the IAM role LambdaIAMRole created by the CloudFormation template. The other two IAM roles, AmazonSageMakerServiceCatalogProductsLaunchRole and AmazonSageMakerServiceCatalogProductsUseRole, and the associated policy 1PProductUseRolePolicy (if created) must not be deleted because they need to exist for accessing JumpStart.

Conclusion

In this post, we shared the steps to enable JumpStart for a custom role for existing users as well as new users programmatically. As always, make sure to validate the steps mentioned in this solution in a non-production environment before deploying to production.

Try it out and let us know if you have any questions in the comments section!

Additional resources

For more information, see the following:


About the Authors

Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, and analytics. In his spare time, he enjoys playing badminton with his daughter and exploring the outdoors.

Evan Kravitz is a software engineer at Amazon Web Services, working on SageMaker JumpStart. He enjoys cooking and going on runs in New York City.

Read More

Predict residential real estate prices at ImmoScout24 with Amazon SageMaker

This is a guest post by Oliver Frost, data scientist at ImmoScout24, in partnership with Lukas Müller, AWS Solutions Architect.

In 2010, ImmoScout24 released a price index for residential real estate in Germany: the IMX. It was based on ImmoScout24 listings. Besides the price, listings typically contain a lot of specific information such as the construction year, the plot size, or the number of rooms. This information allowed us to build a so-called hedonic price index, which considers the particular features of a real estate property.

When we released the IMX, our goal was to establish it as the standard index for real estate prices in Germany. However, it struggled to capture the price increase in the German property market since the financial crisis of 2008. In addition, like a stock market index, it was an abstract figure that can’t be interpreted directly. The IMX was therefore difficult to grasp for non-experts.

At ImmoScout24, our mission is to make complex decisions easy, and we realized that we needed a new concept to fulfill it. Instead of another index, we decided to build a market report that everyone can easily understand: the WohnBarometer. It’s based on our listings data and takes object properties into account. The key difference from the IMX is that the WohnBarometer shows rent and sale prices in Euro per square meter for specific residential real estate types over time. The figures therefore can be directly interpreted and allow our customers to answer questions such as “Do I pay too much rent?” or “Is the apartment I am about to buy reasonably priced?” or “Which city in my region is the most promising one for investing?” Currently, the WohnBarometer is reported for Germany as a whole, the seven biggest cities, and alternating local markets.

The following graph shows an example of the WohnBarometer, with sale prices for Berlin and the development per quarter.

This post discusses how ImmoScout24 used Amazon SageMaker to create the model for the WohnBarometer in order to make it relevant for our customers. It discusses the underlying data model, hyperparameter tuning, and technical setup. This post also shows how SageMaker supported one data scientist to complete the WohnBarometer within 2 months. It took a whole team 2 years to develop the first version of the IMX. Such an investment was not an option for the WohnBarometer.

About ImmoScout24

ImmoScout24 is the leading online platform for residential and commercial real estate in Germany. For over 20 years, ImmoScout24 has been revolutionizing the real estate market and supports over 20 million users each month on its online marketplace or in its app to find new homes or commercial spaces. That’s why 99% of our target customer group know ImmoScout24. With its digital solutions, the online marketplace coordinates and brings owners, realtors, tenants, and buyers together successfully. ImmoScout24 is working towards the goal of digitizing the process of real estate transactions and thereby making complex decisions easy. Since 2012, ImmoScout24 has also been active in the Austrian real estate market, reaching around 3 million users monthly.

From on-premises to AWS Data Pipeline to SageMaker

In this section, we discuss the previous setup and its challenges, and why we decided to use SageMaker for our new model.

The previous setup

When the first version of the IMX was published in 2010, the cloud was still a mystery to most businesses, including ImmoScout24. The field of machine learning (ML) was in its infancy and only a handful of experts knew how to code a model (for the sake of illustration, the first public release of Scikit-Learn was in February 2010). It’s no surprise that the development of the IMX took more than 2 years and cost a seven-figure sum.

In 2015, ImmoScout24 started its AWS migration, and rebuilt IMX on AWS infrastructure. With the data in our Amazon Simple Storage Service (Amazon S3) data lake, both the data preprocessing and the model training were now done on Amazon EMR clusters orchestrated by AWS Data Pipeline. While the former was a PySpark ETL application, the latter was several Python scripts using classical ML packages (such as Scikit-Learn).

Issues with this setup

Although this setup proved quite stable, troubleshooting the infrastructure or improving the model wasn’t easy. A key problem with the model was its complexity, because some components had begun a life on their own: in the end, the code of the outlier detection was almost twice as long the code of the core IMX model itself.

The core model, in fact, wasn’t one model, but hundreds: one model per residential real estate type and region, with the definition varying from a single neighborhood in a big city to several villages in rural areas. We had, for example, one model for apartments for sale in the middle of Berlin and one model for houses for sale in a suburb of Munich. Because setting up the training of all these models took a lot of time, we omitted the hyperparameter tuning, which likely led to the models underperforming.

Why we decided on SageMaker

Given these issues and our ambition of having a market report with practical benefits, we had to decide between rewriting large parts of the existing code or starting from scratch. As you can infer from this post, we opted for the latter. But why SageMaker?

Most of our time spent on the IMX went into troubleshooting the infrastructure, not improving the model. For the new market report, we wanted to flip this around, with the focus on the statistical performance of the model. We also wanted to have the flexibility to quickly replace individual components of the model, such as the optimization of the hyperparameters. What if a new superior boosting algorithm comes around (think about how XGBoost hit the stage in 2014)? Of course, we want to adopt it as one of the first!

In SageMaker, the major components of the classical ML workflow—preprocessing, training, hyperparameter tuning, and inference—are neatly separated on the API level and also on the AWS Management Console. Modifying them individually isn’t difficult.

The new model

In this section, we discuss the components of the new model, including its input data, algorithm, hyperparameter tuning, and technical setup.

Input data

The WohnBarometer is based on a sliding window of 5 years of ImmoScout24 listings of residential real estate located in Germany. After we remove outliers and fraudulent listings, we’re left with approximately 4 million listings that are split into train (60 %), validation (20 %), and test data (20 %). The relationship between listings and objects is not necessarily 1:1; over the course of 5 years, it’s likely that the same object is inserted multiple times (by multiple people).

We use 13 listing attributes, such as the location of the property (WGS84 coordinates), the real estate type (house or apartment, sale or rent), its age (years), its size (square meter) or it’s condition (for example, new or refurbished). Given that each listing typically comes with dozens of attributes, the question arises: which to include in the model? On the one hand, we used domain knowledge; for example, it’s well known that location is a key factor, and in almost all markets new property is more expensive than existing ones. On the other hand, we relied on our experiences with the IMX and similar models. There we learned that including dozens of attributes doesn’t significantly improve the model.

Depending on the real estate type of the listing, the target variable of our model is either the rent per square meter or the sale price per square meter (we explain later why this choice wasn’t ideal). Unlike the IMX, the WohnBarometer is therefore a number that can be directly interpreted and acted upon by our customers.

Model description

When using SageMaker, you can choose between different strategies of implementing your algorithm:

  • Use one of SageMaker’s built-in algorithms. There are almost 20 and they cover all major ML problem types.
  • Customize a pre-made Docker image based on a standard ML framework (such as Scikit-Learn or PyTorch).
  • Build your own algorithm and deploy it as a Docker image.

For the WohnBarometer, we wanted a solution that is easy to maintain and allows us to focus on improving the model itself, not the underlying infrastructure. Therefore, we decided on the first option: use a fully-managed algorithm with proper documentation and fast support if needed. Next, we needed to pick the algorithm itself. Again, the decision wasn’t difficult: we went for the XGBoost algorithm because it’s one of the most renowned ML algorithms for regression type problems, and we have already successfully used it in several projects.

Hyperparameter tuning

Most ML algorithms come with a myriad of parameters to tweak. Boosting algorithms, for example, have many parameters specifying how exactly the trees are built: Do the trees have at maximum 20 or 30 leaves? Is each tree based on all rows and columns or only samples? How heavily to prune the trees? Finding the optimal values of those parameters (as measured by an evaluation metric of your choice), the so-called hyperparameter tuning, is critical to building a powerful ML model.

A key question in hyperparameter tuning is which parameters to tune and how to set the search ranges. You might ask, why not check all possible combinations? Although in theory this sounds like a good idea, it would result in an enormous hyperparameter space with way too many points to evaluate them all at a reasonable price. That is why ML practitioners typically select a small number of hyperparameters known to have a strong impact on the performance of the chosen algorithm.

After the hyperparameter space is defined, the next task is to find the best combination of values in it. The following techniques are commonly employed:

  • Grid search – Divide the space in a discrete grid and then evaluate all points in the grid with cross-validation.
  • Random search – Randomly draw combinations from the space. With this approach, you’ll most likely miss the best combination, but it serves as a good benchmark.
  • Bayesian optimization – Build a probabilistic model of the objective function and use this model to generate new combinations. The model is updated after each combination, leading quickly to good results.

In recent years, thanks to cheap compute power, Bayesian optimization has become the gold standard in hyperparameter tuning, and is the default setting in SageMaker.

Technical setup

As with many other AWS services, you can create SageMaker jobs on the console, with the AWS Command Line Interface (AWS CLI), or via code. We chose the third option, the SageMaker Python SDK to be precise, because it allows for a highly automated setup: the WohnBarometer lives in a Python software project that is command-line executable. For example, all steps of the ML pipeline such as the preprocessing or the model training can be triggered via Bash commands. Those Bash commands, in turn, are orchestrated with a Jenkins pipeline powered by AWS Fargate.

Let’s look at the steps and the underlying infrastructure:

  • Preprocessing – The preprocessing is done with the built-in Scikit-Learn library in SageMaker. Because it involves joining data frames with millions of rows, we need an ml.m5.24xlarge machine here, the largest you can get in the ml.m family. Alternatively, we could have used multiple smaller machines with a distributed framework like Dask, but we wanted to keep it as simple as possible.
  • Training – We use the default SageMaker XGBoost algorithm. The training is done with two ml.m5.12xlarge machines. It’s worth mentioning that our train.py containing the code of the model training and the hyperparameter tuning has less than 100 rows.
  • Hyperparameter tuning – Following the principle of less is more, we only tune 11 hyperparameters (for example, the number of boosting rounds and the learning rate), which gives us time to carefully choose their ranges and inspect how they interact with each other. With only a few hyperparameters, each training job runs relatively fast; in our case the jobs take between 10–20 minutes. With a maximal number of 30 training jobs and 2 concurrent jobs, the total training time is around 3 hours.
  • Inference – SageMaker offers multiple options to serve your model. We use batch transform jobs because we only need the WohnBarometer numbers once a quarter. We didn’t use an endpoint because it would be idle most of the time. Each batch job (approximately 6.8 million rows) is served by a single ml.m5.4xlarge machine in less than 10 minutes.

We can easily debug these steps on the SageMaker console. If, for example, a training job is taking longer than expected, we navigate to the Training page, locate the training job in question, and review Amazon CloudWatch metrics of the underlying machines.

The following architecture diagram shows the infrastructure of the WohnBarometer:

Challenges and learnings

In the beginning everything went smoothly: within a few days we set up the software project and trained a miniature version of our model in SageMaker. We had high hopes for the first run on the full dataset and the hyperparameter tuning in place. Unfortunately, the results weren’t satisfying. We had the following key issues:

  • The predictions of the model were too low, both for rent and sale objects. For Berlin, for example, the sale prices predicted for our reference objects were roughly 50% below the market prices.
  • According to the model, there was no significant price difference between new and existing buildings. The truth is that new buildings are almost always significantly more expensive than existing buildings.
  • The effect of the location on the price wasn’t captured correctly. We know, for example, that apartments for sale in Frankfurt am Main, are, on average, more expensive than in Berlin (although Berlin is catching up); our model, however, predicted it the other way round.

What was the problem and how did we solve it?

Sampling of the features

At first glance, it looks like the issues aren’t related, but indeed they are. By default, XGBoost builds each tree with a random sample of the features. Let’s say a model has 10 features F1, F2, … F10, then the algorithm might use F1, F4, and F7 for one tree, and F3, F4, and F8 for another. While in general this behavior effectively prevents overfitting, it can be problematic if the number of features is small and some of them have a big effect on the target variable. In this case, many trees will miss the crucial features.

XGBoost’s sampling of our 13 features led to many trees including neither of the crucial features—real estate type, location, and new or existing buildings—and as a consequence caused these issues. Luckily, there is a parameter to control the sampling: colsample_bytree (in fact, there are two more parameters to control the sampling, but we didn’t touch them). When we checked our code, we saw that colsample_bytree was set to 0.5, a value we carried over from past projects. As soon as we set it to the default value of 1, the preceding issues were gone.

One model vs. multiple models

Unlike the IMX, the WohnBarometer model really is only one model. Although this minimizes the maintenance effort, it’s not ideal from a statistical point of view. Because our training data contains both sale and rent objects, the spread in the target variable is huge: it ranges from below 5 Euro for some rent apartments to well above 10,000 Euro for houses for sale in first-class locations. The big challenge for the model is to understand that an error of 5 Euro is fantastic for sale objects, but disastrous for rent objects.

In hindsight, knowing how easy it is to maintain multiple models in SageMaker, we would have built at least two models: one for rent and one for sale objects. This would make it easier to capture the peculiarities of both markets. For example, the price of unrented apartments for sale is typically 20–30% higher than for rented apartments for sale. Therefore, encoding this information as a dummy variable in the sale model makes a lot of sense; for the rent model on the other hand, you could leave it out.

Conclusion

Did the WohnBarometer meet the goal of being relevant to our customers? Taking media coverage as an indication, the answer is a clear yes: as of November 2021, more than 700 newspaper articles and TV or radio reports on the WohnBarometer have been published. The list includes national newspapers such as Frankfurter Allgemeine Zeitung, Tagesspiegel, and Handelsblatt, and local newspapers that often ask for WohnBarometer figures for their region. Because we calculate the figures for all regions of Germany anyway, we’re happy to take such requests. With the old IMX, this level of granularity wasn’t possible.

The WohnBarometer outperforms the IMX in regards to statical performance, in particular when it comes to the costs: the IMX was generated by an EMR cluster with 10 task nodes running almost half a day. In contrast, all WohnBarometer steps take less than 5 hours using medium-sized machines. This results in cost savings of almost 75%.

Thanks to SageMaker, we were able to bring a complex ML model in production with one data scientist in less than 2 months. This is remarkable. 10 years earlier, when ImmoScout24 built the IMX, reaching the same milestone took more than 2 years and involved a whole team.

How could we be so efficient? SageMaker allowed us to focus on the model instead of the infrastructure, and SageMaker promotes a microservice architecture that is easy to maintain. If we got stuck with something, we could call on AWS support. In the past, when one of our IMX data pipelines failed, we would sometimes spend days to debug it. Since we started publishing WohnBarometer figures in April 2021, the SageMaker infrastructure hasn’t failed a single time.

To learn more about the WohnBarometer, check out WohnBarometer and WohnBarometer: Angebotsmieten stiegen 2021 bundesweit wieder stärker an. To learn more about using the SageMaker Scikit-Learn library for preprocessing, see Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn. Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your AWS support contacts.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Oliver Frost joined ImmoScout24 in 2017 as a business analyst. Two years later, he became a data scientist in a team whose job it is to turn ImmoScout24 data into veritable data products. Before building the WohnBarometer model, he ran smaller SageMaker projects. Oliver holds several AWS certificates, including the Machine Learning Specialty.

Lukas Müller is a Solutions Architect at AWS. He works with customers in the sports, media, and entertainment industries. He is always looking for ways to combine technical enablement with cultural and organizational enablement to help customers achieve business value with cloud technologies.

Read More

Transforming qualitative research by automating speech to text-to-text analytics

This post is authored by Satish Jha, Intelligent Automation Manager, Matt Docherty, Data Science Manager, Jayesh Muley, Associate Consultant and Tapan Vora, Rapid Prototyping, from ZS Associates.

At ZS Associates, we do a significant amount of qualitative market research. The work involves interviewing relevant subjects (such as healthcare professionals and sales representatives) and developing bespoke analytics on the interview data. We’ve taken advantage of the advances in AI, machine learning (ML), and cloud computing to reimagine qualitative market research and developed a scalable solution that is equipped to perform speech-to-text conversion and natural language processing (NLP) on the audio recordings of interviewed subjects. The solution is better, cheaper, and faster than the current ways of working (manual interpretation), giving a competitive advantage in this space.

This post discusses how ZS used Amazon Transcribe, Amazon Comprehend Medical, and custom NLP for text summarization and graph visualization to create a scalable, automated solution that helps us provide insights in a faster, better, and more efficient way.

Background assessment

The traditional method of performing qualitative market research requires human intervention and interpretation, which is highly subjective in nature. We used advanced AI and ML to develop a platform that is capable of the following:

  • Performing speech-to-text conversion; specifically with high precision, converting interview audio recordings conducted for the purpose of qualitative market research
  • Drawing analytical insights from the converted text using a state-of-the-art NLP model

To achieve this, we combined state-of-the-art AWS AI services and cloud computing capabilities with our propriety NLP and text summarization algorithms to drive impact at scale.

Solution overview

To build our solution, we adopted the methodology of starting small, highlighting value, and scaling fast. We identified a key user group and defined phase one of the solution to do automated speech-to-text and analytics. We defined a key user interface and developed the technology architecture for the solution. Because ZS is an AWS Partner and has already been using multiple AWS Cloud services for our enterprise products and solutions, AWS was the preferred choice for this project. We used Amazon Transcribe and Amazon Comprehend Medical for transcription and theme identification purposes. For hosting custom NLP analytics APIs, we used a serverless infrastructure using Amazon API Gateway, AWS Lambda, and Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. These services are HIPAA-eligible and compliant with pharma regulatory requirements.

The process includes the following stages:

  • File upload to Amazon S3 – The process starts when the user uploads one or more audio recording files for transcription to the site on which our tool is hosted. To upload the files to Amazon Simple Storage Service (Amazon S3), the user is provided with a temporary written token or pre-signed URL using API Gateway, which provides Amazon S3 access.
  • Audio transcription – Depending on the type of file uploaded, different triggers are in place to initiate the appropriate workflow:

    • Audio files uploaded without a dictionary file – If the user didn’t provide a dictionary file, the tool processes the audio file using Amazon Transcribe.
    • Audio files uploaded with a dictionary file – If the user provided a dictionary file, certain AWS Step Functions steps are triggered, followed by processing the dictionary file using Amazon Transcribe. When the dictionary processing is complete, the tool transcribes the audio file using Amazon Transcribe.
  • Transcript file generation – In either of the preceding two cases, when the transcription is in progress, the tool uses Amazon CloudWatch Events to update the transcription status. Lambda functions trigger the tool to update the status on the RDBMS and convey the status to the user through the tool’s UI using sockets. When the transcription is complete, the final output file is stored in Amazon S3.
  • File type conversion – After the output file is generated, the tool uses triggers to create a .doc or .xlsx file, stored again in Amazon S3.
  • Generating analytical insights – With Amazon Comprehend Medical and certain ZS in-house NLP tools, the tool generates analytics based on the transcribed data and updates dashboards on our site to access them in real time.
  • Audio streaming with Amazon Transcribe – We use Amazon CloudFront audio streaming paired with our final output file, which is generated from Amazon Transcribe. The user can simultaneously listen to the recording and read the transcript.

The following diagram shows the high-level architecture and workflow.

The platform is designed to process a large number of files in real time. Therefore, the solution greatly augments the work of our current ZS qualitative research team by making the process more efficient and giving it an entirely new dimension!

Overall, our solution has the following features:

  • The ability to upload single or multiple audio files
  • Automated speech-to-text conversion, with the ability to add a custom dictionary
  • The ability to listen to the uploaded audio and refine text
  • Text summarization and analytics

Process map

The following diagram gives a high-level visualization of our developed solution, with the following stages:

  • Upload audio – The process starts with the user uploading their audio recording (with or without a dictionary file) to the tool
  • Speech to text – These uploaded audio files are transcribed by converting speech to text
  • Listen and refine – The user can simultaneously listen to the recording and read the transcript and make changes wherever necessary
  • Speech-to-text output – The consolidated file includes the converted transcript and its corresponding analytics

It took us approximately 5–6 months to develop this solution end to end with a four-member team. Today it is being used by over 300 people, and the tool has processed thousands of hours of audio.

AWS services used

The solution uses multiple AWs services:

  • AWS Lambda and API Gateway – Hosted the serverless APIs and functions.

    • We developed multiple API Gateways to ensure loose coupling and easy integration with external APIs. Custom authorizers were implemented to enable token-based authentication and restrict unauthorized access to the web content.
    • We also built the Lambda APIs (using Python and NodeJS) that could easily interact with a website hosted on ECS containers and can also be easily linked with Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The use of Lambda functions in our solution helped us avoid the load balancing, restoring, and stopping clusters efforts and reduce overall costs, because the clusters only ran when the functions were running. Additionally, we were able to easily scale our solution because of the serverless architecture.
  • Amazon Transcribe – Provided us options to easily configure the batch processing of audio files up to 100 at a time and even scale a larger load using its built-in queuing mechanism. It also allowed us to load a custom dictionary to transcribe the audio data more accurately.
  • Amazon Comprehend Medical – Generated analytical insights from the text data using its built-in NLP capabilities to sort through text for valuable information.
  • AWS CloudFormation – We used AWS CloudFormation to deploy the Lambda functions and APIs across environments (various S3 buckets and multiple environments in the same bucket, such as production and development) using stage variables.
  • AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline – We used AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline to perform continuous deployment of the front end and analytics backend to ECS clusters.

The following diagram illustrates the architecture of these services.

Conclusion

We used AWS services to develop a platform that helped our teams apply cutting-edge AI to their projects. It has helped our teams do the following:

  • Automate the process of speech-to-text conversion and only focus on low-accuracy aspects.
  • Drive automation of insights with NLP algorithms.
  • Drive self-service. Because we do not need to launch any particular server, we can easily create Lambda functions, make changes to the code on the fly, and provide key ML services as plug and play so that users don’t need to be data scientists.

Today the solution is used by over 300 people, and we have processed thousands of hours of audio. We’re now integrating our solution with other applications to provide users with the flexibility to either upload audio files for transcription or directly upload transcribed files for drawing analytical insights.

We also derived multiple benefits from building our platform with AWS:

  • Using an end-to-end cloud-based architecture proved beneficial in terms of managing environments for business applications
  • With management tools such as CloudWatch, AWS CloudFormation, CodeBuild, CodeDeploy, and CodePipeline, it was easier to monitor, track, and deploy development changes
  • We used AWS’s built-in security with virtual private clouds and identity management with customized policies
  • We were able to reduce load on valuable microservices, with the additional benefit of quick hosting and deployment

About ZS

ZS Associates is a consulting and professional services firm focusing on consulting, software, and technology, headquartered in Evanston, Illinois, that provides services for clients in pharma, healthcare, and technology. The firm employs more than 10,000 employees in 30 offices in North America, South America, Europe, and Asia. ZS works with 49 of the 50 largest drug-makers and 17 of the 20 largest medical device makers and serves consumer products, financial services, industrial products, telecommunications, transportation, and logistics industries.

Disclaimer: AWS is not responsible for the content or accuracy of this post. The content and opinions in this post are solely those of the third-party author. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.


About the Authors

Satish Jha is a Manager with ZS Associates. He is a leader in the firm’s Intelligent Automation Practice, where he works side by side with several pharma clients to transform operations and drive impact.

Matt Docherty is a Data Science Manager with ZS Associates in the Philadelphia office. He is focused on applying data science in the pharmaceutical industry.

Jayesh Muley is an Associate Consultant for Process Excellence & Transformation with ZS Associates. He has 4 years of experience advising pharma clients in the forecasting, process excellence, and digital transformation spaces. He played a critical role in establishing ZS’s automation center of excellence. He is always keen on learning new technologies and is always evolving in his role.

Tapan Vora is a Manager for Rapid Prototyping with ZS Associates. Tapan has over 14 years of technology and engineering management experience. He plays multiple roles in the team, such as business analyst, people manager, solution designer, data analyst, and product leader.

Read More

How The Barcode Registry detects counterfeit products using object detection and Amazon SageMaker

This is a guest post authored by Andrew Masek, Software Engineer at The Barcode Registry and Erik Quisling, CEO of The Barcode Registry.

Product counterfeiting is the single largest criminal enterprise in the world. Growing over 10,000% in the last two decades, sales of counterfeit goods now total $1.7 trillion per year worldwide, which is more than drugs and human trafficking. Although traditional methods of counterfeit prevention like unique barcodes and product verification can be very effective, new machine learning (ML) technologies such as object detection seem very promising. With object detection, you can now snap a picture of a product and know almost instantly if that product is likely to be legitimate or fraudulent.

The Barcode Registry (in conjunction with its partner Buyabarcode.com) is a full-service solution that helps customers prevent product fraud and counterfeiting. It does this by selling unique GS1-registered barcodes, verifying product ownership, and registering users’ products and barcodes in a comprehensive database. Their latest offering, which we discuss in this post, uses Amazon SageMaker to create object detection models to help instantly recognize counterfeit products.

Overview of solution

To use these object detection models, you first need to collect data to train them. Companies upload annotated pictures of their products to The Barcode Registry website. After this data is uploaded to Amazon Simple Storage Service (Amazon S3) and processed by AWS Lambda functions, you can use it to train a SageMaker object detection model. This model is hosted on a SageMaker endpoint, where the website connects it to the end-user.

There are three key steps to creating The Barcode Registry uses to create a custom object detection model with SageMaker:

  1. Create a training script for SageMaker to run.
  2. Build a Docker container from the training script and upload it to Amazon ECR.
  3. Use the SageMaker console to train a model with the custom algorithm.

Product data

As a prerequisite in order to train an object detection model you will need an AWS account and training images, consisting of at least 100 high-quality (high-resolution and in multiple lighting-conditions) pictures of your object. As with any ML model, high-quality data is paramount. To train an object detection model, we need images containing the relevant products as well as bounding boxes describing where the products are in the images, as shown in the following example.

To train an effective model, pictures of each of a brand’s products with different backgrounds and lighting conditions are needed—approximately 30–100 unique annotated images for each product.

After the images are uploaded to the web server, they’re uploaded to Amazon S3 using the AWS SDK for PHP. A Lambda event is triggered each time an image is uploaded. The function removes the Exif metadata from the images, which can sometimes cause them to appear rotated when they’re opened by the ML libraries later used to train the model. The associated bounding box data is stored in JSON files and uploaded to Amazon S3 to accompany the images.

SageMaker for object detection models

SageMaker is a managed ML service that includes a variety of tools for building, training and hosting models in the cloud. In particular, TheBarcodeRegistry uses SageMaker for its object detection service because of SageMaker’s reliable and scalable ML model training and hosting services. This means that many brands can have their own object detection models trained and hosted and even if usage spikes unpredictably, there won’t be any downtime.

The Barcode Registry uses custom Docker containers uploaded to Amazon Elastic Container Registry (Amazon ECR) in order to have more fine-grained control of the object detection algorithm employed for training and inference as well as support for Multi Model Server (MMS). MMS is very important for the counterfeit detection use case because it allows multiple brand’s models to be cost-effectively hosted on the same server. Alternatively, you can use the built-in object detection algorithm to quickly deploy standard models developed by AWS.

Train a custom object detection model with SageMaker

First, you need to add your object detection algorithm. In this case, upload a Docker container featuring scripts to train a Yolov5 object detection model to Amazon ECR:

  1. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  2. Choose Create notebook instance.
  3. Enter a name for the notebook instance and under Permissions and encryption choose an AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. Open the Git repositories menu.
  5. Select Clone a public Git repository to this notebook instance only and paste the following Git repository URL: https://github.com/portoaj/SageMakerObjectDetection
  6. Click Create notebook instance and wait about five minutes for the instance’s status to update from Pending to InService in the Notebook instance menu.
  7. Once the notebook is InService, select it and click Actions and Open Jupyter to launch the notebook instance in a new tab.
  8. Select the SageMakerObjectDetection directory and then click on sagemakerobjectdetection.ipynb to launch the Jupyter notebook.
  9. Select the conda_python3 kernel and click Set Kernel.
  10. Select the code cell and set the aws_account_id variable to your AWS Account ID.
  11. Click Run to begin the process of building a Docker container and uploading it to Amazon ECR. This process may take about 20 minutes to complete.
  12. Once the Docker container has been uploaded, return to the Notebook instances menu, select your instance, and click Actions and Stop to shut your notebook instance down.

After the algorithm is built and pushed to Amazon ECR, you can use it to train a model via the SageMaker console.

  1. On the SageMaker console, under Training in the navigation pane, choose Training jobs.
  2. Choose Create training job.
  3. Enter a name for the job and choose the AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. For Algorithm source, select Your own algorithm container in ECR.
  5. For Container, enter the registry path.
  6. Setting a single ml.p2.xlarge instance under the resource configuration should be sufficient for training a Yolov5 model.
  7. Specify Amazon S3 locations for both your input data and output path and any other settings such as configuring a VPC via Amazon Virtual Private Cloud (Amazon VPC) or enabling Managed Spot Training.
  8. Choose Create training job.

You can track the model’s training progress on the SageMaker console.

Automated model training

The following diagram illustrates the automated model training workflow:

To make SageMaker start training the object detection model as soon as a user finishes uploading their data, the web server uses Amazon API Gateway to notify a Lambda function that the brand has finished and to begin a training job.

When a brand’s model is successfully trained, Amazon EventBridge calls a Lambda function that moves the trained model into the live endpoint’s S3 bucket, where it’s finally ready for inference. A newer alternative to using Amazon EventBridge to move models through the MLOps lifecycle that you should consider is SageMaker Pipelines.

Host the model for inference

The following diagram illustrates the inference workflow:

To use the trained models, SageMaker requires an inference model to be hosted by an endpoint. The endpoint is the server or array of servers that are used to actually host the inference model. Similar to the training container that we created, a Docker container for inference is hosted in Amazon ECR. The inference model uses that Docker container and takes the input image the user took with their phone, runs it through the trained object detection model, and outputs the result.

Again, The Barcode Registry uses custom Docker containers for the inference model to enable the use of Multi Model Server, but if only one model is needed that can be easily hosted through the built-in object detection algorithm.

Conclusion

The Barcode Registry (in conjunction with its partner Buyabarcode.com) uses AWS for its entire object detection pipeline. The web server reliably stores data in Amazon S3 and uses API Gateway and Lambda functions to connect the web server to the cloud. SageMaker readily trains and hosts ML models, which means a user can take a picture of a product on their phone and see if the product is a counterfeit. This post shows how to create and host an object detection model using SageMaker, as well as how to automate the process.

In testing, the model was able to achieve over 90% accuracy on a training set of 62 images and a testing set of 32 images, which is pretty impressive for a model trained without any human intervention. To get started training object detection models yourself check out the official documentation or learn how to deploy an object detection model to the edge using AWS IoT Greengrass.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Andrew Masek, Software Engineer at The Barcode Registry.

Erik Quisling, CEO of The Barcode Registry.

Read More

Build a cold start time series forecasting engine using AutoGluon

Whether you’re allocating resources more efficiently for web traffic, forecasting patient demand for staffing needs, or anticipating sales of a company’s products, forecasting is an essential tool across many businesses. One particular use case, known as cold start forecasting, builds forecasts for a time series that has little or no existing historical data, such as a new product that just entered the market in the retail industry. Traditional time series forecasting methods such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ES) rely heavily on historical time series of each individual product, and therefore aren’t effective for cold start forecasting.

In this post, we demonstrate how to build a cold start forecasting engine using AutoGluon AutoML for time series forecasting, an open-source Python package to automate machine learning (ML) on image, text, tabular, and time series data. AutoGluon provides an end-to-end automated machine learning (AutoML) pipeline for beginners to experienced ML developers, making it the most accurate and easy-to-use fully automated solution. We use the free Amazon SageMaker Studio Lab service for this demonstration.

Introduction to AutoGluon time series

AutoGluon is a leading open-source library for AutoML for text, image, and tabular data, allowing you to produce highly accurate models from raw data with just one line of code. Recently, the team has been working to extend these capabilities to time series data, and has developed an automated forecasting module that is publicly available on GitHub. The autogluon.forecasting module automatically processes raw time series data into the appropriate format, and then trains and tunes various state-of-the-art deep learning models to produce accurate forecasts. In this post, we demonstrate how to use autogluon.forecasting and apply it to cold start forecasting tasks.

Solution overview

Because AutoGluon is an open-source Python package, you can implement this solution locally on your laptop or on Amazon SageMaker Studio Lab. We walk through the following steps:

  1. Set up AutoGluon for Amazon SageMaker Studio Lab.
  2. Prepare the dataset.
  3. Define training parameters using AutoGluon.
  4. Train a cold start forecasting engine for time series forecasting.
  5. Visualize cold start forecasting predictions.

The key assumption of cold start forecasting is that items with similar characteristics should have similar time series trajectories, which is what allows cold start forecasting to make predictions on items without historical data, as illustrated in the following figure.

In our walkthrough, we use a synthetic dataset based on electricity consumption, which consists of the hourly time series for 370 items, each with an item_id from 0–369. Within this synthetic dataset, each item_id is also associated with a static feature (a feature that doesn’t change over time). We train a DeepAR model using AutoGluon to learn the typical behavior of similar items, and transfer such behavior to make predictions on new items (item_id 370–373) that don’t have historical time series data. Although we’re demonstrating the cold start forecasting approach with only one static feature, in practice, having informative and high-quality static features is the key for a good cold start forecast.

The following diagram provides a high-level overview of our solution. The open-source code is available on the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Log in to your Amazon SageMaker Studio Lab account and set up the environment using the terminal:

cd sagemaker-studiolab-notebooks/ 
git clone https://github.com/whosivan/amazon-sagemaker-studio-lab-cold-start-forecasting-using-autogluon
conda env create -f autogluon.yml
conda activate autogluon
git clone https://github.com/yx1215/autogluon.git
cd autogluon/
git checkout --track origin/add_forecasting_predictor

These instructions should also work from your laptop if you don’t have access to Amazon SageMaker Studio Lab (we recommend installing Anaconda on your laptop first).

When you have the virtual environment fully set up, launch the notebook AutoGluon-cold-start-demo.ipynb and select the custom environment .conda-autogluon:Python kernel.

Prepare the target time series and item meta dataset

Download the following datasets to your notebook instance if they’re not included, and save them under the directory data/. You can find these datasets on our GitHub repo:

  • Test.csv.gz
  • coldStartTargetData.csv
  • itemMetaData.csv

Run the following snippet to load the target time series dataset into the kernel:

zipLocalFilePath = "data/test.csv.gz"
localFilePath = "data/test.csv"
util.extract_gz(zipLocalFilePath, localFilePath)

tdf = pd.read_csv(zipLocalFilePath, dtype = object)
tdf['target_value'] = tdf['target_value'].astype('float')
tdf.head()

AutoGluon time series requires static features to be represented in numerical format. This can be achieved through applying LabelEncoder() on our static feature type, where we encode A=0, B=1, C=2, D=3 (see the following code). By default, AutoGluon infers the static feature to be either ordinal or categorical. You can also overwrite this by converting the static feature column to be the object/string data type for categorical features, or integer/float data type for ordinal features.

localItemMetaDataFilePath = "data/itemMetaData.csv"
imdf = pd.read_csv(localItemMetaDataFilePath, dtype = object)

labelencoder = LabelEncoder()
imdf['type'] = labelencoder.fit_transform(imdf['type'])

imdf_without_coldstart_item['type'] = imdf_without_coldstart_item['type'].astype(str)

imdf_without_coldstart_item = imdf[imdf.item_id.isin(tdf.item_id.tolist())]
imdf_without_coldstart_item.to_csv('data/itemMetaDatawithoutColdstart.csv', index=False)

imdf_with_coldstart_item = imdf[~imdf.item_id.isin(tdf.item_id.tolist())]
imdf_with_coldstart_item.to_csv('data/itemMetaDataOnlyColdstart.csv', index=False)

Set up and start AutoGluon model training

We need to specify save_path = ‘autogluon-coldstart-demo’ as the model artifact folder name (see the following code). We also set our eval_metric as mean absolute percentage error, or ‘MAPE’ for short, where we defined prediction_length as 24 hours. If not specified, AutoGluon by default produces probabilistic forecasts and scores them via the weighted quantile loss. We only look at the DeepAR model in our demo, because we know the DeepAR algorithm allows cold start forecasting by design. We set one of the DeepAR hyperparameters arbitrarily and pass that hyperparameter to the ForecastingPredictor().fit() call. This allows AutoGluon to look only into the specified model. For a full list of tunable hyperparameters, refer to gluonts.model.deepar package.

save_path = 'autogluon-coldstart-demo'
eval_metric = 'MAPE'
deepar_params = {
    "scaling":True
}

ag_predictor = ForecastingPredictor(path=save_path, 
eval_metric=eval_metric).fit(tdf, static_features = imdf_without_coldstart_item,
prediction_length=24, #how far out in the future we wish to forecast                                                                  index_column="item_id",                             
target_column="target_value",                                          
time_column="timestamp",
quantiles=[0.1, 0.5, 0.9],                                                                
hyperparameters={"DeepAR": deepar_params})

The training takes 30–45 minutes. You can get the model summary by calling the following function:

ag_predictor.fit_summary()

Forecast on the cold start item

Now we’re ready to generate forecasts for the cold start item. We recommend having at least five rows for each item_id. Therefore, for the item_id that has fewer than five observations, we fill in with NaNs. In our demo, both item_id 370 and 372 have zero observation, a pure cold start problem, whereas the other two have five target values.

Load in the cold start target time series dataset with the following code:

localColdStartDataFilePath = "data/coldStartTargetData.csv"
cstdf = pd.read_csv(localColdStartDataFilePath, dtype = object)
cstdf.head(20)

We feed the cold start target time series into our AutoGluon model, along with the item meta dataset for the cold start item_id:

cold_start_prediction = ag_predictor.predict(cstdf, static_features=imdf_with_coldstart_item)

Visualize the predictions

We can create a plotting function to generate a visualization on the cold start forecasting, as shown in the following graph.

Clean up

To optimize resource usage, consider stopping the runtime on Amazon SageMaker Studio Lab after you have fully explored the notebook.

Conclusion

In this post, we showed how to build a cold start forecasting engine using AutoGluon AutoML for time series data on Amazon SageMaker Studio Lab. For those of you who are wondering the difference between Amazon Forecast and AutoGluon (time series), Amazon Forecast is a fully managed and supported service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. While AutoGluon is an open-source project that is community supported with the latest research contributions. We walked through an end-to-end example to demonstrate what AutoGluon for time series is capable of, and provided a dataset and use case.

AutoGluon for time series data is an open-source Python package, and we hope that this post, together with our code example, gives you a straightforward solution to tackle challenging cold start forecasting problems. You can access the entire example on our GitHub repo. Try it out, and let us know what you think!


About the Authors

Ivan Cui is a Data Scientist with AWS Professional Services, where he helps customers build and deploy solutions using machine learning on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, and healthcare. In his free time, he enjoys reading, spending time with his family, and maximizing his stock portfolio.

Jonas Mueller is a Senior Applied Scientist in the AI Research and Education group at AWS, where he develops new algorithms to improve deep learning and develop automated machine learning. Before joining AWS to democratize ML, he completed his PhD at the MIT Computer Science and Artificial Intelligence Lab. In his free time, he enjoys exploring mountains and the outdoors.

Wenming Ye is a Research Product Manager at AWS AI. He is passionate about helping researchers and enterprise customers rapidly scale their innovations through open-source and state-of-the-art machine learning technology. Wenming has diverse R&D experience from Microsoft Research, the SQL engineering team, and successful startups.

Read More