Amazon AWS – Page 225

Amazon SageMaker Autopilot now supports time series data

March 9, 2022

by Nikita Ivkin Amazon AWS

Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning (ML) models based on your data, while allowing you to maintain full control and visibility. We have recently announced support for time series data in Autopilot. You can use Autopilot to tackle regression and classification tasks on time series data, or sequence data in general. Time series data is a special type of sequence data where data points are collected at even time intervals.

Manually preparing the data, selecting the right ML model, and optimizing its parameters is a complex task, even for an expert practitioner. Although automated approaches exist that can find the best models and their parameters, these typically can’t handle data that comes as sequences, such as network traffic, electricity consumption, or household expenses recorded over time. Because this data takes the form of observations acquired at different time points, consecutive observations can’t be treated as independent of each other and need to be processed as a whole. You can use Autopilot for a wide range of problems dealing with sequential data. For example, you can classify network traffic recorded over time to identify malicious activities, or determine if individuals qualify for a mortgage based on their credit history. You provide a dataset containing time series data and Autopilot handles the rest, processing the sequential data through specialized feature transforms and finding the best model on your behalf.

Autopilot eliminates the heavy lifting of building ML models, and helps you automatically build, train, and tune the best ML model based on your data. Autopilot runs several algorithms on your data and tunes their hyperparameters on a fully managed compute infrastructure. In this post, we demonstrate how you can use Autopilot to solve classification and regression problems on time series data. For instructions on creating and training an Autopilot model, see Customer Churn Prediction with Amazon SageMaker Autopilot.

Time series data classification using Autopilot

As a running example, we consider a multi-class problem on the time series dataset UWaveGestureLibraryX, containing equidistant readings of accelerometer sensors while performing one of eight predefined hand gestures. For simplicity, we consider only X dimension of the accelerometer. The task is to build a classification model to map the time series data from the sensor readings to the predefined gestures. The following figure shows the first rows of the dataset in CSV format. The entire table consists of 896 rows and two columns: the first column is a gesture label and the second column is a time series of sensor readings.

Convert data to the right format with Amazon SageMaker Data Wrangler

On top of accepting numerical, categorical, and standard text columns, Autopilot now also accepts a sequence input column. If your time series data doesn’t follow this format, you can easily convert it through Amazon SageMaker Data Wrangler. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. For instance, consider the same dataset but in a different input format: each gesture (specified by ID) is a sequence of equidistant measurements of the accelerometer. When stored vertically, each row contains a timestamp and one value. The following figure compares this data in its original format and a sequence format.

To convert this dataset to the format described earlier using Data Wrangler, load the dataset from Amazon Simple Storage Service (Amazon S3). Then use the time series Group by transform, as shown in the following screenshot, and export the data back to Amazon S3 in CSV format.

When the dataset is in its designated format, you can proceed with Autopilot. To check out other time series transformers of Data Wrangler refer to Prepare time series data with Amazon SageMaker Data Wrangler.

Launch an AutoML job

As with other input types supported by Autopilot, each row of the dataset is a different observation and each column is a feature. In this example, we have a single column containing time series data, but you can have multiple time series columns. You can also have multiple columns with different input types, such as time series, text, and numerical.

To create an Autopilot experiment, place the dataset in an S3 bucket and create a new experiment within Amazon SageMaker Studio. As shown in the following screenshot, you must specify the name of experiment, S3 location of the dataset, S3 location for the output artifacts, and the column name to predict.

Autopilot analyzes the data, generates ML pipelines, and runs a default 250 iterations of hyperparameter optimization on this classification task. As shown in the following model leaderboard, Autopilot reaches 0.821 accuracy, and you can deploy the best model in just one click.

In addition, Autopilot generates a data exploration report, where you can visualize and explore your data.

Transparency is foundational for Autopilot. You can inspect and modify generated ML pipelines within the candidate definition notebook. The following screenshot demonstrates how Autopilot recommends a range of pipelines, combining the time series transformer TSFeatureExtractor with different ML algorithms, such as gradient boosted decision trees and linear models. The TSFeatureExtractor extracts hundreds of time series features for you, which are then fed to the downstream algorithms to make predictions. For the full list of time series features, refer to Overview on extracted features.

Conclusion

In this post, we demonstrated how to use SageMaker Autopilot to solve time series classification and regression problems in just a few clicks.

For more information about Autopilot, see Amazon SageMaker Autopilot. To explore related features of SageMaker, see Amazon SageMaker Data Wrangler.

About the Authors

Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.

Anne Milbert is a Software Development engineer working on Amazon SageMaker Automatic Model Tuning.

Valerio Perrone is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.

Meghana Satish is a Software Development engineer working on Amazon SageMaker Automatic Model Tuning.

Ali Takbiri is an AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.

Enable Amazon SageMaker JumpStart for custom IAM execution roles

March 9, 2022

by Nikhil Jha Amazon AWS

With an Amazon SageMaker Domain, you can onboard users with an AWS Identity and Access Management (IAM) execution role different than the Domain execution role. In such case, the onboarded Domain user can’t create projects using templates and Amazon SageMaker JumpStart solutions. This post outlines an automated approach to enable JumpStart for Domain users with a custom execution role. We walk you through two different use cases for enabling JumpStart and how to solve these cases programmatically. The automated solution can help you scale your process to enable JumpStart for Domain users with custom roles, increasing productivity of your data science team and Amazon SageMaker Studio administrators.

JumpStart is a feature within Studio that helps you quickly and easily get started with machine learning (ML). With more and more customers increasingly using ML and adopting Amazon SageMaker, JumpStart is making it easier for data science and ML teams to access and fine-tune more than 150 popular open-source models, such as natural language processing, object detection, and image classification models.

Solution overview

JumpStart requires a SageMaker Domain with project templates enabled for the account and Studio users, as shown in the following screenshot.

If enabled, this setting allows users (configured to use the Domain execution role) to create projects using templates and JumpStart solutions. In the scenario where the user’s execution role is different than the Domain execution role, JumpStart remains disabled for that user even when it’s enabled on the Domain. We address this custom role scenario and the automated solution in the following sections.

In this solution, we address the issue for the following two cases:

Use case 1 – Enabling JumpStart in an automated manner for existing Domain users with custom roles regardless of apps assigned
Use case 2 – Providing a reference script that you can use to programmatically enable JumpStart while onboarding a new Domain user with a custom role

Domain user onboarding

After you create a Domain, you can onboard users to launch apps (such as Studio, RStudio, or Canvas). You must assign a default execution role to a Domain user during the creation process, as shown in the following screenshot.

You can choose a role different than the Domain execution role for a user. However, this may disable JumpStart for such users even when it’s enabled on the Domain. This behavior is due to the fact that SageMaker makes no assumption on a custom role and its permission boundary. The required permissions and policies have to be assigned explicitly to access templates and JumpStart solutions published by SageMaker in AWS Service Catalog.

You can enable SageMaker Projects and JumpStart manually for every user by selecting the user profile on the SageMaker Domain control panel. However, this process can be time-consuming if a user already has some apps assigned. The Edit button at bottom right is only enabled when no apps are assigned to that user (see the following screenshot). You have to delete the assigned apps first in order to edit a user profile.

The cause of the disabled JumpStart feature is evident during Step 2 of editing a user profile, where a message states “If there are individual users using custom execution roles in your organization, you need to enable them on the user profile page.”

In the following sections, we walk you through two automated solutions that cover use cases for both existing and new Domain users.

Prerequisites

The steps described as part of this solution have the following prerequisites:

You have created a SageMaker Domain
The SageMaker Domain authentication method is IAM
Custom roles assigned to the SageMaker Domain users have the AmazonSageMakerFullAccess policy attached

In order for JumpStart Solutions to be enabled for users, the AWS Service Catalog portfolio Amazon SageMaker Solutions and ML Ops products must be imported into the account, and this portfolio must be associated with the role that runs SageMaker. The role association is necessary so that Studio can invoke AWS Service Catalog APIs associated with the Solutions portfolio.

As a general best practice, we recommend testing the process in a non-production environment followed by validation tests to make sure everything is configured and operating as per your expectations before making changes to the production environment.

Use case 1: Enable JumpStart for all existing Domain users with a custom role

Let’s first consider the use case for existing users and enable JumpStart for those users in an automated way.

To achieve this, we have created an AWS CloudFormation template that you can run in the same Region where the SageMaker Domain exists.

The CloudFormation stack contained in the attached jumpstart_solutions_resources.template.yaml file has the following components:

AmazonSageMakerServiceCatalogProductsLaunchRole and AmazonSageMakerServiceCatalogProductsUseRole – Creates these two IAM roles, if they don’t already exist.
1PProductUseRolePolicy – Creates this policy used by AmazonSageMakerServiceCatalogProductsUseRole, if this role doesn’t already exist.
setup_solutions_tests_portfolio – An AWS Lambda function that performs the AWS Service Catalog portfolio import and role association by calling Boto3 APIs. This function is called once during CloudFormation stack creation.
LambdaIAMRole role – Used by the function setup_solutions_tests_portfolio for calling AWS Service Catalog and SageMaker APIs.
SetupPortfolioInvoker – Invokes the function setup_solutions_tests_portfolio.

After the Lambda function runs as part of the CloudFormation deployment, it retrofits all the existing SageMaker Domain users to enable JumpStart and Projects for them. For more information on creating and monitoring a CloudFormation stack, refer to How does AWS CloudFormation work.

Use case 2: Enable JumpStart for a single Domain user with a custom role

Many customers prefer to scale the Domain user onboarding process by automating it programmatically. In this section, we provide a Python script reference that you can use as part of the onboarding process to enable JumpStart for a new user with a custom role. This Python script performs the required association for the given user role. The automated process calling this script must have permission to use AWS Service Catalog and SageMaker APIs. See the following code:

sagemaker_client = boto3.client("sagemaker")
sc_client = boto3.client("servicecatalog")

# function to return 'Amazon SageMaker' portfolio id
def get_solutions_portfolio_id(sc_client):
    portfolio_shares = sc_client.list_accepted_portfolio_shares()
    for portfolio in portfolio_shares['PortfolioDetails']:
            if portfolio['ProviderName'] == 'Amazon SageMaker':
                    return(portfolio['Id'])

portfolio_id = get_solutions_portfolio_id(sc_client)
# import Solutions Service Catalog Portfolio 
sagemaker_client.enable_sagemaker_servicecatalog_portfolio()
    	
sc_client.associate_principal_with_portfolio(
                    PortfolioId=portfolio_id,
                    PrincipalARN=, # custom role ARN
                    PrincipalType='IAM'
                    )

You can either call the script independently or embed it as a step within an automated process to create a user profile for onboarding to Studio. For more information on using Boto3, refer to Boto3 reference.

Clean up

After all the custom roles are enabled to use JumpStart, we can clean up the resources no longer needed. You can delete the Lambda function setup_solutions_tests_portfolio and the IAM role LambdaIAMRole created by the CloudFormation template. The other two IAM roles, AmazonSageMakerServiceCatalogProductsLaunchRole and AmazonSageMakerServiceCatalogProductsUseRole, and the associated policy 1PProductUseRolePolicy (if created) must not be deleted because they need to exist for accessing JumpStart.

Conclusion

In this post, we shared the steps to enable JumpStart for a custom role for existing users as well as new users programmatically. As always, make sure to validate the steps mentioned in this solution in a non-production environment before deploying to production.

Try it out and let us know if you have any questions in the comments section!

Additional resources

For more information, see the following:

About the Authors

Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, and analytics. In his spare time, he enjoys playing badminton with his daughter and exploring the outdoors.

Evan Kravitz is a software engineer at Amazon Web Services, working on SageMaker JumpStart. He enjoys cooking and going on runs in New York City.

Predict residential real estate prices at ImmoScout24 with Amazon SageMaker

March 9, 2022

by Oliver Frost Amazon AWS

This is a guest post by Oliver Frost, data scientist at ImmoScout24, in partnership with Lukas Müller, AWS Solutions Architect.

In 2010, ImmoScout24 released a price index for residential real estate in Germany: the IMX. It was based on ImmoScout24 listings. Besides the price, listings typically contain a lot of specific information such as the construction year, the plot size, or the number of rooms. This information allowed us to build a so-called hedonic price index, which considers the particular features of a real estate property.

When we released the IMX, our goal was to establish it as the standard index for real estate prices in Germany. However, it struggled to capture the price increase in the German property market since the financial crisis of 2008. In addition, like a stock market index, it was an abstract figure that can’t be interpreted directly. The IMX was therefore difficult to grasp for non-experts.

At ImmoScout24, our mission is to make complex decisions easy, and we realized that we needed a new concept to fulfill it. Instead of another index, we decided to build a market report that everyone can easily understand: the WohnBarometer. It’s based on our listings data and takes object properties into account. The key difference from the IMX is that the WohnBarometer shows rent and sale prices in Euro per square meter for specific residential real estate types over time. The figures therefore can be directly interpreted and allow our customers to answer questions such as “Do I pay too much rent?” or “Is the apartment I am about to buy reasonably priced?” or “Which city in my region is the most promising one for investing?” Currently, the WohnBarometer is reported for Germany as a whole, the seven biggest cities, and alternating local markets.

The following graph shows an example of the WohnBarometer, with sale prices for Berlin and the development per quarter.

This post discusses how ImmoScout24 used Amazon SageMaker to create the model for the WohnBarometer in order to make it relevant for our customers. It discusses the underlying data model, hyperparameter tuning, and technical setup. This post also shows how SageMaker supported one data scientist to complete the WohnBarometer within 2 months. It took a whole team 2 years to develop the first version of the IMX. Such an investment was not an option for the WohnBarometer.

About ImmoScout24

ImmoScout24 is the leading online platform for residential and commercial real estate in Germany. For over 20 years, ImmoScout24 has been revolutionizing the real estate market and supports over 20 million users each month on its online marketplace or in its app to find new homes or commercial spaces. That’s why 99% of our target customer group know ImmoScout24. With its digital solutions, the online marketplace coordinates and brings owners, realtors, tenants, and buyers together successfully. ImmoScout24 is working towards the goal of digitizing the process of real estate transactions and thereby making complex decisions easy. Since 2012, ImmoScout24 has also been active in the Austrian real estate market, reaching around 3 million users monthly.

From on-premises to AWS Data Pipeline to SageMaker

In this section, we discuss the previous setup and its challenges, and why we decided to use SageMaker for our new model.

The previous setup

When the first version of the IMX was published in 2010, the cloud was still a mystery to most businesses, including ImmoScout24. The field of machine learning (ML) was in its infancy and only a handful of experts knew how to code a model (for the sake of illustration, the first public release of Scikit-Learn was in February 2010). It’s no surprise that the development of the IMX took more than 2 years and cost a seven-figure sum.

In 2015, ImmoScout24 started its AWS migration, and rebuilt IMX on AWS infrastructure. With the data in our Amazon Simple Storage Service (Amazon S3) data lake, both the data preprocessing and the model training were now done on Amazon EMR clusters orchestrated by AWS Data Pipeline. While the former was a PySpark ETL application, the latter was several Python scripts using classical ML packages (such as Scikit-Learn).

Issues with this setup

Although this setup proved quite stable, troubleshooting the infrastructure or improving the model wasn’t easy. A key problem with the model was its complexity, because some components had begun a life on their own: in the end, the code of the outlier detection was almost twice as long the code of the core IMX model itself.

The core model, in fact, wasn’t one model, but hundreds: one model per residential real estate type and region, with the definition varying from a single neighborhood in a big city to several villages in rural areas. We had, for example, one model for apartments for sale in the middle of Berlin and one model for houses for sale in a suburb of Munich. Because setting up the training of all these models took a lot of time, we omitted the hyperparameter tuning, which likely led to the models underperforming.

Why we decided on SageMaker

Given these issues and our ambition of having a market report with practical benefits, we had to decide between rewriting large parts of the existing code or starting from scratch. As you can infer from this post, we opted for the latter. But why SageMaker?

Most of our time spent on the IMX went into troubleshooting the infrastructure, not improving the model. For the new market report, we wanted to flip this around, with the focus on the statistical performance of the model. We also wanted to have the flexibility to quickly replace individual components of the model, such as the optimization of the hyperparameters. What if a new superior boosting algorithm comes around (think about how XGBoost hit the stage in 2014)? Of course, we want to adopt it as one of the first!

In SageMaker, the major components of the classical ML workflow—preprocessing, training, hyperparameter tuning, and inference—are neatly separated on the API level and also on the AWS Management Console. Modifying them individually isn’t difficult.

The new model

In this section, we discuss the components of the new model, including its input data, algorithm, hyperparameter tuning, and technical setup.

Input data

The WohnBarometer is based on a sliding window of 5 years of ImmoScout24 listings of residential real estate located in Germany. After we remove outliers and fraudulent listings, we’re left with approximately 4 million listings that are split into train (60 %), validation (20 %), and test data (20 %). The relationship between listings and objects is not necessarily 1:1; over the course of 5 years, it’s likely that the same object is inserted multiple times (by multiple people).

We use 13 listing attributes, such as the location of the property (WGS84 coordinates), the real estate type (house or apartment, sale or rent), its age (years), its size (square meter) or it’s condition (for example, new or refurbished). Given that each listing typically comes with dozens of attributes, the question arises: which to include in the model? On the one hand, we used domain knowledge; for example, it’s well known that location is a key factor, and in almost all markets new property is more expensive than existing ones. On the other hand, we relied on our experiences with the IMX and similar models. There we learned that including dozens of attributes doesn’t significantly improve the model.

Depending on the real estate type of the listing, the target variable of our model is either the rent per square meter or the sale price per square meter (we explain later why this choice wasn’t ideal). Unlike the IMX, the WohnBarometer is therefore a number that can be directly interpreted and acted upon by our customers.

Model description

When using SageMaker, you can choose between different strategies of implementing your algorithm:

Use one of SageMaker’s built-in algorithms. There are almost 20 and they cover all major ML problem types.
Customize a pre-made Docker image based on a standard ML framework (such as Scikit-Learn or PyTorch).
Build your own algorithm and deploy it as a Docker image.

For the WohnBarometer, we wanted a solution that is easy to maintain and allows us to focus on improving the model itself, not the underlying infrastructure. Therefore, we decided on the first option: use a fully-managed algorithm with proper documentation and fast support if needed. Next, we needed to pick the algorithm itself. Again, the decision wasn’t difficult: we went for the XGBoost algorithm because it’s one of the most renowned ML algorithms for regression type problems, and we have already successfully used it in several projects.

Hyperparameter tuning

Most ML algorithms come with a myriad of parameters to tweak. Boosting algorithms, for example, have many parameters specifying how exactly the trees are built: Do the trees have at maximum 20 or 30 leaves? Is each tree based on all rows and columns or only samples? How heavily to prune the trees? Finding the optimal values of those parameters (as measured by an evaluation metric of your choice), the so-called hyperparameter tuning, is critical to building a powerful ML model.

A key question in hyperparameter tuning is which parameters to tune and how to set the search ranges. You might ask, why not check all possible combinations? Although in theory this sounds like a good idea, it would result in an enormous hyperparameter space with way too many points to evaluate them all at a reasonable price. That is why ML practitioners typically select a small number of hyperparameters known to have a strong impact on the performance of the chosen algorithm.

After the hyperparameter space is defined, the next task is to find the best combination of values in it. The following techniques are commonly employed:

Grid search – Divide the space in a discrete grid and then evaluate all points in the grid with cross-validation.
Random search – Randomly draw combinations from the space. With this approach, you’ll most likely miss the best combination, but it serves as a good benchmark.
Bayesian optimization – Build a probabilistic model of the objective function and use this model to generate new combinations. The model is updated after each combination, leading quickly to good results.

In recent years, thanks to cheap compute power, Bayesian optimization has become the gold standard in hyperparameter tuning, and is the default setting in SageMaker.

Technical setup

As with many other AWS services, you can create SageMaker jobs on the console, with the AWS Command Line Interface (AWS CLI), or via code. We chose the third option, the SageMaker Python SDK to be precise, because it allows for a highly automated setup: the WohnBarometer lives in a Python software project that is command-line executable. For example, all steps of the ML pipeline such as the preprocessing or the model training can be triggered via Bash commands. Those Bash commands, in turn, are orchestrated with a Jenkins pipeline powered by AWS Fargate.

Let’s look at the steps and the underlying infrastructure:

Preprocessing – The preprocessing is done with the built-in Scikit-Learn library in SageMaker. Because it involves joining data frames with millions of rows, we need an ml.m5.24xlarge machine here, the largest you can get in the ml.m family. Alternatively, we could have used multiple smaller machines with a distributed framework like Dask, but we wanted to keep it as simple as possible.
Training – We use the default SageMaker XGBoost algorithm. The training is done with two ml.m5.12xlarge machines. It’s worth mentioning that our train.py containing the code of the model training and the hyperparameter tuning has less than 100 rows.
Hyperparameter tuning – Following the principle of less is more, we only tune 11 hyperparameters (for example, the number of boosting rounds and the learning rate), which gives us time to carefully choose their ranges and inspect how they interact with each other. With only a few hyperparameters, each training job runs relatively fast; in our case the jobs take between 10–20 minutes. With a maximal number of 30 training jobs and 2 concurrent jobs, the total training time is around 3 hours.
Inference – SageMaker offers multiple options to serve your model. We use batch transform jobs because we only need the WohnBarometer numbers once a quarter. We didn’t use an endpoint because it would be idle most of the time. Each batch job (approximately 6.8 million rows) is served by a single ml.m5.4xlarge machine in less than 10 minutes.

We can easily debug these steps on the SageMaker console. If, for example, a training job is taking longer than expected, we navigate to the Training page, locate the training job in question, and review Amazon CloudWatch metrics of the underlying machines.

The following architecture diagram shows the infrastructure of the WohnBarometer:

Challenges and learnings

In the beginning everything went smoothly: within a few days we set up the software project and trained a miniature version of our model in SageMaker. We had high hopes for the first run on the full dataset and the hyperparameter tuning in place. Unfortunately, the results weren’t satisfying. We had the following key issues:

The predictions of the model were too low, both for rent and sale objects. For Berlin, for example, the sale prices predicted for our reference objects were roughly 50% below the market prices.
According to the model, there was no significant price difference between new and existing buildings. The truth is that new buildings are almost always significantly more expensive than existing buildings.
The effect of the location on the price wasn’t captured correctly. We know, for example, that apartments for sale in Frankfurt am Main, are, on average, more expensive than in Berlin (although Berlin is catching up); our model, however, predicted it the other way round.

What was the problem and how did we solve it?

Sampling of the features

At first glance, it looks like the issues aren’t related, but indeed they are. By default, XGBoost builds each tree with a random sample of the features. Let’s say a model has 10 features F₁, F₂, … F₁₀, then the algorithm might use F₁, F₄, and F₇ for one tree, and F₃, F₄, and F₈ for another. While in general this behavior effectively prevents overfitting, it can be problematic if the number of features is small and some of them have a big effect on the target variable. In this case, many trees will miss the crucial features.

XGBoost’s sampling of our 13 features led to many trees including neither of the crucial features—real estate type, location, and new or existing buildings—and as a consequence caused these issues. Luckily, there is a parameter to control the sampling: colsample_bytree (in fact, there are two more parameters to control the sampling, but we didn’t touch them). When we checked our code, we saw that colsample_bytree was set to 0.5, a value we carried over from past projects. As soon as we set it to the default value of 1, the preceding issues were gone.

One model vs. multiple models

Unlike the IMX, the WohnBarometer model really is only one model. Although this minimizes the maintenance effort, it’s not ideal from a statistical point of view. Because our training data contains both sale and rent objects, the spread in the target variable is huge: it ranges from below 5 Euro for some rent apartments to well above 10,000 Euro for houses for sale in first-class locations. The big challenge for the model is to understand that an error of 5 Euro is fantastic for sale objects, but disastrous for rent objects.

In hindsight, knowing how easy it is to maintain multiple models in SageMaker, we would have built at least two models: one for rent and one for sale objects. This would make it easier to capture the peculiarities of both markets. For example, the price of unrented apartments for sale is typically 20–30% higher than for rented apartments for sale. Therefore, encoding this information as a dummy variable in the sale model makes a lot of sense; for the rent model on the other hand, you could leave it out.

Conclusion

Did the WohnBarometer meet the goal of being relevant to our customers? Taking media coverage as an indication, the answer is a clear yes: as of November 2021, more than 700 newspaper articles and TV or radio reports on the WohnBarometer have been published. The list includes national newspapers such as Frankfurter Allgemeine Zeitung, Tagesspiegel, and Handelsblatt, and local newspapers that often ask for WohnBarometer figures for their region. Because we calculate the figures for all regions of Germany anyway, we’re happy to take such requests. With the old IMX, this level of granularity wasn’t possible.

The WohnBarometer outperforms the IMX in regards to statical performance, in particular when it comes to the costs: the IMX was generated by an EMR cluster with 10 task nodes running almost half a day. In contrast, all WohnBarometer steps take less than 5 hours using medium-sized machines. This results in cost savings of almost 75%.

Thanks to SageMaker, we were able to bring a complex ML model in production with one data scientist in less than 2 months. This is remarkable. 10 years earlier, when ImmoScout24 built the IMX, reaching the same milestone took more than 2 years and involved a whole team.

How could we be so efficient? SageMaker allowed us to focus on the model instead of the infrastructure, and SageMaker promotes a microservice architecture that is easy to maintain. If we got stuck with something, we could call on AWS support. In the past, when one of our IMX data pipelines failed, we would sometimes spend days to debug it. Since we started publishing WohnBarometer figures in April 2021, the SageMaker infrastructure hasn’t failed a single time.

To learn more about the WohnBarometer, check out WohnBarometer and WohnBarometer: Angebotsmieten stiegen 2021 bundesweit wieder stärker an. To learn more about using the SageMaker Scikit-Learn library for preprocessing, see Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn. Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your AWS support contacts.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

Oliver Frost joined ImmoScout24 in 2017 as a business analyst. Two years later, he became a data scientist in a team whose job it is to turn ImmoScout24 data into veritable data products. Before building the WohnBarometer model, he ran smaller SageMaker projects. Oliver holds several AWS certificates, including the Machine Learning Specialty.

Lukas Müller is a Solutions Architect at AWS. He works with customers in the sports, media, and entertainment industries. He is always looking for ways to combine technical enablement with cultural and organizational enablement to help customers achieve business value with cloud technologies.

Transforming qualitative research by automating speech to text-to-text analytics

March 9, 2022

by Satish Jha Amazon AWS

This post is authored by Satish Jha, Intelligent Automation Manager, Matt Docherty, Data Science Manager, Jayesh Muley, Associate Consultant and Tapan Vora, Rapid Prototyping, from ZS Associates.

At ZS Associates, we do a significant amount of qualitative market research. The work involves interviewing relevant subjects (such as healthcare professionals and sales representatives) and developing bespoke analytics on the interview data. We’ve taken advantage of the advances in AI, machine learning (ML), and cloud computing to reimagine qualitative market research and developed a scalable solution that is equipped to perform speech-to-text conversion and natural language processing (NLP) on the audio recordings of interviewed subjects. The solution is better, cheaper, and faster than the current ways of working (manual interpretation), giving a competitive advantage in this space.

This post discusses how ZS used Amazon Transcribe, Amazon Comprehend Medical, and custom NLP for text summarization and graph visualization to create a scalable, automated solution that helps us provide insights in a faster, better, and more efficient way.

Background assessment

The traditional method of performing qualitative market research requires human intervention and interpretation, which is highly subjective in nature. We used advanced AI and ML to develop a platform that is capable of the following:

Performing speech-to-text conversion; specifically with high precision, converting interview audio recordings conducted for the purpose of qualitative market research
Drawing analytical insights from the converted text using a state-of-the-art NLP model

To achieve this, we combined state-of-the-art AWS AI services and cloud computing capabilities with our propriety NLP and text summarization algorithms to drive impact at scale.

Solution overview

To build our solution, we adopted the methodology of starting small, highlighting value, and scaling fast. We identified a key user group and defined phase one of the solution to do automated speech-to-text and analytics. We defined a key user interface and developed the technology architecture for the solution. Because ZS is an AWS Partner and has already been using multiple AWS Cloud services for our enterprise products and solutions, AWS was the preferred choice for this project. We used Amazon Transcribe and Amazon Comprehend Medical for transcription and theme identification purposes. For hosting custom NLP analytics APIs, we used a serverless infrastructure using Amazon API Gateway, AWS Lambda, and Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. These services are HIPAA-eligible and compliant with pharma regulatory requirements.

The process includes the following stages:

File upload to Amazon S3 – The process starts when the user uploads one or more audio recording files for transcription to the site on which our tool is hosted. To upload the files to Amazon Simple Storage Service (Amazon S3), the user is provided with a temporary written token or pre-signed URL using API Gateway, which provides Amazon S3 access.
Audio transcription – Depending on the type of file uploaded, different triggers are in place to initiate the appropriate workflow:
- Audio files uploaded without a dictionary file – If the user didn’t provide a dictionary file, the tool processes the audio file using Amazon Transcribe.
- Audio files uploaded with a dictionary file – If the user provided a dictionary file, certain AWS Step Functions steps are triggered, followed by processing the dictionary file using Amazon Transcribe. When the dictionary processing is complete, the tool transcribes the audio file using Amazon Transcribe.
Transcript file generation – In either of the preceding two cases, when the transcription is in progress, the tool uses Amazon CloudWatch Events to update the transcription status. Lambda functions trigger the tool to update the status on the RDBMS and convey the status to the user through the tool’s UI using sockets. When the transcription is complete, the final output file is stored in Amazon S3.
File type conversion – After the output file is generated, the tool uses triggers to create a .doc or .xlsx file, stored again in Amazon S3.
Generating analytical insights – With Amazon Comprehend Medical and certain ZS in-house NLP tools, the tool generates analytics based on the transcribed data and updates dashboards on our site to access them in real time.
Audio streaming with Amazon Transcribe – We use Amazon CloudFront audio streaming paired with our final output file, which is generated from Amazon Transcribe. The user can simultaneously listen to the recording and read the transcript.

The following diagram shows the high-level architecture and workflow.

The platform is designed to process a large number of files in real time. Therefore, the solution greatly augments the work of our current ZS qualitative research team by making the process more efficient and giving it an entirely new dimension!

Overall, our solution has the following features:

The ability to upload single or multiple audio files
Automated speech-to-text conversion, with the ability to add a custom dictionary
The ability to listen to the uploaded audio and refine text
Text summarization and analytics

Process map

The following diagram gives a high-level visualization of our developed solution, with the following stages:

Upload audio – The process starts with the user uploading their audio recording (with or without a dictionary file) to the tool
Speech to text – These uploaded audio files are transcribed by converting speech to text
Listen and refine – The user can simultaneously listen to the recording and read the transcript and make changes wherever necessary
Speech-to-text output – The consolidated file includes the converted transcript and its corresponding analytics

It took us approximately 5–6 months to develop this solution end to end with a four-member team. Today it is being used by over 300 people, and the tool has processed thousands of hours of audio.

AWS services used

The solution uses multiple AWs services:

AWS Lambda and API Gateway – Hosted the serverless APIs and functions.
- We developed multiple API Gateways to ensure loose coupling and easy integration with external APIs. Custom authorizers were implemented to enable token-based authentication and restrict unauthorized access to the web content.
- We also built the Lambda APIs (using Python and NodeJS) that could easily interact with a website hosted on ECS containers and can also be easily linked with Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The use of Lambda functions in our solution helped us avoid the load balancing, restoring, and stopping clusters efforts and reduce overall costs, because the clusters only ran when the functions were running. Additionally, we were able to easily scale our solution because of the serverless architecture.
Amazon Transcribe – Provided us options to easily configure the batch processing of audio files up to 100 at a time and even scale a larger load using its built-in queuing mechanism. It also allowed us to load a custom dictionary to transcribe the audio data more accurately.
Amazon Comprehend Medical – Generated analytical insights from the text data using its built-in NLP capabilities to sort through text for valuable information.
AWS CloudFormation – We used AWS CloudFormation to deploy the Lambda functions and APIs across environments (various S3 buckets and multiple environments in the same bucket, such as production and development) using stage variables.
AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline – We used AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline to perform continuous deployment of the front end and analytics backend to ECS clusters.

The following diagram illustrates the architecture of these services.

Conclusion

We used AWS services to develop a platform that helped our teams apply cutting-edge AI to their projects. It has helped our teams do the following:

Automate the process of speech-to-text conversion and only focus on low-accuracy aspects.
Drive automation of insights with NLP algorithms.
Drive self-service. Because we do not need to launch any particular server, we can easily create Lambda functions, make changes to the code on the fly, and provide key ML services as plug and play so that users don’t need to be data scientists.

Today the solution is used by over 300 people, and we have processed thousands of hours of audio. We’re now integrating our solution with other applications to provide users with the flexibility to either upload audio files for transcription or directly upload transcribed files for drawing analytical insights.

We also derived multiple benefits from building our platform with AWS:

Using an end-to-end cloud-based architecture proved beneficial in terms of managing environments for business applications
With management tools such as CloudWatch, AWS CloudFormation, CodeBuild, CodeDeploy, and CodePipeline, it was easier to monitor, track, and deploy development changes
We used AWS’s built-in security with virtual private clouds and identity management with customized policies
We were able to reduce load on valuable microservices, with the additional benefit of quick hosting and deployment

About ZS

ZS Associates is a consulting and professional services firm focusing on consulting, software, and technology, headquartered in Evanston, Illinois, that provides services for clients in pharma, healthcare, and technology. The firm employs more than 10,000 employees in 30 offices in North America, South America, Europe, and Asia. ZS works with 49 of the 50 largest drug-makers and 17 of the 20 largest medical device makers and serves consumer products, financial services, industrial products, telecommunications, transportation, and logistics industries.

Disclaimer: AWS is not responsible for the content or accuracy of this post. The content and opinions in this post are solely those of the third-party author. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.

About the Authors

Satish Jha is a Manager with ZS Associates. He is a leader in the firm’s Intelligent Automation Practice, where he works side by side with several pharma clients to transform operations and drive impact.

Matt Docherty is a Data Science Manager with ZS Associates in the Philadelphia office. He is focused on applying data science in the pharmaceutical industry.

Jayesh Muley is an Associate Consultant for Process Excellence & Transformation with ZS Associates. He has 4 years of experience advising pharma clients in the forecasting, process excellence, and digital transformation spaces. He played a critical role in establishing ZS’s automation center of excellence. He is always keen on learning new technologies and is always evolving in his role.

Tapan Vora is a Manager for Rapid Prototyping with ZS Associates. Tapan has over 14 years of technology and engineering management experience. He plays multiple roles in the team, such as business analyst, people manager, solution designer, data analyst, and product leader.

Amazon and Energy Dept. team up to change how we recycle plastic

March 9, 2022

by admin Amazon AWS

Amazon joins the US DOE’s Bio-Optimized Technologies to keep Thermoplastics out of Landfills and the Environment (BOTTLE™) Consortium, focusing on materials and recycling innovation.Read More

How The Barcode Registry detects counterfeit products using object detection and Amazon SageMaker

March 8, 2022

by Andrew Masek Amazon AWS

This is a guest post authored by Andrew Masek, Software Engineer at The Barcode Registry and Erik Quisling, CEO of The Barcode Registry.

Product counterfeiting is the single largest criminal enterprise in the world. Growing over 10,000% in the last two decades, sales of counterfeit goods now total $1.7 trillion per year worldwide, which is more than drugs and human trafficking. Although traditional methods of counterfeit prevention like unique barcodes and product verification can be very effective, new machine learning (ML) technologies such as object detection seem very promising. With object detection, you can now snap a picture of a product and know almost instantly if that product is likely to be legitimate or fraudulent.

The Barcode Registry (in conjunction with its partner Buyabarcode.com) is a full-service solution that helps customers prevent product fraud and counterfeiting. It does this by selling unique GS1-registered barcodes, verifying product ownership, and registering users’ products and barcodes in a comprehensive database. Their latest offering, which we discuss in this post, uses Amazon SageMaker to create object detection models to help instantly recognize counterfeit products.

Overview of solution

To use these object detection models, you first need to collect data to train them. Companies upload annotated pictures of their products to The Barcode Registry website. After this data is uploaded to Amazon Simple Storage Service (Amazon S3) and processed by AWS Lambda functions, you can use it to train a SageMaker object detection model. This model is hosted on a SageMaker endpoint, where the website connects it to the end-user.

There are three key steps to creating The Barcode Registry uses to create a custom object detection model with SageMaker:

Create a training script for SageMaker to run.
Build a Docker container from the training script and upload it to Amazon ECR.
Use the SageMaker console to train a model with the custom algorithm.

Product data

As a prerequisite in order to train an object detection model you will need an AWS account and training images, consisting of at least 100 high-quality (high-resolution and in multiple lighting-conditions) pictures of your object. As with any ML model, high-quality data is paramount. To train an object detection model, we need images containing the relevant products as well as bounding boxes describing where the products are in the images, as shown in the following example.

To train an effective model, pictures of each of a brand’s products with different backgrounds and lighting conditions are needed—approximately 30–100 unique annotated images for each product.

After the images are uploaded to the web server, they’re uploaded to Amazon S3 using the AWS SDK for PHP. A Lambda event is triggered each time an image is uploaded. The function removes the Exif metadata from the images, which can sometimes cause them to appear rotated when they’re opened by the ML libraries later used to train the model. The associated bounding box data is stored in JSON files and uploaded to Amazon S3 to accompany the images.

SageMaker for object detection models

SageMaker is a managed ML service that includes a variety of tools for building, training and hosting models in the cloud. In particular, TheBarcodeRegistry uses SageMaker for its object detection service because of SageMaker’s reliable and scalable ML model training and hosting services. This means that many brands can have their own object detection models trained and hosted and even if usage spikes unpredictably, there won’t be any downtime.

The Barcode Registry uses custom Docker containers uploaded to Amazon Elastic Container Registry (Amazon ECR) in order to have more fine-grained control of the object detection algorithm employed for training and inference as well as support for Multi Model Server (MMS). MMS is very important for the counterfeit detection use case because it allows multiple brand’s models to be cost-effectively hosted on the same server. Alternatively, you can use the built-in object detection algorithm to quickly deploy standard models developed by AWS.

Train a custom object detection model with SageMaker

First, you need to add your object detection algorithm. In this case, upload a Docker container featuring scripts to train a Yolov5 object detection model to Amazon ECR:

On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
Choose Create notebook instance.
Enter a name for the notebook instance and under Permissions and encryption choose an AWS Identity and Access Management (IAM) role with the necessary permissions.
Open the Git repositories menu.
Select Clone a public Git repository to this notebook instance only and paste the following Git repository URL: https://github.com/portoaj/SageMakerObjectDetection
Click Create notebook instance and wait about five minutes for the instance’s status to update from Pending to InService in the Notebook instance menu.
Once the notebook is InService, select it and click Actions and Open Jupyter to launch the notebook instance in a new tab.
Select the SageMakerObjectDetection directory and then click on sagemakerobjectdetection.ipynb to launch the Jupyter notebook.
Select the conda_python3 kernel and click Set Kernel.
Select the code cell and set the aws_account_id variable to your AWS Account ID.
Click Run to begin the process of building a Docker container and uploading it to Amazon ECR. This process may take about 20 minutes to complete.
Once the Docker container has been uploaded, return to the Notebook instances menu, select your instance, and click Actions and Stop to shut your notebook instance down.

After the algorithm is built and pushed to Amazon ECR, you can use it to train a model via the SageMaker console.

On the SageMaker console, under Training in the navigation pane, choose Training jobs.
Choose Create training job.
Enter a name for the job and choose the AWS Identity and Access Management (IAM) role with the necessary permissions.
For Algorithm source, select Your own algorithm container in ECR.
For Container, enter the registry path.
Setting a single ml.p2.xlarge instance under the resource configuration should be sufficient for training a Yolov5 model.
Specify Amazon S3 locations for both your input data and output path and any other settings such as configuring a VPC via Amazon Virtual Private Cloud (Amazon VPC) or enabling Managed Spot Training.
Choose Create training job.

You can track the model’s training progress on the SageMaker console.

Automated model training

The following diagram illustrates the automated model training workflow:

To make SageMaker start training the object detection model as soon as a user finishes uploading their data, the web server uses Amazon API Gateway to notify a Lambda function that the brand has finished and to begin a training job.

When a brand’s model is successfully trained, Amazon EventBridge calls a Lambda function that moves the trained model into the live endpoint’s S3 bucket, where it’s finally ready for inference. A newer alternative to using Amazon EventBridge to move models through the MLOps lifecycle that you should consider is SageMaker Pipelines.

Host the model for inference

The following diagram illustrates the inference workflow:

To use the trained models, SageMaker requires an inference model to be hosted by an endpoint. The endpoint is the server or array of servers that are used to actually host the inference model. Similar to the training container that we created, a Docker container for inference is hosted in Amazon ECR. The inference model uses that Docker container and takes the input image the user took with their phone, runs it through the trained object detection model, and outputs the result.

Again, The Barcode Registry uses custom Docker containers for the inference model to enable the use of Multi Model Server, but if only one model is needed that can be easily hosted through the built-in object detection algorithm.

Conclusion

The Barcode Registry (in conjunction with its partner Buyabarcode.com) uses AWS for its entire object detection pipeline. The web server reliably stores data in Amazon S3 and uses API Gateway and Lambda functions to connect the web server to the cloud. SageMaker readily trains and hosts ML models, which means a user can take a picture of a product on their phone and see if the product is a counterfeit. This post shows how to create and host an object detection model using SageMaker, as well as how to automate the process.

In testing, the model was able to achieve over 90% accuracy on a training set of 62 images and a testing set of 32 images, which is pretty impressive for a model trained without any human intervention. To get started training object detection models yourself check out the official documentation or learn how to deploy an object detection model to the edge using AWS IoT Greengrass.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

Andrew Masek, Software Engineer at The Barcode Registry.

Erik Quisling, CEO of The Barcode Registry.

How Prime Video uses machine learning to ensure video quality

March 4, 2022

by admin Amazon AWS

Detectors for block corruption, audio artifacts, and errors in audio-video synchronization are just three of Prime Video’s quality assurance tools.Read More

Build a cold start time series forecasting engine using AutoGluon

March 4, 2022

by Ivan Cui Amazon AWS

Whether you’re allocating resources more efficiently for web traffic, forecasting patient demand for staffing needs, or anticipating sales of a company’s products, forecasting is an essential tool across many businesses. One particular use case, known as cold start forecasting, builds forecasts for a time series that has little or no existing historical data, such as a new product that just entered the market in the retail industry. Traditional time series forecasting methods such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ES) rely heavily on historical time series of each individual product, and therefore aren’t effective for cold start forecasting.

In this post, we demonstrate how to build a cold start forecasting engine using AutoGluon AutoML for time series forecasting, an open-source Python package to automate machine learning (ML) on image, text, tabular, and time series data. AutoGluon provides an end-to-end automated machine learning (AutoML) pipeline for beginners to experienced ML developers, making it the most accurate and easy-to-use fully automated solution. We use the free Amazon SageMaker Studio Lab service for this demonstration.

Introduction to AutoGluon time series

AutoGluon is a leading open-source library for AutoML for text, image, and tabular data, allowing you to produce highly accurate models from raw data with just one line of code. Recently, the team has been working to extend these capabilities to time series data, and has developed an automated forecasting module that is publicly available on GitHub. The autogluon.forecasting module automatically processes raw time series data into the appropriate format, and then trains and tunes various state-of-the-art deep learning models to produce accurate forecasts. In this post, we demonstrate how to use autogluon.forecasting and apply it to cold start forecasting tasks.

Solution overview

Because AutoGluon is an open-source Python package, you can implement this solution locally on your laptop or on Amazon SageMaker Studio Lab. We walk through the following steps:

Set up AutoGluon for Amazon SageMaker Studio Lab.
Prepare the dataset.
Define training parameters using AutoGluon.
Train a cold start forecasting engine for time series forecasting.
Visualize cold start forecasting predictions.

The key assumption of cold start forecasting is that items with similar characteristics should have similar time series trajectories, which is what allows cold start forecasting to make predictions on items without historical data, as illustrated in the following figure.

In our walkthrough, we use a synthetic dataset based on electricity consumption, which consists of the hourly time series for 370 items, each with an item_id from 0–369. Within this synthetic dataset, each item_id is also associated with a static feature (a feature that doesn’t change over time). We train a DeepAR model using AutoGluon to learn the typical behavior of similar items, and transfer such behavior to make predictions on new items (item_id 370–373) that don’t have historical time series data. Although we’re demonstrating the cold start forecasting approach with only one static feature, in practice, having informative and high-quality static features is the key for a good cold start forecast.

The following diagram provides a high-level overview of our solution. The open-source code is available on the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An Amazon SageMaker Studio Lab account
GitHub account access

cd sagemaker-studiolab-notebooks/ 
git clone https://github.com/whosivan/amazon-sagemaker-studio-lab-cold-start-forecasting-using-autogluon
conda env create -f autogluon.yml
conda activate autogluon
git clone https://github.com/yx1215/autogluon.git
cd autogluon/
git checkout --track origin/add_forecasting_predictor

These instructions should also work from your laptop if you don’t have access to Amazon SageMaker Studio Lab (we recommend installing Anaconda on your laptop first).

When you have the virtual environment fully set up, launch the notebook AutoGluon-cold-start-demo.ipynb and select the custom environment .conda-autogluon:Python kernel.

Prepare the target time series and item meta dataset

Download the following datasets to your notebook instance if they’re not included, and save them under the directory data/. You can find these datasets on our GitHub repo:

Test.csv.gz
coldStartTargetData.csv
itemMetaData.csv

Run the following snippet to load the target time series dataset into the kernel:

zipLocalFilePath = "data/test.csv.gz"
localFilePath = "data/test.csv"
util.extract_gz(zipLocalFilePath, localFilePath)

tdf = pd.read_csv(zipLocalFilePath, dtype = object)
tdf['target_value'] = tdf['target_value'].astype('float')
tdf.head()

AutoGluon time series requires static features to be represented in numerical format. This can be achieved through applying LabelEncoder() on our static feature type, where we encode A=0, B=1, C=2, D=3 (see the following code). By default, AutoGluon infers the static feature to be either ordinal or categorical. You can also overwrite this by converting the static feature column to be the object/string data type for categorical features, or integer/float data type for ordinal features.

localItemMetaDataFilePath = "data/itemMetaData.csv"
imdf = pd.read_csv(localItemMetaDataFilePath, dtype = object)

labelencoder = LabelEncoder()
imdf['type'] = labelencoder.fit_transform(imdf['type'])

imdf_without_coldstart_item['type'] = imdf_without_coldstart_item['type'].astype(str)

imdf_without_coldstart_item = imdf[imdf.item_id.isin(tdf.item_id.tolist())]
imdf_without_coldstart_item.to_csv('data/itemMetaDatawithoutColdstart.csv', index=False)

imdf_with_coldstart_item = imdf[~imdf.item_id.isin(tdf.item_id.tolist())]
imdf_with_coldstart_item.to_csv('data/itemMetaDataOnlyColdstart.csv', index=False)

Set up and start AutoGluon model training

We need to specify save_path = ‘autogluon-coldstart-demo’ as the model artifact folder name (see the following code). We also set our eval_metric as mean absolute percentage error, or ‘MAPE’ for short, where we defined prediction_length as 24 hours. If not specified, AutoGluon by default produces probabilistic forecasts and scores them via the weighted quantile loss. We only look at the DeepAR model in our demo, because we know the DeepAR algorithm allows cold start forecasting by design. We set one of the DeepAR hyperparameters arbitrarily and pass that hyperparameter to the ForecastingPredictor().fit() call. This allows AutoGluon to look only into the specified model. For a full list of tunable hyperparameters, refer to gluonts.model.deepar package.

save_path = 'autogluon-coldstart-demo'
eval_metric = 'MAPE'
deepar_params = {
    "scaling":True
}

ag_predictor = ForecastingPredictor(path=save_path, 
eval_metric=eval_metric).fit(tdf, static_features = imdf_without_coldstart_item,
prediction_length=24, #how far out in the future we wish to forecast                                                                  index_column="item_id",                             
target_column="target_value",                                          
time_column="timestamp",
quantiles=[0.1, 0.5, 0.9],                                                                
hyperparameters={"DeepAR": deepar_params})

The training takes 30–45 minutes. You can get the model summary by calling the following function:

ag_predictor.fit_summary()

Forecast on the cold start item

Now we’re ready to generate forecasts for the cold start item. We recommend having at least five rows for each item_id. Therefore, for the item_id that has fewer than five observations, we fill in with NaNs. In our demo, both item_id 370 and 372 have zero observation, a pure cold start problem, whereas the other two have five target values.

Load in the cold start target time series dataset with the following code:

localColdStartDataFilePath = "data/coldStartTargetData.csv"
cstdf = pd.read_csv(localColdStartDataFilePath, dtype = object)
cstdf.head(20)

We feed the cold start target time series into our AutoGluon model, along with the item meta dataset for the cold start item_id:

cold_start_prediction = ag_predictor.predict(cstdf, static_features=imdf_with_coldstart_item)

Visualize the predictions

We can create a plotting function to generate a visualization on the cold start forecasting, as shown in the following graph.

Clean up

To optimize resource usage, consider stopping the runtime on Amazon SageMaker Studio Lab after you have fully explored the notebook.

Conclusion

In this post, we showed how to build a cold start forecasting engine using AutoGluon AutoML for time series data on Amazon SageMaker Studio Lab. For those of you who are wondering the difference between Amazon Forecast and AutoGluon (time series), Amazon Forecast is a fully managed and supported service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. While AutoGluon is an open-source project that is community supported with the latest research contributions. We walked through an end-to-end example to demonstrate what AutoGluon for time series is capable of, and provided a dataset and use case.

AutoGluon for time series data is an open-source Python package, and we hope that this post, together with our code example, gives you a straightforward solution to tackle challenging cold start forecasting problems. You can access the entire example on our GitHub repo. Try it out, and let us know what you think!

About the Authors

Ivan Cui is a Data Scientist with AWS Professional Services, where he helps customers build and deploy solutions using machine learning on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, and healthcare. In his free time, he enjoys reading, spending time with his family, and maximizing his stock portfolio.

Jonas Mueller is a Senior Applied Scientist in the AI Research and Education group at AWS, where he develops new algorithms to improve deep learning and develop automated machine learning. Before joining AWS to democratize ML, he completed his PhD at the MIT Computer Science and Artificial Intelligence Lab. In his free time, he enjoys exploring mountains and the outdoors.

Wenming Ye is a Research Product Manager at AWS AI. He is passionate about helping researchers and enterprise customers rapidly scale their innovations through open-source and state-of-the-art machine learning technology. Wenming has diverse R&D experience from Microsoft Research, the SQL engineering team, and successful startups.

Enable the visually impaired to hear documents using Amazon Textract and Amazon Polly

March 3, 2022

by Jack Marchetti Amazon AWS

At the 2021 AWS re:Invent conference in Las Vegas, we demoed Read For Me at the AWS Builders Fair—a website that helps the visually impaired hear documents.

For better quality, view the video here.

Adaptive technology and accessibility features are often expensive, if they’re available at all. Audio books help the visually impaired read. Audio description makes movies accessible. But what do you do when the content isn’t already digitized?

This post focuses on the AWS AI services Amazon Textract and Amazon Polly, which empower those with impaired vision. Read For Me was co-developed by Jack Marchetti, who is visually impaired.

Solution overview

Through an event-driven, serverless architecture and a combination of multiple AI services, we can create natural-sounding audio files in multiple languages from a picture of a document, or any image with text. For example, a letter from the IRS, a holiday card from family, or even the opening titles to a film.

The following Reference Architecture, published in the AWS Architecture Center shows the workflow of a user taking a picture with their phone and playing an MP3 of the content found within that document.

The workflow includes the following steps:

Static content (HTML, CSS, JavaScript) is hosted on AWS Amplify.
Temporary access is granted for anonymous users to backend services via an Amazon Cognito identity pool.
The image files are stored in Amazon Simple Storage Service (Amazon S3).
A user makes a POST request through Amazon API Gateway to the audio service, which proxies to an express AWS Step Functions workflow.
The Step Functions workflow includes the following steps:
1. Amazon Textract extracts text from the image.
2. Amazon Comprehend detects the language of the text.
3. If the target language differs from the detected language, Amazon Translate translates to the target language.
4. Amazon Polly creates an audio file as output using the text.
The AWS Step Functions workflow creates an audio file as output and stores it in Amazon S3 in MP3 format.
A pre-signed URL with the location of the audio file stored in Amazon S3 is sent back to the user’s browser through API Gateway. The user’s mobile device plays the audio file using the pre-signed URL.

In the following sections, we discuss the reasons for why we chose the specific services, architecture pattern, and service features for this solution.

AWS AI services

Several AI services are wired together to power Read For Me:

Amazon Textract identifies the text in the uploaded picture.
Amazon Comprehend determines the language.
If the user chooses a different spoken language than the language in the picture, we translate it using Amazon Translate.
Amazon Polly creates the MP3 file. We take advantage of the Amazon Polly neural engine, which creates a more natural, lifelike audio recording.

One of the main benefits of using these AI services is the ease of adoption with little or no core machine learning experience required. The services expose APIs that clients can invoke using SDKs made available in multiple programming languages, such as Python and Java.

With Read For Me, we wrote the underlying AWS Lambda functions in Python.

AWS SDK for Python (Boto3)

The AWS SDK for Python (Boto3) makes interacting with AWS services simple. For example, the following lines of Python code return the text found in the image or document you provide:

import boto3
client = boto3.client('textract')
response = client.detect_document_text(
Document={
'S3Object': {
'Bucket': 'bucket-name',
'Name': 's3-key'
}
})
#do something with the response

All Python code is run within individual Lambda functions. There are no servers to provision and no infrastructure to maintain.

Architecture patterns

In this section, we discuss the different architecture patterns used in the solution.

Serverless

We implemented a serverless architecture for two main reasons: speed to build and cost. With no underlying hardware to maintain or infrastructure to deploy, we focused entirely on the business logic code and nothing else. This allowed us to get a functioning prototype up and running in a matter of days. If users aren’t actively uploading pictures and listening to recordings, nothing is running, and therefore nothing is incurring costs outside of storage. An S3 lifecycle management rule deletes uploaded images and MP3 files after 1 day, so storage costs are low.

Synchronous workflow

When you’re building serverless workflows, it’s important to understand when a synchronous call makes more sense from the architecture and user experience than an asynchronous process. With Read For Me, we initially went down the asynchronous path and planned on using WebSockets to bi-directionally communicate with the front end. Our workflow would include a step to find the connection ID associated with the Step Functions workflow and upon completion, alert the front end. For more information about this process, refer to From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

We ultimately chose not to do this and used express step functions which are synchronous. Users understand that processing an image won’t be instant, but also know it won’t take 30 seconds or a minute. We were in a space where a few seconds was satisfactory to the end-user and didn’t need the benefit of WebSockets. This simplified the workflow overall.

Express Step Functions workflow

The ability to break out your code into smaller, isolated functions allows for fine-grained control, easier maintenance, and the ability to scale more accurately. For instance, if we determined that the Lambda function that triggered Amazon Polly to create the audio file was running slower than the function that determined the language, we could vertically scale that function, adding more memory, without having to do so for the others. Similarly, you limit the blast radius of what your Lambda function can do or access when you limit its scope and reach.

One of the benefits of orchestrating your workflow with Step Functions is the ability to introduce decision flow logic without having to write any code.

Our Step Functions workflow isn’t complex. It’s linear until the translation step. If we don’t need to call a translation Lambda function, that’s less cost to us, and a faster experience for the user. We can use the visual designer on the Step Functions console to find the specific key in the input payload and, if it’s present, call one function over the other using JSONPath. For example, our payload includes a key called translate:

{ 
extracted_text: "hello world",
target_language: "es",
source_language: "en",
translate: true
}

Within the Step Functions visual designer, we find the translate key, and set up rules to match.

Headless architecture

Amplify hosts the front-end code. The front end is written in React and the source code is checked into AWS CodeCommit. Amplify solves a few problems for users trying to deploy and manage static websites. If you were doing this manually (using an S3 bucket set up for static website hosting and fronting that with Amazon CloudFront), you’d have to expire the cache yourself each time you did deployments. You’d also have to write up your own CI/CD pipeline. Amplify handles this for you.

This allows for a headless architecture, where front-end code is decoupled from the backend and each layer can be managed and scaled independent of the other.

Analyze ID

In the preceding section, we discussed the architecture patterns for processing the uploaded picture and creating an MP3 file from it. Having a document read back to you is a great first step, but what if you only want to know something specific without having the whole thing read back to you? For instance, you need to fill out a form online and provide your state ID or passport number, or perhaps its expiration date. You then have to take a picture of your ID and, while having it read back to you, wait for that specific part. Alternatively, you could use Analyze ID.

Analyze ID is a feature of Amazon Textract that enables you to query documents. Read For Me contains a drop-down menu where you can specifically ask for the expiration date, date of issue, or document number. You can use the same workflow to create an MP3 file that provides an answer to your specific question.

You can demo the Analyze ID feature at readforme.io/analyze.

Additional Polly Features

Read For Me offers multiple neural voices utilizing different languages and dialects. Note that there are several other voices you can choose from, which we did not implement. When a new voice is available, an update to the front-end code and a lambda function is all it takes to take advantage of it.
The Polly service also offers other options which we have yet to include in Read For Me. Those include adjusting the speed of the voices and speech marks.

Conclusion

In this post, we discussed how to use numerous AWS services, including AI and serverless, to aid the visually impaired. You can learn more about the Read For Me project and use it by visiting readforme.io. You can also find Amazon Textract examples on the GitHub repo. To learn more about Analyze ID, check out Announcing support for extracting data from identity documents using Amazon Textract.

The source code for this project will be open-sourced and added to AWS’s public GitHub soon.

About the Authors

Jack Marchetti is a Senior Solutions architect at AWS. With a background in software engineering, Jack is primarily focused on helping customers implement serverless, event-driven architectures. He built his first distributed, cloud-based application in 2013 after attending the second AWS re:Invent conference and has been hooked ever since. Prior to AWS Jack spent the bulk of his career in the ad agency space building experiences for some of the largest brands in the world. Jack is legally blind and resides in Chicago with his wife Erin and cat Minou. He also is a screenwriter, and director with a primary focus on Christmas movies and horror. View Jack’s filmography at his IMDb page.

Alak Eswaradass is a Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She has a Master’s degree in computer science engineering. Before joining AWS, she worked for different healthcare organizations, and she has in-depth experience architecting complex systems, technology innovation, and research. She hangs out with her daughters and explores the outdoors in her free time.

Swagat Kulkarni is a Senior Solutions Architect at AWS and an AI/ML enthusiast. He is passionate about solving real-world problems for customers with cloud native services and machine learning. Outside of work, Swagat enjoys travel, reading and meditating.

Using natural language processing to understand and identify risks

March 3, 2022

by admin Amazon AWS

As an applied science manager at Amazon, Muthu Chandrasekaran works on new tools to automate and build a risk technology.Read More

Time series data classification using Autopilot

Convert data to the right format with Amazon SageMaker Data Wrangler

Launch an AutoML job

Conclusion

About the Authors

Solution overview

Domain user onboarding

Prerequisites

Use case 1: Enable JumpStart for all existing Domain users with a custom role

Use case 2: Enable JumpStart for a single Domain user with a custom role

Clean up

Conclusion

Additional resources

About the Authors

About ImmoScout24

From on-premises to AWS Data Pipeline to SageMaker

The previous setup

Issues with this setup

Why we decided on SageMaker

The new model

Input data

Model description

Hyperparameter tuning

Technical setup

Challenges and learnings

Sampling of the features

One model vs. multiple models

Conclusion

About the Authors

Background assessment

Solution overview

Process map

AWS services used

Conclusion

About ZS

About the Authors

Overview of solution

Product data

SageMaker for object detection models

Train a custom object detection model with SageMaker

Automated model training

Host the model for inference

Conclusion

About the Authors

Introduction to AutoGluon time series

Solution overview

Prerequisites

Prepare the target time series and item meta dataset

Set up and start AutoGluon model training

Forecast on the cold start item

Visualize the predictions

Clean up

Conclusion

About the Authors

Solution overview

AWS AI services

AWS SDK for Python (Boto3)

Architecture patterns

Serverless

Synchronous workflow

Express Step Functions workflow

Headless architecture

Analyze ID

Additional Polly Features

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.