Predict residential real estate prices at ImmoScout24 with Amazon SageMaker

This is a guest post by Oliver Frost, data scientist at ImmoScout24, in partnership with Lukas Müller, AWS Solutions Architect.

In 2010, ImmoScout24 released a price index for residential real estate in Germany: the IMX. It was based on ImmoScout24 listings. Besides the price, listings typically contain a lot of specific information such as the construction year, the plot size, or the number of rooms. This information allowed us to build a so-called hedonic price index, which considers the particular features of a real estate property.

When we released the IMX, our goal was to establish it as the standard index for real estate prices in Germany. However, it struggled to capture the price increase in the German property market since the financial crisis of 2008. In addition, like a stock market index, it was an abstract figure that can’t be interpreted directly. The IMX was therefore difficult to grasp for non-experts.

At ImmoScout24, our mission is to make complex decisions easy, and we realized that we needed a new concept to fulfill it. Instead of another index, we decided to build a market report that everyone can easily understand: the WohnBarometer. It’s based on our listings data and takes object properties into account. The key difference from the IMX is that the WohnBarometer shows rent and sale prices in Euro per square meter for specific residential real estate types over time. The figures therefore can be directly interpreted and allow our customers to answer questions such as “Do I pay too much rent?” or “Is the apartment I am about to buy reasonably priced?” or “Which city in my region is the most promising one for investing?” Currently, the WohnBarometer is reported for Germany as a whole, the seven biggest cities, and alternating local markets.

The following graph shows an example of the WohnBarometer, with sale prices for Berlin and the development per quarter.

This post discusses how ImmoScout24 used Amazon SageMaker to create the model for the WohnBarometer in order to make it relevant for our customers. It discusses the underlying data model, hyperparameter tuning, and technical setup. This post also shows how SageMaker supported one data scientist to complete the WohnBarometer within 2 months. It took a whole team 2 years to develop the first version of the IMX. Such an investment was not an option for the WohnBarometer.

About ImmoScout24

ImmoScout24 is the leading online platform for residential and commercial real estate in Germany. For over 20 years, ImmoScout24 has been revolutionizing the real estate market and supports over 20 million users each month on its online marketplace or in its app to find new homes or commercial spaces. That’s why 99% of our target customer group know ImmoScout24. With its digital solutions, the online marketplace coordinates and brings owners, realtors, tenants, and buyers together successfully. ImmoScout24 is working towards the goal of digitizing the process of real estate transactions and thereby making complex decisions easy. Since 2012, ImmoScout24 has also been active in the Austrian real estate market, reaching around 3 million users monthly.

From on-premises to AWS Data Pipeline to SageMaker

In this section, we discuss the previous setup and its challenges, and why we decided to use SageMaker for our new model.

The previous setup

When the first version of the IMX was published in 2010, the cloud was still a mystery to most businesses, including ImmoScout24. The field of machine learning (ML) was in its infancy and only a handful of experts knew how to code a model (for the sake of illustration, the first public release of Scikit-Learn was in February 2010). It’s no surprise that the development of the IMX took more than 2 years and cost a seven-figure sum.

In 2015, ImmoScout24 started its AWS migration, and rebuilt IMX on AWS infrastructure. With the data in our Amazon Simple Storage Service (Amazon S3) data lake, both the data preprocessing and the model training were now done on Amazon EMR clusters orchestrated by AWS Data Pipeline. While the former was a PySpark ETL application, the latter was several Python scripts using classical ML packages (such as Scikit-Learn).

Issues with this setup

Although this setup proved quite stable, troubleshooting the infrastructure or improving the model wasn’t easy. A key problem with the model was its complexity, because some components had begun a life on their own: in the end, the code of the outlier detection was almost twice as long the code of the core IMX model itself.

The core model, in fact, wasn’t one model, but hundreds: one model per residential real estate type and region, with the definition varying from a single neighborhood in a big city to several villages in rural areas. We had, for example, one model for apartments for sale in the middle of Berlin and one model for houses for sale in a suburb of Munich. Because setting up the training of all these models took a lot of time, we omitted the hyperparameter tuning, which likely led to the models underperforming.

Why we decided on SageMaker

Given these issues and our ambition of having a market report with practical benefits, we had to decide between rewriting large parts of the existing code or starting from scratch. As you can infer from this post, we opted for the latter. But why SageMaker?

Most of our time spent on the IMX went into troubleshooting the infrastructure, not improving the model. For the new market report, we wanted to flip this around, with the focus on the statistical performance of the model. We also wanted to have the flexibility to quickly replace individual components of the model, such as the optimization of the hyperparameters. What if a new superior boosting algorithm comes around (think about how XGBoost hit the stage in 2014)? Of course, we want to adopt it as one of the first!

In SageMaker, the major components of the classical ML workflow—preprocessing, training, hyperparameter tuning, and inference—are neatly separated on the API level and also on the AWS Management Console. Modifying them individually isn’t difficult.

The new model

In this section, we discuss the components of the new model, including its input data, algorithm, hyperparameter tuning, and technical setup.

Input data

The WohnBarometer is based on a sliding window of 5 years of ImmoScout24 listings of residential real estate located in Germany. After we remove outliers and fraudulent listings, we’re left with approximately 4 million listings that are split into train (60 %), validation (20 %), and test data (20 %). The relationship between listings and objects is not necessarily 1:1; over the course of 5 years, it’s likely that the same object is inserted multiple times (by multiple people).

We use 13 listing attributes, such as the location of the property (WGS84 coordinates), the real estate type (house or apartment, sale or rent), its age (years), its size (square meter) or it’s condition (for example, new or refurbished). Given that each listing typically comes with dozens of attributes, the question arises: which to include in the model? On the one hand, we used domain knowledge; for example, it’s well known that location is a key factor, and in almost all markets new property is more expensive than existing ones. On the other hand, we relied on our experiences with the IMX and similar models. There we learned that including dozens of attributes doesn’t significantly improve the model.

Depending on the real estate type of the listing, the target variable of our model is either the rent per square meter or the sale price per square meter (we explain later why this choice wasn’t ideal). Unlike the IMX, the WohnBarometer is therefore a number that can be directly interpreted and acted upon by our customers.

Model description

When using SageMaker, you can choose between different strategies of implementing your algorithm:

  • Use one of SageMaker’s built-in algorithms. There are almost 20 and they cover all major ML problem types.
  • Customize a pre-made Docker image based on a standard ML framework (such as Scikit-Learn or PyTorch).
  • Build your own algorithm and deploy it as a Docker image.

For the WohnBarometer, we wanted a solution that is easy to maintain and allows us to focus on improving the model itself, not the underlying infrastructure. Therefore, we decided on the first option: use a fully-managed algorithm with proper documentation and fast support if needed. Next, we needed to pick the algorithm itself. Again, the decision wasn’t difficult: we went for the XGBoost algorithm because it’s one of the most renowned ML algorithms for regression type problems, and we have already successfully used it in several projects.

Hyperparameter tuning

Most ML algorithms come with a myriad of parameters to tweak. Boosting algorithms, for example, have many parameters specifying how exactly the trees are built: Do the trees have at maximum 20 or 30 leaves? Is each tree based on all rows and columns or only samples? How heavily to prune the trees? Finding the optimal values of those parameters (as measured by an evaluation metric of your choice), the so-called hyperparameter tuning, is critical to building a powerful ML model.

A key question in hyperparameter tuning is which parameters to tune and how to set the search ranges. You might ask, why not check all possible combinations? Although in theory this sounds like a good idea, it would result in an enormous hyperparameter space with way too many points to evaluate them all at a reasonable price. That is why ML practitioners typically select a small number of hyperparameters known to have a strong impact on the performance of the chosen algorithm.

After the hyperparameter space is defined, the next task is to find the best combination of values in it. The following techniques are commonly employed:

  • Grid search – Divide the space in a discrete grid and then evaluate all points in the grid with cross-validation.
  • Random search – Randomly draw combinations from the space. With this approach, you’ll most likely miss the best combination, but it serves as a good benchmark.
  • Bayesian optimization – Build a probabilistic model of the objective function and use this model to generate new combinations. The model is updated after each combination, leading quickly to good results.

In recent years, thanks to cheap compute power, Bayesian optimization has become the gold standard in hyperparameter tuning, and is the default setting in SageMaker.

Technical setup

As with many other AWS services, you can create SageMaker jobs on the console, with the AWS Command Line Interface (AWS CLI), or via code. We chose the third option, the SageMaker Python SDK to be precise, because it allows for a highly automated setup: the WohnBarometer lives in a Python software project that is command-line executable. For example, all steps of the ML pipeline such as the preprocessing or the model training can be triggered via Bash commands. Those Bash commands, in turn, are orchestrated with a Jenkins pipeline powered by AWS Fargate.

Let’s look at the steps and the underlying infrastructure:

  • Preprocessing – The preprocessing is done with the built-in Scikit-Learn library in SageMaker. Because it involves joining data frames with millions of rows, we need an ml.m5.24xlarge machine here, the largest you can get in the ml.m family. Alternatively, we could have used multiple smaller machines with a distributed framework like Dask, but we wanted to keep it as simple as possible.
  • Training – We use the default SageMaker XGBoost algorithm. The training is done with two ml.m5.12xlarge machines. It’s worth mentioning that our train.py containing the code of the model training and the hyperparameter tuning has less than 100 rows.
  • Hyperparameter tuning – Following the principle of less is more, we only tune 11 hyperparameters (for example, the number of boosting rounds and the learning rate), which gives us time to carefully choose their ranges and inspect how they interact with each other. With only a few hyperparameters, each training job runs relatively fast; in our case the jobs take between 10–20 minutes. With a maximal number of 30 training jobs and 2 concurrent jobs, the total training time is around 3 hours.
  • Inference – SageMaker offers multiple options to serve your model. We use batch transform jobs because we only need the WohnBarometer numbers once a quarter. We didn’t use an endpoint because it would be idle most of the time. Each batch job (approximately 6.8 million rows) is served by a single ml.m5.4xlarge machine in less than 10 minutes.

We can easily debug these steps on the SageMaker console. If, for example, a training job is taking longer than expected, we navigate to the Training page, locate the training job in question, and review Amazon CloudWatch metrics of the underlying machines.

The following architecture diagram shows the infrastructure of the WohnBarometer:

Challenges and learnings

In the beginning everything went smoothly: within a few days we set up the software project and trained a miniature version of our model in SageMaker. We had high hopes for the first run on the full dataset and the hyperparameter tuning in place. Unfortunately, the results weren’t satisfying. We had the following key issues:

  • The predictions of the model were too low, both for rent and sale objects. For Berlin, for example, the sale prices predicted for our reference objects were roughly 50% below the market prices.
  • According to the model, there was no significant price difference between new and existing buildings. The truth is that new buildings are almost always significantly more expensive than existing buildings.
  • The effect of the location on the price wasn’t captured correctly. We know, for example, that apartments for sale in Frankfurt am Main, are, on average, more expensive than in Berlin (although Berlin is catching up); our model, however, predicted it the other way round.

What was the problem and how did we solve it?

Sampling of the features

At first glance, it looks like the issues aren’t related, but indeed they are. By default, XGBoost builds each tree with a random sample of the features. Let’s say a model has 10 features F1, F2, … F10, then the algorithm might use F1, F4, and F7 for one tree, and F3, F4, and F8 for another. While in general this behavior effectively prevents overfitting, it can be problematic if the number of features is small and some of them have a big effect on the target variable. In this case, many trees will miss the crucial features.

XGBoost’s sampling of our 13 features led to many trees including neither of the crucial features—real estate type, location, and new or existing buildings—and as a consequence caused these issues. Luckily, there is a parameter to control the sampling: colsample_bytree (in fact, there are two more parameters to control the sampling, but we didn’t touch them). When we checked our code, we saw that colsample_bytree was set to 0.5, a value we carried over from past projects. As soon as we set it to the default value of 1, the preceding issues were gone.

One model vs. multiple models

Unlike the IMX, the WohnBarometer model really is only one model. Although this minimizes the maintenance effort, it’s not ideal from a statistical point of view. Because our training data contains both sale and rent objects, the spread in the target variable is huge: it ranges from below 5 Euro for some rent apartments to well above 10,000 Euro for houses for sale in first-class locations. The big challenge for the model is to understand that an error of 5 Euro is fantastic for sale objects, but disastrous for rent objects.

In hindsight, knowing how easy it is to maintain multiple models in SageMaker, we would have built at least two models: one for rent and one for sale objects. This would make it easier to capture the peculiarities of both markets. For example, the price of unrented apartments for sale is typically 20–30% higher than for rented apartments for sale. Therefore, encoding this information as a dummy variable in the sale model makes a lot of sense; for the rent model on the other hand, you could leave it out.

Conclusion

Did the WohnBarometer meet the goal of being relevant to our customers? Taking media coverage as an indication, the answer is a clear yes: as of November 2021, more than 700 newspaper articles and TV or radio reports on the WohnBarometer have been published. The list includes national newspapers such as Frankfurter Allgemeine Zeitung, Tagesspiegel, and Handelsblatt, and local newspapers that often ask for WohnBarometer figures for their region. Because we calculate the figures for all regions of Germany anyway, we’re happy to take such requests. With the old IMX, this level of granularity wasn’t possible.

The WohnBarometer outperforms the IMX in regards to statical performance, in particular when it comes to the costs: the IMX was generated by an EMR cluster with 10 task nodes running almost half a day. In contrast, all WohnBarometer steps take less than 5 hours using medium-sized machines. This results in cost savings of almost 75%.

Thanks to SageMaker, we were able to bring a complex ML model in production with one data scientist in less than 2 months. This is remarkable. 10 years earlier, when ImmoScout24 built the IMX, reaching the same milestone took more than 2 years and involved a whole team.

How could we be so efficient? SageMaker allowed us to focus on the model instead of the infrastructure, and SageMaker promotes a microservice architecture that is easy to maintain. If we got stuck with something, we could call on AWS support. In the past, when one of our IMX data pipelines failed, we would sometimes spend days to debug it. Since we started publishing WohnBarometer figures in April 2021, the SageMaker infrastructure hasn’t failed a single time.

To learn more about the WohnBarometer, check out WohnBarometer and WohnBarometer: Angebotsmieten stiegen 2021 bundesweit wieder stärker an. To learn more about using the SageMaker Scikit-Learn library for preprocessing, see Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn. Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your AWS support contacts.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Oliver Frost joined ImmoScout24 in 2017 as a business analyst. Two years later, he became a data scientist in a team whose job it is to turn ImmoScout24 data into veritable data products. Before building the WohnBarometer model, he ran smaller SageMaker projects. Oliver holds several AWS certificates, including the Machine Learning Specialty.

Lukas Müller is a Solutions Architect at AWS. He works with customers in the sports, media, and entertainment industries. He is always looking for ways to combine technical enablement with cultural and organizational enablement to help customers achieve business value with cloud technologies.

Read More

Transforming qualitative research by automating speech to text-to-text analytics

This post is authored by Satish Jha, Intelligent Automation Manager, Matt Docherty, Data Science Manager, Jayesh Muley, Associate Consultant and Tapan Vora, Rapid Prototyping, from ZS Associates.

At ZS Associates, we do a significant amount of qualitative market research. The work involves interviewing relevant subjects (such as healthcare professionals and sales representatives) and developing bespoke analytics on the interview data. We’ve taken advantage of the advances in AI, machine learning (ML), and cloud computing to reimagine qualitative market research and developed a scalable solution that is equipped to perform speech-to-text conversion and natural language processing (NLP) on the audio recordings of interviewed subjects. The solution is better, cheaper, and faster than the current ways of working (manual interpretation), giving a competitive advantage in this space.

This post discusses how ZS used Amazon Transcribe, Amazon Comprehend Medical, and custom NLP for text summarization and graph visualization to create a scalable, automated solution that helps us provide insights in a faster, better, and more efficient way.

Background assessment

The traditional method of performing qualitative market research requires human intervention and interpretation, which is highly subjective in nature. We used advanced AI and ML to develop a platform that is capable of the following:

  • Performing speech-to-text conversion; specifically with high precision, converting interview audio recordings conducted for the purpose of qualitative market research
  • Drawing analytical insights from the converted text using a state-of-the-art NLP model

To achieve this, we combined state-of-the-art AWS AI services and cloud computing capabilities with our propriety NLP and text summarization algorithms to drive impact at scale.

Solution overview

To build our solution, we adopted the methodology of starting small, highlighting value, and scaling fast. We identified a key user group and defined phase one of the solution to do automated speech-to-text and analytics. We defined a key user interface and developed the technology architecture for the solution. Because ZS is an AWS Partner and has already been using multiple AWS Cloud services for our enterprise products and solutions, AWS was the preferred choice for this project. We used Amazon Transcribe and Amazon Comprehend Medical for transcription and theme identification purposes. For hosting custom NLP analytics APIs, we used a serverless infrastructure using Amazon API Gateway, AWS Lambda, and Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. These services are HIPAA-eligible and compliant with pharma regulatory requirements.

The process includes the following stages:

  • File upload to Amazon S3 – The process starts when the user uploads one or more audio recording files for transcription to the site on which our tool is hosted. To upload the files to Amazon Simple Storage Service (Amazon S3), the user is provided with a temporary written token or pre-signed URL using API Gateway, which provides Amazon S3 access.
  • Audio transcription – Depending on the type of file uploaded, different triggers are in place to initiate the appropriate workflow:

    • Audio files uploaded without a dictionary file – If the user didn’t provide a dictionary file, the tool processes the audio file using Amazon Transcribe.
    • Audio files uploaded with a dictionary file – If the user provided a dictionary file, certain AWS Step Functions steps are triggered, followed by processing the dictionary file using Amazon Transcribe. When the dictionary processing is complete, the tool transcribes the audio file using Amazon Transcribe.
  • Transcript file generation – In either of the preceding two cases, when the transcription is in progress, the tool uses Amazon CloudWatch Events to update the transcription status. Lambda functions trigger the tool to update the status on the RDBMS and convey the status to the user through the tool’s UI using sockets. When the transcription is complete, the final output file is stored in Amazon S3.
  • File type conversion – After the output file is generated, the tool uses triggers to create a .doc or .xlsx file, stored again in Amazon S3.
  • Generating analytical insights – With Amazon Comprehend Medical and certain ZS in-house NLP tools, the tool generates analytics based on the transcribed data and updates dashboards on our site to access them in real time.
  • Audio streaming with Amazon Transcribe – We use Amazon CloudFront audio streaming paired with our final output file, which is generated from Amazon Transcribe. The user can simultaneously listen to the recording and read the transcript.

The following diagram shows the high-level architecture and workflow.

The platform is designed to process a large number of files in real time. Therefore, the solution greatly augments the work of our current ZS qualitative research team by making the process more efficient and giving it an entirely new dimension!

Overall, our solution has the following features:

  • The ability to upload single or multiple audio files
  • Automated speech-to-text conversion, with the ability to add a custom dictionary
  • The ability to listen to the uploaded audio and refine text
  • Text summarization and analytics

Process map

The following diagram gives a high-level visualization of our developed solution, with the following stages:

  • Upload audio – The process starts with the user uploading their audio recording (with or without a dictionary file) to the tool
  • Speech to text – These uploaded audio files are transcribed by converting speech to text
  • Listen and refine – The user can simultaneously listen to the recording and read the transcript and make changes wherever necessary
  • Speech-to-text output – The consolidated file includes the converted transcript and its corresponding analytics

It took us approximately 5–6 months to develop this solution end to end with a four-member team. Today it is being used by over 300 people, and the tool has processed thousands of hours of audio.

AWS services used

The solution uses multiple AWs services:

  • AWS Lambda and API Gateway – Hosted the serverless APIs and functions.

    • We developed multiple API Gateways to ensure loose coupling and easy integration with external APIs. Custom authorizers were implemented to enable token-based authentication and restrict unauthorized access to the web content.
    • We also built the Lambda APIs (using Python and NodeJS) that could easily interact with a website hosted on ECS containers and can also be easily linked with Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The use of Lambda functions in our solution helped us avoid the load balancing, restoring, and stopping clusters efforts and reduce overall costs, because the clusters only ran when the functions were running. Additionally, we were able to easily scale our solution because of the serverless architecture.
  • Amazon Transcribe – Provided us options to easily configure the batch processing of audio files up to 100 at a time and even scale a larger load using its built-in queuing mechanism. It also allowed us to load a custom dictionary to transcribe the audio data more accurately.
  • Amazon Comprehend Medical – Generated analytical insights from the text data using its built-in NLP capabilities to sort through text for valuable information.
  • AWS CloudFormation – We used AWS CloudFormation to deploy the Lambda functions and APIs across environments (various S3 buckets and multiple environments in the same bucket, such as production and development) using stage variables.
  • AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline – We used AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline to perform continuous deployment of the front end and analytics backend to ECS clusters.

The following diagram illustrates the architecture of these services.

Conclusion

We used AWS services to develop a platform that helped our teams apply cutting-edge AI to their projects. It has helped our teams do the following:

  • Automate the process of speech-to-text conversion and only focus on low-accuracy aspects.
  • Drive automation of insights with NLP algorithms.
  • Drive self-service. Because we do not need to launch any particular server, we can easily create Lambda functions, make changes to the code on the fly, and provide key ML services as plug and play so that users don’t need to be data scientists.

Today the solution is used by over 300 people, and we have processed thousands of hours of audio. We’re now integrating our solution with other applications to provide users with the flexibility to either upload audio files for transcription or directly upload transcribed files for drawing analytical insights.

We also derived multiple benefits from building our platform with AWS:

  • Using an end-to-end cloud-based architecture proved beneficial in terms of managing environments for business applications
  • With management tools such as CloudWatch, AWS CloudFormation, CodeBuild, CodeDeploy, and CodePipeline, it was easier to monitor, track, and deploy development changes
  • We used AWS’s built-in security with virtual private clouds and identity management with customized policies
  • We were able to reduce load on valuable microservices, with the additional benefit of quick hosting and deployment

About ZS

ZS Associates is a consulting and professional services firm focusing on consulting, software, and technology, headquartered in Evanston, Illinois, that provides services for clients in pharma, healthcare, and technology. The firm employs more than 10,000 employees in 30 offices in North America, South America, Europe, and Asia. ZS works with 49 of the 50 largest drug-makers and 17 of the 20 largest medical device makers and serves consumer products, financial services, industrial products, telecommunications, transportation, and logistics industries.

Disclaimer: AWS is not responsible for the content or accuracy of this post. The content and opinions in this post are solely those of the third-party author. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.


About the Authors

Satish Jha is a Manager with ZS Associates. He is a leader in the firm’s Intelligent Automation Practice, where he works side by side with several pharma clients to transform operations and drive impact.

Matt Docherty is a Data Science Manager with ZS Associates in the Philadelphia office. He is focused on applying data science in the pharmaceutical industry.

Jayesh Muley is an Associate Consultant for Process Excellence & Transformation with ZS Associates. He has 4 years of experience advising pharma clients in the forecasting, process excellence, and digital transformation spaces. He played a critical role in establishing ZS’s automation center of excellence. He is always keen on learning new technologies and is always evolving in his role.

Tapan Vora is a Manager for Rapid Prototyping with ZS Associates. Tapan has over 14 years of technology and engineering management experience. He plays multiple roles in the team, such as business analyst, people manager, solution designer, data analyst, and product leader.

Read More

How The Barcode Registry detects counterfeit products using object detection and Amazon SageMaker

This is a guest post authored by Andrew Masek, Software Engineer at The Barcode Registry and Erik Quisling, CEO of The Barcode Registry.

Product counterfeiting is the single largest criminal enterprise in the world. Growing over 10,000% in the last two decades, sales of counterfeit goods now total $1.7 trillion per year worldwide, which is more than drugs and human trafficking. Although traditional methods of counterfeit prevention like unique barcodes and product verification can be very effective, new machine learning (ML) technologies such as object detection seem very promising. With object detection, you can now snap a picture of a product and know almost instantly if that product is likely to be legitimate or fraudulent.

The Barcode Registry (in conjunction with its partner Buyabarcode.com) is a full-service solution that helps customers prevent product fraud and counterfeiting. It does this by selling unique GS1-registered barcodes, verifying product ownership, and registering users’ products and barcodes in a comprehensive database. Their latest offering, which we discuss in this post, uses Amazon SageMaker to create object detection models to help instantly recognize counterfeit products.

Overview of solution

To use these object detection models, you first need to collect data to train them. Companies upload annotated pictures of their products to The Barcode Registry website. After this data is uploaded to Amazon Simple Storage Service (Amazon S3) and processed by AWS Lambda functions, you can use it to train a SageMaker object detection model. This model is hosted on a SageMaker endpoint, where the website connects it to the end-user.

There are three key steps to creating The Barcode Registry uses to create a custom object detection model with SageMaker:

  1. Create a training script for SageMaker to run.
  2. Build a Docker container from the training script and upload it to Amazon ECR.
  3. Use the SageMaker console to train a model with the custom algorithm.

Product data

As a prerequisite in order to train an object detection model you will need an AWS account and training images, consisting of at least 100 high-quality (high-resolution and in multiple lighting-conditions) pictures of your object. As with any ML model, high-quality data is paramount. To train an object detection model, we need images containing the relevant products as well as bounding boxes describing where the products are in the images, as shown in the following example.

To train an effective model, pictures of each of a brand’s products with different backgrounds and lighting conditions are needed—approximately 30–100 unique annotated images for each product.

After the images are uploaded to the web server, they’re uploaded to Amazon S3 using the AWS SDK for PHP. A Lambda event is triggered each time an image is uploaded. The function removes the Exif metadata from the images, which can sometimes cause them to appear rotated when they’re opened by the ML libraries later used to train the model. The associated bounding box data is stored in JSON files and uploaded to Amazon S3 to accompany the images.

SageMaker for object detection models

SageMaker is a managed ML service that includes a variety of tools for building, training and hosting models in the cloud. In particular, TheBarcodeRegistry uses SageMaker for its object detection service because of SageMaker’s reliable and scalable ML model training and hosting services. This means that many brands can have their own object detection models trained and hosted and even if usage spikes unpredictably, there won’t be any downtime.

The Barcode Registry uses custom Docker containers uploaded to Amazon Elastic Container Registry (Amazon ECR) in order to have more fine-grained control of the object detection algorithm employed for training and inference as well as support for Multi Model Server (MMS). MMS is very important for the counterfeit detection use case because it allows multiple brand’s models to be cost-effectively hosted on the same server. Alternatively, you can use the built-in object detection algorithm to quickly deploy standard models developed by AWS.

Train a custom object detection model with SageMaker

First, you need to add your object detection algorithm. In this case, upload a Docker container featuring scripts to train a Yolov5 object detection model to Amazon ECR:

  1. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  2. Choose Create notebook instance.
  3. Enter a name for the notebook instance and under Permissions and encryption choose an AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. Open the Git repositories menu.
  5. Select Clone a public Git repository to this notebook instance only and paste the following Git repository URL: https://github.com/portoaj/SageMakerObjectDetection
  6. Click Create notebook instance and wait about five minutes for the instance’s status to update from Pending to InService in the Notebook instance menu.
  7. Once the notebook is InService, select it and click Actions and Open Jupyter to launch the notebook instance in a new tab.
  8. Select the SageMakerObjectDetection directory and then click on sagemakerobjectdetection.ipynb to launch the Jupyter notebook.
  9. Select the conda_python3 kernel and click Set Kernel.
  10. Select the code cell and set the aws_account_id variable to your AWS Account ID.
  11. Click Run to begin the process of building a Docker container and uploading it to Amazon ECR. This process may take about 20 minutes to complete.
  12. Once the Docker container has been uploaded, return to the Notebook instances menu, select your instance, and click Actions and Stop to shut your notebook instance down.

After the algorithm is built and pushed to Amazon ECR, you can use it to train a model via the SageMaker console.

  1. On the SageMaker console, under Training in the navigation pane, choose Training jobs.
  2. Choose Create training job.
  3. Enter a name for the job and choose the AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. For Algorithm source, select Your own algorithm container in ECR.
  5. For Container, enter the registry path.
  6. Setting a single ml.p2.xlarge instance under the resource configuration should be sufficient for training a Yolov5 model.
  7. Specify Amazon S3 locations for both your input data and output path and any other settings such as configuring a VPC via Amazon Virtual Private Cloud (Amazon VPC) or enabling Managed Spot Training.
  8. Choose Create training job.

You can track the model’s training progress on the SageMaker console.

Automated model training

The following diagram illustrates the automated model training workflow:

To make SageMaker start training the object detection model as soon as a user finishes uploading their data, the web server uses Amazon API Gateway to notify a Lambda function that the brand has finished and to begin a training job.

When a brand’s model is successfully trained, Amazon EventBridge calls a Lambda function that moves the trained model into the live endpoint’s S3 bucket, where it’s finally ready for inference. A newer alternative to using Amazon EventBridge to move models through the MLOps lifecycle that you should consider is SageMaker Pipelines.

Host the model for inference

The following diagram illustrates the inference workflow:

To use the trained models, SageMaker requires an inference model to be hosted by an endpoint. The endpoint is the server or array of servers that are used to actually host the inference model. Similar to the training container that we created, a Docker container for inference is hosted in Amazon ECR. The inference model uses that Docker container and takes the input image the user took with their phone, runs it through the trained object detection model, and outputs the result.

Again, The Barcode Registry uses custom Docker containers for the inference model to enable the use of Multi Model Server, but if only one model is needed that can be easily hosted through the built-in object detection algorithm.

Conclusion

The Barcode Registry (in conjunction with its partner Buyabarcode.com) uses AWS for its entire object detection pipeline. The web server reliably stores data in Amazon S3 and uses API Gateway and Lambda functions to connect the web server to the cloud. SageMaker readily trains and hosts ML models, which means a user can take a picture of a product on their phone and see if the product is a counterfeit. This post shows how to create and host an object detection model using SageMaker, as well as how to automate the process.

In testing, the model was able to achieve over 90% accuracy on a training set of 62 images and a testing set of 32 images, which is pretty impressive for a model trained without any human intervention. To get started training object detection models yourself check out the official documentation or learn how to deploy an object detection model to the edge using AWS IoT Greengrass.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Andrew Masek, Software Engineer at The Barcode Registry.

Erik Quisling, CEO of The Barcode Registry.

Read More

Build a cold start time series forecasting engine using AutoGluon

Whether you’re allocating resources more efficiently for web traffic, forecasting patient demand for staffing needs, or anticipating sales of a company’s products, forecasting is an essential tool across many businesses. One particular use case, known as cold start forecasting, builds forecasts for a time series that has little or no existing historical data, such as a new product that just entered the market in the retail industry. Traditional time series forecasting methods such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ES) rely heavily on historical time series of each individual product, and therefore aren’t effective for cold start forecasting.

In this post, we demonstrate how to build a cold start forecasting engine using AutoGluon AutoML for time series forecasting, an open-source Python package to automate machine learning (ML) on image, text, tabular, and time series data. AutoGluon provides an end-to-end automated machine learning (AutoML) pipeline for beginners to experienced ML developers, making it the most accurate and easy-to-use fully automated solution. We use the free Amazon SageMaker Studio Lab service for this demonstration.

Introduction to AutoGluon time series

AutoGluon is a leading open-source library for AutoML for text, image, and tabular data, allowing you to produce highly accurate models from raw data with just one line of code. Recently, the team has been working to extend these capabilities to time series data, and has developed an automated forecasting module that is publicly available on GitHub. The autogluon.forecasting module automatically processes raw time series data into the appropriate format, and then trains and tunes various state-of-the-art deep learning models to produce accurate forecasts. In this post, we demonstrate how to use autogluon.forecasting and apply it to cold start forecasting tasks.

Solution overview

Because AutoGluon is an open-source Python package, you can implement this solution locally on your laptop or on Amazon SageMaker Studio Lab. We walk through the following steps:

  1. Set up AutoGluon for Amazon SageMaker Studio Lab.
  2. Prepare the dataset.
  3. Define training parameters using AutoGluon.
  4. Train a cold start forecasting engine for time series forecasting.
  5. Visualize cold start forecasting predictions.

The key assumption of cold start forecasting is that items with similar characteristics should have similar time series trajectories, which is what allows cold start forecasting to make predictions on items without historical data, as illustrated in the following figure.

In our walkthrough, we use a synthetic dataset based on electricity consumption, which consists of the hourly time series for 370 items, each with an item_id from 0–369. Within this synthetic dataset, each item_id is also associated with a static feature (a feature that doesn’t change over time). We train a DeepAR model using AutoGluon to learn the typical behavior of similar items, and transfer such behavior to make predictions on new items (item_id 370–373) that don’t have historical time series data. Although we’re demonstrating the cold start forecasting approach with only one static feature, in practice, having informative and high-quality static features is the key for a good cold start forecast.

The following diagram provides a high-level overview of our solution. The open-source code is available on the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Log in to your Amazon SageMaker Studio Lab account and set up the environment using the terminal:

cd sagemaker-studiolab-notebooks/ 
git clone https://github.com/whosivan/amazon-sagemaker-studio-lab-cold-start-forecasting-using-autogluon
conda env create -f autogluon.yml
conda activate autogluon
git clone https://github.com/yx1215/autogluon.git
cd autogluon/
git checkout --track origin/add_forecasting_predictor

These instructions should also work from your laptop if you don’t have access to Amazon SageMaker Studio Lab (we recommend installing Anaconda on your laptop first).

When you have the virtual environment fully set up, launch the notebook AutoGluon-cold-start-demo.ipynb and select the custom environment .conda-autogluon:Python kernel.

Prepare the target time series and item meta dataset

Download the following datasets to your notebook instance if they’re not included, and save them under the directory data/. You can find these datasets on our GitHub repo:

  • Test.csv.gz
  • coldStartTargetData.csv
  • itemMetaData.csv

Run the following snippet to load the target time series dataset into the kernel:

zipLocalFilePath = "data/test.csv.gz"
localFilePath = "data/test.csv"
util.extract_gz(zipLocalFilePath, localFilePath)

tdf = pd.read_csv(zipLocalFilePath, dtype = object)
tdf['target_value'] = tdf['target_value'].astype('float')
tdf.head()

AutoGluon time series requires static features to be represented in numerical format. This can be achieved through applying LabelEncoder() on our static feature type, where we encode A=0, B=1, C=2, D=3 (see the following code). By default, AutoGluon infers the static feature to be either ordinal or categorical. You can also overwrite this by converting the static feature column to be the object/string data type for categorical features, or integer/float data type for ordinal features.

localItemMetaDataFilePath = "data/itemMetaData.csv"
imdf = pd.read_csv(localItemMetaDataFilePath, dtype = object)

labelencoder = LabelEncoder()
imdf['type'] = labelencoder.fit_transform(imdf['type'])

imdf_without_coldstart_item['type'] = imdf_without_coldstart_item['type'].astype(str)

imdf_without_coldstart_item = imdf[imdf.item_id.isin(tdf.item_id.tolist())]
imdf_without_coldstart_item.to_csv('data/itemMetaDatawithoutColdstart.csv', index=False)

imdf_with_coldstart_item = imdf[~imdf.item_id.isin(tdf.item_id.tolist())]
imdf_with_coldstart_item.to_csv('data/itemMetaDataOnlyColdstart.csv', index=False)

Set up and start AutoGluon model training

We need to specify save_path = ‘autogluon-coldstart-demo’ as the model artifact folder name (see the following code). We also set our eval_metric as mean absolute percentage error, or ‘MAPE’ for short, where we defined prediction_length as 24 hours. If not specified, AutoGluon by default produces probabilistic forecasts and scores them via the weighted quantile loss. We only look at the DeepAR model in our demo, because we know the DeepAR algorithm allows cold start forecasting by design. We set one of the DeepAR hyperparameters arbitrarily and pass that hyperparameter to the ForecastingPredictor().fit() call. This allows AutoGluon to look only into the specified model. For a full list of tunable hyperparameters, refer to gluonts.model.deepar package.

save_path = 'autogluon-coldstart-demo'
eval_metric = 'MAPE'
deepar_params = {
    "scaling":True
}

ag_predictor = ForecastingPredictor(path=save_path, 
eval_metric=eval_metric).fit(tdf, static_features = imdf_without_coldstart_item,
prediction_length=24, #how far out in the future we wish to forecast                                                                  index_column="item_id",                             
target_column="target_value",                                          
time_column="timestamp",
quantiles=[0.1, 0.5, 0.9],                                                                
hyperparameters={"DeepAR": deepar_params})

The training takes 30–45 minutes. You can get the model summary by calling the following function:

ag_predictor.fit_summary()

Forecast on the cold start item

Now we’re ready to generate forecasts for the cold start item. We recommend having at least five rows for each item_id. Therefore, for the item_id that has fewer than five observations, we fill in with NaNs. In our demo, both item_id 370 and 372 have zero observation, a pure cold start problem, whereas the other two have five target values.

Load in the cold start target time series dataset with the following code:

localColdStartDataFilePath = "data/coldStartTargetData.csv"
cstdf = pd.read_csv(localColdStartDataFilePath, dtype = object)
cstdf.head(20)

We feed the cold start target time series into our AutoGluon model, along with the item meta dataset for the cold start item_id:

cold_start_prediction = ag_predictor.predict(cstdf, static_features=imdf_with_coldstart_item)

Visualize the predictions

We can create a plotting function to generate a visualization on the cold start forecasting, as shown in the following graph.

Clean up

To optimize resource usage, consider stopping the runtime on Amazon SageMaker Studio Lab after you have fully explored the notebook.

Conclusion

In this post, we showed how to build a cold start forecasting engine using AutoGluon AutoML for time series data on Amazon SageMaker Studio Lab. For those of you who are wondering the difference between Amazon Forecast and AutoGluon (time series), Amazon Forecast is a fully managed and supported service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. While AutoGluon is an open-source project that is community supported with the latest research contributions. We walked through an end-to-end example to demonstrate what AutoGluon for time series is capable of, and provided a dataset and use case.

AutoGluon for time series data is an open-source Python package, and we hope that this post, together with our code example, gives you a straightforward solution to tackle challenging cold start forecasting problems. You can access the entire example on our GitHub repo. Try it out, and let us know what you think!


About the Authors

Ivan Cui is a Data Scientist with AWS Professional Services, where he helps customers build and deploy solutions using machine learning on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, and healthcare. In his free time, he enjoys reading, spending time with his family, and maximizing his stock portfolio.

Jonas Mueller is a Senior Applied Scientist in the AI Research and Education group at AWS, where he develops new algorithms to improve deep learning and develop automated machine learning. Before joining AWS to democratize ML, he completed his PhD at the MIT Computer Science and Artificial Intelligence Lab. In his free time, he enjoys exploring mountains and the outdoors.

Wenming Ye is a Research Product Manager at AWS AI. He is passionate about helping researchers and enterprise customers rapidly scale their innovations through open-source and state-of-the-art machine learning technology. Wenming has diverse R&D experience from Microsoft Research, the SQL engineering team, and successful startups.

Read More

Enable the visually impaired to hear documents using Amazon Textract and Amazon Polly

At the 2021 AWS re:Invent conference in Las Vegas, we demoed Read For Me at the AWS Builders Fair—a website that helps the visually impaired hear documents.

For better quality, view the video here.

Adaptive technology and accessibility features are often expensive, if they’re available at all. Audio books help the visually impaired read. Audio description makes movies accessible. But what do you do when the content isn’t already digitized?

This post focuses on the AWS AI services Amazon Textract and Amazon Polly, which empower those with impaired vision. Read For Me was co-developed by Jack Marchetti, who is visually impaired.

Solution overview

Through an event-driven, serverless architecture and a combination of multiple AI services, we can create natural-sounding audio files in multiple languages from a picture of a document, or any image with text. For example, a letter from the IRS, a holiday card from family, or even the opening titles to a film.

The following Reference Architecture, published in the AWS Architecture Center shows the workflow of a user taking a picture with their phone and playing an MP3 of the content found within that document.

The workflow includes the following steps:

  1. Static content (HTML, CSS, JavaScript) is hosted on AWS Amplify.
  2. Temporary access is granted for anonymous users to backend services via an Amazon Cognito identity pool.
  3. The image files are stored in Amazon Simple Storage Service (Amazon S3).
  4. A user makes a POST request through Amazon API Gateway to the audio service, which proxies to an express AWS Step Functions workflow.
  5. The Step Functions workflow includes the following steps:
    1. Amazon Textract extracts text from the image.
    2. Amazon Comprehend detects the language of the text.
    3. If the target language differs from the detected language, Amazon Translate translates to the target language.
    4. Amazon Polly creates an audio file as output using the text.
  6. The AWS Step Functions workflow creates an audio file as output and stores it in Amazon S3 in MP3 format.
  7. A pre-signed URL with the location of the audio file stored in Amazon S3 is sent back to the user’s browser through API Gateway. The user’s mobile device plays the audio file using the pre-signed URL.

In the following sections, we discuss the reasons for why we chose the specific services, architecture pattern, and service features for this solution.

AWS AI services

Several AI services are wired together to power Read For Me:

  • Amazon Textract identifies the text in the uploaded picture.
  • Amazon Comprehend determines the language.
  • If the user chooses a different spoken language than the language in the picture, we translate it using Amazon Translate.
  • Amazon Polly creates the MP3 file. We take advantage of the Amazon Polly neural engine, which creates a more natural, lifelike audio recording.

One of the main benefits of using these AI services is the ease of adoption with little or no core machine learning experience required. The services expose APIs that clients can invoke using SDKs made available in multiple programming languages, such as Python and Java.

With Read For Me, we wrote the underlying AWS Lambda functions in Python.

AWS SDK for Python (Boto3)

The AWS SDK for Python (Boto3) makes interacting with AWS services simple. For example, the following lines of Python code return the text found in the image or document you provide:

import boto3
client = boto3.client('textract')
response = client.detect_document_text(
Document={
'S3Object': {
'Bucket': 'bucket-name',
'Name': 's3-key'
}
})
#do something with the response

All Python code is run within individual Lambda functions. There are no servers to provision and no infrastructure to maintain.

Architecture patterns

In this section, we discuss the different architecture patterns used in the solution.

Serverless

We implemented a serverless architecture for two main reasons: speed to build and cost. With no underlying hardware to maintain or infrastructure to deploy, we focused entirely on the business logic code and nothing else. This allowed us to get a functioning prototype up and running in a matter of days. If users aren’t actively uploading pictures and listening to recordings, nothing is running, and therefore nothing is incurring costs outside of storage. An S3 lifecycle management rule deletes uploaded images and MP3 files after 1 day, so storage costs are low.

Synchronous workflow

When you’re building serverless workflows, it’s important to understand when a synchronous call makes more sense from the architecture and user experience than an asynchronous process. With Read For Me, we initially went down the asynchronous path and planned on using WebSockets to bi-directionally communicate with the front end. Our workflow would include a step to find the connection ID associated with the Step Functions workflow and upon completion, alert the front end. For more information about this process, refer to From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

We ultimately chose not to do this and used express step functions which are synchronous. Users understand that processing an image won’t be instant, but also know it won’t take 30 seconds or a minute. We were in a space where a few seconds was satisfactory to the end-user and didn’t need the benefit of WebSockets. This simplified the workflow overall.

Express Step Functions workflow

The ability to break out your code into smaller, isolated functions allows for fine-grained control, easier maintenance, and the ability to scale more accurately. For instance, if we determined that the Lambda function that triggered Amazon Polly to create the audio file was running slower than the function that determined the language, we could vertically scale that function, adding more memory, without having to do so for the others. Similarly, you limit the blast radius of what your Lambda function can do or access when you limit its scope and reach.

One of the benefits of orchestrating your workflow with Step Functions is the ability to introduce decision flow logic without having to write any code.

Our Step Functions workflow isn’t complex. It’s linear until the translation step. If we don’t need to call a translation Lambda function, that’s less cost to us, and a faster experience for the user. We can use the visual designer on the Step Functions console to find the specific key in the input payload and, if it’s present, call one function over the other using JSONPath. For example, our payload includes a key called translate:

{ 
extracted_text: "hello world",
target_language: "es",
source_language: "en",
translate: true
}

Within the Step Functions visual designer, we find the translate key, and set up rules to match.

Headless architecture

Amplify hosts the front-end code. The front end is written in React and the source code is checked into AWS CodeCommit. Amplify solves a few problems for users trying to deploy and manage static websites. If you were doing this manually (using an S3 bucket set up for static website hosting and fronting that with Amazon CloudFront), you’d have to expire the cache yourself each time you did deployments. You’d also have to write up your own CI/CD pipeline. Amplify handles this for you.

This allows for a headless architecture, where front-end code is decoupled from the backend and each layer can be managed and scaled independent of the other.

Analyze ID

In the preceding section, we discussed the architecture patterns for processing the uploaded picture and creating an MP3 file from it. Having a document read back to you is a great first step, but what if you only want to know something specific without having the whole thing read back to you? For instance, you need to fill out a form online and provide your state ID or passport number, or perhaps its expiration date. You then have to take a picture of your ID and, while having it read back to you, wait for that specific part. Alternatively, you could use Analyze ID.

Analyze ID is a feature of Amazon Textract that enables you to query documents. Read For Me contains a drop-down menu where you can specifically ask for the expiration date, date of issue, or document number. You can use the same workflow to create an MP3 file that provides an answer to your specific question.

You can demo the Analyze ID feature at readforme.io/analyze.

Additional Polly Features

  • Read For Me offers multiple neural voices utilizing different languages and dialects. Note that there are several other voices you can choose from, which we did not implement. When a new voice is available, an update to the front-end code and a lambda function is all it takes to take advantage of it.
  • The Polly service also offers other options which we have yet to include in Read For Me. Those include adjusting the speed of the voices and speech marks.

Conclusion

In this post, we discussed how to use numerous AWS services, including AI and serverless, to aid the visually impaired. You can learn more about the Read For Me project and use it by visiting readforme.io. You can also find Amazon Textract examples on the GitHub repo. To learn more about Analyze ID, check out Announcing support for extracting data from identity documents using Amazon Textract.

The source code for this project will be open-sourced and added to AWS’s public GitHub soon.


About the Authors

Jack Marchetti is a Senior Solutions architect at AWS. With a background in software engineering, Jack is primarily focused on helping customers implement serverless, event-driven architectures. He built his first distributed, cloud-based application in 2013 after attending the second AWS re:Invent conference and has been hooked ever since. Prior to AWS Jack spent the bulk of his career in the ad agency space building experiences for some of the largest brands in the world. Jack is legally blind and resides in Chicago with his wife Erin and cat Minou. He also is a screenwriter, and director with a primary focus on Christmas movies and horror. View Jack’s filmography at his IMDb page.

Alak Eswaradass is a Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She has a Master’s degree in computer science engineering. Before joining AWS, she worked for different healthcare organizations, and she has in-depth experience architecting complex systems, technology innovation, and research. She hangs out with her daughters and explores the outdoors in her free time.

Swagat Kulkarni is a Senior Solutions Architect at AWS and an AI/ML enthusiast. He is passionate about solving real-world problems for customers with cloud native services and machine learning. Outside of work, Swagat enjoys travel, reading and meditating.

Read More

Bundesliga Match Fact Set Piece Threat: Evaluating team performance in set pieces on AWS

The importance of set pieces in football (or soccer in the US) has been on the rise in recent years: now more than one quarter of all goals are scored via set pieces. Free kicks and corners generally create the most promising situations, and some professional teams have even hired specific coaches for those parts of the game.

In this post, we share how the Bundesliga Match Fact Set Piece Threat helps evaluate performance in set pieces. As teams look to capitalize more and more on these dead ball situations, Set Piece Threat will help the viewer understand how well teams are leveraging these situations. In addition, it will explain the reader how AWS services can be used to compute statistics in real-time.

Bundesliga’s Union Berlin is a great example for the relevance of set pieces. The team managed to rise from Bundesliga 2 to qualification for a European competition in just 2 years. They finished third in Bundesliga 2 during the 18/19 season, earning themselves a slot in the relegation playoffs to the Bundesliga. In that season, they scored 28 goals from open play, ranking just ninth in the league. However, they ranked second for goals scored through set pieces (16 goals).

Tellingly, in the first relegation playoff match against VfB Stuttgart, Union secured a 2:2 draw, scoring a header after a corner. And in the return match, Stuttgart was disallowed a free kick goal due to a passive offside, allowing Union to enter the Bundesliga with a 0:0 draw.

The relevance of set pieces for Union’s success doesn’t end there. Union finished their first two Bundesliga seasons with strong eleventh and seventh, ranking third and first in number of set piece goals (scoring 15 goals from set pieces in both seasons). For comparison, FC Bayern München—the league champion—only managed to score 10 goals from set pieces in both seasons. The success that Union Berlin has had with their set pieces allowed them to secure the seventh place in the 20/21 Bundesliga season, which meant qualification for the UEFA Europa Conference League, going from Bundesliga 2 to Europe just 2 years after having earned promotion. Unsurprisingly, in the deciding match, they scored one of their two goals after a corner. At the time of this writing, Union Berlin ranks fourth in the Bundesliga (matchday 20) and first in corner performance, a statistic we explain later.

Union Berlin’s path to Europe clearly demonstrates the influential role of offensive and defensive performance during set pieces. Until now however, it was difficult for fans and broadcasters to properly quantify this performance, unless they wanted to dissect massive tables on analytics websites. Bundesliga and AWS have worked together to illustrate the threat that a team produces and the threat that is produced by set pieces against the team, and came up with the new Bundesliga Match Fact: Set Piece Threat.

How does Set Piece Threat work?

To determine the threat a team poses with their set pieces, we take into account different facets of their set piece performance. It’s important to note that we only consider corners and free kicks as set pieces, and compute the threat for each category independently.

Facet 1: Outcome of a set piece: Goals, shots, or nothing

First, we consider the outcome of a set piece. That is, we observe if it results in a goal. However, the outcome is generally influenced by fine margins, such as a great save by the goal keeper or if a shot brushes the post instead of going in, so we also categorize the quality of a shot that results from the set piece. Shots are categorized into several categories.

Category Explanation
Goal A successful shot that lead to a goal
Outstanding Shots that almost led to a goal, such as a shot at the post
Decent Other noteworthy goals scenes
Average The rest of the chances that would be included in a chances ratio with relevant threat of a goal
None No real goal threat, should not be considered a real chance, such as a header that barely touched the ball or a blocked shot
No shot No shots taken at all

The above video shows examples of shot outcome categories in the following order: outstanding, decent, average, none.

Facet 2: Potential of a shot

Second, our algorithm considers the potential of a shot. This incorporates how likely it should have resulted in a goal, taking the actual performance of the shot-taker out of the equation. In other words, we quantify the goal potential of the situation in which the shot was taken. This is captured by the expected goal (xGoals) value of the shot. We remove not only the occurrence of luck or lack thereof, but also the quality of the strike or header.

Facet 3: Quantity of set pieces

Next, we consider the aspect of pure quantity of set pieces that a team gets. Our definition of Set Piece Threat measures the threat on a per-set-piece-basis. Instead of summing up all outcomes and xGoal values of a team over the course of a season, the values are aggregated such that they represent the average threat per set piece. That way, the corner threat, for example, represents the team’s danger for each corner and doesn’t consider a team more dangerous simply because they have more corners than other teams (and therefore potentially more shots or goals).

Facet 4: Development over time

The last aspect to consider is the development of a team’s threat over time. Consider for example a team that scored three goals from corners in the first three matchdays but fails to deliver any considerable threat over the next 15 matchdays. This team should not be considered to pose a significant threat from corners on matchday 19, despite it already having scored three times, which may still be a good return. We account for this (positive or negative) development of a team’s set piece quality by assigning a discount to each set piece, depending on how long ago it occurred. In other words, a free kick that was taken 10 matchdays ago has less influence on the computed threat than one that was taken during the last or even current game.

Score: Per set piece aggregation

All four facets we’ve described are aggregated into two values for each team, one for corners and one for free kicks, which describe the danger that a corresponding set piece by that team would currently pose. The value is defined as the weighted average of the scores of each set piece, where the score of a set piece is defined as (0.7 * shot-outcome + 0.3 * xG-value) if the set piece resulted in a shot and 0 otherwise. The shot-outcome is 1 if the team scored and lower for other outcomes, such as a shot that went wide, depending on its quality. The weight for each set piece is determined by how long ago it was taken, as described earlier. Overall, the values are defined between 0–1, where 1 is the perfect score.

Set piece threat

Next, the values for each team are compared to the league average. The exact formula is score(team)/avg_score(league) - 1. This value is what we call the Set Piece Threat value. A team has a threat value of 0 if it’s exactly as good as the league average. A value of -1 (or -100%) describes a team that poses no threat at all, and a value of +1 (+100%) describes a team that is twice as dangerous as the league average. With those values, we compute a ranking that orders the teams from 1–18 according to their offensive threat of corners and free kicks, respectively.

We use the same data and similar calculations to also compute a defensive threat that measures the defensive performance of a team with regard to how they defend set pieces. Now, instead of computing a score per own set piece, the algorithm computes a score per opponent set piece. Just like for the offensive threat, the score is compared to the league average, but the value is reversed: -score(team)/avg_score(league) + 1. This way, a threat of +1 (+100%) is achieved if team allows opponents no shots at all, whereas a team with defensive threat of -1 (-100%) is twice as susceptible to opponents’ set pieces as the league average. Again, a team with a threat of 0 is as good as the league average.

Set Piece Threat findings

An important aspect of Set Piece Threat is that we focus on an estimation of threat instead of goals scored and conceded via set pieces. If we take SC Freiburg and Union Berlin at matchday 21 as an example, over the course of this season Freiburg has scored seven goals via corners in comparison to four from Union Berlin. Our threat ranking still ranks both of the teams fairly equal. In fact, we predict a corner by Freiburg (Rank 3) to even be 7% less threatening than a corner by Union Berlin (Rank 1). The main reason for this is that Union Berlin created a similar number of great chances out of their corners, but failed to convert these chances into goals. Freiburg on the other hand was vastly more efficient with their chances. Such a discrepancy between chance quality and actual goals can happen in a high-variance sport like football.

The following graph shows Union Berlin’s set piece offensive corner ranking (blue) and score (red) from matchdays 6–21. At matchday 12, Union scored a goal from a corner and additionally had a great chance from a second corner that didn’t result in a goal but was perceived as a high threat by our algorithm. In addition, Union had a shot on target in five of seven corner kicks on matchday 12. Union immediately jumped in the ranking from twelfth to fifth place as a result of this, and the score value for Union increased as well as the league average. As Union saw more and more high threat chances in the later matchdays from corners, they step by step claimed first place of the corner threat ranking. The score is always relative to the current league average, meaning that Union’s threat at matchday 21 is 50% higher from corners than the average threat coming from all teams in the league.

Implementation and architecture

Bundesliga Match Facts are independently running AWS Fargate containers inside Amazon Elastic Container Service (Amazon ECS). Previous Bundesliga Match Facts consume raw event and positional data to calculate advanced statistics. This changes with the release of Set Piece Threat, which analyzes data produced by an existing Bundesliga Match Fact (xGoals) to calculate its rankings. Therefore, we created an architecture to exchange messages between different Bundesliga Match Facts during live matches in real time.

To guarantee the latest data is reflected in the set piece threat calculations, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK). This message broker service allows different Bundesliga Match Facts to send and receive the newest events and updates in real time. By consuming a match and Bundesliga Match Fact-specific topic from Kafka, we can receive the most up-to-date data from all systems involved while retaining the ability to replay and reprocess messages sent earlier.

The following diagram illustrates the solution architecture:

We introduced Amazon MSK to this project to generally replace all internal message passing for the Bundesliga Match Facts platform. It handles the injection of positional and event data, which can aggregate to over 3.6 million data points per match. With Amazon MSK, we can use the underlying persistent storage of messages, which allows us to replay games from any point in time. However, for Set Piece Threat, the focus lies on the specific use case of passing events produced by Bundesliga Match Facts to other Bundesliga Match Facts that are running in parallel.

To facilitate this, we distinguish between two types of Kafka topics: global and match-specific. First, each Bundesliga Match Fact has an own specific global topic, which handles all messages created by the Bundesliga Match Fact. Additionally, there is an additional match-specific topic for each Bundesliga Match Fact for each match that is handling all messages created by a Bundesliga Match Fact for a specific match. When multiple live matches run in parallel, each message is first produced and sent to this Bundesliga Match Fact-specific global topic.

A dispatcher AWS Lambda function is subscribed to every Bundesliga Match Fact-specific global topic and has two tasks:

  1. Write the incoming data to a database provisioned through Amazon Relational Database Service (Amazon RDS).
  2. Redistribute the messages that can be consumed by other Bundesliga Match Facts to a Bundesliga Match Fact-specific topic.

The left side of the architecture diagram shows the different Bundesliga Match Facts running independently from each other for every match and producing messages to the global topic. The new Set Piece Threat Bundesliga Match Fact now can consume the latest xGoal values for each shot for a specific match (right side of the diagram) to immediately compute the threat produced by the set piece that resulted in one or more shots.

Summary

We’re excited about the launch of Set Piece Threat and the patterns commentators and fans will uncover using this brand-new insight. As teams look to capitalize more and more on these dead ball situations, Set Piece Threat will help the viewer understand which team is doing this successfully and which team still has some ground to cover, which adds additional suspense before each of these set piece situations. The new Bundesliga Match Fact is available to Bundesliga’s broadcasters to uncover new perspectives and stories of a match, and team rankings can be viewed at any time in the Bundesliga app.

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.


About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals and won 26 caps for Germany. Currently Rolfes serves as Sporting Director at Bayer 04 Leverkusen where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS

Luuk Figdor is a Senior Sports Technology Specialist in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Jan Bauer is a Cloud Application Architect at AWS Professional Services. His interests are serverless computing, machine learning, and everything that involves cloud computing. He works with clients across industries to help them be successful on their cloud journey.

Pascal Kühner is a Cloud Application Developer in the AWS Professional Services Team. He works with customers across industries to help them achieve their business outcomes via application development, DevOps, and infrastructure. He loves ball sports and in his spare time likes to play basketball and football.

Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music and AI in his spare time.

Read More