Introducing TorchRec, and other domain library updates in PyTorch 1.11

We are introducing the beta release of TorchRec and a number of improvements to the current PyTorch domain libraries, alongside the PyTorch 1.11 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch. Highlights include:

  • TorchRec, a PyTorch domain library for Recommendation Systems, is available in beta. View it on GitHub.
  • TorchAudio – Added Enformer- and RNN-T-based models and recipes to support the full development lifecycle of a streaming ASR model. See the release notes here.
  • TorchText – Added beta support for RoBERTa and XLM-R models, byte-level BPE tokenizer, and text datasets backed by TorchData. See the release notes here.
  • TorchVision – Added 4 new model families and 14 new classification datasets such as CLEVR, GTSRB, FER2013. See the release notes here.

TorchRec 0.1

We announced TorchRec a few weeks ago and we are excited to release the beta version today. To recap, TorchRec is a PyTorch domain library for Recommendation Systems. This new library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production. TorchRec was used to train a 1.25 trillion parameter model, pushed to production in January 2022.

In particular, the library includes:

  • Modeling primitives, such as embedding bags and jagged tensors, that enable easy authoring of large, performant multi-device/multi-node models using hybrid data-parallelism and model-parallelism.
  • Optimized RecSys kernels powered by FBGEMM, including support for sparse and quantized operations.
  • A sharder which can partition embedding tables with a variety of different strategies including data-parallel, table-wise, row-wise, table-wise-row-wise, and column-wise sharding.
  • A planner which can automatically generate optimized sharding plans for models.
  • Pipelining to overlap dataloading device transfer (copy to GPU), inter-device communications (input_dist), and computation (forward, backward) for increased performance.
  • GPU inference support.
  • Common modules for RecSys, such as models and public datasets (Criteo & Movielens).

Please check the TorchRec announcement post here, video tutorial, install instructions here, test drive the feature through this tutorial here, and refer to the reference document here.

TorchAudio 0.11

TorchAudio: Building Blocks for Audio and Speech Processing

We published a paper, TorchAudio: Building Blocks for Audio and Speech Processing, describing the overview of the TorchAudio library. If you find TorchAudio useful for your research, please help us share with the community by citing our paper.

(Beta) RNN-T & (Prototype) Emformer Models and Recipes

Emformer is an efficient memory-transformer-based streaming acoustic model that has demonstrated state-of-the-art streaming automatic speech recognition (ASR) performance in low-latency, resource-constrained scenarios, such as on-device applications (citation: https://arxiv.org/abs/2010.10759).

The TorchAudio v0.11 release includes the following beta features:

  • Implementation of Emformer (docs)
  • Recurrent neural network transducer (RNN-T) streaming ASR model that uses Emformer for its transcription network (docs)
  • RNN-T beam search decoder with TorchScript support (docs)
  • LibriSpeech Emformer RNN-T training recipe (GitHub) and corresponding pre-trained streaming ASR inference pipeline (docs)

Also there are prototype features that are available from nightly builds or the main branch.

  • Training recipes trained on MuST-C and TED-LIUM3 datasets. (GitHub)
  • Pre-trained pipelines corresponding to the recipes. (docs)
  • Tutorial that steps through performing online speech recognition with RNN-T Emformer model. (docs)

Collectively, these features cover the full development lifecycle of a streaming ASR model, from definition through training and inference, and enable users to easily develop their own Emformer- and RNN-T-based models.

Special thanks to Yangyang Shi, Jay Mahadeokar, and Gil Keren for their code contributions and guidance.

(Beta) HuBERT Pretrain Model

The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.

(Prototype) CTC Beam Search Decoder

In recent releases, TorchAudio has added support for ASR models fine-tuned on CTC loss. The addition of an inference time CTC beam search decoder enables running end-to-end ASR evaluation using TorchAudio utils.

The CTC decoder in TorchAudio supports customizable beam search decoding with lexicon constraint. It also has optional KenLM language model support.

For more details, please check out the API tutorial and documentation. This prototype feature is available through nightly builds.

(Prototype) Streaming API

TorchAudio started as simple audio I/O APIs that supplement PyTorch. With the recent addition of ASR models and training recipes, the project has received requests to support high-level application development.

Streaming API makes it easy to develop and test the model in online inference. It utilizes ffmpeg under the hood, and enables reading media from online services and hardware devices, decoding media in an incremental manner, and applying filters and preprocessing.

Please checkout the API tutorial and the documentation. There are also the streaming ASR tutorial and the device streaming ASR tutorial. This feature is available from nightly releases. Please refer to pytorch.org for how to install nightly builds.

TorchText 0.12

(Beta) RoBERTa and XLM-R Models

TorchText has added support for pre-trained RoBERTa and XLM-R models. It would allow users to train end-2-end Transformer Encoder based models on standard NLP tasks using TorchText.

More specifically:

  • The models are torchscriptable and hence can be employed for production use-cases.
  • The model APIs let users to easily attach custom task-specific heads with pre-trained encoders.
  • The API also comes equipped with data pre-processing transforms to match the pre-trained weights and model configuration.

We have added a tutorial to demonstrate SST-2 binary text classification task with pre-trained XLM-R base architecture.

For additional details on model APIs and usage examples, please refer to the documentation.

(Beta) byte-level BPE tokenizer

TorchText has added support for a Byte-Level BPE tokenizer, as used in GPT-2. This tokenizer is also used for tokenizing inputs to the pre-trained RoBERTa models described previously. In addition to the RoBERTa vocab, users can also load their own custom BPE vocab to use the tokenizer. Furthermore, the tokenizer is fully torchscriptable and hence can be employed for production use-cases. For additional details on model APIs and usage examples, please refer to the documentation.

(Beta) Text datasets backed by TorchData

TorchText has modernized its datasets by migrating from older-style Iterable Datasets to TorchData’s DataPipes. TorchData is a library that provides modular/composable primitives, allowing users to load and transform data in performant data pipelines.

These DataPipes work out-of-the-box with PyTorch DataLoader and would enable new functionalities like auto-sharding. Users can now easily do data manipulation and pre-processing using user-defined functions and transformations in a functional style programming. Datasets backed by DataPipes also enable standard flow-control like batching, collation, shuffling and bucketizing.

Collectively, DataPipes provides a comprehensive experience for data preprocessing and tensorization needs in a pythonic and flexible way for model training. We have added a tutorial to demonstrate data-processing pipelining using the modernized dataset for binary text-classification.

You can learn more about TorchData DataPipe APIs in its official documentation.

TorchVision 0.12

New Models

Four new model families have been released in the latest version along with pre-trained weights for their variants.

#1 Object Detection

FCOS is a popular, fully convolutional, anchor-free model for object detection. In this release we include a community-contributed model implementation as well as pre-trained weights. The model was trained on COCO train2017 and can be used as follows:

import torch
from torchvision import models

x = [torch.rand(3, 224, 224)]
fcos = models.detection.fcos_resnet50_fpn(pretrained=True).eval()
predictions =  fcos(x)

The box AP of the pre-trained model on COCO val2017 is 39.2 (see #4961 for more details).

We would like to thank Hu Ye and Zhiqiang Wang for contributing to the model implementation and initial training. This was the first community-contributed model in a long while, and given its success, we decided to use the learnings from this process and create a new model contribution guidelines.

#2 Optical Flow support and RAFT model

TorchVision now supports optical flow! Optical Flow models try to predict movement in a video: given two consecutive frames, the model predicts where each pixel of the first frame ends up in the second frame. Check out our new tutorial on Optical Flow!

We implemented a torchscript-compatible RAFT model with pre-trained weights (both normal and “small” versions), and added support for training and evaluating optical flow models. Our training scripts support distributed training across processes and nodes, leading to much faster training time than the original implementation. We also added 5 new optical flow datasets: Flying Chairs, Flying Things, Sintel, Kitti, and HD1K.

#3. Image Classification

Vision Transformer (ViT) and ConvNeXt are two popular architectures which can be used as image classifiers or as backbones for downstream vision tasks. In this release we include 8 pre-trained weights for their classification variants. The models were trained on ImageNet and can be used as follows:

import torch
from torchvision import models

x = torch.rand(1, 3, 224, 224)
vit = models.detection.vit_b_16(pretrained=True).eval()
convnext = models.detection.convnext_tiny(pretrained=True).eval()
predictions1 = vit(x)
predictions2 = convnext(x)

The accuracies of the pre-trained models obtained on ImageNet val are seen below:

Model Acc@1 Acc@5
vit_b_16 81.072 95.318
vit_b_32 75.912 92.466
vit_l_16 79.662 94.638
vit_l_32 76.972 93.07
convnext_tiny 82.52 96.146
convnext_small 83.616 96.65
convnext_base 84.062 96.87
convnext_large 84.414 96.976

The above models have been trained using an adjusted version of our new training recipe and this allows us to offer models with accuracies significantly higher than the ones on the original papers.

#4. GPU Video Decoding

In this release, we add support for GPU video decoding in the video reading API. To use hardware-accelerated decoding, we just need to pass a cuda device to the video reading API as shown below:

import torchvision

reader = torchvision.io.VideoReader(file_name, device="cuda:0")
for frame in reader:
    print(frame)

We also support seeking to anyframe or a keyframe in the video before reading, as shown below:

reader.seek(seek_time)

New Datasets

We have implemented 14 new classification datasets: CLEVR, GTSRB, FER2013, SUN397, Country211, Flowers102, fvgc_aircraft, OxfordIIITPet, DTD, Food 101, Rendered SST2, Stanford cars, PCAM, and EuroSAT.

As part of our work on Optical Flow support (see above for more details), we also added 5 new optical flow datasets: Flying Chairs, Flying Things, Sintel, Kitti, and HD1K.

Other Updates

  • New documentation layout: Each function / class is now documented in a separate page, clearing up some space in the per-module pages, and easing the discovery of the proposed APIs. Compare e.g. our previous docs vs the new ones. Please let us know if you have any feedback!
  • New model contribution guidelines have been published following the success of the FCOS model which was contributed by the community. These guidelines aim to be an overview of the model contribution process for anyone who would like to suggest, implement and train a new model.
  • Upcoming Prototype API – We are currently working on a prototype API which adds Multi-weight support on all of our model builder methods. This will enable us to offer multiple pre-trained weights, associated with their meta-data and inference transforms. The API is still under review and thus was not included in the release but you can read more about it on our blogpost and provide your feedback on the dedicated Github issue.
  • Changes in our deprecation policy – Up until now, torchvision would almost never remove deprecated APIs. In order to be more aligned and consistent with pytorch core, we are updating our deprecation policy. We are now following a 2-release deprecation cycle: deprecated APIs will raise a warning for 2 versions, and will be removed after that. To reflect these changes and to smooth the transition, we have decided to:

    • Remove all APIs that had been deprecated before or on v0.8, released 1.5 years ago.
    • Update the removal timeline of all other deprecated APIs to v0.14, to reflect the new 2-cycle policy starting now in v0.12.

Captum 0.5

Captum is a PyTorch library for model interpretability. For this release, we expanded Captum with influential instances and added support for both similarity based influences and novel algorithms, TracIn and its variants. TracIn variants offer faster approximation of influence scores based on random projections for fully connected layers.

More specifically the new, influence, subsection of Captum includes:

  • SimilarityInfluence computes similarity scores between test and training examples using default (cosine or euclidean) or custom user definite metrics w.r.t. given input model layers.
  • TracInCP approximates the influential score of each training example on a given test example based on the dot-product similarity between loss gradients w.r.t. model parameters for test and training examples. Note that if we use training examples as test examples then we compute self influence. This method and its variants described below also return top-k proponents and opponents which are the top-k largest positive and negative influential examples respectively.
  • TracInCPFast is an approximation of TracInCP that avoids computing the gradients w.r.t. large parameter matrices. It approximates influence score based on the dot products between last fully connected layer activations and loss gradients w.r.t. that layer for training and test examples.
  • TracInCPFastRandProj uses a nearest neighbor approximation library such as annoy to compute the dot product between the training and test quantities. In order to reduce the dimensionality of layer activations and corresponding gradients this method, in addition, allows to project those vectors into a lower dimensional space using random projection matrices.

More about the implementation of influential instances can be found on our GitHub page and tutorials.

Thanks for reading, If you’re interested in these updates and want to join the PyTorch community, we encourage you to join the discussion forums and open GitHub issues. To get the latest news from PyTorch, follow us on Twitter, Medium, YouTube, and LinkedIn.

Cheers!

Team PyTorch

Read More

PyTorch 1.11, TorchData, and functorch are now available

We are excited to announce the release of PyTorch 1.11 (release notes). This release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, we are releasing beta versions of TorchData and functorch.

Summary:

  • TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
  • functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
  • Distributed Data Parallel (DDP) static graph optimizations available in stable.

Introducing TorchData

We are delighted to present the Beta release of TorchData. This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called “DataPipes” that work well out of the box with the PyTorch’s DataLoader.

A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipe and __getitem__ for MapDataPipe, and returns a new access function with a slight transformation applied. You can chain multiple DataPipes together to form a data pipeline that performs all the necessary data transformation.

We have implemented over 50 DataPipes that provide different core functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe and MapDataPipe.

In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the popular datasets provided by the library are implemented using DataPipes and a section of its SST-2 binary text classification tutorial demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in TorchVision (available in nightly releases) and in TorchRec. You can find more specific examples here.

The documentation for TorchData is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones. FAQs and future plans related to DataLoader are described in our project’s README file.

Introducing functorch

We’re excited to announce the first beta release of functorch. Heavily inspired by Google JAX, functorch is a library that adds composable function transforms to PyTorch. It aims to provide composable vmap (vectorization) and autodiff transforms that work with PyTorch modules and PyTorch autograd with good eager-mode performance.

Composable function transforms can help with a number of use cases that are tricky to do in PyTorch today:

  • computing per-sample-gradients (or other per-sample quantities)
  • running ensembles of models on a single machine
  • efficiently batching together tasks in the inner-loop of MAML
  • efficiently computing Jacobians and Hessians as well as batched ones

Composing vmap (vectorization), vjp (reverse-mode AD), and jvp (forward-mode AD) transforms allows us to effortlessly express the above without designing a separate library for each.

For more details, please see our documentation, tutorials, and installation instructions.

Distributed Training

(Stable) DDP static graph

DDP static graph assumes that your model employs the same set of used/unused parameters in every iteration, so that it can deterministically know states like which hooks will fire, how many times the hooks will fire and gradients computation ready order after the first iteration. Static graph caches these states in the first iteration, and thus it could support features that DDP can not support in previous releases, e.g., support multiple activation checkpoints on the same parameters regardless of whether there are unused parameters or not. The static graph feature also applies performance optimizations when there are unused parameters, e.g., it avoids traversing graphs to search unused parameters every iteration, and enables dynamic bucketing order. These optimizations in the DDP static graph brought 10% QPS gain for some recommendation models.

To enable static graph, just simply set static_graph=True in the DDP API like this:

ddp_model = DistributedDataParallel(model, static_graph=True)

For more details, please see our documentation and tutorials.

Thanks for reading, If you’re interested in these updates and want to join the PyTorch community, we encourage you to join the discussion forums and open GitHub issues. To get the latest news from PyTorch, follow us on Twitter, Medium, YouTube, and LinkedIn.

Cheers!

Team PyTorch

Read More

Extract granular sentiment in text with Amazon Comprehend Targeted Sentiment

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. As a fully managed service, Amazon Comprehend requires no ML expertise and can scale to large volumes of data. Amazon Comprehend provides several different APIs to easily integrate NLP into your applications. You can simply call the APIs in your application and provide the location of the source document or text. The APIs output entities, key phrases, sentiment, document classification, and language in an easy-to-use format for your application or business.

The sentiment analysis APIs provided by Amazon Comprehend help businesses determine the sentiment of a document. You can gauge the overall sentiment of a document as positive, negative, neutral, or mixed. However, to get the granularity of understanding the sentiment associated with specific products or brands, businesses have had to employ workarounds like chunking the text into logical blocks and inferring the sentiment expressed towards a specific product.

To help simplify this process, starting today, Amazon Comprehend is launching the Targeted Sentiment feature for sentiment analysis. This provides the capability to identify groups of mentions (co-reference groups) corresponding to a single real-world entity or attribute, provide the sentiment associated with each entity mention, and provide the classification of the real-world entity based on a pre-determined list of entities.

This post provides an overview of how you can get started with Amazon Comprehend targeted sentiment, demonstrates what you can do with the output, and walks through three common targeted sentiment use cases.

Solution overview

The following is an example of targeted sentiment:

“Spa” is the primary entity, identified as type facility, and is mentioned two more times, referred to as the pronoun “it.” The Targeted Sentiment API provides the sentiment towards each entity. Positive sentiment is green, negative is red,  and neutral is blue. We can also determine how the sentiment towards the spa changes throughout the sentence. We dive deeper into the API later in the post.

This capability opens up several different capabilities for businesses. Marketing teams can track popular sentiments toward their brands in social media over time. Ecommerce merchants can understand which specific attributes of their products were best- and worst-received by customers. Call center operators can use the feature to mine transcripts for escalation issues and to monitor customer experience. Restaurants, hotels, and other hospitality industry organizations can use the service to turn broad ratings categories into rich descriptions of good and bad customer experiences.

Targeted sentiment use cases

The Targeted Sentiment API in Amazon Comprehend takes text data such as social media posts, application reviews, and call center transcriptions as input. Then it analyzes the input using the power of NLP algorithms to extract entity-level sentiment automatically. An entity is a textual reference to the unique name of a real-world object, such as people, places, and commercial items, in addition to precise references to measures such as dates and quantities. For a full list of supported entities, refer to Targeted Sentiment Entities.

We use the Targeted Sentiment API to enable the following use cases:

  • A business can identify parts of the employee/customer experience that are enjoyable and parts that may be improved.
  • Contact centers and customer service teams can analyze on-call transcriptions or chat logs to identify agent training effectiveness, and conversation details such as specific reactions from a customer and phrases or words that were used to illicit that response.
  • Product owners and UI/UX developers can identify features of their product that users enjoy and parts that require improvement. This can support product roadmap discussions and prioritizations.

The following diagram illustrates the targeted sentiment process:

In this post, we demonstrate this process using the following three sample reviews:

  • Sample 1: Business and product review – “I really like how thick the jacket is. I wear a large jacket because I have broad shoulders and that’s what I ordered and it fits perfectly there. I almost feel like it balloons out from the chest down. I thought I would use the strings in the bottom of the jacket to help close it and bring it in, but those don’t work. The jacket feels very bulky.”
  • Sample 2: Contact center transcription – “Hi there, there is a fraud block on my credit card, can you remove it for me. My credit card keeps getting flagged for fraud. It is quite annoying, every time I go to use it, I keep getting declined. I’m going to cancel the card if this happens again.”
  • Sample 3: Employer feedback survey – “I’m glad management is upskilling the team. But the instructor did not go over the basics well. Management should do more due diligence on everyone’s skill level for future sessions.”

Prepare the data

To get started, download the sample files containing the example text using the AWS Command Line Interface (AWS CLI) by running the following commands:

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-8148/ts-sample-data.zip .

Create an Amazon Simple Storage Service (Amazon S3) bucket, unzip the folder and upload the folder containing the three sample files. Make sure you’re using the same Region throughout.

You can now access the three sample text files in your S3 bucket.

Create a job in Amazon Comprehend

After you upload the files to your S3 bucket, complete the following steps:

  1. On the Amazon Comprehend console, choose Analysis jobs in the navigation pane.
  2. Choose Create job.
  3. For Name, enter a name for your job.
  4. For Analysis type, choose Targeted sentiment.
  5. Under Input data, enter the Amazon S3 location of the ts-sample-data folder.
  6. For Input format, choose One document per file.

You can change this configuration if your data is in a single file delimited by lines.

  1. Under Output location, enter the Amazon S3 location where you want to save the job output.
  2. Under Access permissions, for IAM role, choose an existing AWS Identity and Access Management (IAM) role or create one that has permissions to the S3 bucket.
  3. Leave the other options as default and choose Create job.

After you start the job, you can review your job details. The total job runtime depends on the size of the input data.

  1. When the job is complete, under Output, choose the link to the output data location.

Here you can find a compressed output file.

  1. Download and decompress the file.

You can now inspect the output files for each sample text. Open the files in your preferred text editor to review the API response structure. We describe this in more detail in the next section.

API response structure

The Targeted Sentiment API provides a simple way to consume the output of your jobs. It provides a logical grouping of the entities (entity groups) detected, along with the sentiment for each entity. The following are some definitions of the fields that are in the response:

  • Entities – The significant parts of the document. For example, Person, Place, Date, Food, or Taste.
  • Mentions – The references or mentions of the entity in the document. These can be pronouns or common nouns such as “it,” “him,” “book,” and so on. These are organized in order by location (offset) in the document.
  • DescriptiveMentionIndex – The index in Mentions that gives the best depiction of the entity group. For example, “ABC Hotel” instead of “hotel,” “it,” or other common noun mentions.
  • GroupScore – The confidence that all the entities mentioned in the group are related to the same entity (such as “I,” “me,” and “myself” referring to one person).
  • Text – The text in the document that depicts the entity
  • Type – A description of what the entity depicts.
  • Score – The model confidence that this is a relevant entity.
  • MentionSentiment – The actual sentiment found for the mention.
  • Sentiment – The string value of positive, neutral, negative, or mixed.
  • SentimentScore – The model confidence for each possible sentiment.
  • BeginOffset – The offset into the document text where the mention begins.
  • EndOffset – The offset into the document text where the mention ends.

To demonstrate this visually, let’s take the output of the third use case, the employer feedback survey, and walk through the entity groups that represent the employee completing the survey, management, and the instructor.

Let’s first look at all the mentions of the co-reference entity group associated with “I” (the employee writing the response) and the location of the mention in the text. DescriptiveMentionIndex represents indexes of the entity mentions that best depict the co-reference entity group (in this case I):

{
      "DescriptiveMentionIndex": [
        0
      ],
      "Mentions": [
        {
          "BeginOffset": 0,
          "EndOffset": 1,
          "Score": 0.999997,
          "GroupScore": 1,
          "Text": "I",
          "Type": "PERSON",
          "MentionSentiment": {
            "Sentiment": "NEUTRAL",
            "SentimentScore": {
              "Mixed": 0,
              "Negative": 0,
              "Neutral": 1,
              "Positive": 0
            }
          }
        }
      ]
    }

The next group of entities provides all mentions of the co-reference entity group associated with management, along with its location in the text. DescriptiveMentionIndex represents indexes of the entity mentions that best depict the co-reference entity group (in this case management). Something to observe in this example is the sentiment shift towards management. You can use this data to infer what parts of management’s actions were perceived as positive, and what parts were perceived as negative and therefore can be improved upon.

{
      "DescriptiveMentionIndex": [
        0,
        1
      ],
      "Mentions": [
        {
          "BeginOffset": 9,
          "EndOffset": 19,
          "Score": 0.999984,
          "GroupScore": 1,
          "Text": "management",
          "Type": "ORGANIZATION",
          "MentionSentiment": {
            "Sentiment": "POSITIVE",
            "SentimentScore": {
              "Mixed": 0,
              "Negative": 0,
              "Neutral": 0,
              "Positive": 1
            }
          }
        },
        {
          "BeginOffset": 103,
          "EndOffset": 113,
          "Score": 0.999998,
          "GroupScore": 0.999896,
          "Text": "Management",
          "Type": "ORGANIZATION",
          "MentionSentiment": {
            "Sentiment": "NEGATIVE",
            "SentimentScore": {
              "Mixed": 0.000149,
              "Negative": 0.990075,
              "Neutral": 0.000001,
              "Positive": 0.009775
            }
          }
        }
      ]
    }

To conclude, let’s observe all mentions of the instructor and the location in the text. DescriptiveMentionIndex represents indexes of the entity mentions that best depict the co-reference entity group (in this case instructor):

{
      "DescriptiveMentionIndex": [
        0
      ],
      "Mentions": [
        {
          "BeginOffset": 52,
          "EndOffset": 62,
          "Score": 0.999996,
          "GroupScore": 1,
          "Text": "instructor",
          "Type": "PERSON",
          "MentionSentiment": {
            "Sentiment": "NEGATIVE",
            "SentimentScore": {
              "Mixed": 0,
              "Negative": 0.999997,
              "Neutral": 0.000001,
              "Positive": 0.000001
            }
          }
        }
      ]
    }

Reference architecture

You can apply targeted sentiment to many scenarios and use cases to drive business value, such as the following:

  • Determine efficacy of marketing campaigns and feature launches by detecting the entities and mentions that contain the most positive or negative feedback
  • Query output to determine which entities and mentions relate to a corresponding entity (positive, negative, or neutral)
  • Analyze sentiment across the customer interaction lifecycle in contact centers to demonstrate efficacy of process or training changes

The following diagram depicts an end-to-end process:

Conclusion

Understanding the interactions and feedback organizations receive from customers about their products and services remains crucial in developing better products and customer experiences. As such, more granular details are required to infer better outcomes.

In this post, we provided some examples of how using these granular details can help organizations improve products, customer experiences, and training while also incentivizing and validating positive attributes. There are many use cases across industries where you can experiment with and gain value from targeted sentiment.

We encourage you to try this new feature with your use cases. For more information and to get started, refer to Targeted Sentiment.


About the Authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Sanjeev Pulapaka is a Senior Solutions Architect in the U.S. Fed Civilian SA team at Amazon Web Services (AWS). He works closely with customers in building and architecting mission critical solutions. Sanjeev has extensive experience in leading, architecting and implementing high-impact technology solutions that address diverse business needs in multiple sectors including commercial, federal, state and local governments. He has an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.

Read More

Amazon SageMaker Autopilot now supports time series data

Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning (ML) models based on your data, while allowing you to maintain full control and visibility. We have recently announced support for time series data in Autopilot. You can use Autopilot to tackle regression and classification tasks on time series data, or sequence data in general. Time series data is a special type of sequence data where data points are collected at even time intervals.

Manually preparing the data, selecting the right ML model, and optimizing its parameters is a complex task, even for an expert practitioner. Although automated approaches exist that can find the best models and their parameters, these typically can’t handle data that comes as sequences, such as network traffic, electricity consumption, or household expenses recorded over time. Because this data takes the form of observations acquired at different time points, consecutive observations can’t be treated as independent of each other and need to be processed as a whole. You can use Autopilot for a wide range of problems dealing with sequential data. For example, you can classify network traffic recorded over time to identify malicious activities, or determine if individuals qualify for a mortgage based on their credit history. You provide a dataset containing time series data and Autopilot handles the rest, processing the sequential data through specialized feature transforms and finding the best model on your behalf.

Autopilot eliminates the heavy lifting of building ML models, and helps you automatically build, train, and tune the best ML model based on your data. Autopilot runs several algorithms on your data and tunes their hyperparameters on a fully managed compute infrastructure. In this post, we demonstrate how you can use Autopilot to solve classification and regression problems on time series data. For instructions on creating and training an Autopilot model, see Customer Churn Prediction with Amazon SageMaker Autopilot.

Time series data classification using Autopilot

As a running example, we consider a multi-class problem on the time series dataset UWaveGestureLibraryX, containing equidistant readings of accelerometer sensors while performing one of eight predefined hand gestures. For simplicity, we consider only X dimension of the accelerometer. The task is to build a classification model to map the time series data from the sensor readings to the predefined gestures. The following figure shows the first rows of the dataset in CSV format. The entire table consists of 896 rows and two columns: the first column is a gesture label and the second column is a time series of sensor readings.

Convert data to the right format with Amazon SageMaker Data Wrangler

On top of accepting numerical, categorical, and standard text columns, Autopilot now also accepts a sequence input column. If your time series data doesn’t follow this format, you can easily convert it through Amazon SageMaker Data Wrangler. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. For instance, consider the same dataset but in a different input format: each gesture (specified by ID) is a sequence of equidistant measurements of the accelerometer. When stored vertically, each row contains a timestamp and one value. The following figure compares this data in its original format and a sequence format.

To convert this dataset to the format described earlier using Data Wrangler, load the dataset from Amazon Simple Storage Service (Amazon S3). Then use the time series Group by transform, as shown in the following screenshot, and export the data back to Amazon S3 in CSV format.

When the dataset is in its designated format, you can proceed with Autopilot. To check out other time series transformers of Data Wrangler refer to Prepare time series data with Amazon SageMaker Data Wrangler.

Launch an AutoML job

As with other input types supported by Autopilot, each row of the dataset is a different observation and each column is a feature. In this example, we have a single column containing time series data, but you can have multiple time series columns. You can also have multiple columns with different input types, such as time series, text, and numerical.

To create an Autopilot experiment, place the dataset in an S3 bucket and create a new experiment within Amazon SageMaker Studio. As shown in the following screenshot, you must specify the name of experiment, S3 location of the dataset, S3 location for the output artifacts, and the column name to predict.

Autopilot analyzes the data, generates ML pipelines, and runs a default 250 iterations of hyperparameter optimization on this classification task. As shown in the following model leaderboard, Autopilot reaches 0.821 accuracy, and you can deploy the best model in just one click.

In addition, Autopilot generates a data exploration report, where you can visualize and explore your data.

Transparency is foundational for Autopilot. You can inspect and modify generated ML pipelines within the candidate definition notebook. The following screenshot demonstrates how Autopilot recommends a range of pipelines, combining the time series transformer TSFeatureExtractor with different ML algorithms, such as gradient boosted decision trees and linear models. The TSFeatureExtractor extracts hundreds of time series features for you, which are then fed to the downstream algorithms to make predictions. For the full list of time series features, refer to Overview on extracted features.

Conclusion

In this post, we demonstrated how to use SageMaker Autopilot to solve time series classification and regression problems in just a few clicks.

For more information about Autopilot, see Amazon SageMaker Autopilot. To explore related features of SageMaker, see Amazon SageMaker Data Wrangler.


About the Authors

Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.

Anne Milbert is a Software Development engineer working on Amazon SageMaker Automatic Model Tuning.

Valerio Perrone is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.

Meghana Satish is a Software Development engineer working on Amazon SageMaker Automatic Model Tuning.

Ali Takbiri is an AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.

Read More

Optimizing Airline Tail Assignments for Cleaner Skies

Airlines around the world are exploring several tactics to meet aggressive CO2 commitments set by the International Civil Aviation Organization (ICAO). This effort has been emphasized in Europe, where aviation accounts for 13.9% of the transportation industry’s carbon emissions. The largest push comes from the European Green Deal, which aims to decrease carbon emissions from transportation by 90% by 2051. The Lufthansa Group has gone even further, committing to a 50% reduction in emissions compared to 2019 by the year 2030 and to reach net-zero emissions by 2050.

One unexpected approach that airlines can use to lower carbon emissions is through optimizing their tail assignment, i.e., how to assign aircraft (identified by the aircraft registration painted on their tails) to legs in a way that minimizes the total operating cost, of which fuel is a major contributor. More fuel needed to operate the aircraft means higher operating costs and more carbon ejected into the atmosphere. For example, a typical long-haul flight (longer than ~4,100km or ~2,500mi) emits about a ton of CO2.

The amount of fuel needed to fly between origin and destination can vary widely — e.g., larger aircraft weigh more and therefore require more fuel, while modern and younger aircraft tend to be more fuel-efficient because they use newer technology. The mass of the fuel itself is also significant. Aircraft are less fuel-efficient early in their flights when their fuel tanks are full than later when the volume of fuel is reduced. Another important factor for the tail assignment is the number of passengers on board; as the number of bookings changes, a smaller or larger aircraft might be required. Other factors can affect fuel consumption, both negative (e.g., headwinds or the age of the engines) or positive (e.g., tailwinds, sharklets, skin).

During the past year, Google’s Operations Research team has been working with the Lufthansa Group to optimize their tail assignment to reduce carbon emissions and the cost of operating their flights. As part of this collaboration, we developed and launched a mathematical tail assignment solver that has been fully integrated to optimize the fleet schedule for SWISS International Air Lines (a Lufthansa Group subsidiary), which we estimate will result in significant reductions in carbon emissions. This solver is the first step of a multi-phase project that started at SWISS.

A Mathematical Model for Tail Assignment
We structure the task of tail assignment optimization as a network flow problem, which is essentially a directed graph characterized by a set of nodes and a set of arcs, with additional constraints related to the problem at hand. Nodes may have either a supply or a demand for a commodity, while arcs have a flow capacity and a cost per unit of flow. The goal is to determine flows for every arc that minimize the total flow cost of each commodity, while maintaining flow balance in the network.

We decided to use a flow network because it is the most common way of modeling this problem in literature, and the commodities, arcs, and nodes of the flow network have a simple one-to-one correspondence to tails, legs, and airports in the real-life problem. In this case, the arcs of the network correspond to each leg of the flight schedule, and each individual tail is a single instance of a commodity that “flows” along the network. Each leg and tail pair in the network has an associated assignment cost, and the model’s objective is to pick valid leg and tail pairs such that these assignment costs are minimized.

A simple example of the tail assignment problem. There are four legs in this schedule and four possible tails that one can assign to those legs. Each tail and leg pair has an associated operational cost. For example, for Leg 1, it costs $50 to assign Tail 1 to it but $100 to assign Tail 2. The optimal solution, with the minimum cost, is to assign Tail 4 to Legs 3 and 2 and Tail 1 to Legs 1 and 4.

Aside from the standard network flow constraints, the model takes into account additional airline-specific constraints so that the solution is tailored to Lufthansa Group airlines. For example, aircraft turnaround times — i.e., the amount of time an aircraft spends on the ground between two consecutive flights — are airline-specific and can vary for a number of reasons. Catering might be loaded at an airline’s hub, reducing the turnaround time needed at outstations, or a route could have a higher volume of vacation travelers who often take longer to board and disembark than business travelers. Another constraint is that each aircraft must be on the ground for a nightly check at a specified airport’s maintenance hub to receive mandated maintenance work or cleaning. Furthermore, each airline has their own maintenance schedule, which can require aircraft to undergo routine maintenance checks every few nights, in part to help maintain the aircraft’s fuel efficiency.

Preliminary Results & Next Steps
After using our solver to optimize their fleet schedule in Europe, SWISS Airlines estimates an annual savings of over 3.5 million Swiss Francs and a 6500 ton reduction in CO2 emitted. We expect these savings will multiply when the model is rolled out to the rest of the airlines in the Lufthansa Group and again when traffic returns to pre-COVID levels. Future work will include ensuring this model is usable with larger sets of data, and adding crew and passenger assignment to the optimization system to improve the flight schedules for both passengers and flight crew.

If you are interested in experimenting with your own network flow models, check out OR-Tools, our open source software suite that can be used to build optimization solutions similar to the solver presented in this post. Refer to OR-Tools related documentation for more information.

Acknowledgements
Thanks to Jon Orwant for collaborating extensively on this blog post and for establishing the partnership with Lufthansa and SWISS, along with Alejandra Estanislao. Thanks to the Operations Research Team and to the folks at SWISS, this work could not be possible without their hard work and contributions.

Read More

Enable Amazon SageMaker JumpStart for custom IAM execution roles

With an Amazon SageMaker Domain, you can onboard users with an AWS Identity and Access Management (IAM) execution role different than the Domain execution role. In such case, the onboarded Domain user can’t create projects using templates and Amazon SageMaker JumpStart solutions. This post outlines an automated approach to enable JumpStart for Domain users with a custom execution role. We walk you through two different use cases for enabling JumpStart and how to solve these cases programmatically. The automated solution can help you scale your process to enable JumpStart for Domain users with custom roles, increasing productivity of your data science team and Amazon SageMaker Studio administrators.

JumpStart is a feature within Studio that helps you quickly and easily get started with machine learning (ML). With more and more customers increasingly using ML and adopting Amazon SageMaker, JumpStart is making it easier for data science and ML teams to access and fine-tune more than 150 popular open-source models, such as natural language processing, object detection, and image classification models.

Solution overview

JumpStart requires a SageMaker Domain with project templates enabled for the account and Studio users, as shown in the following screenshot.

If enabled, this setting allows users (configured to use the Domain execution role) to create projects using templates and JumpStart solutions. In the scenario where the user’s execution role is different than the Domain execution role, JumpStart remains disabled for that user even when it’s enabled on the Domain. We address this custom role scenario and the automated solution in the following sections.

In this solution, we address the issue for the following two cases:

  • Use case 1 – Enabling JumpStart in an automated manner for existing Domain users with custom roles regardless of apps assigned
  • Use case 2 – Providing a reference script that you can use to programmatically enable JumpStart while onboarding a new Domain user with a custom role

Domain user onboarding

After you create a Domain, you can onboard users to launch apps (such as Studio, RStudio, or Canvas). You must assign a default execution role to a Domain user during the creation process, as shown in the following screenshot.

You can choose a role different than the Domain execution role for a user. However, this may disable JumpStart for such users even when it’s enabled on the Domain. This behavior is due to the fact that SageMaker makes no assumption on a custom role and its permission boundary. The required permissions and policies have to be assigned explicitly to access templates and JumpStart solutions published by SageMaker in AWS Service Catalog.

You can enable SageMaker Projects and JumpStart manually for every user by selecting the user profile on the SageMaker Domain control panel. However, this process can be time-consuming if a user already has some apps assigned. The Edit button at bottom right is only enabled when no apps are assigned to that user (see the following screenshot). You have to delete the assigned apps first in order to edit a user profile.

The cause of the disabled JumpStart feature is evident during Step 2 of editing a user profile, where a message states “If there are individual users using custom execution roles in your organization, you need to enable them on the user profile page.”

In the following sections, we walk you through two automated solutions that cover use cases for both existing and new Domain users.

Prerequisites

The steps described as part of this solution have the following prerequisites:

  • You have created a SageMaker Domain
  • The SageMaker Domain authentication method is IAM
  • Custom roles assigned to the SageMaker Domain users have the AmazonSageMakerFullAccess policy attached

In order for JumpStart Solutions to be enabled for users, the AWS Service Catalog portfolio Amazon SageMaker Solutions and ML Ops products must be imported into the account, and this portfolio must be associated with the role that runs SageMaker. The role association is necessary so that Studio can invoke AWS Service Catalog APIs associated with the Solutions portfolio.

As a general best practice, we recommend testing the process in a non-production environment followed by validation tests to make sure everything is configured and operating as per your expectations before making changes to the production environment.

Use case 1: Enable JumpStart for all existing Domain users with a custom role

Let’s first consider the use case for existing users and enable JumpStart for those users in an automated way.

To achieve this, we have created an AWS CloudFormation template that you can run in the same Region where the SageMaker Domain exists.

The CloudFormation stack contained in the attached jumpstart_solutions_resources.template.yaml file has the following components:

  • AmazonSageMakerServiceCatalogProductsLaunchRole and AmazonSageMakerServiceCatalogProductsUseRole – Creates these two IAM roles, if they don’t already exist.
  • 1PProductUseRolePolicy – Creates this policy used by AmazonSageMakerServiceCatalogProductsUseRole, if this role doesn’t already exist.
  • setup_solutions_tests_portfolio – An AWS Lambda function that performs the AWS Service Catalog portfolio import and role association by calling Boto3 APIs. This function is called once during CloudFormation stack creation.
  • LambdaIAMRole role – Used by the function setup_solutions_tests_portfolio for calling AWS Service Catalog and SageMaker APIs.
  • SetupPortfolioInvoker – Invokes the function setup_solutions_tests_portfolio.

After the Lambda function runs as part of the CloudFormation deployment, it retrofits all the existing SageMaker Domain users to enable JumpStart and Projects for them. For more information on creating and monitoring a CloudFormation stack, refer to How does AWS CloudFormation work.

Use case 2: Enable JumpStart for a single Domain user with a custom role

Many customers prefer to scale the Domain user onboarding process by automating it programmatically. In this section, we provide a Python script reference that you can use as part of the onboarding process to enable JumpStart for a new user with a custom role. This Python script performs the required association for the given user role. The automated process calling this script must have permission to use AWS Service Catalog and SageMaker APIs. See the following code:

sagemaker_client = boto3.client("sagemaker")
sc_client = boto3.client("servicecatalog")

# function to return 'Amazon SageMaker' portfolio id
def get_solutions_portfolio_id(sc_client):
    portfolio_shares = sc_client.list_accepted_portfolio_shares()
    for portfolio in portfolio_shares['PortfolioDetails']:
            if portfolio['ProviderName'] == 'Amazon SageMaker':
                    return(portfolio['Id'])

portfolio_id = get_solutions_portfolio_id(sc_client)
# import Solutions Service Catalog Portfolio 
sagemaker_client.enable_sagemaker_servicecatalog_portfolio()
    	
sc_client.associate_principal_with_portfolio(
                    PortfolioId=portfolio_id,
                    PrincipalARN=, # custom role ARN
                    PrincipalType='IAM'
                    )

You can either call the script independently or embed it as a step within an automated process to create a user profile for onboarding to Studio. For more information on using Boto3, refer to Boto3 reference.

Clean up

After all the custom roles are enabled to use JumpStart, we can clean up the resources no longer needed. You can delete the Lambda function setup_solutions_tests_portfolio and the IAM role LambdaIAMRole created by the CloudFormation template. The other two IAM roles, AmazonSageMakerServiceCatalogProductsLaunchRole and AmazonSageMakerServiceCatalogProductsUseRole, and the associated policy 1PProductUseRolePolicy (if created) must not be deleted because they need to exist for accessing JumpStart.

Conclusion

In this post, we shared the steps to enable JumpStart for a custom role for existing users as well as new users programmatically. As always, make sure to validate the steps mentioned in this solution in a non-production environment before deploying to production.

Try it out and let us know if you have any questions in the comments section!

Additional resources

For more information, see the following:


About the Authors

Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, and analytics. In his spare time, he enjoys playing badminton with his daughter and exploring the outdoors.

Evan Kravitz is a software engineer at Amazon Web Services, working on SageMaker JumpStart. He enjoys cooking and going on runs in New York City.

Read More

Predict residential real estate prices at ImmoScout24 with Amazon SageMaker

This is a guest post by Oliver Frost, data scientist at ImmoScout24, in partnership with Lukas Müller, AWS Solutions Architect.

In 2010, ImmoScout24 released a price index for residential real estate in Germany: the IMX. It was based on ImmoScout24 listings. Besides the price, listings typically contain a lot of specific information such as the construction year, the plot size, or the number of rooms. This information allowed us to build a so-called hedonic price index, which considers the particular features of a real estate property.

When we released the IMX, our goal was to establish it as the standard index for real estate prices in Germany. However, it struggled to capture the price increase in the German property market since the financial crisis of 2008. In addition, like a stock market index, it was an abstract figure that can’t be interpreted directly. The IMX was therefore difficult to grasp for non-experts.

At ImmoScout24, our mission is to make complex decisions easy, and we realized that we needed a new concept to fulfill it. Instead of another index, we decided to build a market report that everyone can easily understand: the WohnBarometer. It’s based on our listings data and takes object properties into account. The key difference from the IMX is that the WohnBarometer shows rent and sale prices in Euro per square meter for specific residential real estate types over time. The figures therefore can be directly interpreted and allow our customers to answer questions such as “Do I pay too much rent?” or “Is the apartment I am about to buy reasonably priced?” or “Which city in my region is the most promising one for investing?” Currently, the WohnBarometer is reported for Germany as a whole, the seven biggest cities, and alternating local markets.

The following graph shows an example of the WohnBarometer, with sale prices for Berlin and the development per quarter.

This post discusses how ImmoScout24 used Amazon SageMaker to create the model for the WohnBarometer in order to make it relevant for our customers. It discusses the underlying data model, hyperparameter tuning, and technical setup. This post also shows how SageMaker supported one data scientist to complete the WohnBarometer within 2 months. It took a whole team 2 years to develop the first version of the IMX. Such an investment was not an option for the WohnBarometer.

About ImmoScout24

ImmoScout24 is the leading online platform for residential and commercial real estate in Germany. For over 20 years, ImmoScout24 has been revolutionizing the real estate market and supports over 20 million users each month on its online marketplace or in its app to find new homes or commercial spaces. That’s why 99% of our target customer group know ImmoScout24. With its digital solutions, the online marketplace coordinates and brings owners, realtors, tenants, and buyers together successfully. ImmoScout24 is working towards the goal of digitizing the process of real estate transactions and thereby making complex decisions easy. Since 2012, ImmoScout24 has also been active in the Austrian real estate market, reaching around 3 million users monthly.

From on-premises to AWS Data Pipeline to SageMaker

In this section, we discuss the previous setup and its challenges, and why we decided to use SageMaker for our new model.

The previous setup

When the first version of the IMX was published in 2010, the cloud was still a mystery to most businesses, including ImmoScout24. The field of machine learning (ML) was in its infancy and only a handful of experts knew how to code a model (for the sake of illustration, the first public release of Scikit-Learn was in February 2010). It’s no surprise that the development of the IMX took more than 2 years and cost a seven-figure sum.

In 2015, ImmoScout24 started its AWS migration, and rebuilt IMX on AWS infrastructure. With the data in our Amazon Simple Storage Service (Amazon S3) data lake, both the data preprocessing and the model training were now done on Amazon EMR clusters orchestrated by AWS Data Pipeline. While the former was a PySpark ETL application, the latter was several Python scripts using classical ML packages (such as Scikit-Learn).

Issues with this setup

Although this setup proved quite stable, troubleshooting the infrastructure or improving the model wasn’t easy. A key problem with the model was its complexity, because some components had begun a life on their own: in the end, the code of the outlier detection was almost twice as long the code of the core IMX model itself.

The core model, in fact, wasn’t one model, but hundreds: one model per residential real estate type and region, with the definition varying from a single neighborhood in a big city to several villages in rural areas. We had, for example, one model for apartments for sale in the middle of Berlin and one model for houses for sale in a suburb of Munich. Because setting up the training of all these models took a lot of time, we omitted the hyperparameter tuning, which likely led to the models underperforming.

Why we decided on SageMaker

Given these issues and our ambition of having a market report with practical benefits, we had to decide between rewriting large parts of the existing code or starting from scratch. As you can infer from this post, we opted for the latter. But why SageMaker?

Most of our time spent on the IMX went into troubleshooting the infrastructure, not improving the model. For the new market report, we wanted to flip this around, with the focus on the statistical performance of the model. We also wanted to have the flexibility to quickly replace individual components of the model, such as the optimization of the hyperparameters. What if a new superior boosting algorithm comes around (think about how XGBoost hit the stage in 2014)? Of course, we want to adopt it as one of the first!

In SageMaker, the major components of the classical ML workflow—preprocessing, training, hyperparameter tuning, and inference—are neatly separated on the API level and also on the AWS Management Console. Modifying them individually isn’t difficult.

The new model

In this section, we discuss the components of the new model, including its input data, algorithm, hyperparameter tuning, and technical setup.

Input data

The WohnBarometer is based on a sliding window of 5 years of ImmoScout24 listings of residential real estate located in Germany. After we remove outliers and fraudulent listings, we’re left with approximately 4 million listings that are split into train (60 %), validation (20 %), and test data (20 %). The relationship between listings and objects is not necessarily 1:1; over the course of 5 years, it’s likely that the same object is inserted multiple times (by multiple people).

We use 13 listing attributes, such as the location of the property (WGS84 coordinates), the real estate type (house or apartment, sale or rent), its age (years), its size (square meter) or it’s condition (for example, new or refurbished). Given that each listing typically comes with dozens of attributes, the question arises: which to include in the model? On the one hand, we used domain knowledge; for example, it’s well known that location is a key factor, and in almost all markets new property is more expensive than existing ones. On the other hand, we relied on our experiences with the IMX and similar models. There we learned that including dozens of attributes doesn’t significantly improve the model.

Depending on the real estate type of the listing, the target variable of our model is either the rent per square meter or the sale price per square meter (we explain later why this choice wasn’t ideal). Unlike the IMX, the WohnBarometer is therefore a number that can be directly interpreted and acted upon by our customers.

Model description

When using SageMaker, you can choose between different strategies of implementing your algorithm:

  • Use one of SageMaker’s built-in algorithms. There are almost 20 and they cover all major ML problem types.
  • Customize a pre-made Docker image based on a standard ML framework (such as Scikit-Learn or PyTorch).
  • Build your own algorithm and deploy it as a Docker image.

For the WohnBarometer, we wanted a solution that is easy to maintain and allows us to focus on improving the model itself, not the underlying infrastructure. Therefore, we decided on the first option: use a fully-managed algorithm with proper documentation and fast support if needed. Next, we needed to pick the algorithm itself. Again, the decision wasn’t difficult: we went for the XGBoost algorithm because it’s one of the most renowned ML algorithms for regression type problems, and we have already successfully used it in several projects.

Hyperparameter tuning

Most ML algorithms come with a myriad of parameters to tweak. Boosting algorithms, for example, have many parameters specifying how exactly the trees are built: Do the trees have at maximum 20 or 30 leaves? Is each tree based on all rows and columns or only samples? How heavily to prune the trees? Finding the optimal values of those parameters (as measured by an evaluation metric of your choice), the so-called hyperparameter tuning, is critical to building a powerful ML model.

A key question in hyperparameter tuning is which parameters to tune and how to set the search ranges. You might ask, why not check all possible combinations? Although in theory this sounds like a good idea, it would result in an enormous hyperparameter space with way too many points to evaluate them all at a reasonable price. That is why ML practitioners typically select a small number of hyperparameters known to have a strong impact on the performance of the chosen algorithm.

After the hyperparameter space is defined, the next task is to find the best combination of values in it. The following techniques are commonly employed:

  • Grid search – Divide the space in a discrete grid and then evaluate all points in the grid with cross-validation.
  • Random search – Randomly draw combinations from the space. With this approach, you’ll most likely miss the best combination, but it serves as a good benchmark.
  • Bayesian optimization – Build a probabilistic model of the objective function and use this model to generate new combinations. The model is updated after each combination, leading quickly to good results.

In recent years, thanks to cheap compute power, Bayesian optimization has become the gold standard in hyperparameter tuning, and is the default setting in SageMaker.

Technical setup

As with many other AWS services, you can create SageMaker jobs on the console, with the AWS Command Line Interface (AWS CLI), or via code. We chose the third option, the SageMaker Python SDK to be precise, because it allows for a highly automated setup: the WohnBarometer lives in a Python software project that is command-line executable. For example, all steps of the ML pipeline such as the preprocessing or the model training can be triggered via Bash commands. Those Bash commands, in turn, are orchestrated with a Jenkins pipeline powered by AWS Fargate.

Let’s look at the steps and the underlying infrastructure:

  • Preprocessing – The preprocessing is done with the built-in Scikit-Learn library in SageMaker. Because it involves joining data frames with millions of rows, we need an ml.m5.24xlarge machine here, the largest you can get in the ml.m family. Alternatively, we could have used multiple smaller machines with a distributed framework like Dask, but we wanted to keep it as simple as possible.
  • Training – We use the default SageMaker XGBoost algorithm. The training is done with two ml.m5.12xlarge machines. It’s worth mentioning that our train.py containing the code of the model training and the hyperparameter tuning has less than 100 rows.
  • Hyperparameter tuning – Following the principle of less is more, we only tune 11 hyperparameters (for example, the number of boosting rounds and the learning rate), which gives us time to carefully choose their ranges and inspect how they interact with each other. With only a few hyperparameters, each training job runs relatively fast; in our case the jobs take between 10–20 minutes. With a maximal number of 30 training jobs and 2 concurrent jobs, the total training time is around 3 hours.
  • Inference – SageMaker offers multiple options to serve your model. We use batch transform jobs because we only need the WohnBarometer numbers once a quarter. We didn’t use an endpoint because it would be idle most of the time. Each batch job (approximately 6.8 million rows) is served by a single ml.m5.4xlarge machine in less than 10 minutes.

We can easily debug these steps on the SageMaker console. If, for example, a training job is taking longer than expected, we navigate to the Training page, locate the training job in question, and review Amazon CloudWatch metrics of the underlying machines.

The following architecture diagram shows the infrastructure of the WohnBarometer:

Challenges and learnings

In the beginning everything went smoothly: within a few days we set up the software project and trained a miniature version of our model in SageMaker. We had high hopes for the first run on the full dataset and the hyperparameter tuning in place. Unfortunately, the results weren’t satisfying. We had the following key issues:

  • The predictions of the model were too low, both for rent and sale objects. For Berlin, for example, the sale prices predicted for our reference objects were roughly 50% below the market prices.
  • According to the model, there was no significant price difference between new and existing buildings. The truth is that new buildings are almost always significantly more expensive than existing buildings.
  • The effect of the location on the price wasn’t captured correctly. We know, for example, that apartments for sale in Frankfurt am Main, are, on average, more expensive than in Berlin (although Berlin is catching up); our model, however, predicted it the other way round.

What was the problem and how did we solve it?

Sampling of the features

At first glance, it looks like the issues aren’t related, but indeed they are. By default, XGBoost builds each tree with a random sample of the features. Let’s say a model has 10 features F1, F2, … F10, then the algorithm might use F1, F4, and F7 for one tree, and F3, F4, and F8 for another. While in general this behavior effectively prevents overfitting, it can be problematic if the number of features is small and some of them have a big effect on the target variable. In this case, many trees will miss the crucial features.

XGBoost’s sampling of our 13 features led to many trees including neither of the crucial features—real estate type, location, and new or existing buildings—and as a consequence caused these issues. Luckily, there is a parameter to control the sampling: colsample_bytree (in fact, there are two more parameters to control the sampling, but we didn’t touch them). When we checked our code, we saw that colsample_bytree was set to 0.5, a value we carried over from past projects. As soon as we set it to the default value of 1, the preceding issues were gone.

One model vs. multiple models

Unlike the IMX, the WohnBarometer model really is only one model. Although this minimizes the maintenance effort, it’s not ideal from a statistical point of view. Because our training data contains both sale and rent objects, the spread in the target variable is huge: it ranges from below 5 Euro for some rent apartments to well above 10,000 Euro for houses for sale in first-class locations. The big challenge for the model is to understand that an error of 5 Euro is fantastic for sale objects, but disastrous for rent objects.

In hindsight, knowing how easy it is to maintain multiple models in SageMaker, we would have built at least two models: one for rent and one for sale objects. This would make it easier to capture the peculiarities of both markets. For example, the price of unrented apartments for sale is typically 20–30% higher than for rented apartments for sale. Therefore, encoding this information as a dummy variable in the sale model makes a lot of sense; for the rent model on the other hand, you could leave it out.

Conclusion

Did the WohnBarometer meet the goal of being relevant to our customers? Taking media coverage as an indication, the answer is a clear yes: as of November 2021, more than 700 newspaper articles and TV or radio reports on the WohnBarometer have been published. The list includes national newspapers such as Frankfurter Allgemeine Zeitung, Tagesspiegel, and Handelsblatt, and local newspapers that often ask for WohnBarometer figures for their region. Because we calculate the figures for all regions of Germany anyway, we’re happy to take such requests. With the old IMX, this level of granularity wasn’t possible.

The WohnBarometer outperforms the IMX in regards to statical performance, in particular when it comes to the costs: the IMX was generated by an EMR cluster with 10 task nodes running almost half a day. In contrast, all WohnBarometer steps take less than 5 hours using medium-sized machines. This results in cost savings of almost 75%.

Thanks to SageMaker, we were able to bring a complex ML model in production with one data scientist in less than 2 months. This is remarkable. 10 years earlier, when ImmoScout24 built the IMX, reaching the same milestone took more than 2 years and involved a whole team.

How could we be so efficient? SageMaker allowed us to focus on the model instead of the infrastructure, and SageMaker promotes a microservice architecture that is easy to maintain. If we got stuck with something, we could call on AWS support. In the past, when one of our IMX data pipelines failed, we would sometimes spend days to debug it. Since we started publishing WohnBarometer figures in April 2021, the SageMaker infrastructure hasn’t failed a single time.

To learn more about the WohnBarometer, check out WohnBarometer and WohnBarometer: Angebotsmieten stiegen 2021 bundesweit wieder stärker an. To learn more about using the SageMaker Scikit-Learn library for preprocessing, see Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn. Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your AWS support contacts.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Oliver Frost joined ImmoScout24 in 2017 as a business analyst. Two years later, he became a data scientist in a team whose job it is to turn ImmoScout24 data into veritable data products. Before building the WohnBarometer model, he ran smaller SageMaker projects. Oliver holds several AWS certificates, including the Machine Learning Specialty.

Lukas Müller is a Solutions Architect at AWS. He works with customers in the sports, media, and entertainment industries. He is always looking for ways to combine technical enablement with cultural and organizational enablement to help customers achieve business value with cloud technologies.

Read More

Transforming qualitative research by automating speech to text-to-text analytics

This post is authored by Satish Jha, Intelligent Automation Manager, Matt Docherty, Data Science Manager, Jayesh Muley, Associate Consultant and Tapan Vora, Rapid Prototyping, from ZS Associates.

At ZS Associates, we do a significant amount of qualitative market research. The work involves interviewing relevant subjects (such as healthcare professionals and sales representatives) and developing bespoke analytics on the interview data. We’ve taken advantage of the advances in AI, machine learning (ML), and cloud computing to reimagine qualitative market research and developed a scalable solution that is equipped to perform speech-to-text conversion and natural language processing (NLP) on the audio recordings of interviewed subjects. The solution is better, cheaper, and faster than the current ways of working (manual interpretation), giving a competitive advantage in this space.

This post discusses how ZS used Amazon Transcribe, Amazon Comprehend Medical, and custom NLP for text summarization and graph visualization to create a scalable, automated solution that helps us provide insights in a faster, better, and more efficient way.

Background assessment

The traditional method of performing qualitative market research requires human intervention and interpretation, which is highly subjective in nature. We used advanced AI and ML to develop a platform that is capable of the following:

  • Performing speech-to-text conversion; specifically with high precision, converting interview audio recordings conducted for the purpose of qualitative market research
  • Drawing analytical insights from the converted text using a state-of-the-art NLP model

To achieve this, we combined state-of-the-art AWS AI services and cloud computing capabilities with our propriety NLP and text summarization algorithms to drive impact at scale.

Solution overview

To build our solution, we adopted the methodology of starting small, highlighting value, and scaling fast. We identified a key user group and defined phase one of the solution to do automated speech-to-text and analytics. We defined a key user interface and developed the technology architecture for the solution. Because ZS is an AWS Partner and has already been using multiple AWS Cloud services for our enterprise products and solutions, AWS was the preferred choice for this project. We used Amazon Transcribe and Amazon Comprehend Medical for transcription and theme identification purposes. For hosting custom NLP analytics APIs, we used a serverless infrastructure using Amazon API Gateway, AWS Lambda, and Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. These services are HIPAA-eligible and compliant with pharma regulatory requirements.

The process includes the following stages:

  • File upload to Amazon S3 – The process starts when the user uploads one or more audio recording files for transcription to the site on which our tool is hosted. To upload the files to Amazon Simple Storage Service (Amazon S3), the user is provided with a temporary written token or pre-signed URL using API Gateway, which provides Amazon S3 access.
  • Audio transcription – Depending on the type of file uploaded, different triggers are in place to initiate the appropriate workflow:

    • Audio files uploaded without a dictionary file – If the user didn’t provide a dictionary file, the tool processes the audio file using Amazon Transcribe.
    • Audio files uploaded with a dictionary file – If the user provided a dictionary file, certain AWS Step Functions steps are triggered, followed by processing the dictionary file using Amazon Transcribe. When the dictionary processing is complete, the tool transcribes the audio file using Amazon Transcribe.
  • Transcript file generation – In either of the preceding two cases, when the transcription is in progress, the tool uses Amazon CloudWatch Events to update the transcription status. Lambda functions trigger the tool to update the status on the RDBMS and convey the status to the user through the tool’s UI using sockets. When the transcription is complete, the final output file is stored in Amazon S3.
  • File type conversion – After the output file is generated, the tool uses triggers to create a .doc or .xlsx file, stored again in Amazon S3.
  • Generating analytical insights – With Amazon Comprehend Medical and certain ZS in-house NLP tools, the tool generates analytics based on the transcribed data and updates dashboards on our site to access them in real time.
  • Audio streaming with Amazon Transcribe – We use Amazon CloudFront audio streaming paired with our final output file, which is generated from Amazon Transcribe. The user can simultaneously listen to the recording and read the transcript.

The following diagram shows the high-level architecture and workflow.

The platform is designed to process a large number of files in real time. Therefore, the solution greatly augments the work of our current ZS qualitative research team by making the process more efficient and giving it an entirely new dimension!

Overall, our solution has the following features:

  • The ability to upload single or multiple audio files
  • Automated speech-to-text conversion, with the ability to add a custom dictionary
  • The ability to listen to the uploaded audio and refine text
  • Text summarization and analytics

Process map

The following diagram gives a high-level visualization of our developed solution, with the following stages:

  • Upload audio – The process starts with the user uploading their audio recording (with or without a dictionary file) to the tool
  • Speech to text – These uploaded audio files are transcribed by converting speech to text
  • Listen and refine – The user can simultaneously listen to the recording and read the transcript and make changes wherever necessary
  • Speech-to-text output – The consolidated file includes the converted transcript and its corresponding analytics

It took us approximately 5–6 months to develop this solution end to end with a four-member team. Today it is being used by over 300 people, and the tool has processed thousands of hours of audio.

AWS services used

The solution uses multiple AWs services:

  • AWS Lambda and API Gateway – Hosted the serverless APIs and functions.

    • We developed multiple API Gateways to ensure loose coupling and easy integration with external APIs. Custom authorizers were implemented to enable token-based authentication and restrict unauthorized access to the web content.
    • We also built the Lambda APIs (using Python and NodeJS) that could easily interact with a website hosted on ECS containers and can also be easily linked with Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The use of Lambda functions in our solution helped us avoid the load balancing, restoring, and stopping clusters efforts and reduce overall costs, because the clusters only ran when the functions were running. Additionally, we were able to easily scale our solution because of the serverless architecture.
  • Amazon Transcribe – Provided us options to easily configure the batch processing of audio files up to 100 at a time and even scale a larger load using its built-in queuing mechanism. It also allowed us to load a custom dictionary to transcribe the audio data more accurately.
  • Amazon Comprehend Medical – Generated analytical insights from the text data using its built-in NLP capabilities to sort through text for valuable information.
  • AWS CloudFormation – We used AWS CloudFormation to deploy the Lambda functions and APIs across environments (various S3 buckets and multiple environments in the same bucket, such as production and development) using stage variables.
  • AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline – We used AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline to perform continuous deployment of the front end and analytics backend to ECS clusters.

The following diagram illustrates the architecture of these services.

Conclusion

We used AWS services to develop a platform that helped our teams apply cutting-edge AI to their projects. It has helped our teams do the following:

  • Automate the process of speech-to-text conversion and only focus on low-accuracy aspects.
  • Drive automation of insights with NLP algorithms.
  • Drive self-service. Because we do not need to launch any particular server, we can easily create Lambda functions, make changes to the code on the fly, and provide key ML services as plug and play so that users don’t need to be data scientists.

Today the solution is used by over 300 people, and we have processed thousands of hours of audio. We’re now integrating our solution with other applications to provide users with the flexibility to either upload audio files for transcription or directly upload transcribed files for drawing analytical insights.

We also derived multiple benefits from building our platform with AWS:

  • Using an end-to-end cloud-based architecture proved beneficial in terms of managing environments for business applications
  • With management tools such as CloudWatch, AWS CloudFormation, CodeBuild, CodeDeploy, and CodePipeline, it was easier to monitor, track, and deploy development changes
  • We used AWS’s built-in security with virtual private clouds and identity management with customized policies
  • We were able to reduce load on valuable microservices, with the additional benefit of quick hosting and deployment

About ZS

ZS Associates is a consulting and professional services firm focusing on consulting, software, and technology, headquartered in Evanston, Illinois, that provides services for clients in pharma, healthcare, and technology. The firm employs more than 10,000 employees in 30 offices in North America, South America, Europe, and Asia. ZS works with 49 of the 50 largest drug-makers and 17 of the 20 largest medical device makers and serves consumer products, financial services, industrial products, telecommunications, transportation, and logistics industries.

Disclaimer: AWS is not responsible for the content or accuracy of this post. The content and opinions in this post are solely those of the third-party author. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.


About the Authors

Satish Jha is a Manager with ZS Associates. He is a leader in the firm’s Intelligent Automation Practice, where he works side by side with several pharma clients to transform operations and drive impact.

Matt Docherty is a Data Science Manager with ZS Associates in the Philadelphia office. He is focused on applying data science in the pharmaceutical industry.

Jayesh Muley is an Associate Consultant for Process Excellence & Transformation with ZS Associates. He has 4 years of experience advising pharma clients in the forecasting, process excellence, and digital transformation spaces. He played a critical role in establishing ZS’s automation center of excellence. He is always keen on learning new technologies and is always evolving in his role.

Tapan Vora is a Manager for Rapid Prototyping with ZS Associates. Tapan has over 14 years of technology and engineering management experience. He plays multiple roles in the team, such as business analyst, people manager, solution designer, data analyst, and product leader.

Read More

Predicting the past with Ithaca

The birth of human writing marked the dawn of History and is crucial to our understanding of past civilisations and the world we live in today. For example, more than 2,500 years ago, the Greeks began writing on stone, pottery, and metal to document everything from leases and laws to calendars and oracles, giving a detailed insight into the Mediterranean region. Unfortunately, it’s an incomplete record. Many of the surviving inscriptions have been damaged over the centuries or moved from their original location. In addition, modern dating techniques, such as radiocarbon dating, cannot be used on these materials, making inscriptions difficult and time-consuming to interpret.Read More