“I hope we have accelerated HIV vaccine development by providing findings that we and others can build on.”Read More
Process larger and wider datasets with Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. Data Wrangler can simplify your data preparation and feature engineering processes and help you with data selection, cleaning, exploration, and visualization. Data Wrangler has over 300 built-in transforms written in PySpark, so you can process datasets up to hundreds of gigabytes efficiently on the default instance, ml.m5.4xlarge.
However, when you work with datasets up to terabytes of data using built-in transforms, you might experience longer processing time or potential out-of-memory errors. Based on your data requirements, you can now use additional Amazon Elastic Compute Cloud (Amazon EC2) M5 instances and R5 instances. For example, you can start with a default instance (ml.m5.4xlarge) and then switch to ml.m5.24xlarge or ml.r5.24xlarge. You have the option of picking different instance types and finding the best trade-off of running cost and processing times. The next time you’re working on time series transformation and running heavy transformers to balance your data, you can right-size your Data Wrangler instance to run these processes faster.
When processing tens of gigabytes or even more with a custom Pandas transform, you might experience out-of-memory errors. You can switch from the default instance (ml.m5.4xlarge) to ml.m5.24xlarge, and the transform will finish without any errors. We thoroughly benchmarked and observed linear speedup as we increased instance size across a portfolio of datasets.
In this post, we share our findings from two benchmark tests to demonstrate how you can process larger and wider datasets with Data Wrangler.
Data Wrangler benchmark tests
Let’s review two tests we ran, aggregation queries and one-hot encoding, with different instance types using PySpark built-in transformers and custom Pandas transforms. Transformations that don’t require aggregation finish quickly and work well with the default instance type, so we focused on aggregation queries and transformations with aggregation. We stored our test dataset on Amazon Simple Storage Service (Amazon S3). This dataset’s expanded size is around 100 GB with 80 million rows and 300 columns. We used UI metrics to time benchmark tests and measure end-to-end customer-facing latency. When importing our test dataset, we disabled sampling. Sampling is enabled by default, and Data Wrangler only processes the first 100 rows when enabled.x
As we increased the Data Wrangler instance size, we observed a roughly linear speedup of Data Wrangler built-in transforms and custom Spark SQL. Pandas aggregation query tests only finished when we used instances larger than ml.m5.16xl, and Pandas needed 180 GB of memory to process aggregation queries for this dataset.
The following table summarizes the aggregation query test results.
Instance | vCPU | Memory (GiB) | Data Wrangler built-in Spark transform time |
Pandas Time (Custom Transform) |
ml.m5.4xl | 16 | 64 | 229 seconds | Out of memory |
ml.m5.8xl | 32 | 128 | 130 seconds | Out of memory |
ml.m5.16xl | 64 | 256 | 52 seconds | 30 minutes |
The following table summarizes the one-hot encoding test results.
Instance | vCPU | Memory (GiB) | Data Wrangler built-in Spark transform time |
Pandas Time (Custom Transform) |
ml.m5.4xl | 16 | 64 | 228 seconds | Out of memory |
ml.m5.8xl | 32 | 128 | 130 seconds | Out of memory |
ml.m5.16xl | 64 | 256 | 52 seconds | Out of memory |
Switch the instance type of a data flow
To switch the instance type of your flow, complete the following steps:
- On the Amazon SageMaker Data Wrangler console, navigate to the data flow that you’re currently using.
- Choose the instance type on the navigation bar.
- Select the instance type that you want to use.
- Choose Save.
A progress message appears.
When the switch is complete, a success message appears.
Data Wrangler uses the selected instance type for data analysis and data transformations. The default instance and the instance you switched to (ml.m5.16xlarge) are both running. You can change the instance type or switch back to the default instance before running a specific transformation.
Shut down unused instances
You are charged for all running instances. To avoid incurring additional charges, shut down the instances that you aren’t using manually. To shut down an instance that is running, complete the following steps:
- On your data flow page, choose the instance icon in the left pane of the UI under Running instances.
- Choose Shut down.
If you shut down an instance used to run a flow, you can’t access the flow temporarily. If you get an error in opening the flow running an instance you previously shut down, wait for approximately 5 minutes and try opening it again.
Conclusion
In this post, we demonstrated how to process larger and wider datasets with Data Wrangler by switching instances to larger M5 or R5 instance types. M5 instances offer a balance of compute, memory, and networking resources. R5 instances are memory-optimized instances. Both M5 and R5 provide instance types to optimize cost and performance for your workloads.
To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.
About the Authors
Haider Naqvi is a Solutions Architect at AWS. He has extensive software development and enterprise architecture experience. He focuses on enabling customers to achieve business outcomes with AWS. He is based out of New York.
Huong Nguyen is a Sr. Product Manager at AWS. She is leading the data ecosystem integration for SageMaker, with 14 years of experience building customer-centric and data-driven products for both enterprise and consumer spaces.
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.
Sriharsha M Sr is an AI/ML Specialist Solutions Architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.
Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.
Fine-tune transformer language models for linguistic diversity with Hugging Face on Amazon SageMaker
Approximately 7,000 languages are in use today. Despite attempts in the late 19th century to invent constructed languages such as Volapük or Esperanto, there is no sign of unification. People still choose to create new languages (think about your favorite movie character who speaks Klingon, Dothraki, or Elvish).
Today, natural language processing (NLP) examples are dominated by the English language, the native language for only 5% of the human population and spoken only by 17%.
The digital divide is defined as the gap between those who can access digital technologies and those who can’t. Lack of access to knowledge or education due to language barriers also contributes to the digital divide, not only between people who don’t speak English, but also for the English-speaking people who don’t have access to non-English content, which reduces diversity of thought and knowledge. There is so much to learn mutually.
In this post, we summarize the challenges of low-resource languages and experiment with different solution approaches covering over 100 languages using Hugging Face transformers on Amazon SageMaker.
We fine-tune various pre-trained transformer-based language models for a question and answering task. We use Turkish in our example, but you could apply this approach to other supported language. Our focus is on BERT [1] variants, because a great feature of BERT is its unified architecture across different tasks.
We demonstrate several benefits of using Hugging Face transformers on Amazon SageMaker, such as training and experimentation at scale, and increased productivity and cost-efficiency.
Overview of NLP
There have been several major developments in NLP since 2017. The emergence of deep learning architectures such as transformers [2], the unsupervised learning techniques to train such models on extremely large datasets, and transfer learning have significantly improved the state-of-the-art in natural language understanding. The arrival of pre-trained model hubs has further democratized access to the collective knowledge of the NLP community, removing the need to start from scratch.
A language model is an NLP model that learns to predict the next word (or any masked word) in a sequence. The genuine beauty of language models as a starting point are three-fold: First, research has shown that language models trained on a large text corpus data learn more complex meanings of words than previous methods. For instance, to be able to predict the next word in a sentence, the language model has to be good at understanding the context, the semantics, and also the grammar. Second, to train a language model, labeled data—which is scarce and expensive—is not required during pre-training. This is important because an enormous amount of unlabeled text data is publicly available on the web in many languages. Third, it has been demonstrated that once the language model is smart enough to predict the next word for any given sentence, it’s relatively easy to perform other NLP tasks such as sentiment analysis or question answering with very little labeled data, because fine-tuning reuses representations from a pre-trained language model [3].
Fully managed NLP services have also accelerated the adoption of NLP. Amazon Comprehend is a fully managed service that enables text analytics to extract insights from the content of documents, and it supports a variety of languages. Amazon Comprehend supports custom classification and custom entity recognition and enables you to build custom NLP models that are specific to your requirements, without the need for any ML expertise.
Challenges and solutions for low-resource languages
The main challenge for a large number of languages is that they have relatively less data available for training. These are called low-resource languages. The m-BERT paper [4] and XLM-R paper [7] refer to Urdu and Swahili as low-resource languages.
The following figure specifies the ISO codes of over 80 languages, and the difference in size (in log-scale) between the two major pre-trainings [7]. In Wikipedia (orange), there are only 18 languages with over 1 million articles and 52 languages with over 1,000 articles, but 164 languages with only 1–10,000 articles [9]. The CommonCrawl corpus (blue) increases the amount of data for low-resource languages by two orders of magnitude. Nevertheless, they are still relatively small compared to high-resource languages such as English, Russian, or German.
In terms of Wikipedia article numbers, Turkish is another language in the same group of over 100,000 articles (28th), together with Urdu (54th). Compared with Urdu, Turkish would be regarded as a mid-resource language. Turkish has some interesting characteristics, which could make language models more powerful by creating certain challenges in linguistics and tokenization. It’s an agglutinative language. It has a very free word order, a complex morphology, or tenses without English equivalents. Phrases formed of several words in languages like English can be expressed with a single word form, as shown in the following example.
Turkish | English |
Kedi | Cat |
Kediler | Cats |
Kedigiller | Family of cats |
Kedigillerden | Belonging to the family of cats |
Kedileştirebileceklerimizdenmişçesineyken | When it seems like that one is one those we can make cat |
Two main solution approaches are language-specific models or multilingual models (with or without cross-language supervision):
- Monolingual language models – The first approach is to apply a BERT variant to a specific target language. The more the training data, the better the model performance.
-
Multilingual masked language models – The other approach is to pre-train large transformer models on many languages. Multilingual language modeling aims to solve the lack of data challenge for low-resource languages by pre-training on a large number of languages so that NLP tasks learned from one language can be transferred to other languages. Multilingual masked language models (MLMs) have pushed the state-of-the-art on cross-lingual understanding tasks. Two examples are:
- Multilingual BERT – The multilingual BERT model was trained in 104 different languages using the Wikipedia corpus. However, it has been shown that it only generalizes well across similar linguistic structures and typological features (for example, languages with similar word order). Its multilinguality is diminished especially for languages with different word orders (for example, subject/object/verb) [4].
- XLM-R – Cross-lingual language models (XLMs) are trained with a cross-lingual objective using parallel datasets (the same text in two different languages) or without a cross-lingual objective using monolingual datasets [6]. Research shows that low-resource languages benefit from scaling to more languages. XLM-RoBERTa is a transformer-based model inspired by RoBERTa [5], and its starting point is the proposition that multilingual BERT and XLM are under-tuned. It’s trained on 100 languages using both the Wikipedia and CommonCrawl corpus, so the amount of training data for low-resource languages is approximately two orders of magnitude larger compared to m-BERT [7].
Another challenge of multilingual language models for low-resource languages is vocabulary size and tokenization. Because all languages use the same shared vocabulary in multilingual language models, there is a trade-off between increasing vocabulary size (which increases the compute requirements) vs. decreasing it (words not present in the vocabulary would be marked as unknown, or using characters instead of words as tokens would ignore any structure). The word-piece tokenization algorithm combines the benefits of both approaches. For instance, it effectively handles out-of-vocabulary words by splitting the word into subwords until it is present in the vocabulary or until the individual character is reached. Character-based tokenization isn’t very useful except for certain languages, such as Chinese. Techniques exist to address challenges for low-resource languages, such as sampling with certain distributions [6].
The following table depicts how three different tokenizers behave for the word “kedileri” (meaning “its cats”). For certain languages and NLP tasks, this would make a difference. For instance, for the question answering task, the model returns the span of the start token index and end token index; returning “kediler” (“cats”) or “kedileri” (“its cats”) would lose some context and lead to different evaluation results for certain metrics.
Pretrained Model | Vocabulary size | Tokenization for “Kedileri”* | |||||
dbmdz/bert-base-turkish-uncased | 32,000 | Tokens | [CLS] | kediler | ##i | [SEP] | |
Input IDs | 2 | 23714 | 1023 | 3 | |||
bert-base-multilingual-uncased | 105,879 | Tokens | [CLS] | ked | ##iler | ##i | [SEP] |
Input IDs | 101 | 30210 | 33719 | 10116 | 102 | ||
deepset/xlm-roberta-base-squad2 | 250,002 | Tokens | <s> | ▁Ke | di | leri | </s> |
Input IDs | 0 | 1345 | 428 | 1341 | . | ||
*In English: (Its) cats |
Therefore, although low-resource languages benefit from multilingual language models, performing tokenization across a shared vocabulary may ignore some linguistic features for certain languages.
In the next section, we compare three approaches by fine-tuning them for a question answering task using a QA dataset for Turkish: BERTurk [8], multilingual BERT [4], and XLM-R [7].
Solution overview
Our workflow is as follows:
- Prepare the dataset in an Amazon SageMaker Studio notebook environment and upload it to Amazon Simple Storage Service (Amazon S3).
- Launch parallel training jobs on SageMaker training deep learning containers by providing the fine-tuning script.
- Collect metadata from each experiment.
- Compare results and identify the most appropriate model.
The following diagram illustrates the solution architecture.
For more information on Studio notebooks, refer to Dive deep into Amazon SageMaker Studio Notebooks architecture. For more information on how Hugging Face is integrated with SageMaker, refer to AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models.
Prepare the dataset
The Hugging Face Datasets library provides powerful data processing methods to quickly get a dataset ready for training in a deep learning model. The following code loads the Turkish QA dataset and explores what’s inside:
There are about 9,000 samples.
The input dataset is slightly transformed into a format expected by the pre-trained models and contains the following columns:
The English translation of the output is as follows:
- context – Resit Emre Kongar (b. 13 October 1941, Istanbul), Turkish sociologist, professor.
- question – What is the academic title of Emre Kongar?
- answer – Professor
Fine-tuning script
The Hugging Face Transformers library provides an example code to fine-tune a model for a question answering task, called run_qa.py. The following code initializes the trainer:
Let’s review the building blocks on a high level.
Tokenizer
The script loads a tokenizer using the AutoTokenizer
class. The AutoTokenizer
class takes care of returning the correct tokenizer that corresponds to the model:
The following is an example how the tokenizer works:
Model
The script loads a model. AutoModel
classes (for example, AutoModelForQuestionAnswering
) directly create a class with weights, configuration, and vocabulary of the relevant architecture given the name and path to the pre-trained model. Thanks to the abstraction by Hugging Face, you can easily switch to a different model using the same code, just by providing the model’s name. See the following example code:
Preprocessing and training
The prepare_train_features()
and prepare_validation_features()
methods preprocess the training dataset and validation datasets, respectively. The code iterates over the input dataset and builds a sequence from the context and the current question, with the correct model-specific token type IDs (numerical representations of tokens) and attention masks. The sequence is then passed through the model. This outputs a range of scores, for both the start and end positions, as shown in the following table.
Input Dataset Fields | Preprocessed Training Dataset Fields for QuestionAnsweringTrainer |
id | input_ids |
title | attention_mask |
context | start_positions |
question | end_positions |
Answers { answer_start, answer_text } | . |
Evaluation
The compute_metrics()
method takes care of calculating metrics. We use the following popular metrics for question answering tasks:
- Exact match – Measures the percentage of predictions that match any one of the ground truth answers exactly.
-
F1 score – Measures the average overlap between the prediction and ground truth answer. The F1 score is the harmonic mean of precision and recall:
- Precision – The ratio of the number of shared words to the total number of words in the prediction.
- Recall – The ratio of the number of shared words to the total number of words in the ground truth.
Managed training on SageMaker
Setting up and managing custom machine learning (ML) environments can be time-consuming and cumbersome. With AWS Deep Learning Container (DLCs) for Hugging Face Transformers libraries, we have access to prepackaged and optimized deep learning frameworks, which makes it easy to run our script across multiple training jobs with minimal additional code.
We just need to use the Hugging Face Estimator available in the SageMaker Python SDK with the following inputs:
Evaluate the results
When the fine-tuning jobs for the Turkish question answering task are complete, we compare the model performance of the three approaches:
- Monolingual language model – The pre-trained model fine-tuned on the Turkish question answering text is called bert-base-turkish-uncased [8]. It achieves an F1 score of 75.63 and an exact match score of 56.17 in only two epochs and with 9,000 labeled items. However, this approach is not suitable for a low-resource language when a pre-trained language model doesn’t exist, or there is little data available for training from scratch.
- Multilingual language model with multilingual BERT – The pre-trained model is called bert-base-multilingual-uncased. The multilingual BERT paper [4] has shown that it generalizes well across languages. Compared with the monolingual model, it performs worse (F1 score 71.73, exact match 50:45), but note that this model handles over 100 other languages, leaving less room for representing the Turkish language.
- Multilingual language model with XLM-R – The pre-trained model is called xlm-roberta-base-squad2. The XLM-R paper shows that it is possible to have a single large model for over 100 languages without sacrificing per-language performance [7]. For the Turkish question answering task, it outperforms the multilingual BERT and monolingual BERT F1 scores by 5% and 2%, respectively (F1 score 77.14, exact match 56.39).
Our comparison doesn’t take into consideration other differences between models such as the model capacity, training datasets used, NLP tasks pre-trained on, vocabulary size, or tokenization.
Additional experiments
The provided notebook contains additional experiment examples.
SageMaker provides a wide range of training instance types. We fine-tuned the XLM-R model on p3.2xlarge (GPU: Nvidia V100 GPU, GPU architecture: Volta (2017)), p3.16xlarge (GPU: 8 Nvidia V100 GPUs), and g4dn.xlarge (GPU: Nvidia T4 GPU, GPU architecture: Turing (2018)), and observed the following:
- Training duration – According to our experiment, the XLM-R model took approximately 24 minutes to train on p3.2xlarge and 30 minutes on g4dn.xlarge (about 23% longer). We also performed distributed fine-tuning on two p3.16xlarge instances, and the training time decreased to 10 minutes. For more information on distributed training of a transformer-based model on SageMaker, refer to Distributed fine-tuning of a BERT Large model for a Question-Answering Task using Hugging Face Transformers on Amazon SageMaker.
- Training costs – We used the AWS Pricing API to fetch SageMaker on-demand prices to calculate it on the fly. According to our experiment, training cost approximately $1.58 on p3.2xlarge, and about four times less on g4dn.xlarge ($0.37). Distributed training on two p3.16xlarge instances using 16 GPUs cost $9.68.
To summarize, although the g4dn.xlarge was the least expensive machine, it also took about three times longer to train than the most powerful instance type we experimented with (two p3.16xlarge). Depending on your project priorities, you could choose from a wide variety of SageMaker training instance types.
Conclusion
In this post, we explored fine tuning pre-trained transformer-based language models for a question answering task for a mid-resource language (in this case, Turkish). You can apply this approach to over 100 other languages using a single model. As of writing, scaling up a model to cover all of the world’s 7,000 languages is still prohibitive, but the field of NLP provides an opportunity to widen our horizons.
Language is the principal method of human communication, and is a means of communicating values and sharing the beauty of a cultural heritage. The linguistic diversity strengthens intercultural dialogue and builds inclusive societies.
ML is a highly iterative process; over the course of a single project, data scientists train hundreds of different models, datasets, and parameters in search of maximum accuracy. SageMaker offers the most complete set of tools to harness the power of ML and deep learning. It lets you organize, track, compare, and evaluate ML experiments at scale.
Hugging Face is integrated with SageMaker to help data scientists develop, train, and tune state-of-the-art NLP models more quickly and easily. We demonstrated several benefits of using Hugging Face transformers on Amazon SageMaker, such as training and experimentation at scale, and increased productivity and cost-efficiency.
You can experiment with NLP tasks on your preferred language in SageMaker in all AWS Regions where SageMaker is available. The example notebook code is available in GitHub.
To learn how Amazon SageMaker Training Compiler can accelerate the training of deep learning models by up to 50%, see New – Introducing SageMaker Training Compiler.
The authors would like to express their deepest appreciation to Mariano Kamp and Emily Webber for reviewing drafts and providing advice.
References
- J. Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, (2018).
- A. Vaswani et al., “Attention Is All You Need”, (2017).
- J. Howard and S. Ruder, “Universal Language Model Fine-Tuning for Text Classification”, (2018).
- T. Pires et al., “How multilingual is Multilingual BERT?”, (2019).
- Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, (2019).
- G. Lample, and A. Conneau, “Cross-Lingual Language Model Pretraining”, (2019).
- A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning at Scale”, (2019).
- Stefan Schweter. BERTurk – BERT models for Turkish (2020).
- Multilingual Wiki Statistics https://en.wikipedia.org/wiki/Wikipedia:Multilingual_statistics
About the Authors
Arnav Khare is a Principal Solutions Architect for Global Financial Services at AWS. His primary focus is helping Financial Services Institutions build and design Analytics and Machine Learning applications in the cloud. Arnav holds an MSc in Artificial Intelligence from Edinburgh University and has 18 years of industry experience ranging from small startups he founded to large enterprises like Nokia, and Bank of America. Outside of work, Arnav loves spending time with his two daughters, finding new independent coffee shops, reading and traveling. You can find me on LinkedIn and in Surrey, UK in real life.
Hasan-Basri AKIRMAK (BSc and MSc in Computer Engineering and Executive MBA in Graduate School of Business) is a Senior Solutions Architect at Amazon Web Services. He is a business technologist advising enterprise segment clients. His area of specialty is designing architectures and business cases on large scale data processing systems and Machine Learning solutions. Hasan has delivered Business development, Systems Integration, Program Management for clients in Europe, Middle East and Africa. Since 2016 he mentored hundreds of entrepreneurs at startup incubation programs pro-bono.
Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning and leads the Natural Language Processing (NLP) community within AWS. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers being successful in their AI/ML journey on AWS and has worked with organizations in many industries, including Insurance, Financial Services, Media and Entertainment, Healthcare, Utilities, and Manufacturing. In his spare time Heiko travels as much as possible.
Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model
In recent years, natural language understanding (NLU) has increasingly found business value, fueled by model improvements as well as the scalability and cost-efficiency of cloud-based infrastructure. Specifically, the Transformer deep learning architecture, often implemented in the form of BERT models, has been highly successful, but training, fine-tuning, and optimizing these models has proven to be a challenging problem. Thanks to the AWS and Hugging Face collaboration, it’s now simpler to train and optimize NLU models on Amazon SageMaker using the SageMaker Python SDK, but sourcing labeled data for these models is still difficult and time-consuming.
One NLU problem of particular business interest is the task of question answering. In this post, we demonstrate how to build a custom question answering dataset using Amazon SageMaker Ground Truth to train a Hugging Face question answering NLU model.
Question answering challenges
Question answering entails a model automatically producing an answer to a query given some body of text that may or may not contain the answer. For example, given the following question, “What workflows does SageMaker Ground Truth support?” a model should be able to identify the segment “annotation consolidation and audit” in the following paragraph:
SageMaker Ground Truth helps improve the quality of labels through annotation consolidation and audit workflows. Annotation consolidation is the process of collecting label inputs from two or more data labelers and combining them to create a single data label for your machine learning model. With built-in audit and review workflows, workers can perform label verification and make adjustments to improve accuracy.
This problem is challenging because it requires a model to comprehend the meaning of a question, rather than simply perform keyword search. Accurate models in this area can reduce customer support costs through powering intelligent chatbots, delivering high-quality voice assistant products, and driving online store revenue through personalized product question answering. One large dataset in this area is the Stanford Question Answering Dataset (SQuAD), a diverse question answering dataset that presents a model with short text passages and requires the model to predict the location of the answering text span in the passage. SQuAD is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is either a span of text from the corresponding passage, or otherwise marked impossible to answer.
One challenge in adapting SQuAD for business use cases is generating domain-specific custom datasets. This process of creating new question and answer datasets requires a specialized user interface that allows annotators to highlight spans and add questions to those spans. It must also be able to support the addition of impossible questions to support SQuAD 2.0 format, which includes non-answerable questions. These impossible questions help models gain additional understanding around which queries can’t be answered using the given passage. The custom worker templates in Ground Truth simplify the generation of these datasets by providing workers with a tailored annotation experience for creating question and answer datasets.
Solution overview
This solution creates and manages Ground Truth labeling jobs to label a domain-specific custom question-answer dataset using a custom annotation user interface. We use SageMaker to train, fine-tune, optimize, and deploy a Hugging Face
BERT model built with PyTorch on a custom question answering dataset.
You can implement the solution by deploying the provided AWS CloudFormation template in your AWS account. AWS CloudFormation handles deploying the AWS Lambda functions that support pre-annotation and annotation consolidation for the annotation user interface. It also creates an Amazon Simple Storage Service (Amazon S3) bucket and the AWS Identity and Access Management (IAM) roles to use when creating a labeling job.
This post walks you through how to do the following:
- Create your own question answering dataset, or augment an existing one using Ground Truth
- Use Hugging Face datasets to combine and tokenize text
- Fine-tune a BERT model on your question answering data using SageMaker training
- Deploy your model to a SageMaker endpoint and visualize your results
Annotation user interface
We use a new custom worker task template with Ground Truth to add new annotations to the existing SQuAD dataset. This solution offers a worker task template as well as a pre-annotation Lambda function (which handles putting data into the user interface) and post-annotation Lambda function (which extracts results from the user interface after labeling is complete).
This custom worker task template gives you the ability to highlight text in the right pane, then add a corresponding question in the left pane that relates to the highlighted text. Highlighted text on the right pane can also be added to any previously created question. Moreover, you can add impossible questions according to SQuAD 2.0 format. Impossible questions allow models to reduce the number of unreliable false positive guesses when the passage is unable to answer a query.
This user interface uses the same JSON schema as the SQuAD 2.0 dataset, which means it can operate over multiple articles and paragraphs, displaying one paragraph at a time using the Previous and Next buttons. The user interface makes it easy to monitor and determine the labeling work each annotator needs to complete during the task submission step.
Because the annotation UI is contained in a single Liquid HTML file, you can customize the labeling experience with knowledge of basic JavaScript. You can also modify Liquid tags to pass additional information into the labeling UI, and you can modify the template itself to include more detailed worker instructions.
Estimated costs
Deploying this solution can incur a maximum cost of around $20, not accounting for human labeling costs. Amazon S3, Lambda, SageMaker, and Ground Truth all offer the AWS Free Tier, with charges for additional usage. For more information, see the following pricing pages:
- Amazon S3 Pricing
- AWS Lambda Pricing
- Amazon SageMaker Pricing
- Amazon SageMaker Data Labeling Pricing – This fee depends on the type of workforce that you use. If you’re a new user of Ground Truth, we suggest using a private workforce and including yourself as a worker to test your labeling job configuration.
Prerequisites
To implement this solution, you should have the following prerequisites:
- An AWS account.
- Familiarity with Ground Truth. For more information, refer to Use Amazon SageMaker Ground Truth to Label Data.
- Familiarity with AWS CloudFormation. For more information, refer to the AWS CloudFormation User Guide.
- A SageMaker workforce. For this demonstration, we use a private workforce. You can create a workforce on the SageMaker console.
The following GIF demonstrates how to create a private workforce. For instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.
Launch the CloudFormation Stack
Now that you’ve seen the structure of the solution, you deploy it into your account so you can run an example workflow. All the deployment steps related to the labeling pipeline are managed by AWS CloudFormation. This means AWS CloudFormation creates your pre-annotation and annotation consolidation Lambda functions, as well as an S3 bucket to store input and output data.
You can launch the stack in AWS Region us-east-1
on the AWS CloudFormation console using the Launch Stack button. To launch the stack in a different Region, use the instructions found in the README of the GitHub repository.
Operate the notebook
After the solution has been deployed to your account, a notebook instance named gt-hf-squad-notebook
is available in your account. To start operating the notebook, complete the following steps:
- On the Amazon SageMaker console, navigate to the notebook instance page.
- Choose Open JupyterLab to open the instance.
- Inside the instance, browse to the repository
hf-gt-custom-qa
and open the notebookhf_squad_finetuning.ipynb
. - Choose
conda_pytorch_p38
as your kernel.
Now that you’ve created a notebook instance and opened the notebook, you can run cells in the notebook to operate the solution. The remainder of this post provides additional details to each section in the notebook as you go along.
Download and inspect the data
The SQuAD dataset contains a training dataset as well as test and development datasets. The notebook downloads the SQuAD2.0 dataset for you, but you can choose which version of SQuAD to use by modifying the notebook cell under Download and inspect the data.
SQuAD was created by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. For more information, refer to the original paper and dataset. SQuAD has been licensed by the authors under the Creative Commons Attribution-ShareAlike 4.0 International Public License.
Let’s look at an example question and answer pair from SQuAD:
Paragraph title: Immune_system
The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism’s own healthy tissue. In many species, the immune system can be classified into subsystems, such as the innate immune system versus the adaptive immune system, or humoral immunity versus cell-mediated immunity. In humans, the blood–brain barrier, blood–cerebrospinal fluid barrier, and similar fluid–brain barriers separate the peripheral immune system from the neuroimmune system which protects the brain.
Question: The immune system protects organisms against what?
Answer: disease
Load model
Now that you’ve viewed an example question and answer pair in SQuAD, you can download a model that you can fine-tune for question answering. Hugging Face allows you to easily download a base model that has undergone large-scale pre-training and reinitialize it for a different downstream task. In this case, you download the distilbert-base-uncased
model and repurpose it for question answering using the AutoModelForQuestionAnswering
class from Hugging Face. You also utilize the AutoTokenizer
class to retrieve the model’s pre-trained tokenizer. We dive deeper into the model we use later in the post.
View BERT input
BERT requires you to transform text data into a numeric representation known as tokens. There are a variety of tokenizers available; the following tokens were created by a tokenizer specifically designed for BERT that you instantiate with a set vocabulary. Each token maps to a word in the vocabulary. Let’s look at the transformed immune system question and context you supply BERT for inference.
Model inference
Now that you’ve seen what BERT takes as input, let’s look at how you can get inference results from the model. The following code demonstrates how to use the previously generated tokenized input and return inference results from the model. Similar to how BERT can’t accept raw text as input, it doesn’t generate raw text as output either. You translate BERT’s output by identifying the start and end points in the paragraph that BERT identified as the answer. Then you map that output to our tokens and back to English text.
The translated results are as follows:
Question: The immune system protects organisms against what?
Answer: disease
Augment SQuAD
Next, to obtain additional labeled data, we use a custom worker task template in Ground Truth. We can first create a new article in SQuAD format. The notebook copies this file from the repo to Amazon S3, but feel free to make any edits before running the Augment SQuAD cell. The format of SQuAD is shown in the following code. Each SQuAD JSON file contains multiple articles stored in the data
key. Each article has a title
field and one or more paragraphs. These paragraphs contain segments of text called context
and any associated questions in the qas
list. Because we’re annotating from scratch, we can leave the qas
list empty and just provide context. The user interface is able to loop across both paragraphs and articles, allowing you to make each worker task as large or small as desired.
After we generate a sample SQuAD data file, we need to create a Ground Truth augmented manifest file that refers to our input data. We do this by generating a JSON lines-formatted file with a “source
” key corresponding to the location in Amazon S3 where we stored our input SQuAD data:
Access labeling portal
After you send the job to Ground Truth, you can view the generated labeling job on the Ground Truth console.
To perform labeling, you need to log in to the worker portal account you created as a part of the prerequisite steps. Your job is available in the worker portal after a few minutes of pre-processing. After opening the task, you’re presented with the custom worker template for Q&A annotation. You can add questions by highlighting sections of text in the context, then choosing Add Question.
Check labeling job status
After submission, you can run the Check labeling job status cell to see if your labeling job is complete. Wait for completion before proceeding to further cells.
Load labeled data
After labeling, the output manifest contains an entry with your label attribute name (in this case squad-1626282229
) containing an S3 URI to SQuAD-formatted data that you can use during training. See the following output manifest contents:
Each line in the manifest corresponds to a single worker task.
Load SQuAD train set
Hugging Face has a dataset package that provides you with the ability to download and preprocess SQuAD, but to add our custom questions and answers, we need to do a bit of processing. SQuAD is structured around sets of topics. Each topic has a variety of different context statements and each context statement has question and answer pairs. Because we want to create our own questions for training, we need to combine our questions with SQuAD. Luckily for us, our annotations are already in SQuAD format, so we can take our example labels and append them as a new topic to the existing SQuAD data.
Create a Hugging Face Dataset object
To get our data into Hugging Face’s dataset format, we have several options. We can use the load_dataset option, in which case we can supply a CSV, JSON, or text file that is loaded as a dataset object. You also can supply load_dataset
with a processing script to convert your file into the desired format. For this post, we instead use the Dataset.from_dict()
method, which allows us to supply an in-memory dictionary to create a dataset object. We also define our dataset features. We can view the features by using Hugging Face’s dataset viewer, as shown in the following screenshot.
Our features are as follows:
- ID – The ID of the text
- title – The associated title for the topic
- context – The context statement the model must search to find an answer
- question – The question the model is being asked
- answer – The accepted answer text and location in the context statement
Hugging Face datasets easily allow us to define this schema:
After we create our dataset object, we have to tokenize the text. Because models can’t accept raw text as an input, we need to convert our text into a numeric input that it can understand, otherwise known as tokenization. Tokenization is model specific, so let’s understand the model we’re going to fine-tune. We’re using a distilbert-base-uncased model. It looks very similar to BERT: it uses input embeddings, multi-head attention (for more information about this operation, refer to The Illustrated Transformer), and feed forward layers, but has half the parameters of the original BERT base model. See the following initial model layers:
Let’s break down each component of the model’s title. The name distilbert
denotes the fact that this is a distilled version of the BERT base model, which is obtained through a process called knowledge distillation. Knowledge distillation allows us to train a smaller student model on not only the training data but also the responses to the same training set from a larger pre-trained teacher model. base
refers to the size of the model, in this case the model was distilled from a BERT base model (as opposed to a BERT large model). uncased
refers to the text it was trained on. In this case the text didn’t account for case; all the text it was trained on was lowercase. The uncased
aspect directly affects the way we tokenize our text. Thankfully, in addition to providing easy access to downloading transformer models, Hugging Face also provides the model’s accompanying tokenizer. We also downloaded a customized tokenizer for our distilbert-base-uncased model
that we now use to transform our text:
Another feature of the dataset class is it allows us to run preprocessing and tokenization in parallel with its map function. We define a processing function and then pass it to the map method.
For question answering, Hugging Face needs several components (which are also defined in the glossary):
- attention mask – A mask indicating to the model which tokens to pay attention to, used primarily for differentiating between actual text and padding tokens
- start_positions – The start position of the answer in the text
- end_positions – The end position of the answer in the text
- input_ids – The token indices mapping the tokens to the vocabulary
Our tokenizer will tokenize the text, but we need to explicitly capture the start and end positions of our answer, which is why we have defined a custom preprocessing function. Now that we have our inputs ready, let’s start training!
Launch training job
We can run training in our notebook, but the types of instances we need to train our Q&A model in a reasonable amount of time, p3 and p4 instances, are rather powerful. These instances tend to be overkill for running a notebook or as a persistent Amazon Elastic Compute Cloud (Amazon EC2) instance. This is where SageMaker training comes in. SageMaker training allows you to launch a training job on a specified instance or instances that are only up for the duration of the training job. This allows us to run on larger instances like the p4d.24xlarge, with 8 NVIDIA A100 GPUs, but without worrying about running up a huge bill in case we forget to turn it off. It also gives us easy access to other SageMaker functionalities, like SageMaker Experiments for tracking your ML training runs and SageMaker Debugger for understanding and profiling your training jobs.
Local training
Let’s start by understanding how training a model in Hugging Face works locally, then go over the adjustments we make to run it in SageMaker.
Hugging Face makes training easy through the use of their trainer class. The trainer class allows us to pass in our model, our train and validation datasets, our hyperparameters, and even our tokenizer. Because we already have our model as well as our training and validation sets, we only need to define our hyperparameters. We can do this through the TrainingArguments
class. This allows us to specify things like the learning rate, batch size, number of epochs, and more in-depth parameters like weight decay or a learning rate scheduling strategy. After we define our TrainingArguments
, we can pass in our model, training set, validation set, and arguments to instantiate our trainer class. Then we can simply call trainer.train()
to start training our model. The following code block demonstrates how to run local training:
Send data to S3
Doing the same thing in SageMaker training is straightforward. The first step is putting our data in Amazon S3 so that our model can access it. SageMaker training allows you to specify a data source; you can use sources like Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Lustre for high-performance data ingestion. In our case, our augmented SQuAD dataset isn’t particularly large, so Amazon S3 is a good choice. We upload our training data to a folder in Amazon S3 and when SageMaker spins up our training instance, it downloads the data from our specified location.
Instantiate the model
To launch our training job, we can use the built-in Hugging Face estimator in the SageMaker SDK. SageMaker uses the estimator class to define the parameters for a training job as well as the number and type of instances to use for training. SageMaker training is built around the use of Docker containers. You can use the default containers in SageMaker or supply your own custom container for training. In the case of Hugging Face models, SageMaker has built-in Hugging Face containers with all the dependencies you need to run Hugging Face training jobs. All we need to do is define our training script, which our Hugging Face container uses as its entry point.
In this training script, we define our arguments, which we pass to our entry point in the form of a set of hyperparameters, as well as our training code. Our training code is the same as if we were running it locally; we can simply use the TrainingArguments
and then pass them to a trainer object. The only difference is we need to specify the output location for our model to be in /opt/ml/model
so that SageMaker training can take it, package it, and send it to Amazon S3. The following code block shows how to instantiate our Hugging Face estimator:
Fine-tune the model
For our specific training job, we use a p3.8xlarge instance consisting of 4 V100 GPUs. The trainer class automatically supports training on multi-GPU instances so we don’t need any additional setup to account for this. We train our model for two epochs, with a batch size of 16, and a learning rate of 4e5. We’re also enabling mixed precision training, which uses mixed precision in areas where we can reduce numerical precision without impacting our model’s accuracy. This increases our available memory and training speeds. To launch the training job, we call the fit
method from our huggingface_estimator
class.
When our model is done training, we can download the model locally and load it into our notebook’s memory to test it, which is demonstrated in the notebook. We will focus on another option, deploying it as a SageMaker endpoint!
Deploy trained model
In addition to providing utilities for training, SageMaker can also allow data scientists and ML engineers to easily deploy REST endpoints for their trained models. You can deploy models trained in or outside of SageMaker. For more information, refer to Deploy a Model in Amazon SageMaker.
Because our model was trained in SageMaker, it’s already in the correct format to deploy as an endpoint. Similar to training, we define a SageMaker model class that defines the model, serving code, and the number and type of instances we want to deploy as endpoints. Also similar to training, serving is based on Docker containers, and we can use either of the built-in SageMaker containers or supply our own. For this post, we use a built-in PyTorch serving container, so we simply need to define a few things to get our endpoint up and running. Our serving code needs four functions:
- model_fn – Defines how the endpoint loads the model (it only does this once, and then keeps it in memory for subsequent predictions)
- input_fn – Defines how the input is deserialized and processed
- predict_fn – Defines how our model makes predictions on our input
- output_fn – Defines how the endpoint formats and sends back the output data to the client making the request
After we define these functions, we can deploy our endpoint and pass it context statements and questions and return its predicted answer:
Visualize model results
Because we deployed a SageMaker endpoint that allows us to send context statements and receive answers, we can go back and visualize the resulting inferences within the original SQuAD viewer to better visualize what our model found in the passage context. We do this by reformatting the results of inference back into SQuAD format, then replacing the Liquid tags in the worker template with the SQuAD-formatted JSON. We can then iframe the resulting UI inside our worker template to iteratively review results within the context of a single notebook, as shown in the following screenshot. Each question on the left can be clicked to highlight the spans of text on the right matching the query. With no question selected, all text spans are highlighted on the right as shown below.
Clean up
To avoid incurring future charges, run the Clean up section of the notebook to delete all the resources, including SageMaker endpoints, S3 objects that contains the raw and processed dataset, and the CloudFormation stack. When the deletion is complete, make sure to stop and delete the notebook instance that is hosting the current notebook script.
Conclusion
In this post, you learned how to create your own question answering dataset using Ground Truth and combine it with SQuAD to train and deploy your own question answering model using SageMaker. After you complete the notebook, you have a deployed SageMaker endpoint that was trained on your custom Q&A dataset. This endpoint is ready for integration into your production NLU workflows, because SageMaker endpoints are available through standard REST APIs. You also have an annotated custom dataset in SQuAD 2.0 format, which allows you to retrain your existing model or try training other question answering model architectures. Finally, you have a mechanism to quickly visualize the results from your inference by loading the worker template in your local notebook.
Try out the notebook, augment it with your own questions, and train and deploy your own custom question answering model for your NLU use cases!
Happy building!
About the Authors
Jeremy Feltracco is a Software Development Engineer with the Amazon ML Solutions Lab at Amazon Web Services. He uses his background in computer vision, robotics, and machine learning to help AWS customers accelerate their AI adoption.
Vidya Sagar Ravipati is a Manager at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.
Isaac Privitera is a Senior Data Scientist at the Amazon Machine Learning Solutions Lab, where he develops bespoke machine learning and deep learning solutions to address customers’ business problems. He works primarily in the computer vision space, focusing on enabling AWS customers with distributed training and active learning.
More-efficient caching for product retrieval
Locality-sensitive hashing enables cache to hold more than three times as many query results.Read More
Use custom vocabulary in Amazon Lex to enhance speech recognition
In our daily conversations, we come across new words or terms that we may not know. Perhaps these are related to a new domain that we’re just getting familiar with, and we pick these up as we understand more about the domain. For example, home loan terminology (“curtailment”), shortened words, (“refi”, “comps”), and acronyms (“HELOC”) are likely new to first-time home buyers. Similarly, brand names can be tricky when you first hear of them. For example, with the sentence “I’d like to get a Cyetus 7601”, unless you’re a coffee aficionado, you may not realize the speaker is talking about a coffee machine brand. Some of these words may not be even present in the dictionary. Once we understand their usage, meaning, and spelling, they become part of our vocabulary so we can have effective conversations.
Similarly, bots have to understand and add such words to their vocabulary. Starting today, you can give Amazon Lex additional information about how to process speech input by creating a custom vocabulary. You can create a list with domain-specific words or brand names for improved speech recognition and end-user experience.
In this post, we review the custom vocabulary capability and use it for home loan customer service transactions and queries.
Overview of the custom vocabulary capability
You define the custom vocabulary for a language in the bot. In this case, we add brand names and domain-specific words and phrases in English (GB). The bot uses these phrases for transcription of speech input when it’s trying to understand the user’s request (or intent classification). You can also augment the custom vocabulary with information you have to collect (or slot elicitation) by configuring a setting on the slot type. You can specify the degree of boosting for each word or phrase by configuring a weight of 1, 2, or 3. We recommend you try out different weights. Words that are usually difficult to recognize or pronounce should have a higher weight.
The following is an example conversation with our bot:
Virtual Agent: Welcome to ACME bank. How can I help you?
Caller: Do you offer CapitaRise loans?
Virtual Agent: We offer CapitaRise loans for personal purposes. Do you want to get an application started?
Caller: No, thanks.
Virtual Agent: Is there anything else I can help you with today?
Caller: How about your refi rates?
Virtual Agent: For refinancing, we offer 4.5% with no closing costs for a 30 years fixed. Other information you might be interested in?
Caller: Yes, how about SBA?
Virtual Agent: We offer a variety of small business admin loans. You can find more information on our website. Anything else I can do for you?
Caller: I would like to do a payment on my HELOC.
Virtual Agent: To proceed with a payment, what is the account ID?
Caller: 1234
Virtual Agent: Thanks. What’s the payment amount?
Caller: 120
Virtual Agent: Thank you. The payment should be processed within 1 to 2 business days. Anything else I can help you with?
Caller: No, that’s it.
Virtual Agent: Thank you. Have a great day.
Solution overview
Let’s review the overall architecture for the solution (see the following diagram):
- We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience
- We define the custom vocabulary for the English (GB) language by adding words such as “CapitaRise,” “HELOC,” and “refi”, along with weights
- After the caller is authenticated, the control is passed to the bot to perform transactions (for example, to process payment)
The custom vocabulary file is a tab-separated list of values that contain the phrase to recognize and a weight to give the boost. Phrases with a higher boost value are more likely to be used when they appear in the audio input.
Deploy the sample Amazon Lex bot
To create the sample bot and configure the custom vocabulary, perform the following steps. This creates an Amazon Lex bot calledFinanceBot
, with intents PersonalLoan
, BusinessLoan
, InterestRateRefinancing
, InterestRateCredit
, Payment
, Welcome
, and Goodbye
, as well as two slot types (accountNumber
and confirmationSlot
).
- Download the Amazon Lex bot.
- On the Amazon Lex console, choose Actions, Import.
- Choose the file FinanceBot.zip file that you downloaded, and choose Import.
- In the IAM Permissions section, for Runtime role, choose Create a new role with basic Amazon Lex permissions.
- On the Amazon Lex console, navigate to the bot
FinanceBot
. - Download the .zip file with the phrases that you want to add to the custom vocabulary.
- On the bot detail page, in the Add languages section, choose View languages.
- From the list of languages, choose English (GB).
- In the Custom vocabulary section, choose Import.
- Browse to the file to import, enter a password if necessary, and then choose Import.
- Choose Build.
- Download the supporting AWS Lambda code.
- On the Lambda console, create a new function and select Author from scratch.
- For Function name¸ enter
FinanceBotEnglish
. - For Runtime, choose Python 3.8.
- Choose Create function.
- In the Code source section, open
lambda_function.py
and delete the existing code. - Download the code and open it in a text editor.
- Copy and paste the code into the empty lambda_function.py tab.
- Choose Deploy.
- On the Amazon Lex console, and open
FinanceBot
. - Choose Deployment and then Aliases, followed by
TestBotAlias
. - On the Aliases page, in the Languages section, navigate to English (GB).
- For Source, select
FinanceBotEnglish
. - For Lambda version or alias, enter
$LATEST
. - On the Amazon Connect console, choose Contact flows.
- Download the contact flow to integrate with the Amazon Lex bot.
- In the Amazon Lex section, select your Amazon Lex bot and make it available for use in the Amazon Connect contact flows.
- Select the contact flow to load it into the application.
- Make sure the right bot is configured in the “Get Customer Input” block.
- Choose a queue in the “Set working queue” block.
- Add a phone number to the contact flow.
- Test the IVR flow by calling in to the phone number.
Test the solution
You can call in to the Amazon Connect phone number and interact with the bot.
Conclusion
Custom vocabulary enables improved recognition of domain-specific words and brand names for speech modality. You can easily define the custom vocabulary for your Amazon Lex bot and augment it to the bot definition. With improved recognition, you can enable more effective conversations across a broader set of use cases. You can configure custom vocabulary using the Amazon Lex V2 console or via the API. The capability is available for English (US) and English (GB) in all AWS Regions where Amazon Lex operates. To learn more, refer to custom vocabulary documentation.
About the Authors
Kai Loreck is a professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.
Anubhav Mishra is a Product Manager with AWS. He spends his time understanding customers and designing product experiences to address their business challenges.
Mebz Qazi is a Senior Consultant working on global projects for AWS. He very much enjoys working on technological innovation in natural language and AI/ML.
Sravan Bodapati is an Applied Science Manager at AWS Lex. He focuses on building cutting edge Artificial Intelligence and Machine Learning solutions for AWS customers in ASR and NLP space. In his spare time, he enjoys hiking, learning economics, watching TV shows and spending time with his family.
Predict customer churn with no-code machine learning using Amazon SageMaker Canvas
Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. But losing customers (also called customer churn) is always a risk, and insights into why customers leave can be just as important for maintaining revenues and profits. Machine learning (ML) can help with insights, but up until now you needed ML experts to build models to predict churn, the lack of which could delay insight-driven actions by businesses to retain customers.
In this post, we show you how business analysts can build a customer churn ML model with Amazon SageMaker Canvas, no code required. Canvas provides business analysts with a visual point-and-click interface that allows you to build models and generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.
Overview of solution
For this post, we assume the role of a marketing analyst in the marketing department of a mobile phone operator. We have been tasked with identifying customers that are potentially at risk of churning. We have access to service usage and other customer behavior data, and want to know if this data can help explain why a customer would leave. If we can identify factors that explain churn, then we can take corrective actions to change predicted behavior, such as running targeted retention campaigns.
To do this, we use the data we have in a CSV file, which contains information about customer usage and churn. We use Canvas to perform the following steps:
- Import the churn dataset from Amazon Simple Storage Service (Amazon S3).
- Train and build the churn model.
- Analyze the model results.
- Test predictions against the model.
For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The attributes are as follows:
- State – The US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- Account Length – The number of days that this account has been active
- Area Code – The three-digit area code of the customer’s phone number
- Phone – The remaining seven-digit phone number
- Int’l Plan – Whether the customer has an international calling plan (yes/no)
- VMail Plan – Whether the customer has a voice mail feature (yes/no)
- VMail Message – The average number of voice mail messages per month
- Day Mins – The total number of calling minutes used during the day
- Day Calls – The total number of calls placed during the day
- Day Charge – The billed cost of daytime calls
- Eve Mins, Eve Calls, Eve Charge – The billed cost for evening calls
- Night Mins, Night Calls, Night Charge – The billed cost for nighttime calls
- Intl Mins, Intl Calls, Intl Charge – The billed cost for international calls
- CustServ Calls – The number of calls placed to customer service
- Churn? – Whether the customer left the service (true/false)
The last attribute, Churn?
, is the attribute that we want the ML model to predict. The target attribute is binary, meaning our model predicts the output as one of two categories (True
or False
).
Prerequisites
A cloud admin with an AWS account with appropriate permissions is required to complete the following prerequisites:
- Deploy an Amazon SageMaker For instructions, see Onboard to Amazon SageMaker Domain.
- Deploy Canvas. For instructions, see Setting up and managing Amazon SageMaker Canvas (for IT administrators).
- Configure cross-origin resource sharing (CORS) policies for Canvas. For instructions, see Give your users the ability to upload local files.
Create a customer churn model
First, let’s download the churn dataset and review the file to make sure all the data is there. Then complete the following steps:
- Sign in to the AWS Management Console, using an account with the appropriate permissions to access Canvas.
- Log in to the Canvas console.
This is where we can manage our datasets and create models.
- Choose Import.
- Choose Upload and select the
churn.csv
file. - Choose Import data to upload it to Canvas.
The import process takes approximately 10 seconds (this can vary depending on dataset size). When it’s complete, we can see the dataset is in Ready
status.
- To preview the first 100 rows of the dataset, hover your mouse over the eye icon.
A preview of the dataset appears. Here we can verify that our data is correct.
After we confirm that the imported dataset is ready, we create our model.
- Choose New model.
- Select the churn.csv dataset and choose Select dataset.
Now we configure the build model process.
- For Target columns, choose the
Churn?
column.
For Model type, Canvas automatically recommends the model type, in this case 2 category prediction (what a data scientist would call binary classification). This is suitable for our use case because we have only two possible prediction values: True
or False
, so we go with the recommendation Canvas made.
We now validate some assumptions. We want to get a quick view into whether our target column can be predicted by the other columns. We can get a fast view into the model’s estimated accuracy and column impact (the estimated importance of each column in predicting the target column).
- Select all 21 columns and choose Preview model.
This feature uses a subset of our dataset and only a single pass at modeling. For our use case, the preview model takes approximately 2 minutes to build.
As shown in the following screenshot, the Phone
and State
columns have much less impact on our prediction. We want to be careful when removing text input because it can contain important discrete, categorical features contributing to our prediction. Here, the phone number is just the equivalent of an account number—not of value in predicting other accounts’ likelihood of churn, and the customer’s state doesn’t impact our model much.
- We remove these columns because they have no major feature importance.
- After we remove the
Phone
andState
columns, let’s run the preview again.
As shown in the following screenshot, the model accuracy increased by 0.1%. Our preview model has a 95.9% estimated accuracy, and the columns with the biggest impact are Night Calls
, Eve Mins
, and Night Charge
. This gives us an insight into what columns impact the performance of our model the most. Here we need to be careful when doing feature selection because if a single feature is extremely impactful on a model’s outcome, it’s a primary indicator of target leakage, and the feature won’t be available at the time of prediction. In this case, few columns showed very similar impact, so we continue to build our model.
Canvas offers two build options:
- Standard build – Builds the best model from an optimized process powered by AutoML; speed is exchanged for greatest accuracy
- Quick build – Builds a model in a fraction of the time compared to a standard build; potential accuracy is exchanged for speed.
- For this post, we choose the Standard build option because we want to have the very best model and we are willing to spend additional time waiting the result.
The build process can take 2–4 hours. During this time, Canvas tests hundreds of candidate pipelines, selecting the best model to present to us. In the following screenshot, we can see the expected build time and progress.
Evaluate model performance
When the model building process is complete, the model predicted churn 97.9% of the time. This seems fine, but as analysts we want to dive deeper and see if we can trust the model to make decisions based on it. On the Scoring tab, we can review a visual plot of our predictions mapped to their outcomes. This allows us a deeper insight into our model.
Canvas separates the dataset into training and test sets. The training dataset is the data Canvas uses to build the model. The test set is used to see if the model performs well with new data. The Sankey diagram in the following screenshot shows how the model performed on the test set. To learn more, refer to Evaluating Your Model’s Performance in Amazon SageMaker Canvas.
To get more detailed insights beyond what is displayed in the Sankey diagram, business analysts can use a confusion matrix analysis for their business solutions. For example, we want to better understand the likelihood of the model making false predictions. We can see this in the Sankey diagram, but want more insights, so we choose Advanced metrics. We’re presented with a confusion matrix, which displays the performance of a model in a visual format with the following values, specific to the positive class—we’re measuring based on whether they will in fact churn, so our positive class is True
in this example:
-
True Positive (TP) – The number of
True
results that were correctly predicted asTrue
-
True Negative (TN) – The number of
False
results that were correctly predicted asFalse
-
False Positive (FP) – The number of
False
results that were wrongly predicted asTrue
-
False Negative (FN) – The number of
True
results that were wrongly predicted asFalse
We can use this matrix chart to determine not only how accurate our model is, but when it is wrong, how often that might be and how it’s wrong.
The advanced metrics look good. We can trust the model result. We see very low false positives and false negatives. These are if the model thinks a customer in the dataset will churn and they actually don’t (false positive), or if the model thinks the customer will churn and they actually do (false negative). High numbers for either might make us think more on if we can use the model to make decisions.
Let’s go back to Overview tab, to review the impact of each column. This information can help the marketing team gain insights that lead to taking actions to reduce customer churn. For example, we can see that both low and high CustServ Calls
increase the likelihood of churn. The marketing team can take actions to prevent customer churn based on these learnings. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up.
Our model looks pretty accurate. We can directly perform an interactive prediction on the Predict tab, either in batch or single (real-time) prediction. In this example, we made a few changes to certain column values and performed a real-time prediction. Canvas shows us the prediction result along with the confidence level.
Let’s say we have an existing customer who has the following usage: Night Mins
is 40 and Eve Mins
is 40. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True
). We might now choose to provide promotional discounts to retain this customer.
Let’s say we have an existing customer who has the following the usage: Night Mins
is 40 and Eve Mins
is 40. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True
). We might now choose to provide promotion discounts to retain this customer.
Running one prediction is great for individual what-if analysis, but we also need to run predictions on many records at once. Canvas is able to run batch predictions, which allows you to run predictions at scale.
Conclusion
In this post, we showed how a business analyst can create a customer churn model with SageMaker Canvas using sample data. Canvas allows your business analysts to create accurate ML models and generate predictions using a no-code, visual, point-and-click interface. A marketing analysist can now use this information to run targeted retention campaigns and test new campaign strategies faster, leading to a reduction in customer churn.
Analysts can take this to the next level by sharing their models with data scientist colleagues. The data scientists can view the Canvas model in Amazon SageMaker Studio, where they can explore the choices Canvas AutoML made, validate model results, and even productionalize the model with a few clicks. This can accelerate ML-based value creation and help scale improved outcomes faster.
To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.
About the Author
Henry Robalino is a Solutions Architect at AWS, based out of NJ. He is passionate about cloud and machine learning, and the role they can play in society. He achieves this by working with customers to help them achieve their business goals using the AWS Cloud. Outside of work, you can find Henry traveling or exploring the outdoors with his fur daughter Arly.
Chaoran Wang is a Solution Architect at AWS, based in Dallas, TX. He has been working at AWS since graduating from the University of Texas at Dallas in 2016 with a master’s in Computer Science. Chaoran helps customers build scalable, secure, and cost-effective applications and find solutions to solve their business challenges on the AWS Cloud. Outside work, Chaoran loves spending time with his family and two dogs, Biubiu and Coco.
Amazon and UCLA announce recipients of gift awards, graduate fellowships
The UCLA Science Hub seeks to address challenges to humanity through research using artificial intelligence, bringing together academic and industry scientists.Read More
Deploy and manage machine learning pipelines with Terraform using Amazon SageMaker
AWS customers are relying on Infrastructure as Code (IaC) to design, develop, and manage their cloud infrastructure. IaC ensures that customer infrastructure and services are consistent, scalable, and reproducible, while being able to follow best practices in the area of development operations (DevOps).
One possible approach to manage AWS infrastructure and services with IaC is Terraform, which allows developers to organize their infrastructure in reusable code modules. This aspect is increasingly gaining importance in the area of machine learning (ML). Developing and managing ML pipelines, including training and inference with Terraform as IaC, lets you easily scale for multiple ML use cases or Regions without having to develop the infrastructure from scratch. Furthermore, it provides consistency for the infrastructure (for example, instance type and size) for training and inference across different implementations of the ML pipeline. This lets you route requests and incoming traffic to different Amazon SageMaker endpoints.
In this post, we show you how to deploy and manage ML pipelines using Terraform and Amazon SageMaker.
Solution overview
This post provides code and walks you through the steps necessary to deploy AWS infrastructure for ML pipelines with Terraform for model training and inference using Amazon SageMaker. The ML pipeline is managed via AWS Step Functions to orchestrate the different steps implemented in the ML pipeline, as illustrated in the following figure.
Step Functions starts an AWS Lambda function, generating a unique job ID, which is then used when starting a SageMaker training job. Step Functions also creates a model, endpoint configuration, and endpoint used for inference. Additional resources include the following:
- AWS Identity and Access Management (IAM) roles and policies attached to the resources in order to enable interaction with other resources
- Amazon Simple Storage Service (Amazon S3) buckets for training data and model output
- An Amazon Elastic Container Registry (Amazon ECR) repository for the Docker image containing the training and inference logic
The ML-related code for training and inference with a Docker image relies mainly on existing work in the following GitHub repository.
The following diagram illustrates the solution architecture:
We walk you through the following high-level steps:
- Deploy your AWS infrastructure with Terraform.
- Push your Docker image to Amazon ECR.
- Run the ML pipeline.
- Invoke your endpoint.
Repository structure
You can find the repository containing the code and data used for this post in the following GitHub repository.
The repository includes the following directories:
-
/terraform
– Consists of the following subfolders:-
./infrastructure
– Contains the main.tf file calling the ML pipeline module, in addition to variable declarations that we use to deploy the infrastructure -
./ml-pipeline-module
– Contains the Terraform ML pipeline module, which we can reuse
-
-
/src
– Consists of the following subfolders:-
./container
– Contains example code for training and inference with the definitions for the Docker image -
./lambda_function
– Contains the Python code for the Lambda function generating configurations, such as a unique job ID for the SageMaker training job
-
-
/data
– Contains the following file:-
./iris.csv
– Contains data for training the ML model
-
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
- Terraform version 0.13.5 or greater
- AWS Command Line Interface (AWS CLI) v2
- Python 3.7 or greater
- Docker
Deploy your AWS infrastructure with Terraform
To deploy the ML pipeline, you need to adjust a few variables and names according to your needs. The code for this step is in the /terraform
directory.
When initializing for the first time, open the file terraform/infrastructure/terraform.tfvars
and adjust the variable project_name to the name of your project, in addition to the variable region if you want to deploy in another Region. You can also change additional variables such as instance types for training and inference.
Then use the following commands to deploy the infrastructure with Terraform:
Check the output and make sure that the planned resources appear correctly, and confirm with yes in the apply stage if everything is correct. Then go to the Amazon ECR console (or check the output of Terraform in the terminal) and get the URL for your ECR repository that you created via Terraform.
The output should look similar to the following displayed output, including the ECR repository URL:
Push your Docker image to Amazon ECR
For the ML pipeline and SageMaker to train and provision a SageMaker endpoint for inference, you need to provide a Docker image and store it in Amazon ECR. You can find an example in the directory src/container
. If you have already applied the AWS infrastructure from the earlier step, you can push the Docker image as described. After your Docker image is developed, you can take the following actions and push it to Amazon ECR (adjust the Amazon ECR URL according to your needs):
If you have already applied the AWS infrastructure with Terraform, you can push the changes of your code and Docker image directly to Amazon ECR without deploying via Terraform again.
Run the ML pipeline
To train and run the ML pipeline, go to the Step Functions console and start the implementation. You can check the progress of each step in the visualization of the state machine. You can also check the SageMaker training job progress and the status of your SageMaker endpoint.
After you successfully run the state machine in Step Functions, you can see that the SageMaker endpoint has been created. On the SageMaker console, choose Inference in the navigation pane, then Endpoints. Make sure to wait for the status to change to InService.
Invoke your endpoint
To invoke your endpoint (in this example, for the iris dataset), you can use the following Python script with the AWS SDK for Python (Boto3). You can do this from a SageMaker notebook, or embed the following code snippet in a Lambda function:
Clean up
You can destroy the infrastructure created by Terraform with the command terraform destroy, but you need to delete the data and files in the S3 buckets first. Furthermore, the SageMaker endpoint (or multiple SageMaker endpoints if run multiple times) is created via Step Functions and not managed via Terraform. This means that the deployment happens when running the ML pipeline with Step Functions. Therefore, make sure you delete the SageMaker endpoint or endpoints created via the Step Functions ML pipeline as well to avoid unnecessary costs. Complete the following steps:
- On the Amazon S3 console, delete the dataset in the S3 training bucket.
- Delete all the models you trained via the ML pipeline in the S3 models bucket, either via the Amazon S3 console or the AWS CLI.
- Destroy the infrastructure created via Terraform:
- Delete the SageMaker endpoints, endpoint configuration, and models created via Step Functions, either on the SageMaker console or via the AWS CLI.
Conclusion
Congratulations! You’ve deployed an ML pipeline using SageMaker with Terraform. This example solution shows how you can easily deploy AWS infrastructure and services for ML pipelines in a reusable fashion. This allows you to scale for multiple use cases or Regions, and enables training and deploying ML models with one click in a consistent way. Furthermore, you can run the ML pipeline multiple times, for example, when new data is available or you want to change the algorithm code. You can also choose to route requests or traffic to different SageMaker endpoints.
I encourage you to explore adding security features and adopting security best practices according to your needs and potential company standards. Additionally, embedding this solution into your CI/CD pipelines will give you further capabilities in adopting and establishing DevOps best practices and standards according to your requirements.
About the Author
Oliver Zollikofer is a Data Scientist at Amazon Web Services. He enables global enterprise customers to build, train and deploy machine learning models, as well as managing the ML model lifecycle with MLOps. Further, he builds and architects related cloud solutions.
The science behind ultrasonic motion sensing for Echo
Reducing false positives for rare events, adapting Echo hardware to ultrasound sensing, and enabling concurrent ultrasound sensing and music playback are just a few challenges Amazon researchers addressed.Read More