Easily create and store features in Amazon SageMaker without code

Data scientists and machine learning (ML) engineers often prepare their data before building ML models. Data preparation typically includes data preprocessing and feature engineering. You preprocess data by transforming data into the right shape and quality for training, and you engineer features by selecting, transforming, and creating variables when building a predictive model.

Amazon SageMaker helps you perform these tasks by simplifying feature preparation with Amazon SageMaker Data Wrangler and storage and feature serving with Amazon SageMaker Feature Store. You can prepare your data and engineer features using over 300 built-in transformations with Data Wrangler. Then you can persist those features to a purpose-built feature store for ML with Feature Store. These services help you build automatic and repeatable processes to streamline your data preparation tasks, all without writing code.

We’re excited to announce a new capability that seamlessly integrates Data Wrangler with Feature Store. You can now easily create features with Data Wrangler and store those features in Feature Store with just a few clicks in Amazon SageMaker Studio.

In this post, we demonstrate creating features with Data Wrangler and persisting them in Feature Store using the hotel booking demand dataset. We focus on the data preparation and feature engineering tasks to show how easily you can create and stores features in SageMaker without code using Data Wrangler. After the features are stored, they can be used for training and inference by multiple models and teams.

Solution overview

To demonstrate feature engineering and feature storage, we use a hotel booking demand dataset. You can download the dataset and view the full description of each variable. The dataset contains information such as when a hotel booking was made, the booking location, the length of stay, the number of parking spaces, and other features.

Our goal is to engineer features to predict if a user will cancel a booking.

We host the dataset in an Amazon Simple Storage Service (Amazon S3) bucket. We also open a Studio domain to utilize the native Data Wrangler and Feature Store capabilities. We import the dataset into a Data Wrangler flow and define the data transformation steps we want to apply using the Data Wrangler user interface (UI). We then have SageMaker run our feature engineering steps and store the features in Feature Store.

The following diagram illustrates the solution workflow.

To demonstrate Data Wrangler’s feature engineering steps, we assume we’ve already conducted exploratory data analysis (EDA). EDA helps you understand your data by identifying patterns in your data. For example, we might find that customers who book resort hotels tend to stay longer than city hotels. Or customers that stay over the weekend purchase more meals. Because these patterns aren’t evident with data in tables, data scientists use visualization tools to help identify patterns. EDA is often a necessary step to determine which features to create, delete, and transform.

If you already have features ready to export to Feature Store, you can navigate to the Save features to Feature Store section to learn how you can easily save your prepared features to Feature Store.

Prerequisites

If you want to follow along with this post, you should have the following prerequisites:

Create features with Data Wrangler

To create features with Data Wrangler, complete the following steps:

  1. Enter your Studio domain.
  2. Choose Data Wrangler as your resource to view.
  3. Choose New flow.
  4. Choose Import and import your data.

You can see a preview of the data in the Data Wrangler UI when selecting your dataset. You can also choose a sampling method. Because our dataset is relatively small, we choose not to sample our data. The flow editor now shows two steps in the UI, representing the step you took to import the data and a data validation step Data Wrangler automatically completes for you.

  1. Choose the plus sign next to Data types and choose Add transform.

Assuming we’ve spent time in EDA, we can remove redundant columns that contribute to target leakage. Target leakage occurs when some data in a training dataset is strongly correlated with the target label, but isn’t available in real-world data. After we conduct a target leakage analysis, we determine we should drop redundant columns. Data Wrangler helped identify 10 columns to drop.

  1. Add a step and choose the Drop column transform step.

Additionally, we determine we can remove columns like agent and adults after a multicollinearity analysis. Multicollinearity is the presence of high correlations between two or more independent variables. We usually want to avoid variables to be correlated to each other because they can lead to misleading and inaccurate models.

We also want to drop duplicate rows. In our case, nearly 28% of all rows in our dataset are duplicates. Because duplicates may have undesirable effects on our model, we use the transform set to remove them.

  1. Add a new transform and choose Manage rows from the list of available transforms.
  2. Choose Drop duplicates on the Transform drop-down menu.

Next, we want to handle missing values. We find that many hotel guests didn’t travel with children, and have a blank value for the children column. We can replace this blank value with 0.

  1. Choose Handle missing as the transform step and Fill missing as the transform type.
  2. Add a transform to fill blank values with the 0 value by choosing children as the input column.

From our EDA, we see that there are many missing values for the country column. However, the data reveals most of the hotel guests are from Europe. We determine that missing country column values can be replaced with the most commonly occurring country—Portugal (PRT).

  1. Choose the Handle missing transform step and choose Fill missing as the transform type.
  2. Choose country as the input column, and enter PRT as the Fill value.

ML algorithms like linear regression, logistic regression, neural networks, and others that use gradient descent as an optimization technique require data to be scaled. Normalization (also known as min-max scaling) is a scaling technique that transforms values to be in the range of 0–1. Standardization is another scaling technique where the values are centered around the mean with a standard deviation unit. In our case, we normalize the numeric feature columns to a standard scale between [0, 1].

  1. Choose the Process numeric transform step and Scale values as the transform type.
  2. Choose Min-max scaler as the scaler and lead_time, booking_changes, adr, and others as the input columns.
  3. Leave 0 as Min and 1 as Max default values.

We also want to handle categorical data by representing them as numeric values. For example, if your categories are Dog and Cat, you may encode this information into two vectors, [1,0] to represent Dog, and [0,1] to represent Cat. For our dataset, we use one-hot encoding to encode categories into an integer between 0 and the total number of categories within the column.

  1. Choose the One-hot encode transform type from the Encode categorical transform.

ML models are sensitive to the distribution and range of your feature values. Outliers can negatively impact model accuracy and lead to longer training times. For our dataset, we apply the standard deviation numeric outliers transform with a set of configuration values as shown in the following screenshot. We apply this transform on the numeric columns.

  1. Choose the Standard Deviation Numeric Outliers transform type from the Handle outliers transform.

Lastly, we want to balance the target variable for class imbalance. In Data Wrangler, we can handle class imbalance using three different techniques:

  • Random undersample
  • Random oversample
  • SMOTE
  1. In the Data Wrangler transform pane, choose Balance data as the group and choose Random oversample for the Transform field.

The ratio of positive to negative cases is around 0.38 before balancing.

After oversampling and balancing the dataset, the ratio equates to 1.

Now that we’ve completed our feature engineering tasks, we’re ready to export our features to Feature Store with one click.

Save features to Feature Store

You can easily export your generated features to SageMaker Feature Store by selecting it as the destination.

You can save the features into an existing feature group or create new one. For this post, we create a new feature group. Studio directs you to a new tab where you can create a new feature group.

  1. Choose the plus sign, choose Export to, and choose SageMaker Feature Store.

  1. Choose Create Feature Group.

  1. Optionally, select Create “EventTime” column.
  2. Choose Next.

  1. Copy the JSON schema, then choose Create.

  1. Provide a feature group name and an optional description for your feature group.
  2. Select a feature group storage configuration that is either online or offline, or both.

Online stores serve features with low millisecond latency for real-time inference, whereas offline stores are ideal for retrieving your features for training models or for batch scoring. Additionally, you can run queries on your offline feature stores by registering your features in an AWS Glue Data Catalog. For more information, see Query Feature Store with Athena and AWS Glue.

  1. Choose Continue.

Next, you specify the feature definitions. You specify the data type (string, integral, fractional) for each feature definition.

  1. Enter the JSON schema from the previous step to define your feature definitions.
  2. Choose Continue.

  1. Next, you specify a record identifier name and a timestamp to uniquely identify a record within a feature group.

The record identifier name must refer to one of the names of a feature defined in the feature group’s feature definition. In our case, we use the existing identifier, distribution-channel, which was in our source dataset, and EventTime.

  1. Choose Continue.

  1. Lastly, apply any relevant tags and review your feature group details.
  2. Choose Create feature group to finalize the process.

  1. After we create our feature group, we can return to the Data Wrangler flow UI.
  2. Choose the plus sign, choose Add destination, and choose SageMaker Feature Store.

  1. We choose the desired destination feature group to ensure that the features we’re storing match the feature group schema.

If the newly created feature group doesn’t show up in the UI, refresh the list to reload the groups.

  1. Chose the message under the Validation column to have Data Wrangler validate the schema of the dataset with the schema of the feature group.

If you missed specifying the event time column, Data Wrangler will notify you of an error and request that you add one to your dataset.

Once validated, Data Wrangler informs you that the data frame matches the feature group schema.

  1. If you enabled both the online and offline stores for the feature group, you can optionally select Write to offline store only to only ingest data to the offline store.

This is helpful for historical data backfilling scenarios.

  1. Choose Add to add another step to our Data Wrangler flow.
  2. With all our steps defined, choose Create job to run our ML workflow from feature engineering to ingesting features into our feature group.

  1. Give the job a name, then provide the job specifications like the type and number of instances.
  2. Choose Run.

Congratulations! You’ve successfully engineered features using Data Wrangler and stored them in a persistent feature store without writing any code. You can easily explore features, see details of your feature group, and update the feature group schema when necessary.

Conclusion

In this post, we created features with Data Wrangler, and easily stored those features in Feature Store. We showed an example workflow for feature engineering in the Data Wrangler UI. Then we saved those features into Feature Store directly from Data Wrangler by creating a new feature group. Finally, we ran a processing job to ingest those features into Feature Store. These services helped us build automatic and repeatable processes to streamline our data preparation tasks, all without writing code.

With this new integration, you can accelerate your ML tasks with a more streamlined experience between feature engineering and feature ingestion. For more information, refer to Get Started with Data Wrangler and Get started with Amazon SageMaker Feature Store.


About the Authors

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications. He enjoys coffee, cooking, staying active, and spending time with his family.

Patrick Lin is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is committed to making Amazon SageMaker Data Wrangler the number one data preparation tool for productionized ML workflows. Outside of work, you can find him reading, listening to music, having conversations with friends, and serving at his church.

Ziyao Huang is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is passionate about building great product that makes ML easy for the customers. Outside of work, Ziyao likes to read, and hang out with his friends

Read More

Create train, test, and validation splits on your data for machine learning with Amazon SageMaker Data Wrangler

In this post, we talk about how to split a machine learning (ML) dataset into train, test, and validation datasets with Amazon SageMaker Data Wrangler so you can easily split your datasets with minimal to no code.

Data used for ML is typically split into the following datasets:

  • Training – Used to train an algorithm or ML model. The model iteratively uses the data and learns to provide the desired result.
  • Validation – Introduces new data to the trained model. You can use a validation set to periodically measure model performance as training is happening, and also tune any hyperparameters of the model. However, validation datasets are optional.
  • Test – Used on the final trained model to assess its performance on unseen data. This helps determine how well the model generalizes.

Data Wrangler is a capability of Amazon SageMaker that helps data scientists and data engineers quickly and easily prepare data for ML applications using a visual interface. It contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code.

Today, we’re excited to announce a new data transformation to split datasets for ML use cases within Data Wrangler. This transformation splits your dataset into training, test, and optionally validation datasets without having to write any code.

Overview of the split data transformation

The split data transformation includes four commonly used techniques to split the data for training the model, validating the model, and testing the model:

  • Random split – Splits data randomly into train, test, and, optionally validation datasets using the percentage specified for each dataset. It ensures that the distribution of the data is similar in all datasets. Choose this option when you don’t need to preserve the order of your input data. For example, consider a movie dataset where the dataset is sorted by genre and you’re predicting the genre of the movie. A random split on this dataset ensures that the distribution of the data includes all genres in all three datasets.
  • Ordered split – Splits data in order, using the percentage specified for each dataset. An ordered split ensures that the data in each split is non-overlapping while preserving the order of the data. When training, we want to avoid past or future information leaking across datasets. The ordered split option prevents data leakage. For example, consider a scenario where you have customer engagement data for the first few months and you want to use this historical data to predict customer engagement in the next month. You can perform this split by providing an optional input column (numeric column). This operation uses the values of a numeric column to ensure that the data in each split doesn’t overlap while preserving the order. This helps avoid data leakage across splits. If no input column is provided, the order of the rows is used, so the data in each split still comes before the data in the next split. This is useful where the rows of the dataset are already ordered (for example, by date) and the model may need to be fit to earlier data and tested on later data.
  • Stratified split – Splits the dataset so that each split is similar with respect to a column specifying different categories for your data, for example, size or country. This split ensures that the train, test, and validation datasets have the same proportions for each category as the input dataset. This is useful with classification problems where we’re trying to ensure that the train and test sets have approximately the same percentage of samples of each target class. Choose this option if you have imbalanced data across different categories and you need to have it balanced across split datasets.
  • Split by key – Takes one or more columns as input (the key) and ensures that no combination of values across the input columns occurs in more than one of the splits (split by key). This is useful to avoid data leakage for unordered data. Choose this option if your data for key columns needs to be in the same split. For example, consider customer transactions split by customer ID; the split ensures that customer IDs don’t overlap across split datasets.

Solution overview

For this post, we demonstrate how to split data into train, test, and validation datasets using the four new split options in Data Wrangler. We use a hotel booking dataset available publicly on Kaggle, which has the year, month, and date that bookings were made, along with reservation statuses, cancellations, repeat customers, and other features.

Prerequisites

Before getting started, upload the dataset to an Amazon Simple Storage Service (S3) bucket, then import it into Data Wrangler. For instructions, refer to Import data from Amazon S3.

Random split

After we import the data into Data Wrangler, we start the transformation. We first demonstrate a random split.

  1. On the Data Wrangler console, choose the plus sign and choose Add transform.
  2. To add the split data transformation, choose Add step.

    You’re redirected to the page where all transformations are displayed.
  3. Scroll down the list and choose Split data.

    The split data transformation has a drop-down menu that lists the available transformations to split your data, which include random, ordered, stratified, and split by key. By default, Randomized split is displayed.
  4. Choose the default value Randomized split.
  5. In the Splits section, enter the name Train with an 0.8 split percentage, and Test with a 0.2 percentage.
  6. Choose the plus sign to add an additional split.
  7. Add the Validation split with 0.2, and adjust Train to 0.7 and Test to 0.1.
    The split percentage can be any value you want, provided all three splits sum to 1 (100%).We can also specify optional fields like Error threshold and Random seed. We can achieve an exact split by setting the error threshold to 0. A smaller error threshold can lead to more processing time for splitting the data. This allows you to control the trade-off between time and accuracy on the operation. The Random seed option is for reproducibility. If not specified, Data Wrangler uses a default random seed value. We leave it blank for the purpose of this post.
  8. To preview your data split, choose Preview.

    The preview page displays the data split. You can choose Train, Test, or Validation on the drop-down menu to review the details of each split.
  9. When you’re satisfied with your data split, choose Add to add the transformation to your Data Wrangler flow.

To analyze the train dataset, choose Add analysis.

You can perform a similar analysis on the validation and test datasets.

Ordered split

We now use the hotel bookings dataset to demonstrate an ordered split transformation. The hotel dataset contains rows ordered by date.

  1. Repeat the steps to add a split, and choose Ordered split on the drop-down menu.
  2. Specify your three splits and desired percentages.
  3. Preview your data and choose Add to add the transformation to the Data Wrangler flow.
  4. Use the Add analysis option to verify the splits.

Stratified split

In the hotel booking dataset, we have an is_cancelled column, which indicates whether the booking was cancelled or not. We want to use this column to split the data. A stratified split ensures that the train, test, and validation datasets have same percentage of samples of is_cancelled.

  1. Repeat the steps to add a transformation, and choose Stratified split.
  2. Specify your three splits and desired percentages.
  3. For Input column, choose is_canceled.
  4. Preview your data and choose Add to add the transformation to the Data Wrangler flow.
  5. Use the Add analysis option to verify the splits.

Split by key

The split by key transformation splits the data by the key or multiple keys we specify. This split is useful to avoid having the same data in the split datasets created during transformation and to avoid data leakage.

  1. Repeat the steps to add a transformation, and choose Split by key.
  2. Specify your three splits and desired percentages.
  3. For Key column, we can specify the columns to form the key. For this post, choose the following columns:
    1. is_cancelled
    2. arrival_date_year
    3. arrival_date_month
    4. arrival_date_week_number
    5. reservation_status
  4. Preview your data and choose Add to add the transformation to the Data Wrangler flow.
  5. Use the Add analysis option to verify the splits.

Considerations

The node labeled as Data types cannot be deleted. Deleting a split node deletes all its datasets and downstream datasets and its nodes.

Conclusion

In this post, we demonstrated how to split an input dataset into train, test, and validation datasets with Data Wrangler using the split techniques random, ordered, stratified, and split by key.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.


About the Authors

Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the Financial Services Industry with their operations in AWS. As a machine learning specialist, Gopi works to support customers succeed in their ML journey.

Patrick Lin is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is committed to making Amazon SageMaker Data Wrangler the number one data preparation tool for productionized ML workflows. Outside of work, you can find him reading, listening to music, having conversations with friends, and serving at his church.

Xiyi Li is a Front End Engineer at Amazon SageMaker Data Wrangler. She helps support Amazon SageMaker Data Wrangler and is passionate about building products that provide a great user experience. Outside of work, she enjoys hiking and listening to classical music.

Vishaal Kapoor is a Senior Applied Scientist with AWS AI. He is passionate about helping customers understand their data in Data Wrangler. In his spare time, he mountain bikes, snowboards, and spends time with his family.

Read More

How InfoJobs (Adevinta) improves NLP model prediction performance with AWS Inferentia and Amazon SageMaker

This is a guest post co-written by Juan Francisco Fernandez, ML Engineer in Adevinta Spain, and AWS AI/ML Specialist Solutions Architects Antonio Rodriguez and João Moura.

InfoJobs, a subsidiary company of the Adevinta group, provides the perfect match between candidates looking for their next job position and employers looking for the best hire for the openings they need to fill. For this goal, we use natural language processing (NLP) models such as BERT through PyTorch to automatically extract relevant information from users’ CVs at the moment they upload these to our portal.

Performing inference with NLP models can take several seconds when hosted on typical CPU-based instances given the complexity and variety of the fields. This affects the user experience in the job listing web portal. Alternatively, hosting these models on GPU-based instances can prove costly, which makes the solution not feasible for our business. For this solution, we were looking for a way to optimize the latency of predictions, while keeping the costs at a minimum.

To solve this challenge, we initially considered some possible solutions along two axes:

  • Vertical scaling by using bigger general-purpose instances as well as GPU-powered instances.
  • Optimizing our models using openly available techniques such as quantization or open tools such as ONNX.

Neither option, whether individually or combined, was able to provide the needed performance at an affordable cost. After benchmarking our full range of options with the help of AWS AI/ML Specialists, we found that compiling our PyTorch models with AWS Neuron and using AWS Inferentia to host them on Amazon SageMaker endpoints offered a reduction of up to 92% in prediction latency, at 75% lower cost when compared to our best initial alternatives. It was, in other words, like having the best of GPU power at CPU cost.

Amazon Comprehend is a plug-and-play managed NLP service that uses machine learning to automatically uncover valuable insights and connections in text. However, in this particular case we wanted to use fine-tuned models for the task.

In this post, we share a summary of the benchmarks performed and an example of how to use AWS Inferentia with SageMaker to compile and host NLP models. We also describe how InfoJobs is using this solution to optimize the inference performance of NLP models, extracting key information from users’ CVs in a cost-efficient way.

Overview of solution

First, we had to evaluate the different options available on AWS to find the best balance between performance and cost to host our NLP models. The following diagram summarizes the most common alternatives for real-time inference, most of which were explored during our collaboration with AWS.

Inference options diagram

Hosting options benchmark on SageMaker

We started our tests with a publicly available pre-trained model from the Hugging Face model hub bert-base-multilingual-uncased. This is the same base model used by InfoJobs’s CV key value extraction model. For this purpose, we deployed this model to a SageMaker endpoint using different combinations of instance types: CPU-based, GPU-based, or AWS Inferentia-based. We also explored optimization with Amazon SageMaker Neo and compilation with AWS Neuron where appropriate.

In this scenario, deploying our model to a SageMaker endpoint with an AWS Inferentia instance yielded 96% faster inference times compared to CPU instances and 44% faster inference times compared to GPU instances in the same range of cost and specs. This allows us to respond to 15 times more inferences than using CPU instances, or 4 times more inferences than using GPU instances at the same cost.

Based on the encouraging first results, our next step was to validate our tests on the actual model used by InfoJobs. This is a more complex model that requires PyTorch quantization for performance improvement, so we expected worse results compared to the previous standard case with bert-base-multilingual-uncased. The results of our tests for this model are summarized in the following table (based on public pricing in Region us-east-1 as of February 20, 2022).

Category Mode Instance type example p50 Inference latency (ms)  TPS Cost per hour (USD) Inferences per hour Cost per million inferences (USD)
CPU Normal m5.xlarge 1400 2 0.23 5606 41.03
CPU Optimized m5.xlarge 1105 2 0.23 7105 32.37
GPU Normal g4dn.xlarge 800 18 0.736 64800 11.36
GPU Optimized g4dn.xlarge 700 21 0.736 75600 9.74
AWS Inferentia Compiled inf1.xlarge 57 33 0.297 120000 2.48

The following graph shows real-time inference response times for the InfoJobs model (less is better). In this case, inference latency is 75-92% faster when compared to both CPU or GPU options.

Inference latency graph

This also means between 4-13 times less cost for running inferences compared to both CPU or GPU options, as shown in the following graph of cost per million inferences.

Inference cost graph

We must highlight that no further optimizations were made to the inference code during these non-extensive tests. However, the performance and cost benefits we saw from using AWS Inferentia exceeded our initial expectations, and enabled us to proceed to production. In the future, we will continue to optimize with other features of Neuron, such as NeuronCore Pipeline or the PyTorch-specific DataParallel API. We encourage you to explore and compare the results for your specific use case and model.

Compiling for AWS Inferentia with SageMaker Neo

You don’t need to use the Neuron SDK directly to compile your model and be able to host it on AWS Inferentia instances.

SageMaker Neo automatically optimizes machine learning (ML) models for inference on cloud instances and edge devices to run faster with no loss in accuracy. In particular, Neo is capable of compiling a wide variety of transformer-based models, making use of the Neuron SDK in the background. This allows you to get the benefit of AWS Inferentia by using APIs that are integrated with the familiar SageMaker SDK, with no required context switch.

In this section, we go through an example in which we show you how to compile a BERT model with Neo for AWS Inferentia. We then deploy that model to a SageMaker endpoint. You can find a sample notebook describing the whole process in detail on GitHub.

First, we need to create a sample input to trace our model with PyTorch and create a tar.gz file, with the model being its only content. This is a required step to have Neo compile our model artifact (for more information, see Prepare Model for Compilation). For demonstration purposes, the model is initialized as a mock model for sequence classification that hasn’t been fine-tuned on the task at all. In reality, you would replace the model identifier with your selected model from the Hugging Face model hub or a locally saved model artifact. See the following code:

import transformers
import torch
import tarfile

tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-multilingual-uncased")
model = transformers.AutoModelForSequenceClassification.from_pretrained(
"distilbert-base- multilingual-uncased", return_dict=False
)

seq_0 = "This is just sample text for model tracing, the length of the sequence does not matter because we will pad to the max length that Bert accepts."
seq_1 = seq_0
max_length = 512

tokenized_sequence_pair = tokenizer.encode_plus(
    seq_0, seq_1, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
)

example = tokenized_sequence_pair["input_ids"], tokenized_sequence_pair["attention_mask"]

traced_model = torch.jit.trace(model.eval(), example)
traced_model.save("model.pth")

with tarfile.open('model.tar.gz', 'w:gz') as f:
    f.add('model.pth')
f.close()

It’s important to set the return_dict parameter to False when loading a pre-trained model, because Neuron compilation does not support dictionary-based model outputs. We upload our model.tar.gz file to Amazon Simple Storage Service (Amazon S3), saving its location in a variable named traced_model_url.

We then use the PyTorchModel SageMaker API to instantiate and compile our model:

from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
import json

traced_sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.5.1",
    role=role,
    sagemaker_session=sagemaker_session,
    entry_point="inference_inf1.py",
    source_dir="code",
    py_version="py3",
    name="inf1-bert-base-multilingual-uncased ",
)

compiled_inf1_model = traced_sm_model.compile(
    target_instance_family="ml_inf1",
    input_shape={"input_ids": [1, 512], "attention_mask": [1, 512]},
    job_name=’testing_inf1_neo,
    role=role,
    framework="pytorch",
    framework_version="1.5.1",
    output_path=f"s3://{sm_bucket}/{your_model_destination}”
    compiler_options=json.dumps("--dtype int64")
)

Compilation may take a few minutes. As you can see, our entry_point to model inference is our inference_inf1.py script. It determines how our model is loaded, how input and output are preprocessed, and how the model is used for prediction. Check out the full script on GitHub.

Finally, we can deploy our model to a SageMaker endpoint on an AWS Inferentia instance, and get predictions from it in real time:

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

compiled_inf1_predictor = compiled_inf1_model.deploy(
    instance_type="ml.inf1.xlarge",
    initial_instance_count=1,
    endpoint_name=f"test-neo-inf1-bert",
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

payload = seq_0, seq_1
print(compiled_inf1_predictor.predict(payload))

As you can see, we were able to get all the benefits of using AWS Inferentia instances on SageMaker by using simple APIs that complement the standard flow of the SageMaker SDK.

Final solution

The following architecture illustrates the solution deployed in AWS.

Architecture diagram

All the testing and evaluation analysis described in this post were done with the help of AWS AI/ML Specialist Solutions Architects in under 3 weeks, thanks for the ease of use of SageMaker and AWS Inferentia.

Conclusion

In this post, we shared how InfoJobs (Adevinta) uses AWS Inferentia with SageMaker endpoints to optimize the performance of NLP model inference in a cost-effective way, reducing inference times up to 92% with a 75% lower cost than the initial best alternative. You can follow the process and code shared for compiling and deploying your own models easily using SageMaker, the Neuron SDK for PyTorch, and AWS Inferentia.

The results of the benchmarking tests performed between AWS AI/ML Specialist Solutions Architects and InfoJobs engineers were also validated in InfoJobs’s environment. This solution is now being deployed in production, handling the processing of all the CVs uploaded by users to the InfoJobs portal in real time.

As a next step, we will be exploring ways to optimize model training and our ML pipeline with SageMaker by relying on the Hugging Face integration with SageMaker and SageMaker Training Compiler, among other features.

We encourage you to try out AWS Inferentia with SageMaker, and connect with AWS to discuss your specific ML needs. For more examples on SageMaker and AWS Inferentia, you can also check out SageMaker examples on GitHub and AWS Neuron tutorials.


About the Authors

Juan Francisco Fernandez is an ML Engineer with Adevinta Spain. He joined InfoJobs to tackle the challenge of automating model development, thereby providing more time for data scientists to think about new experiments and models and freeing them of the burden of engineering tasks. In his spare time, he enjoys spending time with his son, playing basketball and video games, and learning languages.

Antonio Rodriguez is an AI & ML Specialist Solutions Architect at Amazon Web Services. He helps companies solve their challenges through innovation with the AWS Cloud and AI/ML services. Apart from work, he loves to spend time with his family and play sports with his friends.

João Moura is an AI & ML Specialist Solutions Architect at Amazon Web Services. He focuses mostly on NLP use cases and helping customers optimize deep learning model deployments.

Read More

Amazon SageMaker Studio and SageMaker Notebook Instance now come with JupyterLab 3 notebooks to boost developer productivity

Amazon SageMaker comes with two options to spin up fully managed notebooks for exploring data and building machine learning (ML) models. The first option is fast start, collaborative notebooks accessible within Amazon SageMaker Studio – a fully integrated development environment (IDE) for machine learning. You can quickly launch notebooks in Studio, easily dial up or down the underlying compute resources without interrupting your work, and even share your notebook as a link in few simple clicks. In addition to creating notebooks, you can perform all the ML development steps to build, train, debug, track, deploy, and monitor your models in a single pane of glass in Studio. The second option is Amazon SageMaker Notebook Instance – a single, fully managed ML compute instance running notebooks in cloud, offering customers more control on their notebook configurations.

Today, we’re excited to announce that SageMaker Studio and SageMaker Notebook Instance now come with JupyterLab 3 notebooks. The new notebooks provide data scientists and developers a modern IDE complete with developer productivity tools for code authoring, refactoring and debugging, and support for the latest open-source Jupyter extensions. AWS is a major contributor to the Jupyter open-source community and we’re happy to bring the latest Jupyter capabilities to our customers.

In this post, we showcase some of the exciting new features built into SageMaker notebooks and call attention to some of our favorite open-source extensions that improve the developer experience when using SageMaker to build, train, and deploy your ML models.

What’s new with notebooks on SageMaker

The new notebooks come with several features out of the box that improve the SageMaker developer experience, including the following:

  • An integrated debugger with support for breakpoints and variable inspection
  • A table of contents panel to more easily navigate notebooks
  • A filter bar for the file browser
  • Support for multiple display languages
  • The ability to install extensions through pip, Conda, and Mamba

With the integrated debugger, you can inspect variables and step through breakpoints while you interactively build your data science and ML code. You can access the debugger by simply choosing the debugger icon on the notebook toolbar.

As of this writing, the debugger is available for our newly launched Base Python 2.0 and Data Science 2.0 images in SageMaker Studio and amazonei_pytorch_latest_p37, pytorch_p38, and tensorflow2_p38 kernels in SageMaker Notebook Instance, with plans to support more in the near future.

The table of contents panel provides an excellent utility to navigate notebooks and more easily share your findings with colleagues.

JupyterLab extensions

With the upgraded notebooks in SageMaker, you can take advantage of the ever-growing community of open-source JupyterLab extensions. In this section, we highlight a few that fit naturally into the SageMaker developer workflow, but we encourage you to browse the available extensions or even create your own.

The first extension we highlight is the Language Server Protocol extension. This open-source extension enables modern IDE functionality such as tab completion, syntax highlighting, jump to reference, variable renaming across notebooks and modules, diagnostics, and much more. This extension is very useful for those developers who want to author Python modules as well as notebooks.

Another useful extension for the SageMaker developer workflow is the jupyterlab-s3-browser. This extension picks up your SageMaker execution role’s credentials and allows you to browse, load, and write files directly to Amazon Simple Storage Service (Amazon S3).

Install extensions

JupyterLab 3 now makes the process of packaging and installing extensions significantly easier. You can install the aforementioned extensions through bash scripts. For example, in SageMaker Studio, open the system terminal from the Studio launcher and run the following commands. Note that the upgraded Studio has a separate, isolated Conda environment for managing the Jupyter Server runtime, so you need to install extensions into the studio Conda environment. To install extensions in SageMaker Notebook Instance, there is no need to switch Conda environments.

In addition, you can automate the installation of these extensions using lifecycle configurations so they’re persisted between Studio restarts. You can configure this for all the users in the domain or at an individual user level.

For Python Language Server, use the following code to install the extensions:

conda init
conda activate studio
pip install jupyterlab-lsp
pip install 'python-lsp-server[all]'
conda deactivate
nohup supervisorctl -c /etc/supervisor/conf.d/supervisord.conf restart jupyterlabserver

For Amazon S3 filebrowser, use the following:

conda init
conda activate studio
pip install jupyterlab_s3_browser
jupyter serverextension enable --py jupyterlab_s3_browser
conda deactivate
nohup supervisorctl -c /etc/supervisor/conf.d/supervisord.conf restart jupyterlabserver

Be sure to refresh your browser after installation.

For more information about writing similar lifecycle scripts for SageMaker Notebook Instance, refer to Customize a Notebook Instance Using a Lifecycle Configuration Script and Customize your Amazon SageMaker notebook instances with lifecycle configurations and the option to disable internet access. Additionally, for more information on extension management, including how to write lifecycle configurations that work for both versions 1 and 3 of JupyterLab notebooks for backward compatibility, see Installing JupyterLab and Jupyter Server extensions.

Get started with JupyterLab 3 notebooks in Studio

If you’re creating a new Studio domain, you can specify the default notebook version directly from the AWS Management Console or using the API.

On the SageMaker Control Panel, change your notebook version when editing your domain settings, in the Jupyter Lab version section.

To use the API, configure the JupyterServerAppSettings parameter as follows:

aws --region <REGION> 
sagemaker create-domain 
--domain-name <NEW_DOMAIN_NAME> 
--auth-mode <AUTHENTICATION_MODE> 
--subnet-ids <SUBNET-IDS> 
--vpc-id <VPC-ID> 
--default-user-settings ‘{
  “JupyterServerAppSettings”: {
    “DefaultResourceSpec”: {
      “SageMakerImageArn”: “arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-server-3",
      “InstanceType”: “system”
    }
  }
}

If you’re an existing Studio user, you can modify your notebook version by choosing your user profile on the SageMaker Control Panel and choosing Edit.

Then choose your preferred version in the Jupyter Lab version section.

For more information, see JupyterLab Versioning.

Get started with JupyterLab 3 on SageMaker Notebook Instance

SageMaker Notebook Instance users can also specify the default notebook version both from the console and using our API. If using the console, note that the option to choose the Jupyter Lab 3 notebooks is only available for latest generation of SageMaker Notebook Instance that comes with Amazon Linux 2.

On the SageMaker console, choose your version while creating your notebook instance, under Platform identifier.

If using the API, use the following code:

create-notebook-instance --notebook-instance-name <NEW_NOTEBOOK_NAME> 
--instance-type <INSTANCE_TYPE> 
--role-arn <YOUR_ROLE_ARN> 
--platform-identifier <notebook-al2-v2>

For more information, see Creating a notebook with your JupyterLab version.

Conclusion

SageMaker Studio and SageMaker Notebook Instance now offer an upgraded notebook experience to users. We encourage you to try out the new capabilities and further boost developer productivity with these enhancements!


About the Authors

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor/maintainer and is the special interest group lead for TensorFlow Add-ons.

Arkaprava De is a Senior Software Engineer at AWS. He has been at Amazon for over 7 years and is currently working on improving the Amazon SageMaker Studio IDE experience.

Kunal Jha is a Senior Product Manager at AWS. He is focused on building Amazon SageMaker Studio as the IDE of choice for all ML development steps. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can find him on LinkedIn.

Read More

Reinventing retail with no-code machine learning: Sales forecasting using Amazon SageMaker Canvas

Retail businesses are data-driven—they analyze data to get insights about consumer behavior, understand shopping trends, make product recommendations, optimize websites, plan for inventory, and forecast sales.

A common approach for sales forecasting is to use historical sales data to predict future demand. Forecasting future demand is critical for planning and impacts inventory, logistics, and even marketing campaigns. Sales forecasting is generated at many levels such as product, sales channel (store, website, partner), warehouse, city, or country.

Sales managers and planners have domain expertise and knowledge of sales history, but lack data science and programming skills to create machine learning (ML) models to generate accurate sales forecasts. They need an intuitive, easy-to-use tool to create ML models without writing code.

To help achieve the agility and effectiveness that business analysts seek, we’ve introduced Amazon SageMaker Canvas, a no-code ML solution that helps companies accelerate delivery of ML solutions down to hours or days. Canvas enables analysts to easily use available data in data lakes, data warehouses, and operational data stores; build ML models; and use them to make predictions interactively and for batch scoring on bulk datasets—all without writing a single line of code.

In this post, we show how to use Canvas to generate sales forecasts at the retail store level.

Solution overview

Canvas can import data from the local disk file, Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Snowflake (as of this writing).

In this post, we use Amazon Redshift cluster-based data with Canvas to build ML models to generate sales forecasts. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Retail industry customers use Amazon Redshift to store and analyze large-scale, enterprise-level structured and semi-structured business data. It helps them accelerate data-driven business decisions in a performant and scalable way.

Generally, data engineers are responsible for ingesting and curating sales data in Amazon Redshift. Many retailers have a data lake where this has been done, but we show the steps here for clarity, and to illustrate how the data engineer can help the business analyst (such as the sales manager) by curating data for their use. This allows the data engineers to enable self-service data for use by business analysts.

In this post, we use a sample dataset that consists of two tables: storesales and storepromotions. You can prepare this sample dataset using your own sales data.

The storesales table keeps historical time series sales data for the stores. The table details are as follows:

Column Name Data Type
store INT
saledate TIMESTAMP
totalsales DECIMAL

The storepromotions table contains historical data from the stores regarding promotions and school holidays, on a daily time frame. The table details are as follows:

Column Name Data Type
store INT
saledate TIMESTAMP
promo INT (0 /1)
schoolholiday INT (0/1)

We combine data from these two tables to train an ML model that can generate forecasts for the store sales.

Canvas is a visual, point-and-click service that makes it easy to build ML models and generate accurate predictions. There are four steps involved in building the forecasting model:

  1. Select data from the data source (Amazon Redshift in this case).
  2. Configure and build (train) your model.
  3. View model insights such as accuracy and column impact on the prediction.
  4. Generate predictions (sales forecasts in this case).

Before we can start using Canvas, we need to prepare our data and configure an AWS Identity and Access Management (IAM) role for Canvas.

Create tables and load sample data

To use the sample dataset, complete the following steps:

  1. Upload storesales and storepromotions sample data files store_sales.csv and store_promotions.csv to an Amazon S3 bucket. Make sure the bucket is in the same region where you run Amazon Redshift cluster.
  2. Create an Amazon Redshift cluster (if not running).
  3. Access the Amazon Redshift query editor.
  4. Create the tables and run the COPY command to load data. Use the appropriate IAM role for the Amazon Redshift cluster in the following code:
create table storesales
(
store INT,
saledate VARCHAR,
totalsales DECIMAL
);

create table storepromotions
(
store INT,
saledate VARCHAR,
promo INT,
schoolholiday INT
);

copy storesales (store,saledate,totalsales)
from ‘s3://<YOUR_BUCKET_NAME>/store_sales.csv’
iam_role ‘<REDSHIFT_IAM_ROLE_ARN>’
Csv
IGNOREHEADER 1;

copy storepromotions (store,saledate,promo,schoolholiday)
from ‘s3://<YOUR_BUCKET_NAME>/store_promotions.csv’
iam_role ‘<REDSHIFT_IAM_ROLE_ARN>’
Csv
IGNOREHEADER 1;

By default, the sample data is loaded in the storesales and storepromotions tables in the public schema of the dev database. But you can choose to use a different database and schema.

Create an IAM role for Canvas

Canvas uses an IAM role to access other AWS services. To configure your role, complete the following steps:

  1. Create your role. For instructions, refer to Give your users permissions to perform time series forecasting.
  2. Replace the code in the Trusted entities field on the Trust relationships tab.

The following code is the new trust policy for the IAM role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [ "sagemaker.amazonaws.com", 
            "forecast.amazonaws.com"]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
  1. Provide the IAM role permission to Amazon Redshift. For instructions, refer to Give users permissions to import Amazon Redshift data.

The following screenshot shows your permission policies.

The IAM role should be assigned as the execution role for Canvas in the Amazon SageMaker domain configuration.

  1. On the SageMaker console, assign the IAM role created as the execution role when configuring your SageMaker domain.

The data in the Amazon Redshift cluster database and Canvas configuration both are ready. You can now use Canvas to build the forecasting model.

Launch Canvas

After the data engineers prepare the data in Amazon Redshift data warehouse, the sales managers can use Canvas to generate forecasts.

To launch Canvas, the AWS account administrator first performs the following steps:

  • Create a SageMaker domain.
  • Create user profiles for the SageMaker domain.

For instructions, refer to Getting started with using Amazon SageMaker Canvas or contact your AWS account administrator for the guidance.

Launch the Canvas app from the SageMaker console. Make sure to launch Canvas in the same AWS Region where the Amazon Redshift cluster is.

When Canvas is launched, you can start with the first step of selecting data from the data source.

Import data in Canvas

To import your data, complete the following steps:

  1. In the Canvas application, on the Datasets menu, choose Import.
  2. On the Import page, choose the Add connection menu and choose Redshift.

    The data engineer or cloud administrator can provide Amazon Redshift connection information to the sales manager. We show an example of the connection information in this post.
  3. For Type, choose IAM.
  4. For Cluster identifier, enter your Amazon Redshift cluster ID.
  5. For Database name, enter dev.
  6. For Database user, enter awsuser.
  7. For Unload IAM role, enter the IAM role you created earlier for the Amazon Redshift cluster.
  8. For Connection name, enter redshiftconnection.
  9. Choose Add connection.

    The connection between Canvas and the Amazon Redshift cluster is established. You can see the redshiftconnection icon on the top of the page.
  10. Drag and drop storesales and storepromotions tables under the public schema to the right panel.
    It automatically creates an inner join between the tables on their matching column names store and saledate.

    You can update joins and decide which fields to select from each table to create your desired dataset. You can configure the joins and field selection in two ways: using the Canvas user interface to drag and drop joining of tables, or update the SQL script in Canvas if the sales manager knows SQL. We include an example of editing SQL for completeness, and for the many business analysts who have been trained in SQL. The end goal is to prepare a SQL statement that provides the desired dataset that can be imported to Canvas.
  11. Choose Edit in SQL to see SQL script used for the join.
  12. Modify the SQL statement with the following code:
    WITH DvtV AS (SELECT store, saledate, promo, schoolholiday FROM dev.public."storepromotions"), 
          L394 AS (SELECT store, saledate, totalsales FROM dev.public."storesales")
          SELECT 
                      DvtV.promo,
                      DvtV.schoolholiday,
                      L394.totalsales,
                      DvtV.saledate AS saledate,
                      DvtV.store AS store
             FROM DvtV INNER JOIN L394 ON DvtV.saledate = L394.saledate AND DvtV.store = L394.store;

  13. Choose Run SQL to run the query.

    When the query is complete, you can see a preview of the output. This is the final data that you want to import in Canvas for the ML model and forecasting purposes.
  14. Choose Import data to import the data into Canvas.

When importing the data, provide a suitable name for the dataset, such as store_daily_sales_dataset.

The dataset is ready in Canvas. Now you can start training a model to forecast total sales across stores.

Configure and train the model

To configure model training in Canvas, complete the following steps:

  1. Choose the Models menu option and choose New Model.
  2. For the new model, give a suitable name such as store_sales_forecast_model.
  3. Select the dataset store_daily_sales_dataset.
  4. Choose Select dataset.

    On the Build tab, you can see data and column-level statistics as well as the configuration area for the model training.
  5. Select totalsales for the target column.
    Canvas automatically selects Time series forecasting as the model type.
  6. Choose Configure to start configuration of the model training.
  7. In the Time series forecasting configuration section, choose store as the unique identity column because we want to generate forecasts for the store.
  8. Choose saledate for the time stamps column because it represents historical time series.
  9. Enter 120 as the number of days because we want to forecast sales for a 3-month horizon.
  10. Choose Save.
  11. When the model training configuration is complete, choose Standard build to start the model training.

The Quick build and Preview model options aren’t available for the time series forecasting model type at the time of this writing. After you choose the standard build, the Analyze tab shows the estimated time for the model training.

Model training can take 1–4 hours to complete depending on the data size. For the sample data used in this post, the model training was around 3 hours. When the model is ready, you can use it for generating forecasts.

Analyze results and generate forecasts

When the model training is complete, Canvas shows the prediction accuracy of the model on the Analyze tab. For this example, it shows prediction accuracy as 79.13%. We can also see the impact of the columns on the prediction; in this example, promo and schoolholiday don’t influence the prediction. Column impact information is useful in fine-tuning the dataset and optimizing the model training.

The forecasts are generated on the Predict tab. You can generate forecasts for all the items (all stores) or for the selected single item (single store). It also shows the date range for which the forecasts can be generated.

As an example, we choose to view a single item and enter 2 as the store to generate sales forecasts for store 2 for the date range 2015-07-31 00:00:00 through 2015-11-28 00:00:00.

The generated forecasts show the average forecast as well as the upper and lower bound of the forecasts. The forecasts boundary helps make aggressive or balanced approaches for the forecast handling.

You can also download the generated forecasts as a CSV file or image. The generated forecasts CSV file is generally used to work offline with the forecast data.

The forecasts are generated based on time series data for a period of time. When the new baseline of data becomes available for the forecasts, you can upload a new baseline dataset and change the dataset in Canvas to retrain the forecast model using new data.

You can retrain the model multiple times as new source data is available.

Conclusion

Generating sales forecasts using Canvas is configuration driven and an easy-to-use process. We showed you how data engineers can help curate data for business analysts to use, and how business analysts can gain insights from their data. The business analyst can now connect to data sources such local disk, Amazon S3, Amazon Redshift, or Snowflake to import data and join data across multiple tables to train a ML forecasting model, which is then used to generate sales forecasts. As the historical sales data updates, you can retrain the forecast model to maintain forecast accuracy.

Sales managers and operations planners can use Canvas without expertise in data science and programming. This expedites decision-making time, enhances productivity, and helps build operational plans.

To get started and learn more about Canvas, refer to the following resources:


About the Authors

Brajendra Singh is solution architect in Amazon Web Services working with enterprise customers. He has strong developer background and is a keen enthusiast for data and machine learning solutions.

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Train machine learning models using Amazon Keyspaces as a data source

Many applications meant for industrial equipment maintenance, trade monitoring, fleet management, and route optimization are built using open-source Cassandra APIs and drivers to process data at high speeds and low latency. Managing Cassandra tables yourself can be time consuming and expensive. Amazon Keyspaces (for Apache Cassandra) lets you set up, secure, and scale Cassandra tables in the AWS Cloud without managing additional infrastructure.

In this post, we’ll walk you through AWS Services related to training machine learning (ML) models using Amazon Keyspaces at a high level, and provide step by step instructions for ingesting data from Amazon Keyspaces into Amazon SageMaker and training a model which can be used for a specific customer segmentation use case.

AWS has multiple services to help businesses implement ML processes in the cloud.

AWS ML Stack has three layers. In the middle layer is SageMaker, which provides developers, data scientists, and ML engineers with the ability to build, train, and deploy ML models at scale. It removes the complexity from each step of the ML workflow so that you can more easily deploy your ML use cases. This includes anything from predictive maintenance to computer vision to predict customer behaviors. Customers achieve up to 10 times improvement in data scientists’ productivity with SageMaker.

Apache Cassandra is a popular choice for read-heavy use cases with un-structured or semi-structured data. For example, a popular food delivery business estimates time of delivery, and a retail customer could persist frequently using product catalog information in the Apache Cassandra Database. Amazon Keyspaces is a scalable, highly available, and managed serverless Apache Cassandra–compatible database service. You don’t need to provision, patch, or manage servers, and you don’t need to install, maintain, or operate software. Tables can scale up and down automatically, and you only pay for the resources that you use. Amazon Keyspaces lets you you run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today.

SageMaker provides a suite of built-in algorithms to help data scientists and ML practitioners get started training and deploying ML models quickly. In this post, we’ll show you how a retail customer can use customer purchase history in the Keyspaces Database and target different customer segments for marketing campaigns.

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity. SageMaker uses a modified version of the web-scale k-means clustering algorithm. As compared with the original version of the algorithm, the version used by SageMaker is more accurate. However, like the original algorithm, it scales to massive datasets and delivers improvements in training time.

Solution overview

The instructions assume that you would be using SageMaker Studio to run the code. The associated code has been shared on AWS Sample GitHub. Following the instructions in the lab, you can do the following:

  • Install necessary dependencies.
  • Connect to Amazon Keyspaces, create a Table, and ingest sample data.
  • Build a classification ML model using the data in Amazon Keyspaces.
  • Explore model results.
  • Clean up newly created resources.

Once complete, you’ll have integrated SageMaker with Amazon Keyspaces to train ML models as shown in the following image.

Now you can follow the step-by-step instructions in this post to ingest raw data stored in Amazon Keyspaces using SageMaker and the data thus retrieved for ML processing.

Prerequisites

First, Navigate to SageMaker.

Next, if this is the first time that you’re using SageMaker, select Get Started.

Next, select Setup up SageMaker Domain.

Next, create a new user profile with Name – sagemakeruser, and select Create New Role in the Default Execution Role sub section.

Next, in the screen that pops up, select any Amazon Simple Storage Service (Amazon S3) bucket, and select Create role.

This role will be used in the following steps to allow SageMaker to access Keyspaces Table using temporary credentials from the role. This eliminates the need to store a username and password in the notebook.

Next, retrieve the role associated with the sagemakeruser that was created in the previous step from the summary section.

Then, navigate to the AWS Console and look up AWS Identity and Access Management (IAM). Within IAM, navigate to Roles. Within Roles, search for the execution role identified in the previous step.

Next, select the role identified in the previous step and select Add Permissions. In the drop down that appears, select Create Inline Policy. SageMaker lets you provide a granular level of access that restricts what actions a user/application can perform based on business requirements.

Then, select the JSON tab and copy the policy from the Note section of Github page. This policy allows the SageMaker notebook to connect to Keyspaces and retrieve data for further processing.

Then, select Add permissions again and from the drop down, and select Attach Policy.

Lookup AmazonKeyspacesFullAccess policy, and select the checkbox next to the matching result, and select Attach Policies.

Verify that the permissions policies section includes AmazonS3FullAccess, AmazonSageMakerFullAccess, AmazonKeyspacesFullAccess, as well as the newly added inline policy.

Next, navigate to SageMaker Studio using the AWS Console and select the SageMaker Studio. Once there, select Launch App and select Studio.

Notebook walkthrough

The preferred way to connect to Keyspaces from SageMaker Notebook is by using AWS Signature Version 4 process (SigV4) based Temporary Credentials for authentication. In this scenario, we do NOT need to generate or store Keyspaces credentials and can use the credentials to authenticate with the SigV4 plugin. Temporary security credentials consist of an access key ID and a secret access key. However, they also include a security token that indicates when the credentials expire. In this post, we’ll create an IAM role and generate temporary security credentials.

First, we install a driver (cassandra-sigv4). This driver enables you to add authentication information to your API requests using the AWS Signature Version 4 Process (SigV4). Using the plugin, you can provide users and applications with short-term credentials to access Amazon Keyspaces (for Apache Cassandra) using IAM users and roles. Following this, you’ll import a required certificate along with additional package dependencies. In the end, you will allow the notebook to assume the role to talk to Keyspaces.

# Install missing packages and import dependencies
# Installing Cassandra SigV4
%pip install  cassandra-sigv4

# Get Security certificate
!curl https://certs.secureserver.net/repository/sf-class2-root.crt -O

# Import
from sagemaker import get_execution_role
from cassandra.cluster import Cluster
from ssl import SSLContext, PROTOCOL_TLSv1_2, CERT_REQUIRED
from cassandra_sigv4.auth import SigV4AuthProvider
import boto3

import pandas as pd
from pandas import DataFrame

import csv
from cassandra import ConsistencyLevel
from datetime import datetime
import time
from datetime import timedelta

import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

# Getting credentials from the role
client = boto3.client("sts")

# Get notebook Role
role = get_execution_role()
role_info = {"RoleArn": role, "RoleSessionName": "session1"}
print(role_info)

credentials = client.assume_role(**role_info)

Next, connect to Amazon Keyspaces and read systems data from Keyspaces into Pandas DataFrame to validate the connection.

# Connect to Cassandra Database from SageMaker Notebook 
# using temporary credentials from the Role.
session = boto3.session.Session()

###
### You can also pass specific credentials to the session
###
#session = boto3.session.Session(
# aws_access_key_id=credentials["Credentials"]["AccessKeyId"],
# aws_secret_access_key=credentials["Credentials"]["SecretAccessKey"],
# aws_session_token=credentials["Credentials"]["SessionToken"],
#)

region_name = session.region_name

# Set Context
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
ssl_context.load_verify_locations("sf-class2-root.crt")
ssl_context.verify_mode = CERT_REQUIRED

auth_provider = SigV4AuthProvider(session)
keyspaces_host = "cassandra." + region_name + ".amazonaws.com"

cluster = Cluster([keyspaces_host], ssl_context=ssl_context, auth_provider=auth_provider, port=9142)
session = cluster.connect()

# Read data from Keyspaces system table. 
# Keyspaces is serverless DB so you don't have to create Keyspaces DB ahead of time.
r = session.execute("select * from system_schema.keyspaces")

# Read Keyspaces row into Panda DataFrame
df = DataFrame(r)
print(df)

Next, prepare the data for training on the raw data set. In the python notebook associated with this post, use a retail data set downloaded from here, and process it. Our business objective given the data set is to cluster the customers using a specific metric call RFM. The RFM model is based on three quantitative factors:

  • Recency: How recently a customer has made a purchase.
  • Frequency: How often a customer makes a purchase.
  • Monetary Value: How much money a customer spends on purchases.

RFM analysis numerically ranks a customer in each of these three categories, generally on a scale of 1 to 5 (the higher the number, the better the result). The “best” customer would receive a top score in every category. We’ll use pandas’s Quantile-based discretization function (qcut). It will help discretize values into equal-sized buckets based or based on sample quantiles.

# Prepare Data
r = session.execute("select * from " + keyspaces_schema + ".online_retail")

df = DataFrame(r)
df.head(100)

df.count()
df["description"].nunique()
df["totalprice"] = df["quantity"] * df["price"]
df.groupby("invoice").agg({"totalprice": "sum"}).head()

df.groupby("description").agg({"price": "max"}).sort_values("price", ascending=False).head()
df.sort_values("price", ascending=False).head()
df["country"].value_counts().head()
df.groupby("country").agg({"totalprice": "sum"}).sort_values("totalprice", ascending=False).head()

returned = df[df["invoice"].str.contains("C", na=False)]
returned.sort_values("quantity", ascending=True).head()

df.isnull().sum()
df.dropna(inplace=True)
df.isnull().sum()
df.dropna(inplace=True)
df.isnull().sum()
df.describe([0.05, 0.01, 0.25, 0.50, 0.75, 0.80, 0.90, 0.95, 0.99]).T
df.drop(df.loc[df["customer_id"] == ""].index, inplace=True)

# Recency Metric
import datetime as dt

today_date = dt.date(2011, 12, 9)
df["customer_id"] = df["customer_id"].astype(int)

# create get the most recent invoice for each customer
temp_df = df.groupby("customer_id").agg({"invoice_date": "max"})
temp_df["invoice_date"] = temp_df["invoice_date"].astype(str)
temp_df["invoice_date"] = pd.to_datetime(temp_df["invoice_date"]).dt.date
temp_df["Recency"] = (today_date - temp_df["invoice_date"]).dt.days
recency_df = temp_df.drop(columns=["invoice_date"])
recency_df.head()

# Frequency Metric
temp_df = df.groupby(["customer_id", "invoice"]).agg({"invoice": "count"})
freq_df = temp_df.groupby("customer_id").agg({"invoice": "count"})
freq_df.rename(columns={"invoice": "Frequency"}, inplace=True)

# Monetary Metric
monetary_df = df.groupby("customer_id").agg({"totalprice": "sum"})
monetary_df.rename(columns={"totalprice": "Monetary"}, inplace=True)
rfm = pd.concat([recency_df, freq_df, monetary_df], axis=1)

df = rfm
df["RecencyScore"] = pd.qcut(df["Recency"], 5, labels=[5, 4, 3, 2, 1])
df["FrequencyScore"] = pd.qcut(df["Frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
df["Monetary"] = df["Monetary"].astype(int)
df["MonetaryScore"] = pd.qcut(df["Monetary"], 5, labels=[1, 2, 3, 4, 5])
df["RFM_SCORE"] = (
    df["RecencyScore"].astype(str)
    + df["FrequencyScore"].astype(str)
    + df["MonetaryScore"].astype(str)
)
seg_map = {
    r"[1-2][1-2]": "Hibernating",
    r"[1-2][3-4]": "At Risk",
    r"[1-2]5": "Can't Loose",
    r"3[1-2]": "About to Sleep",
    r"33": "Need Attention",
    r"[3-4][4-5]": "Loyal Customers",
    r"41": "Promising",
    r"51": "New Customers",
    r"[4-5][2-3]": "Potential Loyalists",
    r"5[4-5]": "Champions",
}

df["Segment"] = df["RecencyScore"].astype(str) + rfm["FrequencyScore"].astype(str)
df["Segment"] = df["Segment"].replace(seg_map, regex=True)
df.head()
rfm = df.loc[:, "Recency":"Monetary"]
df.groupby("customer_id").agg({"Segment": "sum"}).head()

In this example, we use CQL to read records from the Keyspace table. In some ML use-cases, you may need to read the same data from the same Keyspaces table multiple times. In this case, we would recommend that you save your data into an Amazon S3 bucket to avoid incurring additional costs reading from Amazon Keyspaces. Depending on your scenario, you may also use Amazon EMR to ingest a very large Amazon S3 file into SageMaker.

## Optional Code to save Python DataFrame to S3
from io import StringIO # python3 (or BytesIO for python2)

smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
bucket = sess.default_bucket() # Set a default S3 bucket
print(bucket)

csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, ‘out/saved_online_retail.csv').put(Body=csv_buffer.getvalue())

Next, we train an ML model using the KMeans algorithm and make sure that the clusters are created. In this particular scenario, you would see that the created clusters are printed, showing that the customers in the raw data set have been grouped together based on various attributes in the data set. This cluster information can be used for targeted marketing campaigns.

# Training

sc = MinMaxScaler((0, 1))
df = sc.fit_transform(rfm)

# Clustering
kmeans = KMeans(n_clusters=6).fit(df)

# Result
segment = kmeans.labels_

# Visualize the clusters
import matplotlib.pyplot as plt

final_df = pd.DataFrame({"customer_id": rfm.index, "Segment": segment})
bucket_data = final_df.groupby("Segment").agg({"customer_id": "count"}).head()
index_data = final_df.groupby("Segment").agg({"Segment": "max"}).head()
index_data["Segment"] = index_data["Segment"].astype(int)
dataFrame = pd.DataFrame(data=bucket_data["customer_id"], index=index_data["Segment"])
dataFrame.rename(columns={"customer_id": "Total Customers"}).plot.bar(
    rot=70, title="RFM clustering"
)
# dataFrame.plot.bar(rot=70, title="RFM clustering");
plt.show(block=True);

(Optional) Next, we save the customer segments that have been identified by the ML model back to an Amazon Keyspaces table for targeted marketing. A batch job could read this data and run targeted campaigns to customers in specific segments.

# Create ml_clustering_results table to store results 
createTable = """CREATE TABLE IF NOT EXISTS %s.ml_clustering_results ( 
 run_id text,
 segment int,
 total_customers int,
 run_date date,
    PRIMARY KEY (run_id, segment));
"""
cr = session.execute(createTable % keyspaces_schema)
time.sleep(20)
print("Table 'ml_clustering_results' created")
    
insert_ml = (
    "INSERT INTO "
    + keyspaces_schema
    + '.ml_clustering_results'  
    + '("run_id","segment","total_customers","run_date") ' 
    + 'VALUES (?,?,?,?); '
)

prepared = session.prepare(insert_ml)
prepared.consistency_level = ConsistencyLevel.LOCAL_QUORUM

run_id = "101"
dt = datetime.now()

for ind in dataFrame.index:
    print(ind, dataFrame['customer_id'][ind])
    r = session.execute(
                    prepared,
                    (
                        run_id, ind, dataFrame['customer_id'][ind], dt,
                    ),
                )

Finally, we clean up the resources created during this tutorial to avoid incurring additional charges.

# Delete blog keyspace and tables
deleteKeyspace = "DROP KEYSPACE IF EXISTS blog"
dr = session.execute(deleteKeyspace)

time.sleep(5)
print("Dropping %s keyspace. It may take a few seconds to a minute to complete deletion keyspace and table." % keyspaces_schema )

It may take a few seconds to a minute to complete the deletion of keyspace and tables. When you delete a keyspace, the keyspace and all of its tables are deleted and you stop accruing charges from them.

Conclusion

This post showed you how to ingest customer data from Amazon Keyspaces into SageMaker and train a clustering model that allowed you to segment customers. You could use this information for targeted marketing, thus greatly improving your business KPI. To learn more about Amazon Keyspaces, review the following resources:


About the Authors

Vadim Lyakhovich is a Senior Solutions Architect at AWS in the San Francisco Bay Area helping customers migrate to AWS. He is working with organizations ranging from large enterprises to small startups to support their innovations. He is also helping customers to architect scalable, secure, and cost-effective solutions on AWS.

Parth Patel is a Solutions Architect at AWS in the San Francisco Bay Area. Parth guides customers to accelerate their journey to cloud and help them adopt AWS cloud successfully. He focuses on ML and Application Modernization.

Ram Pathangi is a Solutions Architect at AWS in the San Francisco Bay Area. He has helped customers in Agriculture, Insurance, Banking, Retail, Health Care & Life Sciences, Hospitality, and Hi-Tech verticals to run their business successfully on AWS cloud. He specializes in Databases, Analytics and ML.

Read More

Improve organizational diversity, equity, and inclusion initiatives with Amazon Polly

Organizational diversity, equity and inclusion (DEI) initiatives are at the forefront of companies across the globe. By constructing inclusive spaces with individuals from diverse backgrounds and experiences, businesses can better represent our mutual societal needs and deliver on objectives. In the article How Diversity Can Drive Innovation, Harvard Business Review states that companies that focus on multiple dimensions of diversity are 45% more likely to grow their market share and 70% more likely to capture new markets.

DEI initiatives can be difficult and complex to scale, taking long periods of time to show impact. As such, organizations should plan initiatives in phases, similar to an agile delivery process. Achieving small but meaningful wins at each phase can contribute towards larger organizational goals. An example of such an initiative at Amazon is the “Say my Name” tool.

Amazon’s global workforce—with offices in over 30 countries—requires the consistent innovation of inclusive tools to foster an environment that dispels unconscious bias. “Say my Name” was created to help internal Amazon employees share the correct pronunciation of their names and practice saying the name of their colleagues in a culturally competent manner. Incorrect name pronunciation can alienate team members and can have adverse effects on performance and team morale. A study by Catalyst.org reported that employees are more innovative when they feel more included. In India, 62% of innovation is driven by employee perceptions of inclusion. Adding this pronunciation guide to written names aims to create a more inclusive and respectful professional environment for employees.

The following screenshots show examples of pronunciations generated by “Say my Name”.

Say my name tool interface- practice any name

The application is powered by Amazon Polly. Amazon Polly provides users a text-to-speech (TTS) service that uses advanced deep learning technologies to synthesize natural-sounding human speech. Amazon Polly provides users with dozens of lifelike voices across a broad set of languages, allowing users to select the voice, ethnicity, and accent they would like to share with their colleagues.

In this post, we show how to deploy this name pronunciation application in your AWS environment, along with ways to scale the application across the organization.

Solution overview

The application follows a serverless architecture. The front end is built from a static React app hosted in an Amazon Simple Storage Service (Amazon S3) bucket behind an Amazon CloudFront distribution. The backend runs behind Amazon API Gateway, implemented as AWS Lambda functions to interface with Amazon Polly. Here, the application is fully downloaded to the client and rendered in a web browser. The following diagram shows the solution architecture.

Solution overview

To view a sample demo without using the AWS Management Console, navigate our demo site.

The site allows users to do the following:

  • Hear how their name and colleagues’ names sound with the different voices of Amazon Polly.
  • Generate MP3 files to put in email signatures or profiles.
  • Generate shareable links to provide colleagues or external partners with accurate pronunciation of names.

To deploy the application in your environment, continue following along with this post.

Prerequisites

You must complete the following prerequisites to implement this solution:

  1. Install Node.js version 16.14.0 or above.
  2. Install the AWS Cloud Development Kit (AWS CDK) version 2.16.0 or above.
  3. Configure AWS Command Line Interface (AWS CLI).
  4. Install Docker and have Docker Daemon running.
  5. Install and configure Git.

The solution is optimized best to work in the Chrome, Safari, and Firefox web browsers.

Implement the solution

  1. To get started, clone the repository:
    git clone https://github.com/aws-samples/aws-name-pronunciation

The repository consists of two main folders:

    • /cdk – Code to deploy the solution
    • /pronounce_app – Front-end and backend application code
  1. We build the application components and then deploy them via the AWS CDK. To get started, run the following commands in your terminal window:
    cd aws-name-pronunciation/pronounce_app/
    npm install
    npm run build 
    cd ../cdk
    npm install
    cdk bootstrap
    cdk deploy ApiStack --outputs-file ../pronounce_app/src/config.json

This step should produce the endpoints for your backend services using API Gateway. See the following sample output:

Outputs:
ApiStack.getVoicesApiUrl = {endpoint_dns}
ApiStack.synthesizeSpeechApiUrl = {endpoint_dns}
  1. You can now deploy the front end:
    cd ../pronounce_app
    npm run build
    cd ../cdk
    cdk deploy FrontendStack

This step should produce the URL for your CloudFront distribution, along with the S3 bucket storing your React application. See the following sample output:

FrontendStack.Bucket = {your_bucket_name}
FrontendStack.CloudFrontReactAppURL = {your_cloudfront_distribution}

You can validate that all the deployment steps worked correctly by navigating to the AWS CloudFormation console. You should see three stacks, as shown in the following screenshot.

To access Say my name, use the value from the FrontendStack.CloudFrontReactAppURL AWS CDK output. Alternatively, choose the stack FrontendStack on the AWS CloudFormation console, and on the Outputs tab, choose the value for CloudFrontReactAppURL.

CloudFormation outputs

You’re redirected to the name pronunciation application.

Name pronunciation tool interface

In the event that Amazon Polly is unable to correctly pronounce the name entered, we suggest users check out
Speech Synthesis Markup Language (SSML) with Amazon Polly. Using SSML-enhanced text gives you additional control over how Amazon Polly generates speech from the text you provide.

For example, you can include a long pause within your text, or change the speech rate or pitch. Other options include:
  • emphasizing specific words or phrases
  • using phonetic pronunciation
  • including breathing sounds
  • whispering
  • using the Newscaster speaking style
For complete details on the SSML tags supported by Amazon Polly and how to use them, see 
Supported SSML Tags.

Conclusion

Organizations have a responsibility to facilitate more inclusive and accessible spaces as workforces grow to be increasingly diverse and globalized. There are numerous use-cases for teaching the correct pronunciation of names in an organization:

  • Helping pronounce the names of new colleagues and team members.
  • Offering the correct pronunciation of your name via an MP3 or audio stream prior to meetings.
  • Providing sales teams mechanisms to learn names of clients and stakeholders prior to customer meetings.

Although this is a small step in creating a more equitable and inclusive workforce, accurate name pronunciations can have profound impacts on how people feel in their workplace. If you have ideas for features or improvements, please raise a pull request on our GitHub repo or leave a comment on this post.

To learn more about the work AWS is doing in DEI, check out AWS Diversity, Equity & Inclusion. To learn more about Amazon Polly, please refer to our resources to get started with Amazon Polly.


About the Authors

Aditi Rajnish is a second-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found playing badminton, learning new songs on the piano, or hiking in North America’s national parks.

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Mason Force is a Solutions Architect based in Seattle. He specializes in Analytics and helps enterprise customers across the western and central United States develop efficient data strategies. Outside of work, Mason enjoys bouldering, snowboarding and exploring the wilderness across the Pacific Northwest.

Read More