Amazon’s Daliana Liu helps others in the field chart their own paths.Read More
Fine-tune and deploy the ProtBERT model for protein classification using Amazon SageMaker
Proteins, the key fundamental macromolecules governing in biological bodies, are composed of amino acids. These 20 essential amino acids, each represented by a capital letter, combine to form a protein sequence, which can be used to predict the subcellular localization (the location of protein in a cell) and structure of proteins.
Figure 1: Protein Sequence
The study of protein localization is important to comprehend the function of protein, which is essentially to structure, function, and regulate the body’s tissues and organs. Protein localization has great importance for drug design and other applications. For example, we can investigate methods to disrupt the binding of the spiky S1 protein of the SARS-Cov-2 virus. The binding of the S1 protein to the human receptor ACE2 is the mechanism which led to the COVID-19 pandemic [1]. It also plays an important role in characterizing the cellular function of hypothetical and newly discovered proteins [2].
Figure 2: SARS-Cov-2 virus binding to ACE2 human receptor
Protein sequences are constrained to adopting particular 3D shapes (referred to as protein 3D structure) optimized for accomplishing particular functions. These constraints mirror the rules of grammar and meaning in natural language, thereby allowing us to map algorithms from natural language processing (NLP) directly onto protein sequences. During training, the language model learns to extract those constraints from millions of examples and store the derived knowledge in its weights. [1] Although existing solutions in protein bioinformatics [11, 12, 13, 14, 15,16] usually have to search for evolutionary-related proteins in exponentially growing databases, language models offer a potential alternative to this increasingly time-consuming database search because they extract features directly from single protein sequences. Additionally, the performance of existing solutions deteriorates if a sufficient number of related sequences can’t be found; for example, the quality of predicted protein structures correlates strongly with the number of effective sequences found in today’s databases [17].
Several research endeavors currently aim to localize whole proteomes by using high-throughput approaches [2, 3, 4]. These large datasets provide important information about protein function, and more generally global cellular processes. However, they currently don’t achieve 100% coverage of proteomes, and the methodology used can in some cases cause mislocalization of subsets of proteins [5, 6]. Therefore, complementary methods are necessary to address these problems.
In this post, we use NLP techniques for protein sequence classification. The idea is to interpret protein sequences as sentences and their constituent—amino acids—as single words [7]. More specifically, we fine-tune the PyTorch ProtBERT model from the Hugging Face library using Amazon SageMaker.
What is ProtBERT?
ProtBERT is a pretrained model on protein sequences using a masked language modeling objective. It’s based on the BERT model, which is pretrained on a large corpus of protein sequences in a self-supervised fashion. This means it was pretrained on the raw protein sequences only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those protein sequences [8]. For more information about ProtBERT, see ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.
Solution overview
The post focuses on fine-tuning the PyTorch ProtBERT model (see the following diagram). We first extend the pretrained ProtBERT model to classify the protein sequences.
We then deploy the model using SageMaker, which is the most comprehensive and fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. During the training, we use the distributed data parallel (SDP) feature in SageMaker, which extends its training capabilities on deep learning models with near-linear scaling efficiency, achieving fast time-to-train with minimal code changes.
The notebook and code from this post are available on GitHub. To run it yourself, clone the GitHub repository and open the Jupyter notebook file.
Dataset
In this post, we use an open-source DeepLoc [10] public dataset of protein sequences to train the model. The dataset is a FASTA file composed of header and protein sequence. The header is composed of the accession number from Uniprot, the annotated subcellular localization, and possibly a description field indicating if the protein was part of the test set. The subcellular localization includes an additional label, where S indicates soluble, M membrane, and U unknown [9]. The following code is a sample of the data:
>Q9SMX3 Mitochondrion-M test
MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTTGTNKGSLFLGDVATQVKNNNFTADVKVST
DSSLLTTLTFDEPAPGLKVIVQAKLPDHKSGKAEVQYFHDYAGISTSVGFTATPIVNFSGVVGTNGLSLGTDV
AYNTESGNFKHFNAGFNFTKDDLTASLILNDKGEKLNASYYQIVSPSTVVGAEISHNFTTKENAITVGTQHAL>
DPLTTVKARVNNAGVANALIQHEWRPKSFFTVSGEVDSKAIDKSAKVGIALALKP"
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The definition line (defline) is distinguished from the sequence data by a greater-than (>
) symbol at the beginning. The word following the >
symbol is the identifier of the sequence, and the rest of the line is the description.
We download the FASTA formatted dataset and read it by directly filtering out the columns that are of interest. The dataset consists of 14,000 sequences and 6 columns in total. The columns are as follows:
- id – Unique identifier given each sequence in the dataset.
- sequence – Protein sequence. Each character is separated by a space. This is useful for the BERT tokenizer.
- sequence_length – Character length of each protein sequence.
- location – Classification given each sequence. The dataset has 10 unique classes (subcellular localization).
- is_train – Indicates whether the record should be used for training or test. Is also used to separate the dataset for training and validation.
When we plot the sequence lengths of each record as an histogram, we observe the following distribution.
This is an important observation because the ProtBERT model receives a fixed sentence length as input. Usually, the maximum length of a sentence depends on the data we’re working on. For sentences that are shorter than this maximum length, we have to add paddings (empty tokens) to the sentences to make up the length.
In the preceding plot, most of the sequences are under 1,500 characters in length, therefore, it’s a good idea to choose max_length = 1536
, but that increases the training time for this sample notebook, therefore, we use max_length = 512
.
When we’re retrieving each sequence record using the Pytorch DataLoaders during training, we must ensure that each sequence is tokenized, truncated, and the necessary padding is added to make them all the same max_length
value. To encapsulate this process, we define the ProteinSequenceDataset
class, which uses the encode_plus()
API provided by the Hugging Face transformer library:
#data_prep.py
import torch
from torch import nn
import torch.utils.data
import torch.utils.data.distributed
from torch.utils.data import Dataset, DataLoader, RandomSampler, TensorDataset
class ProteinSequenceDataset(Dataset):
def __init__(self, sequence, targets, tokenizer, max_len):
self.sequence = sequence
self.targets = targets
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.sequence)
def __getitem__(self, item):
sequence = str(self.sequence[item])
target = self.targets[item]
encoding = self.tokenizer.encode_plus(
sequence,
truncation=True,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
return_attention_mask=True,
return_tensors='pt',
)
return {
'protein_sequence': sequence,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
}
Next, we divide the dataset into training and test. We can use the is_train column to do the split, which results 11,231 records for the training set and 2,773 records for the test set (about a 75:25 data split). Finally, we upload this test and train data to our Amazon Simple Storage Service (Amazon S3) location in order to accommodate model training on SageMaker.
ProtBERT fine-tuning
In computational biology and bioinformatics, we have gold mines of data from protein sequences, but we need high computing resources to train the models, which can be limiting and costly. One way to overcome this challenge is to use transfer learning.
Transfer learning is an ML method in which a pretrained model, such as a pretrained BERT model for text classification, is reused as the starting point for a different but related problem. By reusing parameters from pretrained models, you can save significant amounts of training time and cost.
In our notebook, we use the pretrained prot_bert_bfd_localization
model on the DeepLoc dataset for predicting protein subcellular localization by adding a classification layer, as shown in the following code:
#model_def.py
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
import torch.nn.functional as F
import torch.nn as nn
PRE_TRAINED_MODEL_NAME = 'Rostlab/prot_bert_bfd_localization'
class ProteinClassifier(nn.Module):
def __init__(self, n_classes):
super(ProteinClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
self.classifier = nn.Sequential(nn.Dropout(p=0.2),
nn.Linear(self.bert.config.hidden_size, n_classes),
nn.Tanh())
def forward(self, input_ids, attention_mask):
output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
return self.classifier(output.pooler_output)
We use ProteinClassifier
defined in the model_def.py
script for training.
Training script
We use the PyTorch-Transformers library, which contains PyTorch implementations and pretrained model weights for many NLP models, including BERT. As mentioned earlier, we use the ProtBERT model, which is pretrained on protein sequences.
We also use the distributed data parallel feature launched in December 2020 to speed up the training by distributing the data on multiple GPUs. It’s very similar to a PyTorch training script you might run outside of SageMaker, but modified to run with SDP. SDP’s PyTorch client provides an alternative to PyTorch’s native DDP. For details about how to use SDP in your native PyTorch script, see the Get Started with Distributed Training.
The following script saves the model artifacts learned during training to a file path, model_dir
, as mandated by the SageMaker PyTorch image:
# SageMaker Distributed code.
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
import smdistributed.dataparallel.torch.distributed as dist
# intializes the process group for distributed training
dist.init_process_group()
When training is complete, SageMaker uploads model artifacts saved in model_dir
to Amazon S3 so they’re available for deployment. The following code in the script saves the trained model artifacts:
def save_model(model, model_dir):
path = os.path.join(model_dir, 'model.pth')
# recommended way from http://pytorch.org/docs/master/notes/serialization.html
torch.save(model.state_dict(), path)
logger.info(f"Saving model: {path} n")
Because PyTorch-Transformer isn’t included natively in SageMaker PyTorch images, we have to provide a requirements.txt
file so that SageMaker installs this library for training and inference. A requirements.txt
file is a text file that contains a list of items that are installed by using pip install
. You can also specify the version of an item to install. To install PyTorch-Transformer and other libraries, we add the following line to the requirements.txt
file:
transformers
torch-optimizer
sagemaker==2.19.0
boto3
You can view the entire file in the GitHub repo, and it also goes into the code/
directory. For more information about the format of a requirements.txt
file, see Requirements Files.
Train on SageMaker
We use SageMaker to train and deploy a model using our custom PyTorch code. The SageMaker Python SDK makes it easy to run a PyTorch script in SageMaker using its PyTorch estimator. After that, we can use the SageMaker Python SDK to deploy the trained model and run predictions. For more information on how to use this SDK with PyTorch, see Use PyTorch with the SageMaker Python SDK.
To start, we use the PyTorch estimator class to train our model. When creating our estimator, we make sure to specify a few things:
- entry_point – The name of our PyTorch script. It contains our training script, which loads data from the input channels, configures training with hyperparameters, trains a model, and saves the model. It also contains code to load and run the model during inference.
- source_dir – The location of our training scripts and
requirements.txt
file. The requirements file lists packages you want to use with your script. - framework_version – The PyTorch version we want to use.
The PyTorch estimator supports both single-machine and multi-machine, distributed PyTorch training using SDP. Our training script supports distributed training for only GPU instances.
Instance types
SDP supports model training on SageMaker with the following instance types only:
- p3.16xlarge
- p3dn.24xlarge (Recommended)
- p4d.24xlarge (Recommended)
Instance count
To get the best performance out of SDP, you should use at least two instances, but you can also use one for testing this example, which implements the script in a single instance, multiple GPU mode, taking advantage of the eight GPUs on the instance to train faster and cheaper.
Distribution strategy
To use DDP mode, you update the the distribution strategy and set it to use smdistributed dataparallel
.
After we create the estimator, we call fit()
, which launches a training job. We use the Amazon S3 URIs that we uploaded the training data to earlier. See the following code:
from sagemaker.pytorch import PyTorch
TRAINING_JOB_NAME="protbert-training-pytorch-{}".format(time.strftime("%m-%d-%Y-%H-%M-%S"))
print('Training job name: ', TRAINING_JOB_NAME)
estimator = PyTorch(
entry_point="train.py",
source_dir="code",
role=role,
framework_version="1.6.0",
py_version="py36",
instance_count=1, # this script support distributed training for only GPU instances.
instance_type="ml.p3.16xlarge",
distribution={'smdistributed':{
'dataparallel':{
'enabled': True
}
}
},
debugger_hook_config=False,
hyperparameters={
"epochs": 3,
"num_labels": num_classes,
"batch-size": 4,
"test-batch-size": 4,
"log-interval": 100,
"frozen_layers": 15,
},
metric_definitions=[
{'Name': 'train:loss', 'Regex': 'Training Loss: ([0-9\.]+)'},
{'Name': 'test:accuracy', 'Regex': 'Validation Accuracy: ([0-9\.]+)'},
{'Name': 'test:loss', 'Regex': 'Validation loss: ([0-9\.]+)'},
]
)
estimator.fit({"training": inputs_train, "testing": inputs_test}, job_name=TRAINING_JOB_NAME)
With max_length=512
and running the model for only three epochs, we get a validation accuracy of around 65%, which is pretty decent. You can optimize it further by trying a bigger sequence length, increasing the number of epochs, and tuning other hyperparameters. Make sure to increase the GPU memory or reduce the batch size when you increase the sequence length, otherwise you might get cuda out of memory error
.
For more details on optimizing the model, see ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.
Deploy the model on SageMaker
After we train our model, we host it on an SageMaker endpoint. To make the endpoint load the model and serve predictions, we implement a few methods in inference.py:
- model_fn() – Loads the saved model and returns a model object that can be used for model serving. The SageMaker PyTorch model server loads our model by invoking
model_fn
:
def model_fn(model_dir):
logger.info('model_fn')
print('Loading the trained model...')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ProteinClassifier(10) # pass number of classes, in our case its 10
with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
model.load_state_dict(torch.load(f, map_location=device))
return model.to(device)
- input_fn() – Deserializes and prepares the prediction input. In this example, our request body is first serialized to JSON and then sent to the model serving endpoint. Therefore, in
input_fn()
, we first deserialize the JSON-formatted request body and return the input as atorch.tensor
, as required for the ProtBERT model:
def input_fn(request_body, request_content_type):
"""An input_fn that loads a pickled tensor"""
if request_content_type == "application/json":
sequence = json.loads(request_body)
print("Input protein sequence: ", sequence)
encoded_sequence = tokenizer.encode_plus(
sequence,
max_length = MAX_LEN,
add_special_tokens = True,
return_token_type_ids = False,
padding = 'max_length',
return_attention_mask = True,
return_tensors='pt'
)
input_ids = encoded_sequence['input_ids']
attention_mask = encoded_sequence['attention_mask']
return input_ids, attention_mask
raise ValueError("Unsupported content type: {}".format(request_content_type))
- predict_fn() – Performs the prediction and returns the result. To deploy our endpoint, we call
deploy()
on our PyTorch estimator object, passing in our desired number of instances and instance type:
def predict_fn(input_data, model):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
input_id, input_mask = input_data
input_id = input_id.to(device)
input_mask = input_mask.to(device)
with torch.no_grad():
output = model(input_id, input_mask)
_, prediction = torch.max(output, dim=1)
return prediction
Create a model object
You define the model object by using the SageMaker SDK’s PyTorchModel
and pass in the model from the estimator and the entry_point
. The function loads the model and sets it to use a GPU, if available. See the following code:
import sagemaker
from sagemaker.pytorch import PyTorchModel
ENDPOINT_NAME = "protbert-inference-pytorch-1-{}".format(time.strftime("%m-%d-%Y-%H-%M-%S"))
print("Endpoint name: ", ENDPOINT_NAME)
model = PyTorchModel(model_data=model_data, source_dir='code',
entry_point='inference.py', role=role, framework_version='1.6.0', py_version='py3')
Deploy the model on an endpoint
You create a predictor by using the model.deploy
function. You can optionally change both the instance count and instance type:
%%time
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.2xlarge', endpoint_name=ENDPOINT_NAME)
Predict protein subcellular localization
Now that we have deployed the model endpoint, we can provide some protein sequences and let the model endpoint identify their subcellular localization, using the predictor we created:
prediction = predictor.predict(protein_sequence)
print(prediction)
The following table summarizes some of our results.
Sequence | Ground Truth | Prediction |
M G K K D A S T T R T P V D Q Y R K Q I G R Q D Y K K N K P V L K A T R L K A E A K K A A I G I K E V I L V T I A I L V L L F A F Y A F F F L N L T K T D I Y E D S N N | Endoplasmic.reticulum | Endoplasmic.reticulum |
M S M T I L P L E L I D K C I G S N L W V I M K S E R E F A G T L V G F D D Y V N I V L K D V T E Y D T V T G V T E K H S E M L L N G N G M C M L I P G G K P E | Nucleus | Nucleus |
M G G P T R R H Q E E G S A E C L G G P S T R A A P G P G L R D F H F T T A G P S K A D R L G D A A Q I H R E R M R P V Q C G D G S G E R V F L Q S P G S I G T L Y I R L D L N S Q R S T C C C L L N A G T K G M C | Cytoplasm | Cytoplasm |
Clean up resources
Remember to delete the SageMaker endpoint and SageMaker notebook instance created to avoid charges. See the following code:
predictor.delete_endpoint()
Conclusion
In this post, we used a pretrained ProtBERT model (prot_bert_bfd_localization) as a starting point and fine-tuned it for the downstream task of identifying the subcelluar localization of protein sequences. We used the SageMaker capabilities to train, deploy, and do the inference. Furthermore, we explored the SageMaker data parallel feature to make our training process efficient. You can use the same concept to perform other downstream tasks, such as for amino-acid level classification like predicting the secondary structure of the protein. For more about using PyTorch with SageMaker, see Using PyTorch with the SageMaker Python SDK.
References
- [1] ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v2.full.pdf)
- [2]Protein sequence Diagram : https://www.technologynetworks.com/applied-sciences/articles/essential-amino-acids-chart-abbreviations-and-structure-324357
- [3] Refining Protein Subcellular Localization (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1289393/)
- [4] Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, et al. Subcellular localization of the yeast proteome. Genes Dev. 2002;16:707–719. [PMC free article] [PubMed] [Google Scholar]
- [5] Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, et al. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. [PubMed] [Google Scholar]
- [6] Wiemann S, Arlt D, Huber W, Wellenreuther R, Schleeger S, et al. From ORFeome to biology: A functional genomics pipeline. Genome Res. 2004;14:2136–2144. [PMC free article] [PubMed] [Google Scholar]
- [7] Davis TN. Protein localization in proteomics. Curr Opin Chem Biol. 2004;8:49–53. [PubMed] [Google Scholar]
- [8] Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif co-occurrence. Genome Res. 2004;14:1957–1966. [PMC free article] [PubMed] [Google Scholar]
- [9] ProtBERT Hugging Face (https://huggingface.co/Rostlab/prot_bert)
- [10] DeepLoc-1.0: Eukaryotic protein subcellular localization predictor (http://www.cbs.dtu.dk/services/DeepLoc-1.0/data.php)
- [11] M. S. Klausen, M. C. Jespersen et al., “NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning,” Proteins: Structure, Function, and Bioinformatics, vol. 87, no. 6, pp. 520–527, 2019, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.25674.
- [12] J. J. Almagro Armenteros, C. K. Sønderby et al., “DeepLoc: Prediction of protein subcellular localization using deep learning,” Bioinformatics, vol. 33, no. 21, pp. 3387–3395, Nov. 2017.
- [13] J. Yang, I. Anishchenko et al., “Improved protein structure prediction using predicted interresidue orientations,” Proceedings of the National Academy of Sciences, vol. 117, no. 3, pp. 1496–1503, Jan. 2020.
- [14] A. Kulandaisamy, J. Zaucha et al., “Pred-MutHTP: Prediction of disease-causing and neutral mutations in human transmembrane proteins,” Human Mutation, vol. 41, no. 3, pp. 581–590, 2020, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/humu.23961.
- [15] M. Schelling, T. A. Hopf, and B. Rost, “Evolutionary couplings and sequence variation effect predict protein binding sites,” Proteins: Structure, Function, and Bioinformatics, vol. 86, no. 10, pp. 1064–1074, 2018, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.25585.
- [16] M. Bernhofer, E. Kloppmann et al., “TMSEG: Novel prediction of transmembrane helices,” Proteins: Structure, Function, and Bioinformatics, vol. 84, no. 11, pp. 1706–1716, 2016, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.25155.
- [17] D. S. Marks, L. J. Colwell et al., “Protein 3D Structure Computed from Evolutionary Sequence Variation,” PLOS ONE, vol. 6, no. 12, p. e28766, Dec. 2011.
About the Authors
Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spend lot of her free time.
Shamika Ariyawansa is a Solutions Architect at AWS helping customers run a variety of applications on AWS and machine learning workloads in particular. He is based out of Denver, Colorado. In his spare time, he enjoys off-roading adventures in the Colorado mountains and competing in machine learning competitions.
Vaijayanti Joshi is a Boston-based Solutions Architect for AWS. She is passionate about technology and enjoys helping customers find innovative solutions to complex business challenges. Her core areas of focus are machine learning and analytics. When she’s not working with customers on their journey to the cloud, she enjoys biking, swimming, and exploring new places.
Predicting answers to product questions using similar products
A new method predicts with high accuracy the answer to a product question, based on knowledge from similar products.Read More
Gain valuable ML skills with the AWS Machine Learning Engineer Nanodegree Scholarship from Udacity
Amazon Web Services is partnering with Udacity to help educate developers of all skill levels on machine learning (ML) concepts with the AWS Machine Learning Scholarship Program by Udacity by offering 425 scholarships, with a focus on women and underrepresented groups.
Machine learning is an exciting and rapidly developing technology that has the power to create millions of jobs and transform our daily lives. According to the Future of Jobs Report 2020 by the World Economic Forum, by 2025, 97 million new roles may be created as a result of ML innovation. However, only today’s developers have the skills to act on these opportunities now. Proximity to high-quality education, cost of traditional education, and allocating time to start and complete new learning projects make learning ML more complicated.
To address this challenge, AWS invests in educating developers, data scientists, and ML developers with a variety of education solutions, such as exploring reinforcement learning concepts with AWS DeepRacer, training and validation with the AWS Certified Machine Learning – Specialty certification, and hands-on tutorials from the AWS Machine Learning Community.
Up-leveling ML skills and opening new career opportunities
With the AWS Machine Learning Engineer Nanodegree by Udacity, developers can learn valuable skills for ML career path with an interactive, cost-effective, and accessible ML education. All students that enroll in the scholarship program have access to AWS Machine Learning Foundations, a free course covering an introduction to ML concepts, including reinforcement learning, computer vision, and generative artificial intelligence, with expert-led interactive tutorials with AWS AI Devices such as AWS DeepRacer, AWS DeepLens, and AWS DeepComposer.
The AWS Machine Learning Scholarship Program is open to all for registration starting May 26, 2021, through June 23, 2021. Your learning journey begins with the free AWS Machine Learning Foundations course on June 28, 2021, in which you learn the fundamental aspects of ML, ML techniques and algorithms, programming best practices, Python coding, and interactive tutorials with AWS AI Devices. You have 3months to study and complete your assessment by October 11, 2021, with the top 425 students eligible for a scholarship to the AWS Machine Learning Engineer Nanodegree.
The AWS Machine Learning Foundations Course (free) includes the following objectives:
- Learn the fundamentals of ML
- Learn object-oriented programming best practices
- Learn computer vision with AWS DeepLens, reinforcement learning with AWS DeepRacer, and generative AI with AWS DeepComposer.
- Dedicate 3–5 hours a week on the course and work towards earning one of the follow-up Nanodegree program scholarships
In the Machine Learning Engineer Nanodegree program (a $1,000 value course), you learn advanced ML techniques and algorithms, including how to package and deploy models to a production environment.
The AWS Machine Learning Nanodegree Program enabled me to learn and achieve valuable machine learning skills at my own pace with interactive modules that made learning fun and effective,” said Juv Chan, AWS ML Hero and AWS Machine Learning Nanodegree Program Alumni. “Carving out time to learn machine learning can be very hard, especially under the demanding schedules that software engineers work from. The flexibility offered by Udacity Nanodegrees lets me learn new skills on a timetable that works for me.”
This year, we added 100 additional scholarships on top of the 325 scholarships allocated in 2020, and updated the content for students with advanced ML techniques and algorithms and expert-led tutorials on deploying ML models at scale with Amazon SageMaker.
AWS is also collaborating with several nonprofit organizations through the We Power Tech Program to increase the diversity and talent in technical roles, including organizations like Girls In Tech and the National Society of Black Engineers. As part of these ongoing relationships, the nonprofit organizations will help encourage women and underrepresented groups to participate in the AWS Machine Learning Engineer Nanodegree Scholarship Program. Organizations like these develop programs to inspire, support, train, and empower people from underrepresented groups to pursue careers in tech.
“AWS strives to help level the playing field for women and people of color, who have been underrepresented in the tech industry for far too long. We are thrilled to collaborate with Udacity to make this sort of technical training more widely available and accessible,” said LaDavia Drane, global head of Inclusion, Diversity & Equity at AWS. “We look forward to seeing the incredible innovations in machine learning that are sure to come from this initiative.”
“Tech needs representation from women, BIPOC, and other marginalized communities in every aspect of our industry. Companies must make meaningful and measurable change in the areas of diversity, equity, and inclusion to reach their greatest potential, and skills training programs uniquely tailored to increase representation from these groups are necessary for technology to achieve all that it’s capable of. Girls in Tech applauds our collaborator AWS, as well as Udacity, for breaking down the barriers that so often leave women behind in tech. Together, we aim to give everyone a seat at the table.” Adriana Gascoigne, Founder and CEO, Girls in Tech.
How the AWS Machine Learning Engineer Nanodegree Scholarship works
Scholarship enrollment is open from May 26, 2021, through June 23, 2021. Students begin their learning journey with the AWS Machine Learning Foundations Course from June 28, 2021, through October 11, 2021. At the end of the course, learners take an assessment, from which top students are selected for 425 scholarships, who then start the AWS Machine Learning Engineer Nanodegree from October 25, 2021, through January 25, 2022.
Get started and leverage the community
Get started by enrolling now and accelerate your ML journey! Connect with experts and like-minded aspiring ML developers on the AWS Machine Learning Slack channel.
About the Author
Cameron Peron is Senior Marketing Manager for AWS AI/ML Education and the AWS AI/ML community. He evangelizes how AI/ML innovation solves complex challenges facing community, enterprise, and startups alike. Out of the office, he enjoys staying active with kettlebell-sport, spending time with his family and friends, and is an avid fan of Euro-league basketball.
Ten university teams selected to participate in Alexa Prize TaskBot Challenge
Teams from three continents will compete to develop agents that assist customers in completing multi-step tasks.Read More
How Contentsquare reduced TensorFlow inference latency with TensorFlow Serving on Amazon SageMaker
In this post, we present the results of a model serving experiment made by Contentsquare scientists with an innovative DL model trained to analyze HTML documents. We show how the Amazon SageMaker TensorFlow Serving solution helped Contentsquare address several serving challenges.
Contentsquare’s challenge
Contentsquare is a fast-growing French technology company empowering brands to build better digital experiences. In their own words, “Our experience analytics platform tracks and visualizes billions of digital behaviors, delivering intelligent recommendations that everyone can use to grow revenue, increase loyalty, and fuel innovation.”
Contentsquare scientists developed several ML and deep learning models, and wanted to find solutions for cost-effective and performant real-time model serving. For this experiment, they chose a custom multi-input, multi-task deep neural network developed with TensorFlow-backed Keras, which can answer several questions in one single inference on large payloads consisting of HTML pages.
Baseline deployment: Flask on Amazon EC2
As a baseline deployment, the Contentsquare team served the TensorFlow-backed Keras model from a Flask server hosted on an Amazon Elastic Compute Cloud (Amazon EC2) p2.xlarge GPU machine. Flask is a popular Python web framework (52,000 stars on GitHub) appreciated for its simplicity and large community. The EC2 p2.xlarge instance was fitted with a NVIDIA Tesla K80 GPU card. On a reference input payload used as a benchmark, this design provided a single-request inference latency of approximately 5 seconds.
Optimized deployment: TensorFlow Serving on SageMaker
To reduce management overhead and get a simpler deployment experience, the Contentsquare team experimented with Amazon SageMaker. SageMaker is a managed service supporting the development lifecycle of custom models, from annotation up to production deployment and monitoring. Beyond enabling a faster time to market, SageMaker provides state-of-the-art open-source pre-written serving containers for XGBoost (container, SDK), Scikit-Learn (container, SDK), PyTorch (container, SDK), TensorFlow (container, SDK) and Apache MXNet (container, SDK). In particular, the SageMaker TensorFlow serving container is built on top of TensorFlow Serving (TensorFlow-Serving: Flexible, High-Performance ML Serving, Olston et al.), the official, high-performance serving stack for TensorFlow. The SageMaker team further improved the TensorFlow Serving experience by adding the option to run custom inference code in front of TensorFlow Serving (for example, for pre or postprocessing).
Slim Frikha, a Contentsquare scientist, says, “That is one of the reasons why we use TensorFlow Serving on SageMaker: TensorFlow Serving runs performant inference, SageMaker provides easy deployment, and the combination of both brings the extra possibility to do preprocessing and postprocessing with TensorFlow Serving.”
Preprocessing and postprocessing are important capabilities that ML practitioners look for when choosing an ML serving solution. To use the custom processing capacity of SageMaker TensorFlow Serving, developers can provide a custom inference.py script containing handling functions. For more information, see Create Python Scripts for Custom Input and Output Formats.
The following figures show a high-level view of the internal architecture of the current SageMaker TensorFlow Serving container. Two web servers are collocated in each instance of the endpoint instance fleet. An NGINX server handles the communication with the requesting client and can optionally run ad hoc data processing via an infererence.py script running in Gunicorn. A TensorFlow Serving server internally exposes TensorFlow models for consumption by the Gunicorn server. In-server communication between Gunicorn and TensorFlow Serving can be done in REST or gRPC when using an inference.py custom inference script, and with REST when using the default setup without the custom inference script. In both cases, external requests are done with REST.
The Contentsquare team tested both gRPC and HTTP for internal communication with TensorFlow Serving, and found gRPC to be much faster than HTTP, because HTTP required a JSON dump of the very large preprocessed input. On the specific benchmark inference payload, deploying in SageMaker TensorFlow Serving on an ml.p2.xlarge hosting instance reduced the global serving latency from 5 seconds to 3 seconds, compared to Keras deployed in Flask on Amazon EC2 p2.xlarge instance—a 40% improvement! This gain is driven by serving optimizations internal to TensorFlow Serving and decoding inputs to TensorFlow tensors, which can be faster if using gRPC.
Conclusion
Contentsquare scientists successfully completed their benchmark and found a cost-effective, high-performance serving solution for their custom TensorFlow model that reduced latency by 40% vs. a reasonable baseline. Another axis of improvement, not evaluated in this benchmark but worth consideration for extra gains, would be to evaluate different instance types. For example, the EC2 G4 instances, more recent than the P2, demonstrated great performance and economics in several inference cases. If you are interested in learning more about TensorFlow Serving on SageMaker, you can find guidance in the documentation, view the container source code on GitHub and navigate our examples gallery.
About the Author
Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in Lyon, France. Olivier helps French customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.
Host multiple TensorFlow computer vision models using Amazon SageMaker multi-model endpoints
Amazon SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker accelerates innovation within your organization by providing purpose-built tools for every step of ML development, including labeling, data preparation, feature engineering, statistical bias detection, AutoML, training, tuning, hosting, explainability, monitoring, and workflow automation.
Companies are increasingly training ML models based on individual user data. For example, an image sharing service designed to enable discovery of information on the internet trains custom models based on each user’s uploaded images and browsing history to personalize recommendations for that user. The company can also train custom models based on search topics for recommending images per topic. Building custom ML models for each use case leads to higher inference accuracy, but increases the cost of deploying and managing models. These challenges become more pronounced when not all models are accessed at the same rate but still need to be available at all times.
SageMaker multi-model endpoints provide a scalable and cost-effective way to deploy large numbers of ML models in the cloud. SageMaker multi-model endpoints enable you to deploy multiple ML models behind a single endpoint and serve them using a single serving container. Your application simply needs to include an API call with the target model to this endpoint to achieve low-latency, high-throughput inference. Instead of paying for a separate endpoint for every single model, you can host many models for the price of a single endpoint. For more information about SageMaker multi-model endpoints, see Save on inference costs by using Amazon SageMaker multi-model endpoints.
In this post, we demonstrate how to use SageMaker multi-model endpoints to host two computer vision models with different model architectures and datasets for image classification. In practice, you can deploy tens of thousands of models on multi-model endpoints.
Overview of solution
SageMaker multi-model endpoints work with several frameworks, such as TensorFlow, PyTorch, MXNet, and sklearn, and you can build your own container with a multi-model server. Multi-model endpoints are also supported natively in the following popular SageMaker built-in algorithms: XGBoost, Linear Learner, Random Cut Forest (RCF), and K-Nearest Neighbors (KNN). You can directly use the SageMaker-provided containers while using these algorithms without having to build your own custom container.
The following diagram is a simplified illustration of how you can host multiple (for this post, six) models using SageMaker multi-model endpoints. In practice, multi-model endpoints can accommodate hundreds to tens of thousands of ML models behind an endpoint. In our architecture, if we host more models using model artifacts stored in Amazon Simple Storage Service (Amazon S3), multi-model endpoints dynamically unload some of the least-used models to accommodate newer models.
In this post, we show how to host two computer vision models trained using the TensorFlow framework behind a single SageMaker multi-model endpoint. We use the TensorFlow Serving container enabled for multi-model endpoints to host these models. For our first model, we train a smaller version of AlexNet CNN to classify images from the CIFAR-10 dataset. For the second model, we use a VGG16 CNN model pretrained on the ImageNet dataset and fine-tuned on the Sign Language Digits Dataset to classify hand symbol images. We also provide a fully functional notebook to demonstrate all the steps.
Model 1: CIFAR-10 image classification
CIFAR-10 is a benchmark dataset for image classification in computer vision and ML. CIFAR images are colored (three channels) with dramatic variation in how the objects appear. It consists of 32 × 32 color images in 10 classes, with 6,000 images per class. It contains 50,000 training images and 10,000 test images. The following image shows a sample of the images grouped by the labels.
To build the image classifier, we use a simplified version of the classical AlexNet CNN. The network is composed of five convolutional and pooling layers, and three fully connected layers. Our simplified architecture stacks three convolutional layers and two fully connected (dense) layers.
The first step is to load the dataset into train and test objects. The TensorFlow framework provides the CIFAR dataset for us to load using the load_data() method. Next, we rescale the input images by dividing the pixel values by 255: [0,255] ⇒ [0,1]. We also need to prepare the labels using one-hot encoding. One hot encoding is a process by which the categorical variables are converted into a numerical form. The following code snippet shows these steps in action:
from tensorflow.keras.datasets import cifar10
# load dataset
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
# rescale input images
X_train = X_train.astype('float32')/255
X_test = X_test.astype('float32')/255
# one hot encode target labels
num_classes = len(np.unique(y_train))
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)
After the dataset is prepared and ready for training, it’s saved to Amazon S3 to be used by SageMaker. The model architecture and the training code for the image classifier are assembled into a training script (cifar_train.py). To generate batches of tensor image data for the training process, we use ImageDataGenerator. This enables us to apply data augmentation transformations like rotation, width, and height shifts to our training data.
In the next step, we use the training script to create a TensorFlow estimator using the SageMaker SDK (see the following code). We use the estimator to fit the CNN model on CIFAR-10 inputs. When the training is complete, the model is saved to Amazon S3.
from sagemaker.tensorflow import TensorFlow
model_name = 'cifar-10'
hyperparameters = {'epochs': 50}
estimator_parameters = {'entry_point':'cifar_train.py',
'instance_type': 'ml.m5.2xlarge',
'instance_count': 2,
'model_dir': f'/opt/ml/model',
'role': role,
'hyperparameters': hyperparameters,
'output_path': f's3://{BUCKET}/{PREFIX}/cifar_10/out',
'base_job_name': f'mme-cv-{model_name}',
'framework_version': TF_FRAMEWORK_VERSION,
'py_version': 'py37',
'script_mode': True}
estimator_1 = TensorFlow(**estimator_parameters)
estimator_1.fit(inputs)
Later, we demonstrate how to host this model using SageMaker multi-model endpoint alongside our second model (the sign language digits classifier).
Model 2: Sign language digits classification
For our second model, we use the sign language digits dataset. This dataset distinguishes the sign language digits from 0–9. The following image shows a sample of the dataset.
The dataset contains 100 x 100 images in RGB color and has 10 classes (digits 0–9). The training set contains 1,712 images, the validation set 300, and the test set 50.
This dataset is very small. Training a network from scratch on this small a dataset doesn’t achieve good results. To achieve higher accuracy, we use transfer learning. Transfer learning is usually the go-to approach when starting a classification project, especially when you don’t have much training data. It migrates the knowledge learned from the source dataset to the target dataset, to save training time and computational cost.
To train this model, we use a pretrained VGG16 CNN model trained on the ImageNet dataset and fine-tune it to work on our sign language digits dataset. A pretrained model is a network that has been previously trained on a large dataset, typically on a large-scale image classification task. The VGG16 model architecture we use has 13 convolutional layers in total. For the sign language dataset, because its domain is different from the source domain of the ImageNet dataset, we only fine-tune the last few layers. Fine-tuning here refers to freezing a few of the network layers that are used for feature extraction, and jointly training both the non-frozen layers and the newly added classifier layers of the pretrained model.
The training script (sign_language_train.py) encapsulates the model architecture and the training logic for the sign language digits classifier. First, we load the pretrained weights from the VGG16 network trained on the ImageNet dataset. Next, we freeze part of the feature extractor part, followed by adding the new classifier layers. Finally, we compile the network and run the training process to optimize the model for the smaller dataset.
Next, we use this training script to create a TensorFlow estimator using the SageMaker SDK. This estimator is used to fit the sign language digits classifier on the supplied inputs. When the training is complete, the model is saved to Amazon S3 to be hosted by SageMaker multi-model endpoints. See the following code:
model_name = 'sign-language'
hyperparameters = {'epochs': 50}
estimator_parameters = {'entry_point':'sign_language_train.py',
'instance_type': 'ml.m5.2xlarge',
'instance_count': 2,
'hyperparameters': hyperparameters,
'model_dir': f'/opt/ml/model',
'role': role,
'output_path': f's3://{BUCKET}/{PREFIX}/sign_language/out',
'base_job_name': f'cv-{model_name}',
'framework_version': TF_FRAMEWORK_VERSION,
'py_version': 'py37',
'script_mode': True}
estimator_2 = TensorFlow(**estimator_parameters)
estimator_2.fit({'train': train_input, 'val': val_input})
Deploy a multi-model endpoint
SageMaker multi-model endpoints provide a scalable and cost-effective solution to deploy large numbers of models. It uses a shared serving container that is enabled to host multiple models. This reduces hosting costs by improving endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to them.
To create the multi-model endpoint, first we need to copy the trained models for the individual estimators (1 and 2) from their saved S3 locations to a common S3 prefix that can be used by the multi-model endpoint:
tf_model_1 = estimator_1.model_data
output_1 = f's3://{BUCKET}/{PREFIX}/mme/cifar.tar.gz'
tf_model_2 = estimator_2.model_data
output_2 = f's3://{BUCKET}/{PREFIX}/mme/sign-language.tar.gz'
!aws s3 cp {tf_model_1} {output_1}
!aws s3 cp {tf_model_2} {output_2}
After the models are copied to the common location designated by the S3 prefix, we create a serving model using the TensorFlowModel class from the SageMaker SDK. The serving model is created for one of the models to be hosted under the multi-model endpoint. In this case, we use the first model (the CIFAR-10 image classifier). Next, we use the MultiDataModel class from the SageMaker SDK to create a multi-model data model using the serving model for model-1, which we created in the previous step:
from sagemaker.tensorflow.serving import TensorFlowModel
from sagemaker.multidatamodel import MultiDataModel
model_1 = TensorFlowModel(model_data=output_1,
role=role,
image_uri=IMAGE_URI)
mme = MultiDataModel(name=f'mme-tensorflow-{current_time}',
model_data_prefix=model_data_prefix,
model=model_1,
sagemaker_session=sagemaker_session)
Finally, we deploy the MultiDataModel by calling the deploy() method, providing the attributes needed to create the hosting infrastructure required to back the multi-model endpoint:
predictor = mme.deploy(initial_instance_count=2,
instance_type='ml.m5.2xlarge',
endpoint_name=f'mme-tensorflow-{current_time}')
The deploy call returns a predictor instance, which we can use to make inference calls. We see this in the next section.
Test the multi-model endpoint for real-time inference
Multi-model endpoints enable sharing memory resources across your models. If the model to be referenced is already cached, multi-model endpoints run inference immediately. On the other hand, if the particular requested model isn’t cached, SageMaker has to download the model, which increases latency for that initial request. However, this takes only a fraction of the time it would take to launch an entirely new infrastructure (instances) to host the model individually on SageMaker. After a model is cached in the multi-model endpoint, subsequent requests are initiated in real time (unless the model is removed). As a result, you can run many models from a single instance, effectively decoupling our quantity of models from our cost of deployment. This makes it easy to manage ML deployments at scale and lowers your model deployment costs through increased usage of the endpoint and its underlying compute instances. For more information and a demonstration of cost savings of over 90% for a 1,000-model example, see Save on inference costs using Amazon SageMaker multi-model endpoints.
Multi-model endpoints also unload unused models from the container when the instances backing the endpoint reach memory capacity and more models need to be loaded into its container. SageMaker deletes unused model artifacts from the instance storage volume when the volume is reaching capacity and new models need to be downloaded. The first invocation to a newly added model takes longer because the endpoint takes time to download the model from Amazon S3 to the container’s memory of the instances backing the multi-model endpoint. Models that are unloaded remain on the instance’s storage volume and can be loaded into the container’s memory later without being downloaded again from the S3 bucket.
Let’s see how to make an inference from the CIFAR-10 image classifier (model-1) hosted under the multi-model endpoint. First, we load a sample image from one of the classes—airplane— and prepare it to be sent to the multi-model endpoint using the predictor we created in the previous step.
With this predictor, we can call the predict() method along with the initial_args parameter, which specifics the name of the target model to invoke. In this case, the target model is cifar.tar.gz. The following snippet demonstrates this process in detail:
img = load_img('./data/cifar_10/raw_images/airplane.png', target_size=(32, 32))
data = img_to_array(img)
data = data.astype('float32')
data = data / 255.0
data = data.reshape(1, 32, 32, 3)
payload = {'instances': data}
y_pred = predictor.predict(data=payload, initial_args={'TargetModel': 'cifar.tar.gz'})
predicted_label = CIFAR10_LABELS[np.argmax(y_pred)]
print(f'Predicted Label: [{predicted_label}]')
Running the preceding code returns the prediction output as the label airplane, which is correctly interpreted by our served model:
Predicted Label: [airplane]
Next, let’s see how to dynamically load the sign language digit classifier (model-2) into a multi-model endpoint by invoking the endpoint with sign-language.tar.gz as the target model.
We use the following sample image of the hand sign digit 0.
The following snippet shows how to invoke the multi-model endpoint with the sample image to get back the correct response:
test_path = './data/sign_language/test'
img = mpimg.imread(f'{test_path}/0/IMG_4159.JPG')
def path_to_tensor(img_path):
# loads RGB image as PIL.Image.Image type
img = image.load_img(img_path, target_size=(224, 224))
# convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
x = image.img_to_array(img)
# convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
return np.expand_dims(x, axis=0)
data = path_to_tensor(f'{test_path}/0/IMG_4159.JPG')
payload = {'instances': data}
y_pred = predictor.predict(data=payload, initial_args={'TargetModel': 'sign-language.tar.gz'})predicted_label = np.argmax(y_pred)
print(f'Predicted Label: [{predicted_label}]')
The following code is our response, with the label 0:
Predicted Label: [0]
Conclusion
In this post, we demonstrated the SageMaker feature multi-model endpoints to optimize inference costs. Multi-model endpoints are useful when you’re dealing with hundreds to tens of thousands of models and where you don’t need to deploy each model as an individual endpoint. Models are loaded and unloaded dynamically, according to usage and the amount of memory available on the endpoint.
This post discussed how to host multiple computer vision models trained using the TensorFlow framework under one SageMaker multi-model endpoint. The image classification models were of different model architectures and trained on different datasets. The notebook included with the post provides detailed instructions on training and hosting the models.
Give SageMaker multi-model endpoints a try for your use case and leave your feedback in the comments.
About the Authors
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
Mark Roy is a Principal Machine Learning Architect for AWS, helping AWS customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including Insurance, Financial Services, Media and Entertainment, Healthcare, Utilities, and Manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for 25+ years, including 19 years in financial services.
Your Guide to the AWS Machine Learning Summit
We’re about a week away from the AWS Machine Learning Summit and if you haven’t registered yet, you better get on it! On June 2, 2021 (Americas) and June 3, 2021 (Asia-Pacific, Japan, Europe, Middle East, and Africa), don’t miss the opportunity to hear from some of the brightest minds in machine learning (ML) at the free virtual AWS Machine Learning Summit. This Summit, which is open to all, brings together industry luminaries, AWS customers, and leading ML experts to share the latest in ML. You’ll learn about science breakthroughs in ML, how ML is impacting business, best practices in building ML, and how to get started now without prior ML expertise. This post is your guide to navigating the Summit.
The day kicks off with a keynote from ML leaders from across AWS, Amazon, and the industry, including Swami Sivasubramanian, VP of AI and Machine Learning, AWS; Bratin Saha, VP of Machine Learning, AWS; and Yoelle Maarek, VP of Research, Alexa Shopping, who will share a keynote on how we’re applying customer-obsessed science to advance ML. You’ll also hear from Ashok Srivastava, Senior Vice President and Chief Data Officer at Intuit, about how the company is scaling its ML and AI to create new customers experiences.
Next, tune in for an exclusive fireside chat with Andrew Ng, founder and CEO of Landing AI and founder of deeplearning.ai, and Swami Sivasubramanian about the future of ML, the skills that are fundamental for the next generation of ML practitioners, and how we can bridge the gap from proof of concept to production in ML.
From there, pick the track that best matches your interests or mix and match throughout the day. We’ll also have expert Q&A available from 11:00am–3:00pm local time.
The science of machine learning
If you’re an advanced practitioner or just really interested in the science of ML, this track provides a technical deep dive into the groundbreaking work that ML scientists within AWS, Amazon, and beyond are doing to advance the science of ML in areas including computer vision, natural language processing, bias, and more.
Speakers include two Amazon Scholars, Michael Kearns and Kathleen McKeown. Kearns is a professor in the Computer and Information Science department at the University of Pennsylvania, where he holds the National Center Chair. He is co-author of the book “The Ethical Algorithm: The Science of Socially Aware Algorithm Design,” and joined Amazon as a scholar June 2020. McKeown is the Henry and Gertrude Rothschild professor of computer science at Columbia University, and the founding director of the school’s Data Science Institute. She joined Amazon as a scholar in 2019.
You’ll also get an inside look at trends in deep learning and natural language in a powerhouse fireside chat with Amazon distinguished scientists Alex Smola and Bernhard Schölkopf, and Alexa AI Senior Principal Scientist Dilek Hakkani-Tur.
The impact of machine learning
If you’re a technical business leader, you won’t want to miss this track where you’ll learn from AWS customers that are leading the way in ML adoption. Customers including 3M, AstraZeneca, Vanguard, Carbon Lighthouse, ADP, and Bundesliga will share how they’re applying ML to create efficiencies, deliver new revenue streams, and launch entirely new products and business models. You’ll get best practices for scaling ML in an organization and showing impact.
How machine learning is done
If you’re a data scientist or ML developer, join this track for practical deep dives into tools that can speed up the entire ML lifecycle, from building to training to deploying ML models. Sessions include how to choose the right algorithms, more accurate and speedy data prep, model explainability, and more.
Machine learning: No expertise required
If you’re a developer who wants to apply ML and AI to a use case but you don’t have the expertise, this track is for you. Learn how to use AWS AI services and other tools to get started with your ML project right away, for use cases including contact center intelligence, personalization, intelligent document processing, business metrics analysis, computer vision, and more. You’ll also learn from customers like Fidelity on how they’re applying ML to business problems like DevOps.
For more details, visit the website and we’ll see you there!
About the Author
Laura Jones is a product marketing lead for AWS AI/ML where she focuses on sharing the stories of AWS customers and educating organizations on the impact of machine learning. As a Florida native living and surviving in rainy Seattle, she enjoys coffee, attempting to ski and enjoying the great outdoors.
3 questions with Kathleen McKeown: Controlling model hallucinations in natural language generation
McKeown is a featured speaker at the first virtual Amazon Web Services Machine Learning Summit on June 2.Read More
It’s a wrap for Amazon SageMaker Month, 30 days of content, discussions, and news
Did you miss SageMaker Month? Don’t look any further than this round-up post to get caught up. In this post, we share key highlights and learning materials to accelerate your machine learning (ML) innovation.
On April 20, 2021, we launched the first ever Amazon SageMaker Month, 30 days of hands-on workshops, tech talks, Twitch sessions, blog posts, and playbooks. Our goal with SageMaker Month was to connect you with AWS experts, getting started resources, workshops, and learning content to be successful with ML. The following is a summary of what you can access on-demand to get started on your ML journey with Amazon SageMaker.
Introducing SageMaker Savings Plans
To kick off SageMaker month, we introduced Amazon SageMaker Savings Plans, a flexible, usage-based pricing model for SageMaker. The goal of SageMaker Savings Plans is to offer you the flexibility to save up to 64% on SageMaker ML instance usage in exchange for a commitment of consistent usage for a 1-year or 3-year term. In addition, to help you save even more, we announced a price drop on SageMaker CPU and GPU instances.
To enable customers to save more on SageMaker, we hosted a SageMaker Friday Twitch session with Greg Coquillo, the second-most influential speaker according to LinkedIn Top Voices 2020: Data Science & AI, along with Julien Simon and Segolene Dessertine-Panhard outlining cost-optimization techniques using SageMaker and SageMaker Savings Plans.
SageMaker Savings Plans enhance the productivity and cost-optimizing capabilities already available in Amazon SageMaker Studio, which can improve your data science team’s productivity up to 10 times. Studio provides a single visual interface where you can perform all your ML development steps. Studio also gives you complete access, control, and visibility into each step required to build, train, and deploy models. To enable your teams to move faster and boost productivity, learn how to customize your Studio notebooks.
Getting started with ML
SageMaker is the most comprehensive ML service, purpose-built for every step of the ML development lifecycle. SageMaker provides all the components used for ML in a single service, so you can prepare data and build, train, and deploy models.
Data preparation is the first step of building an ML model. It’s a time-consuming and involved process that is largely undifferentiated. We hear from our customers that it constitutes up to 80% of their time during ML development. Data preparation has always been considered tedious and resource intensive, due to the inherent nature of data being “dirty” and not ready for ML in its raw form. “Dirty” data could include missing or erroneous values, outliers, and more. Feature engineering is often needed to transform the inputs to deliver more accurate and efficient ML models. To help with feature engineering, Amazon SageMaker Feature Store offers a purpose-built repository to store, update, retrieve, and share ML features within development teams.
Another challenge with data preparation is that it often requires multiple steps. Although most standalone data preparation tools provide data transformation, feature engineering, and visualization, few tools provide built-in model validation. And all of these data preparation steps are considered separate from ML. What’s needed is a framework that provides all these capabilities in one place and is tightly integrated with the rest of the ML pipeline. Most standalone tools for data preparation treat it as an extract, transform, and load (ETL) workload, making it tedious to iteratively prepare data, validate the model on test datasets, deploy it in production, and go back to ingesting new data sources and performing additional feature engineering. Most iterative data preparation is divorced from deployment. Therefore, data preparation modules need curation and integration before they’re deployed in production. These practices in ML are sometimes referred to as MLOps.
To help you overcome these challenges, you can use Amazon SageMaker Data Wrangler, a capability to simplify the process of data preparation, feature engineering, and each step of the data preparation workflow, including data selection, cleansing, and exploration on a single visual interface. As part of SageMaker Month, we created a step-by-step tutorial on how you can prepare data for ML with Data Wrangler. In addition, you can learn how financial customers use SageMaker every day to predict credit risk and approve loans. This example uses Data Wrangler and Amazon SageMaker Clarify to detect bias during the data preparation stage.
Another part of the data preparation stage is labeling data. Data labeling is the task of identifying objects in raw data, such as text, images, and videos, and tagging them with labels that help your ML model make accurate predictions and estimations. For example, in an autonomous vehicle use case, Light Detection and Ranging (LIDAR) devices are commonly used to capture and generate a three-dimensional point cloud data, which is an understanding of the physical space at a single point in time. For this use case, you need to label your data captured both in 2D and 3D spaces to produce highly accurate predictions of vehicles, lanes, and pedestrians. Amazon SageMaker Ground Truth, a fully managed data labeling service, makes it easy to build highly accurate training datasets for ML in 2D and 3D spaces using custom or built-in data labeling workflows. To help you label your data, we created how-to blog posts to showcase how to annotate 3D point cloud data and automate data labeling workflows for an autonomous vehicle use case with Ground Truth.
After you built your ML model, you must train and tune it to achieve the highest accuracy. Improving a model’s performance is an experimental and iterative process. For SageMaker Month, we consolidated a few techniques and best practices on how to train and tune high-quality deep learning models with complete visibility using SageMaker.
When you’re satisfied with your model’s accuracy, understanding how to deploy and manage models at scale is key. For model deployment and management, we showcase an example where an application developer is using SageMaker multi-model endpoints to host thousands of models and pipelines to automate retraining to improve recommendations across different US cities.
When it’s time to deploy your model and make predictions, a process called inference, you can use SageMaker for inference in the cloud or on edge devices. Amazon SageMaker Neo automatically compiles ML models for any ML framework and any target hardware. A Neo compiled model can speed up YOLOv4 inference to twice as fast. You can also reduce ML inference costs on SageMaker with hardware and software acceleration.
As part of SageMaker Month, we also launched an example use case that shows how you can use Amazon SageMaker Edge Manager, a capability to optimize, secure, monitor, and maintain ML models on fleets of smart cameras, robots, personal computers, and mobile devices. This blog outlines how to manage and monitor models on edge devices such as wind turbines.
Finally, to bring all our SageMaker capabilities together and help you move from model ideation to production, we created an on-demand introduction to SageMaker workshop similar to the virtual hands-on workshops we conducted live and during recent AWS Summits. It includes everything you need to get started with SageMaker at your own pace.
ML through our Partners
As part of SageMaker Month, we partnered with Tableau and DOMO to empower data and business analysts with ML-powered insights without needing any ML expertise. With the right data readily available, you can use ML and business intelligence (BI) tools to help make predictions needed to automate and speed up critical business processes and workflows.
We partnered with DOMO to enable ML for everyone with SageMaker. Domo AutoML, powered by Amazon SageMaker Autopilot, provides insights to complex business problems and automates the end-to-end decision-making process. This helps organizations improve decision-making and adapt faster to business changes.
We also partnered with Tableau to create a blog post and tech talk that showcases an end-to-end demo and new Quick Start solution that makes it easy for data analysts to use ML models deployed on SageMaker directly in their Tableau dashboards without writing any custom integration code.
What’s next
SageMaker Month focused on cost savings and optimization, getting started with ML, and learning content to accelerate ML innovation. As we wrap up SageMaker Month, we’re excited to share the upcoming and first ever virtual AWS Machine Learning Summit on June 2, 2021. The summit brings together industry-leading scientists, AWS customers, and experts to dive deep into the art, science, and impact of ML. Attend for free, learn about features over 30 sessions, and interact with leaders in a live Q&A.
About the Author
Shashank Murthy is a Senior Product Marketing Manager with AWS Machine Learning. His goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Shashank likes to hike the Pacific Northwest, play soccer, and run obstacle course races.