Human-evaluation studies validate metrics, and experiments show evidence of bias in popular language models.Read More
Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store
Amazon SageMaker Feature Store is a new capability of Amazon SageMaker that helps data scientists and machine learning (ML) engineers securely store, discover, and share curated data used in training and prediction workflows. As organizations build data-driven applications using ML, they’re constantly assembling and moving features between more and more functional teams. This constant movement of data can lead to inconsistencies in features and become a bottleneck when designing ML initiatives spanning multiple teams. For example, an ecommerce company might have several data science and engineering teams working on different aspects of their platform. The Core Search team focuses on query understanding and information retrieval tasks. The Product Success team solves problems involving customer reviews and feedback signals. The Personalization team uses clickstream and session data to create ML models for personalized recommendations. Additionally, data engineering teams like the Data Curation team can curate and validate user-specific information, which is an essential component that other teams can use. A Feature Store works as a unified interface between these teams, enabling one team to leverage the features generated by other teams, which minimizes the operational overhead of replicating and moving features across teams.
Training a production-ready ML model typically involves access to diverse set of features that aren’t always owned and maintained by the team that is building the model. A common practice for organizations that apply ML is to think of these data science teams as individual groups that work independently with limited collaboration. This results in ML workflows with no standardized way to share features across teams, which becomes a crucial limiting factor for data science productivity and makes it harder for data scientists to build new and complex models. With a shared feature store, organizations can achieve economies of scale. As more shared features become available, it becomes easier and cheaper for teams to build and maintain new models. These models can reuse features that are already developed, tested, and offered using a centralized feature store.
This post captures the essential cross-account architecture patterns for Feature Store that can be implemented in an organization with many data engineering and data science teams operating in different AWS accounts. We share how to enable sharing of features between accounts through a step-by-step example, which you can try out yourself with the code in our GitHub repo.
SageMaker Feature Store overview
By default, a SageMaker Feature Store is local to the account in which it is created, but it can also be centralized and shared by many accounts. An organization with multiple teams can have one centralized feature store that is shared across teams, as well as separate feature stores for use by individual teams. The separate stores can either hold feature groups that are of a sensitive nature or that are specific to a unique ML workload.
In this post, you first learn about the centralized feature store pattern. This pattern prescribes a central interface through which teams can create and publish new features, and from which other teams (or systems) can consume features. It also ensures that you have a single source of truth for feature data across your organization and simplifies resource management.
Next, you learn about the combined feature store pattern, which allows teams to maintain their own feature stores local to their account, while still being able to access shared features from the centralized feature store. These local feature stores are usually built for data science experimentation. By combining shared features from the centralized store with local features, teams can derive new enhanced features that can help when building more complex ML models. You can also use the local stores to store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.
Lastly, we briefly cover a less common pattern involving replication of feature data.
Centralized feature store
Organizations can maximize the benefits of a feature store when it’s centralized. The centralized feature store pattern demonstrates how feature pipelines from multiple accounts can write to one centralized feature store, and how multiple other accounts can consume these features. This is a common pattern among mid- to large-sized enterprises where multiple teams manage different types of data or different parts of an application.
The process of hypothesizing, selecting, and transforming data inputs into a usable form suitable for ML models is called feature engineering. A feature pipeline encapsulates all the steps of the feature engineering process needed to convert raw data into useful features that ML models take as input for predictions. Maintaining feature pipelines is an expensive, time-consuming, and error-prone process. Also, replicating feature recipes and transformations across accounts can lead to inconsistencies and skew in feature characteristics. Because a centralized feature store facilitates knowledge sharing, teams don’t have to recreate feature recipes and rewrite pipelines from scratch in every project.
In this pattern, instead of writing features locally to an account-specific feature store, features are written to a centralized feature store. The centralized store serves as the central vault and creates a standardized way to access and maintain features for cross-team collaboration. It acts as an enabler and accelerator for AI adoption, reducing time to market for ML solutions, and allows for centralized governance and access control to ML features. You can grant access to external accounts, users, or roles to read and write individual feature groups in keeping with your data access policies. AWS recommends enforcing least privilege access to only the feature groups that you need for your job function. This is managed by the underlying AWS Identity and Access Management (IAM) policies. You can further refine access control with feature group tags and IAM conditions to decide which principals can perform specific actions. When you’re using a centralized store at scale, it’s important to also implement proper feature governance to ensure feature groups are well designed, have feature pipelines that are documented and supported, and have processes in place to ensure feature quality. This type of governance helps earn the trust required for feature reuse across teams.
Before walking through an example, let’s identify some key feature store concepts. First, feature groups are logical groups of features, typically originating from the same feature pipeline. An offline store contains large volumes of historical feature data used to create training and testing data for model development, or by batch applications for model scoring. The purpose of the online store is to serve these same features in real time with low latency. Unlike the offline store, which is append-only, the goal of the online store is to serve the most recent feature values. Behind the scenes, Feature Store automatically carries out data synchronization between the two stores. If you ingest new feature values into the online store, they’re automatically appended to the offline store. However, you can also create offline and online stores separately if this is a requirement for your team or project.
The following diagram depicts three functional teams, each with its own feature pipeline writing to a feature group in a centralized feature store.
The Personalization account manages user session data collected from a customer-facing application and owns a feature pipeline that produces a feature group called Sessions with features derived from the session data. This pipeline writes the generated feature values to the centralized feature store. Likewise, a feature pipeline in the Product Success account is responsible for producing features in the Reviews feature group, and the Data Curation account produces features in the Users feature group.
The centralized feature store account holds all features received from the three producer accounts, mapped to three feature groups: Sessions, Reviews, and Users. Feature pipelines can write to the centralized feature store by assuming a specific IAM role that is created in the centralized store account. We discuss how to enable this cross-account role later in this post. External accounts can also query features from the feature groups in the centralized store for training or inference, as shown in the preceding architecture diagram. For training, you can assume the IAM role from the centralized store and run cross-account Amazon Athena queries (as shown in the diagram), or initiate an Amazon EMR or SageMaker Processing job to create training datasets. In case of real-time inference, you can read online features directly via the same assumed IAM role for cross-account access.
In this model, the centralized feature store usually resides in a production account. Applications using this store can either live in this account or in other accounts with cross-account access to the centralized feature store. You can replicate this entire structure in lower environments, such as development or staging, for testing infrastructure changes before promoting them to production.
Combined feature store
In this section, we discuss a variant of the centralized feature store pattern called the combined feature store pattern. In feature engineering, a common practice is to combine existing features to derive new features. When teams combine shared features from the centralized store with local features in their own feature store, they can derive new enhanced features to help build more complex data models. We know from the previous section that the centralized store makes it easy for any data science team to access external features and use them with their existing pool of features to compound and evolve new features.
Security and compliance is another use case for teams to maintain a team-specific feature store in addition to accessing features from the centralized store. Many teams require specific access rights that aren’t granted to everyone in the organization. For example, it might not be feasible to publish features that are extracted from sensitive data to a centralized feature store within the organization.
In the following architecture diagram, the centralized feature store is the account that collects and catalogs all the features received from multiple feature pipelines into one central repository. In this example, the account of the combined store belongs to the Core Search team. This account is the consumer of the shareable features from the centralized store. In addition, this account manages user keyword data collected via a customer-facing search application.
This account maintains its own local offline and online stores. These local stores are populated by a feature pipeline set up locally to ingest user query keyword data and generate features. These features are grouped under a feature group named Keywords. Feature Store by default automatically creates an AWS Glue table for this feature group, which is registered in the AWS Glue Data Catalog in this account. The metadata of this table points to the Amazon S3 location of the feature group in this account’s offline store.
The combined store account can also access feature groups Sessions, Reviews, and Users from the centralized store. You can enable cross-account access by role, which we discuss in the next sections. Data scientists and researchers can use Athena to query feature groups created locally and join these internal features with external features derived from the centralized store for data science experiments.
Cross-account access overview
This section provides an overview of how to enable cross-account access for Feature Store between two accounts using an assumed role via AWS Security Token Service (AWS STS). AWS STS is a web service that enables you to request temporary, limited-privilege credentials for IAM users. AWS STS returns a set of temporary security credentials that you can use to access AWS resources that you might not normally have access to. These temporary credentials consist of an access key ID, secret access key, and security token.
To demonstrate this process, assume we have two accounts, A and B, as shown in the following diagram.
Account B maintains a centralized online and offline feature store. Account A needs access to both online and offline stores contained in Account B. To enable this, we create a role in Account B and let Account A assume that role using AWS STS. This enables Account A to behave like Account B, with permissions to perform specific actions identified by the role. AWS services like SageMaker (processing and training jobs, endpoints) and AWS Lambda used from Account A can assume the IAM role created in Account B by using an AWS STS client (see code block later in this post). This grants them the needed permissions to access resources like Amazon S3, Athena, and the AWS Glue Data Catalog inside Account B. After the services in Account A acquire the necessary permissions to the resources, they can access both the offline and online store in Account B. Depending on the choice of your service, you also need to add the IAM execution role for that service to the trusted policy of the cross-account IAM role in Account B. We discuss this in detail in the following section.
The preceding architecture diagram shows how Account A assumes a role from Account B to read and write to both online and offline stores contained within Account B. The seven steps in the diagram are as follows:
- Account B creates a role that can be assumed by others (for our use case, Account A).
- Account A assumes the IAM role from Account B using AWS STS. Account A can now generate temporary credentials that can be used to create AWS service clients that behave as if they are inside Account B.
- In Account A, SageMaker and other service clients (such as Amazon S3 and Athena) are created using the temporary credentials via the assumed role.
- The service clients in Account A can now create feature groups and populate feature values into Account B’s centralized online store using the AWS SDK.
- The online store in Account B automatically syncs with the offline store, also in Account B.
- The Athena service client inside Account A runs cross-account queries to read, group, and materialize feature sets using Athena tables inside Account B. Because the offline store exists in Account B, the corresponding AWS Glue tables, metadata catalog entries, and S3 objects all reside within Account B. Account A can use the AWS STS assume role to query the offline features (S3 objects) inside Account B.
- Athena query results are returned back as feature datasets into Account A’s S3 bucket.
The temporary credentials use the AWS STS GetSessionToken API and are limited to 1 hour. You can extend the duration of your session by using RefreshableCredentials, a Botocore class that can automatically refresh the credentials to work with your long-running applications beyond the 1-hour timeframe. An example notebook demonstrating this is available in our GitHub repo.
Create cross-account access
This section details all the steps to create the cross-account access roles, policies, and permissions to enable shareability of features between Accounts A and B according to our architecture.
Create a Feature Store access role
From Account B, we create a Feature Store access role. This is the role assumed by AWS services inside Account A to gain access to resources in Account B.
- On the IAM console, in the navigation pane, choose Roles.
- Choose Create role.
- Choose Another AWS account.
- For Account ID, enter the 12-digit account ID of Account B.
- Choose Next: Permissions.
- In the Permissions section, search for and attach the following AWS managed policies:
AmazonSageMakerFullAccess
(you can further restrict this to least privileges based on your use case)AmazonSageMakerFeatureStoreAccess
- Create and attach a custom policy to this new role (provide the S3 bucket name in Account A where the Athena query results collected in Account B are written):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AthenaResultsS3BucketCrossAccountAccessPolicy",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObjectAcl",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>",
"arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>/*"
]
}
]
}
When you use this new AWS STS cross-account role from Account A, it can run Athena queries against the offline store content in Account B. The custom policy allows Athena (inside Account B) to write back the results to a results bucket in Account A. Make sure that this results bucket is created in Account A before you create the preceding policy.
Alternatively, you can let the centralized feature store in Account B maintain all the Athena query results in an S3 bucket. In this case, you have to set up cross-account Amazon S3 read access policies for external accounts to read the saved results (S3 objects).
- After you attach the policies, choose Next.
- Enter a name for this role (for example, cross-account-assume-role).
- On the Summary page for the created role, under Trust relationships, choose Edit trust relationship.
- Edit the access control policy document as shown in the following code:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<ACCOUNT A ID>:root"
],
"Service": [
"sagemaker.amazonaws.com",
"athena.amazonaws.com"
]
},
"Action": "sts:AssumeRole",
"Condition": {}
}
]
}
The preceding code adds SageMaker and Athena as services in the Principal section. If you want more external accounts or roles to assume this role, you can add their corresponding ARNs in this section.
Create a SageMaker notebook instance
From Account A, create a SageMaker notebook instance with an IAM execution role. This role grants the SageMaker notebook in Account A the needed permissions to run actions on the feature store inside Account B. Alternatively, if you’re not using a SageMaker notebook and using Lambda instead, you need to create a role for Lambda with the same attached policies as shown in this section.
By default, the following policies are attached when you create a new execution role for a SageMaker notebook:
AmazonSageMaker-ExecutionPolicy
AmazonSageMakerFullAccess
We need to create and attach two additional custom policies. First, create a custom policy with the following code, which allows the execution role in Account A to perform certain S3 actions needed to interact with the offline store in Account B:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FeatureStoreS3AccessPolicy",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketAcl",
"s3:GetObjectAcl"
],
"Resource": [
"arn:aws:s3:::<OFFLINE STORE BUCKET NAME IN ACCOUNT B>",
"arn:aws:s3:::<OFFLINE STORE BUCKET NAME IN ACCOUNT B>/*"
]
}
]
}
You can also attach the AWS managed policy AmazonSageMakerFeatureStoreAccess
, if your offline store S3 bucket name contains the SageMaker
keyword.
Second, create the following custom policy, which allows the SageMaker notebook in Account A to assume the role (cross-account-assume-role
) created in Account B:
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::<ACCOUNT B ID>:role/cross-account-assume-role"
}
}
We know Account A can access the online and offline store in Account B. When Account A assumes the cross-account AWS STS role of Account B, it can run Athena queries inside Account B against its offline store. However, the results of these queries (feature datasets) need to be saved in Account A’s S3 bucket in order to enable model training. Therefore, we need to create a bucket in Account A that can store the Athena query results as well as create a bucket policy (see the following code). This policy allows the cross-account AWS STS role to write and read objects in this bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "MyStatementSid",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<ACCOUNT B>:role/cross-account-assume-role"
]
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>",
"arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>/*"
]
}
]
}
Modify the trust relationship policy
Because we created an IAM execution role in Account A, we use the ARN of this role to modify the trust relationships policy of the cross-account assume role in Account B:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"ARN OF SAGEMAKER EXECUTION ROLE CREATED IN ACCOUNT A"
],
"Service": [
"sagemaker.amazonaws.com",
"athena.amazonaws.com"
]
},
"Action": "sts:AssumeRole",
"Condition": {}
}
]
}
Validate the setup process
After you set up all the roles and accompanying policies, you can validate the setup by running the example notebooks in the GitHub repo. The following code block is an excerpt from the example notebook and must be run in a SageMaker notebook running within Account A. It demonstrates how you can assume the cross-account role from Account B using AWS STS via the AssumeRole API call. This call returns a set of temporary credentials that Account A can use to create any service clients. When you use these clients, your code uses the permissions of the assumed role, and acts as if it belongs to Account B. For more information, see assume_role in the AWS SDK for Python (Boto 3) documentation.
import boto3
# Create STS client
sts = boto3.client('sts')
# Role assumption B -> A
CROSS_ACCOUNT_ASSUME_ROLE = 'arn:aws:iam::<ACCOUNT B ID>:role/cross-account-assume-role'
metadata = sts.assume_role(RoleArn=CROSS_ACCOUNT_ASSUME_ROLE,
RoleSessionName='FeatureStoreCrossAccountAccessDemo')
# Get temporary credentials
access_key_id = metadata['Credentials']['AccessKeyId']
secret_access_key = metadata['Credentials']['SecretAccessKey']
session_token = metadata['Credentials']['SessionToken']
region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)
# Create SageMaker client
sagemaker_client = boto3.client('sagemaker',
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
aws_session_token=session_token)
# Create SageMaker Feature Store runtime client
sagemaker_featurestore_runtime_client = boto3.client(service_name='sagemaker-featurestore-runtime',
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
aws_session_token=session_token)
. . .
offline_config = {'OfflineStoreConfig': {'S3StorageConfig': {'S3Uri': f's3://{OFFLINE_STORE_BUCKET}'}}}
sagemaker_client.create_feature_group(FeatureGroupName=FEATURE_GROUP_NAME,
RecordIdentifierFeatureName=record_identifier_feature_name,
EventTimeFeatureName=event_time_feature_name,
FeatureDefinitions=feature_definitions,
Description='< DESCRIPTION >',
Tags='< LIST OF TAGS >',
OnlineStoreConfig={'EnableOnlineStore': True},
RoleArn=CROSS_ACCOUNT_ASSUME_ROLE,
**offline_config)
. . .
sagemaker_featurestore_runtime_client.put_record(FeatureGroupName=FEATURE_GROUP_NAME,
Record=record)
After you create the SageMaker clients as per the preceding code example in Account A, you can create feature groups and populate features into Account B’s centralized online and offline store. For more information about how to create, describe, and delete feature groups, see create_feature_group in the Boto3 documentation. You can also use the Feature Store runtime client to put and get feature records to and from feature groups.
Offline store replication
Reproducibility is the ability to recreate an ML model exactly, so if you use the same features as input, the model returns the same output as the original model. This is essentially what we strive to achieve between the models we develop in a research environment and deploy in a production environment. Replicating feature engineering pipelines across accounts is a complex and time-consuming process that can introduce model discrepancies if not implemented properly. If the feature set used to train a model changes after the training phase, it may be difficult or impossible to reproduce a model.
Applications that reside on AWS usually have several distinct environments and accounts, such as development, testing, staging, and production. To achieve automated deployment of the application across different environments, we use CI/CD pipelines. Organizations often need to maintain isolated work environments and multiple copies of data in the same or different AWS Regions, or across different AWS accounts. In the context of Feature Store, some companies may want to replicate offline feature store data. Offline store replication via Amazon S3 replication can be a useful pattern in this case. This pattern enables isolated environments and accounts to retrain ML models using full feature sets without using cross-account roles or permissions.
Conclusion
In this post, we demonstrated various architecture patterns like the centralized feature store, combined feature store, and other design considerations for SageMaker Feature Store that are essential to cross-functional data science collaboration. We also showed how to set up cross-account access using AWS STS.
To learn more about Feature Store capabilities and use cases, see Understanding the key capabilities of Amazon SageMaker Feature Store and Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time.
If you have any comments or questions, please leave them in the comments section.
About the Authors
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
Mark Roy is a Principal Machine Learning Architect for AWS, helping AWS customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including Insurance, Financial Services, Media and Entertainment, Healthcare, Utilities, and Manufacturing. Mark holds 6 AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for 25+ years, including 19 years in financial services.
Stefan Natu is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.
How one computer scientist and his team aim to bring genome data search to the next level
Politecnico di Milano professor Stefano Ceri is working to integrate genomic datasets into a single accessible system with the support of an Amazon Machine Learning Research Award.Read More
AWS and Hugging Face Collaborate to Simplify and Accelerate Adoption of Natural Language Processing Models
Just like computer vision a few years ago, the decade-old field of natural language processing (NLP) is experiencing a fascinating renaissance. Not a month goes by without a new breakthrough! Indeed, thanks to the scalability and cost-efficiency of cloud-based infrastructure, researchers are finally able to train complex deep learning models on very large text datasets, in order to solve business problems such as question answering, sentence comparison, or text summarization.
In this respect, the Transformer deep learning architecture has proven very successful, and has spawned several state of the art model families:
- Bidirectional Encoder Representations from Transformers (BERT): 340 million parameters [1]
- Text-To-Text Transfer Transformer (T5): over 10 billion parameters [2]
- Generative Pre-Training (GPT): over 175 billion parameters [3]
As amazing as these models are, training and optimizing them remains a challenging endeavor that requires a significant amount of time, resources, and skills, all the more when different languages are involved. Unfortunately, this complexity prevents most organizations from using these models effectively, if at all. Instead, wouldn’t it be great if we could just start from pre-trained versions and put them to work immediately?
This is the exact challenge that Hugging Face is tackling. Founded in 2016, this startup based in New York and Paris makes it easy to add state of the art Transformer models to your applications. Thanks to their popular transformers
, tokenizers
and datasets
libraries, you can download and predict with over 7,000 pre-trained models in 164 languages. What do I mean by ‘popular’? Well, with over 42,000 stars on GitHub and 1 million downloads per month, the transformers
library has become the de facto place for developers and data scientists to find NLP models.
At AWS, we’re also working hard on democratizing machine learning in order to put it in the hands of every developer, data scientist and expert practitioner. In particular, tens of thousands of customers now use Amazon SageMaker, our fully managed service for machine learning. Thanks to its managed infrastructure and its advanced machine learning capabilities, customers can build and run their machine learning workloads quicker than ever at any scale. As NLP adoption grows, so does the adoption of Hugging Face models, and customers have asked us for a simpler way to train and optimize them on AWS.
Working with Hugging Face Models on Amazon SageMaker
Today, we’re happy to announce that you can now work with Hugging Face models on Amazon SageMaker. Thanks to the new HuggingFace
estimator in the SageMaker SDK, you can easily train, fine-tune, and optimize Hugging Face models built with TensorFlow and PyTorch. This should be extremely useful for customers interested in customizing Hugging Face models to increase accuracy on domain-specific language: financial services, life sciences, media and entertainment, and so on.
Here’s a code snippet fine-tuning the DistilBERT model for a single epoch.
from sagemaker.huggingface import HuggingFace
hf_estimator = HuggingFace(
entry_point='train.py',
pytorch_version = '1.6.0',
transformers_version = '4.4',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
hyperparameters = {
'epochs': 1,
'train_batch_size': 32,
'model_name':'distilbert-base-uncased'
}
)
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
As usual on SageMaker, the train.py
script uses Script Mode to retrieve hyperparameters as command line arguments. Then, thanks to the transformers
library API, it downloads the appropriate Hugging Face model, configures the training job, and runs it with the Trainer
API. Here’s a code snippet showing these steps.
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
...
model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
training_args = TrainingArguments(
output_dir=args.model_dir,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.train_batch_size,
per_device_eval_batch_size=args.eval_batch_size,
warmup_steps=args.warmup_steps,
evaluation_strategy="epoch",
logging_dir=f"{args.output_data_dir}/logs",
learning_rate=float(args.learning_rate)
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=test_dataset
)
trainer.train()
As you can see, this integration makes it easier and quicker to train advanced NLP models, even if you don’t have a lot of machine learning expertise.
Customers are already using Hugging Face models on Amazon SageMaker. For example, Quantum Health is on a mission to make healthcare navigation smarter, simpler, and most cost-effective for everybody. Says Jorge Grisman, NLP Data Scientist at Quantum Health: “we use Hugging Face and Amazon SageMaker a lot for many NLP use cases such as text classification, text summarization, and Q&A with the goal of helping our agents and members. For some use cases, we just use the Hugging Face models directly and for others we fine tune them on SageMaker. We are excited about the integration of Hugging Face Transformers into Amazon SageMaker to make use of the distributed libraries during training to shorten the training time for our larger datasets“.
Kustomer is a customer service CRM platform for managing high support volume effortlessly. Says Victor Peinado, ML Software Engineering Manager at Kustomer: “Kustomer is a customer service CRM platform for managing high support volume effortlessly. In our business, we use machine learning models to help customers contextualize conversations, remove time-consuming tasks, and deflect repetitive questions. We use Hugging Face and Amazon SageMaker extensively, and we are excited about the integration of Hugging Face Transformers into SageMaker since it will simplify the way we fine tune machine learning models for text classification and semantic search“.
Training Hugging Face Models at Scale on Amazon SageMaker
As mentioned earlier, NLP datasets can be huge, which may lead to very long training times. In order to help you speed up your training jobs and make the most of your AWS infrastructure, we’ve worked with Hugging Face to add the SageMaker Data Parallelism Library to the transformers
library (details are available in the Trainer
API documentation).
Adding a single parameter to your HuggingFace
estimator is all it takes to enable data parallelism, letting your Trainer
-based code use it automatically.
huggingface_estimator = HuggingFace(. . .
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
)
That’s it. In fact, the Hugging Face team used this capability to speed up their experiment process by over four times!
Getting Started
You can start using Hugging Face models on Amazon SageMaker today, in all AWS Regions where SageMaker is available. Sample notebooks are available on GitHub. In order to enjoy automatic data parallelism, please make sure to use version 4.3.0 of the transformers
library (or newer) in your training script.
Please give it a try, and let us know what you think. As always, we’re looking forward to your feedback. You can send it to your usual AWS Support contacts, or in the AWS Forum for SageMaker.
[1] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding“, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova.[2] “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer“, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
[3] “Improving Language Understanding by Generative Pre-Training“, Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever.
About the Author
Julien is the Artificial Intelligence & Machine Learning Evangelist for EMEA. He focuses on helping developers and enterprises bring their ideas to life. In his spare time, he reads the works of JRR Tolkien again and again.
Announcing AWS Media Intelligence Solutions
Today, we’re pleased to announce the availability of AWS Media Intelligence (AWS MI) solutions, a combination of services that empower you to easily integrate AI into your media content workflows. AWS MI allows you to analyze your media, improve content engagement rates, reduce operational costs, and increase the lifetime value of media content. With AWS MI, you can choose turnkey solutions from participating AWS Partners or use AWS Solutions to enable rapid prototyping. These solutions cover four important use cases: content search and discovery, captioning and localization, compliance and brand safety, and content monetization.
The demand for media content in the form of audio, video and images is growing at an unprecedented rate. Consumers are relying on content not only to entertain, but also to educate and facilitate purchasing decisions. To meet this demand, media content production is exploding. However, the process of producing, distributing, and monetizing this content is often complex, expensive and time consuming. Applying AI and machine learning (ML) capabilities like image and video analysis, audio transcription, machine translation, and text analytics can solve many of these problems. AWS continues to invest with leading technology and consulting partners to meet customer requests for ready-to-use ML capabilities across the most popular use cases.
AWS Media Intelligence use cases
AWS MI solutions are powered by AWS AI services, like Amazon Rekognition for image and video analysis, Amazon Transcribe for audio transcription, Amazon Comprehend for natural language comprehension, and Amazon Translate for language translation.
As a result, AWS MI can process and analyze media assets to automatically generate metadata from images, video, and audio content for downstream workloads at scale. Together, these solutions support AWS MI’s four primary use cases:
Search and discovery
Search and discovery use cases utilize AWS image and speech recognition capabilities to perform automatic metadata tagging on media assets and object identification such as logos and characters to create searchable indexes. Historically, humans would manually review and annotate content to facilitate search. This reduces the amount of human-led annotation, content production costs, and highlight generation efforts. AWS Partners CLIPr, Dalet, EditShare, Empress Media Asset Management, Evertz, GrayMeta, IMT, Keycore, Quantiphi, PromoMii, SDVI, Starchive, Synchronized, and TrackIt all offer off-the-shelf solutions built on top of AWS AI/ML technology to meet highly specific customer needs.
Customers who are using Search and discovery solutions include TF1, a French free-to-air television channel owned by TF1 Group. They work with Synchronized, an AWS MI
Technology Partner who provides turnkey search and discovery solutions for broadcasters and over the top (OTT) platforms. Nicolas Lemaitre, the Digital Director for the TF1 Group says, “Due to the ever-increasing size of the MYTF1 catalog, manual delivery of editorial tasks, such as the creation of thumbnails, is no longer scalable. Synchronized’s Media Intelligence solution and its applied artificial intelligence allow us to automate part of these tasks and enhance our premium content. We can now process bulk video content rapidly and with high quality control reducing cost and time to market.”
Subtitling and localization
Subtitling and localization use cases increase user engagement and reach by using our speech recognition and machine translation services to produce subtitles in the languages that cater to a diverse audience. CaptionHub and EEG are AWS Partners that combine their own expertise with AWS AI/ML capabilities to offer rich subtitling and localization solutions.
As an international financial services provider, Allianz SE offers over 100 million customers worldwide products and solutions in insurance and asset management. They work with CaptionHub, an AWS Technology
Partner who provides an AI-powered platform for enterprise subtitles powered by Amazon Transcribe and Amazon Translate. Steve Flynn, the marketing manager at Allianz says, “Video is a fast growing tool for internal education and external marketing for Allianz, globally. We had a large backlog of videos that we needed to transcribe where speed of production was the imperative factor. With AWS Partner CaptionHub, we have been able to speed up the initial transcription and translation of captions by using speech recognition and machine translation. I am producing the captions in 1/8th of the time I did before. The speech recognition accuracy has been impressive and adds a new dimension of professionalism to our videos. ”
Compliance and brand safety
With fully managed image, video or speech analysis APIs, compliance and brand safety capabilities make it easy to detect inappropriate, unwanted, or offensive content to increase brand safety and comply with content standards or regional regulations. AWS Partners offering Compliance and brand safety solutions include Dalet, EditShare, GrayMeta, Quantiphi and SDVI.
SmugMug is a photo sharing,
photo hosting service, and online video company that operates two large Internet-scale platforms: SmugMug & Flickr. Don MacAskill, the co-founder, CEO, & Chief Geek at Smugmug says, “As a large, global platform, unwanted content is extremely risky to the health of our community and can alienate photographers. We use Amazon Rekognition’s content moderation feature to find and properly flag unwanted content, enabling a safe and welcoming experience for our community. Especially at Flickr’s huge scale, doing this without Amazon Rekognition is nearly impossible. Now, thanks to content moderation with Amazon Rekognition, our platform can automatically discover and highlight amazing photography that more closely matches our members’ expectations, enabling our mission to inspire, connect, and share.”
Content monetization
Lastly, you can boost ROI with our newest content monetization solutions that are contextual and effective. This is an area in which ML drives innovation that aims to deliver the right ads at the right time to the right consumer based on context. Contextual advertising increases advertisers return on investment while minimizing disruption to viewers. With content monetization solutions, you can generate rich metadata from audio and visual assets that allows you to optimize ad insertion placement and automatically identify sponsorship opportunities. Any organization seeking online media advertising optimization can benefit from solutions from AWS Partners Infinitive, Quantiphi, TrackIt, or TripleLift.
TripleLift is an AWS Technology Partner that provides a programmatic advertising platform powered by AWS MI. Tastemade, a leading food and lifestyle brand, has deployed TripleLift ad experiences in over 200 episodes of their programming. Jeff Imberman, Tastemade’s Head of Sales and Brand Partnerships, says “TripleLift’s deep learning-based video analysis provides us with a scalable solution for finding thousands of moments for inserting integrated ad experiences in programming, supplementing a high touch marketing function with artificial intelligence.”
AWS Media Intelligence Partners
Swami Sivasubramanian, the VP of Amazon Machine Learning at AWS says, “The volume of media content that our customers are creating and managing is growing exponentially. Customers can dramatically reduce the time and cost requirements to produce, distribute and monetize media content at scale with AWS Media Intelligence and its underlying AI services. We are excited to announce these solutions backed by a strong group of AWS Technology and Consulting Partners. They are committed to delivering turnkey solutions that enable our customers to easily achieve the benefits of ML transformation while removing the heavy lifting needed to integrate ML capabilities into their existing media content workflows.”
AWS Partners that provide AWS MI solutions include AWS Technology Partners: CaptionHub, CLIPr, Dalet, EditShare, EEG Video, Empress, Evertz, GrayMeta, IMT, PromoMii, SDVI, Starchive, Synchronized, and TripleLift. For customers who require a custom solution or seek additional assistance with AWS MI, AWS Consulting Partners: Infinitive, KeyCore, Quantiphi, and TrackIt can help.
For more than 50 years, Evertz has specialized in connecting content creators and audiences by simplifying complex broadcast and new-media workflows, including delivering captioning and subtitling services via the cloud. Martin Whittaker, Technical Director of MAM and Automation at Evertz, says, “Effective captioning is increasingly used as a tool to drive engagement with audiences who are streaming video content in public spaces or in other regions around the world. Using Amazon Transcribe, Evertz’s Emmy Award-winning Mediator-X cloud playout platform delivers a cost-effective way to generate and receive speech-to-text caption and subtitle files at scale. Mediator’s Render-X transcode engine converts caption files into different formats for use across all terrestrial, direct to consumer (D2C) and OTT TV channels or video on demand (VOD) deliveries, eliminating redundancies in resourcing.”
PromoMii provides video content editing tools powered by AWS AI services such as Amazon Comprehend, Amazon Rekognition, and Amazon Transcribe. Michael Moss, Co-founder and CEO of PromoMii says, “AI creates a paradigm shift in virtually every sector, especially in video editing. With Nova, our video editing platform powered by AWS Media Intelligence, our customers are able to save 70% on video editing time and reduce costs by up to 97%. Now Disney and other customers are able to produce a quantitatively higher amount of quality content. Working closely with AWS MI helps us to develop and deploy our technology at a faster pace to meet market demand. What would have taken us weeks before, now only takes a matter of days or even hours.”
AWS solutions and open source frameworks for Media Intelligence
AWS Solutions for MI provide an accelerator and pattern for partners or customers who want to build internally without starting from the ground up. Media2Cloud is a serverless, open source solution for seamless media ingestion into AWS and to export assets into a Media Asset Management (MAM) system used by many of our MI partners. Media2Cloud has pre-built mechanisms to augment content with ML-generated descriptive metadata powered by AWS AI services for search and discovery.
AWS Content Analysis solution enables customers to perform automated video content analysis using a serverless application model to generate meaningful insights through machine learning (ML) generated metadata. The solution leveraging AWS image and video analysis, speech transcription, machine translation and text analytics.
The custom brand detection solution is specifically built to help content producers prepare a dataset and train models to identify and detect specific brands on videos and images. This combines Amazon Rekognition Custom Labels and Amazon SageMaker Ground Truth.
You can also create an ad insertion solution by combining Amazon Rekognition with AWS Elemental MediaTailor. MediaTailor is a channel assembly and personalized ad insertion solution for video providers to create linear OTT channels using existing video content and monetize those channels, or other live streams and VOD content.
Get started with AWS Media Intelligence
With AWS, you have access to a range of options including AWS Technology and Consulting Partners and open source projects that help you get started with AWS Media Intelligence
- Contact a participating AWS Partner by visiting the AWS MI Partners page for more information, demos, and contact details.
- Attend a Tech Talk webinar and learn more about AWS Media Intelligence on April 26 at 11am PST. “Increase the lifetime value of media content, while reducing costs with AWS Media Intelligence solutions.” Registration will open in early April.
- Want to build it yourself? Get started with our pre-built AWS solutions used by many MI Partners:
- Media2Cloud, an automated video asset ingestion and metadata tagging
- AWS Content Analysis, a video analytics solutions for search, subtitling and translation
- Amazon Rekognition Custom Labels, Amazon Transcribe vocabulary filters, Amazon Transcribe content redaction, and Amazon SageMaker Ground Truth for content compliance
- Ad insertion solution, an ad insertion solution to help monetize your video content
- Learn more about the underlying AI services: Amazon Rekognition, Amazon Transcribe, Amazon Translate, and Amazon Comprehend.
About the Authors
Vasi Philomin is the GM for Machine Learning & AI at AWS, responsible for Amazon Lex, Polly, Transcribe, Translate and Comprehend.
Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.
Create forecasting systems faster with automated workflows and notifications in Amazon Forecast
You can now enable notifications for workflow status changes while using Amazon Forecast, allowing you to work seamlessly without the disruption of having to check if a particular workflow has completed. Additionally, you can now automate workflows through the notifications to increase work efficiency. Forecast uses machine learning (ML) to generate more accurate demand forecasts, without requiring any prior ML experience. Forecast brings the same technology used at Amazon.com to developers as a fully managed service, removing the need to manage resources or rebuild your systems.
Previously, you had to proactively check to see if a job was complete at the end of each stage, whether it was importing your data, training the predictor, or generating the forecast. The time needed to import your data or train a predictor can vary depending on the size and contents of your data. The wait time can feel even longer when you have to constantly check the status before being able to proceed to the next task. The work flow disruption can negatively impact the entire day’s work. Additionally, if you were integrating Forecast into software solutions, you had to build notifications yourself, creating additional work.
Now, with a one-time setup of workflow notifications, you can choose to either be notified when a specific step is complete or set up sequential workflow tasks after the preceding workflow is complete, which eliminates administrative overhead. Forecast enables notifications by onboarding to Amazon EventBridge, which lets you activate these notifications either directly through the Forecast console or through APIs. You can customize the notification based on your preference of rules and selected events. You can also use EventBridge notifications to fully automate the forecasting cycle end to end, allowing for an even more streamlined experience using Forecast. Software as a service (SaaS) providers can set up routing rules to determine where to send generated forecasts to build applications that react in real time to the data that is received.
EventBridge allows you to build event-driven Forecast workflows. For example, you can create a rule that when data has been imported into Forecast, the completion of this event triggers the next step of training a predictor through AWS Lambda functions. We explore using Lambda functions to automate the Forecast workflow through events in the next section. Or, after the predictor has been trained, you can set up a new rule to receive an SMS text message notification through Amazon Simple Notification Service (Amazon SNS), reminding you to return to Forecast to evaluate the accuracy metrics of the predictor before proceeding to the next step. For this post, we use Lambda with Amazon Simple Email Service (Amazon SES) to send notification messages. For more information, see How do I send email using Lambda and Amazon SES?
Solution overview
In this section, we provide an example of how you can automate Forecast workflows using EventBridge notifications, from importing data, training a predictor, and generating forecasts.
It starts by creating rules in EventBridge that can be accessed through the API, SDK, CLI, and the Forecast console. You can also see the demonstration in the next section. For this use case, we select the target for all the rules as a Lambda function. For instructions on creating the functions and adding the necessary permissions, see Steps 1 and 2 in Tutorial: Schedule AWS Lambda Functions Using EventBridge.
You create rules for the following:
- Dataset import – Checks whether the status field in the event is ACTIVE and invokes the Forecast Create Predictor
- Predictor – Checks whether the status field in the event is ACTIVE and invokes the Forecast Create Forecast
- Forecast – Checks whether the status field in the event is ACTIVE and invokes the Forecast Create Forecast Export
- Forecast Export – Checks whether the status field in the event is ACTIVE and it invokes Amazon SES to send an email. At this point, the forecast export results are already exported to your Amazon Simple Storage Service (Amazon S3) bucket.
After you set up the rules, you can start with your first workflow of calling the dataset import job API. Forecast starts sending status change events with statuses like CREATE_IN_PROGRESS, ACTIVE, CREATE_FAILED, and CREATE_STOPPED to your account. After the event gets matched to the rule, it invokes the target Lambda function configured on the rule, and moves to the next steps of training a predictor, creating a forecast, and finally exporting the forecasts. After the forecasts are exported, you receive an email notification.
The following diagram illustrates this architecture.
Create rules for Forecast notifications through EventBridge
To create your rules for notifications, complete the following steps:
- On the Forecast console, choose your dataset.
- In the Dataset imports section, choose Configure notifications.
Links to additional information about setting up notifications is available in the help pane.
You’re redirected to the EventBridge console, where you now create your notification.
- In the navigation pane, under Events, choose Rules.
- Choose Create rule.
- For Name, enter a name.
- Under Define pattern, select Event pattern.
- For Event matching patterns, select Pre-defined pattern by service.
- For Event type, choose your event on the drop-down menu.
For this post, we choose Forecast Dataset Import Job State Change because we’re interested in knowing when the dataset import is complete.
When you choose your event, the appropriate event pattern is populated in the Event pattern section.
- Under Select event bus, select AWS default event bus.
- Confirm that Enable the rule on the select event bus is enabled.
- For Target, choose Lambda function.
- For Function, choose the function you created.
- Choose Create.
Make sure that the rule and targets are in the same Region.
You’re redirected to the Rules page on the EventBridge console, where you can see a confirmation that your rule was created successfully.
Conclusion
You can now enable notifications for workflow status changes while using Forecast. With a one-time setup of workflow notifications, you can choose to either get notified or set up sequential workflow tasks after the preceding workflow has completed, eliminating administrative overhead.
To get started with this capability, see Setting Up Job Status Notifications. You can use this capability in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.
About the Authors
Alex Kim is a Sr. Product Manager for Amazon Forecast. His mission is to deliver AI/ML solutions to all customers who can benefit from it. In his free time, he enjoys all types of sports and discovering new places to eat.
Ranjith Kumar Bodla is an SDE in the Amazon Forecast team. He works as a backend developer within a distributed environment with a focus on AI/ML and leadership. During his spare time, he enjoys playing table tennis, traveling, and reading.
Raj Vippagunta is a Senior SDE at AWS AI Services. He leverages his vast experience in large-scale distributed systems and his passion for machine learning to build practical service offerings in the AI space. He has helped build various solutions for AWS and Amazon. In his spare time, he likes reading books and watching travel and cuisine vlogs from across the world.
Shannon Killingsworth is a UX Designer for Amazon Forecast and Amazon Personalize. His current work is creating console experiences that are usable by anyone, and integrating new features into the console experience. In his spare time, he is a fitness and automobile enthusiast.
Lab126, University of Maryland collaborate to develop reliability models to build resilient devices
Amazon Lab126 and the Center for Risk and Reliability will study how devices are accidentally damaged — and how to help ensure they survive more of those incidents.Read More
RAPIDS and Amazon SageMaker: Scale up and scale out to tackle ML challenges
In this post, we combine the powers of NVIDIA RAPIDS and Amazon SageMaker to accelerate hyperparameter optimization (HPO). HPO runs many training jobs on your dataset using different settings to find the best-performing model configuration.
HPO helps data scientists reach top performance, and is applied when models go into production, or to periodically refresh deployed models as new data arrives. However, HPO can feel out of reach on non-accelerated platforms as dataset sizes continue to grow.
With RAPIDS and SageMaker working together, workloads like HPO are GPU scaled up (multi-GPU) within a node and cloud scaled out over parallel instances. With this collaboration of technologies, machine learning (ML) jobs like HPO complete in hours instead of days, while also reducing costs.
The Amazon Packaging Experience Team (CPEX) recently found similar speedups using our HPO demo framework on their gradient boosted models for selecting minimal packaging materials based on product features. For more information about their relentless quest to shrink packaging and reduce waste with AI, see Inside Amazon’s quest to use less cardboard.
Getting started
We encourage you to get hands-on and launch a SageMaker notebook so you can replicate this demo or use your own dataset. This RAPIDS with SageMaker HPO example is part of the amazon-sagemaker-examples GitHub repository, which is integrated into the SageMaker UX, making it very simple to launch. We also have a video walkthrough of this material.
The key ingredients for cloud HPO are a dataset, a RAPIDS ML workflow containerized as a SageMaker estimator, and a SageMaker HPO tuner definition. We go through each element in order and provide benchmarking results.
Dataset
Our hope is that you can use your own dataset for this walkthrough, so we’ve tried to make this easy by supporting any tabular dataset as input, such as Parquet or CSV format, stored on Amazon Simple Storage Service (Amazon S3).
For this post, we use the dataset to set up a classification workflow for whether a flight will arrive more than 15 minutes late. This dataset has been collected by the US Bureau of Transportation for over 30 years and includes 14 features (such as distance, source, origin, carrier ID, and scheduled vs. actual departure and arrival).
The following graph shows that for the past 20 years, 81% of flights arrived on time—meaning less than 15 minutes late to arrive at their destination. 2020 is at 90% due to less congestion in the sky.
The following graph shows the number of domestic US flights (in millions) for the last 20 years. We can see that although 2020 counts have only been reported through September, the year is going to come in below the running average.
The following image shows 10,000 flights out of Atlanta. The arch height represents delays. Flights out of most airports arrive late when covering a great distance. Atlanta is an outlier, with delays common even for short flights.
SageMaker estimator
Now that we have our dataset, we build a RAPIDS ML workflow and package it using the SageMaker Training API into an interface called an estimator. Our estimator is essentially a container image that holds our code as well as some additional software (sagemaker-training-toolkit), which helps make sure everything is correctly hooking up to the AWS Cloud. SageMaker uses our estimator image as a way to deploy the same logic to all the parallel instances that participate in the HPO search process.
RAPIDS ML workflow
For this post, we built a lightweight RAPIDS ML workflow that doesn’t delve into data augmentation or feature engineering, but rather offers the bare essentials so that everything is simple and the focus remains on HPO. The steps of the workflow include data ingestion, model training, prediction, and scoring.
We offer four variations of the workflow, which unlock increasing amounts of parallelism and allow for experimentation with different libraries and instance types. The curious reader is welcome to dive into the code for each option:
At a high level, all the workflows accomplish the same goal, however in the GPU case, we replace the CPU Pandas and CPU SKLearn libraries with RAPIDS cuDF and cuML, respectively. Because the dataset scales into very large numbers of samples (over 10 years of airline data), we recommend using the multi-CPU and multi-GPU workflows, which add Dask and enable data and computation to be distributed among parallel workers. Our recommendations are captured in the notebook, which offers an on-the-fly instance type recommendation based on the choice of CPU vs. GPU as well as dataset size.
HPO tuning
Now that we have our dataset and estimator prepared, we can turn our attention to defining how we want the hyperparameter optimization process to unfold. Specifically, we should now decide on the following:
- Hyperparameter ranges
- The strategy for searching through the ranges
- How many experiments to run in parallel and the total experiments to run
Hyperparameter ranges
The hyperparameter ranges are at the heart of HPO. Choosing large ranges for parameters allows the search to consider many model configurations and increase its probability of finding a champion model.
In this post, we focus on tuning model size and complexity by varying the maximum depth and the number of trees for XGBoost and Random Forest. To guard against overfitting, we use cross-validation so that each configuration is retested with different splits of the train and test data.
Search strategy
In terms of HPO search strategy, SageMaker offers Bayesian and random search. For more information, see How Hyperparameter Tuning Works. For this post, we use the random search strategy.
HPO sizing
Lastly, in terms of sizing, we set the notebook defaults to a relatively small HPO search of 10 experiments, running two at a time so that everything runs quickly end-to-end. For a more realistic use case, we used the same code but ramped up the number of experiments to 100, which is what we have benchmarked in the next section.
Results and benchmarks
In our benchmarking, we tested 100 XGBoost HPO runs with 10 years of the airline dataset (approximately 60 million flights). On the ml.p3.8xlarge 4x Volta100 GPU instances, we see a 14 times reduction (over 3 days vs. 6 hours) and a 4.5 times cost reduction vs. the ml.m5.24xlarge instances.
Production grade HPO jobs running on the CPU may time out because they exceed the 24-hour runtime limit we added as a safeguard (in our run, 12 out of 100 CPU jobs were stopped).
As a final benchmarking exercise to showcase what’s happening on each training run, we show an example of a single fold of cross-validation on the entire airline dataset (33 years going back to 1987) for both XGBoost and Random Forest with a middle-of-the-pack model complexity (max_depth is 15, n_estimators is 500).
We can see the computational advantage of the GPU for model training and how this advantage grows along with the parallelism inherent in the algorithm used (Random Forest is embarrassingly parallel, whereas XGBoost builds trees sequentially).
Deploying our best model with the Forest Inference Library
As a final touch, we also offer model serving. This is done in the serve.py code, where a Flask server loads the best model found during HPO and uses the Forest Inference Library (FIL) for GPU-accelerated large batch inference. The FIL works for both XGBoost and Random Forest, and can be 28 times faster relative to CPU-based inference.
Conclusion
We hope that after reading this post, you’re inspired to try combining RAPIDS and SageMaker for HPO. We’re sure you’ll benefit from the tremendous acceleration made possible by GPUs at cloud scale. AWS also recently launched the Ampere100 GPUs in the form of p4d instances, which are the fastest ML nodes in the cloud, and should be coming to SageMaker soon.
At NVIDIA and AWS, we hope to continue working to democratize high performance computing both in terms of ease of use (such as SageMaker notebooks that spawn large compute workloads) and in terms of total cost of ownership. If you run into any issues, let us know via GitHub. You can also get in touch with us via Slack, Google Groups, or Twitter. We look forward to hearing from you!
About the Authors
Wenming Ye is an AI and ML specialist architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had a diverse R&D experience at Microsoft Research, SQL engineering team, and successful startups.
Miro Enev, PhD is a Principal Solution Architect at NVIDIA.
Establishing a new standard in answer selection precision
A model that uses both local and global context improves on the state of the art by 6% and 11% on two benchmark datasets.Read More
Helmet detection error analysis in football videos using Amazon SageMaker
The National Football League (NFL) is America’s most popular sports league. Founded in 1920, the NFL developed the model for the successful modern sports league and is committed to advancing progress in the diagnosis, prevention, and treatment of sports-related injuries. Health and safety efforts include support for independent medical research and engineering advancements in addition to a commitment to better protect players and make the game safer. This includes enhancements to medical protocols and improvements to how our game is taught and played. For more information about the NFL’s health and safety efforts, see NFL Player Health and Safety.
We have partnered with AWS to develop the Digital Athlete program, where we use AWS machine learning (ML) services to identify potential risks coming from helmet-to-helmet, helmet-to-shoulder and other body parts, and helmet-to-ground collisions. As of this writing, there is no automated way to identify these collisions. An expert needs to review hours of game footage to visually identify impacts and compare that to the actual collisions reported during the game. Our team, in collaboration with AWS Professional Services and BioCore, is developing computer vision algorithms to analyze All-22 videos using Amazon SageMaker to help shape the future of American football and its players.
We planned to accomplish this objective in three steps: detect helmets, track detected helmets, and identify impacts to tracked helmets on the field. The tracking and impact detection workflows are beyond the scope of this post. This discussion focuses on helmet detection even under challenging conditions such as when players are obscured by other players for several frames and when video quality and video zoom effects change as the cameras track the action.
In this post, we discuss how state-of-the-art object detection model metrics don’t provide the full picture of where detection goes wrong, and how that motivated us to create a custom visualization for the entire play that shows the full story of helmet detection performance as a function of time within the play. This visualization has significantly improved our understanding of when and how our helmet detection algorithms fail.
Detection challenge
The challenges of a helmet detector model with respect to team play are three-fold:
- Helmet size is small compared to the image size in a typical clip of sideline or end zone view
- Precise detection is important to subsequently track the same helmet in future clips to correctly identify an impact, if any
- State-of-the-art object detection metrics collected from models don’t provide the full picture in the context of game plays
To address the first two challenges, we considered object detection algorithms that work well on relatively smaller objects and emphasize more on accuracy than speed.
To address the third challenge, we introduced a custom visualization technique that focused on some of the shortcomings of the conventional model metrics, specifically the following:
- A frame-wise error analysis that captures missed and false detections
- A visual summary of stacked true positives, false positives, and false negatives per frame over time to assess model performance for the entire play
Dataset and modeling
We recently announced a Kaggle competition (NFL 1st and Future – Impact Detection) for ML experts around the world to contribute towards NFL research addressing the need for a computer vision system to detect on-field helmet impacts as part of the Digital Athlete platform. In this post, we use static images from the competition data as an example to build a helmet detection model. We used Amazon SageMaker Ground Truth to create the computer vision dataset that is as accurate as possible to build a solid platform.
We used the Kaggle API to download the data within the SageMaker notebook instance. For instructions on creating a notebook instance, see Create a Notebook Instance. We used an ml.P3.2xlarge instance with one GPU and 50 GB EBS volume for better data manipulation and training. For more information about instance types, see Available Instance Types.
We started with some basic EDA to explore the static images and corresponding annotations. The labeled image dataset consists of 9,947 labeled images (with 4,958 sideline and 4,989 end zone) and a CSV file named image_labels.csv
that contains the labeled bounding boxes for all images. The labeled file contains 193,736 helmets (114,986 sideline and 78,750 end zone) with 9,825 unique plays.
There are five different helmet labels, including Blurred
, Sideline
, Partial
, and Difficult
. The following table summarizes each label’s percentage of occurrence.
Helmet label type | Percentage of occurrence |
Helmet | 66.98% |
Helmet-Blurred | 17.31% |
Helmet-Sideline | 7.76% |
Helmet-Partial | 4.55% |
Helmet-Difficult | 3.39% |
We considered all Helmet
types to be the same for simplicity and did an 80/20 split to train and test in the modeling phase.
Next, we used FasterRCNN with ResNet50 FPN as our helmet detection model and used a pretrained model based on COCO data within a PyTorch framework. For more information about object detection in TorchVision, see TorchVision Object Detection Finetuning Tutorial. The network seemed like an ideal choice because it detects objects of relatively smaller size and has performed very well in multiple standard object detection competitions. The goal was not to build an award-winning helmet detection model, but to identify errors in specific images within an entire play with a relatively high-performing model.
Model performance metrics
We trained the model using the default PyTorch Conda environment pytorch_p36 within a SageMaker notebook instance. The Average Precision (AP) @[IoU=0.50:0.95] for the test set at the end of 10 epochs was 0.498, and Average Recall @@[IoU=0.50:0.95] was 0.56 and deemed excellent as an object detector.
We took the saved model and evaluated frame by frame on an entire play (for example, 57583_000082_Endzone
). We used annotation labels for the entire play to evaluate frame by frame. The following graph is a plot of precision vs. recall for all the frames with mAP of 93.12% using object detection metrics package.
As evident from the plot, this is an excellent model and only fails if the helmet is either blurred or too difficult to detect even with expert eyes.
Next, we calculated the number of true positives, false positives, and false negatives for each frame of the 57583_000082_Endzone
play. To match the predicted detection with ground truth annotations, we only considered predictions with scores higher than 0.9 and 0.25 IoU threshold between ground truth and the predicted bounding boxes. The conflicts between multiple detections for the same ground truth bounding boxes were resolved using a confidence score. Essentially, we only considered the highest confidence detections for multiple detections.
The number of ground truth helmets in each frame can vary between 18–22 for 57583_000082_Endzone
, whereas our model predicted anywhere between 15–23 helmets. Therefore, even though our model is an excellent one, it did miss some helmets and made wrong predictions. Because false negatives or missed detections are more important for proper tracking of the players, we looked into the frames where we got too many false negatives.
The following image shows an example where the model predicted every helmet correctly (depicted by the cyan boxes).
This next image shows where the model missed a few helmets (depicted by red boxes) and made wrong predictions (depicted by blue boxes).
To identify where and why a model is underperforming, it’s imperative to calculate the precision, recall, and F1-score for each frame and for the overall play. We got a precision of 0.97, recall of 0.93, and F1-score of 0.95 for the overall play, which definitely doesn’t provide the full picture of errors in a team play context. The following plot shows several false positives, false negatives on the right y-axis and precision, recall on the left y-axis against the individual frame number. It’s clear that our model did an excellent job overall except in the frames between approximately 100–300, where typically tackling happens in football plays. Unfortunately, most impacts or collisions happen in these frame ranges, and therefore we dug deeper into the error cases.
The following plot is a stacked bar representation of true positives (green area), false negatives (red area), and false positives (blue area) against individual frame numbers. The black bold line represents the total number of ground truth helmets in each frame. The dotted vertical black line represents the snap frame. An ideal helmet detector should detect each and every helmet in each frame, thereby covering the entire area with green. However, as you can see in the visualization, our model had limitations, which are clearly depicted both qualitatively and quantitatively in the visualization.
Therefore, this novel visualization gives us a tool to distinguish between an excellent helmet detector and a perfect helmet detector. It also provides a quick visual summary that allows us to compare the performance of the detector in different plays and quickly identify the temporal location and type of error the models are propagating. This can further be leveraged to assess improved helmet detector models after retraining.
To improve the helmet detector model, we could retrain the model using additional frames that are harder to detect into the training set, train for longer epochs, apply hyperparameter tuning, implement additional augmentation techniques, or incorporate other modeling strategies. At every step, we can use this stacked bar plot as a tool to assess the model quality in a team game perspective because it provides a visual summary that depicts where and how models are failing to perform against a perfect benchmark.
Prerequisites
To reproduce this analysis in your own environment, you must complete the following prerequisites:
It’s recommended to use an instance with GPU support, for example ml.p3.2xlarge. The EBS volume size should be around 50 GB in order to store all necessary data.
- Download the data from Kaggle using the Kaggle API.
Refer to the API credentials to retrieve and save the kaggle.json
file on SageMaker within /home/ec2-user/.kaggle
. For security reasons, make sure to change modes for accidental other users. See the following code:
pip install kaggle
mkdir /home/ec2-user/.kaggle
mv kaggle.json /home/ec2-user/.kaggle
chmod 600 ~/.kaggle/kaggle.json
kaggle competitions download -c nfl-impact-detection
Building the helmet detection model
The following code snippet shows the custom dataset class for helmets:
class DatasetHelmet(Dataset):
def __init__(self, marking, image_ids, transforms=None, test=False):
super().__init__()
self.image_ids = image_ids
self.marking = marking
self.transforms = transforms
self.test = test
def __getitem__(self, index: int):
image_id = self.image_ids[index]
image, boxes, labels = self.load_image_and_boxes(index)
num_boxes = len(boxes)
if num_boxes > 0:
target = {}
new_boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
labels = torch.ones((num_boxes,), dtype=torch.int64)
area = (new_boxes[:, 3] - new_boxes[:, 1]) * (new_boxes[:, 2] - new_boxes[:, 0])
# suppose all instances are not crowd
iscrowd = torch.zeros((num_boxes,), dtype=torch.int64)
target['boxes'] = new_boxes
target['labels'] = labels
target['image_id'] = torch.tensor([index])
target["area"] = area
target["iscrowd"] = iscrowd
else:
target = {}
if self.transforms is not None:
image, target = self.transforms(image, target)
return image, target
def __len__(self) -> int:
return self.image_ids.shape[0]
def load_image_and_boxes(self, index):
image_id = self.image_ids[index]
TRAIN_ROOT_PATH = args.train + "images"
image = cv2.imread(f'{TRAIN_ROOT_PATH}/{image_id}', cv2.IMREAD_COLOR).copy().astype(np.float32)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
image /= 255.0
records = self.marking[self.marking['image'] == image_id]
boxes = records[['left', 'top', 'width', 'height']].values
boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
labels = records['label'].values
return image, boxes, labels
The following code shows the main training function:
def main(args):
# Read images label csv file
image_labels = pd.read_csv('/home/ec2-user/SageMaker/helmet_detection/input/image_labels.csv'
# # Split annotations into train and validation
np.random.seed(0)
image_names = np.random.permutation(image_labels.image.unique())
valid_image_len = int(len(image_names)*0.2)
images_valid = image_names[:valid_image_len]
images_train = image_names[valid_image_len:]
logging.info(f"images_valid {images_valid}, n images_train {images_train}")
# Define train and validation datasets and data loaders
TRAIN_ROOT_PATH = args.train
train_dataset = DatasetHelmet(
image_ids=images_train,
marking=image_labels,
transforms=get_transform(train=True),
test=False,
)
validation_dataset = DatasetHelmet(
image_ids=images_valid,
marking=image_labels,
transforms=get_transform(train=False),
test=True,
)
data_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=1,
collate_fn=utils_torchvision.collate_fn
)
data_loader_valid = torch.utils.data.DataLoader(
validation_dataset, batch_size=args.batch_size, shuffle=False, num_workers=1,
collate_fn=utils_torchvision.collate_fn
)
print(f"We have {len(train_dataset)} images for training and {len(validation_dataset)} for validation")
# Set up model
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
## Our dataset has two classes only - helmet and not helmet
num_classes = 2
## Get the model using our helper function
model = get_model(num_classes)
print(f"Loaded model")
# Set up training
start_epoch = 0
end_epoch = args.epochs
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
print(f"Loaded model parameters")
## if retraining from a checkpoint file
if args.retrain:
checkpoint = torch.load(os.path.join(args.model_dir, "model_checkpoint.pt"))
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch'] + 1
end_epoch = start_epoch + args.epochs
print('nLoaded checkpoint from epoch %d.n' % start_epoch)
print(start_epoch, end_epoch)
# Train model
loss_epoch = []
for epoch in range(start_epoch, end_epoch):
# train for one epoch, printing every 1 iterations
print(f"Training epoch {epoch}")
train_one_epoch(model, optimizer, data_loader, data_loader_valid, device, epoch, loss_epoch, print_freq=1)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_valid, device=device, print_freq=1)
# save checkpoint model after each epoch
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict()
}, os.path.join(args.model_dir, "model_checkpoint.pt"))
# Save final model
torch.save(model.state_dict(), os.path.join(args.model_dir, "model_helmet_frcnn.pt"))
loss_df = pd.DataFrame(loss_epoch, columns=["train_loss", "val_loss"])
loss_df.reset_index(inplace=True)
loss_df = loss_df.rename(columns = {'index':'Epoch'})
print(loss_df)
loss_df.to_csv (os.path.join(args.model_dir, "loss_epoch.csv"), index = False, header=True)
Evaluating helmet detection model
Use the saved model to run predictions on an entire play. The following code is an example function to run evaluations:
def run_detection_eval_video(video_in, gtfile_name, model_path, full_video=True, subset_video=60, conf_thres=0.9, iou_threshold = 0.5):
""" Run detection on video
Args:
video_in: Input video path
gtfile_name: Ground Truth annotation json file name
model_path: Location of the pretrained model.pt
full_video: Bool to indicate whether to run the whole video, default = False
subset_video: Number of frames to run detection on
conf_thres = Only consider detections with score higher than conf_thres, default = 0.9
iou_threshold = Match detection with ground trurh if iou is higher than iou_threshold, default = 0.5
Returns:
Predicted detection for all the frames in a video, evaluation for detection, a dataframe with bounding boxes for false negatives and false positives
df_predictions (pandas.DataFrame): prediction of detected object for all frames
with columns ['frame_id', 'class_id', 'score', 'x1', 'y1', 'x2', 'y2']
eval_results (pandas.DataFrame): Count of total number of objects in gt and det, and tp, fn, fp for all frames
with columns ['frame_id', 'num_object_gt', 'num_object_det', 'tp', 'fn', 'fp']
fns (pandas.DataFrame): False negative records in a Pandas Dataframe for all frames
with columns ['frame_id','class_id','x1','y1','x2','y2'],
return empty if no false negatives
fps (pandas.DataFrame): False positive records in a Pandas Dataframe for all frames
with columns ['frame_id','class_id', 'score', 'x1','y1','x2','y2'],
return empty if no false positives
"""
# Capture the input video
vid = cv2.VideoCapture(video_in)
# Get video title
vid_title = os.path.splitext(os.path.basename(video_in))[0]
# Get total number of frames
num_frames = vid.get(cv2.CAP_PROP_FRAME_COUNT)
# load model
num_classes = 2
model = ObjectDetector.load_custom_model(model_path=model_path, num_classes=num_classes)
print("Pretrained model loaded")
# Get GT annotations
gt_labels = pd.read_csv('/home/ec2-user/SageMaker/helmet_detection/input/train_labels.csv')
video = os.path.basename(video_in)
print("Processing video: ",video)
labels = gt_labels[gt_labels['video']==video]
# if running for the whole video, then change the size of subset_video with total number of frames
if full_video:
subset_video = int(num_frames)
df_predictions = [] # predictions for whole video
eval_results = [] # detection evaluations for the whole video
fns = [] # false negative detections for the whole video
fps = [] # false positive detections for the whole video
for i in range(subset_video):
ret, frame = vid.read()
print("Processing frame#: {} running detection and evaluation for videos".format(i+1))
# Get detection for this frame
list_frame = [frame]
dataset_frame = FramesDataset(list_frame)
prediction = ObjectDetector.run_detection(dataset_frame, model)
df_prediction = ObjectDetector.to_dataframe_highconf(prediction, conf_thres, i)
df_predictions.append(df_prediction)
# Get label for this frame
cur_label = labels[labels['frame']==i+1] # get this frame's record
cur_boxes = cur_label[['left','width','top','height']].values
gt = ObjectDetector.get_gt_frame(i+1, cur_boxes)
# Evaluate detection for this frame
eval_result, fn, fp = ObjectDetector.evaluate_detections_iou(gt, df_prediction, iou_threshold)
eval_results.append(eval_result)
if fn is not None:
fns.append(fn)
if fp is not None:
fps.append(fp)
# Concatenate predictions, evaluation resutls, fns and fps for all frames of the video
df_predictions = pd.concat(df_predictions)
eval_results = pd.concat(eval_results)
# Concatenate fns if not empty, otherwise create an empty dataframe
if not fns:
fns = pd.DataFrame()
else:
fns = pd.concat(fns)
# Concatenate fps if not empty, otherwise create an empty dataframe
if not fps:
fps = pd.DataFrame()
else:
fps = pd.concat(fps)
return df_predictions, eval_results, fns, fps
After we have evaluation results saved in a Pandas DataFrame, we can use the following code snippet to plot the stacked bar figure we described earlier:
pal = ["g","r","b"]
plt.figure(figsize=(12,8))
plt.stackplot(eval_det['frame_id'], eval_det['tp'], eval_det['fn'], eval_det['fp'],
labels=['TP','FN','FP'], colors=pal)
plt.plot(eval_det['frame_id'], eval_det['num_object_gt'], color='k', linewidth=6, label='Total Helmets')
plt.legend(loc='best', fontsize=12)
plt.xlabel('Frame ID', fontsize=12)
plt.ylabel(' # of TPs, FNs, FPs', fontsize=12)
plt.axvline(x=snap_time, color='k', linestyle='--')
plt.savefig('/home/ec2-user/SageMaker/helmet_detection/output/stacked.png')
Conclusion
In this post, we showed how we used Amazon SageMaker to build a helmet detector model, ran error analysis on a team play context, and improved the detector model with better precision in the frames where it matters the most. With the visualization tool that we created, we could qualitatively and quantitatively assess the model accuracy in the entire play context. Furthermore, we could introduce additional training images and improve the model accuracy as depicted by both traditional state-of-the-art object detector metrics and our custom visualization.
With a near-perfect helmet detector model, our team is ready for the next step, which is tracking the players on the ground and detecting impacts using computer vision techniques. This will be discussed in a future post.
Readers are welcome to check out the Kaggle competition website and should be able to reproduce the results presented here with the code included in the post.
About the Authors
Sam Huddleston is a Sr. Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.
Jayeeta Ghosh is a Data Scientist who works on AI/ML projects for AWS customers and helps solve customer business problems across industries using deep learning and cloud expertise.