In our recent paper, we show that it is possible to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves. Our approach provides one tool for finding harmful model behaviours before users are impacted, though we emphasize that it should be viewed as one component alongside many other techniques that will be needed to find harms and mitigate them once found.Read More
Red Teaming Language Models with Language Models
In our recent paper, we show that it is possible to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves. Our approach provides one tool for finding harmful model behaviours before users are impacted, though we emphasize that it should be viewed as one component alongside many other techniques that will be needed to find harms and mitigate them once found.Read More
An explorer in the sprawling universe of possible chemical combinations
The direct conversion of methane gas to liquid methanol at the site where it is extracted from the Earth holds enormous potential for addressing a number of significant environmental problems. Developing a catalyst for that conversion has been a critical focus for Associate Professor Heather Kulik and the lab she directs at MIT.
As important as that research is, however, it is just one example of the innumerable possibilities of Kulik’s work. Ultimately, her focus is far broader, the scope of her exploration infinitely more vast.
“All of our research is dedicated toward the same practical goal,” she says. “Namely, we aim to be able to predict and understand using computational tools why catalysts or materials behave the way they do so that we can overcome limitations in present understanding or existing materials.”
Simply put, Kulik wants to apply novel simulation and machine-learning technologies she and her lab have developed to rapidly investigate the sprawling world of possible chemical combinations. In the process, the team is mapping out how chemical structures relate to chemical properties, in order to create new materials tailored to particular applications.
“Once you realize the sheer scale of how many materials we could or should be studying to solve outstanding problems, you realize the only way to make a dent is to do things at a larger and faster scale that has ever been done before,” Kulik says. “Thanks to both machine-learning models and heterogeneous computing that has accelerated first-principles modeling, we are now able to start asking and answering questions that we could never have addressed before.”
Despite Kulik’s many awards and consistent recognition for her research, the New Jersey native was not always destined to be a scientist. Her parents were not particularly interested in math and science and, although she was mathematically precocious and did arithmetic as a toddler and college-level classes in middle school, she pursued other interests into her teens, including creative writing, graphic design, art, and photography.
Majoring in chemical engineering at the Cooper Union, Kulik says she wanted to occupy her mind, do something useful, and “make an okay living.” Chemical engineering was one of the highest-paying professions for undergraduates, she says.
The first thing she remembers hearing about graduate school was from a teaching assistant in her undergraduate physics class, who explained that being in academia meant “not having a real job until you’re at least 30” and working long hours.
“I thought that sounded like a terrible idea!” Kulik says.
Luckily, some of her classroom experiences at the Cooper union, as well as encouragement from her quantum mechanics professor, Robert Topper, led her toward research.
“While I wanted to be useful, I kept being drawn to these fundamental questions of how knowing where the atoms and electrons were located explained the world around us,” she says. “Ultimately, I obtained my PhD in computational materials science to become a scientist who works with electrons every day for that reason. Since what I do hardly ever feels like a chore, I now have a greater appreciation for the fact that this path allowed me to ‘not have a real job.’”
Kulik credits MIT professor of chemistry and biology Cathy Drennan, whom Kulik collaborated with during graduate school, with “helping me see past the short-term barriers that come up in academia” and “showing me what a career in science could look like.” She also mentions Nicola Marzari, her PhD advisor, then an associate professor in the MIT’s Department of Materials Science and Engineering, and her postdoc advisor at Stanford University, Todd Martinez, “who gave me a glimpse of what an independent career might look like.”
Kulik works hard to pass on her ethics and her ideas about work-life balance to students in her lab, and she teaches them to rely on each other, referring to the group as a “tight-knit community all with the same goals.” Twice a month, she holds meetings at which she encourages students to share how they have come up with solutions when working through research problems. “We can each see and learn from different problem-solving strategies others in the group have tried and help each other out along the way.”
She also encourages a light atmosphere. The lab’s web page says its members “embrace very #random (but probably fairly uncool) jokes in our Slack channels. We are computational researchers after all!”
“We like to keep it lighthearted,” Kulik says.
Nonetheless, Kulik and her lab have achieved major breakthroughs, including changing the approach to computational chemistry to make the way multiscale simulations are set up more systematic, while exponentially accelerating the process of materials discovery. Over the years, the lab has developed and honed an open-source code called molSimplify, which researchers can use to build and simulate new compounds. Combined with machine-learning models, the automated method enabled by the software has led to “structure-property maps” that explain why materials behave as they do, in a more comprehensive manner than was ever before possible.
For her efforts, Kulik has won grants from the MIT Energy Initiative, a Burroughs Wellcome Fund Career Award at the Scientific Interface, the American Chemical Society OpenEye Outstanding Junior Faculty Award, an Office of Naval Research Young Investigator Award, a DARPA Young Faculty Award and Director’s Fellowship, the AAAS Marion Milligan Mason Award, the Physical Chemistry B Lectureship, and a CAREER award from the National Science Foundation, among others. This year, she was named a Sloan Research Fellow and was granted tenure.
When not hard at work on her next accomplishment, Kulik enjoys listening to music and taking walks around Cambridge and Boston, where she lives in the Beacon Hill neighborhood with her partner, who was a fellow graduate student at MIT.
Each year for the past three to four years, Kulik has spent at least two weeks on a wintertime vacation in a sunny climate.
“I reflect on what I’ve been doing at work as well as what my priorities might be both in life and in work in the upcoming year,” she says. “This helps to inform any decisions I make about how to prioritize my time and efforts each year and helps me to make sure I’ve put everything in perspective.”
CAIT announces new fellowships, faculty research awards
The Columbia Center of AI Technology announced its inaugural recipients last year.Read More
Josh Miele: Amazon’s resident MacArthur Fellow
Miele has merged a lifelong passion for science with a mission to make the world more accessible for people with disabilities.Read More
2 Powerful 2 Be Stopped: ‘Dying Light 2 Stay Human’ Arrives on GeForce NOW’s Second Anniversary
Great things come in twos. Techland’s Dying Light 2 Stay Human arrives with RTX ON and is streaming from the cloud tomorrow, Feb. 4.
Plus, in celebration of the second anniversary of GeForce NOW, February is packed full of membership rewards in Eternal Return, World of Warships and more. There are also 30 games joining the GeForce NOW library this month, with four streaming this week.
Take a Bite Out of Dying Light 2
It’s the end of the world. Get ready for Techland’s Dying Light 2 Stay Human, releasing on Feb. 4 on GeForce NOW with RTX ON.
Civilization has fallen to the virus, and The City, one of the last large human sanctuaries, is torn by conflict in a world ravaged by the infected dead. You are a wanderer with the power to determine the fate of The City in your search to learn the truth.
Struggle to survive by using clever thinking, daring parkour, resourceful traps and creative weapons to make helpful allies and overcome enemies with brutal combat. Discover the secrets of a world thrown into darkness and make decisions that determine your destiny. There’s one thing you can never forget — stay human.
We’ve teamed up with Techland to bring real-time ray tracing to Dying Light 2 Stay Human, delivering the highest level of realistic lighting and depth into every scene of the new game.
Dying Light 2 Stay Human features a dynamic day-night cycle that is enhanced with the power of ray tracing, including fully ray-traced lighting throughout. Ray-traced global illumination, reflections and shadows bring the world to life with significantly better diffused lighting and improved shadows, as well as more accurate and crisp reflections.
“We’ve been working with NVIDIA to expand the world of Dying Light 2 with ray-tracing technology so that players can experience our newest game with unprecedented image quality and more immersion than ever,” said Tomasz Szałkowski, rendering director at Techland. “Now, gamers can play Dying Light 2 Stay Human streaming on GeForce NOW to enjoy our game in the best way possible and exactly as intended with the RTX 3080 membership, even when playing on underpowered devices.”
Enter the dark ages and stream Dying Light 2 Stay Human from both Steam and Epic Games Store on GeForce NOW at launch tomorrow.
An Anniversary Full of Rewards
Happy anniversary, members. We’re celebrating GeForce NOW turning two with three rewards for members — including in-game content for Eternal Return, World of Warships.
Check in on GFN Thursdays throughout February for updates on the upcoming rewards, and make sure you’re opted in by checking the box for Rewards in the GeForce NOW account portal.
Thanks to the cloud, this past year, over 15 million members around the world have streamed nearly 180 million hours of their favorite games like Apex Legends, Rust and more from the ever-growing GeForce NOW library.
Members have played with the power of the new six-month RTX 3080 memberships, delivering cinematic graphics with RTX ON in supported games like Cyberpunk 2077 and Control. They’ve also experienced gaming with ultra-low latency and maximized eight-hour session lengths across their devices.
It’s All Fun and Games in February
The party doesn’t stop there. It’s also the first GFN Thursday of the month, which means a whole month packed full of games.
Gear up for the 30 new titles coming to the cloud in February, with four games ready to stream this week:
- Life is Strange Remastered (New release on Steam, Feb. 1)
- Life is Strange: Before the Storm Remastered (New release on Steam, Feb. 1)
- Dying Light 2 Stay Human (New release on Steam and Epic Games Store, Feb. 4)
- Warm Snow (Steam)
Also coming in February:
- Werewolf: The Apocalypse – Earthblood (New release on Steam Feb. 7)
- Sifu (New release on Epic Games Store, Feb. 8)
- Diplomacy is Not an Option (New release on Steam, Feb. 9)
- SpellMaster: The Saga (New release on Steam, Feb. 16)
- Destiny 2: The Witch Queen Deluxe Edition (New release on Steam, Feb. 22)
- SCP: Pandemic (New release on Steam, Feb. 22)
- Martha is Dead (New release on Steam and Epic Games Store, Feb. 24)
- Ashes of the Singularity: Escalation (Steam)
- AWAY: The Survival Series (Epic Games Store)
- Citadel: Forged With Fire (Steam)
- Escape Simulator (Steam)
- Galactic Civilizations III (Steam)
- Haven (Steam)
- Labyrinthine Dreams (Steam)
- March of Empires (Steam)
- Modern Combat 5 (Steam)
- Parkasaurus (Steam)
- People Playground (Steam)
- Police Simulator: Patrol Officers (Steam)
- Sins of a Solar Empire: Rebellion (Steam)
- Train Valley 2 (Steam)
- TROUBLESHOOTER: Abandoned Children (Steam)
- Truberbrook (Steam and Epic Games Store)
- Two Worlds Epic Edition (Steam)
- Valley (Steam)
- The Vanishing of Ethan Carter (Epic Games Store)
We make every effort to launch games on GeForce NOW as close to their release as possible, but, in some instances, games may not be available immediately.
Extra Games From January
As the cherry on top of the games announced in January, an extra nine made it to the cloud. Don’t miss any of these titles that snuck their way onto GeForce NOW last month:
- Assassin’s Creed III Deluxe Edition (Ubisoft Connect)
- Blacksmith Legends (Steam)
- Daemon X Machina (Epic Games Store)
- Expeditions: Rome (Steam and Epic Games Store)
- Galactic Civilizations 3 (Epic Games Store)
- Hitman 3 (Steam)
- Supraland Six Inches Under (Steam)
- Tropico 6 (Epic Games Store)
- WARNO (Steam)
With another anniversary for the cloud and all of these updates, there’s never too much fun. Share your favorite GeForce NOW memories from the past year and talk to us on Twitter.
so we thought you should know that…
today is 2/2/22
2⃣morrow we kick off celebrations for our 2-year anniversary
on Friday, @DyingLight 2 launches in the cloudand yes, we’ve been dying 2 tell you all of this
— NVIDIA GeForce NOW (@NVIDIAGFN) February 2, 2022
The post 2 Powerful 2 Be Stopped: ‘Dying Light 2 Stay Human’ Arrives on GeForce NOW’s Second Anniversary appeared first on The Official NVIDIA Blog.
Rain or Shine: Radar Vision Sees Through Clouds to Support Emergency Flood Relief
Flooding usually comes with various bad weather conditions, such as thick clouds, heavy rain and blustering winds.
GPU-powered data science systems can now help researchers and emergency flood response teams to see through it all.
John Murray, visiting professor in the Geographic Data Science Lab at the University of Liverpool, developed cuSAR, a platform that can monitor ground conditions using radar data from the European Space Agency.
cuSAR uses the satellite data to create images that portray accurate geographic information about what’s happening on the ground, below the bad weather conditions.
To create the radar vision platform, Murray used the NVIDIA RAPIDS suite of software libraries and the CUDA parallel computing platform, as well as NVIDIA GPUs.
Emergency Flood Response
The platform was originally designed for the property insurance sector, as mortgage and insurance providers need to assess risks that affect properties, including flooding.
Using satellite data in this way requires clear visuals of the ground, but obtaining analyzable images meant potentially waiting weeks for breaks in Britain’s infamous cloud cover. With cuSAR, users can gain insights in near real time.
Use cases for the radar vision platform have now expanded to the safety sector.
The North Wales Regional Emergency Planning Service first contacted the Geographic Data Science Lab for help with serious flooding that occurred in the Dee Valley a couple years ago. Low, dense clouds hung over the valley, making it impossible for the team to fly helicopters. And drones weren’t able to give a sufficient overview of how the floodplains along the river were behaving.
Using the NVIDIA GPU-powered image analysis platform, Murray was able to provide high-quality renders of the affected areas each day of the flooding. The emergency planning service used this information to allocate its limited resources to critical areas, adjusting its efforts as the flooding progressed.
Last year, the lab provided radar data to monitor a vaccine factory under threat from rising water levels. Emergency response teams were able to send helicopters, which weather conditions allowed this time, to the exact locations from which to best combat the flooding.
Correcting a Distorted View
Creating analyzable images from radar data is no simple task.
Due to the earth’s curvature, the perspective of satellite images is distorted. This distortion needs to be mathematically corrected and overlaid with location data, using a process called rubbersheeting, for precise geolocation.
A typical radar manifest contains half a billion data points, presented as a grid.
“You can’t just take radar data and make an image from it,” said Murray. “There’s a lot of processing and math involved, and that’s where the GPUs come in.”
Murray wrote the code for cuSAR using NVIDIA RAPIDS and Python Numba CUDA, which matches the radar and location data seamlessly.
Traditional Java or Python code would usually take around 40 minutes to provide an output. Backed by an NVIDIA GPU, it takes only four seconds.
Once the data has been processed, the platform outputs an image with accurate geographic information that corresponds to Ordnance Survey grid coordinates.
Within 15 minutes of receiving the satellite data, it can be placed in the hands of the emergency relief teams, giving them the knowledge to effectively react to a rapidly evolving situation on the ground.
Flood Protection for the Future
In the last decade, the U.K. has seen several of its wettest months on record. Notably, 2020 was the first year on record that fell in the top 10 for all three key weather rankings — warmest, wettest and sunniest. The Met Office predicts that severe flash flooding could be nearly five times more likely in 50 years time.
Technology like cuSAR enables researchers and emergency responders to monitor and react to disasters in a timely manner, protecting homes and businesses that are most vulnerable to worsening weather conditions.
Learn more about technology breakthroughs at GTC, running March 21-24.
Feature image courtesy of Copernicus Sentinel data, processed by ESA CC BY-SA 3.0.
The post Rain or Shine: Radar Vision Sees Through Clouds to Support Emergency Flood Relief appeared first on The Official NVIDIA Blog.
Announcing the launch of the model copy feature for Amazon Comprehend custom models
Technology trends and advancements in digital media in the past decade or so have resulted in the proliferation of text-based data. The potential benefits of mining this text to derive insights, both tactical and strategic, is enormous. This is called natural language processing (NLP). You can use NLP, for example, to analyze your product reviews for customer sentiments, train a custom entity recognizer model to identify product types of interest based on customer comments, or train a custom text classification model to determine the most popular product categories.
Amazon Comprehend is an NLP service with ready-made intelligence to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Amazon Comprehend Custom uses automatic machine learning (Auto ML) to build NLP models on your behalf using your own data. This enables you to detect entities unique to your business or classify text or documents as per your requirements. Additionally, you can automate your entire NLP workflow with easy-to-use APIs.
Today we’re happy to announce the launch of the Amazon Comprehend custom model copy feature, which allows you to automatically copy your Amazon Comprehend custom models from a source account to designated target accounts in the same Region without requiring access to the datasets that the model was trained and evaluated on. Starting today, you can use the AWS Management Console, AWS Command Line Interface (AWS CLI), or the boto3 APIs (Python SDK for AWS) to copy trained custom models from a source account to a designated target account. This new feature is available for both Amazon Comprehend custom classification and custom entity recognition models.
Benefits of the model copy feature
This new feature has the following benefits:
- Multi-account MLOps strategy – Train a model one time and ensure predictable deployment in multiple environments in different accounts.
- Faster deployment – You can quickly copy a trained model between accounts, avoiding the time taken to retrain in every account.
- Protect sensitive datasets – Now you no longer need to share the datasets between different accounts or users. The training data needs to be available only on the account where the training is done. This is very important for certain industries like financial services, where data isolation and sandboxing are essential to meet regulatory requirements.
- Easy collaboration – Partners or vendors can now easily train in Amazon Comprehend Custom and share the models with their customers.
How model copy works
With the new model copy feature, you can copy custom models between AWS accounts in the same Region in a two-stage process. First, a user in one AWS account (account A), shares a custom model that’s in their account. Then, a user in another AWS account (account B) imports the model into their account.
Share a model
To share a custom model in account A, the user attaches an AWS Identity and Access Management (IAM) resource-based policy to a model version. This policy authorizes an entity in account B, such as an IAM user or role, to import the model version into Amazon Comprehend in their AWS account. You can configure a resource-based policy either through the console or with the Amazon Comprehend custom PutResourcePolicy
API.
Import a model
To import the model to account B, the user of this account provides Amazon Comprehend with the necessary details, such as the Amazon Resource Name (ARN) of the model. When they import the model, this user creates a new custom model in their AWS account that replicates the model that they imported. This model is fully trained and ready for inference jobs, such as document classification or named entity recognition. If the model is encrypted with an AWS Key Management Service (AWS KMS) key in the source, then the service role specified while importing the model needs to have access to the KMS key in order to decrypt the model during import. The target account can also specify a KMS key to encrypt the model during import. The importing of the shared model is also available both on the console and as an API.
Solution overview
To demonstrate the functionality of the model copy feature, we show you how to train, share, and import an Amazon Comprehend custom entity recognition model using both the Amazon Comprehend console and the AWS CLI. For this demonstration, we use two different accounts. The steps are applicable to Amazon Comprehend custom classification as well. The required steps are as follows:
- Train an Amazon Comprehend custom entity recognition model in the source account.
- Define the IAM resource policy for the trained model to allow cross-account access.
- Copy the trained model from the source account to the target account.
- Test the copied model through a batch job.
Train an Amazon Comprehend custom entity recognition model in the source account
The first step is to train an Amazon Comprehend custom entity recognition model in the source account. As an input dataset for the training, we use a CSV entity list and training documents for recognizing AWS service offerings in a given document. Make sure that the entity list and training documents are in an Amazon Simple Storage Service (Amazon S3) bucket in the source account. For instructions, see Adding Documents to Amazon S3.
Create an IAM role for Amazon Comprehend and provide required access to the S3 bucket with the training data. Note the role ARN and S3 bucket paths to use in later steps.
Train a model with the AWS CLI
Create an entity recognizer using the following AWS CLI command. Substitute your parameters for the S3 paths, IAM role, and Region. The response returns back the EntityRecognizerArn
.
The status of the training job can be monitored by calling the describe-entity-recognizer and checking the Status in the response.
Train a model via the console
To train a model via the console, complete the following steps:
- On the Amazon Comprehend console, under Customization, create a new custom entity recognizer model.
- Provide a model name and version.
- For Language, choose Engligh.
- For Custom entity type, add
AWS_OFFERING
.
To train a custom entity recognition model, you can choose one of two ways to provide data to Amazon Comprehend: annotations or entity lists. For simplicity, use the entity list method.
- For Data format, select CSV file.
- For Training type, select Using entity list and training docs.
- Provide the S3 location paths for the entity list CSV and training data.
- To grant permissions to Amazon Comprehend to access your S3 bucket, create an IAM service-linked role.
In the Resource-based policy section, you can authorize access for the model version. The accounts you grant access to can import this model into their account. We skip this step for now and add the policy after the model is trained and we’re satisfied with the model performance.
- Choose Create.
This submits your custom entity recognizer, which goes through a number of models, tunes your hyperparameters, and checks for cross-validation to make sure that your model is robust. These are all the same activities that data scientists perform.
Define the IAM resource policy for the trained model to allow cross-account access
When we’re satisfied with the training performance, we can go ahead and share the specific model version by adding a resource policy.
Add a resource-based policy from the AWS CLI
Authorize importing the model from the target account by adding a resource policy on the model, as shown in the following code. The policy can be tightly scoped to a particular model version and target principal. Substitute your trained entity recognizer ARN and target account to provide access to.
Add a resource-based policy via the console
When the training is complete, a custom entity recognition model version is generated. We can choose the trained model and version to view the training details, including performance of the trained model.
To update the policy, complete the following steps:
- On the Tags, VPC & Policy tab, edit the resource-based policy.
- Provide the policy name, Amazon Comprehend service principal (
comprehend.amazonaws.com
), target account ID, and IAM users in the target account authorized to import the model version.
We specify root
as the IAM entity to authorize all users in the target account.
Copy the trained model from the source account to the target account
Now the model is trained and shared from the source account. The authorized target account user can import the model and create a copy of the model in their own account.
To import a model, you need to specify the source model ARN and service role for Amazon Comprehend to perform the copy action on your account. You can specify an optional AWS KMS ID to encrypt the model in your target account.
Import the model through AWS CLI
To import your model with the AWS CLI, enter the following code:
Import the model via the console
To import the model via the console, complete the following steps:
- On the Amazon Comprehend console, under Custom entity recognition, choose Import version.
- For Model version ARN, enter the ARN for the model trained in the source account.
- Enter a model name and version for the target.
- Provide a service account role and choose Confirm to start the model import process.
After the model status changes to Imported
, we can view the model details, including the performance details of the trained model.
Test the copied model through a batch job
We test the copied model in the target account by detecting custom entities with a batch job. To test the model, download the test file and place it in an S3 bucket in your target account. Create an IAM role for Amazon Comprehend and provide the required access to the S3 bucket with the test data. You use the role ARN and S3 bucket paths that you noted earlier.
When the job is complete, you can verify the inference data in the specified output S3 bucket.
Test the model with the AWS CLI
To test the model using the AWS CLI, enter the following code:
Test the model via the console
To test the model via the console, complete the following steps:
- On the Amazon Comprehend console, choose Analysis jobs and choose Create job.
- For Name, enter a name for the job.
- For Analysis type¸ choose Custom entity recognition.
- Choose the model name and version of the imported model.
- Provide the S3 paths for the test file for the job and the output location where Amazon Comprehend stores the result.
- Choose or create an IAM role with permission to access the S3 buckets.
- Choose Create job.
When your analysis job is complete, you have JSON files in your output S3 bucket path, which you can download to verify the results of the entity recognition from the imported model.
Conclusion
In this post, we demonstrated the Amazon Comprehend custom entity model copy feature. This feature gives you the ability to train an Amazon Comprehend custom entity recognition or classification model in one account and then share the model with another account in the same Region. This simplifies the multi-account strategy where the model can be trained one time and shared between accounts within same Region without having to retrain or share the training datasets. This allows for a predicable deployment in every account as part of your MLOps workflow. For more information, see our documentation on Comprehend custom copy, or try out the walkthrough in this post either via the console or using a cloud shell with the AWS CLI.
As of this writing, the model copy feature in Amazon Comprehend is available in the following Regions:
- US East (Ohio)
- US East (N. Virginia)
- US West (Oregon)
- Asia Pacific (Mumbai)
- Asia Pacific (Seoul)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
- Asia Pacific (Tokyo)
- EU (Frankfurt)
- EU (Ireland)
- EU (London)
- AWS GovCloud (US-West)
Give the feature a try, and please send us feedback either via the AWS forum for Amazon Comprehend or through your usual AWS support contacts.
About the Authors
Premkumar Rangarajan is an AI/ML specialist solutions architect at Amazon Web Services and has previously authored the book Natural Language Processing with AWS AI services. He has 26 years of experience in the IT industry in a variety of roles, including delivery lead, integration specialist, and enterprise architect. He helps enterprises of all sizes adopt AI and ML to solve for their real-world challenges.
Chethan Krishna is a Senior Partner Solutions Architect in India. He works with Strategic AWS Partners for establishing a robust cloud competency, adopting AWS best practices and solving customer challenges. He is a builder and enjoys experimenting with AI/ML, IoT, and analytics.
Sriharsha M S is an AI/ML specialist solution architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, bigdata, analytics and machine learning.
Balance your data for machine learning with Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler is a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface. It contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code.
Today, we’re excited to announce new transformations that allow you to balance your datasets easily and effectively for ML model training. We demonstrate how these transformations work in this post.
New balancing operators
The newly announced balancing operators are grouped under the Balance data transform type in the ADD TRANFORM pane.
Currently, the transform operators support only binary classification problems. In binary classification problems, the classifier is tasked with classifying each sample to one of two classes. When the number of samples in the majority class (bigger) is considerably larger than the number of samples in the minority (smaller) class, the dataset is considered imbalanced. This skew is challenging for ML algorithms and classifiers because the training process tends to be biased towards the majority class.
Balancing schemes, which augment the data to be more balanced before training the classifier, were proposed to address this challenge. The simplest balancing methods are either oversampling the minority class by duplicating minority samples or undersampling the majority class by removing majority samples. The idea of adding synthetic minority samples to tabular data was first proposed in the Synthetic Minority Oversampling Technique (SMOTE), where synthetic minority samples are created by interpolating pairs of the original minority points. SMOTE and other balancing schemes were extensively studied empirically and shown to improve prediction performance in various scenarios, as per the publication To SMOTE, or not to SMOTE.
Data Wrangler now supports the following balancing operators as part of the Balance data transform:
- Random oversampler – Randomly duplicate minority samples
- Random undersampler – Randomly remove majority samples
- SMOTE – Generate synthetic minority samples by interpolating real minority samples
Let’s now discuss the different balancing operators in detail.
Random oversample
Random oversampling includes selecting random examples from the minority class with a replacement and supplementing the training data with multiple copies of this instance. Therefore, it’s possible that a single instance may be selected multiple times. With the Random oversample transform type, Data Wrangler automatically oversamples the minority class for you by duplicating the minority samples in your dataset.
Random undersample
Random undersampling is the opposite of random oversampling. This method seeks to randomly select and remove samples from the majority class, consequently reducing the number of examples in the majority class in the transformed data. The Random undersample transform type lets Data Wrangler automatically undersample the majority class for you by removing majority samples in your dataset.
SMOTE
In SMOTE, synthetic minority samples are added to the data to achieve the desired ratio between majority and minority samples. The synthetic samples are generated by interpolation of pairs of the original minority points. The SMOTE transform supports balancing datasets including numeric and non-numeric features. Numeric features are interpolated by weighted average. However, you can’t apply weighted average interpolation to non-numeric features—it’s impossible to average “dog”
and “cat”
for example. Instead, non-numeric features are copied from either original minority sample according to the averaging weight.
For example, consider two samples, A and B:
Assume the samples are interpolated with weights 0.3 for sample A and 0.7 for sample B. Therefore, the numeric fields are averaged with these weights to yield 0.3 and 0.6, respectively. The next field is filled with “dog”
with probability 0.3 and “cow”
with probability 0.7. Similarly, the next one equals “carnivore”
with probability 0.3 and “herbivore”
with probability 0.7. The random copying is done independently for each feature, so sample C below is a possible result:
This example demonstrates how the interpolation process could result in unrealistic synthetic samples, such as an herbivore dog. This is more common with categorical features but can occur in numeric features as well. Even though some synthetic samples may be unrealistic, SMOTE could still improve classification performance.
To heuristically generate more realistic samples, SMOTE interpolates only pairs that are close in features space. Technically, each sample is interpolated only with its k-nearest neighbors, where a common value for k is 5. In our implementation of SMOTE, only the numeric features are used to calculate the distances between points (the distances are used to determine the neighborhood of each sample). It’s common to normalize the numeric features before calculating distances. Note that the numeric features are normalized only for the purpose of calculating the distance; the resulting interpolated features aren’t normalized.
Let’s now balance the Adult Dataset (also known as the Census Income dataset) using the built-in SMOTE transform provided by Data Wrangler. This multivariate dataset includes six numeric features and eight string features. The goal of the dataset is a binary classification task to predict whether the income of an individual exceeds $50,000 per year or not based on census data.
You can also see the distribution of the classes visually by creating a histogram using the histogram analysis type in Data Wrangler. The target distribution is imbalanced and the ratio of records with >50K
to <=50K
is about 1:4.
We can balance this data using the SMOTE operator found under the Balance Data transform in Data Wrangler with the following steps:
- Choose
income
as the target column.
We want the distribution of this column to be more balanced.
- Set the desired ratio to
0.66
.
Therefore, the ratio between the number of minority and majority samples is 2:3 (instead of the raw ratio of 1:4).
- Choose SMOTE as the transform to use.
- Leave the default values for Number of neighbors to average and whether or not to normalize.
- Choose Preview to get a preview of the applied transformation and choose Add to add the transform to your data flow.
Now we can create a new histogram similar to what we did before to see the realigned distribution of the classes. The following figure shows the histogram of the income
column after balancing the dataset. The distribution of samples is now 3:2, as was intended.
We can now export this new balanced data and train a classifier on it, which could yield superior prediction quality.
Conclusion
In this post, we demonstrated how to balance imbalanced binary classification data using Data Wrangler. Data Wrangler offers three balancing operators: random undersampling, random oversampling, and SMOTE to rebalance data in your unbalanced datasets. All three methods offered by Data Wrangler support multi-modal data including numeric and non-numeric features.
As next steps, we recommend you replicate the example in this post in your Data Wrangler data flow to see what we discussed in action. If you’re new to Data Wrangler or SageMaker Studio, refer to Get Started with Data Wrangler. If you have any questions related to this post, please add it in the comment section.
About the Authors
Yotam Elor is a Senior Applied Scientist at Amazon SageMaker. His research interests are in machine learning, particularly for tabular data.
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
Launch processing jobs with a few clicks using Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface. Previously, when you created a Data Wrangler data flow, you could choose different export options to easily integrate that data flow into your data processing pipeline. Data Wrangler offers export options to Amazon Simple Storage Service (Amazon S3), SageMaker Pipelines, and SageMaker Feature Store, or as Python code. The export options create a Jupyter notebook and require you to run the code to start a processing job facilitated by SageMaker Processing.
We’re excited to announce the general release of destination nodes and the Create Job feature in Data Wrangler. This feature gives you the ability to export all the transformations that you made to a dataset to a destination node with just a few clicks. This allows you to create data processing jobs and export to Amazon S3 purely via the visual interface without having to generate, run, or manage Jupyter notebooks, thereby enhancing the low-code experience. To demonstrate this new feature, we use the Titanic dataset and show how to export your transformations to a destination node.
Prerequisites
Before we learn how to use destination nodes with Data Wrangler, you should already understand how to access and get started with Data Wrangler. You also need to know what a data flow means with context to Data Wrangler and how to create one by importing your data from the different data sources Data Wrangler supports.
Solution overview
Consider the following data flow named example-titanic.flow
:
- It imports the Titanic dataset three times. You can see these different imports as separate branches in the data flow.
- For each branch, it applies a set of transformations and visualizations.
- It joins the branches into a single node with all the transformations and visualizations.
With this flow, you might want to process and save parts of your data to a specific branch or location.
In the following steps, we demonstrate how to create destination nodes, export them to Amazon S3, and create and launch a processing job.
Create a destination node
You can use the following procedure to create destination nodes and export them to an S3 bucket:
- Determine what parts of the flow file (transformations) you want to save.
- Choose the plus sign next to the nodes that represent the transformations that you want to export. (If it’s a collapsed node, you must select the options icon (three dots) for the node).
- Hover over Add destination.
- Choose Amazon S3.
- Specify the fields as shown in the following screenshot.
- For the second join node, follow the same steps to add Amazon S3 as a destination and specify the fields.
You can repeat these steps as many times as you need for as many nodes you want in your data flow. Later on, you pick which destination nodes to include in your processing job.
Launch a processing job
Use the following procedure to create a processing job and choose the destination node where you want to export to:
- On the Data Flow tab, choose Create job.
- For Job name¸ enter the name of the export job.
- Select the destination nodes you want to export.
- Optionally, specify the AWS Key Management Service (AWS KMS) key ARN.
The KMS key is a cryptographic key that you can use to protect your data. For more information about KMS keys, see the AWS Key Developer Guide.
- Choose Next, 2. Configure job.
- Optionally, you can configure the job as per your needs by changing the instance type or count, or adding any tags to associate with the job.
- Choose Run to run the job.
A success message appears when the job is successfully created.
View the final data
Finally, you can use the following steps to view the exported data:
A new tab opens showing the processing job on the SageMaker console.
- When the job is complete, review the exported data on the Amazon S3 console.
You should see a new folder with the job name you chose.
FAQ
In this section, we address a few frequently asked questions about this new feature:
- What happened to the Export tab? With this new feature, we removed the Export tab from Data Wrangler. You can still facilitate the export functionality via the Data Wrangler generated Jupyter notebooks from any nodes you created in the data flow with the following steps:
- How many destinations nodes can I include in a job? There is a maximum of 10 destinations per processing job.
- How many destination nodes can I have in a flow file? You can have as many destination nodes as you want.
- Can I add transformations after my destination nodes? No, the idea is destination nodes are terminal nodes that have no further steps afterwards.
- What are the supported sources I can use with destination nodes? As of this writing, we only support Amazon S3 as a destination source. Support for more destination source types will be added in the future. Please reach out if there is a specific one you would like to see.
Summary
In this post, we demonstrated how to use the newly launched destination nodes to create processing jobs and save your transformed datasets directly to Amazon S3 via the Data Wrangler visual interface. With this additional feature, we have enhanced the tool-driven low-code experience of Data Wrangler.
As next steps, we recommend you try the example demonstrated in this post. If you have any questions or want to learn more, see Export or leave a question in the comment section.
About the Authors
Alfonso Austin-Rivera is a Front End Engineer at Amazon SageMaker Data Wrangler. He is passionate about building intuitive user experiences that spark joy. In his spare time, you can find him fighting gravity at a rock-climbing gym or outside flying his drone.
Parsa Shahbodaghi is a Technical Writer in AWS specializing in machine learning and artificial intelligence. He writes the technical documentation for Amazon SageMaker Data Wrangler and Amazon SageMaker Feature Store. In his free time, he enjoys meditating, listening to audiobooks, weightlifting, and watching stand-up comedy. He will never be a stand-up comedian, but at least his mom thinks he’s funny.
Balaji Tummala is a Software Development Engineer at Amazon SageMaker. He helps support Amazon SageMaker Data Wrangler and is passionate about building performant and scalable software. Outside of work, he enjoys reading fiction and playing volleyball.
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.