Learn what goes into Amazon’s effort to develop human-like reasoning for Alexa.Read More
“I want machines to write as fluently as humans”
Amazon Machine Learning Fellow Jiao Sun works on strategies to control text generation.Read More
Image augmentation pipeline for Amazon Lookout for Vision
Amazon Lookout for Vision provides a machine learning (ML)-based anomaly detection service to identify normal images (i.e., images of objects without defects) vs anomalous images (i.e., images of objects with defects), types of anomalies (e.g., missing piece), and the location of these anomalies. Therefore, Lookout for Vision is popular among customers that look for automated solutions for industrial quality inspection (e.g., detecting abnormal products). However, customers’ datasets usually face two problems:
- The number of images with anomalies could be very low and might not reach anomalies/defect type minimum imposed by Lookout for Vision (~20).
- Normal images might not have enough diversity and might result in the model failing when environmental conditions such as lighting change in production
To overcome these problems, this post introduces an image augmentation pipeline that targets both problems: It provides a way to generate synthetic anomalous images by removing objects in images and generates additional normal images by introducing controlled augmentation such as gaussian noise, hue, saturation, pixel value scaling etc. We use the imgaug library to introduce augmentation to generate additional anomalous and normal images for the second problem. We use Amazon Sagemaker Ground Truth to generate object removal masks and the LaMa algorithm to remove objects for the first problem using image inpainting (object removal) techniques.
The rest of the post is organized as follows. In Section 3, we present the image augmentation pipeline for normal images. In Section 4, we present the image augmentation pipeline for abnormal images (aka synthetic defect generation). Section 5 illustrates the Lookout for Vision training results using the augmented dataset. Section 6 demonstrates how the Lookout for Vision model trained on synthetic data perform against real defects. In Section 7, we talk about cost estimation for this solution. All of the code we used for this post can be accessed here.
1. Solution overview
ML diagram
The following is the diagram of the proposed image augmentation pipeline for Lookout for Vision anomaly localization model training:
The diagram above starts by collecting a series of images (step 1). We augment the dataset by augmenting the normal images (step 3) and by using object removal algorithms (steps 2, 5-6). We then package the data in a format that can be consumed by Amazon Lookout for Vision (steps 7-8). Finally, in step 9, we use the packaged data to train a Lookout for Vision localization model.
This image augmentation pipeline gives customers flexibility to generate synthetic defects in the limited sample dataset, as well as add more quantity and variety to normal images. It would boost the performance of Lookout for Vision service, solving the lack of customer data issue and making the automated quality inspection process smoother.
2. Data preparation
From here to the end of the post, we use the public FICS-PCB: A Multi-Modal Image Dataset for Automated Printed Circuit Board Visual Inspection dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License to illustrate the image augmentation pipeline and the consequent Lookout for Vision training and testing. This dataset is designed to support the evaluation of automated PCB visual inspection systems. It was collected at the SeCurity and AssuraNce (SCAN) lab at the University of Florida. It can be accessed here.
We start with the hypothesis that the customer only provides a single normal image of a PCB board (a s10 PCB sample) as the dataset. It can be seen as follows:
3. Image augmentation for normal images
The Lookout for Vision service requires at least 20 normal images and 20 anomalies per defect type. Since there is only one normal image from the sample data, we must generate more normal images using image augmentation techniques. From the ML standpoint, feeding multiple image transformations using different augmentation techniques can improve the accuracy and robustness of the model.
We’ll use imgaug for image augmentation of normal images. Imgaug is an open-source python package that lets you augment images in ML experiments.
First, we’ll install the imgaug library in an Amazon SageMaker notebook.
Next, we can install the python package named ‘IPyPlot’.
Then, we perform image augmentation of the original image using transformations including GammaContrast
, SigmoidContrast
, and LinearContrast
, and adding Gaussian noise on the image.
Since we need at least 20 normal images, and the more the better, we generated 10 augmented images for each of the 4 transformations shown above as our normal image dataset. In the future, we plan to also transform the images to be positioned at difference locations and different angels so that the trained model can be less sensitive to the placement of the object relative to the fixed camera.
4. Synthetic defect generation for augmentation of abnormal images
In this section, we present a synthetic defect generation pipeline to augment the number of images with anomalies in the dataset. Note that, as opposed to the previous section where we create new normal samples from existing normal samples, here, we create new anomaly images from normal samples. This is an attractive feature for customers that completely lack this kind of images in their datasets, e.g., removing a component of the normal PCB board. This synthetic defect generation pipeline has three steps: first, we generate synthetic masks from source (normal) images using Amazon SageMaker Ground Truth. In this post, we target at a specific defect type: missing component. This mask generation provides a mask image and a manifest file. Second, the manifest file must be modified and converted to an input file for a SageMaker endpoint. And third, the input file is input to an Object Removal SageMaker endpoint responsible of removing the parts of the normal image indicated by the mask. This endpoint provides the resulting abnormal image.
4.1 Generate synthetic defect masks using Amazon SageMaker Ground Truth
Amazon Sagemaker Ground Truth for data labeling
Amazon SageMaker Ground Truth is a data labeling service that makes it easy to label data and gives you the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce. You can follow this tutorial to set up a labeling job.
In this section, we’ll show how we use Amazon SageMaker Ground Truth to mark specific “components” in normal images to be removed in the next step. Note that a key contribution of this post is that we don’t use Amazon SageMaker Ground Truth in its traditional way (that is, to label training images). Here, we use it to generate a mask for future removal in normal images. These removals in normal images will generate the synthetic defects.
For the purpose of this post, in our labeling job we’ll artificially remove up to three components from the PCB board: IC, resistor1, and resistor2. After entering the labeling job as a labeler, you can select the label name and draw a mask of any shape around the component that you want to remove from the image as a synthetic defect. Note that you can’t include ‘_’ in the label name for this experiment, since we use ‘_’ to separate different metadata in the defect name later in the code.
In the following picture, we draw a green mask around IC (Integrated Circuit), a blue mask around resistor 1, and an orange mask around resistor 2.
After we select the submit button, Amazon SageMaker Ground Truth will generate an output mask with white background and a manifest file as follows:
Note that so far we haven’t generated any abnormal images. We just marked the three components that will be artificially removed and whose removal will generate abnormal images. Later, we’ll use both (1) the mask image above, and (2) the information from the manifest file as inputs for the abnormal image generation pipeline. The next section shows how to prepare the input for the SageMaker endpoint.
4.2 Prepare Input for SageMaker endpoint
Transform Amazon SageMaker Ground Truth manifest as a SageMaker endpoint input file
First, we set up an Amazon Simple Storage Service (Amazon S3) bucket to store all of the input and output for the image augmentation pipeline. In the post, we use an S3 bucket named qualityinspection
. Then we generate all of the augmented normal images and upload them to this S3 bucket.
Next, we download the mask from Amazon SageMaker Ground Truth and upload it to a folder named ‘mask’ in that S3 bucket.
After that, we download the manifest file from Amazon SageMaker Ground Truth labeling job and read it as json lines.
Lastly, we generate an input dictionary which records the input image’s S3 location, mask location, mask information, etc., save it as txt file, and then upload it to the target S3 bucket ‘input’ folder.
The following is a sample input file:
4.3 Create Asynchronous SageMaker endpoint to generate synthetic defects with missing components
4.3.1 LaMa Model
To remove components from the original image, we’re using an open-source PyTorch model called LaMa from LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions. It’s a resolution-robust large mask in-painting model with Fourier convolutions developed by Samsung AI. The inputs for the model are an image and a black and white mask and the output is an image with the objects inside the mask removed. We use Amazon SageMaker Ground Truth to create the original mask, and then transform it to a black and white mask as required. The LaMa model application is demonstrated as following:
4.3.2 Introducing Amazon SageMaker Asynchronous inference
Amazon SageMaker Asynchronous Inference is a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. Asynchronous inference enables users to save on costs by autoscaling the instance count to zero when there are no requests to process. This means that you only pay when your endpoint is processing requests. The new asynchronous inference option is ideal for workloads where the request sizes are large (up to 1GB) and inference processing times are in the order of minutes. The code to deploy and invoke the endpoint is here.
4.3.3 Endpoint deployment
To deploy the asynchronous endpoint, first we must get the IAM role and set up some environment variables.
As we mentioned before, we’re using open source PyTorch model LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions and the pre-trained model has been uploaded to s3://qualityinspection/model/big-lama.tar.gz
. The image_uri
points to a docker container with the required framework and python versions.
Then, we must specify additional asynchronous inference specific configuration parameters while creating the endpoint configuration.
Next, we deploy the endpoint on a ml.g4dn.xlarge instance by running the following code:
After approximately 6-8 minutes, the endpoint is created successfully, and it will show up in the SageMaker console.
4.3.4 Invoke the endpoint
Next, we use the input txt file we generated earlier as the input of the endpoint and invoke the endpoint using the following code:
The above command will finish execution immediately. However, the inference will continue for several minutes until it completes all of the tasks and returns all of the outputs in the S3 bucket.
4.3.5 Check the inference result of the endpoint
After you select the endpoint, you’ll see the Monitor session. Select ‘View logs’ to check the inference results in the console.
Two log records will show up in Log streams. The one named data-log
will show the final inference result, while the other log record will show the details of the inference, which is usually used for debug purposes.
If the inference request succeeds, then you’ll see the message: Inference request succeeded.
in the data-log and also get information of the total model latency, total process time, etc. in the message. If the inference fails, then check the other log to debug. You can also check the result by polling the status of the inference request. Learn more about the Amazon SageMaker Asynchronous inference here.
4.3.6 Generating synthetic defects with missing components using the endpoint
We’ll complete four tasks in the endpoint:
- The Lookout for Vision anomaly localization service requires one defect per image in the training dataset to optimize model performance. Therefore, we must separate the masks for different defects in the endpoint by color filtering.
- Split train/test dataset to satisfy the following requirement:
- at least 10 normal images and 10 anomalies for train dataset
- one defect/image in train dataset
- at least 10 normal images and 10 anomalies for test dataset
- multiple defects per image is allowed for the test dataset
- Generate synthetic defects and upload them to the target S3 locations.
We generate one defect per image and more than 20 defects per class for train dataset, as well as 1-3 defects per image and more than 20 defects per class for the test dataset.
The following is an example of the source image and its synthetic defects with three components: IC, resistor1, and resistor 2 missing.
original image
40_im_mask_IC_resistor1_resistor2.jpg (the defect name indicates the missing components)
- Generate manifest files for train/test dataset recording all of the above information.
Finally, we’ll generate train/test manifests to record information, such as synthetic defect S3 location, mask S3 location, defect class, mask color, etc.
The following are sample json lines for an anomaly and a normal image in the manifest.
For anomaly:
Amazon SageMaker JumpStart now offers Amazon Comprehend notebooks for custom classification and custom entity detection
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. Amazon Comprehend provides customized features, custom entity recognition, custom classification, and pre-trained APIs such as key phrase extraction, sentiment analysis, entity recognition, and more so you can easily integrate NLP into your applications.
We recently added Amazon Comprehend related notebooks in Amazon SageMaker JumpStart notebooks that can help you quickly get started using the Amazon Comprehend custom classifier and custom entity recognizer. You can use custom classification to organize documents into categories (classes) that you define. Custom entity recognition extends the capability of the Amazon Comprehend pre-trained entity detection API by helping you identify entity types that are unique to your domain or business that aren’t in the preset generic entity types.
In this post, we show you how to use JumpStart to build Amazon Comprehend custom classification and custom entity detection models as part of your enterprise NLP needs.
SageMaker JumpStart
The Amazon SageMaker Studio landing page provides the option to use JumpStart. JumpStart provides a quick way to get started by providing pre-trained models for a variety of problem types. You can train and tune these models. JumpStart also provides other resources like notebooks, blogs, and videos.
JumpStart notebooks are essentially sample code that you can use as a starting point to get started quickly. Currently, we provide you with over 40 notebooks that you can use as is or customize as needed. You can find your notebooks by using search or the tabbed view panel. After you find the notebook you want to use, you can import the notebook, customize it for your requirements, and select the infrastructure and environment to run the notebook on.
Get started with JumpStart notebooks
To get started with JumpStart, go to the Amazon SageMaker console and open Studio. Refer to Get Started with SageMaker Studio for instructions on how to get started with Studio. Then complete the following steps:
- In Studio, go to the launch page of JumpStart and choose Go to SageMaker JumpStart.
You’re offered multiple ways to search. You may either use tabs on the top to get to what you want, or use the search box as shown in the following screenshot.
- To find notebooks, we go to the Notebooks tab.
At the time of writing, JumpStart offers 47 notebooks. You can use filters to find Amazon Comprehend related notebooks.
- On the Content Type drop-down menu, choose Notebook.
As you can see in the following screenshot, we currently have two Amazon Comprehend notebooks.
In the following sections, we explore both notebooks.
Amazon Comprehend Custom Classifier
In this notebook, we demonstrate how to use the custom classifier API to create a document classification model.
The custom classifier is a fully managed Amazon Comprehend feature that lets you build custom text classification models that are unique to your business, even if you have little or no ML expertise. The custom classifier builds on the existing capabilities of Amazon Comprehend, which are already trained on tens of millions of documents. It abstracts much of the complexity required to build a NLP classification model. The custom classifier automatically loads and inspects the training data, selects the right ML algorithms, trains your model, finds the optimal hyperparameters, tests the model, and provides model performance metrics. The Amazon Comprehend custom classifier also provides an easy-to-use console for the entire ML workflow, including labeling text using Amazon SageMaker Ground Truth, training and deploying a model, and visualizing the test results. With an Amazon Comprehend custom classifier, you can build the following models:
- Multi-class classification model – In multi-class classification, each document can have one and only one class assigned to it. The individual classes are mutually exclusive. For example, a movie can be classed as a documentary or as science fiction, but not both at the same time.
- Multi-label classification model – In multi-label classification, individual classes represent different categories, but these categories are somehow related and not mutually exclusive. As a result, each document has at least one class assigned to it, but can have more. For example, a movie can simply be an action movie, or it can be an action movie, a science fiction movie, and a comedy, all at the same time.
This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
Amazon Custom Entity Recognizer
In this notebook, we demonstrate how to use the custom entity recognition API to create an entity recognition model.
Custom entity recognition extends the capabilities of Amazon Comprehend by helping you identify your specific entity types that aren’t in the preset generic entity types. This means that you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.
Building an accurate custom entity recognizer on your own can be a complex process, requiring preparation of large sets of manually annotated training documents and selecting the right algorithms and parameters for model training. Amazon Comprehend helps reduce the complexity by providing automatic annotation and model development to create a custom entity recognition model.
The example notebook takes the training dataset in CSV format and runs inference against text input. Amazon Comprehend also supports an advanced use case that takes Ground Truth annotated data for training and allows you to directly run inference on PDFs and Word documents. For more information, refer to Build a custom entity recognizer for PDF documents using Amazon Comprehend.
Amazon Comprehend has lowered the annotation limits and allowed you to get more stable results, especially for few-shot subsamples. For more information about this improvement, refer to Amazon Comprehend announces lower annotation limits for custom entity recognition.
This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
Use, customize, and deploy Amazon Comprehend JumpStart notebooks
After you select the Amazon Comprehend notebook you want to use, choose Import notebook. As you do that, you can see the notebook kernel starting.
Importing your notebook triggers selection of the notebook instance, kernel, and image that is used to run the notebook. After the default infrastructure is provisioned, you can change the selections as per your requirements.
Now, go over the outline of the notebook and carefully read the sections for prerequisites setup, data setup, training the model, running inference, and stopping the model. Feel free to customize the generated code per your needs.
Based on your requirements, you may want to customize the following sections:
- Permissions – For a production application, we recommend restricting access policies to only those needed to run the application. Permissions can be restricted based on the use case, such as training or inference, and specific resource names, such as a full Amazon Simple Storage Service (Amazon S3) bucket name or an S3 bucket name pattern. You should also restrict access to the custom classifier or SageMaker operations to just those that your application needs.
- Data and location – The example notebook provides you sample data and S3 locations. Based on your requirements, you may use your own data for training, validation, and testing, and use different S3 locations as needed. Similarly, when the model is created, you can choose to keep the model at different locations. Just make sure you have provided the right permissions to access S3 buckets.
- Preprocessing steps – If you’re using different data for training and testing, you may want to adjust the preprocessing steps per your requirements.
- Testing data – You can bring your own inference data for testing.
- Clean up – Delete the resources launched by the notebook to avoid recurring charges.
Conclusion
In this post, we showed you how to use JumpStart to learn and fast-track using Amazon Comprehend APIs by making it convenient to find and run Amazon Comprehend related notebooks from Studio while having the option to modify the code as needed. The notebooks use sample datasets with AWS product announcements and sample news articles. You may use this notebook to learn how to use Amazon Comprehend APIs in a Python notebook, or you may use it as a starting point and expand the code further for your unique requirements and production deployments.
You can start using JumpStart and take advantage of over 40 notebooks in various topics in all Regions where Studio is available at no additional cost.
About the Authors
Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team with expertise in AI and ML for Content Moderation and Rekognition. She is passionate about promoting AWS AI services and helping customers transform their business solutions.
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI
Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.
AWS VP Bratin Saha: ML is becoming a ‘mainstream endeavor’
Vice president of ML and AI Services says more than 100,000 customers are doing machine learning on AWS.Read More
Amazon SageMaker’s fifth birthday: Looking back, looking forward
Vice president Bratin Saha reflects on the past and future of Amazon Web Services’ machine learning tools and AI services.Read More
Economics students in Africa build computational skills
Amazon provided funding for two-week workshop led by Nobel Prize winner Thomas Sargent.Read More
How Amazon Robotics is working to eliminate the need for barcodes
Why multimodal identification is a crucial step in automating item identification at Amazon scale.Read More
Prepare data from Amazon EMR for machine learning using Amazon SageMaker Data Wrangler
Data preparation is a principal component of machine learning (ML) pipelines. In fact, it is estimated that data professionals spend about 80 percent of their time on data preparation. In this intensive competitive market, teams want to analyze data and extract more meaningful insights quickly. Customers are adopting more efficient and visual ways to build data processing systems.
Amazon SageMaker Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select, clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. You can now also use Amazon EMR as a data source in Data Wrangler to easily prepare data for ML.
Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Data professionals such as data scientists want to leverage the power of Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation, but the learning curve is steep. Our customers wanted the ability to connect to Amazon EMR to run ad hoc SQL queries on Hive or Presto to query data in the internal metastore or external metastore (e.g., AWS Glue Data Catalog), and prepare data within a few clicks.
This blog article will discuss how customers can now find and connect to existing Amazon EMR clusters using a visual experience in SageMaker Data Wrangler. They can visually inspect the database, tables, schema, and Presto queries to prepare for modeling or reporting. They can then quickly profile data using a visual interface to assess data quality, identify abnormalities or missing or erroneous data, and receive information and recommendations on how to address these issues. Additionally, they can analyze, clean, and engineer features with the aid of more than a dozen additional built-in analyses and 300+ extra built-in transformations backed by Spark without writing a single line of code.
Solution overview
Data professionals can quickly find and connect to existing EMR clusters using SageMaker Studio configurations. Additionally, data professionals can terminate EMR clusters with only a few clicks from SageMaker Studio using predefined templates and on-demand creation of EMR clusters. With the help of these tools, customers may jump right into the SageMaker Studio universal notebook and write code in Apache Spark, Hive, Presto, or PySpark to perform data preparation at scale. Due to a steep learning curve for creating Spark code to prepare data, not all data professionals are comfortable with this procedure. With Amazon EMR as a data source for Amazon SageMaker Data Wrangler, you can now quickly and easily connect to Amazon EMR without writing a single line of code.
The following diagram represents the different components used in this solution.
We demonstrate two authentication options that can be used to establish a connection to the EMR cluster. For each option, we deploy a unique stack of AWS CloudFormation templates.
The CloudFormation template performs the following actions when each option is selected:
- Creates a Studio Domain in VPC-only mode, along with a user profile named
studio-user
. - Creates building blocks, including the VPC, endpoints, subnets, security groups, EMR cluster, and other required resources to successfully run the examples.
- For the EMR cluster, connects the AWS Glue Data Catalog as metastore for EMR Hive and Presto, creates a Hive table in EMR, and fills it with data from a US airport dataset.
- For the LDAP CloudFormation template, creates an Amazon Elastic Compute Cloud (Amazon EC2) instance to host the LDAP server to authenticate the Hive and Presto LDAP user.
Option 1: Lightweight Access Directory Protocol
For the LDAP authentication CloudFormation template, we provision an Amazon EC2 instance with an LDAP server and configure the EMR cluster to use this server for authentication. This is TLS Enabled.
Option 2: No-Auth
In the No-Auth authentication CloudFormation template, we use a standard EMR cluster with no authentication enabled.
Deploy the resources with AWS CloudFormation
Complete the following steps to deploy the environment:
- Sign in to the AWS Management Console as an AWS Identity and Access Management (IAM) user, preferably an admin user.
- Choose Launch Stack to launch the CloudFormation template for the appropriate authentication scenario. Make sure the Region used to deploy the CloudFormation stack has no existing Studio Domain. If you already have a Studio Domain in a Region, you may choose a different Region.
- Choose Next.
- For Stack name, enter a name for the stack (for example,
dw-emr-blog
). - Leave the other values as default.
- To continue, choose Next from the stack details page and stack options. The LDAP stack uses the following credentials:
- username:
david
- password:
welcome123
- username:
- On the review page, select the check box to confirm that AWS CloudFormation might create resources.
- Choose Create stack. Wait until the status of the stack changes from
CREATE_IN_PROGRESS
toCREATE_COMPLETE
. The process usually takes 10–15 minutes.
Note: If you would like to try multiple stacks, please follow the steps in the Clean up section. Remember that you must delete the SageMaker Studio Domain before the next stack can be successfully launched.
Set up the Amazon EMR as a data source in Data Wrangler
In this section, we cover connecting to the existing Amazon EMR cluster created through the CloudFormation template as a data source in Data Wrangler.
Create a new data flow
To create your data flow, complete the following steps:
- On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
- Choose Open studio.
- In the Launcher, choose New data flow. Alternatively, on the File drop-down, choose New, then choose Data Wrangler flow.
- Creating a new flow can take a few minutes. After the flow has been created, you see the Import data page.
Add Amazon EMR as a data source in Data Wrangler
On the Add data source menu, choose Amazon EMR.
You can browse all the EMR clusters that your Studio execution role has permissions to see. You have two options to connect to a cluster; one is through interactive UI, and the other is to first create a secret using AWS Secrets Manager with JDBC URL, including EMR cluster information, and then provide the stored AWS secret ARN in the UI to connect to Presto. In this blog, we follow the first option. Select one of the following clusters that you want to use. Click on Next, and select endpoints.
Select Presto, connect to Amazon EMR, create a name to identify your connection, and click Next.
Select Authentication type, either LDAP or No Authentication, and click Connect.
- For Lightweight Directory Access Protocol (LDAP), provide username and password to be authenticated.
- For No Authentication, you will be connected to EMR Presto without providing user credentials within VPC. Enter Data Wrangler’s SQL explorer page for EMR.
Once connected, you can interactively view a database tree and table preview or schema. You can also query, explore, and visualize data from EMR. For preview, you would see a limit of 100 records by default. For customized query, you can provide SQL statements in the query editor box and once you click the Run button, the query will be executed on EMR’s Presto engine.
The Cancel query button allows ongoing queries to be canceled if they are taking an unusually long time.
The last step is to import. Once you are ready with the queried data, you have options to update the sampling settings for the data selection according to the sampling type (FirstK, Random, or Stratified) and sampling size for importing data into Data Wrangler.
Click Import. The prepare page will be loaded, allowing you to add various transformations and essential analysis to the dataset.
Navigate to DataFlow from the top screen and add more steps to the flow as needed for transformations and analysis. You can run a data insight report to identify data quality issues and get recommendations to fix those issues. Let’s look at some example transforms.
Go to your dataflow, and this is the screen that you should see. It shows us that we are using EMR as a data source using the Presto connector.
Let’s click on the + button to the right of Data types and select Add transform. When you do that, the following screen should pop up:
Let’s explore the data. We see that it has multiple features such as iata_code, airport, city, state, country, latitude, and longitude. We can see that the entire dataset is based in one country, which is the US, and there are missing values in Latitude and Longitude. Missing data can cause bias in the estimation of parameters, and it can reduce the representativeness of the samples, so we need to perform some imputation and handle missing values in our dataset.
Let’s click on the Add Step button on the navigation bar to the right. Select Handle missing. The configurations can be seen in the following screenshots. Under Transform, select Impute. Select the column type as Numeric and column names Latitude and Longitude. We will be imputing the missing values using an approximate median value. Preview and add the transform.
Let us now look at another example transform. When building a machine learning model, columns are removed if they are redundant or don’t help your model. The most common way to remove a column is to drop it. In our dataset, the feature country can be dropped since the dataset is specifically for US airport data. Let’s see how we can manage columns. Let’s click on the Add step button on the navigation bar to the right. Select Manage columns. The configurations can be seen in the following screenshots. Under Transform, select Drop column, and under Columns to drop, select Country.
You can continue adding steps based on the different transformations required for your dataset. Let us go back to our data flow. You will now see two more blocks showing the transforms that we performed. In our scenario, you can see Impute and Drop column.
ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity because different teams often run identical jobs or even write duplicate feature engineering code because they have no knowledge of prior work. To avoid the reprocessing of features, we will now export our transformed features to Amazon Feature Store. Let’s click on the + button to the right of Drop column. Select Export to and choose Sagemaker Feature Store (via Jupyter notebook).
You can easily export your generated features to SageMaker Feature Store by selecting it as the destination. You can save the features into an existing feature group or create a new one.
We have now created features with Data Wrangler and easily stored those features in Feature Store. We showed an example workflow for feature engineering in the Data Wrangler UI. Then we saved those features into Feature Store directly from Data Wrangler by creating a new feature group. Finally, we ran a processing job to ingest those features into Feature Store. Data Wrangler and Feature Store together helped us build automatic and repeatable processes to streamline our data preparation tasks with minimum coding required. Data Wrangler also provides us flexibility to automate the same data preparation flow using scheduled jobs. We can also automate training or feature engineering with SageMaker Pipelines (via Jupyter Notebook) and deploy to the Inference endpoint with SageMaker inference pipeline (via Jupyter Notebook).
Clean up
If your work with Data Wrangler is complete, select the stack created from the CloudFormation page and delete it to avoid incurring additional fees.
Conclusion
In this post, we went over how to set up Amazon EMR as a data source in Data Wrangler, how to transform and analyze a dataset, and how to export the results to a data flow for use in a Jupyter notebook. After visualizing our dataset using Data Wrangler’s built-in analytical features, we further enhanced our data flow. The fact that we created a data preparation pipeline without writing a single line of code is significant.
To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and see the latest information on the Data Wrangler product page.
About the authors
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.
Rui Jiang is a Software Development Engineer at AWS based in the New York City area. She is a member of the SageMaker Data Wrangler team helping develop engineering solutions for AWS enterprise customers to achieve their business needs. Outside of work, she enjoys exploring new foods, life fitness, outdoor activities, and traveling.
Exafunction supports AWS Inferentia to unlock best price performance for machine learning inference
Across all industries, machine learning (ML) models are getting deeper, workflows are getting more complex, and workloads are operating at larger scales. Significant effort and resources are put into making these models more accurate since this investment directly results in better products and experiences. On the other hand, making these models run efficiently in production is a non-trivial undertaking that’s often overlooked, despite being key to achieving performance and budget goals. In this post we cover how Exafunction and AWS Inferentia work together to unlock easy and cost-efficient deployment for ML models in production.
Exafunction is a start-up focused on enabling companies to perform ML at scale as efficiently as possible. One of their products is ExaDeploy, an easy-to-use SaaS solution to serve ML workloads at scale. ExaDeploy efficiently orchestrates your ML workloads across mixed resources (CPU and hardware accelerators) to maximize resource utilization. It also takes care of auto scaling, compute colocation, network issues, fault tolerance, and more, to ensure efficient and reliable deployment. AWS Inferentia-based Amazon EC2 Inf1 instances are purpose built to deliver the lowest cost-per-inference in the cloud. ExaDeploy now supports Inf1 instances, which allows users to get both the hardware-based savings of accelerators and the software-based savings of optimized resource virtualization and orchestration at scale.
Solution overview
How ExaDeploy solves for deployment efficiency
To ensure efficient utilization of compute resources, you need to consider proper resource allocation, auto scaling, compute co-location, network cost and latency management, fault tolerance, versioning and reproducibility, and more. At scale, any inefficiencies materially affect costs and latency, and many large companies have addressed these inefficiencies by building internal teams and expertise. However, it’s not practical for most companies to assume this financial and organizational overhead of building generalizable software that isn’t the company’s desired core competency.
ExaDeploy is designed to solve these deployment efficiency pain points, including those seen in some of the most complex workloads such as those in Autonomous Vehicle and natural language processing (NLP) applications. On some large batch ML workloads, ExaDeploy has reduced costs by over 85% without sacrificing on latency or accuracy, with integration time as low as one engineer-day. ExaDeploy has been proven to auto scale and manage thousands of simultaneous hardware accelerator resource instances without any system degradation.
Key features of ExaDeploy include:
- Runs in your cloud: None of your models, inputs, or outputs ever leave your private network. Continue to use your cloud provider discounts.
- Shared accelerator resources: ExaDeploy optimizes the accelerators used by enabling multiple models or workloads to share accelerator resources. It can also identify if multiple workloads are deploying the same model, and then share the model across those workloads, thereby optimizing the accelerator used. Its automatic rebalancing and node draining capabilities maximize utilization and minimize costs.
- Scalable serverless deployment model: ExaDeploy auto scales based on accelerator resource saturation. Dynamically scale down to 0 or up to thousands of resources.
- Support for a variety of computation types: You can offload deep learning models from all major ML frameworks as well as arbitrary C++ code, CUDA kernels, custom ops, and Python functions.
- Dynamic model registration and versioning: New models or model versions can be registered and run without having to rebuild or redeploy the system.
- Point-to-point execution: Clients connect directly to remote accelerator resources, which enables low latency and high throughput. They can even store the state remotely.
- Asynchronous execution: ExaDeploy supports asynchronous execution of models, which allows clients to parallelize local computation with remote accelerator resource work.
- Fault-tolerant remote pipelines: ExaDeploy allows clients to dynamically compose remote computations (models, preprocessing, etc.) into pipelines with fault tolerance guarantee. The ExaDeploy system handles pod or node failures with automatic recovery and replay, so that the developers never have to think about ensuring fault tolerance.
- Out-of-the-box monitoring: ExaDeploy provides Prometheus metrics and Grafana dashboards to visualize accelerator resource usage and other system metrics.
ExaDeploy supports AWS Inferentia
AWS Inferentia-based Amazon EC2 Inf1 instances are designed for deep learning specific inference workloads. These instances provide up to 2.3x throughput and up to 70% cost saving compared to the current generation of GPU inference instances.
ExaDeploy now supports AWS Inferentia, and together they unlock the increased performance and cost-savings achieved through purpose-built hardware-acceleration and optimized resource orchestration at scale. Let’s look at the combined benefits of ExaDeploy and AWS Inferentia by considering a very common modern ML workload: batched, mixed-compute workloads.
Hypothetical workload characteristics:
- 15 ms of CPU-only pre-process/post-process
- Model inference (15 ms on GPU, 5 ms on AWS Inferentia)
- 10 clients, each make request every 20 ms
- Approximate relative cost of CPU:Inferentia:GPU is 1:2:4 (Based on Amazon EC2 On-Demand pricing for c5.xlarge, inf1.xlarge, and g4dn.xlarge)
The table below shows how each of the options shape up:
Setup | Resources needed | Cost | Latency |
GPU without ExaDeploy | 2 CPU, 2 GPU per client (total 20 CPU, 20 GPU) | 100 | 30 ms |
GPU with ExaDeploy | 8 GPUs shared across 10 clients, 1 CPU per client | 42 | 30 ms |
AWS Inferentia without ExaDeploy | 1 CPU, 1 AWS Inferentia per client (total 10 CPU, 10 Inferentia) | 30 | 20 ms |
AWS Inferentia with ExaDeploy | 3 AWS Inferentia shared across 10 clients, 1 CPU per client | 16 | 20 ms |
ExaDeploy on AWS Inferentia example
In this section, we go over the steps to configure ExaDeploy through an example with inf1 nodes on a BERT PyTorch model. We saw an average throughput of 1140 samples/sec for the bert-base model, which demonstrates that little to no overhead was introduced by ExaDeploy for this single model, single workload scenario.
Step 1: Set up an Amazon Elastic Kubernetes Service (Amazon EKS) cluster
An Amazon EKS cluster can be brought up with our Terraform AWS module. For our example, we used an inf1.xlarge
for AWS Inferentia.
Step 2: Set up ExaDepoy
The second step is to set up ExaDeploy. In general, the deployment of ExaDeploy on inf1 instances is straightforward. Setup mostly follows the same procedure as it does on graphics processing unit (GPU) instances. The primary difference is to change the model tag from GPU to AWS Inferentia and recompile the model. For example, moving from g4dn to inf1 instances using ExaDeploy’s application programming interfaces (APIs) required only approximately 10 lines of code to be changed.
- One simple method is to use Exafunction’s Terraform AWS Kubernetes module or Helm chart. These deploy the core ExaDeploy components to run in the Amazon EKS cluster.
- Compile model into a serialized format (e.g., TorchScript, TF saved models, ONNX, etc).. For AWS Inferentia, we followed this tutorial.
- Register the compiled model in ExaDeploy’s module repository.
- Prepare the data for the model (i.e., not
ExaDeploy-specific
).
- Run the model remotely from the client.
ExaDeploy and AWS Inferentia: Better together
AWS Inferentia is pushing the boundaries of throughput for model inference and delivering lowest cost-per-inference in the cloud. That being said, companies need the proper orchestration to enjoy the price-performance benefits of Inf1 at scale. ML serving is a complex problem that, if addressed in-house, requires expertise that’s removed from company goals and often delays product timelines. ExaDeploy, which is Exafunction’s ML deployment software solution, has emerged as the industry leader. It serves even the most complex ML workloads, while providing smooth integration experiences and support from a world-class team. Together, ExaDeploy and AWS Inferentia unlock increased performance and cost-savings for inference workloads at scale.
Conclusion
In this post, we showed you how Exafunction supports AWS Inferentia for performance ML. For more information on building applications with Exafunction, visit Exafunction. For best practices on building deep learning workloads on Inf1, visit Amazon EC2 Inf1 instances.
About the Authors
Nicholas Jiang, Software Engineer, Exafunction
Jonathan Ma, Software Engineer, Exafunction
Prem Nair, Software Engineer, Exafunction
Anshul Ramachandran, Software Engineer, Exafunction
Shruti Koparkar, Sr. Product Marketing Manager, AWS