EMNLP papers examine constrained generation of rewrite candidates and automatic selection of information-rich training data.Read More
Introducing Amazon SageMaker Data Wrangler’s new embedded visualizations
Manually inspecting data quality and cleaning data is a painful and time-consuming process that can take a huge chunk of a data scientist’s time on a project. According to a 2020 survey of data scientists conducted by Anaconda, data scientists spend approximately 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), and visualizing data (21%). Amazon SageMaker offers a range of data preparation tools to meet different customer needs and preferences. For users who prefer a GUI-based interactive interface, SageMaker Data Wrangler offers 300+ built-in visualizations, analyses, and transformations to efficiently process data backed by Spark without writing a single line of code.
Data visualization in machine learning (ML) is an iterative process and requires continuous visualization of the dataset for discovery, investigation and validation. Putting data into perspective entails seeing each of the columns to comprehend possible data errors, missing values, wrong data types, misleading/incorrect data, outlier data, and more.
In this post, we’ll show you how Amazon SageMaker Data Wrangler automatically generates key visualizations of data distribution, detects data quality issues, and surfaces data insights such as outliers for each feature without writing a single line of code. It helps improve the data grid experience with automatic quality warnings (for example, missing values or invalid values). The automatically-generated visualizations are also interactive. For example, you can show a tabulation of the top five most frequent items ordered by percent, and hover over the bar to switch between count and percentage.
Prerequisites
Amazon SageMaker Data Wrangler is a SageMaker feature available within SageMaker Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS Identity and Access Management (IAM) Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).
Solution Walkthrough
Start your SageMaker Studio Environment and create a new Data Wrangler flow. You can either import your own dataset or use a sample dataset (Titanic) as seen in the following image. These two nodes (the source node and the data type node) are clickable – when you double-click these two nodes, Data Wrangler will display the table.
In our case, let’s right-click on the Data Types icon and Add a transform:
You should now see visualizations on top of each column. Please allow for some time for the charts to load. The latency depends on the size of the dataset (for the Titanic dataset, it should take 1-2 seconds in the default instance).
Scroll to the horizontal top bar by hovering over tooltip. Now that the charts have loaded, you can see the data distribution, invalid values, and missing values. Outliers and missing values are characteristics of erroneous data, and it’s critical to identify them because they could affect your results. This means that because your data came from an unrepresentative sample, your findings may not be generalizable to situations outside of your study. Classification of values can be seen on the charts on the bottom where valid values are represented in white, invalid values in blue, and missing values in purple. You can also look at the outliers depicted by the blue dots to the left or right of a chart.
All the visualizations come in the form of histograms. For non-categorical data, a bucket set is defined for each bin. For categorical data, each unique value is treated as a bin. On top of the histogram, there’s a bar chart showing you the invalid and missing values. We can view the ratio of valid values for Numeric, Categorical, Binary, Text, and Datetime types, as well as the ratio of missing values based on the total null and empty cells and, finally, the ratio of invalid values. Let’s look at some examples to understand how you can see these using Data Wrangler’s pre-loaded sample Titanic Dataset.
Example 1 – We can look at the 20% missing values for the AGE feature/column. It’s crucial to deal with missing data in the field of data-related research/ML, either by removing it or imputing it (handling the missing values with some estimation).
You can process missing values using the Handle missing values transform group. Use the Impute missing transform to generate imputed values where missing values were found in input column. The configuration depends on your data type.
In this example, the AGE column has numeric data type. For imputing strategy, we can choose to impute the mean or the approximate median over the values that are present in your dataset.
Now that we have added the transformation, we can see that the AGE column no longer has missing values.
Example 2 – We can look at the 27% invalid values for the TICKET feature/column which is of the STRING type. Invalid data can produce biased estimates, which can reduce a model’s accuracy and result in false conclusions. Let us explore some transforms that we can utilize to handle the invalid data in the TICKET column.
Looking at the screenshot, we see that some of the inputs are written in a format that contains alphabets before numerals “PC 17318” and others are just numerals such as “11769”.
We can choose to apply a transform to search for and edit specific patterns within strings such as “PC” and replace them. Next, we can cast our string column to a new type such as Long for ease of use.
This still leaves us with 19% missing values on the TICKET feature. Similar to example 1, we can now impute the missing values using mean or approximate median. The feature TICKET should no longer have invalid or missing values as per the image below.
Clean Up
To make sure that you don’t incur charges after following this tutorial, make sure that you shut down the Data Wrangler app.
Conclusion
In this post, we presented the new Amazon Sagemaker Data Wrangler widget that will help remove the undifferentiated heavy lifting for end users during data preparation with automatically surfacing visualizations and data profiling insights for each feature. This widget makes it easy to visualize data (for example, categorical/non-categorical histogram), detect data quality issues (for example, missing values and invalid values), and surface data insights (for example, outliers and top N item).
You can start using this capability today in all of the regions where SageMaker Studio is available. Give it a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.
About the Authors
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.
Parth Patel is a Solutions Architect at AWS in the San Francisco Bay Area. Parth guides customers to accelerate their journey to the cloud and helps them adopt the AWS Cloud successfully. He focuses on ML and application modernization.
How we count carbon emissions from electricity matters
Amazon advocates for updating carbon accounting to measure where renewable-energy projects will have the greatest impact.Read More
Start your successful journey with time series forecasting with Amazon Forecast
Organizations of all sizes are striving to grow their business, improve efficiency, and serve their customers better than ever before. Even though the future is uncertain, a data-driven, science-based approach can help anticipate what lies ahead to successfully navigate through a sea of choices.
Every industry uses time series forecasting to address a variety of planning needs, including but not limited to:
- Developing a cash flow projection based on future expected revenues and expenses
- Estimating how many items to manufacture or purchase from suppliers to meet future demand
- Knowing where to stock inventory in retail settings to meet on-shelf availability while also minimizing stock-outs and product waste
- In wholesale or ecommerce settings, knowing where to position inventory within the supply chain network to maximize regional availability while also minimizing final mile delivery costs
- Having a system for detecting outliers in which future actuals far exceed or fall short of the expected plan
- Establishing specialized workforces in response to anticipated customer foot traffic, call center operations, manufacturing plans, and other similar workforce demand curves
In this post, we outline five best practices to get started with Amazon Forecast, and apply the power of highly-accurate machine learning (ML) forecasting to your business.
Why Amazon Forecast
AWS offers a fully managed time series forecasting service called Amazon Forecast that allows you to generate and maintain ongoing automated time series forecasts without requiring ML expertise. In addition, you can build and deploy repeatable forecasting operations without the need to write code, build ML models, or manage infrastructure.
The capabilities of Forecast allow it to serve a wide range of customer roles, from analysts and supply chain managers to developers and ML experts. There are several reasons why customers favor Forecast: it offers high accuracy, repeatable results, and the ability to self-serve without waiting on specialized technical resource availability. Forecast is also selected by data science experts because it provides highly accurate results, based on an ensemble of self-tuned models, and the flexibility to experiment quickly without having to deploy or manage clusters of any particular size. Its ML models also make it easier to support forecasts for a large number of items, and can generate accurate forecasts for cold-start items with no history.
Five best practices when getting started with Forecast
Forecast provides high accuracy and quick time-to-market for developers and data scientists. Although developing highly accurate time series models has been made easy, this post provides best practices to speed up your onboarding and time to value. A little rigor and perhaps a couple of rounds of experimentation must be applied to reach success. A successful forecasting journey depends on multiple factors, some subtle.
These are some key items you should consider when starting to work with Forecast.
Start simple
As shown in the following flywheel, consider beginning with a simple model that uses a target time series dataset to develop a baseline as you propose your first set of input data. Subsequent experiments can add in other temporal features and static metadata with a goal of improving model accuracy. Each time a change is made, you can measure and learn how much the change has helped, if at all. Depending on your assessment, you may decide to keep the new set of features provided, or pivot and try another option.
Focus on the outliers
With Forecast, you can obtain accuracy statistics for the entire dataset. It’s important to recognize that although this top-level statistic is interesting, it should be viewed as being only directionally correct. You should concentrate on item-level accuracy statistics rather than top-level statistics. Consider the following scatterplot as a guide. Some of the items in the dataset will have high accuracy; for these no action is required.
While building a model, you should explore some of the points labeled as “exploratory time-series.” In these exploratory cases, determine how to improve accuracy by incorporating more input data, such as price variations, promotional spend, explicit seasonality features, and the inclusion of local, market, global, and other real-world events and conditions.
Review predictor accuracy before creating forecasts
Don’t create future dated forecasts with Forecast until you have reviewed prediction accuracy during the backtest period. The preceding scatterplot illustrates time series level accuracy, which is your best indication for what future dated predictions will look like, all other things being the same. If this period isn’t providing your required level of accuracy, don’t proceed with the future dated forecast operation, because this may lead to inefficient spend. Instead, focus on augmenting your input data and trying another round at the innovation flywheel, as discussed earlier.
Reduce training time
You can reduce training time through two mechanisms. First, use Forecast’s retrain function to help reduce training time through transfer learning. Second, prevent model drift with predictor monitoring by training only when necessary.
Build repeatable processes
We encourage you not to build Forecast workflows through the AWS Management Console or using APIs from scratch until you have at least evaluated our AWS samples GitHub repo. Our mission with GitHub samples is to help remove friction and expedite your time-to-market with repeatable workflows that have already been thoughtfully designed. These workflows are serverless and can be scheduled to run on a regular schedule.
Visit our official GitHub repo, where you can quickly deploy our solution guidance by following the steps provided. As shown in the following figure, the workflow provides a complete end-to-end pipeline that can retrieve historical data, import it, build models, and produce inference against the models—all without needing to write code.
The following figure offers a deeper view into just one module, which is able to harvest historical data for model training from a myriad of database sources that are supported by Amazon Athena Federated Query.
Get started today
You can implement a fully automated production workflow in a matter of days to weeks, especially when paired with our workflow orchestration pipeline available at our GitHub sample repository.
This re:Invent video highlights a use case of a customer who automated their workflow using this GitHub model:
Forecast has many built-in capabilities to help you achieve your business goals through highly accurate ML-based forecasting. We encourage you to contact your AWS account team if you have any questions and let them know that you would like to speak with a time series specialist in order to provide guidance and direction. We can also offer workshops to assist you in learning how to use Forecast.
We are here to support you and your organization as you endeavor to automate and improve demand forecasting in your company. A more accurate forecast can result in higher sales, a significant reduction in waste, a reduction in idle inventory, and ultimately higher levels of customer service.
Take action today; there is no better time than the present to begin creating a better tomorrow.
About the Author
Charles Laughlin is a Principal AI/ML Specialist Solution Architect and works inside the Time Series ML team at AWS. He helps shape the Amazon Forecast service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds a M.S. in Supply Chain Management and has spent the past decade working in the consumer packaged goods industry.
Dan Sinnreich is a Sr. Product Manager for Amazon Forecast. He is focused on democratizing low-code/no-code machine learning and applying it to improve business outcomes. Outside of work, he can be found playing hockey, trying to improve his tennis serve, scuba diving, and reading science fiction.
Chronomics detects COVID-19 test results with Amazon Rekognition Custom Labels
Chronomics is a tech-bio company that uses biomarkers—quantifiable information taken from the analysis of molecules—alongside technology to democratize the use of science and data to improve the lives of people. Their goal is to analyze biological samples and give actionable information to help you make decisions—about anything where knowing more about the unseen is important. Chronomics’s platform enables providers to seamlessly implement at-home diagnostics at scale—all without sacrificing efficiency or accuracy. It has already processed millions of tests through this platform and delivers a high-quality diagnostics experience.
During the COVID-19 pandemic, Chronomics sold lateral flow tests (LFT) for detecting COVID-19. The users register the test on the platform by uploading a picture of the test cassette and entering a manual reading of the test (positive, negative or invalid). With the increase in the number of tests and users, it quickly became impractical to manually verify if the reported result matched the result in the picture of the test. Chronomics wanted to build a scalable solution that uses computer vision to verify the results.
In this post, we share how Chronomics used Amazon Rekognition to automatically detect the results of a COVID-19 lateral flow test.
Preparing the data
The following image shows the picture of a test cassette uploaded by a user. The dataset consists of images like this one. These images are to be classified as positive, negative, or invalid, corresponding to the outcome of a COVID-19 test.
The main challenges with the dataset were the following:
- Imbalanced dataset – The dataset was extremely skewed. More than 90% of the samples were from the negative class.
- Unreliable user inputs – Readings that were manually reported by the users were not reliable. Around 40% of the readings didn’t match the actual result from the picture.
To create a high-quality training dataset, Chronomics engineers decided to follow these steps:
- Manual annotation – Manually select and label 1,000 images to ensure that the three classes are evenly represented
- Image augmentation – Augment the labeled images to increase the number to 10,000
Image augmentation was performed using Albumentations, an open-source Python library. A number of transformations like rotation, rescale, and brightness were performed to generate 9,000 synthetic images. These synthetic images were added to the original images to create a high-quality dataset.
Building a custom computer vision model with Amazon Rekognition
Chronomics’s engineers turned towards Amazon Rekognition Custom Labels, a feature of Amazon Rekognition with AutoML capabilities. After training images are provided, it can automatically load and inspect the data, select the right algorithms, train a model, and provide model performance metrics. This significantly accelerates the process of training and deploying a computer vision model, making it the primary reason for Chronomics to adopt Amazon Rekognition. With Amazon Rekognition, we were able to get a highly accurate model in 3–4 weeks as opposed to spending 4 months trying to build a custom model to achieve the desired performance.
The following diagram illustrates the model training pipeline. The annotated images were first preprocessed using an AWS Lambda function. This preprocessing step ensured that the images were in the appropriate file format and also performed some additional steps like resizing the image and converting the image from RGB to grayscale. It was observed that this improved the performance of the model.
After the model has been trained, it can be deployed for inference using just a single click or API call.
Model performance and fine-tuning
The model yielded an accuracy of 96.5% and a F1 score of 97.9% on a set of out-of-sample images. The F1 score is a measure that uses both precision and recall to measure the performance of a classifier. The DetectCustomLabels API is used to detect the labels of a supplied image during inference. The API also returns the confidence that Rekognition Custom Labels has in the accuracy of the predicted label. The following chart has the distribution of the confidence scores of the predicted labels for the images. The x-axis represents the confidence score multiplied by 100, and the y-axis is the count of the predictions in log-scale.
By setting a threshold on the confidence score, we can filter out predictions that have a lower confidence. A threshold of 0.99 resulted in an accuracy of 99.6%, and 5% of the predictions were discarded. A threshold of 0.999 resulted in an accuracy of 99.87%, with 27% of the predictions discarded. In order to deliver the right business value, Chronomics picked a threshold of 0.99 to maximize the accuracy and minimize the rejection of predictions. For more information, see Analyzing an image with a trained model.
The discarded predictions can also be routed to a human in the loop using Amazon Augmented AI (Amazon A2I) for manually processing the image. For more information on how to do this, refer to Use Amazon Augmented AI with Amazon Rekognition.
The following image is an example where the model has correctly identified the test as invalid with a confidence of 0.999.
Conclusion
In this post, we showed the ease with which Chronomics quickly built and deployed a scalable computer vision-based solution that uses Amazon Rekognition to detect the result of a COVID-19 lateral flow test. The Amazon Rekognition API makes it very easy for practitioners to accelerate the process of building computer vision models.
Learn about how you can train computer vision models for your specific business use case by visiting Getting started with Amazon Rekognition custom labels and by reviewing the Amazon Rekognition Custom Labels Guide.
About the Authors
Mattia Spinelli is a Senior Machine Learning Engineer at Chronomics, a biomedical company. Chronomics’s platform enables providers to seamlessly implement at-home diagnostics at scale—all without sacrificing efficiency or accuracy.
Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.
Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.
Pashmeen Mistry is a Senior Product Manager at AWS. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.
reMARS revisited: Human-like reasoning for an AI
Learn what goes into Amazon’s effort to develop human-like reasoning for Alexa.Read More
“I want machines to write as fluently as humans”
Amazon Machine Learning Fellow Jiao Sun works on strategies to control text generation.Read More
Image augmentation pipeline for Amazon Lookout for Vision
Amazon Lookout for Vision provides a machine learning (ML)-based anomaly detection service to identify normal images (i.e., images of objects without defects) vs anomalous images (i.e., images of objects with defects), types of anomalies (e.g., missing piece), and the location of these anomalies. Therefore, Lookout for Vision is popular among customers that look for automated solutions for industrial quality inspection (e.g., detecting abnormal products). However, customers’ datasets usually face two problems:
- The number of images with anomalies could be very low and might not reach anomalies/defect type minimum imposed by Lookout for Vision (~20).
- Normal images might not have enough diversity and might result in the model failing when environmental conditions such as lighting change in production
To overcome these problems, this post introduces an image augmentation pipeline that targets both problems: It provides a way to generate synthetic anomalous images by removing objects in images and generates additional normal images by introducing controlled augmentation such as gaussian noise, hue, saturation, pixel value scaling etc. We use the imgaug library to introduce augmentation to generate additional anomalous and normal images for the second problem. We use Amazon Sagemaker Ground Truth to generate object removal masks and the LaMa algorithm to remove objects for the first problem using image inpainting (object removal) techniques.
The rest of the post is organized as follows. In Section 3, we present the image augmentation pipeline for normal images. In Section 4, we present the image augmentation pipeline for abnormal images (aka synthetic defect generation). Section 5 illustrates the Lookout for Vision training results using the augmented dataset. Section 6 demonstrates how the Lookout for Vision model trained on synthetic data perform against real defects. In Section 7, we talk about cost estimation for this solution. All of the code we used for this post can be accessed here.
1. Solution overview
ML diagram
The following is the diagram of the proposed image augmentation pipeline for Lookout for Vision anomaly localization model training:
The diagram above starts by collecting a series of images (step 1). We augment the dataset by augmenting the normal images (step 3) and by using object removal algorithms (steps 2, 5-6). We then package the data in a format that can be consumed by Amazon Lookout for Vision (steps 7-8). Finally, in step 9, we use the packaged data to train a Lookout for Vision localization model.
This image augmentation pipeline gives customers flexibility to generate synthetic defects in the limited sample dataset, as well as add more quantity and variety to normal images. It would boost the performance of Lookout for Vision service, solving the lack of customer data issue and making the automated quality inspection process smoother.
2. Data preparation
From here to the end of the post, we use the public FICS-PCB: A Multi-Modal Image Dataset for Automated Printed Circuit Board Visual Inspection dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License to illustrate the image augmentation pipeline and the consequent Lookout for Vision training and testing. This dataset is designed to support the evaluation of automated PCB visual inspection systems. It was collected at the SeCurity and AssuraNce (SCAN) lab at the University of Florida. It can be accessed here.
We start with the hypothesis that the customer only provides a single normal image of a PCB board (a s10 PCB sample) as the dataset. It can be seen as follows:
3. Image augmentation for normal images
The Lookout for Vision service requires at least 20 normal images and 20 anomalies per defect type. Since there is only one normal image from the sample data, we must generate more normal images using image augmentation techniques. From the ML standpoint, feeding multiple image transformations using different augmentation techniques can improve the accuracy and robustness of the model.
We’ll use imgaug for image augmentation of normal images. Imgaug is an open-source python package that lets you augment images in ML experiments.
First, we’ll install the imgaug library in an Amazon SageMaker notebook.
Next, we can install the python package named ‘IPyPlot’.
Then, we perform image augmentation of the original image using transformations including GammaContrast
, SigmoidContrast
, and LinearContrast
, and adding Gaussian noise on the image.
Since we need at least 20 normal images, and the more the better, we generated 10 augmented images for each of the 4 transformations shown above as our normal image dataset. In the future, we plan to also transform the images to be positioned at difference locations and different angels so that the trained model can be less sensitive to the placement of the object relative to the fixed camera.
4. Synthetic defect generation for augmentation of abnormal images
In this section, we present a synthetic defect generation pipeline to augment the number of images with anomalies in the dataset. Note that, as opposed to the previous section where we create new normal samples from existing normal samples, here, we create new anomaly images from normal samples. This is an attractive feature for customers that completely lack this kind of images in their datasets, e.g., removing a component of the normal PCB board. This synthetic defect generation pipeline has three steps: first, we generate synthetic masks from source (normal) images using Amazon SageMaker Ground Truth. In this post, we target at a specific defect type: missing component. This mask generation provides a mask image and a manifest file. Second, the manifest file must be modified and converted to an input file for a SageMaker endpoint. And third, the input file is input to an Object Removal SageMaker endpoint responsible of removing the parts of the normal image indicated by the mask. This endpoint provides the resulting abnormal image.
4.1 Generate synthetic defect masks using Amazon SageMaker Ground Truth
Amazon Sagemaker Ground Truth for data labeling
Amazon SageMaker Ground Truth is a data labeling service that makes it easy to label data and gives you the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce. You can follow this tutorial to set up a labeling job.
In this section, we’ll show how we use Amazon SageMaker Ground Truth to mark specific “components” in normal images to be removed in the next step. Note that a key contribution of this post is that we don’t use Amazon SageMaker Ground Truth in its traditional way (that is, to label training images). Here, we use it to generate a mask for future removal in normal images. These removals in normal images will generate the synthetic defects.
For the purpose of this post, in our labeling job we’ll artificially remove up to three components from the PCB board: IC, resistor1, and resistor2. After entering the labeling job as a labeler, you can select the label name and draw a mask of any shape around the component that you want to remove from the image as a synthetic defect. Note that you can’t include ‘_’ in the label name for this experiment, since we use ‘_’ to separate different metadata in the defect name later in the code.
In the following picture, we draw a green mask around IC (Integrated Circuit), a blue mask around resistor 1, and an orange mask around resistor 2.
After we select the submit button, Amazon SageMaker Ground Truth will generate an output mask with white background and a manifest file as follows:
Note that so far we haven’t generated any abnormal images. We just marked the three components that will be artificially removed and whose removal will generate abnormal images. Later, we’ll use both (1) the mask image above, and (2) the information from the manifest file as inputs for the abnormal image generation pipeline. The next section shows how to prepare the input for the SageMaker endpoint.
4.2 Prepare Input for SageMaker endpoint
Transform Amazon SageMaker Ground Truth manifest as a SageMaker endpoint input file
First, we set up an Amazon Simple Storage Service (Amazon S3) bucket to store all of the input and output for the image augmentation pipeline. In the post, we use an S3 bucket named qualityinspection
. Then we generate all of the augmented normal images and upload them to this S3 bucket.
Next, we download the mask from Amazon SageMaker Ground Truth and upload it to a folder named ‘mask’ in that S3 bucket.
After that, we download the manifest file from Amazon SageMaker Ground Truth labeling job and read it as json lines.
Lastly, we generate an input dictionary which records the input image’s S3 location, mask location, mask information, etc., save it as txt file, and then upload it to the target S3 bucket ‘input’ folder.
The following is a sample input file:
4.3 Create Asynchronous SageMaker endpoint to generate synthetic defects with missing components
4.3.1 LaMa Model
To remove components from the original image, we’re using an open-source PyTorch model called LaMa from LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions. It’s a resolution-robust large mask in-painting model with Fourier convolutions developed by Samsung AI. The inputs for the model are an image and a black and white mask and the output is an image with the objects inside the mask removed. We use Amazon SageMaker Ground Truth to create the original mask, and then transform it to a black and white mask as required. The LaMa model application is demonstrated as following:
4.3.2 Introducing Amazon SageMaker Asynchronous inference
Amazon SageMaker Asynchronous Inference is a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. Asynchronous inference enables users to save on costs by autoscaling the instance count to zero when there are no requests to process. This means that you only pay when your endpoint is processing requests. The new asynchronous inference option is ideal for workloads where the request sizes are large (up to 1GB) and inference processing times are in the order of minutes. The code to deploy and invoke the endpoint is here.
4.3.3 Endpoint deployment
To deploy the asynchronous endpoint, first we must get the IAM role and set up some environment variables.
As we mentioned before, we’re using open source PyTorch model LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions and the pre-trained model has been uploaded to s3://qualityinspection/model/big-lama.tar.gz
. The image_uri
points to a docker container with the required framework and python versions.
Then, we must specify additional asynchronous inference specific configuration parameters while creating the endpoint configuration.
Next, we deploy the endpoint on a ml.g4dn.xlarge instance by running the following code:
After approximately 6-8 minutes, the endpoint is created successfully, and it will show up in the SageMaker console.
4.3.4 Invoke the endpoint
Next, we use the input txt file we generated earlier as the input of the endpoint and invoke the endpoint using the following code:
The above command will finish execution immediately. However, the inference will continue for several minutes until it completes all of the tasks and returns all of the outputs in the S3 bucket.
4.3.5 Check the inference result of the endpoint
After you select the endpoint, you’ll see the Monitor session. Select ‘View logs’ to check the inference results in the console.
Two log records will show up in Log streams. The one named data-log
will show the final inference result, while the other log record will show the details of the inference, which is usually used for debug purposes.
If the inference request succeeds, then you’ll see the message: Inference request succeeded.
in the data-log and also get information of the total model latency, total process time, etc. in the message. If the inference fails, then check the other log to debug. You can also check the result by polling the status of the inference request. Learn more about the Amazon SageMaker Asynchronous inference here.
4.3.6 Generating synthetic defects with missing components using the endpoint
We’ll complete four tasks in the endpoint:
- The Lookout for Vision anomaly localization service requires one defect per image in the training dataset to optimize model performance. Therefore, we must separate the masks for different defects in the endpoint by color filtering.
- Split train/test dataset to satisfy the following requirement:
- at least 10 normal images and 10 anomalies for train dataset
- one defect/image in train dataset
- at least 10 normal images and 10 anomalies for test dataset
- multiple defects per image is allowed for the test dataset
- Generate synthetic defects and upload them to the target S3 locations.
We generate one defect per image and more than 20 defects per class for train dataset, as well as 1-3 defects per image and more than 20 defects per class for the test dataset.
The following is an example of the source image and its synthetic defects with three components: IC, resistor1, and resistor 2 missing.
original image
40_im_mask_IC_resistor1_resistor2.jpg (the defect name indicates the missing components)
- Generate manifest files for train/test dataset recording all of the above information.
Finally, we’ll generate train/test manifests to record information, such as synthetic defect S3 location, mask S3 location, defect class, mask color, etc.
The following are sample json lines for an anomaly and a normal image in the manifest.
For anomaly:
Amazon SageMaker JumpStart now offers Amazon Comprehend notebooks for custom classification and custom entity detection
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. Amazon Comprehend provides customized features, custom entity recognition, custom classification, and pre-trained APIs such as key phrase extraction, sentiment analysis, entity recognition, and more so you can easily integrate NLP into your applications.
We recently added Amazon Comprehend related notebooks in Amazon SageMaker JumpStart notebooks that can help you quickly get started using the Amazon Comprehend custom classifier and custom entity recognizer. You can use custom classification to organize documents into categories (classes) that you define. Custom entity recognition extends the capability of the Amazon Comprehend pre-trained entity detection API by helping you identify entity types that are unique to your domain or business that aren’t in the preset generic entity types.
In this post, we show you how to use JumpStart to build Amazon Comprehend custom classification and custom entity detection models as part of your enterprise NLP needs.
SageMaker JumpStart
The Amazon SageMaker Studio landing page provides the option to use JumpStart. JumpStart provides a quick way to get started by providing pre-trained models for a variety of problem types. You can train and tune these models. JumpStart also provides other resources like notebooks, blogs, and videos.
JumpStart notebooks are essentially sample code that you can use as a starting point to get started quickly. Currently, we provide you with over 40 notebooks that you can use as is or customize as needed. You can find your notebooks by using search or the tabbed view panel. After you find the notebook you want to use, you can import the notebook, customize it for your requirements, and select the infrastructure and environment to run the notebook on.
Get started with JumpStart notebooks
To get started with JumpStart, go to the Amazon SageMaker console and open Studio. Refer to Get Started with SageMaker Studio for instructions on how to get started with Studio. Then complete the following steps:
- In Studio, go to the launch page of JumpStart and choose Go to SageMaker JumpStart.
You’re offered multiple ways to search. You may either use tabs on the top to get to what you want, or use the search box as shown in the following screenshot.
- To find notebooks, we go to the Notebooks tab.
At the time of writing, JumpStart offers 47 notebooks. You can use filters to find Amazon Comprehend related notebooks.
- On the Content Type drop-down menu, choose Notebook.
As you can see in the following screenshot, we currently have two Amazon Comprehend notebooks.
In the following sections, we explore both notebooks.
Amazon Comprehend Custom Classifier
In this notebook, we demonstrate how to use the custom classifier API to create a document classification model.
The custom classifier is a fully managed Amazon Comprehend feature that lets you build custom text classification models that are unique to your business, even if you have little or no ML expertise. The custom classifier builds on the existing capabilities of Amazon Comprehend, which are already trained on tens of millions of documents. It abstracts much of the complexity required to build a NLP classification model. The custom classifier automatically loads and inspects the training data, selects the right ML algorithms, trains your model, finds the optimal hyperparameters, tests the model, and provides model performance metrics. The Amazon Comprehend custom classifier also provides an easy-to-use console for the entire ML workflow, including labeling text using Amazon SageMaker Ground Truth, training and deploying a model, and visualizing the test results. With an Amazon Comprehend custom classifier, you can build the following models:
- Multi-class classification model – In multi-class classification, each document can have one and only one class assigned to it. The individual classes are mutually exclusive. For example, a movie can be classed as a documentary or as science fiction, but not both at the same time.
- Multi-label classification model – In multi-label classification, individual classes represent different categories, but these categories are somehow related and not mutually exclusive. As a result, each document has at least one class assigned to it, but can have more. For example, a movie can simply be an action movie, or it can be an action movie, a science fiction movie, and a comedy, all at the same time.
This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
Amazon Custom Entity Recognizer
In this notebook, we demonstrate how to use the custom entity recognition API to create an entity recognition model.
Custom entity recognition extends the capabilities of Amazon Comprehend by helping you identify your specific entity types that aren’t in the preset generic entity types. This means that you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.
Building an accurate custom entity recognizer on your own can be a complex process, requiring preparation of large sets of manually annotated training documents and selecting the right algorithms and parameters for model training. Amazon Comprehend helps reduce the complexity by providing automatic annotation and model development to create a custom entity recognition model.
The example notebook takes the training dataset in CSV format and runs inference against text input. Amazon Comprehend also supports an advanced use case that takes Ground Truth annotated data for training and allows you to directly run inference on PDFs and Word documents. For more information, refer to Build a custom entity recognizer for PDF documents using Amazon Comprehend.
Amazon Comprehend has lowered the annotation limits and allowed you to get more stable results, especially for few-shot subsamples. For more information about this improvement, refer to Amazon Comprehend announces lower annotation limits for custom entity recognition.
This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
Use, customize, and deploy Amazon Comprehend JumpStart notebooks
After you select the Amazon Comprehend notebook you want to use, choose Import notebook. As you do that, you can see the notebook kernel starting.
Importing your notebook triggers selection of the notebook instance, kernel, and image that is used to run the notebook. After the default infrastructure is provisioned, you can change the selections as per your requirements.
Now, go over the outline of the notebook and carefully read the sections for prerequisites setup, data setup, training the model, running inference, and stopping the model. Feel free to customize the generated code per your needs.
Based on your requirements, you may want to customize the following sections:
- Permissions – For a production application, we recommend restricting access policies to only those needed to run the application. Permissions can be restricted based on the use case, such as training or inference, and specific resource names, such as a full Amazon Simple Storage Service (Amazon S3) bucket name or an S3 bucket name pattern. You should also restrict access to the custom classifier or SageMaker operations to just those that your application needs.
- Data and location – The example notebook provides you sample data and S3 locations. Based on your requirements, you may use your own data for training, validation, and testing, and use different S3 locations as needed. Similarly, when the model is created, you can choose to keep the model at different locations. Just make sure you have provided the right permissions to access S3 buckets.
- Preprocessing steps – If you’re using different data for training and testing, you may want to adjust the preprocessing steps per your requirements.
- Testing data – You can bring your own inference data for testing.
- Clean up – Delete the resources launched by the notebook to avoid recurring charges.
Conclusion
In this post, we showed you how to use JumpStart to learn and fast-track using Amazon Comprehend APIs by making it convenient to find and run Amazon Comprehend related notebooks from Studio while having the option to modify the code as needed. The notebooks use sample datasets with AWS product announcements and sample news articles. You may use this notebook to learn how to use Amazon Comprehend APIs in a Python notebook, or you may use it as a starting point and expand the code further for your unique requirements and production deployments.
You can start using JumpStart and take advantage of over 40 notebooks in various topics in all Regions where Studio is available at no additional cost.
About the Authors
Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team with expertise in AI and ML for Content Moderation and Rekognition. She is passionate about promoting AWS AI services and helping customers transform their business solutions.
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI
Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.
AWS VP Bratin Saha: ML is becoming a ‘mainstream endeavor’
Vice president of ML and AI Services says more than 100,000 customers are doing machine learning on AWS.Read More