From forecasting demand to ordering – An automated machine learning approach with Amazon Forecast to decrease stockouts, excess inventory, and costs

This post is a guest joint collaboration by Supratim Banerjee of More Retail Limited and Shivaprasad KT and Gaurav H Kankaria of Ganit Inc.

More Retail Ltd. (MRL) is one of India’s top four grocery retailers, with a revenue in the order of several billion dollars. It has a store network of 22 hypermarkets and 624 supermarkets across India, supported by a supply chain of 13 distribution centers, 7 fruits and vegetables collection centers, and 6 staples processing centers.

With such a large network, it’s critical for MRL to deliver the right product quality at the right economic value, while meeting customer demand and keeping operational costs to a minimum. MRL collaborated with Ganit as its AI analytics partner to forecast demand with greater accuracy and build an automated ordering system to overcome the bottlenecks and deficiencies of manual judgment by store managers. MRL used Amazon Forecast to increase their forecasting accuracy from 24% to 76%, leading to a reduction in wastage by up to 30% in the fresh produce category, improving in-stock rates from 80% to 90%, and increasing gross profit by 25%.

We were successful in achieving these business results and building an automated ordering system because of two primary reasons:

  • Ability to experiment – Forecast provides a flexible and modular platform through which we ran more than 200 experiments using different regressors and types of models, which included both traditional and ML models. The team followed a Kaizen approach, learning from previously unsuccessful models, and deploying models only when they were successful. Experimentation continued on the side while winning models were deployed.
  • Change management – We asked category owners who were used to placing orders using business judgment to trust the ML-based ordering system. A systemic adoption plan ensured that the tool’s results were stored, and the tool was operated with a disciplined cadence, so that in filled and current stock were identified and recorded on time.

Complexity in forecasting the fresh produce category

Forecasting demand for the fresh produce category is challenging because fresh products have a short shelf life. With over-forecasting, stores end up selling stale or over-ripe products, or throw away most of their inventory (termed as shrinkage). If under-forecasted, products may be out of stock, which affects customer experience. Customers may abandon their cart if they can’t find key items in their shopping list, because they don’t want to wait in checkout lines for just a handful of products. To add to this complexity, MRL has many SKUs across its over 600 supermarkets, leading to more than 6,000 store-SKU combinations.

By the end of 2019, MRL was using traditional statistical methods to create forecasting models for each store-SKU combination, which resulted in an accuracy as low as 40%. The forecasts were maintained through multiple individual models, making it computationally and operationally expensive.

Demand forecasting to order placement

In early 2020, MRL and Ganit started working together to further improve the accuracy for forecasting the fresh category, known as Fruits and Vegetables (F&V), and reduce shrinkage.

Ganit advised MRL to break their problem into two parts:

  • Forecast demand for each store-SKU combination
  • Calculate order quantity (indents)

We go into more detail of each aspect in the following sections.

Forecast demand

In this section, we discuss the steps of forecasting demand for each store-SKU combination.

Understand drivers of demand

Ganit’s team started their journey by first understanding the factors that drove demand within stores. This included multiple on-site store visits, discussions with category managers, and cadence meetings with the supermarket’s CEO coupled with Ganit’s own in-house forecasting expertise on several other aspects like seasonality, stock-out, socio-economic, and macro-economic factors.

After the store visits, approximately 80 hypotheses on multiple factors were formulated to study their impact on F&V demand. The team performed comprehensive hypotheses testing using techniques like correlation, bivariate and univariate analysis, and statistical significance tests (Student’s t-test, Z tests) to establish the relationship between demand and relevant factors such as festival dates, weather, promotions, and many more.

Data segmentation

The team emphasized developing a granular model that could accurately forecast a store-SKU combination for each day. A combination of the sales contribution and ease of prediction was built as an ABC-XYZ framework, with ABC indicating the sales contribution (A being the highest) and XYZ indicating the ease of prediction (Z being the lowest). For model building, the first line of focus was on store-SKU combinations that had a high contribution to sales and were the most difficult to predict. This was done to ensure that improving forecasting accuracy has the maximum business impact.

Data treatment

MRL’s transaction data was structured like conventional point of sale data, with fields like mobile number, bill number, item code, store code, date, bill quantity, realized value, and discount value. The team used daily transactional data for the last 2 years for model building. Analyzing historical data helped identity two challenges:

  • The presence of numerous missing values
  • Some days had extremely high or low sales at bill levels, which indicated the presence of outliers in the data

Missing value treatment

A deep dive into the missing values identified reasons such as no stock available in the store (no supply or not in season) and stores being closed due to planned holiday or external constraints (such as a regional or national shutdown, or construction work). The missing values were replaced with 0, and appropriate regressors or flags were added to the model so the model could learn from this for any such future events.

Outlier treatment

The team treated the outliers at the most granular bill level, which ensured that factors like liquidation, bulk buying (B2B), and bad quality were considered. For example, bill-level treatment may include observing a KPI for each store-SKU combination at a day level, as in the following graph.

We can then flag dates on which abnormally high quantities are sold as outliers, and dive deeper into those identified outliers. Further analysis shows that these outliers are pre-planned institutional purchases.

These bill-level outliers are then capped with the maximum sales quantity for that date. The following graphs show the difference in bill-level demand.

Forecasting process

The team tested multiple forecasting techniques like time series models, regression-based models, and deep learning models before choosing Forecast. The primary reason for choosing Forecast was the difference in performance when comparing forecast accuracies in the XY bucket against the Z bucket, which was the most difficult to predict. Although most conventional techniques provided higher accuracies in the XY bucket, only the ML algorithms in Forecast provided a 10% incremental accuracy compared to other models. This was primarily due to Forecast’s ability to learn other SKUs (XY) patterns and apply those learnings to highly volatile items in the Z bucket. Through AutoML, the Forecast DeepAR+ algorithm was the winner and chosen as the forecast model.

Iterating to further improve forecasting accuracy

After the team identified Deep AR+ as the winning algorithm, they ran several experiments with additional features to further improve accuracy. They performed multiple iterations on a smaller sample set with different combinations like pure target time series data (with and without outlier treatment), regressors like festivals or store closures, and store-item metadata (store-item hierarchy) to understand the best combination for improving forecast accuracy. The combination of outlier treated target time series along with store-item metadata and regressors returned the highest accuracy. This was scaled back to the original set of 6,230 store-SKU combinations to get the final forecast.

Order quantity calculation

After the team developed the forecasting model, the immediate next step was to use this to decide how much inventory to buy and place orders. Order generation is influenced by forecasted demand, current stock on hand, and other relevant in-store factors.

The following formula served as the basis for designing the order construct.

The team also considered other indent adjustment parameters for the automatic ordering system, such as minimum order quantity, service unit factor, minimum closing stock, minimum display stock (based on planogram), and fill rate adjustment, thereby bridging the gap between machine and human intelligence.

Balance under-forecast and over-forecast scenarios

To optimize the output cost of shrinkage with the cost of stockouts and lost sales, the team used the quantiles feature of Forecast to move the forecast response from the model.

In the model design, three forecasts were generated at p40, p50, and p60 quantiles, with p50 being the base quantile. The selection of quantiles was programmed to be based on stockouts and wastage in stores in the recent past. For example, higher quantiles were automatically chosen if a particular store-SKU combination faced continuous stockouts in the last 3 days, and lower quantiles were automatically chosen if the store-SKU had witnessed high wastage. The quantum of increasing and decreasing quantiles was based on the magnitude of stockout or shrinkage within the store.

Automated order placement through Oracle ERP

MRL deployed Forecast and the indent ordering systems in production by integrating them with Oracle’s ERP system, which MRL uses for order placements. The following diagram illustrates the final architecture.

To deploy the ordering system into production, all MRL data was migrated into AWS. The team set up ETL jobs to move live tables to Amazon Redshift (data warehouse for business intelligence work), so Amazon Redshift became the single source of input for future all data processing.

The entire data architecture was divided into two parts:

  • Forecasting engine:
    • Used historical demand data (1-day demand lag) present in Amazon Redshift
    • Other regressor inputs like last bill time, price, and festivals were maintained in Amazon Redshift
    • An Amazon Elastic Compute Cloud (Amazon EC2) instance was set up with customized Python scripts to wrangle transaction, regressors, and other metadata
    • Post-data wrangling, the data was moved to an Amazon Simple Storage Service (Amazon S3) bucket to generate forecasts (T+2 forecasts for all store-SKU combinations)
    • The final forecast output was stored in a separate folder in an S3 bucket
  • Order (indent) engine:
    • All data required to convert forecasts into orders (such as stock on hand, received to store quantity, last 2 days of orders placed to receive, service unit factor, and planogram-based minimum opening and closing stock) was stored and maintained in Amazon Redshift
    • Order quantity was calculated through Python scripts run on EC2 instances
    • Orders were then moved to Oracle’s ERP system, which placed an order to vendors

The entire ordering system was decoupled into multiple key segments. The team set up Apache Airflow’s scheduler email notifications for each process to notify respective stakeholders upon successful completion or failure, so that they could take immediate action. The orders placed through the ERP system were then moved to Amazon Redshift tables for calculating the next days’ orders. The ease of integration between AWS and ERP systems led to a complete end-to-end automated ordering system with zero human intervention.

Conclusion

An ML-based approach unlocked the true power of data for MRL. With Forecast, we created two national models for different store formats, as opposed to over 1,000 traditional models that we had been using.

Forecast also learns across time series. ML algorithms within Forecast enable cross-learning between store-SKU combinations, which helps improve forecast accuracies.

Additionally, Forecast allows you to add related time series and item metadata, such as customers who send demand signals based on the mix of items in their basket. Forecast considers all the incoming demand information and arrives at a single model. Unlike conventional models, where the addition of variables leads to overfitting, Forecast enriches the model, providing accurate forecasts based on business context. MRL gained the ability to categorize products based on factors like shelf life, promotions, price, type of stores, affluent cluster, competitive store, and stores throughput.  We recommend that you try Amazon Forecast to improve your supply chain operations. You can learn more about Amazon Forecast here. To learn more about Ganit and our solutions, reach out at info@ganitinc.com to learn more.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

 Supratim Banerjee is the Chief Transformational Officer at More Retail Limited. He is an experienced professional with a demonstrated history of working in the venture capital and private equity industries. He was a consultant with KPMG and worked with organizations like A.T. Kearney and India Equity Partners. He holds an MBA focused on Finance, General from Indian School of Business, Hyderabad.

 

Shivaprasad KT is the Co-Founder & CEO at Ganit Inc. He has a 17+ years of experience in delivering top-line and bottom-line impact using data science in the US, Australia, Asia, and India. He has advised CXOs at companies like Walmart, Sam’s Club, Pfizer, Staples, Coles, Lenovo, and Citibank. He holds an MBA from SP Jain, Mumbai, and a bachelor’s degree in Engineering from NITK Surathkal.

 

Gaurav H Kankaria is the Senior Data Scientist at Ganit Inc. He has over 6 years of experience in designing and implementing solutions to help organizations in retail, CPG, and BFSI domains make data-driven decisions. He holds a bachelor’s degree from VIT University, Vellore.

Read More

How Latent Space used the Amazon SageMaker model parallelism library to push the frontiers of large-scale transformers

This blog is co-authored by Sarah Jane Hong CSO, Darryl Barnhart CTO, and Ian Thompson CEO of Latent Space and Prem Ranga of AWS.

Latent space is a hidden representation of abstract ideas that machine learning (ML) models learn. For example, “dog,” “flower,” or “door” are concepts or locations in latent space. At Latent Space, we’re working on an engine that allows you to manipulate and explore this space with both language and visual prompts. The Latent Space team comes from two fields that have long had little overlap: graphics and natural language processing (NLP). Traditionally, the modalities of images and text have been handled separately, each with their own history of complex, expensive, and fragile feature engineering. NLP tasks like document understanding or question answering have usually had little in common with vision tasks like scene understanding or rendering, and usually we use very different approaches and models for each task. But this is rapidly changing.

This merging of modalities in a single shared latent space unlocks a new generation of creative and commercial applications, from gaming to document understanding. But unlocking these new applications in a single model opens up new scaling challenges, as highlighted in “The Bitter Lesson” by Richard Sutton, and the exciting work in the last few years on scaling laws. To make this possible, Latent Space is working on cutting-edge research to fuse these modalities in a single model, but also to scale and do so efficiently. This is where model parallelism comes in.

Amazon SageMaker‘s unique automated model partitioning and efficient pipelining approach made our adoption of model parallelism possible with little engineering effort, and we scaled our training of models beyond 1 billion parameters (we use the p4d.24xlarge A100 instances), which is an important requirement for us. Furthermore, we observed that when training with a 16 node, eight GPU training setup with the SageMaker model parallelism library, we recorded a 38% improvement in efficiency compared to our previous training runs.

Challenges with training large-scale transformers

At Latent Space, we’re fusing language and vision in transformer models with billions of parameters to support “out of distribution” use cases from a user’s imagination or that would occur in the real world but not in our training data. We’re handling the challenges inherent in scaling to billions of parameters and beyond in two different ways:

Information retrieval techniques have long been a key component of search engines and QA tasks. Recently, exciting progress has been made combining classic IR techniques with modern transformers, specifically for question answering tasks where a model is trained jointly with a neural retriever that learns to retrieve relevant documents to help answer questions. For an overview, see the recent work from FAIR in Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models and Fusion-in-Decoder, Google Brain’s REALM, and Nvidia’s Neural Retriever for question answering.

While retrieval-augmented techniques help with costs and efficiency, we are still unable to fit the model on a single GPU for our largest model. This means that we need to use model parallelism to train it. However, due to the nature of our retrieval architecture, designing our model splitting was challenging because of interdependencies between retrieved contexts across training inputs. Furthermore, even if we determine how we split our model, introducing model parallelism was a significant engineering task for us to do manually across our research and development lifecycle.

The SageMaker model parallelism library

Model parallelism is the process of splitting a model up between multiple devices or nodes (such as GPU-equipped instances) and creating an efficient pipeline to train the model across these devices to maximize GPU utilization. The model parallelism library in SageMaker makes model parallelism more accessible by providing automated model splitting, also referred to as automated model partitioning and sophisticated pipeline run scheduling. The model splitting algorithms can optimize for speed or memory consumption. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance.

Automated model partitioning

For our PyTorch use case, the model parallel library internally runs a tracing step (in the first training step) that constructs the model graph and determines the tensor and parameter shapes. It then constructs a tree, which consists of the nested nn.Module objects in the model, as well as additional data gathered from tracing, such as the amount of stored nn.Parameters, and runtime for each nn.Module.

The library then traverses this tree from the root and runs a partitioning algorithm that balances computational load and memory use, and minimizes communication between instances. If multiple nn.Modules share the same nn.Parameter, these modules are placed on the same device to avoid maintaining multiple versions of the same parameter. After the partitioning decision is made, the assigned modules and weights are loaded to their devices.

Pipeline run scheduling

Another core feature of the SageMaker distributed model parallel library is pipelined runs, which determine the order in which computations are made and data is processed across devices during model training. Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one by one and follow a run schedule defined by the library runtime.

The microbatch pipeline ensures that all the GPUs are fully utilized, which is something we would have to build ourselves, but with the model parallelism library this happens neatly behind the scenes. Lastly, we can use Amazon FSx, which is important to ensure our read speeds are fast given the number of files being read during the training of a multimodal model with retrieval.

Training architecture

The following diagram represents how we set up our training architecture. Our primary objectives were to improve training speed and reduce costs. The image and language transformers we are training are highly complex, with a significantly large number of layers and weights inside, running to billions of parameters, all of which makes them unable to fit in the memory of a single node. Each node carries a subset of the model, through which the data flows and the transformations are shared and compiled. We setup 16 p4d.24xlarge instances each with eight GPUs using the following architecture representation:

As we scale up our models, a common trend is to have everything stored in the weights of the network. However, for practical purposes, we want to augment our models to learn how to look for relevant contexts to help with the task of rendering. This enables us to keep our serving costs down without compromising on image quality. We use a large transformer-based NLP model and as mentioned before, we observed a 38% increase in training efficiency with the SageMaker model parallelism library as shown by the following:

  • We need an allreduce for every computation in the case of tensor level parallelism. This takes O(log_2 n) parallel steps. That is n machines taking O(n) steps, for O(n log_2 n) total operations.
  • For pipeline parallelism, we require O(1) parallel steps for passing data down the pipeline
  • Given 16 machines with eight GPUs, we have O(1) cost for pipeline parallel, and O(log_2(8)) = O(3) cost for depth-wise model parallel.
  • In this case, we see that the network cost is reduced to 1/3rd by switching to pipeline parallel that what we use with SageMaker model parallelism, and the overall training cost reduces to 1/2 + 1/2 * 1/log_2(16) = 0.625 of the original cost leading to a corresponding efficiency improvement.

In general, when the need warrants distributed training (issues with scaling model size or training data), we can follow a set of best practices to determine what approach works best.

Best practices for distributed training

Based on our experience, we suggest starting with a distributed data parallel approach. Distributed data parallelism such as the SageMaker distributed data parallel library resolves most of the networking issues with model replicas, so you should fit models into the smallest number of nodes, then replicate to scale batch size as needed.

If you run out of memory during training, as we did in this scenario, you may want to switch to a model parallel approach. However, consider these alternatives before trying model parallel training:

  • On NVIDIA Tensor Core-equipped hardware, use mixed-precision training to create speedup and reduce memory consumption.
  • Reduce the batch size (or reduce image resolution or NLP sequence length, if possible).

Additionally, we prefer model designs that do not have batch normalization as described in High-performance large-scale image recognition without normalization. If it cannot be avoided, ensure batch normalization is synced across devices. When you use distributed training, your batch is split across GPUs, so accurate batch statistics require synchronization across all devices. Without this, the normalization will have increased error and thereby impair convergence.

Start with model parallel training when you have the following constraints:

  • Your model doesn’t fit on a single device
  • Due to your model size, you’re facing limitations in choosing larger batch sizes, such as if your model weights take up most of your GPU memory and you’re forced to choose a smaller, suboptimal batch size

When optimizing for performance, do the following:

  • Use pipelining for inter-node communications to minimize latency and increase throughput
  • Keep pipelines as short as possible to minimize any bubbles. The number of microbatches should be tuned to balance computational efficiency with bubble size, and be at least the pipeline length. If needed you can form microbatches at the token level as described in TeraPipe: Token Level Pipeline Parallelism for training large-scale language models

When optimizing for cost, use SageMaker managed Spot Instances for training. This can optimize the cost of training models up to 90% over On-Demand instances. SageMaker manages the Spot interruptions on your behalf.

Other factors to consider:

  • Within a node when there is a fast interconnect, it’s more nuanced. If there is ample intra-node network capacity, reshuffling data for more optimal compute may show a benefit.
  • If activations are much larger than weight tensors, a sharded optimizer may also help. Please refer to ZeRO for more details.

The following table provides some common training scaleup scenarios and how you can configure them on AWS.

Scenario When does it apply? Solution
Scaling from a single GPU to many GPUs When the amount of training data or the size of the model is too large Change to a multi-GPU instance such as p3.16xlarge, which has eight GPUs, with the data and processing split across the eight GPUs, and producing a near-linear speedup in the time it takes to train your model.
Scaling from a single instance to multiple instances When the scaling needs extend beyond changing the instance size Scale the number of instances with the SageMaker Python SDK’s estimator function by setting your instance_type to p3.16xlarge and instance_count to 2. Instead of the eight GPUs on a single p3.16xlarge, you have 16 GPUs across two identical instances. Consider using the SageMaker distributed data parallel library.
Selecting a model parallel approach for training When encountering out of memory errors during training Switch to a model parallel approach using the SageMaker distributed model parallel library.
Network performance for inter-node communications For distributed training with multiple instances (for example, communication between the nodes in the cluster when doing an AllReduce operation) Your instances need to be in the same Region and same Availability Zone. When you use the SageMaker Python SDK, this is handled for you. Your training data should also be in the same Availability Zone. Consider using the SageMaker distributed data parallel library.
Optimized GPU, network, and Storage For large scale distributed training needs The p4d.24xlarge instance type was designed for fast local storage and a fast network backplane with up to 400 gigabits, and we highly recommend it as the most performant option for distributed training.

Conclusion

With the model parallel library in SageMaker, we get a lot of the benefits out of the box, such as automated model partitioning and efficient pipelining. In this post, we shared our challenges with our ML use case, our considerations on different training approaches, and how we used the Amazon SageMaker model parallelism library to speed up our training. Best of all, it can now take only a few hours to adopt best practices for model parallelism and performance improvements described here. If this post helps you or inspires you to solve a problem, we would love to hear about it! Please share your comments and feedback.

References

For more information, please see following:


About the Authors

Prem Ranga is an Enterprise Solutions Architect based out of Atlanta, GA. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an autonomous vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

 

Sarah Jane Hong is the co-founder and Chief Science Officer at Latent Space. Her background lies at the intersection of human-computer interaction and machine learning. She previously led NLP research at Sonar (acquired by Marchex), which serves businesses in the conversational AI space. She is also an esteemed AR/VR developer, having received awards and fellowships from Oculus, Mozilla Mixed Reality, and Microsoft Hololens.

 

Darryl Barnhart is the co-founder and Chief Technology Officer at Latent Space. He is a seasoned developer with experience in GPU acceleration, computer graphics, large-scale data, and machine learning. Other passions include mathematics, game development, and the study of information.

 

 

Ian Thompson is the founder and CEO at Latent Space. Ian is an engineer and researcher inspired by the “adjacent possible” — technologies about to have a big impact on our lives. Currently focused on simplifying and scaling multimodal representation learning to help build safe and creative AI. He previously helped build companies in graphics/virtual reality (AltspaceVR, acquired by Microsoft) and education/NLP (HSE).

Read More

PDF document pre-processing with Amazon Textract: Visuals detection and removal

Amazon Textract is a fully managed machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract can detect text in a variety of documents, including financial reports, medical records, and tax forms.

In many use cases, you need to extract and analyze documents with various visuals, such as logos, photos, and charts. These visuals contain embedded text that convolutes Amazon Textract output or isn’t required for your downstream process. For example, many real estate evaluation forms or documents contain pictures of houses or trends of historical prices. This information isn’t needed in downstream processes, and you have to remove it before using Amazon Textract to analyze the document. In this post, we illustrate two effective methods to remove these visuals as part of your preprocessing.

Solution overview

For this post, we use a PDF that contains a logo and a chart as an example. We use two different types of processes to convert and detect these visuals, then redact them.

In the first method, we use the OpenCV library canny edge detector to detect the edge of the visuals. For the second method, we write a custom pixel concentration analyzer to detect the location of these visuals.

You can extract these visuals out for further processing, and easily modify the code to fit your use case.

Searchable PDFs are native PDF files usually generated by other applications, such as text processors, virtual PDF printers, and native editors. These types of PDFs retain metadata, text, and image information inside the document. You can easily use libraries like PyMuPDF/fitz to navigate the PDF structure and identify images and text. In this post, we focus on non-searchable or image-based documents.

Option 1: Detecting visuals with OpenCV edge detector

In this approach, we convert the PDF into PNG format, then grayscale the document with the OpenCV-Python library and use the Canny Edge Detector to detect the visual locations. You can follow the detailed steps in the following notebook.

  1. Convert the document to grayscale.

  1. Apply the Canny Edge algorithm to detect contours in the Canny-Edged document.
  2. Identify the rectangular contours with relevant dimensions.

You can further tune and optimize a few parameters to increase detection accuracy depending on your use case:

  • Minimum height and width – These parameters define the minimum height and width thresholds for visual detection. It’s expressed in percentage of the page size.
  • Padding – When a rectangle contour is detected, we define the extra padding area to have some flexibility on the total area of the page to be redacted. This is helpful in cases where the texts in the visuals aren’t inside clearly delimited rectangular areas.

Advantages and disadvantages

This approach has the following advantages:

  • It satisfies most use cases
  • It’s easy to implement, and quick to get up and running
  • Its optimum parameters yield good results

However, the approach has the following drawbacks:

  • For visuals without a bounding box or surrounding edges, the performance may vary depending on the type of visuals
  • If a block of text is inside large bounding boxes, the whole text block may be considered a visual and get removed using this logic

Option 2: Pixel concentration analysis

We implement our second approach by analyzing the image pixels. Normal text paragraphs retain a concentration signature in its lines. We can measure and analyze the pixel densities to identify areas with pixel densities that aren’t similar to the rest of document. You can follow the detailed steps in the following notebook.

  1. Convert the document to grayscale.
  2. Convert gray areas to white.
  3. Collapse the pixels horizontally to calculate the concentration of black pixels.
  4. Split the document into horizontal stripes or segments to identify those that aren’t full text (extending across the whole page).

  1. For all horizontal segments that aren’t full text, identify the areas that are text vs. areas that are images. This is done by filtering out sections using minimum and maximum black pixel concentration thresholds.
  2. Remove areas identified as non-full text.

You can tune the following parameters to optimize the accuracy of identifying non-text areas:

  • Non-text horizontal segment thresholds – Define the minimum and maximum black pixel concentration thresholds used to detect non-text horizontal segments in the page.
  • Non-text vertical segment thresholds – Define the minimum and maximum black pixel concentration thresholds used to detect non-text vertical segments in the page.
  • Window size – Controls how the page is split in horizontal and vertical segments for analysis (X_WINDOW, Y_WINDOW). It’s defined in number of pixels.
  • Minimum visual area – Defines the smallest area that can be considered as a visual to be removed. It’s defined in pixels.
  • Gray range threshold – The threshold for shades of gray to be removed.

Advantages and disadvantages

This approach is highly customizable. However, it has the following drawbacks:

  • Optimum parameters take longer and to achieve a deeper understanding of the solution
  • If the document isn’t perfectly rectified (image taken by camera with an angle), this method may fail.

Conclusion

In this post, we showed how you can implement two approaches to redact visuals from different documents. Both approaches are easy to implement. You can get high-quality results and customize either method according to your use case.

To learn more about different techniques in Amazon Textract, visit the public AWS Samples GitHub repo.


About the Authors

 Yuan Jiang is a Sr Solution Architect with a focus in machine learning. He’s a member of the Amazon Computer Vision Hero program and the Amazon Machine Learning Technical Field Community.

 

 

 

Victor Rojo is a Sr Partner Solution Architect with Conversational AI focus. He’s also a member of the Amazon Computer Vision Hero program.

 

 

 

Luis Pineda is a Sr Partner Management Solution Architect. He’s also a member of the Amazon Computer Vision Hero program.

 

 

 

Miguel Romero Calvo is a Data Scientist from the AWS Machine Learning Solution Lab.

Read More

Batch image processing with Amazon Rekognition Custom Labels 

Amazon Rekognition is a computer vision service that makes it easy to add image and video analysis to your applications using proven, highly scalable, deep learning technology that requires no machine learning (ML) expertise to use. With Amazon Rekognition, you can identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content. Amazon Rekognition also provides highly accurate facial analysis and facial search capabilities that you can use to detect, analyze, and compare faces for a wide variety of use cases.

Amazon Rekognition Custom Labels allows you to identify the objects and scenes in images that are specific to your business needs. For example, you can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, and more. The blog post Building your own brand detection shows how to use Amazon Rekognition Custom Labels to build an end-to-end solution to detect brand logos in images and videos.

Amazon Rekognition Custom Labels provides a simple end-to-end experience where you start by labeling a dataset, and Amazon Rekognition Custom Labels builds a custom ML model for you by inspecting the data and selecting the right ML algorithm. After your model is trained, you can start using it immediately for image analysis. If you want to process images in batches (such as once a day or week, or at scheduled times during the day), you can provision your custom model at scheduled times.

In this post, we show how you can build a cost-optimal batch solution with Amazon Rekognition Custom Labels that provisions your custom model at scheduled times, processes all your images, and deprovisions your resources to avoid incurring extra cost.

Overview of solution

The following architecture diagram shows how you can design a cost-effective and highly scalable workflow to process images in batches with Amazon Rekognition Custom Labels. It takes advantage of AWS services such as Amazon EventBridge, AWS Step Functions, Amazon Simple Queue Service (Amazon SQS), AWS Lambda, and Amazon Simple Storage Service (Amazon S3).

This solution uses a serverless architecture and managed services, so it can scale on demand and doesn’t require provisioning and managing any servers. The Amazon SQS queue increases the overall fault tolerance of the solution by decoupling image ingestion from the image processing and enabling reliable delivery of messages for each ingested image. Step Functions makes it easy to build visual workflows to orchestrate a series of individual tasks, such as checking if an image is available for processing and managing the state lifecycle of the Amazon Rekognition Custom Labels project. Although the following architecture shows how you can build a batch processing solution for Amazon Rekognition Custom Labels using AWS Lambda, you can build a similar architecture using services such as AWS Fargate.

The following steps describe the overall workflow:

  1. As an image is stored in Amazon S3 bucket, it triggers a message that gets stored in an Amazon SQS queue.
  2. Amazon EventBridge is configured to trigger an AWS Step Functions workflow at a certain frequency (1 hour by default).
  3. As the workflow runs, it performs the following actions:
    1. It checks the number of items in the Amazon SQS queue. If there are no items to process in the queue, the workflow ends.
    2. If there are items to process in the queue, the workflow starts the Amazon Rekognition Custom Labels model.
    3. The workflow enables Amazon SQS integration with an AWS Lambda function to process those images.
  4. As the integration between the Amazon SQS queue and AWS Lambda is enabled, the following events occur:
    1. AWS Lambda starts processing messages with the image details from Amazon SQS.
    2. The AWS Lambda function uses the Amazon Rekognition Custom Labels project to process the images.
    3. The AWS Lambda function then places the JSON file containing the inferenced labels in the final bucket. The image is also moved from the source bucket to the final bucket.
  5. When all the images are processed, the AWS Step Functions workflow does the following:
    1. It stops the Amazon Rekognition Custom Labels model.
    2. It disables integration between the Amazon SQS queue and the AWS Lambda function by disabling the trigger.

The following diagram illustrates the AWS Step Functions state machine for this solution.

Prerequisites

To deploy this solution, you need the following prerequisites:

  • An AWS account with permission to deploy the solution using AWS CloudFormation, which creates AWS Identity and Access Management (IAM) roles and other resources.
  •  The Amazon Resource Name (ARN) of the Amazon Rekognition Custom Labels project (referred as ProjectArn) and the Amazon Resource Name (ARN) of the model version that was created after training the model (referenced as ProjectVersionArn). These values are required to check the status of the model and also to analyze images using the model.

To learn how to train a model, see Getting Started with Amazon Rekognition Custom Labels.

Deployment

To deploy the solution using AWS CloudFormation in your AWS account, follow the steps in the GitHub repo. It creates the following resources:

  • Amazon S3 bucket
  • Amazon SQS queue
  • AWS Step Functions workflow
  • Amazon EventBridge rules to trigger the workflow
  • IAM roles
  • AWS Lambda Functions

You can see the names of different resources created by the solution in the output section of the CloudFormation stack.

Testing the workflow

To test your workflow, complete the following steps:

  1. Upload sample images to the input S3 bucket that was created by the solution (for example, xxxx-sources3bucket-xxxx).
  2. On the Step Functions console, choose the state machine created by the solution (for example, CustomCVStateMachine-xxxx).

You should see the state machine is triggered by the Amazon EventBridge rule every hour.

  1. You can manually start the workflow by choosing Start execution.
  2. As images are processed, you can go to the output S3 bucket (for example, xxxx-finals3bucket-xxxx) to see the JSON output for each image.

The following screenshot shows the contents of the final S3 bucket with the images, along with their corresponding JSON output from Amazon Rekognition Custom Labels.

Conclusion

In this post, we showed how you can build a cost-optimal batch solution with Amazon Rekognition Custom Labels that can provision your custom model at scheduled times, process all your images, and deprovision your resources to avoid incurring extra cost. Depending on your use case, you can easily adjust the scheduled time window at which the solution should process the batch. For more information about how to create, train, evaluate, and use a model that detects objects, scenes, and concepts in images see getting started with Amazon Rekognition Custom Labels.

While the solution described in this post showed how you can process batch images with Amazon Rekognition Custom Labels, you can easily tweak the solution to process batch images with Amazon Lookout for Vision for defects and anomalies detection. With Amazon Lookout for Vision, manufacturing companies can increase quality and reduce operational costs by quickly identifying differences in images of objects at scale. For example, Amazon Lookout for Vision can be used to identify missing components in products, damage to vehicles or structures, irregularities in production lines, minuscule defects in silicon wafers, and other similar problems. To learn more about Amazon Lookout for Vision, see the developer guide.


About the Authors

Rahul Srivastava is a Senior Solutions Architect at Amazon Web Services and is based in the United Kingdom. He has extensive architecture experience working with large enterprise customers. He is helping our customers with architecture, cloud adoption, developing products with a purpose and take advantage of AI/ ML to solve real world business problems.

 

Kashif Imran is a Principal Solutions Architect at Amazon Web Services. He works with some of the largest AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement computer vision applications at scale. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

Read More

Translate video captions and subtitles using Amazon Translate

Video is a highly effective a highly effective way to educate, entertain, and engage users. Your company might carry a large collection of videos that include captions or subtitles. To make these videos accessible to a larger audience, you can provide translated captions and subtitles in multiple languages. In this post, we show you how to create an automated and serverless pipeline to translate captions and subtitles using Amazon Translate, without losing their context during translation.

Captions and subtitles help make videos accessible for those hard of hearing, provide flexibility to users in noisy or quiet environments, and assist non-native speakers. Captions or subtitles are normally represented in SRT (.srt) or WebVTT (.vtt) format. SRT stands for SubRip Subtitle, and is the most common file format for subtitles and captions. WebVTT stands for Web Video Text Track, and is becoming a popular format for the same purpose.

Multi-language video subtitling and captioning solution

This solution uses Amazon Translate, a neural machine translation service that delivers fast, high-quality, and affordable language translation. Amazon Translate supports the ability to ignore tags and only translate text content in HTML documents. The following diagram illustrates the workflow of our solution.

The following diagram illustrates the workflow of our solution.

The workflow includes the following steps:

  1. Extract caption text from a WebVTT or SRT file and create a delimited text file using an HTML tag.
  2. Translate this delimited file using the asynchronous batch processing capability in Amazon Translate.
  3. Recreate the WebVTT or SRT files using the translated delimited file.

We provide a more detailed architecture in the next section.

Solution architecture

This solution is based on an event-driven and serverless pipeline architecture, and uses managed services so that it’s scalable and cost-effective. The following diagram illustrates the serverless pipeline architecture.

The following diagram illustrates the serverless pipeline architecture.

The pipeline contains the following steps:

  1. Users upload one or more caption files in the WebVTT (.vtt) or the SRT (.srt) format to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The upload triggers an AWS Lambda function.
  3. The function extracts text captions from each file, creates a corresponding HTML tag delimited text file, and stores them in Amazon S3.
  4. The function invokes Amazon Translate in batch mode to translate the delimited text files into the target language.
  5. The AWS Step Functions based job poller polls for the translation job to complete.
  6. The Step Functions workflow sends an Amazon Simple Notification Service (Amazon SNS) notification when the translation is complete.
  7. A Lambda function reads the translated delimited files from Amazon S3, creates the caption files in the WebVTT (.vtt) or SRT(.srt) format with the translated text captions, and stores them back in Amazon S3.

We explain Steps 3–7 in more detail in the following sections.

Convert caption files to delimited files

In this architecture, uploading the file with triggerFileName triggers the Lambda function <Stack name>-S3CaptionsFileEventProcessor-<Random string>. The function iterates through the WebVTT and SRT files in the input folder and for each file, it extracts the caption text, converts it into a delimited text file using an HTML (<span>) tag, and places it in the captions-in folder of the Amazon S3 bucket. See the following function code:

try:
        captions = Captions()
        #filter only the VTT and SRT file for processing in the input folder
        objs = S3Helper().getFilteredFileNames(bucketName,"input/",["vtt","srt"])
        for obj in objs:
            try:
                vttObject = {}
                vttObject["Bucket"] = bucketName
                vttObject["Key"] = obj
                captions_list =[]
                #based on the file type call the method that coverts them into python list object
                if(obj.endswith("vtt")):
                    captions_list =  captions.vttToCaptions(vttObject)
                elif(obj.endswith("srt")):
                    captions_list =  captions.srtToCaptions(vttObject)
                #convert the text captions in the list object to a delimited file
                delimitedFile = captions.ConvertToDemilitedFiles(captions_list)
                fileName = obj.split("/")[-1]
                newObjectKey = "captions-in/{}.delimited".format(fileName)
                S3Helper().writeToS3(str(delimitedFile),bucketName,newObjectKey)   
                output = "Output Object: {}/{}".format(bucketName, newObjectKey)

The solution uses a Python library webvtt-py to load, parse, and generate the WebVTT and SRT file formats. All the operations related to the library are abstracted within the Captions module. Also, all Amazon S3 operations are abstracted within the S3Helper module.

Batch translation of delimited files

After the delimited files are stored in the captions-in folder of the Amazon S3 bucket, the Lambda function <Stack name>-S3CaptionsFileEventProcessor-<Random string> invokes the Amazon Translate job startTextTranslationJob with the following parameters:

  • The captions-in folder in the S3 bucket is the input location for files to be translated
  • The captions-out folder in the S3 bucket is the output location for translated files
  • Source language code
  • Destination language code
  • An AWS Identity and Access Management (IAM) role ARN with necessary policy permissions to read and write to the S3 bucket

See the following job code:

translateContext = {}
translateContext["sourceLang"] = sourceLanguageCode
translateContext["targetLangList"] = [targetLanguageCode]
translateContext["roleArn"] = access_role 
translateContext["bucket"] = bucketName
translateContext["inputLocation"] = "captions-in/"
translateContext["outputlocation"] = "captions-out/"
translateContext["jobPrefix"] = "TranslateJob-captions"
#Call Amazon Translate to translate the delimited files in the captions-in folder
jobinfo = captions.TranslateCaptions(translateContext)

Poll the Amazon Translate batch translate job

The solution uses a Step Functions workflow to periodically poll the Amazon Translate service for the status of the submitted job using a Lambda function. When the job is complete, the workflow creates an Amazon SNS notification with details of the Amazon Translate job as the notification payload. For more details on the Step Functions job definition and the Lambda code, see Getting a batch job completion message from Amazon Translate.

Create WebVTT and SRT files from the delimited files

The Amazon SNS notification from the job poller step triggers the Lambda function <Stack name>-TranslateCaptionsJobSNSEventProcessor-<Random string>. The function iterates through the each of the translated delimited files generated in the captions-out folder based on the event details available from the Amazon SNS notification event. See the following code:

output = ""
    logger.info("request: {}".format(request))
    up = urlparse(request["s3uri"], allow_fragments=False)
    accountid = request["accountId"]
    jobid =  request["jobId"]
    bucketName = up.netloc
    objectkey = up.path.lstrip('/')
    basePrefixPath = objectkey  + accountid + "-TranslateText-" + jobid + "/";
    languageCode = request["langCode"]
    logger.debug("Base Prefix Path:{}".format(basePrefixPath))
    captions = Captions()
    #filter only the delimited files with .delimited suffix
    objs = S3Helper().getFilteredFileNames(bucketName,basePrefixPath,["delimited"])
    for obj in objs:
        try:
            #Read the Delimited file contents
            content = S3Helper().readFromS3(bucketName,obj)
            fileName = FileHelper().getFileName(obj)

The solution generates the WebVTT or SRT file using the original WebVTT or SRT file from the input folder for the time markers, but replaces the captions with the translated caption text from the delimited files. See the following code:

logger.debug("SourceFileKey:{}.processed".format(sourceFileName))
            soureFileKey = "input/{}.processed".format(sourceFileName)
            vttObject = {}
            vttObject["Bucket"] = bucketName
            vttObject["Key"] = soureFileKey
            captions_list = []
            #Based on the file format, call the right method to load the file as python object
            if(fileName.endswith("vtt")):
                    captions_list =  captions.vttToCaptions(vttObject)
            elif(fileName.endswith("srt")):
                captions_list =  captions.srtToCaptions(vttObject)
            # Replace the text captions with the translated content
            translatedCaptionsList = captions.DelimitedToWebCaptions(captions_list,content,"<span>",15)
            translatedText = ""
            # Recreate the Caption files in VTT or SRT format
            if(fileName.endswith("vtt")):
                translatedText =  captions.captionsToVTT(translatedCaptionsList)
            elif(fileName.endswith("srt")):
                translatedText =  captions.captionsToSRT(translatedCaptionsList)

The function then writes the new WebVTT or SRT files as S3 objects in the output folder with the following naming convention: TargetLanguageCode-<inputFileName>.vtt or TargetLanguageCode-<inputFileName>.srt. See the following code:

newObjectKey = "output/{}".format(fileName)
# Write the VTT or SRT file into the output S3 folder
S3Helper().writeToS3(str(translatedText),bucketName,newObjectKey)

Solution deployment

You can either deploy the solution using an AWS CloudFormation template or by cloning the GitHub repository.

Deployment using the CloudFormation template

The CloudFormation template provisions the necessary resources needed for the solution, including the IAM roles, IAM policies, and Amazon SNS topics. The template creates the stack the us-east-1 Region.

  1. Launch the CloudFormation template by choosing Launch Stack:

  1. For Stack name, enter a unique stack name for this account; for example, translate-captions-stack.
  2. For SourceLanguageCode, enter the language code for the current language of the caption text; for example, en for English.
  3. For TargetLanguageCode, enter the language code that you want your translated text in; for example, es for Spanish.

For more information about supported languages, see Supported Languages and Language Codes.

  1. For TriggerFileName, enter the name of the file that triggers the translation serverless pipeline (the default is triggerfile).
  2. In the Capabilities and transforms section, and select the check boxes to acknowledge that CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.

AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.

  1. Choose Create stack.

Choose Create stack.

The stack creation may take up to 10 minutes, after which the status changes to CREATE_COMPLETE. You can see the name of the newly created S3 bucket along with other AWS resources created on the Outputs tab.

You can see the name of the newly created S3 bucket along with other AWS resources created on the Outputs tab.

Deployment using the GitHub repository

To deploy the solution using GitHub, visit the GitHub repo and follow the instructions in the README.md file. The solution uses AWS SAM to make it easy to deploy in your AWS account.

Test the solution

To test the solution, upload one or more WebVTT (.vtt) or SRT (.srt) files to the input folder. Because this is a batch operation, we recommend uploading multiple files at the same time. The following code shows a sample SRT file:

1
00:00:00,500 --> 00:00:07,000
Hello. My name is John Doe. Welcome to the blog demonstrating the ability to

2
00:00:07,000 --> 00:00:11,890
translate from one language to another using Amazon Translate. Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. 

3
00:00:11,890 --> 00:00:16,320
Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms.

4
00:00:16,320 --> 00:00:21,580
The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content.

5
00:00:21,580 --> 00:00:23,880
Its asynchronous batch processing capability enables you to translate a large collection of text or HTML documents with a single API call.

After you upload all the WebVTT or SRT documents, upload the file that triggers the translation workflow. This file can be a zero-byte file, but the filename should match the TriggerFileName parameter in the CloudFormation stack. The default name for the file is triggerfile.

After you upload all the WebVTT or SRT documents, upload the file that triggers the translation workflow.

After a short time (15–20 minutes), check the output folder to see the WebVTT or SRT files with the following naming convention: TargetLanguageCode-<inputFileName>.vtt or TargetLanguageCode-<inputFileName>.srt.

After a short time (15–20 minutes), check the output folder to see the WebVTT or SRT files

The following snippet shows the SRT file translated into Spanish:

1
00:00:00,500 --> 00:00:07,000
Hola. Mi nombre es John Doe. Bienvenido al blog que demuestra la capacidad de

2
00:00:07,000 --> 00:00:11,890
traducir de un idioma a otro utilizando Amazon Translate. Amazon Translate es un servicio de traducción automática neuronal que ofrece traducción de idiomas rápida, de alta calidad y asequible. 

3
00:00:11,890 --> 00:00:16,320
La traducción automática neuronal es una forma de automatización de la traducción de idiomas que utiliza modelos de aprendizaje profundo para ofrecer una traducción más precisa y natural que los algoritmos de traducción basados en reglas y estadísticas tradicionales. 

4
00:00:16,320 --> 00:00:21,579
El servicio de traducción está capacitado en una amplia variedad de contenido en diferentes casos de uso y dominios para funcionar bien en muchos tipos de contenido. 

5
00:00:21,579 --> 00:00:23,879
Su capacidad de procesamiento por lotes asincrónico le permite traducir una gran colección de documentos de texto o HTML con una sola llamada a la API.

You can monitor the progress of the solution pipeline by checking the Amazon CloudWatch logs generated for each Lambda function that is part of the solution. For more information, see Accessing Amazon CloudWatch logs for AWS Lambda.

To do a translation for a different source-target language combination, you can update the SOURCE_LANG_CODE and TARGET_LANG_CODE environment variable for the <Stack name>-S3CaptionsFileEventProcessor-<Random string> function and trigger the solution pipeline by uploading WebVTT or SRT documents and the TriggerFileName into the input folder.

To do a translation for a different source-target language combination, you can update the SOURCE_LANG_CODE and TARGET_LANG_CODE environment variable

Conclusion

In this post, we demonstrated how to translate video captions and subtitles in WebVTT and SRT file formats using Amazon Translate asynchronous batch processing. This process can be used in several industry verticals, including education, media and entertainment, travel and hospitality, healthcare, finance, law, or any organization with a large collection of subtitled or captioned video assets that wants these translated to their customers in multiple languages.

You can easily integrate the approach into your own pipelines as well as handle large volumes of caption and subtitle text with this scalable architecture. This methodology works for translating captions and subtitles between over 70 languages supported by Amazon Translate (as of this writing). Because this solution uses asynchronous batch processing, you can customize your machine translation output using parallel data. For more information on using parallel data, see Customizing Your Translations with Parallel Data (Active Custom Translation). For a low-latency, low-throughput solution translating smaller caption files, you can perform the translation through the real-time Amazon Translate API. For more information, see Translating documents with Amazon Translate, AWS Lambda, and the new Batch Translate API. If your organization has a large collection of videos that need to be captioned or subtitled, you can use this AWS Subtitling solution.


About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.

 

 

Raju Penmatcha is a Senior AI/ML Specialist Solutions Architect at AWS. He works with education, government, and non-profit customers on machine learning and artificial intelligence related projects, helping them build solutions using AWS. Outside of work, he likes exploring new places.

Read More

Active learning workflow for Amazon Comprehend custom classification models – Part 2

This is the second in a two part series on Amazon Comprehend custom classification models. In Part 1 of this series, we looked at how to build an AWS Step Functions workflow to automatically build, test, and deploy Amazon Comprehend custom classification models and endpoints. In Part 2, we look at real-time classification APIs, feedback loops, and human review workflows that help with continuous model training to keep the model up-to-date with new data and patterns. You can find Part 1 here

The Amazon Comprehend custom classification API enables you to easily build custom text classification models using your business-specific labels without learning machine learning (ML). For example, your customer support organization can use custom classification to automatically categorize inbound requests by problem type based on how the customer described the issue. You can use custom classifiers to automatically label support emails with appropriate issue types, thereby routing customer phone calls to the right agents and categorizing social media posts into user segments.

In Part 1 of this series, we looked at how to build an AWS Step Functions workflow to automatically build, test, and deploy Amazon Comprehend custom classification models and endpoints. In this post, we cover the real-time classification APIs, feedback loops, and human review workflows that help with continuous model training to keep it up to date with new data and patterns.

Solution architecture

This post describes a reference architecture for retraining custom classification models. The architecture comprises real-time classification, feedback pipelines, human review workflows using Amazon Augmented AI (Amazon A2I) , preparing new training data from the human review data, and triggering the model building flow that we covered in Part 1 of this series.

The following diagram illustrates this architecture covering the last three components. In the following sections, we walk you through each step in the workflow.

The following diagram illustrates this architecture covering the last three components.

Real-time classification

To use custom classification in Amazon Comprehend in real time, you need to create an API that calls the custom classification model endpoint with the text that needs to be classified. This stage is represented by Steps 1–3 in the preceding architecture:

  1. The end user application calls an Amazon API Gateway endpoint with a text that needs to be classified.
  2. The API Gateway endpoint then calls an AWS Lambda function configured to call an Amazon Comprehend endpoint.
  3. The Lambda function calls the Amazon Comprehend endpoint, which returns the unlabeled text classification and a confidence score.

Feedback collection

When the endpoint returns the classification and the confidence score during the real-time classification, you can send instances with low confidence scores to human review. This type of feedback is called implicit feedback.

  1. The Lambda function sends the implicit feedback to an Amazon Kinesis Data Firehose delivery stream.

The other type of feedback is called explicit feedback, and comes from the application’s end users that use the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send explicit feedback either in real time through an API or a batch process.

  1. End users of the application submit explicit real-time feedback through an API Gateway endpoint.
  2. The Lambda function backing the API endpoint transforms the data into a standard feedback format and writes it to the Kinesis Data Firehose delivery stream.
  3. End users of the application can also submit explicit feedback as a batch file by uploading it to an S3 bucket.
  4. A trigger configured on the S3 bucket triggers a Lambda function.
  5. The Lambda function transforms the data into a standard feedback format and writes it to the delivery stream.
  6. Both the implicit and explicit feedback data get sent to a delivery stream in a standard format. All this data is buffered and written to an S3 bucket.

Human classification

The human classification stage includes the following steps:

  1. A trigger configured on the feedback bucket in Step 10 invokes a Lambda function.
  2. The Lambda function creates Amazon A2I human review tasks for all the feedback data received.
  3. Workers assigned to the classification jobs log in to the human review portal and either approve the classification by the model or classify the text with the right labels.
  4. After the human review, all these instances are stored in an S3 bucket and used for retraining the models.

Retraining workflow

The retraining workflow stage includes the following steps:

  1. A trigger configured on the human-reviewed data bucket in Step 14 invokes a Lambda function.
  2. The function transforms the human-reviewed data payload to a comma-separated training data format, required by Amazon Comprehend custom classification models. After transformation, this data is written to a Firehose delivery stream, which acts as an accumulator.
  3. Depending on the time frame set for retraining models, the delivery stream flushes the data into the training bucket that was created in Part 1 of this series. For this post, we set the buffer conditions to 1 MiB or 60 seconds. For your own use case, you might want to adjust these settings so model retraining occurs according to your time or size requirements. This completes the active learning loop, and starts the Step Functions workflow for retraining models.

Solution overview

The next few sections of the post go over how to set up this architecture in your AWS account. We classify news into four categories: World, Sports, Business, and Sci/Tech, using the AG News dataset for custom classification, and set up the implicit and explicit feedback loop. You need to complete two manual steps:

  1. Create an Amazon Comprehend custom classifier and an endpoint.
  2. Create an Amazon SageMaker private workforce, worker task template, and human review workflow.

After this, you run the provided AWS CloudFormation template to set up the rest of the architecture.

Prerequisites

If you’re continuing from Part 1 of this series, you can skip to the step Create a private workforce, worker task template, and human review workflow.

Create a custom classifier and an endpoint

Before you get started, download the dataset and upload it to Amazon S3. This dataset comprises a collection of news articles and their corresponding category labels. We have created a training dataset called train.csv from the original dataset and made it available for download.

The following screenshot shows a sample of the train.csv file.

The following screenshot shows a sample of the train.csv file.

After you download the train.csv file, upload it to an S3 bucket in your account for reference during training. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

To create your classifier for classifying news, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. Choose Train classifier.
  3. For Name, enter news-classifier-demo.
  4. Select Using Multi-class mode.
  5. For Training data S3 location, enter the path for train.csv in your S3 bucket, for example, s3://<your-bucketname>/train.csv.
  6. For Output data S3 location, enter the S3 bucket path where you want the output, such as s3://<your-bucketname>/.
  7. For IAM role, select Create an IAM role.
  8. For Permissions to access, choose Input and output (if specified) S3 bucket.
  9. For Name suffix, enter ComprehendCustom.

For Name suffix, enter ComprehendCustom

  1. Scroll down and choose Train Classifier to start the training process.

The training takes some time to complete. You can either wait to create an endpoint or come back to this step later after finishing the steps in the section Create a private workforce, worker task template, and human review workflow.

Create a custom classifier real-time endpoint

To create your endpoint, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and select your model news-classifier-demo.
  3. On the Actions drop-down menu, choose Create endpoint.
  4. For Endpoint name, enter classify-news-endpoint and give it one inference unit.
  5. Choose Create endpoint.
  6. Copy the endpoint ARN as shown in the following screenshot. You use it when running the CloudFormation template in a future step.

Copy the endpoint ARN as shown in the following screenshot.

Create a private workforce, worker task template, and human review workflow

This section walks you through creating a private workforce in SageMaker, a worker task template, and your human review workflow.

Create a labeling workforce

For this post, you create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console).

After the user accepts the invitation, you add them to the workforce. For instructions, see the Add a Worker to a Work Team section the Manage a Workforce (Amazon SageMaker Console).

Create a worker task template

To create a worker task template, complete the following steps:

  1. On the Amazon A2I console, choose Worker task templates.
  2. Choose to Create a template.
  3. For Template name, enter custom-classification-template.
  4. For Template type, choose Custom,
  5. In the Template editor, enter the following GitHub UI template code.
  6. Choose Create.

Choose Create.

Create a human review workflow

To create your human review workflow, complete the following steps:

  1. On the Amazon A2I console, choose Human review workflows.
  2. Choose Create human review workflow.
  3. For Name, enter classify-workflow.
  4. Create a S3 bucket to store the human review output. Make a note of this bucket, because we use this in the later part of the post.
  1. Specify an S3 bucket to store output: s3://<your bucketname>/. Use the bucket created earlier.
  2. For IAM role, select Create a new role.
  3. For Task type, choose Custom.
  4. Under Worker task template creation, select the custom classification template you created.
  5. For Task description, enter Read the instructions and review the document.
  6. Under Workers, select Private.
  7. Use the drop-down list to choose the private team that you created.
  8. Choose Create.
  9. Copy the workflow ARN (see the following screenshot). You will use it when initializing the CloudFormation template parameters in a later step.

Copy the workflow ARN

Deploy the CloudFormation template to set up active learning feedback

Now that you have completed the manual steps, you can run the CloudFormation template to set up this architecture’s building blocks, including the real-time classification, feedback collection, and the human classification.

Before deploying the CloudFormation template, make sure you have the following to pass as parameters:

  • Custom classifier endpoint ARN
  • Amazon A2I workflow ARN
  1. Choose Launch Stack:

  1. You must set this parameter, only if you’re continuing from Part 1 of this series. For BucketFromBlogPart1, enter the bucket name that was created for storing training data in Part 1 of this blog series.
  2. You must set this parameter, only if you’re continuing from Part 1 of this series. For ComprehendEndpointParameterKey, enter /<<StackName of Part1 Blog>>/CURRENT_CLASSIFIER_ENDPOINT. This parameter can be found in the Parameter Store section of the Systems Manager.
  1. You’re not required to set this parameter if you’re continuing from Part 1.For ComprehendEndpointARN, enter the endpoint ARN of your Amazon Comprehend custom classification model.
  1. For HumanReviewWorkflowARN, enter the workflow ARN you copied.
  2. For ComrehendClassificationScoreThreshold, enter 0.5, which means a 50% threshold for low confidence scores.

For ComrehendClassificationScoreThreshold, enter 0.5

  1. Choose Next until the Capabilities
  2. Select the check box to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.

For more information about these resources, see AWS IAM resources.

  1. Choose Create stack.

Choose Create stack.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

  1. On the Outputs tab of the stack (see the following screenshot), copy the value for BatchUploadS3Bucket, FeedbackAPIGatewayID, and TextClassificationAPIGatewayID to interact with the feedback loop.

Both the TextClassificationAPI and FeedbackAPI require an API key to interact with them. The CloudFormation stack output ApiGWKey refers to the name of the API key. As of this writing, this API key is associated with a usage plan that allows 2,000 requests per month.

  1. On the API Gateway console, choose either the TextClassificationAPI or the FeedbackAPI.
  2. In the navigation pane, choose API Keys.
  3. Expand the API key section and copy the value.

Expand the API key section and copy the value.

You can manage the usage plan by following the instructions on Create, configure, and test usage plans with the API Gateway console.

You can also add fine-grained authentication and authorization to your APIs. For more information on securing your APIs, see Controlling and managing access to a REST API in API Gateway.

Enable the trigger to start the retraining workflow

The last step of the process is to add a trigger to the S3 bucket that we created earlier to store the human-reviewed output. The trigger invokes the Lambda function that begins the payload transformation from the Amazon A2I human review output format to a CSV format required for training Amazon Comprehend custom classification models.

  1. Open the Lambda function HumanReviewTrainingDataTransformerFunction, created by running the CloudFormation template.
  2. In the Trigger configuration section, choose S3.
  3. For Bucket, enter the bucket you created earlier in the step 4 of Create a human review workflow section.

For Bucket, enter the bucket you created earlier.

Test the feedback loop

In this section, we walk you through testing your feedback loop, including real-time classification, implicit and explicit feedback, and human review tasks.

Real-time classification

To interact and test these APIs, you need to download Postman.

The API Gateway endpoint receives an unlabeled text document from a client application and internally calls the custom classification endpoint, which returns the predicted label and a confidence score.

  1. Open Postman and enter the TextClassificationAPIGateway URL in POST method.
  2. In the Headers section, configure the API key: x-api-key : << Your API key >>.
  3. In the text field, enter the following JSON code (make sure you have JSON selected and enable raw):
    {"classifier":"<your custom classifier name>", "sentence":"MS Dhoni retires and a billion people had mixed feelings."}

  1. Choose Send.

You get a response back with a confidence score and class, as seen in the following screenshot.

You get a response back with a confidence score and class, as seen in the following screenshot.

Implicit feedback

When the endpoint returns the classification and the confidence score during the real-time classification, you can route all the instances where the confidence score doesn’t meet the threshold to human review. This type of feedback is called implicit feedback. For this post, we set the threshold as 0.5 as an input to the CloudFormation stack parameter.

You can change this threshold when deploying the CloudFormation template based on your needs.

Explicit feedback

The explicit feedback comes from the end users of the application that uses the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send the predicted label by the model’s explicit feedback through the following methods:

  • Real time through an API, which is usually triggered through a like/dislike button on a UI
  • Batch process, where a file with a collection of misclassified utterances is put together based on a user survey conducted by the customer outreach team

Invoke the explicit real-time feedback loop

To test the Feedback API, complete the following steps:

  1. Open Postman and enter the FeedbackAPIGatewayID value from your CloudFormation stack output in POST method.
  2. In the Headers section, configure the API key: x-api-key : << Your API key >>.
  3. In the text field, enter the following JSON code (for classifier, enter the classifier you created, such as news-classifier-demo, and make sure you have JSON selected and enable raw):
    {"classifier":"<your custom classifier name>","sentence":"Sachin is Indian Cricketer."}

  1. Choose Send.

We recommend that you submit at least four test samples that will result in a confidence score lesser than your set threshold.

We recommend that you submit at least four test samples that will result in a confidence score lesser than your set threshold.

Submit explicit feedback as a batch file

Download the following test feedback JSON file, populate it with your data, and upload it into the BatchUploadS3Bucket created when you deployed your CloudFormation template. We recommend that you submit at least four feedback entries in this file. The following code shows some sample data in the file:

{
   "classifier":"news-classifier-demo",
   "sentences":[
      "US music firms take legal action against 754 computer users alleged to illegally swap music online.",
      "A gamer spends $26,500 on a virtual island that exists only in a PC role-playing game."
   ]
}

Uploading the file triggers the Lambda function that starts your human review loop.

Human review tasks

All the feedback collected through the implicit and explicit methods is sent for human classification. The labeling workforce can include Amazon Mechanical Turk, private teams, or AWS Marketplace vendors. For this post, we create a private workforce. The URL to the labeling portal is located on the SageMaker console, on the Labeling workforces page, on the Private tab.

For this post, we create a private workforce.

After you log in, you can see the human review tasks assigned to you. Select the task to complete and choose Start working.

Select the task to complete and choose Start working.

You see the tasks displayed based on the worker template used when creating the human workflow.

You see the tasks displayed based on the worker template used when creating the human workflow.

After you complete the human classification and submit the tasks, the human-reviewed data is stored in the S3 bucket you configured when creating the human review workflow. This bucket is located under Output location on the workflow details page.

his bucket is located under Output location on the workflow details page.

This human-reviewed data is used to retrain the custom classification model to learn newer patterns and improve its overall accuracy. The following screenshot shows the human-annotated output file output.json in the S3 bucket.

The following screenshot shows the human-annotated output file output.json in the S3 bucket.

This human-reviewed data is then converted to a custom classification model training data format, and transferred to the training bucket that was created in Part 1 of this series, which starts the Step Functions workflow for retraining models. The process of retraining the models with human-reviewed data, selecting the best model, and automatically deploying the new endpoints completes the active learning workflow.

Cleanup

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. On the Amazon S3 console, delete the S3 bucket that contains the training dataset.
  2. On the Amazon Comprehend console, delete the endpoint and the classifier.
  3. On the Amazon A2I console, delete the human review workflow, worker template, and private workforce.
  4. On the AWS CloudFormation console, delete the stack you created. (This removes the resources the CloudFormation template created.)

Conclusion

Amazon Comprehend helps you build scalable and accurate natural language processing capabilities without any ML experience. This post provides a reusable pattern and infrastructure for active learning workflows for custom classification models. The feedback pipelines and human review workflow help the custom classifier learn new data patterns continuously. To learn more about automatic model building, selection, and deployment of custom classification models, you can refer to Active learning workflow for Amazon Comprehend custom classification models – Part 1.

For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.


About the Authors

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and developing products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.

 

 

Joyson Neville Lewis Joyson Neville Lewis obtained his masters in Information Technology from Rutgers University in 2018. He has worked as a Software/Data engineer before diving into the conversational AI domain in 2019, where he works with companies to connect the dots between business and AI using voice and chatbot solutions. Joyson joined Amazon Web Services in February of 2018 as a Big Data Consultant for the AWS Professional Services team in NYC.

Read More

Active learning workflow for Amazon Comprehend custom classification models – Part 1

This is the first in a two part series on Amazon Comprehend custom classification models. In Part 1 of this series, we look at how to build an AWS Step Functions workflow to automatically build, test, and deploy Amazon Comprehend custom classification models and endpoints. In Part 2, we will look at real-time classification APIs, feedback loops, and human review workflows that help with continuous model training to keep the model up-to-date with new data and patterns. You can find Part 2 here

The Amazon Comprehend custom classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use custom classification to automatically categorize inbound requests by problem type based on how the customer has described the issue. You can use custom classifiers to automatically label support emails with appropriate issue types, route customer phone calls to the right agents, and categorize social media posts into user segments.

For custom classification, you start by creating a training job with a ground truth dataset comprising a collection of text and corresponding category labels. When the job is complete, you have a classifier that can classify any new text into one or more named categories. When the custom classification model classifies a new unlabeled text document, it predicts what it has learned from the training data. Sometimes you may not have a training dataset with various language patterns, or once you deploy the model, you start seeing completely new data patterns. In these cases, the model may not be able to classify these new data patterns accurately. How can we ensure continuous model training to keep it up to date with new data and patterns?

Feedback loops play a pivotal role in keeping the models up to date. This feedback helps the models learn about their misclassifications and learn the right ones. This process of teaching the models continuously through feedback and deploying them is called active learning.

Solution architecture

In this two-part series, we discuss an architecture pattern that allows you to build an active learning workflow for Amazon Comprehend custom classification models. The first post will cover an AWS Step Functions workflow that automates model building, selecting the best model, and deploying an endpoint of the chosen model. The second post describes a workflow comprising real-time classification, feedback pipelines, and human review workflows using Amazon Augmented AI (Amazon A2I).

Step Functions workflow

The following diagram shows the Step Functions workflow for automatic model building, endpoint creation, and deploying Amazon Comprehend custom classification models.

In the following sections, we discuss the six steps in more detail:

  • Model building: Steps 1–2
  • Model selection: Step 3
  • Model deployment: Steps 4–6

Model building

Steps 1–2 in the workflow cover model building, which includes incorporating new data into the ground truth dataset and retraining the model.  If the model is being built for the first time, the new dataset will be marked as ground truth dataset, and the Model selection step would be skipped. This new data can come from different sources, including feedback data that was human reviewed and reclassified, as discussed in Part 2 of this series. The new data is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, which starts a workflow that includes merging the new data with the ground truth dataset and starting a custom classification model training job that uses the newly merged dataset.

Model selection

Step 3 covers model selection, which includes testing the newly created model with a validation dataset, computing the test results, comparing the results of the new model with the current model in production, and finally selecting the model that performs best with respect to a chosen metric like accuracy, precision, recall, or F1 score. All these steps are orchestrated using the same Step Functions workflow after the model is built.

Model deployment

Steps 4–6 cover model deployment. If the new model outperforms the current model in production, the Step Functions workflow continues to the next step, where a new endpoint is created for the newly created custom classification model, and then updates the AWS Systems Manager Parameter Store values to this new classifier ARN and the new endpoint ARN to be used by the real-time classification API, upgrades the newly merged training dataset as the primary training dataset, and deletes the endpoint of the previous model that was in production. If the newer model doesn’t perform well in production, you can roll back to the previous model and endpoint by manually updating the Parameter Store values to refer to the earlier model and endpoint ARNs.

Deploying the AWS CloudFormation template

You can deploy this architecture using the provided AWS CloudFormation template in us-east-1.

  1. Choose Launch Stack:

  1. For Stack Name, enter the name of your CloudFormation stack.
  2. For StepFunctionName, enter the Step Functions name for automatic model building, endpoint creation, and deploying Amazon Comprehend custom classification models (can be left at the default value of ComprehendModelStepFunction).
  3. For TestThresholdParameterName, choose Accuracy, Precision, Recall, or F1score. (For this post, we leave it at the default value of F1 Score).

We use this metric to check if the newer model is better than the previous model.

  1. Choose Next.
  2. Choose Next again.
  3. In the Capabilities and transforms section, select all three check boxes to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.
  4. Choose Create stack.

This process might take 15 minutes or more to complete, and creates the following resources:

  • Systems Manager parameters to store intermediate parameters, like the current classifier endpoint ARN
  • An S3 bucket for the custom classification model training data
  • An S3 bucket for the custom classification model prediction job on the test data
  • Amazon DynamoDB tables for model predictions on the test data and status of custom classification prediction jobs
  • AWS Lambda functions:
    • StartCustomClassificationModelBuilding – Starts the custom classification model building
    • GetCustomClassificationModelStatus – Gets the status of the model building stage
    • StartCustomClassificationJob – Starts the prediction job using the test dataset
    • GetCustomClassificationJobStatus – Gets the status of the prediction job
    • CustomClassificationModelSelection – Starts the custom classification model selection stage
    • StartCustomClassificationEndpointBuilding – Starts the custom classification model endpoint building stage
    • GetCustomClassificationEndpointStatus – Gets the status of model endpoint building stage
    • DeleteCustomClassificationEndpoint – Deletes the old model endpoint
  • Step Functions to automate the workflow of building, testing, selecting, and deleting the models
  • IAM roles:
    • A Lambda execution role to run Amazon Comprehend custom classification jobs
    • A Lambda execution role to trigger Step Functions
    • A Step Functions role to trigger Lambda functions
    • An Amazon Comprehend data access role to give Amazon Comprehend access to the training data in the S3 bucket

Testing

For testing this blog, you can use your own training dataset or you can download the news dataset and upload it to Amazon S3. The news dataset comprises a collection of news articles and their corresponding category labels.

  1. Find the S3 bucket created by the CloudFormation to store the training data. This can be found by going to the Resources section of the CloudFormation and looking up ComprehendInputDataS3Bucket.
  2. On the Amazon S3 console, inside the input data bucket, create a folder named train.
  3. Upload the training data to the train folder.
  4. On the Step Functions console, choose the new state machine you created.
  5. In the Executions section, choose the latest run.

In the Executions section, choose the latest run.

The following screenshot shows the Graph inspector view. On the Details tab, you can check that Step Functions ran successfully.

The following screenshot shows the Graph inspector view.

It takes approximately 1 hour for the Step Functions state machine to complete.

  1. After Step Functions has successfully ran, on the Systems Manager console, in the navigation pane, under Application Management, choose Parameter Store.

You can check the updated classifier and endpoint from the Parameter Store.

You can check the updated classifier and endpoint from the Parameter Store.

Cleaning up

To avoid incurring any charges in the future, delete the CloudFormation stack. This removes all the resources you created as part of this post.

Conclusion

Active learning in custom classification models ensures that your models are kept up to date with new data and patterns. This two-part series provides you with a reference architecture to build an active learning workflow comprised of real-time classification APIs, feedback loops, a human review workflow, model building, model selection, and model deployment. For more information about the feedback loops and human review workflow, see the second part of this blog series, Active learning workflow for Amazon Comprehend custom classification models – Part 2.

For more information about custom classification in Amazon Comprehend, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.


About the Authors

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and developing products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

Marty Jiang is a Conversational AI Consultant with AWS Professional Services. Outside of work, he loves spending time outdoors with his family and exploring new technologies.

Read More