Workshop at ICLR 2021 unites communities investigating synthetic data generation to improve machine learning and protect privacy.Read More
Introducing hierarchical deletion to easily clean up unused resources in Amazon Forecast
Amazon Forecast just launched the ability to hierarchically delete resources at a parent level without having to locate the child resources. You can stay focused on building value-adding forecasting systems and not worry about trying to manage individual resources that are created in your workflow. Forecast uses machine learning (ML) to generate more accurate demand forecasts, without requiring any prior ML experience. Forecast brings the same technology used at Amazon.com to developers as a fully managed service, removing the need to manage resources or rebuild your systems.
When importing data, training a predictor, and creating forecasts, Forecast generates resources related to the dataset group. For example, when a predictor is generated using a dataset group, the predictor is the child resource and the dataset group is the parent resource. Previously, it was difficult to delete resources while building your forecasting system because you had to delete the child resources first, and then delete the parent resources. This was especially difficult and time-consuming because deleting resources required you to understand the various resource hierarchies, which weren’t immediately visible.
As you experiment and create multiple dataset groups, predictors, and forecasts, the resource hierarchy can become complicated. However, this streamlined hierarchical deletion method allows you to quickly clean up resources without having to worry about understanding the resource hierarchy.
In this post, we walk through the Forecast console experience of deleting all the resource types that are supported by Forecast. You can also perform hierarchical deletion by referencing the Deleting Resources page. To delete individual or child resources one at a time, you can continue to use the existing APIs such as DeleteDataset, DeleteDatasetGroup, DeleteDatasetImportJob, DeleteForecast, DeleteForecastExportJob, DeletePredictor and DeletePredictorBacktestExportJob.
Delete dataset group resources
To delete a dataset group when it doesn’t have any child resources, a simple dialog is displayed. You can delete the chosen resource by entering delete and choosing Delete.
When a dataset group has underlying child resources such as predictors, predictor backtest export jobs, forecasts, and forecast export jobs, a different dialog is displayed. After you enter delete and choose Delete, all these child resources are deleted, including the selected dataset group resource.
Delete dataset resources
For a dataset resource without child resources, you see a simple dialog is during the delete operation.
When a dataset has child dataset import jobs, the following dialog is displayed.
Delete predictor resources
For a predictor resource without child resources, the following simple dialog is displayed.
When the predictor resource has underlying child resources such as predictor backtest export jobs, forecasts, or forecast export jobs, the following dialog is displayed. If you proceed with the delete action, all these child resources are deleted, including the selected predictor resource.
Delete a forecast resource
For a forecast resource without child resources, the following dialog is displayed.
When a forecast resource has underlying child resources such as forecast export jobs, the following dialog is displayed.
Delete dataset import job, predictor backtest export job, or forecast export job resources
The dataset import job, predictor backtest export job, and forecast export job resources don’t have any child resources. Therefore, when you choose to delete any of these resources via the Forecast console, a simple delete dialog is displayed. When you proceed with the delete, only the selected resources are deleted.
For example, when deleting a dataset import job resource, the following dialog is displayed.
Conclusion
You now have more flexibility when deleting a resource or an entire hierarchy of resources. To get started with this capability, see the Deleting Resources page and go through the notebook in our GitHub repo that walks you through how to perform hierarchical deletion. You can use this capability in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.
About the Authors
Alex Kim is a Sr. Product Manager for Amazon Forecast. His mission is to deliver AI/ML solutions to all customers who can benefit from it. In his free time, he enjoys all types of sports and discovering new places to eat.
Ranga Reddy Pallelra works as an SDE on the Amazon Forecast team. In his current role, he works on large-scale distributed systems with a focus on AI/ML. In his free time, he enjoys listening to music, watching movies, and playing racquetball.
Shannon Killingsworth is a UX Designer for Amazon Forecast and Amazon Personalize. His current work is creating console experiences that are usable by anyone, and integrating new features into the console experience. In his spare time, he is a fitness and automobile enthusiast.
Caltech names eight AI4Science fellows supported by Amazon
Amazon is collaborating with Caltech to support research, education, and outreach programs that help build bridges between AI and other areas of science and engineering.Read More
Translate All: Automating multiple file type batch translation with AWS CloudFormation
This is a guest post by Cyrus Wong, an AWS Machine Learning Hero. You can learn more about and connect with AWS Machine Learning Heroes at the community page.
On July 29, 2020, AWS announced that Amazon Translate now supports Microsoft Office documents, including .docx, .xlsx, and .pptx.
The world is full of bilingual countries and cities like Hong Kong. I find myself always needing to prepare Office documents and presentation slides in both English and Chinese. Previously, it could be quite time-consuming to prepare the translated documents manually, and this approach can also lead to more errors. If I try to just select all, copy, and paste into a translation tool, then copy and paste the result into a new file, I lose all the formatting and images! My old method was to copy content piece by piece, translate it, then copy and paste it into the original document, over and over again. The new support for Office documents in Amazon Translate is really great news for teachers like me. It saves you a lot of time!
Still, we have to sort the documents by their file types and call Amazon Translate separately for different file types. For example, if I have notes in .docx files, presentations in .pptx files, and data in .xlsx files, I still have to sort them by their file type and send different TextTranslation API calls. In this post, I show how to sort content by document type and make batch translation calls. This solution automates the undifferentiated task of sorting files.
For my workflow, I need to upload all my course materials in a single Amazon Simple Storage Service (Amazon S3) bucket. This bucket often includes different types of files in one folder and subfolders.
However, when I start the Amazon Translate job on the console, I have to choose the file content type. The problem arises when different file types are in one folder without subfolders.
Therefore, our team developed a solution that we’ve called Translate All—a simple AWS serverless application to resolve those challenges and make it easier to integrate with other projects. A serverless architecture is a way to build and run applications and services without having to manage infrastructure. Your application still runs on servers, but all the server management is done by AWS. You no longer have to provision, scale, and maintain servers to run your applications, databases, and storage systems. For more information about serverless computing, see Serverless on AWS.
Solution overview
We use the following AWS services to run this solution:
- AWS Lambda – This serverless compute service lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. With Lambda, you can run code for virtually any type of application or backend service—all with zero administration.
- Amazon Simple Notification Service – Amazon SNS is a fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication. The A2A pub/sub functionality provides topics for high-throughput, push-based, many-to-many messaging between distributed systems, microservices, and event-driven serverless applications.
- Amazon Simple Queue Service – Amazon SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Amazon SQS eliminates the complexity and overhead associated with managing and operating message-oriented middleware, and empowers developers to focus on differentiating work.
- AWS Step Functions – This serverless function orchestrator makes it easy to sequence Lambda functions and multiple AWS services into business-critical applications. Through its visual interface, you can create and run a series of checkpointed and event-driven workflows that maintain the application state. The output of one step acts as an input to the next. Each step in your application runs in order, as defined by your business logic.
In our solution, if a JSON message is sent to the SQS queue, it triggers a Lambda function to start a Step Functions state machine (see the following diagram).
The state machine includes the following high-level steps:
- In the Copy to Type Folder stage, the state machine gets all the keys under
InputS3Uri
and copies each type of file into a type-specific, individual folder. If subfolders exist, it replaces / with_ForwardSlash_
.
The following screenshot shows the code for this step.
The following screenshot shows the output files.
- The Parallel Map stage arranges the data according to
contentTypes
and starts the translation job workflow in parallel.
- The Start Translation Job workflow loops for completion status until all translate jobs are complete.
- In the Copy to Parent Folder stage, the step machine reconstructs the original input folder structure and generates a signed URL that remains valid for 7 days.
- The final stage publishes the results to Amazon SNS.
Additional considerations
When implementing this solution, consider the following:
- As of this writing, we just handle the happy path and assume that all jobs are in Completed status at the end
- The default job completion maximum period is 180 minutes; you can change the
NumberOfIteration
variable to extend it as needed - You can’t use the reserved words for file name or folder name:
!!plain!!
,!!html!!
,!!document!!
,!!presentation!!
,!!sheet!!
,!!document!!
, or-_ForwardSlash_-
Deploy the solution
To deploy this solution, complete the following steps:
- Open the Serverless Application Repository link.
- Select I acknowledge that this app creates custom IAM roles.
- Choose Deploy.
- When the AWS CloudFormation console appears, note the input and output parameters on the Outputs
Test the solution
In this section, we walk you through using the application.
- On the Amazon SQS console, subscribe your email to
TranslateCompletionSNSTopic
. - Upload all files into the folder
InputBucket
. - Send a message to
TranlateQueue
. See the following example code:
{
"JobName": "testing",
"InputBucket": "//enter your InputBucket from the CloudFormation console//",
"InputS3Uri": "test",
"OutputBucket": "//enter your OutputBucket from the CloudFormation console//",
"SourceLanguageCode": "en",
"TargetLanguageCodes": [
"zh-TW"
]
}
You receive the translation job result as an email.
The email contains a presigned URL with 7-day valid period. You can share the translated file without having to sign in to the AWS Management Console.
Conclusion
With this solution, my colleagues and I easily resolved our course materials translation problem. We saved a lot of time compared to opening the files one by one and copy-and-pasting repeatedly. The translation quality is good and eliminates the potential for errors that can often come with undifferentiated manual workflows. Now we can just use the AWS Command Line Interface (AWS CLI) to run the Amazon S3 sync command to upload our files into an S3 bucket and translate the all the course materials at once. Using this tool to leverage a suite of powerful AWS services has empowered my team to spend less time processing course materials and more time educating the next generation of cloud technology professionals!
Project collaborators include Mike Ng, Technical Program Intern at AWS, Brian Cheung, Sam Lam, and Pearly Law from the IT114115 Higher Diploma in Cloud and Data Centre Administration. This post was edited with Greg Rushing’s contribution.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
About the Author
Cyrus Wong is Data Scientist of Cloud Innovation Centre at the IT Department of the Hong Kong Institute of Vocational Education (Lee Wai Lee). He has achieved all 13 AWS Certifications and actively promotes the use of AWS in different media and events. His projects received four Hong Kong ICT Awards in 2014, 2015, and 2016, and all winning projects are running solely on AWS with Data Science and Machine Learning.
Scale session-aware real-time product recommendations on Shopify with Amazon Personalize and Amazon EventBridge
This is a guest post by Jeff McKelvey, Principal Development Lead at HiConversion. The team at HiConversion has collaborated closely with James Jory, Applied AI Services Solutions Architect at AWS, and Matt Chwastek, Senior Product Manager for Amazon Personalize at AWS. In their own words, “HiConversion is the eCommerce Intelligence platform helping merchants personalize and optimize shopping experiences for every visitor session.”
Shopify powers over 1 million online businesses worldwide. It’s an all-in-one commerce platform to start, run, and grow a brand. Shopify’s mission is to reduce the barriers to business ownership, making commerce more equitable for everyone.
With over 50% of ecommerce sales coming from mobile shoppers, one of the challenges limiting future growth for Shopify’s merchants is effective product discovery. If visitors can’t quickly find products of interest on a merchant’s site, they leave, often for good.
That’s why we introduced HiConversion Recommend, a Shopify Plus certified application powered by Amazon Personalize. This application helps Shopify merchants deliver personalized product discovery experiences based on a user’s in-session behavior and interests directly on their own storefront.
We chose to integrate Amazon Personalize into the HiConversion Recommend application because it makes the same machine learning (ML) technology used by Amazon.com accessible to more Shopify merchants. This enables merchants to generate product recommendations that adapt to visitor actions and behavioral context in real time.
In this post, we describe the architectures used in our application for serving recommendations as well as synchronizing events and catalog updates in real time. We also share some of the results for session-based personalization from a customer using the application.
Private, fully managed recommendation systems
Amazon Personalize is an AI service from AWS that provides multiple ML algorithms purpose-built for personalized recommendation use cases. When a Shopify merchant installs the HiConversion Recommend application, HiConversion provisions a dedicated, private environment, represented as a dataset group within Amazon Personalize, for that merchant.
Then data from the merchant’s catalog as well as the browsing and purchase history of their shoppers is uploaded into datasets within the dataset group. Private ML models are trained using that data and deployed to unique API endpoints to provide real-time recommendations. Therefore, each merchant has their own private ML-based recommendation system, isolated from other merchants, which is fully managed by HiConversion.
HiConversion also creates and manages the resources needed to stream new events and catalog updates from a merchant’s Shopify storefront directly into Amazon Personalize. This enables the real-time capabilities of Amazon Personalize, such as learning the interests of new shoppers to the storefront, adapting to each evolving shopper intent, and incorporating new products in recommendations.
We can also apply business rules using Amazon Personalize filters, enabling merchants to tailor recommendations to a particular category of products, excluding recently purchased products from being recommended, and more.
Serving millions of online shoppers in real time
Creating a premium, self-service Shopify application based on Amazon Personalize required the automation of many processes. Our goal was to democratize access to an advanced product discovery solution, making it easy to use by anyone running their store on Shopify.
To provide a seamless, real-time personalized user experience, an event-driven approach was needed to ensure that Shopify, Amazon Personalize, and HiConversion had the same picture of the visitor and product catalog at all times. For this, we chose to use Shopify’s integration with Amazon EventBridge as well as Amazon Simple Queue Service (Amazon SQS) and AWS Lambda.
The following high-level diagram illustrates how HiConversion Recommend manages the data connections between users and their product recommendations.
As shown in our diagram, AWS Lambda@Edge connects with three independent systems that provide the essential application capabilities:
- Amazon Personalize campaign – A custom API endpoint enabling real-time product recommendations based on Amazon Personalize algorithms.
- HiConversion Rich Data endpoint – This enables hybrid product recommendations based on a mix of HiConversion visitor and web analytics, and Amazon Personalize ranking algorithms.
- Amazon CloudFront endpoint – This enables rapid access to product metadata, like product images, pricing, and inventory, in combination with Amazon Simple Storage Service (Amazon S3).
When a Shopify merchant activates the HiConversion Recommend application, all of this infrastructure is automatically provisioned and training of the Amazon Personalize ML models is initiated—dramatically reducing the time to go live.
Why Lambda@Edge?
According to a 2017 Akamai study, a 100-millisecond delay in website load time can hurt conversion rates by up to 7%. Bringing an application to Shopify’s global network of stores means we had to prioritize performance globally.
We use Lambda@Edge on the front end of our application to track visitor activity and contextual data, allowing us to deliver product recommendations to visitors on Shopify-powered ecommerce sites with low latency. Putting our code as close as possible to shoppers improves overall performance and leads to reduced latency.
We chose Lambda@Edge to maximize the availability of our content delivery network. Lambda@Edge also removes the need to provision or manage infrastructure in multiple locations around the world; it allows us to reduce costs while providing a highly scalable system.
Scaling with EventBridge for Shopify
Our application launched during the busiest time of the year—the holiday shopping season. One thing that stood out was the massive increase in promotional and business activity from our live customers. Due to those promotions, the frequency of catalog changes in our clients’ Shopify stores rapidly increased.
Our original implementation relied on Shopify webhooks, allowing us to take specific actions in response to connected events. Due to the increasing volume of data through our application, we realized that keeping the product metadata and the real-time product recommendations in sync was becoming problematic.
This was particularly common when large merchants started to use our application, or when merchants launched flash sales. The subsequent firehose of incoming data meant that our application infrastructure was at risk of not being able to keep up with the onslaught of web traffic, leading to broken shopping experiences for customers.
We needed a separate, more scalable solution.
We needed a solution that would scale with our customer base and our customers’ traffic. Enter EventBridge: a serverless, event-driven alternative to receiving webhooks via standard HTTP. Integrating with EventBridge meant that Shopify could directly send event data securely to AWS, instead of handling all that traffic within our own application.
Event-driven solutions like EventBridge provide a scalable buffer between our application and our addressable market of hundreds of thousands of live Shopify stores. It allows us to process events at the rate that works for our tech stack without getting overwhelmed. It’s highly scalable and resilient, is able to accept more event-based traffic, and reduces our infrastructure cost and complexity.
The following diagram illustrates how HiConversion Recommend uses EventBridge to enable our real-time product recommendation architecture with Amazon Personalize.
The architecture includes the following components:
- The Amazon Personalize putEvents() API enables product recommendations that consider real-time visitor action and context. Visitor activity and contextual data is captured by the HiConversion web analytics module and sent to a Lambda@Edge function. We then use Amazon SQS and a Lambda function to stream events to an Amazon Personalize event tracker endpoint.
- EventBridge notifies Amazon Personalize about product catalog changes via a Lambda function dedicated to that purpose. For example, Amazon Personalize can recommend new products even when they don’t have prior order history.
- EventBridge also keeps Shopify product metadata in sync with HiConversion’s metadata stored in Amazon S3 for real-time delivery via Amazon CloudFront.
Ultimately, EventBridge replaced an undifferentiated custom implementation within our architecture with a fully managed solution able to automatically scale, allowing us to focus on building features that deliver differentiated value to our customers.
Measuring the effectiveness of session-based product recommendations on Shopify
Many product recommendation solutions are available to Shopify merchants, each using different types of algorithms. To measure the effectiveness of session-based recommendations from Amazon Personalize, and to indulge our data curious team culture, we ran an objective experiment.
Selecting a technology in todays’ economy is challenging, so we designed this experiment to assist merchants in determining for themselves the effectiveness of Amazon Personalize session-based algorithms.
We started with a hypothesis: If session-based recommendation algorithms can adapt to visitors’ actions and context in real time, they should produce improved results when visitor intent and preferences suddenly shift.
To test our hypothesis, we identified a predictable shift in intent and preferences to:
- Visitor preferences – Before the holiday season, visitors are typically buying something for themselves, whereas during the holiday season, visitors are also buying things for others.
- Visitor profiles – Visitor profiles before the holiday season are different than during the holidays. The holiday season sees more new visitors who have never purchased before and whose preferences are unknown.
- Brand promotions – During the holiday season, brands aggressively promote their offerings, which impact visitor behavior and decision-making.
To evaluate our hypothesis, we compared pre-holiday product recommendation results with results from peak holiday time. One of our clients, a large and successful cosmetics brand, found that product recommendation improved Revenue Per Visitor (RPV) +113% when compared to pre-holiday.
Pre-holiday product recommendation results
First, we looked at the percentage of overall revenue impacted by personalized product recommendations. For this client, only 7.7% of all revenues were influenced by personalized product recommendations compared to non-personalized experiences.
Second, we looked at the RPV—the most important metric for measuring new ecommerce revenue growth. Conversion Rate (CR) or Average Order Value (AOV) only tell us part of the story and can be misleading if relied on alone.
For example, a merchant can increase the site conversion rate with aggressive promotions that actually lead to a decline in the average order value, netting a drop in overall revenue.
Based on our learnings, at HiConversion we evangelize RPV as the metric to use when measuring the effectiveness of an ecommerce product recommendation solution.
In this example, visitors who engaged with recommended products had over 175% higher RPV than visitors who did not.
Our analysis illustrates that session-based product recommendations are very effective. If recommendations weren’t effective, visitors who engaged with recommendations wouldn’t have seen a higher RPV when compared with those that didn’t engage with recommended products.
Peak-holiday product recommendation results.
A leading indicator that session-based recommendations were working was the increase in the percentage of overall sales influenced by personalized recommendations. It grew from 7.7% before the holidays to 14.6% during the holiday season.
This data is even more impressive when we looked at RPV lift. Visitors who engaged with personalized recommendations had over 259% higher RPV than those who didn’t.
In comparison, session-based recommendations over-performed pre-holiday RPV lift.
Before Holidays | During Holidays | Relative Lift | |
RPV Lift | 175.04% | 258.61% | 47.74% |
New revenue calculations
Based on the preceding data points, we can calculate new revenues attributable directly to HiConversion Recommend.
Before Holidays | During Holidays | |
RPV (personalized) | $6.84 | $14.70 |
RPV (non-personalized) | $2.49 | $4.10 |
Visits (personalized) | 33,862 | 97,052 |
Visits (non-personalized) | 1,143,147 | 2,126,693 |
Revenue attributable to HiConversion Recommend | $147,300 | $1,028,751 |
% of all revenue attributable to HiConversion Recommend | 5% | 10% |
These calculations make a strong case for the high ROI of using HiConversion Recommend, when considering the new revenue potential created by session-based recommendations.
Conclusions
Product recommendations for Shopify—powered by Amazon Personalize—are an effective way of engaging and converting more new shoppers. To prove it, we have built a challenge to show you how quickly you can achieve measurable, positive ROI. To get started, sign up for the 7-Day Product Recommendation Challenge.
A well-designed and scalable solution is particularly important in serving a massive, global customer. And because session-based, real-time personalization is a differentiator to drive ecommerce growth, it’s extremely important to consider the best technology partner for your business.
About the Authors
Jeff McKelvey is the Principal Development Lead at HiConversion.
James Jory is a Solutions Architect in Applied AI with AWS. He has a special interest in personalization and recommender systems and a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and auto racing simulation.
Matt Chwastek is a Senior Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build and use machine learning solutions. In his spare time, he enjoys reading and photography.
Annotate dense point cloud data using SageMaker Ground Truth
Autonomous vehicle companies typically use LiDAR sensors to generate a 3D understanding of the environment around their vehicles. For example, they mount a LiDAR sensor on their vehicles to continuously capture point-in-time snapshots of the surrounding 3D environment. The LiDAR sensor output is a sequence of 3D point cloud frames (the typical capture rate is 10 frames per second). Amazon SageMaker Ground Truth makes it easy to label objects in a single 3D frame or across a sequence of 3D point cloud frames for building machine learning (ML) training datasets. Ground Truth also supports sensor fusion of camera and LiDAR data with up to eight video camera inputs.
As LiDAR sensors become more accessible and cost-effective, customers are increasingly using point cloud data in new spaces like robotics, signal mapping, and augmented reality. Some new mobile devices even include LiDAR sensors, one of which supplied the data for this post! The growing availability of LiDAR sensors has increased interest in point cloud data for ML tasks, like 3D object detection and tracking, 3D segmentation, 3D object synthesis and reconstruction, and even using 3D data to validate 2D depth estimation.
Although dense point cloud data is rich in information (over 1 million point clouds), it’s challenging to label because labeling workstations often have limited memory, and graphics capabilities and annotators tend to be geographically distributed, which can increase latency. Although large numbers of points may be renderable in a labeler’s workstation, labeler throughput can be reduced due to rendering time when dealing with multi-million sized point clouds, greatly increasing labeling costs and reducing efficiency.
A way to reduce these costs and time is to convert point cloud labeling jobs into smaller, more easily rendered tasks that preserve most of the point cloud’s original information for annotation. We refer to these approaches broadly as downsampling, similar to downsampling in the signal processing domain. Like in the signal processing domain, point cloud downsampling approaches attempt to remove points while preserving the fidelity of the original point cloud. When annotating downsampled point clouds, you can use the output 3D cuboids for object tracking and object detection tasks directly for training or validation on the full-size point cloud with little to no impact on model performance while saving labeling time. For other modalities, like semantic segmentation, in which each point has its own label, you can use your downsampled labels to predict the labels on each point in the original point cloud, allowing you to perform a tradeoff between labeler cost (and therefore amount of labeled data) and a small amount of misclassifications of points in the full-size point cloud.
In this post, we walk through how to perform downsampling techniques to prepare your point cloud data for labeling, then demonstrate how to upsample your output labels to apply to your original full-size dataset using some in-sample inference with a simple ML model. To accomplish this, we use Ground Truth and Amazon SageMaker notebook instances to perform labeling and all preprocessing and postprocessing steps.
The data
The data we use in this post is a scan of an apartment building rooftop generated using the 3D Scanner App on an iPhone12 Pro. The app allows you to use the built-in LiDAR scanners on mobile devices to scan a given area and export a point cloud file. In this case, the point cloud data is in xyzrgb format, an accepted format for a Ground Truth point cloud. For more information about the data types allowed in a Ground Truth point cloud, see Accepted Raw 3D Data Formats.
The following image shows our 3D scan.
Methods
We first walk through a few approaches to reduce dataset size for labeling point clouds: tiling, fixed step sample, and voxel mean. We demonstrate why downsampling techniques can increase your labeling throughput without significantly sacrificing annotation quality, and then we demonstrate how to use labels created on the downsampled point cloud and apply them to your original point cloud with an upsampling approach.
Downsampling approaches
Downsampling is taking your full-size dataset and either choosing a subset of points from it to label, or creating a representative set of new points that aren’t necessarily in the original dataset, but are close enough to allow labeling.
Tiling
One naive approach is to break your point cloud space into 3D cubes, otherwise known as voxels, of (for example) 500,000 points each that are labeled independently in parallel. This approach, called tiling, effectively reduces the scene size for labeling.
However, it can greatly increase labeling time and costs, because a typical 8-million-point scene may need to be broken up into over 16 sub-scenes. The large number of independent tasks that result from this method means more annotator time is spent on context switching between tasks, and workers may lose context when the scene is too small, resulting in mislabeled data.
Fixed step sample
An alternative approach is to select or create a reduced number of points by a linear subsample, called a fixed step sample. Let’s say you want to hit a target of 500,000 points (we have observed this is generally renderable on a consumer laptop—see Accepted Raw 3D Data Format), but you have a point cloud with 10 million points. You can calculate your step size as step = 10,000,000 / 500,000 = 20
. After you have a step size, you can select every 20th point in your dataset, creating a new point cloud. If your point cloud data is of high enough density, labelers should still be able to make out any relevant features for labeling even though you may only have 1 point for every 20 in the original scene.
The downside of this approach is that not all points contribute to the final downsampled result, meaning that if a point is one of few important ones, but not part of the sample, your annotators may miss the feature entirely.
Voxel mean
An alternate form of downsampling that uses all points to generate a downsampled point cloud is to perform grid filtering. Grid filtering means you break the input space into regular 3D boxes (or voxels) across the point cloud and replace all points within a voxel with a single representative point (the average point, for example). The following diagram shows an example voxel red box.
If no points exist from the input dataset within a given voxel, no point is added to the downsampled point cloud for that voxel. Grid filtering differs from a fixed step sample because you can use it to reduce noise and further tune it by adjusting the kernel size and averaging function to result in slightly different final point clouds. The following point clouds show the results of simple (fixed step sample) and advanced (voxel mean) downsampling. The point cloud downsampled using the advanced method is smoother, this is particularly noticeable when comparing the red brick wall in the back of both scenes.
Upsampling approach
After downsampling and labeling your data, you may want to see the labels produced on the smaller, downsampled point cloud projected onto the full-size point cloud, which we call upsampling. Object detection or tracking jobs don’t require post-processing to do this. Labels in the downsampled point cloud (like cuboids) are directly applicable to the larger point cloud because they’re defined in a world coordinate space shared by the full-size point cloud (x, y, z, height, width, length). These labels are minimally susceptible to very small errors along the boundaries of objects when a boundary point wasn’t in the downsampled dataset, but such occasional and minor errors are outweighed by the number of extra, correctly labeled points within the cuboid that can also be trained on.
For 3D point cloud semantic segmentation jobs, however, the labels aren’t directly applicable to the full-size dataset. We only have a subset of the labels, but we want to predict the rest of the full dataset labels based on this subset. To do this, we can use a simple K-Nearest Neighbors (K-NN) classifier with the already labeled points serving as the training set. K-NN is a simple supervised ML algorithm that predicts the label of a point using the “K” closest labeled points and a weighted vote. With K-NN, we can predict the point class of the rest of the unlabeled points in the full-size dataset based on the majority class of the three closest (by Euclidean distance) points. We can further refine this approach by varying the hyperparameters of a K-NN classifier, like the number of closest points to consider as well as the distance metric and weighting scheme of points.
After you map the sample labels to the full dataset, you can visualize tiles within the full-size dataset to see how well the upsampling strategy worked.
Now that we’ve reviewed the methods used in this post, we demonstrate these techniques in a SageMaker notebook on an example semantic segmentation point cloud scene.
Prerequisites
To walk through this solution, you need the following:
- An AWS account.
- A notebook AWS Identity and Access Management (IAM) role with the permissions required to complete this walkthrough. Your IAM role must have the following AWS managed policies attached:
AmazonS3FullAccess
AmazonSageMakerFullAccess
- An Amazon Simple Storage Service (Amazon S3) bucket where the notebook artifacts (input data and labels) are stored.
- A SageMaker work team. For this post, we use a private work team. You can create a work team on the SageMaker console.
Notebook setup
We use the notebook ground_truth_annotation_dense_point_cloud_tutorial.ipynb
in the SageMaker Examples section of a notebook instance to demonstrate these downsampling and upsampling approaches. This notebook contains all code required to perform preprocessing, labeling, and postprocessing.
To access the notebook, complete the following steps:
- Create a notebook instance. You can use the instance type, ml.t2.xlarge, to launch the notebook instance. Please choose an instance with at least 16 GB of RAM.
- You need to use the notebook IAM role you created early. This role allows your notebook to upload your dataset to Amazon S3 and call the solution APIs.
- Open Jupyter Lab or Jupyter to access your notebook instance.
- In Jupyter, choose the SageMaker Examples In Jupyter Lab, choose the SageMaker icon.
- Choose Ground Truth Labeling Jobs and then choose the ipynb notebook.
- If you’re using Jupyter, choose Use to copy the notebook to your instance and run it. If you’re in Jupyter lab, choose Create a Copy.
Provide notebook inputs
First, we modify the notebook to add our private work team ARN and the bucket location we use to store our dataset as well as our labels.
Section 1: Retrieve the dataset and visualize the point cloud
We download our data by running Section 1 of our notebook, which downloads our dataset from Amazon S3 and loads the point cloud into our notebook instance. We download custom prepared data from an AWS owned bucket. An object called rooftop_12_49_41.xyz
should be in the root of the S3 bucket. This data is a scan of an apartment building rooftop custom generated on a mobile device. In this case, the point cloud data is in xyzrgb format.
We can visualize our point cloud using the Matplotlib scatter3d function. The point cloud file contains all the correct points but isn’t rotated correctly. We can rotate the object around its axis by multiplying the point cloud by a rotation matrix. We can obtain a rotation matrix using scipy and specify the degree changes we want to make to each axis using the from_euler
method:
!aws s3 cp s3://smgt-downsampling-us-east-1-322552456788/rooftop_12_49_41.xyz pointcloud.xyz
# Let's read our dataset into a numpy file
pc = np.loadtxt("pointcloud.xyz", delimiter=",")
print(f"Loaded points of shape {pc.shape}")
# playing with view of 3D scene
from scipy.spatial.transform import Rotation
def plot_pointcloud(pc, rot = [[30,90,60]], color=True, title="Simple Downsampling 1", figsize=(50,25), verbose=False):
if rot:
rot1 = Rotation.from_euler('zyx', [[30,90,60]], degrees=True)
R1 = rot1.as_matrix()
if verbose:
print('Rotation matrix:','n',R1)
# matrix multiplication between our rotation matrix and pointcloud
pc_show = np.matmul(R1, pc.copy()[:,:3].transpose() ).transpose()
if color:
try:
rot_color1 = np.matmul(R1, pc.copy()[:,3:].transpose() ).transpose().squeeze()
except:
rot_color1 = np.matmul(R1, np.tile(pc.copy()[:,3],(3,1))).transpose().squeeze()
else:
pc_show = pc
fig = plt.figure( figsize=figsize)
ax = fig.add_subplot(111, projection="3d")
ax.set_title(title, fontdict={'fontsize':20})
if color:
ax.scatter(pc_show[:,0], pc_show[:,1], pc_show[:,2], c=rot_color1[:,0], s=0.05)
else:
ax.scatter(pc_show[:,0], pc_show[:,1], pc_show[:,2], c='blue', s=0.05)
# rotate in z direction 30 degrees, y direction 90 degrees, and x direction 60 degrees
rot1 = Rotation.from_euler('zyx', [[30,90,60]], degrees=True)
print('Rotation matrix:','n', rot1.as_matrix())
plot_pointcloud(pc, rot = [[30,90,60]], color=True, title="Full pointcloud", figsize=(50,30))
Section 2: Downsample the dataset
Next, we downsample the dataset to less than 500,000 points, which is an ideal number of points for visualizing and labeling. For more information, see the Point Cloud Resolution Limits in Accepted Raw 3D Data Formats. Then we plot the results of our downsampling by running Section 2.
As we discussed earlier, the simplest form of downsampling is to choose values using a fixed step size based on how large we want our resulting point cloud to be.
A more advanced approach is to break the input space into cubes, otherwise known as voxels, and choose a single point per box using an averaging function. A simple implementation of this is shown in the following code.
You can tune the target number of points and box size used to see the reduction in point cloud clarity as more aggressive downsampling is performed.
#Basic Approach
target_num_pts = 500_000
subsample = int(np.ceil(len(pc) / target_num_pts))
pc_downsample_simple = pc[::subsample]
print(f"We've subsampled to {len(pc_downsample_simple)} points")
#Advanced Approach
boxsize = 0.013 # 1.3 cm box size.
mins = pc[:,:3].min(axis=0)
maxes = pc[:,:3].max(axis=0)
volume = maxes - mins
num_boxes_per_axis = np.ceil(volume / boxsize).astype('int32').tolist()
num_boxes_per_axis.extend([1])
print(num_boxes_per_axis)
# For each voxel or "box", use the mean of the box to chose which points are in the box.
means, _, _ = scipy.stats.binned_statistic_dd(
pc[:,:4],
[pc[:,0], pc[:,1], pc[:,2], pc[:,3]],
statistic="mean",
bins=num_boxes_per_axis,
)
x_means = means[0,~np.isnan(means[0])].flatten()
y_means = means[1,~np.isnan(means[1])].flatten()
z_means = means[2,~np.isnan(means[2])].flatten()
c_means = means[3,~np.isnan(means[3])].flatten()
pc_downsample_adv = np.column_stack([x_means, y_means, z_means, c_means])
print(pc_downsample_adv.shape)
Section 3: Visualize the 3D rendering
We can visualize point clouds using a 3D scatter plot of the points. Although our point clouds have color, our transforms have different effects on color, so comparing them in a single color provides a better comparison. We can see that the advanced voxel mean method creates a smoother point cloud because averaging has a noise reduction effect. In the following code, we can look at our point clouds from two separate perspectives by multiplying our point clouds by different rotation matrices.
When you run Section 3 in the notebook, you also see a comparison of a linear step approach versus a box grid approach, specifically in how the box grid filter has a slight smoothing effect on the overall point cloud. This smoothing could be important depending on the noise level of your dataset. Modifying the grid filtering function from mean to median or some other averaging function can also improve the final point cloud clarity. Look carefully at the back wall of the simple (fixed step size) and advanced (voxel mean) downsampled examples, notice the smoothing effect the voxel mean method has compared to the fixed step size method.
rot1 = Rotation.from_euler('zyx', [[30,90,60]], degrees=True)
R1 = rot1.as_matrix()
simple_rot1 = pc_downsample_simple.copy()
simple_rot1 = np.matmul(R1, simple_rot1[:,:3].transpose() ).transpose()
advanced_rot1 = pc_downsample_adv.copy()
advanced_rot1 = np.matmul(R1, advanced_rot1[:,:3].transpose() ).transpose()
fig = plt.figure( figsize=(50, 30))
ax = fig.add_subplot(121, projection="3d")
ax.set_title("Simple Downsampling 1", fontdict={'fontsize':20})
ax.scatter(simple_rot1[:,0], simple_rot1[:,1], simple_rot1[:,2], c='blue', s=0.05)
ax = fig.add_subplot(122, projection="3d")
ax.set_title("Voxel Mean Downsampling 1", fontdict={'fontsize':20})
ax.scatter(advanced_rot1[:,0], advanced_rot1[:,1], advanced_rot1[:,2], c='blue', s=0.05)
# to look at any of the individual pointclouds or rotate the pointcloud, use the following function
plot_pointcloud(pc_downsample_adv, rot = [[30,90,60]], color=True, title="Advanced Downsampling", figsize=(50,30))
Section 4: Launch a Semantic Segmentation Job
Run Section 4 in the notebook to take this point cloud and launch a Ground Truth point cloud semantic segmentation labeling job using it. These cells generate the required input manifest file and format the point cloud in a Ground Truth compatible representation.
To learn more about the input format of Ground Truth as it relates to point cloud data, see Input Data and Accepted Raw 3D Data Formats.
In this section, we also perform the labeling in the worker portal. We label a subset of the point cloud to have some annotations to perform upsampling with. When the job is complete, we load the annotations from Amazon S3 into a NumPy array for our postprocessing. The following is a screenshot from the Ground Truth point cloud semantic segmentation tool.
Section 5: Perform label upsampling
Now that we have the downsampled labels, we train a K-NN classifier from SKLearn to predict the full dataset labels by treating our annotated points as training data and performing inference on the remainder of the unlabeled points in our full-size point cloud.
You can tune the number of points used as well as the distance metric and weighting scheme to influence how label inference is performed. If you label a few tiles in the full-size dataset, you can use those labeled tiles as ground truth to evaluate the accuracy of the K-NN predictions. You can then use this accuracy metric for hyperparameter tuning of K-NN or to try different inference algorithms to reduce your number of misclassified points between object boundaries, resulting in the lowest possible in-sample error rate. See the following code:
# There's a lot of possibility to tune KNN further
# 1) Prevent classification of points far away from all other points (random unfiltered ground point)
# 2) Perform a non-uniform weighted vote
# 3) Tweak number of neighbors
knn = KNeighborsClassifier(n_neighbors=3)
print(f"Training on {len(pc_downsample_adv)} labeled points")
knn.fit(pc_downsample_adv[:,:3], annotations)
print(f"Upsampled to {len(pc)} labeled points")
annotations_full = knn.predict(pc[:,:3])
Section 6: Visualize the upsampled labels
Now that we have performed upsampling of our labeled data, we can visualize a tile of the original full-size point cloud. We aren’t rendering all of the full-size point cloud because that may prevent our visualization tool from rendering. See the following code:
pc_downsample_annotated = np.column_stack((pc_downsample_adv[:,:3], annotations))
pc_annotated = np.column_stack((pc[:,:3], annotations_full))
labeled_area = pc_downsample_annotated[pc_downsample_annotated[:,3] != 255]
min_bounds = np.min(labeled_area, axis=0)
max_bounds = np.max(labeled_area, axis=0)
min_bounds = [-2, -2, -4.5, -1]
max_bounds = [2, 2, -1, 256]
def extract_tile(point_cloud, min_bounds, max_bounds):
return point_cloud[
(point_cloud[:,0] > min_bounds[0])
& (point_cloud[:,1] > min_bounds[1])
& (point_cloud[:,2] > min_bounds[2])
& (point_cloud[:,0] < max_bounds[0])
& (point_cloud[:,1] < max_bounds[1])
& (point_cloud[:,2] < max_bounds[2])
]
tile_downsample_annotated = extract_tile(pc_downsample_annotated, min_bounds, max_bounds)
tile_annotated = extract_tile(pc_annotated, min_bounds, max_bounds)
rot1 = Rotation.from_euler('zyx', [[30,90,60]], degrees=True)
R1 = rot1.as_matrix()
down_rot = tile_downsample_annotated.copy()
down_rot = np.matmul(R1, down_rot[:,:3].transpose() ).transpose()
down_rot_color = np.matmul(R1, np.tile(tile_downsample_annotated.copy()[:,3],(3,1))).transpose().squeeze()
full_rot = tile_annotated.copy()
full_rot = np.matmul(R1, full_rot[:,:3].transpose() ).transpose()
full_rot_color = np.matmul(R1, np.tile(tile_annotated.copy()[:,3],(3,1))).transpose().squeeze()
fig = plt.figure(figsize=(50, 20))
ax = fig.add_subplot(121, projection="3d")
ax.set_title("Downsampled Annotations", fontdict={'fontsize':20})
ax.scatter(down_rot[:,0], down_rot[:,1], down_rot[:,2], c=down_rot_color[:,0], s=0.05)
ax = fig.add_subplot(122, projection="3d")
ax.set_title("Upsampled Annotations", fontdict={'fontsize':20})
ax.scatter(full_rot[:,0], full_rot[:,1], full_rot[:,2], c=full_rot_color[:,0], s=0.05)
Because our dataset is dense, we can visualize the upsampled labels within a tile to see the downsampled labels upsampled to the full-size point cloud. Although a small number of misclassifications may exist along boundary regions between objects, you also have many more correctly labeled points in the full-size point cloud than the initial point cloud, meaning your overall ML accuracy may improve.
Cleanup
Notebook instance: you have two options if you do not want to keep the created notebook instance running. If you would like to save it for later, you can stop rather than deleting it.
- To stop a notebook instance: click the Notebook instances link in the left pane of the SageMaker console home page. Next, click the Stop link under the ‘Actions’ column to the left of your notebook instance’s name. After the notebook instance is stopped, you can start it again by clicking the Start link. Keep in mind that if you stop rather than delete it, you will be charged for the storage associated with it.
- To delete a notebook instance: first stop it per the instruction above. Next, click the radio button next to your notebook instance, then select Delete from the Actions drop down menu.
Conclusion
Downsampling point clouds can be a viable method when preprocessing data for object detection and object tracking labeling. It can reduce labeling costs while still generating high-quality output labels, especially for 3D object detection and tracking tasks. In this post, we demonstrated how the downsampling method can affect the clarity of the point cloud for workers, and showed a few approaches that have tradeoffs based on the noise level of the dataset.
Finally, we showed that you can perform 3D point cloud semantic segmentation jobs on downsampled datasets and map the labels to the full-size point cloud through in-sample prediction. We accomplished this by training a classifier to do inference on the remaining full dataset size points, using the already labeled points as training data. This approach enables cost-effective labeling of highly dense point cloud scenes while still maintaining good overall label quality.
Test out this notebook with your own dense point cloud scenes in Ground Truth, try out new downsampling techniques, and even try new models beyond K-NN for final in-sample prediction to see if downsampling and upsampling techniques can reduce your labeling costs.
About the Authors
Vidya Sagar Ravipati is a Deep Learning Architect at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.
Isaac Privitera is a Machine Learning Specialist Solutions Architect and helps customers design and build enterprise-grade computer vision solutions on AWS. Isaac has a background in using machine learning and accelerated computing for computer vision and signals analysis. Isaac also enjoys cooking, hiking, and keeping up with the latest advancements in machine learning in his spare time.
Jeremy Feltracco is a Software Development Engineer with the Amazon ML Solutions Lab at Amazon Web Services. He uses his background in computer vision, robotics, and machine learning to help AWS customers accelerate their AI adoption.
ICLR: What representation learning means in the data center
Amazon Scholar Aravind Srinivasan on the importance of machine learning for real-time and offline resource management.Read More
Two Amazon Scholars elected to prestigious science organizations
Yale economics professor Dirk Bergemann elected to American Academy of Arts & Sciences; University of Pennsylvania computer science professor Michael Kearns elected to National Academy of Sciences.Read More
Amazon open-sources library for prediction over large output spaces
Framework improves efficiency, accuracy of applications that search for a handful of solutions in a huge space of candidates.Read More
Intelligent governance of document processing pipelines for regulated industries
Processing large documents like PDFs and static images is a cornerstone of today’s highly regulated industries. From healthcare information like doctor-patient visits and bills of health, to financial documents like loan applications, tax filings, research reports, and regulatory filings, these documents are integral to how these industries conduct business. The mechanisms by which these documents are processed and analyzed, however, are often manual, time-consuming, error-prone, expensive, and not easily scalable.
Fortunately, recent innovations in this space are helping companies improve these methods. Machine learning (ML) techniques such as optical character recognition (OCR) and natural language processing (NLP) enable firms to digitize and extract text from millions of documents and understand the content, including contextual nuances of the language within them. Furthermore, you can then transform the extracted text by merging it with supplemental data to produce additional business insights.
This step-by-step method is called a document processing pipeline. The pipeline includes various components to extract, transform, enrich, and conform the data. New data domains are often introduced and used for numerous downstream business purposes. For example, in financial services, you could be identifying connected financial events, calculating environmental risk scores, and developing risk models. Because these documents help inform or even dictate such important data-driven decisions, it’s imperative for regulated industry companies to establish and maintain a robust data governance framework as part of these document processing pipelines. Without governance, pipelines become a dumping ground where documents are inconsistently stored, duplicated, and processed, and the business is unable to explain to potential auditors where the data that fed their decisions came from, or what that data was used for.
A data governance framework is made up of people, processes, and technology. It enables business users to work collaboratively with technologists to drive clean, certified, and trusted data. It consists of several components including data quality, data catalog, data ownership, data lineage, operation, and compliance. In this post, we discuss data catalog, data ownership, and data lineage, and how they tie together with building document processing pipelines for regulated industries.
For more information about design patterns on data quality, see How to Architect Data Quality on the AWS Cloud.
Data lineage
Data lineage is the part of data governance that refers to the practice of providing GPS services for data. At any point in time, it can explain where the data originated, what happened to it, what its latest status is, and where it’s headed from this point on.
It provides visibility while simplifying the ability to trace financial numbers back to their origin, and provides transparency on potential errors and their root cause analyses.
Furthermore, you can use data lineage captured over time as analytical inputs to drive accuracy scores.
It’s imperative for a document processing pipeline to have a well-defined data lineage framework. The framework should include an end-to-end lifecycle, responsibility model, and the technology to enable data transformation transparency. Without lineage, the data can’t be trusted.
To illustrate this end-to-end data lineage concept, we walk you through creating an NLP-powered document search engine with built-in lineage at each step. Every object and piece of data processed by this ML pipeline can be traced back to the original document.
Each processing component can be replaced by your choice of tooling or bespoke ML model. Furthermore, you can customize the solution to include other use cases, such as central document data lakes or supplemental tabular data feed to an online transaction processing (OLTP) application.
The solution follows an event-driven architecture in which the completion of each stage within the pipeline triggers the next step, while providing self-service lineage for traceability and monitoring. In addition, hooks have been included to provide capabilities to extend the pipeline to additional workloads.
This design uses the following AWS services (you can also follow along in the GitHub repo):
- Amazon Comprehend – An NLP service that uses ML to find insights and relationships in text, and can do so in multiple languages.
- Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale.
- Amazon DynamoDB Streams – A change data capture (CDC) service. It captures an ordered flow of information about changes to items in a DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table.
- Amazon Elasticsearch Service (Amazon ES) – A fully managed service that makes it possible for you to deploy, secure, and run Elasticsearch cost-effectively and at scale. You can build, monitor, and troubleshoot your applications using the tools you love, at the scale you need.
- AWS Lambda – A serverless compute service that runs code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
- Amazon Simple Notification Service (Amazon SNS) – An AWS managed service for application-to-application communications, with a pub/sub model enabling high-throughput, low-latency message relaying.
- Amazon Simple Queue Service (Amazon SQS) – A fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
- Amazon Simple Storage Service (Amazon S3) – An object storage service to stores your documents and allows for central management with fine-tuned access controls.
- Amazon Textract – A fully managed ML service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond OCR to identify, understand, and extract data from forms and tables.
Architecture
The overall design is grouped into five segments:
- Metadata services module
- Ingestion module
- OCR module
- NLP module
- Analytics module
All components interact via asynchronous events to allow for scalability. The following diagram illustrates the conceptual design.
The following diagram illustrates the physical design.
Metadata services
This is an encapsulated module to register, track, and trace incoming documents. It’s designed to be used across many different document processing pipelines. In your organization, one team might decide to use the OCR and NLP modules designed in this post. Another team might decide to use a different pipeline. However, governance practices of each pipeline should be consistent, and documents should be registered one time with full transparency on movement and downstream usage. Each document can be processed several times. You can extend the catalog and lineage services designed in this post to keep track of many pipelines, from multiple sources of data.
At the core, the metadata services module contains four reference tables, an SNS topic, three SQS queues, and three self-contained Lambda functions. Tables are created in DynamoDB, and schemas can be easily extended to include additional data attributes deemed important for your pipeline.
In addition, you can extend this design to include additional data governance components such as data quality.
The tables are defined as follows.
Table Name | Purpose | DynamoDB Stream Enabled? | Data Governance component | Sample Use |
Document Registry | Keeps track of all incoming documents. Each document is assigned a unique document ID and registered one time in this table. | Yes | Catalog | Provides the ability to quickly look up and understand the document source and context metadata. |
Document Ownership | Covers responsibility model of the data governance in which each document acquired to the pipeline has a defined owner. | No | Ownership | Provides notification services and can be extended to manage data quality controls. |
Document Lineage | Keeps track of all data movements. It provides detailed lineage info that includes the source S3 bucket name, destination S3 bucket name, source file name, target file name, ARN ID of the AWS service that processed the document, and timestamp. | No |
A simple PartiQL query against this table based on the document ID provides a list of all steps the original document has taken. Query output can include the following columns: · Document ID · Original document name · Timestamp · Source S3 bucket · Source file name · Destination S3 bucket · Destination file name |
|
Pipeline Operations | Keeps a record of all pipeline actions taken on a document ID, including the current pipeline stage and its status, and keeps a timeline of the stages in chronological order. | Yes | An operational query on a document ID to determine where in the pipeline the current document processing is. |
DynamoDB Streams allows downstream application code to react to updates to objects in DynamoDB. It provides a mechanism to keep an event-based microservices architecture in place by triggering subsequent steps of a workflow whenever new documents are written to our Document Registry table, and subsequently when new document references are created in the Pipeline Operations table.
In addition, DynamoDB Streams provides developer teams with an efficient way of connecting your application logic to various updates in the tables (for example, to keep track of a particular document ID based on owner tags, or alert when certain unexpected problems arise while processing some documents).
The Lambda functions provide microservices API call capabilities for the document pipeline to self-register its movements and actions undertaken by the pipeline code:
- Document Arrival Register API – Registers the incoming document’s metadata and location within Document Registry table
- Document Lineage API – Registers the lineage information within Document Lineage table
- Pipeline Operations API – Provides up-to-date information on the state of the pipeline
The SNS topic is used as a sink for incoming messages from all pipeline movements and document registrations. It disseminates the messages to each downstream subscribed SQS queue according to what type of message was received. In this model, the number of consumers of the messages coming through the SNS topic could be greatly expanded as needed, and all messages are guaranteed to stay in order, because both the SNS topics and SQS queues are created in a First-In-First-Out (FIFO) configuration to prevent duplicates and maintain single-threaded processing in the pipeline.
Using Amazon SNS in the design provides scalability by creating a pub/sub architecture. A pub/sub architecture design is a pattern that provides a framework to decouple the services that produce an event from services that process the event. Many subscribers can subscribe to the same event and trigger different pipelines. For example, this design can easily be extended to process incoming XML file formats by subscribing an additional XML process pipeline for the same event.
The following table provides schema information. The document ID is identical and unique for each document and is part of the composite primary key used to identify movement of each document within the pipeline.
The following diagram shows the architecture of our metadata services.
Ingestion module
The ingestion workload is triggered when a new document is uploaded to the NLP/Raw S3 bucket (or the bucket where raw documents are placed from users or front-end applications).
The ingestion module follows a four-step process (as shown in the following diagram):
- A document is uploaded to the NLP/Raw S3 bucket.
- The Document Registrar Lambda function is invoked, which calls the metadata services API to register the document and receive a unique ID. This ID is added to the document as a tag, and the metadata is registered within the DynamoDB table Document Registry.
- After the document metadata is registered with Metadata Services, the DynamoDB Document Registration stream is invoked to start the Document Classification Lambda function. This function examines the metadata registered on the document and determines if the downstream OCR segment should be invoked on this document. The result of this examination is written back to the metadata services.
- The metadata registration of the previous step invokes a DynamoDB Pipeline Operations Stream, which invokes the Document Extension Detector Lambda function. This function examines the incoming file formats and separates the images files from PDF documents.
All steps are registered in metadata services. The red dotted lines in the following diagram represent the metadata asynchronous API calls.
OCR module
This module detects the incoming file format and uses Amazon Textract in this implementation to convert the incoming documents into text. Amazon Textract can process image files synchronously, and PDF and other documents asynchronously, to allow time for the service to complete its analysis.
The OCR module consists of the following process, as illustrated in the architecture diagram:
- Image files are uploaded to the NLP/image S3 bucket and the Sync Processor Lambda function is invoked. The function synchronously points Amazon Textract to the S3 location of the image file, and waits for a response.
- Amazon Textract transforms the format to text and deposits the text output in the NLP/Textract. This step concludes OCR processing of the image file types.
- PDF files are placed within the NLP/PDF S3 bucket. This bucket invokes the Async Processor Lambda function. This function feeds the document to Amazon Textract and completes its state, registering as such with the metadata services.
- When the Amazon Textract document analysis is complete, an SNS message is sent to a specified SNS topic, notifying downstream consumers of the job completion. In this implementation, an SQS queue captures that message.
- The SQS queue message is the event that triggers the Result Processor Lambda function.
- The function extracts the results of document analysis from Amazon Textract and formats it according to the type of text it analyzed (forms, tables, and raw text).
- The results are pushed to the NLP/Textract S3 bucket, page by page for every type of text, and as a complete JSON response.
All the progress is registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.
NLP module
This module detects key phrases and entities within the document by using the text output from the OCR module. A key phrase is a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it. For example, “day” is a noun; “a beautiful day” is a noun phrase that includes an article (“a”) and an adjective (“beautiful”).
Once key phrases are understood, it’s quite likely that indexing them in an analytical tool would let you find this article quickly and accurately. For example, if you want to analyze corporate social responsibility (CSR) reports, you can find attributes such as “reducing carbon footprints,” “improving labor policies,” “participating in fair-trade,” and “charitable giving” by indexing results of this module.
We use Amazon Comprehend to perform this function in this pipeline. However, as we explained earlier, you can easily swap the tooling used for this design with your preferred tool. For example, you can replace Amazon Comprehend with an Amazon SageMaker custom model as an alternative to extract key phrases and entities in a more domain-focused way. SageMaker is an ML service that you can use to build, train, and deploy ML models for virtually any use case.
Amazon Comprehend is called on a synchronous basis to extract key phrases in the following steps (as illustrated in the following diagram):
- The incoming text file uploaded to the NLP/Textract S3 bucket invokes the Sync Comprehend Processor Lambda function.
- The function feeds the incoming file to Amazon Comprehend for processing.
- The results from Amazon Comprehend, in JSON format, are deposited in the NLP/JSON S3 bucket.
- The results from Amazon Comprehend are sent to Amazon ES, the service we incorporate as our document search engine.
All steps are being registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.
Analytics module
This module is responsible for the consumption and analytics segment of the pipeline. The steps are illustrated in the following diagram:
- The output from Amazon Comprehend, in JSON format, is fed to Amazon Neptune. Neptune allows end users to discover relationships across documents. This is an example of a downstream analytics application that is not implemented in this post.
- The end users have access to the original document in four formats (CSV, JSON, original, text), and can search key phrases using Amazon ES. They can identify relationships using Neptune. A JSON version of the document is available in the NLP/JSON S3 bucket. The original document is available in the NLP/Raw S3 bucket.
- Full lineage can be obtained from the Document Lineage table in DynamoDB.
The analytics module has many potential implementations. For example, you can use a relational datastore like Amazon Relational Database (Amazon RDS) or Amazon Aurora to analyze extracted tabular data using SQL.
Conclusion
In this post, we architected an end-to-end document processing pipeline using AWS managed ML services. In addition, we introduced metadata services to help organizations create a centralized document repository to store documents one time but process multiple times. A data governance framework as illustrated in this design provides you with necessary guardrails to ensure documents are governed in a standard fashion across the organization, while providing lines of business with autonomy to decide your NLP and OCR models and choice of tooling.
The architecture discussed in this post has been coded and is available for deployment in the GitHub repo. You can download the code and create your pipeline within a few days.
About the Authors
David Kheyman is a Solutions Architect at Amazon Web Services based out of New York City, where he designs and implements repeatable AWS architecture patterns and solutions for large organizations.
Mojgan Ahmadi is a Principal Solutions Architect with Amazon Web Services based in New York, where she guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. She brings over 20 years of technology experience on Software Development and Architecture, Data Governance and Engineering, and IT Management.
Anirudh Menon is a Solutions Architect with Amazon Web Services based in New York, where he helps financial services customers drive innovation with AWS solutions and industry-specific patterns.