Integrate SaaS platforms with Amazon SageMaker to enable ML-powered applications

Integrate SaaS platforms with Amazon SageMaker to enable ML-powered applications

Amazon SageMaker is an end-to-end machine learning (ML) platform with wide-ranging features to ingest, transform, and measure bias in data, and train, deploy, and manage models in production with best-in-class compute and services such as Amazon SageMaker Data Wrangler, Amazon SageMaker Studio, Amazon SageMaker Canvas, Amazon SageMaker Model Registry, Amazon SageMaker Feature Store, Amazon SageMaker Pipelines, Amazon SageMaker Model Monitor, and Amazon SageMaker Clarify. Many organizations choose SageMaker as their ML platform because it provides a common set of tools for developers and data scientists. A number of AWS independent software vendor (ISV) partners have already built integrations for users of their software as a service (SaaS) platforms to utilize SageMaker and its various features, including training, deployment, and the model registry.

In this post, we cover the benefits for SaaS platforms to integrate with SageMaker, the range of possible integrations, and the process for developing these integrations. We also deep dive into the most common architectures and AWS resources to facilitate these integrations. This is intended to accelerate time-to-market for ISV partners and other SaaS providers building similar integrations and inspire customers who are users of SaaS platforms to partner with SaaS providers on these integrations.

Benefits of integrating with SageMaker

There are a number of benefits for SaaS providers to integrate their SaaS platforms with SageMaker:

  • Users of the SaaS platform can take advantage of a comprehensive ML platform in SageMaker
  • Users can build ML models with data that is in or outside of the SaaS platform and exploit these ML models
  • It provides users with a seamless experience between the SaaS platform and SageMaker
  • Users can utilize foundation models available in Amazon SageMaker JumpStart to build generative AI applications
  • Organizations can standardize on SageMaker
  • SaaS providers can focus on their core functionality and offer SageMaker for ML model development
  • It equips SaaS providers with a basis to build joint solutions and go to market with AWS

SageMaker overview and integration options

SageMaker has tools for every step of the ML lifecycle. SaaS platforms can integrate with SageMaker across the ML lifecycle from data labeling and preparation to model training, hosting, monitoring, and managing models with various components, as shown in the following figure. Depending on the needs, any and all parts of the ML lifecycle can be run in either the customer AWS account or SaaS AWS account, and data and models can be shared across accounts using AWS Identity and Access Management (IAM) policies or third-party user-based access tools. This flexibility in the integration makes SageMaker an ideal platform for customers and SaaS providers to standardize on.

SageMaker overview

Integration process and architectures

In this section, we break the integration process into four main stages and cover the common architectures. Note that there can be other integration points in addition to these, but those are less common.

  • Data access – How data that is in the SaaS platform is accessed from SageMaker
  • Model training – How the model is trained
  • Model deployment and artifacts – Where the model is deployed and what artifacts are produced
  • Model inference – How the inference happens in the SaaS platform

The diagrams in the following sections assume SageMaker is running in the customer AWS account. Most of the options explained are also applicable if SageMaker is running in the SaaS AWS account. In some cases, an ISV may deploy their software in the customer AWS account. This is usually in a dedicated customer AWS account, meaning there still needs to be cross-account access to the customer AWS account where SageMaker is running.

There are a few different ways in which authentication across AWS accounts can be achieved when data in the SaaS platform is accessed from SageMaker and when the ML model is invoked from the SaaS platform. The recommended method is to use IAM roles. An alternative is to use AWS access keys consisting of an access key ID and secret access key.

Data access

There are multiple options on how data that is in the SaaS platform can be accessed from SageMaker. Data can either be accessed from a SageMaker notebook, SageMaker Data Wrangler, where users can prepare data for ML, or SageMaker Canvas. The most common data access options are:

  • SageMaker Data Wrangler built-in connector – The SageMaker Data Wrangler connector enables data to be imported from a SaaS platform to be prepared for ML model training. The connector is developed jointly by AWS and the SaaS provider. Current SaaS platform connectors include Databricks and Snowflake.
  • Amazon Athena Federated Query for the SaaS platformFederated queries enable users to query the platform from a SageMaker notebook via Amazon Athena using a custom connector that is developed by the SaaS provider.
  • Amazon AppFlow – With Amazon AppFlow, you can use a custom connector to extract data into Amazon Simple Storage Service (Amazon S3) which subsequently can be accessed from SageMaker. The connector for a SaaS platform can be developed by AWS or the SaaS provider. The open-source Custom Connector SDK enables the development of a private, shared, or public connector using Python or Java.
  • SaaS platform SDK – If the SaaS platform has an SDK (Software Development Kit), such as a Python SDK, this can be used to access data directly from a SageMaker notebook.
  • Other options – In addition to these, there can be other options depending on whether the SaaS provider exposes their data via APIs, files or an agent. The agent can be installed on Amazon Elastic Compute Cloud (Amazon EC2) or AWS Lambda. Alternatively, a service such as AWS Glue or a third-party extract, transform, and load (ETL) tool can be used for data transfer.

The following diagram illustrates the architecture for data access options.

Data access

Model training

The model can be trained in SageMaker Studio by a data scientist, using Amazon SageMaker Autopilot by a non-data scientist, or in SageMaker Canvas by a business analyst. SageMaker Autopilot takes away the heavy lifting of building ML models, including feature engineering, algorithm selection, and hyperparameter settings, and it is also relatively straightforward to integrate directly into a SaaS platform. SageMaker Canvas provides a no-code visual interface for training ML models.

In addition, Data scientists can use pre-trained models available in SageMaker JumpStart, including foundation models from sources such as Alexa, AI21 Labs, Hugging Face, and Stability AI, and tune them for their own generative AI use cases.

Alternatively, the model can be trained in a third-party or partner-provided tool, service, and infrastructure, including on-premises resources, provided the model artifacts are accessible and readable.

The following diagram illustrates these options.

Model training

Model deployment and artifacts

After you have trained and tested the model, you can either deploy it to a SageMaker model endpoint in the customer account, or export it from SageMaker and import it into the SaaS platform storage. The model can be stored and imported in standard formats supported by the common ML frameworks, such as pickle, joblib, and ONNX (Open Neural Network Exchange).

If the ML model is deployed to a SageMaker model endpoint, additional model metadata can be stored in the SageMaker Model Registry, SageMaker Model Cards, or in a file in an S3 bucket. This can be the model version, model inputs and outputs, model metrics, model creation date, inference specification, data lineage information, and more. Where there isn’t a property available in the model package, the data can be stored as custom metadata or in an S3 file.

Creating such metadata can help SaaS providers manage the end-to-end lifecycle of the ML model more effectively. This information can be synced to the model log in the SaaS platform and used to track changes and updates to the ML model. Subsequently, this log can be used to determine whether to refresh downstream data and applications that use that ML model in the SaaS platform.

The following diagram illustrates this architecture.

Model deployment and artifacts

Model inference

SageMaker offers four options for ML model inference: real-time inference, serverless inference, asynchronous inference, and batch transform. For the first three, the model is deployed to a SageMaker model endpoint and the SaaS platform invokes the model using the AWS SDKs. The recommended option is to use the Python SDK. The inference pattern for each of these is similar in that the predictor’s predict() or predict_async() methods are used. Cross-account access can be achieved using role-based access.

It’s also possible to seal the backend with Amazon API Gateway, which calls the endpoint via a Lambda function that runs in a protected private network.

For batch transform, data from the SaaS platform first needs to be exported in batch into an S3 bucket in the customer AWS account, then the inference is done on this data in batch. The inference is done by first creating a transformer job or object, and then calling the transform() method with the S3 location of the data. Results are imported back into the SaaS platform in batch as a dataset, and joined to other datasets in the platform as part of a batch pipeline job.

Another option for inference is to do it directly in the SaaS account compute cluster. This would be the case when the model has been imported into the SaaS platform. In this case, SaaS providers can choose from a range of EC2 instances that are optimized for ML inference.

The following diagram illustrates these options.

Model inference

Example integrations

Several ISVs have built integrations between their SaaS platforms and SageMaker. To learn more about some example integrations, refer to the following:

Conclusion

In this post, we explained why and how SaaS providers should integrate SageMaker with their SaaS platforms by breaking the process into four parts and covering the common integration architectures. SaaS providers looking to build an integration with SageMaker can utilize these architectures. If there are any custom requirements beyond what has been covered in this post, including with other SageMaker components, get in touch with your AWS account teams. Once the integration has been built and validated, ISV partners can join the AWS Service Ready Program for SageMaker and unlock a variety of benefits.

We also ask customers who are users of SaaS platforms to register their interest in an integration with Amazon SageMaker with their AWS account teams, as this can help inspire and progress the development for SaaS providers.


About the Authors

Mehmet Bakkaloglu is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML and ISV partners.

Raj Kadiyala is a Principal AI/ML Evangelist at AWS.

Read More

Highlight text as it’s being spoken using Amazon Polly

Highlight text as it’s being spoken using Amazon Polly

Amazon Polly is a service that turns text into lifelike speech. It enables the development of a whole class of applications that can convert text into speech in multiple languages.

This service can be used by chatbots, audio books, and other text-to-speech applications in conjunction with other AWS AI or machine learning (ML) services. For example, Amazon Lex and Amazon Polly can be combined to create a chatbot that engages in a two-way conversation with a user and performs certain tasks based on the user’s commands. Amazon Transcribe, Amazon Translate, and Amazon Polly can be combined to transcribe speech to text in the source language, translate it to a different language, and speak it.

In this post, we present an interesting approach for highlighting text as it’s being spoken using Amazon Polly. This solution can be used in many text-to-speech applications to do the following:

  • Add visual capabilities to audio in books, websites, and blogs
  • Increase comprehension when customers are trying to understand the text rapidly as it’s being spoken

Our solution gives the client (the browser, in this example), the ability to know what text (word or sentence) is being spoken by Amazon Polly at any instant. This enables the client to dynamically highlight the text as it’s being spoken. Such a capability is useful for providing visual aid to speech for the use cases mentioned previously.

Our solution can be extended to perform additional tasks besides highlighting text. For example, the browser can show images, play music, or perform other animations on the front end as the text is being spoken. This capability is useful for creating dynamic audio books, educational content, and richer text-to-speech applications.

Solution overview

At its core, the solution uses Amazon Polly to convert a string of text into speech. The text can be input from the browser or through an API call to the endpoint exposed by our solution. The speech generated by Amazon Polly is stored as an audio file (MP3 format) in an Amazon Simple Storage Service (Amazon S3) bucket.

However, using the audio file alone, the browser can’t find what parts of the text are being spoken at any instant because we don’t have granular information on when each word is spoken.

Amazon Polly provides a way to obtain this using speech marks. Speech marks are stored in a text file that shows the time (measured in milliseconds from start of the audio) when each word or sentence is spoken.

Amazon Polly returns speech mark objects in a line-delimited JSON stream. A speech mark object contains the following fields:

  • Time – The timestamp in milliseconds from the beginning of the corresponding audio stream
  • Type – The type of speech mark (sentence, word, viseme, or SSML)
  • Start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
  • End – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)
  • Value – This varies depending on the type of speech mark:
    • SSML – <mark> SSML tag
    • Viseme – The viseme name
    • Word or sentence – A substring of the input text as delimited by the start and end fields

For example, the sentence “Mary had a little lamb” can give you the following speech marks file if you use SpeechMarkTypes = [“word”, “sentence”] in the API call to obtain the speech marks:

{"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
{"time":6,"type":"word","start":0,"end":4,"value":"Mary"}
{"time":373,"type":"word","start":5,"end":8,"value":"had"}
{"time":604,"type":"word","start":9,"end":10,"value":"a"}
{"time":643,"type":"word","start":11,"end":17,"value":"little"}
{"time":882,"type":"word","start":18, "end":22,"value":"lamb"}

The word “had” (at the end of line 3) begins 373 milliseconds after the audio stream begins, starts at byte 5, and ends at byte 8 of the input text.

Architecture overview

The architecture of our solution is presented in the following diagram.

Architecture Diagram

Highlight Text as it’s spoken, using Amazon Polly

Our website for the solution is stored on Amazon S3 as static files (JavaScript, HTML), which are hosted in Amazon CloudFront (1) and served to the end-user’s browser (2).

When the user enters text in the browser through a simple HTML form, it’s processed by JavaScript in the browser. This calls an API (3) through Amazon API Gateway, to invoke an AWS Lambda function (4). The Lambda function calls Amazon Polly (5) to generate speech (audio) and speech marks (JSON) files. Two calls are made to Amazon Polly to fetch the audio and speech marks files. The calls are made using JavaScript async functions. The output of these calls is the audio and speech marks files, which are stored in Amazon S3 (6a). To avoid multiple users overwriting each others’ files in the S3 bucket, the files are stored in a folder with a timestamp. This minimizes the chances of two users overwriting each others’ files in Amazon S3. For a production release, we can employ more robust approaches to segregate users’ files based on user ID or timestamp and other unique characteristics.

The Lambda function creates pre-signed URLs for the speech and speech marks files and returns them to the browser in the form of an array (7, 8, 9).

When the browser sends the text file to the API endpoint (3), it gets back two pre-signed URLs for the audio file and the speech marks file in one synchronous invocation (9). This is indicated by the key symbol next to the arrow.

A JavaScript function in the browser fetches the speech marks file and the audio from their URL handles (10). It sets up the audio player to play the audio. (The HTML audio tag is used for this purpose).

When the user clicks the play button, it parses the speech marks retrieved in the earlier step to create a series of timed events using timeouts. The events invoke a callback function, which is another JavaScript function used to highlight the spoken text in the browser. Simultaneously, the JavaScript function streams the audio file from its URL handle.

The result is that the events are run at the appropriate times to highlight the text as it’s spoken while the audio is being played. The use of JavaScript timeouts provides us the synchronization of the audio with the highlighted text.

Prerequisites

To run this solution, you need an AWS account with an AWS Identity and Access Management (IAM) user who has permission to use Amazon CloudFront, Amazon API Gateway, Amazon Polly, Amazon S3, AWS Lambda, and AWS Step Functions.

Use Lambda to generate speech and speech marks

The following code invokes the Amazon Polly synthesize_speech function two times to fetch the audio and speech marks file. They’re run as asynchronous functions and coordinated to return the result at the same time using promises.

const p1 = new Promise(doSynthesizeSpeech marks);
const p2 = new Promise(doSynthesizeSpeech);
var result;

await Promise.all([p1, p2])
.then((values) => {
//return array of presigned urls 
     console.log('Values:', values);
     result = { "output" : values };
})
.catch((err) => {
     console.log("Error:" + err);
     result = err;
});

On the JavaScript side, the text highlighting is done by highlighter(start, finish, word) and the timed events are set by setTimers():

function highlighter(start, finish, word) {
     let textarea = document.getElementById("postText");
     //console.log(start + "," + finish + "," + word);
     textarea.focus();
     textarea.setSelectionRange(start, finish);
}

function setTimers() {
     let speech marksStr = sessionStorage.getItem("speech marks");
     //read through the speech marks file and set timers for every word
     console.log(speech marksStr);
     let speech marks = speech marksStr.split("n");
     for (let i = 0; i < speech marks.length; i++) {
          //console.log(i + ":" + speech marks[i]);
          if (speech marks[i].length == 0) {
               continue;
     }

     smjson = JSON.parse(speech marks[i]);
     t = smjson["time"];
     s = smjson["start"];
     f = smjson["end"]; 
     word = smjson["value"];
     setTimeout(highlighter, t, s, f, word);
     }
}

Alternative approaches

Instead of the previous approach, you can consider a few alternatives:

  • Create both the speech marks and audio files inside a Step Functions state machine. The state machine can invoke the parallel branch condition to invoke two different Lambda functions: one to generate speech and another to generate speech marks. The code for this can be found in the using-step-functions subfolder in the Github repo.
  • Invoke Amazon Polly asynchronously to generate the audio and speech marks. This approach can be used if the text content is large or the user doesn’t need a real-time response. For more details about creating long audio files, refer to Creating Long Audio Files.
  • Have Amazon Polly create the presigned URL directly using the generate_presigned_url call on the Amazon Polly client in Boto3. If you go with this approach, Amazon Polly generates the audio and speech marks newly every time. In our current approach, we store these files in Amazon S3. Although these stored files aren’t accessible from the browser in our version of the code, you can modify the code to play previously generated audio files by fetching them from Amazon S3 (instead of regenerating the audio for the text again using Amazon Polly). We have more code examples for accessing Amazon Polly with Python in the AWS Code Library.

Create the solution

The entire solution is available from our Github repo. To create this solution in your account, follow the instructions in the README.md file. The solution includes an AWS CloudFormation template to provision your resources.

Cleanup

To clean up the resources created in this demo, perform the following steps:

  1. Delete the S3 buckets created to store the CloudFormation template (Bucket A), the source code (Bucket B) and the website (pth-cf-text-highlighter-website-[Suffix]).
  2. Delete the CloudFormation stack pth-cf.
  3. Delete the S3 bucket containing the speech files (pth-speech-[Suffix]). This bucket was created by the CloudFormation template to store the audio and speech marks files generated by Amazon Polly.

Summary

In this post, we showed an example of a solution that can highlight text as it’s being spoken using Amazon Polly. It was developed using the Amazon Polly speech marks feature, which provides us markers for the place each word or sentence begins in an audio file.

The solution is available as a CloudFormation template. It can be deployed as is to any web application that performs text-to-speech conversion. This would be useful for adding visual capabilities to audio in books, avatars with lip-sync capabilities (using viseme speech marks), websites, and blogs, and for aiding people with hearing impairments.

It can be extended to perform additional tasks besides highlighting text. For example, the browser can show images, play music, and perform other animations on the front end while the text is being spoken. This capability can be useful for creating dynamic audio books, educational content, and richer text-to-speech applications.

We welcome you to try out this solution and learn more about the relevant AWS services from the following links. You can extend the functionality for your specific needs.


About the Author

Varad G Varadarajan is a Trusted Advisor and Field CTO for Digital Native Businesses (DNB) customers at AWS. He helps them architect and build innovative solutions at scale using AWS products and services. Varad’s areas of interest are IT strategy consulting, architecture, and product management. Outside of work, Varad enjoys creative writing, watching movies with family and friends, and traveling.

Read More

Predict vehicle fleet failure probability using Amazon SageMaker Jumpstart

Predict vehicle fleet failure probability using Amazon SageMaker Jumpstart

Predictive maintenance is critical in automotive industries because it can avoid out-of-the-blue mechanical failures and reactive maintenance activities that disrupt operations. By predicting vehicle failures and scheduling maintenance and repairs, you’ll reduce downtime, improve safety, and boost productivity levels.

What if we could apply deep learning techniques to common areas that drive vehicle failures, unplanned downtime, and repair costs?

In this post, we show you how to train and deploy a model to predict vehicle fleet failure probability using Amazon SageMaker JumpStart. SageMaker Jumpstart is the machine learning (ML) hub of Amazon SageMaker, providing pre-trained, publicly available models for a wide range of problem types to help you get started with ML. The solution outlined in the post is available on GitHub.

SageMaker JumpStart solution templates

SageMaker JumpStart provides one-click, end-to-end solutions for many common ML use cases. Explore the following use cases for more information on available solution templates:

The SageMaker JumpStart solution templates cover a variety of use cases, under each of which several different solution templates are offered (the solution in this post, Predictive Maintenance for Vehicle Fleets, is in the Solutions section). Choose the solution template that best fits your use case from the SageMaker JumpStart landing page. For more information on specific solutions under each use case and how to launch a SageMaker JumpStart solution, see Solution Templates.

Solution overview

The AWS predictive maintenance solution for automotive fleets applies deep learning techniques to common areas that drive vehicle failures, unplanned downtime, and repair costs. It serves as an initial building block for you to get to a proof of concept in a short period of time. This solution contains data preparation and visualization functionality within SageMaker and allows you to train and optimize the hyperparameters of deep learning models for your dataset. You can use your own data or try the solution with a synthetic dataset as part of this solution. This version processes vehicle sensor data over time. A subsequent version will process maintenance record data.

The following diagram demonstrates how you can use this solution with SageMaker components. As part of the solution, the following services are used:

  • Amazon S3 – We use Amazon Simple Storage Service (Amazon S3) to store datasets
  • SageMaker notebook – We use a notebook to preprocess and visualize the data, and to train the deep learning model
  • SageMaker endpoint – We use the endpoint to deploy the trained model

Solution overview

The workflow includes the following steps:

  1. An extract of historical data is created from the Fleet Management System containing vehicle data and sensor logs.
  2. After the ML model is trained, the SageMaker model artifact is deployed.
  3. The connected vehicle sends sensor logs to AWS IoT Core (alternatively, via an HTTP interface).
  4. Sensor logs are persisted via Amazon Kinesis Data Firehose.
  5. Sensor logs are sent to AWS Lambda for querying against the model to make predictions.
  6. Lambda sends sensor logs to Sagemaker model inference for predictions.
  7. Predictions are persisted in Amazon Aurora.
  8. Aggregate results are displayed on an Amazon QuickSight dashboard.
  9. Real-time notifications on the predicted probability of failure are sent to Amazon Simple Notification Service (Amazon SNS).
  10. Amazon SNS sends notifications back to the connected vehicle.

The solution consists of six notebooks:

  • 0_demo.ipynb – A quick preview of our solution
  • 1_introduction.ipynb – Introduction and solution overview
  • 2_data_preparation.ipynb – Prepare a sample dataset
  • 3_data_visualization.ipynb – Visualize our sample dataset
  • 4_model_training.ipynb – Train a model on our sample dataset to detect failures
  • 5_results_analysis.ipynb – Analyze the results from the model we trained

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run SageMaker JumpStart, we need to set up SageMaker Studio. You can skip this step if you already have your own version of SageMaker Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Then we create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch SageMaker Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

To run this SageMaker JumpStart solution and have the infrastructure deployed to your AWS account, you need to create an active SageMaker Studio instance (see Onboard to Amazon SageMaker Studio). When your instance is ready, use the instructions in SageMaker JumpStart to launch the solution. The solution artifacts are included in this GitHub repository for reference.

Launch the SageMaker Jumpstart solution

To get started with the solution, complete the following steps:

  1. On the SageMaker Studio console, choose JumpStart.
    choose jumpstart
  2. On the Solutions tab, choose Predictive Maintenance for Vehicle Fleets.
    choose predictive maintenance
  3. Choose Launch.
    launch solution
    It takes a few minutes to deploy the solution.
  4. After the solution is deployed, choose Open Notebook.
    open notebook

If you’re prompted to select a kernel, choose PyTorch 1.8 Python 3.6 for all notebooks in this solution.

Solution preview

We first work on the 0_demo.ipynb notebook. In this notebook, you can get a quick preview of what the outcome will look like when you complete the full notebook for this solution.

Choose Run and Run All Cells to run all cells in SageMaker Studio (or Cell and Run All in a SageMaker notebook instance). You can run all the cells in each notebook one after the other. Ensure all the cells finish processing before moving to the next notebook.

run all cells

This solution relies on a config file to run the provisioned AWS resources. We generate the file as follows:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)

We have some sample time series input data consisting of a vehicle’s battery voltage and battery current over time. Next, we load and visualize the sample data. As shown in the following screenshots, the voltage and current values are on the Y axis and the readings (19 readings recorded) are on the X axis.

volt

current

volt and current

We have previously trained a model on this voltage and current data that predicts the probability of vehicle failure and have deployed the model as an endpoint in SageMaker. We will call this endpoint with some sample data to determine the probability of failure in the next time period.

Given the sample input data, the predicted probability of failure is 45.73%.

To move to the next stage, choose Click here to continue.

next stage

Introduction and solution overview

The 1_introduction.ipynb notebook provides an overview of the solution and stages, and a look into the configuration file that has content definition, data sampling period, train and test sample count, parameters, location, and column names for generated content.

After you review this notebook, you can move to the next stage.

Prepare a sample dataset

We prepare a sample dataset in the 2_data_preparation.ipynb notebook.

We first generate the configuration file for this solution:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)
import os

from source.config import Config
from source.preprocessing import pivot_data, sample_dataset
from source.dataset import DatasetGenerator
config = Config(filename="config/config.yaml", fetch_sensor_headers=False)
config

The config properties are as follows:

fleet_info_fn=data/example_fleet_info.csv
fleet_sensor_logs_fn=data/example_fleet_sensor_logs.csv
vehicle_id_column=vehicle_id
timestamp_column=timestamp
target_column=target
period_ms=30000
dataset_size=25000
window_length=20
chunksize=10000
processing_chunksize=2500
fleet_dataset_fn=data/processed/fleet_dataset.csv
train_dataset_fn=data/processed/train_dataset.csv
test_dataset_fn=data/processed/test_dataset.csv
period_column=period_ms

You can define your own dataset or use our scripts to generate a sample dataset:

if should_generate_data:
    fleet_statistics_fn = "data/generation/fleet_statistics.csv"
    generator = DatasetGenerator(fleet_statistics_fn=fleet_statistics_fn,
                                 fleet_info_fn=config.fleet_info_fn, 
                                 fleet_sensor_logs_fn=config.fleet_sensor_logs_fn, 
                                 period_ms=config.period_ms, 
                                 )
    generator.generate_dataset()

assert os.path.exists(config.fleet_info_fn), "Please copy your data to {}".format(config.fleet_info_fn)
assert os.path.exists(config.fleet_sensor_logs_fn), "Please copy your data to {}".format(config.fleet_sensor_logs_fn)

You can merge the sensor data and fleet vehicle data together:

pivot_data(config)
sample_dataset(config)

We can now move to data visualization.

Visualize our sample dataset

We visualize our sample dataset in 3_data_vizualization.ipynb. This solution relies on a config file to run the provisioned AWS resources. Let’s generate the file similar to the previous notebook.

The following screenshot shows our dataset.

dataset

Next, let’s build the dataset:

train_ds = PMDataset_torch(
    config.train_dataset_fn,
    sensor_headers=config.sensor_headers,
    target_column=config.target_column,
    standardize=True)

properties = train_ds.vehicle_properties_headers.copy()
properties.remove('vehicle_id')
properties.remove('timestamp')
properties.remove('period_ms')

Now that the dataset is ready, let’s visualize the data statistics. The following screenshot shows the data distribution based on vehicle make, engine type, vehicle class, and model.

visualize

Comparing the log data, let’s look at an example of the mean voltage across different years for Make E and C (random).

The mean of voltage and current is on the Y axis and the number of readings is on the X axis.

  • Possible values for log_target: [‘make’, ‘model’, ‘year’, ‘vehicle_class’, ‘engine_type’]
    • Randomly assigned value for log_target: make
  • Possible values for log_target_value1: [‘Make A’, ‘Make B’, ‘Make E’, ‘Make C’, ‘Make D’]
    • Randomly assigned value for log_target_value1: Make B
  • Possible values for log_target_value2: [‘Make A’, ‘Make B’, ‘Make E’, ‘Make C’, ‘Make D’]
    • Randomly assigned value for log_target_value2: Make D

Based on the above, we assume log_target: make, log_target_value1: Make B and log_target_value2: Make D

make b and d

The following graphs break down the mean of the log data.

engine g h e

The following graphs visualize an example of different sensor log values against voltage and current.

volt current 2

Train a model on our sample dataset to detect failures

In the 4_model_training.ipynb notebook, we train a model on our sample dataset to detect failures.

Let’s generate the configuration file similar to the previous notebook, and then proceed with training configuration:

sage_session = sagemaker.session.Session()
s3_bucket = sagemaker_configs["S3Bucket"]  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

# run in local_mode on this machine, or as a SageMaker TrainingJob
local_mode = False

if local_mode:
    instance_type = 'local'
else:
    instance_type = sagemaker_configs["SageMakerTrainingInstanceType"]
    
role = sagemaker.get_execution_role()
print("Using IAM role arn: {}".format(role))
# only run from SageMaker notebook instance
if local_mode:
    !/bin/bash ./setup.sh
cpu_or_gpu = 'gpu' if instance_type.startswith('ml.p') else 'cpu'


We can now define the data and initiate hyperparameter optimization:

%%time

estimator = PyTorch(entry_point="train.py",
                    source_dir='source',                    
                    role=role,
                    dependencies=["source/dl_utils"],
                    instance_type=instance_type,
                    instance_count=1,
                    output_path=s3_output_path,
                    framework_version="1.5.0",
                    py_version='py3',
                    base_job_name=job_name_prefix,
                    metric_definitions=metric_definitions,
                    hyperparameters= {
                        'epoch': 100,  # tune it according to your need
                        'target_column': config.target_column,
                        'sensor_headers': json.dumps(config.sensor_headers),
                        'train_input_filename': os.path.basename(config.train_dataset_fn),
                        'test_input_filename': os.path.basename(config.test_dataset_fn),
                        }
                     )

if local_mode:
    estimator.fit({'train': training_data, 'test': testing_data})
%%time

tuner = HyperparameterTuner(estimator,
                            objective_metric_name='test_auc',
                            objective_type='Maximize',
                            hyperparameter_ranges=hyperparameter_ranges,
                            metric_definitions=metric_definitions,
                            max_jobs=max_jobs,
                            max_parallel_jobs=max_parallel_jobs,
                            base_tuning_job_name=job_name_prefix)
tuner.fit({'train': training_data, 'test': testing_data})

Analyze the results from the model we trained

In the 5_results_analysis.ipynb notebook, we get data from our hyperparameter tuning job, visualize metrics of all the jobs to identify the best job, and build an endpoint for the best training job.

Let’s generate the configuration file similar to the previous notebook and visualize the metrics of all the jobs. The following plot visualizes test accuracy vs. epoch.

test accuracy

The following screenshot shows the hyperparameter tuning jobs we ran.

hyperparameter tuning jobs

You can now visualize data from the best training job (out of the four training jobs) based on the test accuracy (red).

As we can see in the following screenshots, the test loss declines and AUC and accuracy increase with epochs.

auc and accuracy

auc and accuracy 2

Based on the visualizations, we can now build an endpoint for the best training job:

%%time

role = sagemaker.get_execution_role()

model = PyTorchModel(model_data=model_artifact,
                     role=role,
                     entry_point="inference.py",
                     source_dir="source/dl_utils",
                     framework_version='1.5.0',
                     py_version = 'py3',
                     name=sagemaker_configs["SageMakerModelName"],
                     code_location="s3://{}/endpoint".format(s3_bucket)
                    )

endpoint_instance_type = sagemaker_configs["SageMakerInferenceInstanceType"]

predictor = model.deploy(initial_instance_count=1, instance_type=endpoint_instance_type, endpoint_name=sagemaker_configs["SageMakerEndpointName"])

def custom_np_serializer(data):
    return json.dumps(data.tolist())
    
def custom_np_deserializer(np_bytes, content_type='application/x-npy'):
    out = np.array(json.loads(np_bytes.read()))
    return out

predictor.serializer = custom_np_serializer
predictor.deserializer = custom_np_deserializer

After we build the endpoint, we can test the predictor by passing it sample sensor logs:

import botocore

config = botocore.config.Config(read_timeout=200)
runtime = boto3.client('runtime.sagemaker', config=config)

data = np.ones(shape=(1, 20, 2)).tolist()
payload = json.dumps(data)

response = runtime.invoke_endpoint(EndpointName=sagemaker_configs["SageMakerEndpointName"],
ContentType='application/json',
Body=payload)
out = json.loads(response['Body'].read().decode())[0]

print("Given the sample input data, the predicted probability of failure is {:0.2f}%".format(100*(1.0-out[0])))

Given the sample input data, the predicted probability of failure is 34.60%.

Clean up

When you’ve finished with this solution, make sure that you delete all unwanted AWS resources. On the Predictive Maintenance for Vehicle Fleets page, under Delete solution, choose Delete all resources to delete all the resources associated with the solution.

clean up

You need to manually delete any extra resources that you may have created in this notebook. Some examples include the extra S3 buckets (to the solution’s default bucket) and the extra SageMaker endpoints (using a custom name).

Customize the solution

Our solution is simple to customize. To modify the input data visualizations, refer to sagemaker/3_data_visualization.ipynb. To customize the machine learning, refer to sagemaker/source/train.py and sagemaker/source/dl_utils/network.py. To customize the dataset processing, refer to sagemaker/1_introduction.ipynb on how to define the config file.

Additionally, you can change the configuration in the config file. The default configuration is as follows:

fleet_info_fn=data/example_fleet_info.csv
fleet_sensor_logs_fn=data/example_fleet_sensor_logs.csv
vehicle_id_column=vehicle_id
timestamp_column=timestamp
target_column=target
period_ms=30000
dataset_size=10000
window_length=20
chunksize=10000
processing_chunksize=1000
fleet_dataset_fn=data/processed/fleet_dataset.csv
train_dataset_fn=data/processed/train_dataset.csv
test_dataset_fn=data/processed/test_dataset.csv
period_column=period_ms

The config file has the following parameters:

  • fleet_info_fn, fleet_sensor_logs_fn, fleet_dataset_fn, train_dataset_fn, and test_dataset_fn define the location of dataset files
  • vehicle_id_column, timestamp_column, target_column, and period_column define the headers for columns
  • dataset_size, chunksize, processing_chunksize, period_ms, and window_length define the properties of the dataset

Conclusion

In this post, we showed you how to train and deploy a model to predict vehicle fleet failure probability using SageMaker JumpStart. The solution is based on ML and deep learning models and allows a wide variety of input data including any time-varying sensor data. Because every vehicle has different telemetry on it, you can fine-tune the provided model to the frequency and type of data that you have.

To learn more about what you can do with SageMaker JumpStart, refer to the following:

Resources


About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Read More

Retain original PDF formatting to view translated documents with Amazon Textract, Amazon Translate, and PDFBox

Retain original PDF formatting to view translated documents with Amazon Textract, Amazon Translate, and PDFBox

Companies across various industries create, scan, and store large volumes of PDF documents. In many cases, the content is text-heavy and often written in a different language and requires translation. To address this, you need an automated solution to extract the contents within these PDFs and translate them quickly and cost-efficiently.

Many businesses have diverse global users and need to translate text to enable cross-lingual communication between them. This is a manual, slow, and expensive human effort. There’s a need to find a scalable, reliable, and cost-effective solution to translate documents while retaining the original document formatting.

For verticals such as healthcare, due to regulatory requirements, the translated documents require an additional human in the loop to verify the validity of the machine-translated document.

If the translated document doesn’t retain the original formatting and structure, it loses its context. This can make it difficult for a human reviewer to validate and make corrections.

In this post, we demonstrate how to create a new translated PDF from a scanned PDF while retaining the original document structure and formatting using a geometry-based approach with Amazon Textract, Amazon Translate, and Apache PDFBox.

Solution overview

The solution presented in this post uses the following components:

  • Amazon Textract – A fully managed machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract can detect text in a variety of documents, including financial reports, medical records, and tax forms.
  • Amazon Translate – A neural machine translation service that delivers fast, high-quality, and affordable language translation. Amazon Translate provides high-quality on-demand and batch translation capabilities across more than 2,970 language pairs, while decreasing your translation costs.
  • PDF Translate – An open-source library written in Java and published on AWS Samples in GitHub. This library contains logic to generate translated PDF documents in your desired language with Amazon Textract and Amazon Translate. It also uses the open-source Java library Apache PDFBox to create PDF documents. There are similar PDF processing libraries available in other programming languages, for example Node PDFBox.

While performing machine translations, you may have situations where you wish to preserve specific sections of text from being translated, such as names or unique identifiers. Amazon Translate allows tag modifications, which allows you to specify what text should not be translated. Amazon Translate also supports formality customization, which allows you to customize the level of formality in your translation output.

For details on Amazon Textract limits, refer to Quotas in Amazon Textract.

The solution is restricted to the languages that can be extracted by Amazon Textract, which currently supports English, Spanish, Italian, Portuguese, French, and German. These languages are also supported by Amazon Translate. For the full list of languages supported by Amazon Translate, refer to Supported languages and language codes.

We use the following PDF to demonstrate translating the text from English to Spanish. The solution also supports generating the translated document without any formatting. The position of the translated text is maintained. The source and translated PDF documents can also be found in the AWS Samples GitHub repo.

In the following sections, we demonstrate how to run the translation code on a local machine and look at the translation code in more detail.

Prerequisites

Before you get started, set up your AWS account and the AWS Command Line Interface (AWS CLI). For access to any AWS Services such as Textract and Translate, appropriate IAM permissions are needed. We recommend utilizing least privilege permissions. To learn more about IAM permissions see Policies and permissions in IAM as well as How Amazon Textract works with IAM and How Amazon Translate works with IAM.

Run the translation code on a local machine

This solution focuses on the standalone Java code to extract and translate a PDF document. This is for easier testing and customizations to get the best-rendered translated PDF document. The code can then be integrated into an automated solution to deploy and run in AWS. See Translating PDF documents using Amazon Translate and Amazon Textract for a sample architecture that uses Amazon Simple Storage Service (Amazon S3) to store the documents and AWS Lambda to run the code.

To run the code on a local machine, complete the following steps. The code examples are available on the GitHub repo.

  1. Clone the GitHub repo:
    git clone https://github.com/aws-samples/amazon-translate-pdf

  2. Run the following command:
    cd amazon-translate-pdf

  3. Run the following command to translate from English to Spanish:
    java -jar target/translate-pdf-1.0.jar --source en --translated es

Two translated PDF documents are created in the documents folder, with and without the original formatting (SampleOutput-es.pdf and SampleOutput-min-es.pdf).

Code to generate the translated PDF

The following code snippets show how to take a PDF document and generate a corresponding translated PDF document. It extracts the text using Amazon Textract and creates the translated PDF by adding the translated text as a layer to the image. It builds on the solution shown in the post Generating searchable PDFs from scanned documents automatically with Amazon Textract.

The code first gets each line of text with Amazon Textract. Amazon Translate is used to get translated text and save the geometry of the translated text.

Region region = Region.US_EAST_1;
TextractClient textractClient = TextractClient.builder()
        .region(region)
        .build();

// Get the input Document object as bytes
Document pdfDoc = Document.builder()
        .bytes(SdkBytes.fromByteBuffer(imageBytes))
        .build();

TranslateClient translateClient = TranslateClient.builder()
        .region(region)
        .build();

DetectDocumentTextRequest detectDocumentTextRequest = DetectDocumentTextRequest.builder()
        .document(pdfDoc)
        .build();

// Invoke the Detect operation
DetectDocumentTextResponse textResponse = textractClient.detectDocumentText(detectDocumentTextRequest);

List<Block> blocks = textResponse.blocks();
List<TextLine> lines = new ArrayList<>();
BoundingBox boundingBox;

for (Block block : blocks) {
    if ((block.blockType()).equals(BlockType.LINE)) {
        String source = block.text();

        TranslateTextRequest requestTranslate = TranslateTextRequest.builder()
                .sourceLanguageCode(sourceLanguage)
                .targetLanguageCode(destinationLanguage)
                .text(source)
                .build();

        TranslateTextResponse resultTranslate = translateClient.translateText(requestTranslate);

        boundingBox = block.geometry().boundingBox();
        lines.add(new TextLine(boundingBox.left(),
                boundingBox.top(),
                boundingBox.width(),
                boundingBox.height(),
                resultTranslate.translatedText(),
                source));
    }
}
return lines;

The font size is calculated as follows and can easily be configured:

int fontSize = 20;
float textWidth = font.getStringWidth(text) / 1000 * fontSize;
float textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
 
if (textWidth > bbWidth) {
    while (textWidth > bbWidth) {
        fontSize -= 1;
        textWidth = font.getStringWidth(text) / 1000 * fontSize;
        textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
     }
} else if (textWidth < bbWidth) {
     while (textWidth < bbWidth) {
         fontSize += 1;
         textWidth = font.getStringWidth(text) / 1000 * fontSize;
         textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
      }
}

The translated PDF is created from the saved geometry and translated text. Changes to the color of the translated text can easily be configured.

float width = image.getWidth();
float height = image.getHeight();
 
PDRectangle box = new PDRectangle(width, height);
PDPage page = new PDPage(box);
page.setMediaBox(box);
this.document.addPage(page); //org.apache.pdfbox.pdmodel.PDDocument
 
PDImageXObject pdImage;
 
if(imageType == ImageType.JPEG){
    pdImage = JPEGFactory.createFromImage(this.document, image);
} else {
    pdImage = LosslessFactory.createFromImage(this.document, image);
}
 
PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.OVERWRITE, false);
 
contentStream.drawImage(pdImage, 0, 0);
contentStream.setRenderingMode(RenderingMode.FILL);
 
for (TextLine cline : lines){
    String clinetext = cline.text;
    String clinetextOriginal = cline.originalText;
                       
    FontInfo fontInfo = calculateFontSize(clinetextOriginal, (float) cline.width * width, (float) cline.height * height, font);
    //config to include original document structure - overlay with original
    contentStream.setNonStrokingColor(Color.WHITE);
    contentStream.addRect((float) cline.left * width, (float) (height - height * cline.top - fontInfo.textHeight), (float) cline.width * width, (float) cline.height * height);
    contentStream.fill();
 
    fontInfo = calculateFontSize(clinetext, (float) cline.width * width, (float) cline.height * height, font);
    //config to include original document structure - overlay with translated
    contentStream.setNonStrokingColor(Color.WHITE);
    contentStream.addRect((float) cline.left * width, (float) (height - height * cline.top - fontInfo.textHeight), (float) cline.width * width, (float) cline.height * height);
    contentStream.fill();
    //change the output text color here
    fontInfo = calculateFontSize(clinetext.length() <= clinetextOriginal.length() ? clinetextOriginal : clinetext, (float) cline.width * width, (float) cline.height * height, font);
    contentStream.setNonStrokingColor(Color.BLACK);
    contentStream.beginText();
    contentStream.setFont(font, fontInfo.fontSize);
    contentStream.newLineAtOffset((float) cline.left * width, (float) (height - height * cline.top - fontInfo.textHeight));
    contentStream.showText(clinetext);
    contentStream.endText();
}
contentStream.close()

The following image shows the document translated into Spanish with the original formatting (SampleOutput-es.pdf).

The following image shows the translated PDF in Spanish without any formatting (SampleOutput-min-es.pdf).

Processing time

The employment application pdf took about 10 seconds to extract, process and render the translated pdf. The processing time for text heavy document such as the Declaration of Independence PDF took less than a minute.

Cost

With Amazon Textract, you pay as you go based on the number of pages and images processed. With Amazon Translate, you pay as you go based on the number of text characters that are processed. Refer to Amazon Textract pricing and Amazon Translate pricing for actual costs.

Conclusion

This post showed how to use Amazon Textract and Amazon Translate to generate translated PDF documents while retaining the original document structure. You can optionally postprocess Amazon Textract results to improve the quality of the translation, for example extracted words can be passed through ML-based spellchecks such as SymSpell for data validation, or clustering algorithms can be used to preserve reading order. You can also use Amazon Augmented AI (Amazon A2I) to build human review workflows where you can use your own private workforce to review the original and translated PDF documents to provide more accuracy and context. See Designing human review workflows with Amazon Translate and Amazon Augmented AI and Building a multi-lingual document translation workflow with domain-specific and language-specific customization to get started.


About the Authors

Anubha Singhal is a Senior Cloud Architect at Amazon Web Services in the AWS Professional Services organization.

Sean Lawrence was formerly a Front End Engineer at AWS. He specialized in front end development in the AWS Professional Services organization and the Amazon Privacy team.

Read More