How Süddeutsche Zeitung optimized their audio narration process with Amazon Polly

This is a guest post by Jakob Kohl, a Software Developer at the Süddeutsche Zeitung. Süddeutsche Zeitung is one of the leading quality dailies in Germany when it comes to paid subscriptions and unique users. Its website, SZ.de, reaches more than 15 million monthly unique users as of October 2021.

Thanks to smart speakers and podcasts, the audio industry has experienced a real boom in recent years. At Süddeutsche Zeitung, we’re constantly looking for new ways to make our diverse journalism even more accessible. As pioneers in digital journalism, we want to open up more opportunities for Süddeutsche Zeitung readers to consume articles. We started looking for solutions that could provide high-quality audio narration for our articles. Our ultimate goal was to launch a “listen to the article” feature.

In this post, we share how we optimized our audio narration process with Amazon Polly, a service that turns text into lifelike speech using advanced deep learning technologies.

Why Amazon Polly?

We believe that Vicki, the German neural Amazon Polly voice, is currently the best German voice on the market. Amazon Polly offers the impressive feature to switch between languages, correctly pronouncing for example English movie titles as well as personal names in different languages (for an example, listen to the article Schall und Wahn on our website).

A big part of our infrastructure already runs on AWS, so using Amazon Polly was a perfect fit. We can combine Amazon Polly with the following components:

  • An Amazon Simple Notification Service (Amazon SNS) topic to which we can subscribe for articles. The articles are sent to this topic by the CMS whenever they’re saved by an editor.
  • An Amazon CloudFront distribution with Lambda@Edge to paywall premium articles, which we can reuse for audio versions of articles.

The Amazon Polly API is easy to use and well documented. It took us less than a week to get our proof of concept to work.

The challenge

Hundreds of new articles are published every day on SZ.de. After initial publication, they might get updated several times for various reasons—new paragraphs are added in news-driven articles, typos are fixed, teasers are changed, or metadata is optimized for search engines.

Generating speech for the initial publication of an article is straightforward, because the whole text needs to be synthesized. But how can we quickly generate the audio for updated versions of articles without paying twice for the same content? Our biggest challenge was to prevent sending the whole text to Amazon Polly repeatedly for every single update.

Our technical solution

Every time an editor saves an article, the new version of the article is published to an SNS topic. An AWS Lambda function is subscribed to this topic and called for every new version of an article. This function runs the following steps:

  1. Check if the new version of the article has already been completely synthesized. If so, the function stops immediately (this may happen when only metadata is changed that doesn’t affect the audio).
  2. Convert the article into multiple SSML documents, roughly one for each text paragraph.
  3. For each SSML document, the function checks if it has already been synthesized to audio using calculated hashes. For example:
    1. If an article is saved for the first time, all SSML documents must be synthesized.
    2. If a typo has been fixed in a single paragraph, only the SSML document for this paragraph must be re-synthesized.
    3. If a new paragraph is added to the article, only the SSML document for this new paragraph must be synthesized.
  4. Send all not-yet-synthesized SSML documents separately to Amazon Polly.

These checks help optimize performance and reduce cost by preventing the synthesis of an entire article multiple times. We avoid incurring additional charges due to minor changes such as a title edit or metadata adjustments for SEO reasons.

The following diagram illustrates the solution workflow.

After Amazon Polly synthesizes the SSML documents, the audio files are sent to an output bucket in Amazon Simple Storage Service (Amazon S3). A second Lambda function is listening for object creation on that bucket, waits for the completion of all audio fragments of an article, and merges them into a final audio file using FFmpeg from a Lambda layer. This final audio is sent to another S3 bucket, which is used as the origin in our CloudFront distribution. In CloudFront, we reuse an existing paywall for premium articles for the corresponding audio version.

Based on our freemium model, we provide a shortened audio version of premium articles. Non-subscribers are able to listen to the first paragraph for free, but are required to purchase a subscription to access the full article.

Conclusion

Integration of Amazon Polly into our existing infrastructure was very straightforward. Our content requires minimal customization because we only include paragraphs and some additional breaks. The most challenging part was performance and cost optimization, which we achieved by splitting the article up into multiple SSML documents corresponding to paragraphs, checking for changes in each SSML document, and building the whole audio file by merging the fragments. With these optimizations, we are able to achieve the following:

  • Decrease the amount of synthesized characters by at least 50% by only synthesizing real changes.
  • Reduce the time it takes for a change in the article text to appear in the audio because there is less audio to synthesize.
  • Add arbitrary audio files between paragraphs without re-synthesizing the whole article. For example, we can include a sound file in the shortened audio version of a premium articles to separate the first paragraph from the ensuing note that a subscription is needed to listen to the full version.

In the first month after the launch of the “listen to the article” feature in our SZ.de articles, we received a lot of positive user feedback. We were able to reach almost 30,000 users during the first 2 months after launch. From these users, approximately 200 converted into a paid subscription only from listening to the teaser of an article behind our paywall. The “listen to the article” feature isn’t behind our paywall, but users can only listen to premium articles fully if they have a subscription. Our website also offers free articles without a paywall. In the future, we will expand the feature to other SZ platforms, especially our mobile news apps.


About the Author

Jakob Kohl is a Software Developer at the Süddeutsche Zeitung, where he enjoys working with modern technologies on an agile website team. He is one of the main developers of the “listen to an SZ article” feature. In his leisure time, he likes building wooden furniture, where technical and visual design is as important as in web development.

Read More

Reduce costs and complexity of ML preprocessing with Amazon S3 Object Lambda

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Often, customers have objects in S3 buckets that need further processing to be used effectively by consuming applications. Data engineers must support these application-specific data views with trade-offs between persisting derived copies or transforming data at the consumer level. Neither solution is ideal because it introduces operational complexity, causes data consistency challenges, and wastes more expensive computing resources.

These trade-offs broadly apply to many machine learning (ML) pipelines that train on unstructured data, such as audio, video, and free-form text, among other sources. In each example, the training job must download data from S3 buckets, prepare an application-specific view, and then use an AI algorithm. This post demonstrates a design pattern for reducing costs, complexity, and centrally managing this second step. It uses the concrete example of image processing, though the approach broadly applies to any workload. The economic benefits are also most pronounced when the transformation step doesn’t require GPU, but the AI algorithm does.

The proposed solution also centralizes data transformation code and enables just-in-time (JIT) transformation. Furthermore, the approach uses a serverless infrastructure to reduce operational overhead and undifferentiated heavy lifting.

Solution overview

When ML algorithms process unstructured data like images and video, it requires various normalization tasks (such as grey-scaling and resizing). This step exists to accelerate model convergence, avoid overfitting, and improve prediction accuracy. You often perform these preprocessing steps on instances that later run the AI training. That approach creates inefficiencies, because those resources typically have more expensive processors (for example, GPUs) than these tasks require. Instead, our solution externalizes those operations across economic, horizontally scalable Amazon S3 Object Lambda functions.

This design pattern has three critical benefits. First, it centralizes the shared data transformation steps, such as image normalization and removing ML pipeline code duplication. Next, S3 Object Lambda functions avoid data consistency issues in derived data through JIT conversions. Third, the serverless infrastructure reduces operational overhead, increases access time, and limits costs to the per-millisecond time running your code.

An elegant solution exists in which you can centralize these data preprocessing and data conversion operations with S3 Object Lambda. S3 Object Lambda enables you to add code that modifies data from Amazon S3 before returning it to an application. The code runs within an AWS Lambda function, a serverless compute service. Lambda can instantly scale to tens of thousands of parallel runs while supporting dozens of programming languages and even custom containers. For more information, see Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3.

The following diagram illustrates the solution architecture.

Normalize datasets used to train machine learning model

In this solution, you have an S3 bucket that contains the raw images to be processed. Next, you create an S3 Access Point for these images. If you build multiple ML models, you can create separate S3 Access Points for each model. Alternatively, AWS Identity and Access Management (IAM) policies for access points support sharing reusable functions across ML pipelines. Then you attach a Lambda function that has your preprocessing business logic to the S3 Access Point. After you retrieve the data, you call the S3 Access Point to perform JIT data transformations. Finally, you update your ML model to use the new S3 Object Lambda Access Point to retrieve data from Amazon S3.

Create the normalization access point

This section walks through the steps to create the S3 Object Lambda access point.

Raw data is stored in an S3 bucket. To provide the user with the right set of permissions to access this data, while avoiding complex bucket policies that can cause unexpected impact to another application, you need to create S3 Access Points. S3 Access Points are unique host names that you can use to reach S3 buckets. With S3 Access Points, you can create individual access control policies for each access point to control access to shared datasets easily and securely.

  1. Create your access point.
  2. Create a Lambda function that performs the image resizing and conversion. See the following Python code:
import boto3
import cv2
import numpy as np
import requests
import io

def lambda_handler(event, context):
    print(event)
    object_get_context = event["getObjectContext"]
    request_route = object_get_context["outputRoute"]
    request_token = object_get_context["outputToken"]
    s3_url = object_get_context["inputS3Url"]

    # Get object from S3
    response = requests.get(s3_url)
    nparr = np.fromstring(response.content, np.uint8)
    img = cv2.imdecode(nparr, flags=1)

    # Transform object
    new_shape=(256,256)
    resized = cv2.resize(img, new_shape, interpolation= cv2.INTER_AREA)
    gray_scaled = cv2.cvtColor(resized,cv2.COLOR_BGR2GRAY)

    # Transform object
    is_success, buffer = cv2.imencode(".jpg", gray_scaled)
    if not is_success:
       raise ValueError('Unable to imencode()')

    transformed_object = io.BytesIO(buffer).getvalue()

    # Write object back to S3 Object Lambda
    s3 = boto3.client('s3')
    s3.write_get_object_response(
        Body=transformed_object,
        RequestRoute=request_route,
        RequestToken=request_token)

    return {'status_code': 200}
  1. Create an Object Lambda access point using the supporting access point from Step 1.

The Lambda function uses the supporting access point to download the original objects.

  1. Update Amazon SageMaker to use the new S3 Object Lambda access point to retrieve data from Amazon S3. See the following bash code:
aws s3api get-object --bucket arn:aws:s3-object-lambda:us-west-2:12345678901:accesspoint/image-normalizer --key images/test.png

Cost savings analysis

Traditionally, ML pipelines copy images and other files from Amazon S3 to SageMaker instances and then perform normalization. However, transforming these actions on training instances has inefficiencies. First, Lambda functions horizontally scale to handle the burst then elastically shrink, only charging per millisecond when the code is running. Many preprocessing steps don’t require GPUs and can even use ARM64. That creates an incentive to move that processing to more economical compute such as Lambda functions powered by AWS Graviton2 processors.

Using an example from the Lambda pricing calculator, you can configure the function with 256 MB of memory and compare the costs for both x86 and Graviton (ARM64). We chose this size because it’s sufficient for many single-image data preparation tasks. Next, use the SageMaker pricing calculator to compute expenses for an ml.p2.xlarge instance. This is the smallest supported SageMaker training instance with GPU support. These results show up to 90% compute savings for operations that don’t use GPUs and can shift to Lambda. The following table summarizes these findings.

 

Lambda with x86 Lambda with Graviton2 (ARM) SageMaker ml.p2.xlarge
Memory (GB) 0.25 0.25 61
CPU  —  — 4
GPU  —  — 1
Cost/hour $0.061 $0.049 $0.90

Conclusion

You can build modern applications to unlock insights into your data. These different applications have unique data view requirements, such as formatting and preprocessing actions. Addressing these other use cases can result in data duplication, increasing costs, and more complexity to maintain consistency. This post offers a solution for efficiently handling these situations using S3 Object Lambda functions.

Not only does this remove the need for duplication, but it also forms a path to scale these actions across less expensive compute horizontally! Even optimizing the transformation code for the ml.p2.xlarge instance would still be significantly more costly because of the idle GPUs.

For more ideas on using serverless and ML, see Machine learning inference at scale using AWS serverless and Deploying machine learning models with serverless templates.


About the Authors

Nate Bachmeier is an AWS Senior Solutions Architect nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers’ workloads. Besides this, Nate is a full-time student and has two kids.

Marvin Fernandes is a Solutions Architect at AWS, based in the New York City area. He has over 20 years of experience building and running financial services applications. He is currently working with large enterprise customers to solve complex business problems by crafting scalable, flexible, and resilient cloud architectures.

Read More

Extract entities from insurance documents using Amazon Comprehend named entity recognition

Intelligent document processing (IDP) is a common use case for customers on AWS. You can utilize Amazon Comprehend and Amazon Textract for a variety of use cases ranging from document extraction, data classification, and entity extraction. One specific industry that uses IDP is insurance. They use IDP to automate data extraction for common use cases such as claims intake, policy servicing, quoting, payments, and next best actions. However, in some cases, an office receives a document with complex, label-less information. This is normally difficult for optical character recognition (OCR) software to capture, and identifying relationships and key entities becomes a challenge. The solution is often requires manual human entry to ensure high accuracy.

In this post, we demonstrate how you can use named entity recognition (NER) for documents in their native formats in Amazon Comprehend to address these challenges.

Solution overview

In an insurance scenario, an insurer might receive a demand letter from an attorney’s office. The demand letter includes information such as what law office is sending the letter, who their client is, and what actions are required to satisfy their requests, as shown in the following example:

Because of the varied locations that this information could be found in a demand letter, these documents are often forwarded to an individual adjuster, who takes the time to read through the letter to determine all the necessary information required to proceed with a claim. The document may have multiple names, addresses, and requests that each need to be classified. If the client is mixed up with the beneficiary, or the addresses are switched, delays could add up and negative consequences could impact the company and customers. Because there are often small differences between categories like addresses and names, the documents are often processed by humans rather than using an IDP approach.

The preceding example document has many instances of overlapping entity values (entities that share similar properties but aren’t related). Examples of this are the address of the law office vs. the address of the insurance company or the names of the different individuals (attorney name, beneficiary, policy holder). Additionally, there is positional information (where the entity is positioned within the document) that a traditional text-only algorithm might miss. Therefore, traditional recognition techniques may not meet requirements.

In this post, we use named entity recognition in Amazon Comprehend to solve these challenges. The benefit of using this method is that the custom entity recognition model uses both the natural language and positional information of the text to accurately extract custom entities that may otherwise be impacted when flattening a document, as demonstrated in our preceding example of overlapping entity values. For this post, we use an AWS artificially created dataset of legal requisition and demand letters for life insurance, but you can use this approach across any industry and document that may benefit from spatial data in custom NER training. The following diagram depicts the solution architecture:

We implement the solution with the following high-level steps:

  1. Clone the repository containing the sample dataset.
  2. Create an Amazon Simple Storage Service (Amazon S3) bucket.
  3. Create and train your custom entity recognition model.
  4. Use the model by running an asynchronous batch job.

Prerequisites

You need to complete the following prerequisites to use this solution:

  1. Install Python 3.8.x.
  2. Make sure you have pip installed.
  3. Install and configure the AWS Command Line Interface (AWS CLI).
  4. Configure your AWS credentials.

Annotate your documents

To train a custom entity recognition model that can be used on your PDF, Word, and plain text documents, you need to first annotate PDF documents using a custom Amazon SageMaker Ground Truth annotation template that is provided by Amazon Comprehend. For instructions, see Custom document annotation for extracting named entities in documents using Amazon Comprehend.

We recommend a minimum of 250 documents and 100 annotations per entity to ensure good quality predictions. With more training data, you’re more likely to produce a higher-quality model.

When you’ve finished annotating, you can train a custom entity recognition model and use it to extract custom entities from PDF, Word, and plain text documents for batch (asynchronous) processing.

For this post, we have already labeled our sample dataset, and you don’t have to annotate the documents provided. However, if you want to use your own documents or adjust the entities, you have to annotate the documents. For instructions, see Custom document annotation for extracting named entities in documents using Amazon Comprehend.

We extract the following entities (which are case sensitive):

  • Law Firm
  • Law Office Address
  • Insurance Company
  • Insurance Company Address
  • Policy Holder Name
  • Beneficiary Name
  • Policy Number
  • Payout
  • Required Action
  • Sender

The dataset provided is entirely artificially generated. Any mention of names, places, and incidents are either products of the author’s imagination or are used fictitiously. Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.

Clone the repository

Start by cloning the repository by running the following command:

git clone https://github.com/aws-samples/aws-legal-entity-extraction

The repository contains the following files:

aws-legal-entity-extraction
	/source
	/annotations
	output.manifest
	sample.pdf
	bucketnamechange.py

Create an S3 bucket

To create an S3 bucket to use for this example, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Note the name of the bucket you just created.

To reuse the annotations that we already made for the dataset, we have to modify the output.manifest file and reference the bucket we just created.

  1. Modify the file by running the following commands:
    cd aws-legal-entity-extraction
    python3 bucketnamechange.py
    Enter the name of your bucket: <Enter the name of the bucket you created>

When the script is finished running, you receive the following message:

The manifest file is updated with the correct bucket

We can now begin training our model.

Create and train the model

To start training your model, complete the following steps:

  1. On the Amazon S3 console, upload the /source folder, /annotations folder, output.manifest, and sample.pdf files.

Your bucket should look similar to the following screenshot.

  1. On the Amazon Comprehend console, under Customization in the navigation pane, choose Custom entity recognition.
  2. Choose Create new model.
  3. For Model name, enter a name.
  4. For Language, choose English.
  5. For Custom entity type, add the following case-sensitive entities:
    1. Law Firm
    2. Law Office Address
    3. Insurance Company
    4. Insurance Company Address
    5. Policy Holder Name
    6. Beneficiary Name
    7. Policy Number
    8. Payout
    9. Required Action
    10. Sender

  6. In Data specifications, for Data format, select Augmented manifest to reference the manifest we created when we annotated the documents.
  7. For Training model type, select PDF, Word Documents.

This specifies the type of documents you’re using for training and inference.

  1. For SageMaker Ground Truth augmented manifest file S3 location, enter the location of the output.manifest file in your S3 bucket.
  2. For S3 prefix for Annotation data files, enter the path to the annotations folder.
  3. For S3 prefix for Source documents, enter the path to the source folder.
  4. For Attribute names, enter legal-entity-label-job-labeling-job-20220104T172242.

The attribute name corresponds to the name of the labeling job you create for annotating the documents. For the pre-annotated documents, we use the name legal-entity-label-job-labeling-job-20220104T172242. If you choose to annotate your documents, substitute this value with the name of your annotation job.

  1. Create a new AWS Identity and Access Management (IAM) role and give it permissions to the bucket that contains all your data.
  2. Finish creating the model (select the Autosplit option for your data source to see similar metrics to those in the following screenshots).

Now your recognizer model is visible on the dashboard with the model training status and metrics.

The model may take several minutes to train.

The following screenshot shows your model metrics when the training is complete.

Use the custom entity recognition model

To use the custom entity recognition models trained on PDF documents, we create a batch job to process them asynchronously.

  1. On the Amazon Comprehend console, choose Analysis jobs.
  2. Choose Create job.
  3. Under Input data, enter the Amazon S3 location of the annotated PDF documents to process (for this post, the sample.pdf file).
  4. For Input format, select One document per file.
  5. Under Output Data, enter the Amazon S3 location you want them to populate in. For this post, we create a new folder called analysis-output in the S3 bucket containing all source PDF documents, annotated documents, and manifest.
  6. Use an IAM role with permissions to the sample.pdf folder.

You can use the role created earlier.

  1. Choose Create job.

This is an asynchronous job so it may take a few minutes to complete processing. When the job is complete, you get link to the output. When you open this output, you see a series of files as follows:

You can open the file sample.pdf.out in your preferred text editor. If you search for the Entities Block, you can find the entities identified in the document. The following table shows an example.

Type Text Score
Insurance Company Budget Mutual Insurance Company 0.999984086
Insurance Company Address 9876 Infinity Aven Springfield, MI 65541 0.999982051
Law Firm Bill & Carr 0.99997298
Law Office Address 9241 13th Ave SWn Spokane, Washington (WA),99217 0.999274625
Beneficiary Name Laura Mcdaniel 0.999972464
Policy Holder Name Keith Holt 0.999781546
Policy Number (#892877136) 0.999950143
Payout $15,000 0.999980728
Sender Angela Berry 0.999723455
Required Action We are requesting that you forward the full policy amount of Please forward ann acknowledgement of our demand and please forward the umbrella policy information if one isn applicable. Please send my secretary any information regarding liens on his policy. 0.999989449

Expand the solution

You can choose from a myriad of possibilities for what to do with the detected entities, such as the following:

  • Ingest them into a backend system of record
  • Create a searchable index based on the extracted entities
  • Enrich machine learning and analytics using extracted entity values as parameters for model training and inference
  • Configure back-office flows and triggers based on detected entity value (such as specific law firms or payout values)

The following diagram depicts these options:

Conclusion

Complex document types can often be impediments to full-scale IDP automation. In this post, we demonstrated how you can build and use custom NER models directly from PDF documents. This method is especially powerful for instances where positional information is especially pertinent (similar entity values and varied document formats). Although we demonstrated this solution by using legal requisition letters in insurance, you can extrapolate this use case across healthcare, manufacturing, retail, financial services, and many other industries.

To learn more about Amazon Comprehend, visit the Amazon Comprehend Developer Guide.


About the Authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Enzo Staton is a Solutions Architect with a passion for working with companies to increase their cloud knowledge. He works closely as a trusted advisor and industry specialists with customers around the country.

Read More

Implement MLOps using AWS pre-trained AI Services with AWS Organizations

The AWS Machine Learning Operations (MLOps) framework is an iterative and repetitive process for evolving AI models over time. Like DevOps, practitioners gain efficiencies promoting their artifacts through various environments (such as quality assurance, integration, and production) for quality control. In parallel, customers rapidly adopt multi-account strategies through AWS Organizations and AWS Control Tower to create secure, isolated environments. This combination can introduce challenges for implementing MLOps with AWS pre-trained AI Services, such as Amazon Rekognition Custom Labels. This post discusses design patterns for reducing that complexity while still maintaining security best practices.

Overview

Customers across every industry vertical recognize the value of operationalizing machine learning (ML) efficiently and reducing the time to deliver business value. Most AWS pre-trained AI Services address this situation through out-of-the-box capabilities for computer vision, translation, and fraud detection, among other common use cases. Many use cases require domain-specific predictions that go beyond generic answers. The AI Services can fine-tune the predictive model results using customer-labeled data for those scenarios.

Over time, the domain-specific vocabulary changes and evolves. For example, suppose a tool manufacturer creates a computer vision model to detect its products in images (such as hammers and screwdrivers). In a future release, the business adds support for wrenches and saws. These new labels necessitate code changes on the manufacturer’s websites and custom applications. Now, there are dependencies that both artifacts must release simultaneously.

The AWS MLOps framework addresses these release challenges through iterative and repetitive processes. Before reaching production end-users, the model artifacts must traverse various quality gates like application code. You typically implement those quality gates using multiple AWS accounts within an AWS organization. This approach gives the flexibility to centrally manage these application domains and enforce guardrails and business requirements. It’s becoming increasingly common to have tens or even hundreds of accounts within your organization. However, you must balance your workload isolation needs against the team size and complexity.

MLOps practitioners have standard procedures for promoting artifacts between accounts (such as QA to production). These patterns are straightforward to implement, relying on copying code and binary resources between Amazon Simple Storage Service (Amazon S3) buckets. However, AWS pre-trained AI Services don’t currently support copying the trained custom model across AWS accounts. Until such a mechanism exists, you need to retrain the models in each AWS account using the same dataset. This approach involves time and cost for retraining the model in a new account. This mechanism can be a viable option for some customers. However in this post, we demonstrate the means to define and evolve these custom models centrally while securely sharing them across an AWS organization’s accounts.

Solution overview

This post discusses design patterns for securely sharing AWS pre-trained AI Service domain-specific models. These services include Amazon Fraud Detector, Amazon Transcribe, and Amazon Rekognition, to name a few. Although these strategies are broadly applicable, we focus on Rekognition Custom Labels as a concrete example. We intentionally avoid diving too deep into Rekognition Custom Labels-specific nuances.

The architecture begins with a configured AWS Control Tower in the management account. AWS Control Tower provides the easiest way to set up and govern a secure, multi-account AWS environment. As shown in the following diagram, we use Account Factory in AWS Control Tower to create five AWS accounts:

  • CI/CD account for deployment orchestration (for example, with AWS CodeStar)
  • Production account for external end-users (for example, a public website)
  • Quality assurance account for internal development teams (such as preproduction)
  • ML account for custom models and supporting systems
  • AWS Lake House account holding proprietary customer data

This configuration might be too granular or coarse, depending on your regulatory requirements, industry, and size. Refer to Managing the multi-account environment using AWS Organizations and AWS Control Tower for more guidance.

Example AWS Organizations account configuration

Create the Rekognition Custom Labels model

The first step to creating a Rekognition Custom Labels model is choosing the AWS account to host it. You might begin your ML journey using a single ML account. This approach consolidates any tooling and procedures into one place. However, this centralization can cause bloat in the individual account and lead to monolithic environments. More mature enterprises segment this ML account by team or workload. Regardless of the granularity, the object is the same to define centrally and train models once.

This post demonstrates using a Rekognition Custom Labels model with a single ML account and a separate data lake account (see the following diagram). When the data resides in a different account, you must configure a resource policy to provide cross-account access to the S3 bucket objects. This procedure securely shares the bucket’s contents with the ML account. See the quick start samples for more information on creating an Amazon Rekognition domain-specific model.

Sharing data across accounts

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AWSRekognitionS3AclBucketRead20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": [
                "s3:GetBucketAcl",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::S3:"
        },
        {
            "Sid": "AWSRekognitionS3GetBucket20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetObjectVersion",
                "s3:GetObjectTagging"
            ],
            "Resource": "arn:aws:s3:::S3:/*"
        },
        {
            "Sid": "AWSRekognitionS3ACLBucketWrite20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": "s3:GetBucketAcl",
            "Resource": "arn:aws:s3:::S3:"
        },
        {
            "Sid": "AWSRekognitionS3PutObject20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::S3:/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-acl": "bucket-owner-full-control"
                }
            }
        }
    ]
}

Enable cross-account access

After you build and deploy the model, the endpoint is only available within the ML account. Do not use a static key to share access. You must delegate access to the production (or QA) account using AWS Identity and Access Management (IAM) roles. To create a cross-account role in the ML account, complete the following steps:

  1. On the Rekognition Custom Labels console, choose Projects and choose your project name.
  2. Choose Models and your model name.
  3. On the Use model tab, scroll down to the Use your model section.
  4. Copy the model Amazon Resource Name (ARN). It should be formatted as follows: arn:aws:rekognition:region-name:account-id:project/model-name/version/version-id/timestamp.
  5. Create a role with rekognition:DetectCustomLabels permissions to the model ARN and a trust policy allowing sts:AssumeRole from the production (or QA) account (for example, arn:aws:iam::PROD_ACCOUNT_ID_HERE:root).
  6. Optionally, attach additional policies for any workload-specific actions (such as accessing S3 buckets).
  7. Optionally, configure the condition element to enforce additional delegation requirements.
  8. Record the new role’s ARN to use in the next section.

Invoke the endpoint

With the security policies in place, it’s time to test the configuration. A simple approach involves creating an Amazon Elastic Compute Cloud (Amazon EC2) instance and using the AWS Command Line Interface (AWS CLI). Invoke the endpoint with the following steps:

  1. In the production (or QA) account, create a role for Amazon EC2.
  2. Attach a policy allowing sts:AssumeRole to the ML account’s cross-role ARN.
  3. Launch an Amazon Linux 2 instance with the role from the previous step.
  4. Wait for it to provision, then connect to the Linux instance using SSH.
  5. Invoke the command aws iam assume-role to switch to the cross-account role from the previous section.
  6. Start the model endpoint, if not already running, using the Rekognition console or the start-project-version AWS CLI command.
  7. Invoke the command aws rekognition detect-custom-labels to test the operation.

You can also perform this test using the AWS SDK and another compute resource (for example, AWS Lambda).

Avoiding the public internet

In the previous section, the detect-custom-labels request uses the virtual private cloud’s (VPC) internet gateway and traverses the public internet. TLS/SSL encryption sufficiently secures the communication channel for many workloads. You can use AWS PrivateLink to enable connections between the VPC and supporting services without requiring an internet gateway, NAT device, VPN connection, transit gateway, or AWS Direct Connect connection. Then, the detect-custom-labels request never leaves the AWS network exposed to the public internet. AWS PrivateLink supports all services used within this post. You can also enforce pre-trained AI Services using private connectivity with IAM in the cross-role policy. This control adds another level of protection that prevents misconfigured clients from using the pre-trained AI Service’s internet-facing endpoint. For additional information, see Using Amazon Rekognition with Amazon VPC endpoints, AWS PrivateLink for Amazon S3, and Using AWS STS interface VPC endpoints.

The following diagram illustrates the VPC endpoint configuration between the production account, ML account, and QA account.

Using VPC Endpoints across accounts

Build a CI/CD pipeline for promoting models

AWS recommends continuously providing more training and test data to custom label the Amazon Rekognition project dataset to improve models. After you add more data to a project, a new model can enhance accuracy or alter labels.

In MLOps, model artifacts must be consistent. To accomplish this with pre-trained AI Services, AWS recommends promoting the model endpoint by updating the code’s reference to the new model version’s ARN. This approach avoids retraining the domain-specific model in each environment (such as QA and production accounts). Your applications can use the new model’s ARN as a runtime variable using AWS Systems Manager within a multi-account or multi-stage environment.

Three granularity levels limit access to the cross-account model, specifically at the account, project, and model version level. Models are idempotent and have a unique ARN that maps to specific point-in-time training: arn:aws:rekognition:account:region:project/project_name/version/name/timestamp.

The following diagram illustrates the model rotation from QA to production.

Promoting the model version

In the preceding architecture, the production and QA applications make API calls to use the v2 or v3 model endpoints through their respective VPC endpoints. They receive the ARN from its configuration store (for example, Amazon Systems Manager Parameter Store or AWS AppConfig). This process works with n number of environments, but we demonstrate only using two accounts for simplicity. Optionally, removing the superseded model versions prevents additional consumption of those resources.

The ML account has an IAM role for each environment-specific (such as the Production account) that requires access. The CI/CD pipeline as part of the deployment alters the inline policy of the IAM role to allow for access to the appropriate model.

Consider the scenario of promoting Model-v2 from the QA account to the production account. This process requires the following steps:

  1. On the Rekognition Custom Labels console, transition the Model-v2 endpoint into a running state.
  2. Grant the IAM cross-account role in the ML account access to the new version of Model-v2.

Note that the resource element supports wildcards in the ARN.

  1. Send a test invocation to Model-v2 from the production application using the delegation role.
  2. Optionally, remove the cross-account role’s access to Model-v1.
  3. Optionally, repeat steps 2–3 for each additional AWS account.
  4. Optionally, stop the Model-v1 endpoint to avoid incurring costs.

Global policy propagation from the IAM control plane to the IAM data plane in every Region is an eventually consistent operation. This design can create slight delays for multi-Regional configurations.

Create guardrails through service control policies

Using cross-account roles creates a secure mechanism for sharing pre-trained managed AI resources. But what happens when that role’s policy is too permissive? You can mitigate these risks by using service control policies (SCPs) to set permission guardrails across accounts. Guardrails specify the maximum permissions available for an IAM identity. These capabilities can prevent a model consumer account from, for example, stopping the shared Amazon Rekognition endpoint. After defining appropriate guardrail requirements, organizational units within Organizations allow centrally managing those policies across multiple accounts.

{    
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyModifyingRekgnotionProjects",
      "Effect": "Deny",
      "Action": [
        "rekognition:CreateProject*",
        "rekognition:DeleteProject*",
        "rekognition:StartProject*",
        "rekognition:StopProject*",
      ],
      "Resource": [
        “arn:aws:rekognition:*:*:project/*
      ]
    }
  ]
}

You can also configure detective controls to monitor their configuration and make sure it doesn’t drift out of compliance. AWS IAM Access Analyzer supports assessing policies across the organization and reporting unused permissions. Additionally, AWS Config enables assessing, auditing, and evaluating configurations of AWS resources. This capability supports standard security and compliance requirements, such as verifying and remediating the S3 bucket’s encryption settings.

Conclusion

You need out-of-the-box solutions to add ML capabilities like computer vision, translation, and fraud detection. You also need security boundaries that isolate your different environments for quality control, compliance, and regulatory purposes. AWS pre-trained AI services and AWS Control Tower deliver that functionality in a manner that is easily accessible and secure.

AWS pre-trained AI services don’t currently support copying the trained custom model across AWS accounts. Until such a mechanism exists, you need to retrain the models in each AWS account using the same dataset. This post demonstrates an alternative design approach using IAM cross-account policies to share model endpoints while maintaining robust security control. Furthermore, you can stop paying for redundant training jobs! For more information on cross-account policies, see IAM tutorial: Delegate access across AWS accounts using IAM roles.


About the Authors

Nate Bachmeier is an AWS Senior Solutions Architect that nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers’ workloads. Besides this, Nate is a full-time student and has two kids.

Mario Bourgoin is a Senior Partner Solutions Architect for AWS, an AI/ML specialist, and the global tech lead for MLOps.  He works with enterprise customers and partners deploying AI solutions in the cloud.  He has more than 30 years experience doing machine learning and AI at startups and in enterprises, starting with creating one of the first commercial machine learning systems for big data.  Mario spends the balance of his time playing with his three Belgian Tervurens, cooking dinners for his family, and learning about mathematics and cosmology.

Tim Murphy is a Senior Solutions Architect for AWS, working with enterprise customers in various industries to build business based solutions in the cloud. He has spent the last decade working with startups, non-profits, commercial enterprise, and government agencies, deploying infrastructure at scale. In his spare time when he isn’t tinkering with technology, you’ll most likely find him in far flung areas of the earth hiking mountains, surfing waves, or biking through a new city.

Read More

Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints

Many of our AWS customers provide research, analytics, and business intelligence as a service. This type of research and business intelligence enables their end customers to stay ahead of markets and competitors, identify growth opportunities, and address issues proactively. For example, some of our financial services sector customers do research for equities, hedge funds, and investment management companies to help them understand trends and identify portfolio strategies. In the health industry, an increasingly large portion of health research is now information-based. A great deal of research entails the analysis of data that was initially collected for diagnostic, treatment, or for other research projects, and is now being used for new research purposes. These forms of health research have led to effective primary prevention to avoid new cases, secondary prevention for early detection, and prevention for better disease management. The research outcomes not only improve the quality of life but also help reduce healthcare expenses.

Customers tend to digest the information from public and private sources. They then apply established or custom natural language processing (NLP) models to summarize and identify a trend and generate insights based on this information. The NLP models used for these types of research tasks deal with large models and usually involve long articles to be summarized considering the size of the corpus—and dedicated endpoints, which aren’t cost-optimized at the moment. These applications receive a burst of incoming traffic at different times of the day.

We believe customers would greatly benefit from the ability to scale down to zero and ramp up their inference capability on as needed basis. This optimizes the research cost and still doesn’t compromise on quality of inferences. This post discusses how Hugging Face along with Amazon SageMaker asynchronous inference can help achieve this.

You can build text summarization models with multiple deep-learning frameworks like TensorFlow, PyTorch, and Apache MXNet. These models typically have a large input payload of multiple text documents of varying size. Advanced deep learning models require compute-intensive preprocessing before model inference. Processing times can be as long as a few minutes, which removes the option to run real-time inference by passing payloads over an HTTP API. Instead, you need to process input payloads asynchronously from an object store like Amazon Simple Storage Service (Amazon S3) with automatic queuing and a predefined concurrency threshold. The system should be able to receive status notifications and reduce unnecessary costs by cleaning up resources when the tasks are complete.

SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker provides the most advanced open-source model-serving containers for XGBoost (container, SDK), Scikit-Learn (container, SDK), PyTorch (container, SDK), TensorFlow (container, SDK), and Apache MXNet (container, SDK).

SageMaker provides four options to deploy trained ML models for generating inferences on new data.
  1. Real-time inference endpoints are suitable for workloads that need to be processed with low latency requirements in the order of ms to seconds.
  2. Batch transform is ideal for offline predictions on large batches of data.
  3. Amazon SageMaker Serverless Inference (in preview mode and not recommended for production workloads as of this writing) is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts.
  4. Asynchronous Inference endpoints queue incoming requests. They’re ideal for workloads where the request sizes are large (up to 1 GB) and inference processing times are in the order of minutes (up to 15 minutes). Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process.

Solution overview

In this post, we deploy a PEGASUS model that was pre-trained to do text summarization from Hugging Face to SageMaker hosting services. We use the model as is from Hugging Face for simplicity. However, you can fine-tune the model based on a custom dataset. You can also try out other models available in the Hugging Face Model Hub. We also provision an asynchronous inference endpoint that hosts this model, from which you can get predictions.

The asynchronous inference endpoint’s inference handler expects an article as input payload. The summarized text of the article is the output. The output is stored in the database for analyzing the trends or be fed downstream for further analytics. This downstream analysis derives data insights that help with the research.

We demonstrate how asynchronous inference endpoints enable you to have user-defined concurrency and completion notifications. We configure auto scaling of instances behind the endpoint to scale down to zero when traffic subsides and scale back up as the request queue fills up.

We also use Amazon CloudWatch metrics to monitor the queue size, total processing time, and invocations processed.

In the following diagram, we show the steps involved while performing inference using an asynchronous inference endpoint.

  1. Our pre-trained PEGASUS ML model is first hosted on the scaling endpoint.
  2. The user uploads the article to be summarized to an input S3 bucket.
  3. The asynchronous inference endpoint is invoked using an API.
  4. After the inference is complete, the result is saved to the output S3 bucket.
  5. An Amazon Simple Notification Service (Amazon SNS) notification is sent to the user notifying them of the completed success or failure.

Create an asynchronous inference endpoint

We create the asynchronous inference endpoint similar to a real-time hosted endpoint. The steps include creating a SageMaker model, followed by configuring the endpoint and deploying the endpoint. The difference between the two types of endpoints is that the asynchronous inference endpoint configuration contains an AsyncInferenceConfig section. Here we specify the S3 output path for the results from the endpoint invocation and optionally include SNS topics for notifications on success and failure. We also specify the maximum number of concurrent invocations per instance as determined by the customer. See the following code:

AsyncInferenceConfig={
        "OutputConfig": {
            "S3OutputPath": f"s3://{bucket}/{bucket_prefix}/output",
            #  Optionally specify Amazon SNS topics for notifications
            "NotificationConfig": {
              "SuccessTopic": success_topic,
              "ErrorTopic": error_topic,
            }
        },
        "ClientConfig": {
            "MaxConcurrentInvocationsPerInstance": 2 #increase this value up to throughput peak for ideal performance
        }
    }

For details on the API to create an endpoint configuration for asynchronous inference, see Create an Asynchronous Inference Endpoint.

Invoke the asynchronous inference endpoint

The following screenshot shows a brief article that we use as our input payload:

The following code uploads the article as an input.json file to Amazon S3:

sm_session.upload_data(
        input_location, 
        bucket=sm_session.default_bucket(),
        key_prefix=prefix, 
        extra_args={"ContentType": "text/plain"})

We use the Amazon S3 URI to the input payload file to invoke the endpoint. The response object contains the output location in Amazon S3 to retrieve the results after completion:

response = sm_runtime.invoke_endpoint_async(EndpointName=endpoint_name, 
InputLocation=input_1_s3_location)
output_location = response['OutputLocation']

The following screenshot shows the sample output post summarization:

For details on the API to invoke an asynchronous inference endpoint, see Invoke an Asynchronous Inference Endpoint.

Queue the invocation requests with user-defined concurrency

The asynchronous inference endpoint automatically queues the invocation requests. This is a fully managed queue with various monitoring metrics and doesn’t require any further configuration. It uses the MaxConcurrentInvocationsPerInstance parameter in the preceding endpoint configuration to process new requests from the queue after previous requests are complete. MaxConcurrentInvocationsPerInstance is the maximum number of concurrent requests sent by the SageMaker client to the model container. If no value is provided, SageMaker chooses an optimal value for you.

Auto scaling instances within the asynchronous inference endpoint

We set the auto scaling policy with a minimum capacity of zero and a maximum capacity of five instances. Unlike real-time hosted endpoints, asynchronous inference endpoints support scaling down instances to zero by setting the minimum capacity to zero. We use the ApproximateBacklogSizePerInstance metric for the scaling policy configuration with a target queue backlog of five per instance to scale out further. We set the cooldown period for ScaleInCooldown to 120 seconds and the ScaleOutCooldown to 120 seconds. The value for ApproximateBacklogSizePerInstance is chosen based on the traffic and your sensitivity to scaling speed. The faster you scale in, the less cost you incur, but the more likely you’ll have to scale up again when new requests come in. The slower you scale in, the more cost you incur, but you’re less likely to have a request come in when you’re under-scaled.

client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services

resource_id='endpoint/' + endpoint_name + '/variant/' + 'variant1' # This is the format in which application autoscaling references the endpoint

response = client.register_scalable_target(
ServiceNamespace='sagemaker', #
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=0,
MaxCapacity=5
)

response = client.put_scaling_policy(
PolicyName='Invocations-ScalingPolicy',
ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource.
ResourceId=resource_id, # Endpoint name
ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count
PolicyType='TargetTrackingScaling', # 'StepScaling'|'TargetTrackingScaling'
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 5.0, # The target value for the metric.
'CustomizedMetricSpecification': {
'MetricName': 'ApproximateBacklogSizePerInstance',
'Namespace': 'AWS/SageMaker',
'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name }],
'Statistic': 'Average',
},
'ScaleInCooldown': 120, # ScaleInCooldown - The amount of time, in seconds, after a scale-in activity completes before another scale in activity can start.
'ScaleOutCooldown': 120 # ScaleOutCooldown - The amount of time, in seconds, after a scale-out activity completes before another scale out activity can start.
# 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled.
# If the value is true, scale-in is disabled and the target tracking policy won't remove capacity from the scalable resource.
}
)

For details on the API to auto scale an asynchronous inference endpoint, see the Autoscale an Asynchronous Inference Endpoint.

Configure notifications from the asynchronous inference endpoint

We create two separate SNS topics for success and error notifications for each endpoint invocation result:

sns_client = boto3.client('sns')
response = sns_client.create_topic(Name="Async-Demo-ErrorTopic2")
error_topic = response['TopicArn']
response = sns_client.create_topic(Name="Async-Demo-SuccessTopic2")
success_topic = response['TopicArn']

Other options for notifications include periodically checking the output of the S3 bucket, or using S3 bucket notifications to initialize an AWS Lambda function on file upload. SNS notifications are included in the endpoint configuration section as previously described.

For details on how to set up notifications from an asynchronous inference endpoint, see Check Prediction Results.

Monitor the asynchronous inference endpoint

We monitor the asynchronous inference endpoint with built-in additional CloudWatch metrics specific to asynchronous inference. For example, we monitor the queue length in each instance with ApproximateBacklogSizePerInstance and total queue length with ApproximateBacklogSize.

For a complete list of metrics, refer to Monitoring Asynchronous Inference Endpoints.

We can optimize the endpoint configuration to get the most cost-effective instance with high performance. For example, we can use an instance with Amazon Elastic Inference or AWS Inferentia. We can also gradually increase the concurrency level up to the throughput peak while adjusting other model server and container parameters.

CloudWatch graphs

We simulated a traffic of 10,000 inference requests flowing in over a period to the asynchronous inference endpoint enabled with the auto scaling policy described in the previous section.

The following screenshot shows instance metrics before requests started flowing in. We start with a live endpoint with zero instances running:

The following graph showcases how the BacklogSize and BacklogSizePerInstance metrics change as the auto scaling kicks in and the load on the endpoint is shared by multiple instances that were provisioned as part of the auto scaling process.

As shown in the following screenshot, the number of instances increased as the inference count scaled up:

The following screenshot shows how the scaling in brings back the endpoint to the initial state of zero running instances:

Clean up

After all the requests are complete, we can delete the endpoint similar to deleting real-time hosted endpoints. Note that if we set the minimum capacity of asynchronous inference endpoints to zero, there are no instance charges incurred after it scales down to zero.

If you enabled auto scaling for your endpoint, make sure you deregister the endpoint as a scalable target before deleting the endpoint. To do this, run the following code:

response = client.deregister_scalable_target(ServiceNamespace='sagemaker',ResourceId='resource_id',ScalableDimension='sagemaker:variant:DesiredInstanceCount')

Remember to delete your endpoint after use as you will be charged for the instances used in this demo.

sm_client.delete_endpoint(EndpointName=endpoint_name)

You also need to delete the S3 objects and SNS topics. If you created any other AWS resources to consume and action on the SNS notifications, you may also want to delete them.

Conclusion

In this post, we demonstrated how to use the new asynchronous inference capability from SageMaker to process a typical large input payload that is part of a summarization task. For inference, we used a model from Hugging Face and deployed it on asynchronous inference endpoint. We explained the common challenges of burst traffic, high model processing times, and large payloads involved with research analytics. The asynchronous inference endpoint’s inherent ability to manage internal queues, predefined concurrency limits, configure response notifications and to automatically scale down to zero helped us address these challenges. The complete code for this example is available on GitHub.

To get started with SageMaker asynchronous inference, check out Asynchronous Inference.


About the Authors

Dinesh Kumar Subramani is a Senior Solutions Architect with the UKIR SMB team, based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning. Dinesh enjoys working with customers across industries to help them solve their problems with AWS services. Outside of work, he loves spending time with his family, playing chess and enjoying music across genres.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Read More