Customize Amazon Textract with business-specific documents using Custom Queries

Customize Amazon Textract with business-specific documents using Custom Queries

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Queries is a feature that enables you to extract specific pieces of information from varying, complex documents using natural language. Custom Queries provides a way for you to customize the Queries feature for your business-specific, non-standard documents such as auto lending contracts, checks, and pay statements, in a self-service way. By customizing the feature to recognize the unique terms, structures, and key information specific to these document types, you can meet your downstream processing needs with greater precision and minimal human intervention. Custom Queries is easy to integrate in your existing Textract pipeline and you continue to benefit from the fully managed intelligent document processing features of Amazon Textract without having to invest in ML expertise or infrastructure management.

In this post, we show how Custom Queries can accurately extract data from checks that are complex, non-standard documents. In addition, we discuss the benefits of Custom Queries and share best practices for effectively using this feature.

Solution overview

When starting with a new use case, you can evaluate how Textract Queries performs on your documents by navigating to the Textract console and using the Analyze Document Demo or Bulk Document Uploader. Refer to Best Practices for Queries to draft queries applicable to your use case. If you identify errors in the query responses due to the nature of your business documents, you can use Custom Queries to improve accuracy. Within hours, you can annotate your sample documents using the AWS Management Console and train an adapter. Adapters are components that plug in to the Amazon Textract pre-trained deep learning model, customizing its output based on your annotated documents. You can use the adapter for inference by passing the adapter identifier as an additional parameter to the Analyze Document Queries API request.

Let’s examine how Custom Queries can improve extraction accuracy in a challenging real-world scenario such as extraction of data from checks. The primary challenge when processing checks arises from their high degree of variation depending on the type (e.g., personal or cashier’s checks), financial institution and country (e.g., MICR line format). . These variations can include the placement of the payee’s name, the amount in numbers and words, the date, and the signature. Recognizing and adapting to these variations can be a complex task during data extraction. To improve data extraction, organizations often employ manual verification and validation processes, which increases the cost and time of the extraction process.

Custom Queries addresses these challenges by enabling you to customize the pre-trained Queries features on the different variations of checks. Customization of the pre-trained feature helps you achieve a high data extraction accuracy on the specific variety of layouts that you process.

In our use case, a financial institution wants to extract the following fields from a check: payee name, payer name, account number, routing number, payment amount (in numbers), payment amount (in words), check number, date, and memo.

Let’s explore the process of generating an adapter (component that customizes the output) for checks processing. Adapters can be created via the console or programmatically via the API. This post details the console experience; however, if you’d like to programmatically create the adapter, refer to the code samples in the custom-queries-checks-blog.ipynb Jupyter notebook (Option 2).

The adapter generation process involves five high-level steps: create an adapter, upload sample documents, annotate the documents, train the adapter, and evaluate performance metrics.

Create an adapter

On the Amazon Textract console, create a new adapter by providing a name, description, and optional tags that can help you identify the adapter. You have the option to enable automatic updates, which allows Amazon Textract to update your adapter when the underlying Queries feature is updated with new capabilities.

After the adapter is created, you will see an adapter details page with a list of steps in the How it works section. This section will activate your next steps as you complete them sequentially.

Upload sample documents

The initial phase in adapter generation involves the careful selection of an appropriate set of sample documents for annotation, training, and testing. We have an option to auto split the documents into test and train datasets; however, for this process, we manually split the dataset.

It’s important to note that you can construct an adapter with as few as five test and five training samples, but it’s essential to ensure that this sample set is diverse and representative of the workload encountered in a production environment.

For this tutorial, we have curated sample check datasets that you can download. Our dataset includes variations such as personal checks, cashier’s checks, stimulus checks and checks embedded within pay stubs. We also included handwritten and printed checks; along with variations in fields such as the memo line.

Annotate sample documents

As a next step, you annotate the sample documents by associating queries with their corresponding answers via the console. You can initiate annotation via auto labeling or manual labeling. Auto labeling uses Amazon Textract Queries to pre-label the dataset. We recommend using auto labeling to fast-track the annotation process.

For this checks processing use case, we use the following queries. If your use case involves other document types, refer to Best Practices for Queries to draft queries applicable to your use case.

  • Who is the payee?
  • What is the check#?
  • What is the payee address?
  • What is the date?
  • What is the account#?
  • What is the check amount in words?
  • What is the account name/payer/drawer name?
  • What is the dollar amount?
  • What is the bank name/drawee name?
  • What is the bank routing number?
  • What is the MICR line?
  • What is the memo?

When the auto labeling process is complete, you have the option to review and make edits to the answers provided for each document. Choose Start reviewing to review the annotations against each image.

If the response to a query is missing or wrong, you can add or edit the response either by drawing a bounding box or entering the response manually.

To accelerate your walkthrough, we have pre-annotated the checks samples for you to copy to your AWS account. Run the custom-queries-checks-blog.ipynb Jupyter notebook within the Amazon Textract code samples library to automatically update your annotations.

Train the adapter

After you’ve reviewed all the sample documents to ensure the accuracy of the annotations, you can begin the adapter training process. During this step, you need to designate a storage location where the adapter should be saved. The duration of the training process will vary depending on the size of the dataset utilized for training. The training API can also be invoked programmatically if you choose to use an annotation tool of your own choice and pass the relevant input files to the API. Refer to Custom Queries for more details.

Evaluate performance metrics

After the adapter has completed training, you can assess its performance by examining evaluation metrics such as F1 score, precision, and recall. You can analyze these metrics either collectively or on a per-document basis. Using our sample checks dataset, you will see the accuracy metric (F1 score) improve from 68% to 92% with the trained adapter.

Additionally, you can test the adapter’s output on new documents by choosing Try Adapter.

Following the evaluation, you can choose to enhance the adapter’s performance by either incorporating additional sample documents into the training dataset or by re-annotating documents with scores that are lower than your threshold. To re-annotate documents, choose Verify documents on the adapter details page, select the document, and choose Review annotations.

Programmatically test the adapter

With the training successfully completed, you can now use the adapter in your AnalyzeDocument API calls. The API request is similar to the Amazon Textract Queries API request, with the addition of the AdaptersConfig object.

You can run the following sample code or directly run it within the custom-queries-checks-blog.ipynb Jupyter notebook. The sample notebook also provides code to compare results between Amazon Textract Queries and Amazon Textract Custom Queries.

Create an AdaptersConfig object with the adapter ID and adapter version, and optionally include the pages you want the adapter to be applied to:

!python -m pip install amazon-textract-caller --upgrade
!python -m pip install amazon-textract-response-parser –upgrade

import boto3
from textractcaller.t_call import call_textract, Textract_Features, Query, QueriesConfig, Adapter, AdaptersConfig
import trp.trp2 as t2
from tabulate import tabulate

# Create AdaptersConfig
adapter1 = Adapter(adapter_id=”111111111”, version="1", pages=["*"])
adapters_config = AdaptersConfig(adapters=[adapter1])

Create a QueriesConfig object with the queries you trained the adapter with and call the Amazon Textract API. Note that you can also include additional queries that the adapter has not been trained on. Amazon Textract will automatically use the Queries feature for these questions and not Custom Queries, thereby providing you with the flexibility of using Custom Queries only where needed.

# Create QueriesConfig
queries = []
queries.append(Query(text="What is the check#?", alias="CHECK_NUMBER", pages=["*"]))
queries.append(Query(text="What is the date?", alias="DATE", pages=["*"]))
queries.append(Query(text="What is the check amount in words?", alias="CHECK_AMOUNT_WORDS", pages=["*"]))
queries.append(Query(text="What is the dollar amount?", alias="DOLLAR_AMOUNT", pages=["*"]))
queries.append(Query(text="Who is the payee?", alias="PAYEE_NAME", pages=["*"]))
queries.append(Query(text="What is the customer account#", alias="ACCOUNT_NUMBER", pages=["*"]))
queries.append(Query(text="what is the payee address?", alias="PAYEE_ADDRESS", pages=["*"]))
queries.append(Query(text="What is the bank routing number?", alias="BANK_ROUTING_NUMBER", pages=["*"]))
queries.append(Query(text="What is the memo", alias="MEMO", pages=["*"]))
queries.append(Query(text="What is the account name/payer/drawer name?", alias="ACCOUNT_NAME", pages=["*"]))
queries.append(Query(text="What is the bank name/drawee name?", alias="BANK_NAME", pages=["*"]))
queries_config = QueriesConfig(queries=queries)

document_name = "<image_name>"

textract_json_with_adapter = call_textract(input_document=document_name,
                  boto3_textract_client=textract_client,
                  features=[Textract_Features.QUERIES],
                  queries_config=queries_config,
                  adapters_config=adapters_config)

Finally, we tabulate our results for better readability:

def tabulate_query_answers(textract_json):
    d = t2.TDocumentSchema().load(textract_json)
    for page in d.pages:
        query_answers = d.get_query_answers(page=page)
        print(tabulate(query_answers, tablefmt="github"))

tabulate_query_answers(textract_json_with_adapter)

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon Textract console, choose Custom Queries in the navigation pane.
  2. Select the adaptor you want to delete.
  3. Choose Delete.

Adapter management

You can regularly improve your adapters by creating new versions of a previously generated adapter. To create a new version of an adapter, you add new sample documents to an existing adapter, label the documents, and perform training. You can simultaneously maintain multiple versions of an adapter for use in your development pipelines. To update your adapters seamlessly, do not make changes to or delete your Amazon Simple Storage Service (Amazon S3) bucket where the files needed for adapter generation are saved.

Best practices

When using Custom Queries on your documents, refer to Best practices for Amazon Textract Custom Queries for additional considerations and best practices.

Benefits of Custom Queries

Custom Queries offers the following benefits:

  • Enhanced document understanding – Through its ability to extract and normalize data with high accuracy, Custom Queries reduces reliance on manual reviews, and audits, and enables you to build more reliable automation for your intelligent document processing workflows.
  • Faster time to value – When you encounter new document types where you need higher accuracy, you can use Custom Queries to generate an adapter in a self-service manner within a few hours. You don’t have to wait for a pre-trained model update when you encounter new document types or variations of existing ones in your workflow. You have complete control over your pipeline and don’t need to depend on Amazon Textract to support your new document types.
  • Data privacy – Custom Queries does not retain or use the data employed in generating adapters to enhance our general pretrained models available to all customers. The adapter is limited to the customer’s account or other accounts explicitly designated by the customer, ensuring that only such accounts can access the improvements made using the customer’s data.
  • Convenience –Custom Queries provides a fully managed inference experience similar to Queries. The adapter training is free and you will only pay for inference. Custom Queries saves you the overhead and expenses of training and operating custom models.

Conclusion

In this post, we discussed the benefits of Custom Queries, showed how Custom Queries can accurately extract data from checks, and shared best practices for effectively utilizing this feature. In just a few hours, you can create an adapter using the console and use it in the AnalyzeDocument API for your data extraction needs. For more information, refer to Custom Queries.


About the authors

Shibin Michaelraj is a Sr. Product Manager with the Amazon Textract team. He is focused on building AI/ML-based products for AWS customers. He is excited helping customers solve their complex business challenges by leveraging AI and ML technologies. In his spare time, he enjoys running, tuning into podcasts, and refining his amateur tennis skills.

Keith Mascarenhas is a Sr. Solutions Architect with the Amazon Textract service team. He is passionate about solving business problems at scale using machine learning, and currently helps our worldwide customers automate their document processing to achieve faster time to market with reduced operational costs.

Read More

Stream large language model responses in Amazon SageMaker JumpStart

Stream large language model responses in Amazon SageMaker JumpStart

We are excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to see the model response output as it is being generated instead of waiting for LLMs to finish the response generation before it is made available for you to use or display. The streaming capability in SageMaker JumpStart can help you build applications with better user experience by creating a perception of low latency to the end-user.

In this post, we walk through how to deploy and stream the response from a Falcon 7B Instruct model endpoint.

At the time of this writing, the following LLMs available in SageMaker JumpStart support streaming:

  • Mistral AI 7B, Mistral AI 7B Instruct
  • Falcon 180B, Falcon 180B Chat
  • Falcon 40B, Falcon 40B Instruct
  • Falcon 7B, Falcon 7B Instruct
  • Rinna Japanese GPT NeoX 4B Instruction PPO
  • Rinna Japanese GPT NeoX 3.6B Instruction PPO

To check for updates on the list of models supporting streaming in SageMaker JumpStart, search for “huggingface-llm” at Built-in Algorithms with pre-trained Model Table.

Note that you can use the streaming feature of Amazon SageMaker hosting out of the box for any model deployed using the SageMaker TGI Deep Learning Container (DLC) as described in Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker.

Foundation models in SageMaker

SageMaker JumpStart provides access to a range of models from popular model hubs, including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your ML development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be adapted to a wide category of use cases, such as text summarization, generating digital art, and language translation. Because these models are expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.

You can now find foundation models from different model providers within SageMaker JumpStart, enabling you to get started with foundation models quickly. SageMaker JumpStart offers foundation models based on different tasks or model providers, and you can easily review model characteristics and usage terms. You can also try these models using a test UI widget. When you want to use a foundation model at scale, you can do so without leaving SageMaker by using prebuilt notebooks from model providers. Because the models are hosted and deployed on AWS, you trust that your data, whether used for evaluating or using the model at scale, won’t be shared with third parties.

Token streaming

Token streaming allows the inference response to be returned as it’s being generated by the model. This way, you can see the response generated incrementally rather than wait for the model to finish before providing the complete response. Streaming can help enable a better user experience because it decreases the latency perception for the end-user. You can start seeing the output as it’s generated and therefore can stop generation early if the output isn’t looking useful for your purposes. Streaming can make a big difference, especially for long-running queries, because you can start seeing outputs as it’s generated, which can create a perception of lower latency even though the end-to-end latency stays the same.

As of this writing, you can use streaming in SageMaker JumpStart for models that utilize Hugging Face LLM Text Generation Inference DLC.

Response with No Steaming Response with Streaming

Solution overview

For this post, we use the Falcon 7B Instruct model to showcase the SageMaker JumpStart streaming capability.

You can use the following code to find other models in SageMaker JumpStart that support streaming:

from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And

filter_value = And("task == llm", "framework == huggingface")
model_ids = list_jumpstart_models(filter=filter_value)
print(model_ids)

We get the following model IDs that support streaming:

['huggingface-llm-bilingual-rinna-4b-instruction-ppo-bf16', 'huggingface-llm-falcon-180b-bf16', 'huggingface-llm-falcon-180b-chat-bf16', 'huggingface-llm-falcon-40b-bf16', 'huggingface-llm-falcon-40b-instruct-bf16', 'huggingface-llm-falcon-7b-bf16', 'huggingface-llm-falcon-7b-instruct-bf16', 'huggingface-llm-mistral-7b', 'huggingface-llm-mistral-7b-instruct', 'huggingface-llm-rinna-3-6b-instruction-ppo-bf16']

Prerequisites

Before running the notebook, there are some initial steps required for setup. Run the following commands:

%pip install --upgrade sagemaker –quiet

Deploy the model

As a first step, use SageMaker JumpStart to deploy a Falcon 7B Instruct model. For full instructions, refer to Falcon 180B foundation model from TII is now available via Amazon SageMaker JumpStart. Use the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id="huggingface-llm-falcon-7b-instruct-bf16")
predictor = my_model.deploy()

Query the endpoint and stream response

Next, construct a payload to invoke your deployed endpoint with. Importantly, the payload should contain the key/value pair "stream": True. This indicates to the text generation inference server to generate a streaming response.

payload = {
    "inputs": "How do I build a website?",
    "parameters": {"max_new_tokens": 256},
    "stream": True
}

Before you query the endpoint, you need to create an iterator that can parse the bytes stream response from the endpoint. Data for each token is provided as a separate line in the response, so this iterator returns a token each time a new line is identified in the streaming buffer. This iterator is minimally designed, and you might want to adjust its behavior for your use case; for example, while this iterator returns token strings, the line data contains other information, such as token log probabilities, that could be of interest.

import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("n"):
                self.read_pos += len(line) + 1
                full_line = line[:-1].decode("utf-8")
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"]["text"]
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

Now you can use the Boto3 invoke_endpoint_with_response_stream API on the endpoint that you created and enable streaming by iterating over a TokenIterator instance:

import boto3

client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)

for token in TokenIterator(response["Body"]):
    print(token, end="")

Specifying an empty end parameter to the print function will enable a visual stream without new line characters inserted. This produces the following output:

Building a website can be a complex process, but it generally involves the following steps:

1. Determine the purpose and goals of your website
2. Choose a domain name and hosting provider
3. Design and develop your website using HTML, CSS, and JavaScript
4. Add content to your website and optimize it for search engines
5. Test and troubleshoot your website to ensure it is working properly
6. Maintain and update your website regularly to keep it running smoothly.

There are many resources available online to guide you through these steps, including tutorials and templates. It may also be helpful to seek the advice of a web developer or designer if you are unsure about any of these steps.<|endoftext|>

You can use this code in a notebook or other applications like Streamlit or Gradio to see the streaming in action and the experience it provides for your customers.

Clean up

Finally, remember to clean up your deployed model and endpoint to avoid incurring additional costs:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to use newly launched feature of streaming in SageMaker JumpStart. We hope you will use the token streaming capability to build interactive applications requiring low latency for a better user experience.


About the authors

Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Bundesliga Match Facts Shot Speed – Who fires the hardest shots in the Bundesliga?

Bundesliga Match Facts Shot Speed – Who fires the hardest shots in the Bundesliga?

There’s a kind of magic that surrounds a soccer shot so powerful, it leaves spectators, players, and even commentators in a momentary state of awe. Think back to a moment when the sheer force of a strike left an entire Bundesliga stadium buzzing with energy. What exactly captures our imagination with such intensity? While there are many factors that contribute to an iconic goal, there’s a particular magnetism to shots that blaze through the air, especially those taken from a distance.

Over the years, the Bundesliga has witnessed players who’ve become legends, not just for their skill but for their uncanny ability to unleash thunderbolts. Bernd Nickel, a standout figure from Frankfurt’s illustrious squad in the 1970s and 1980s, earned the title “Dr. Hammer” from ardent fans. Over his illustrious career, he netted 141 times in 426 matches.

Beyond his shooting prowess, another feat of Nickel’s that stands out is his ability to directly score from corner kicks. In fact, he holds the unique distinction of having scored from all four corner positions at Frankfurt’s Waldstadion. An example was witnessed by Frankfurt’s fans in May 1971, during a high-stakes game against Kickers Offenbach when he unveiled a masterclass.

Nickel scored a stunning goal in the 17th minute, which eventually led Frankfurt to a 2:0 victory. What made this goal even more memorable was the manner in which it was executed—a spectacular sideways scissors-kick from the penalty spot, fitting perfectly into the top corner. This goal would later be recognized as the “Goal of the Month” for May 1971. Nickl’s impact on the field was undeniable, and during the time he represented Eintracht Frankfurt, the club won the DFB-Pokal three times (in 1974, 1975, and 1981) and the UEFA Cup once in 1980.

Similarly, Thomas “the Hammer” Hitzlsberger has etched his name into Bundesliga folklore with his stunning left-footed rockets. His 2009 strike against Leverkusen at a speed of 125 km/h is one that is vividly remembered because the sheer velocity of Hitzlsperger’s free-kick was enough to leave Germany’s number one goalkeeper, René Adler, seemingly petrified.

Struck during the fifty-first minute of the game from a distance of 18 meters, the ball soared past Adler, leaving him motionless, and bulged the net, making the score 2:0. This remarkable goal not only showcased Hitzlsperger’s striking ability but also demonstrated the awe-inspiring power that such high-velocity goals can have on a match.

Historical data has shown us a few instances where the ball’s velocity exceeded the 130 km/h mark in Bundesliga, with the all-time record being a jaw-dropping shot at 137 km/h by Bayern’s Roy Makaay.

With all this in mind, it becomes even clearer why the speed and technique behind every shot matters immensely. Although high shooting speed excites soccer fans, it has not been measured regularly in the Bundesliga until now. Recognizing this, we are excited to introduce the new Bundesliga Match Facts: Shot Speed. This new metric aims to shed light on the velocity behind these incredible goals, enhancing our understanding and appreciation of the game even further.

How it works

Have you ever wondered just how fast a shot from your favorite Bundesliga player travels? The newly introduced Bundesliga Match Facts (BMF) Shot Speed now allows fans to satisfy their curiosity by providing insights into the incredible power and speed behind shots. Shot speed is more than just a number; it’s a window into the awe-inspiring athleticism and skill of the Bundesliga players.

Shot speed has a captivating effect on fans, igniting debates about which player possesses the most potent shot in the league and who consistently delivers lightning-fast strikes. Shot speed data is the key to resolving these questions.

Besides that, the new BMF helps to highlight memorable moments. The fastest shots often result in spectacular goals that live long in the memory of fans. Shot speed helps immortalize these moments, allowing fans to relive the magic of those lightning-fast strikes.

But how does this work? Let’s delve into the details.

Data collection process

A foundation of shot speed calculation lies in an organized data collection process. This process comprises two key components: event data and optical tracking data.

Event data collection entails gathering the fundamental building blocks of the game. Shots, goals, assists, fouls, and substitutions provide vital context for understanding what happens on the pitch. In our specific case, we focus on shots, their variations, and the players responsible for them.

On the flip side, optical tracking data is collected using advanced camera systems. These systems record player movements and ball positions, offering a high level of precision. This data serves as the bedrock for comprehensive analysis of player performance, tactical intricacies, and team strategies. When it comes to calculating shot speed, this data is essential for tracking the velocity of the ball.

These two streams of data originate from distinct sources, and their synchronization in time is not guaranteed. For the precision needed in shot speed calculations, we must ensure that the ball’s position aligns precisely with the moment of the event. This eliminates any discrepancies that might arise from the manual collection of event data. To achieve this, our process uses a synchronization algorithm that is trained on a labeled dataset. This algorithm robustly associates each shot with its corresponding tracking data.

Shot speed calculation

The heart of determining shot speed lies in a precise timestamp given by our synchronization algorithm. Imagine a player getting ready to take a shot. Our event gatherers are ready to record the moment, and cameras closely track the ball’s movement. The magic happens exactly when the player decides to pull the trigger.

An accurate timestamp measurement helps us figure out how fast the shot was right from the start. We measure shot speed for shots that end up as goals, those that hit the post, or get saved. To make sure we’re consistent, we don’t include headers or shots that get blocked. These can get a bit tricky due to deflections.

Let’s break down how we transform the collected data into the shot speed you see:

  1. Extracting shot trajectory – After recording the event and tracking the ball’s movement, we extract the trajectory of the shot. This means we map out the path the ball takes from the moment it leaves the player’s foot.
  2. Smoothing velocity curve – The data we get is detailed but can sometimes have tiny variations due to factors like camera sensitivity. To ensure accuracy, we smooth out the velocity curve. This means we remove any minor bumps or irregularities in the data to get a more reliable speed measurement.
  3. Calculating maximum speed – With a clean velocity curve in hand, we then calculate the maximum speed the ball reaches during its flight. This is the key number that represents the shot’s speed and power.

We analyzed around 215 matches from the Bundesliga 2022–2023 season. The following plot shows the number of fast shots (>100 km/h) by player. The 263 players with at least one fast shot (>100 km/h) have, on average, 3.47 fast shots. As the graph shows, some players have a frequency way above average, with around 20 fast shots.

Let’s look at some examples from the current season (2023–2024)

The following videos show examples of measured shots that achieved top-speed values.

Example 1

Measured with top shot speed 118.43 km/h with a distance to goal of 20.61 m

Example 2

Measured with top shot speed 123.32 km/h with a distance to goal of 21.19 m

Example 3

Measured with top shot speed 121.22 km/h with a distance to goal of 25.44 m

Example 4

Measured with top shot speed 113.14 km/h with a distance to goal of 24.46 m

How it’s implemented

In our quest to accurately determine shot speed during live matches, we’ve implemented a cutting-edge solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK). This robust platform serves as the backbone for seamlessly streaming positional data at a rapid 25 Hz sampling rate, enabling real-time updates of shot speed. Through Amazon MSK, we’ve established a centralized hub for data streaming and messaging, facilitating seamless communication between containers for sharing a wealth of Bundesliga Match Facts.

The following diagram outlines the entire workflow for measuring shot speed from start to finish.

Match-related data is gathered and brought into the system via DFL’s DataHub. To process match metadata, we use an AWS Lambda function called MetaDataIngestion, while positional data is brought in using an AWS Fargate container known as MatchLink. These Lambda functions and Fargate containers then make this data available for further use in the appropriate MSK topics.

At the heart of the BMF Shot Speed lies a dedicated Fargate container named BMF ShotSpeed. This container is active throughout the duration of the match, continuously pulling in all the necessary data from Amazon MSK. Its algorithm responds instantly to every shot taken during the game, calculating the shot speed in real time. Moreover, we have the capability to recompute shot speed should any underlying data undergo updates.

Once the shot speeds have undergone their calculations, the next phase in our data journey is the distribution. The shot speed metrics are transmitted back to DataHub, where they are made available to various consumers of Bundesliga Match Facts.

Simultaneously, the shot speed data finds its way to a designated topic within our MSK cluster. This allows other components of Bundesliga Match Facts to access and take advantage of this metric. We’ve implemented an AWS Lambda function with the specific task of retrieving the calculated shot speed from the relevant Kafka topic. Once the Lambda function is triggered, it stores the data in an Amazon Aurora Serverless database. This database houses the shot speed data, which we then use to create interactive, near real-time visualizations using Amazon QuickSight.

Beyond this, we have a dedicated component specifically designed to calculate a seasonal ranking of shot speeds. This allows us to keep track of the fastest shots throughout the season, ensuring that we always have up-to-date information about the fastest shots and their respective rankings after each shot is taken.

Summary

In this blog post, we’re excited to introduce the all-new Bundesliga Match Facts: Shot Speed, a metric that allows us to quantify and objectively compare the velocity of shots taken by different Bundesliga players. This statistic will provide commentators and fans with valuable insights into the power and speed of shots on goal.

The development of the Bundesliga Match Facts is the result of extensive analysis conducted by a collaborative team of soccer experts and data scientists from the Bundesliga and AWS. Notable shot speeds will be displayed in real time on the live ticker during matches, accessible through the official Bundesliga app and website. Additionally, this data will be made readily available to commentators via the Data Story Finder and visually presented to fans at key moments during broadcasts.

We’re confident that the introduction of this brand-new Bundesliga Match Fact will enhance your understanding of the game and add a new dimension to your viewing experience. To delve deeper into the partnership between AWS and Bundesliga, please visit Bundesliga on AWS!

We’re eagerly looking forward to the insights you uncover with this new Shot Speed metric. Share your findings with us on X: @AWScloud, using the hashtag #BundesligaMatchFacts.


About the Authors

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, and machine learning (ML). He supports customers in developing data-driven applications within the AWS Cloud. Prior to joining AWS, he was also a consultant in various industries, such as aviation and telecommunications. He is passionate about enabling customers on their data and artificial intelligence (AI) journey to the cloud.

Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data-driven applications side-by-side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, ML, and their business applications. He is also an enthusiastic cyclist, taking long bike-packing trips.

Luc Eluère is a Data Scientist within Sportec Solutions AG. His mission is to develop and provide valuable KPIs to the soccer industry. At university, he learned the statistical theory with a goal: to apply its concepts to the beautiful game. Even though he was promised a nice career in table soccer, his passion for data science took over, and he chose computers as a career path.

Javier Poveda-Panter is a Senior Data and Machine Learning Engineer for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through ML, data science, and analytics. He follows his passion for a broad range of sports, music, and AI in his spare time.

Read More

Deploy ML models built in Amazon SageMaker Canvas to Amazon SageMaker real-time endpoints

Deploy ML models built in Amazon SageMaker Canvas to Amazon SageMaker real-time endpoints

Amazon SageMaker Canvas now supports deploying machine learning (ML) models to real-time inferencing endpoints, allowing you take your ML models to production and drive action based on ML-powered insights. SageMaker Canvas is a no-code workspace that enables analysts and citizen data scientists to generate accurate ML predictions for their business needs.

Until now, SageMaker Canvas provided the ability to evaluate an ML model, generate bulk predictions, and run what-if analyses within its interactive workspace. But now you can also deploy the models to Amazon SageMaker endpoints for real-time inferencing, making it effortless to consume model predictions and drive actions outside the SageMaker Canvas workspace. Having the ability to directly deploy ML models from SageMaker Canvas eliminates the need to manually export, configure, test, and deploy ML models into production, thereby saving reducing complexity and saving time. It also makes operationalizing ML models more accessible to individuals, without the need to write code.

In this post, we walk you through the process to deploy a model in SageMaker Canvas to a real-time endpoint.

Overview of solution

For our use case, we are assuming the role of a business user in the marketing department of a mobile phone operator, and we have successfully created an ML model in SageMaker Canvas to identify customers with the potential risk of churn. Thanks to the predictions generated by our model, we now want to move this from our development environment to production. To streamline the process of deploying our model endpoint for inference, we directly deploy ML models from SageMaker Canvas, thereby eliminating the need to manually export, configure, test, and deploy ML models into production. This helps reduce complexity, saves time, and also makes operationalizing ML models more accessible to individuals, without the need to write code.

The workflow steps are as follows:

  1. Upload a new dataset with the current customer population into SageMaker Canvas. For the full list of supported data sources, refer to Import data into Canvas.
  2. Build ML models and analyze their performance metrics. For instructions, refer to Build a custom model and Evaluate Your Model’s Performance in Amazon SageMaker Canvas.
  3. Deploy the approved model version as an endpoint for real-time inferencing.

You can perform these steps in SageMaker Canvas without writing a single line of code.

Prerequisites

For this walkthrough, make sure that the following prerequisites are met:

  1. To deploy model versions to SageMaker endpoints, the SageMaker Canvas admin must give the necessary permissions to the SageMaker Canvas user, which you can manage in the SageMaker domain that hosts your SageMaker Canvas application. For more information, refer to Permissions Management in Canvas.
  2. Implement the prerequisites mentioned in Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.

You should now have three model versions trained on historical churn prediction data in Canvas:

  • V1 trained with all 21 features and quick build configuration with a model score of 96.903%
  • V2 trained with all 19 features (removed phone and state features) and quick build configuration and improved accuracy of 97.403%
  • V3 trained with standard build configuration with 97.103% model score

Use the customer churn prediction model

Enable Show advanced metrics on the model details page and review the objective metrics associated with each model version so that you can select the best-performing model for deploying to SageMaker as an endpoint.

Based on the performance metrics, we select version 2 to be deployed.

Configure the model deployment settings—deployment name, instance type, and instance count.

As a starting point, Canvas will automatically recommend the best instance type and the number of instances for your model deployment. You can change it as per your workload needs.

You can test the deployed SageMaker inference endpoint directly from within SageMaker Canvas.

You can change input values using the SageMaker Canvas user interface to infer additional churn prediction.

Now let’s navigate to Amazon SageMaker Studio and check out the deployed endpoint.

Open a notebook in SageMaker Studio and run the following code to infer the deployed model endpoint. Replace the model endpoint name with your own model endpoint name.

import boto3, sys
import pandas as pd

endpoint_name = "canvas-customer-churn-prediction-model"
sm_rt = boto3.Session().client('runtime.sagemaker')

payload = [['PA',163,806,403-2562, 'no', 'yes', 300, 8.16, 3, 7.57,3.93,4,6.5,4.07,100,5.11,4.92,6,5.67,3]]
body = pd.DataFrame(payload).to_csv(header=False, index=False).encode("utf-8")

response = sm_rt.invoke_endpoint(EndpointName=endpoint_name, Body=body, ContentType="text/csv",Accept="application/json")

response = response['Body'].read().decode("utf-8")
print(response)

Our original model endpoint is using an ml.m5.xlarge instance and 1 instance count. Now, let’s assume you expect the number of end-users inferencing your model endpoint will increase and you want to provision more compute capacity. You can accomplish this directly from within SageMaker Canvas by choosing Update configuration.

Clean up

To avoid incurring future charges, delete the resources you created while following this post. This includes logging out of SageMaker Canvas and deleting the deployed SageMaker endpoint. SageMaker Canvas bills you for the duration of the session, and we recommend logging out of SageMaker Canvas when you’re not using it. Refer to Logging out of Amazon SageMaker Canvas for more details.

Conclusion

In this post, we discussed how SageMaker Canvas can deploy ML models to real-time inferencing endpoints, allowing you take your ML models to production and drive action based on ML-powered insights. In our example, we showed how an analyst can quickly build a highly accurate predictive ML model without writing any code, deploy it on SageMaker as an endpoint, and test the model endpoint from SageMaker Canvas, as well as from a SageMaker Studio notebook.

To start your low-code/no-code ML journey, refer to Amazon SageMaker Canvas.

Special thanks to everyone who contributed to the launch: Prashanth Kurumaddali, Abishek Kumar, Allen Liu, Sean Lester, Richa Sundrani, and Alicia Qi.


About the Authors

Janisha Anand is a Senior Product Manager in the Amazon SageMaker Low/No Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.

Indy Sawhney is a Senior Customer Solutions Leader with Amazon Web Services. Always working backward from customer problems, Indy advises AWS enterprise customer executives through their unique cloud transformation journey. He has over 25 years of experience helping enterprise organizations adopt emerging technologies and business solutions. Indy is an area of depth specialist with AWS’s Technical Field Community for AI/ML, with specialization in generative AI and low-code/no-code Amazon SageMaker solutions.

Read More

Develop generative AI applications to improve teaching and learning experiences

Develop generative AI applications to improve teaching and learning experiences

Recently, teachers and institutions have looked for different ways to incorporate artificial intelligence (AI) into their curriculums, whether it be teaching about machine learning (ML) or incorporating it into creating lesson plans, grading, or other educational applications. Generative AI models, in particular large language models (LLMs), have dramatically sped up AI’s impact on education. Generative AI and natural language programming (NLP) models have great potential to enhance teaching and learning by generating personalized learning content and providing engaging learning experiences for students.

In this post, we create a generative AI solution for teachers to create course materials and for students to learn English words and sentences. When students provide answers, the solution provides real-time assessments and offers personalized feedback and guidance for students to improve their answers.

Specifically, teachers can use the solution to do the following:

  • Create an assignment for students by generating questions and answers from a prompt
  • Create an image from the prompt to represent the assignment
  • Save the new assignment to a database
  • Browse existing assignments from the database

Students can use the solution to do the following:

  • Select and review an assignment from the assignment database
  • Answer the questions of the selected assignment
  • Check the grading scores of the answers in real time
  • Review the suggested grammatical improvements to their answers
  • Review the suggested sentence improvements to their answers
  • Read the recommended answers

We walk you through the steps of creating the solution using Amazon Bedrock, Amazon Elastic Container Service (Amazon ECS), Amazon CloudFront, Elastic Load Balancing (ELB), Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and AWS Cloud Development Kit (AWS CDK).

Solution overview

The following diagram shows the resources and services used in the solution.

The solution runs as a scalable service. Teachers and students use their browsers to access the application. The content is served through an Amazon CloudFront distribution with an Application Load Balancer as its origin. It saves the generated images to an S3 bucket, and saves the teacher’s assignments and the students’ answers and scores to separate DynamoDB tables.

The solution uses Amazon Bedrock to generate questions, answers, assignment images and to grade students’ answers. Amazon Bedrock is a fully managed service that makes foundation models from leading AI startups and Amazon available via easy-to-use API interfaces. The solution also uses the grammatical error correction API and the paraphrase API from AI21 to recommend word and sentence corrections.

You can find the implementation details in the following sections. The source code is available in the GitHub repository.

Prerequisites

You should have some knowledge of generative AI, ML, and the services used in this solution, including Amazon Bedrock, Amazon ECS, Amazon CloudFront, Elastic Load Balancing, Amazon DynamoDB and Amazon S3

We use AWS CDK to build and deploy the solution. You can find the setup instructions in the readme file.

Create assignments

Teachers can create an assignment from an input text using the following GUI page. An assignment comprises an input text, the questions and answers generated from the text, and an image generated from the input text to represent the assignment.

For our example, a teacher inputs the Kids and Bicycle Safety guidelines from the United States Department of Transportation. For the input text, we use the file bike.safe.riding.tips.txt.

The following is the generated image output.

The following are the generated questions and answers:

"question": "What should you always wear when riding a bicycle?",
"answer": "You should always wear a properly fitted bicycle helmet when riding a bicycle. A helmet protects your brain and can save your life in a crash."

"question": "How can you make sure drivers can see you when you are bicycling?",
"answer": "To make sure drivers can see you, wear bright neon or fluorescent colors. Also use reflective tape, markings or flashing lights so you are visible."

"question": "What should you do before riding your bicycle?",
"answer": "Before riding, you should inspect your bicycle to make sure all parts are secure and working properly. Check that tires are inflated, brakes work properly, and reflectors are in place."

"question": "Why is it more dangerous to ride a bicycle at night?",
"answer": "It is more dangerous to ride at night because it is harder for other people in vehicles to see you in the dark."

"question": "How can you avoid hazards while bicycling?",
"answer": "Look ahead for hazards like potholes, broken glass, and dogs. Point out and yell about hazards to bicyclists behind you. Avoid riding at night when it is harder to see hazards."

The teacher expects the students to complete the assignment by reading the input text and then answering the generated questions.

The portal uses Amazon Bedrock to create questions, answers, and images. Amazon Bedrock speeds up the development of generative AI solutions by exposing the foundation models through API interfaces. You can find the source code in the file 1_Create_Assignments.py.

The portal invokes two foundation models:

  • Stable Diffusion XL to generate images using the function query_generate_image_endpoint
  • Anthropic Claude v2 to generate questions and answers using the function query_generate_questions_answers_endpoint

The portal saves generated images to an S3 bucket using the function load_file_to_s3. It creates an assignment based on the input text, the teacher ID, the generated questions and answers, and the S3 bucket link for the loaded image. It saves the assignment to the DynamoDB table assignments using the function insert_record_to_dynamodb.

You can find the AWS CDK code that creates the DynamoDB table in the file cdk_stack.py.

Show assignments

Teachers can browse assignments and the generated artifacts using the following GUI page.

The portal uses the function get_records_from_dynamodb to retrieve the assignments from the DynamoDB table assignments. It uses the function download_image to download an image from the S3 bucket. You can find the source code in the file 2_Show_Assignments.py.

Answer questions

A student selects and reads a teacher’s assignment and then answers the questions of the assignment.

The portal provides an engaging learning experience. For example, when the student provides the answer “I should waer hat protect brain in crash” the portal grades the answer in real time by comparing the answer with the correct answer. The portal also ranks all students’ answers to the same question and shows the top three scores. You can find the source code in the file 3_Complete_Assignments.py.

The portal saves the student’s answers to a DynamoDB table called answers. You can find the AWS CDK code that creates the DynamoDB table in the file cdk_stack.py.

To grade a student’s answer, the portal invokes the Amazon Titan Embeddings model to translate the student’s answer and the correct answer into numerical representations and then compute their similarity as a score. You can find the solution in the file 3_Complete_Assignments.py.

The portal generates suggested grammatical corrections and sentence improvements for the student’s answer. Finally, the portal shows the correct answer to the question.

The portal uses the grammatical error correction API and the paraphrase API from AI21 to generate the recommended grammatical and sentence improvements. The AI21 paraphrase model is available as a foundation model in SageMaker. You can deploy the AI21 paraphrase model as an inference point in SageMaker and invoke the inference point to generate sentence improvements.

The functions generate_suggestions_sentence_improvements and generate_suggestions_word_improvements in the file 3_Complete_Assignments.py show an alternative way of using the AI21 REST API endpoints. You need to create an AI21 account and find the API key associated with your account to invoke the APIs. You will have to pay for the invocations after the trial period.

Conclusion

This post showed you how to use an AI-assisted solution to improve the teaching and learning experience by using multiple generative AI and NLP models. You can use the same approach to develop other generative AI prototypes and applications.

If you’re interested in the fundamentals of generative AI and how to work with foundation models, including advanced prompting techniques, check out the hands-on course Generative AI with LLMs. It’s an on-demand, 3-week course for data scientists and engineers who want to learn how to build generative AI applications with LLMs. It’s a good foundation to start building with Amazon Bedrock. Visit the Amazon Bedrock Features page and sign up to learn more about Amazon Bedrock.


About the Authors

Jeff Li is a Senior Cloud Application Architect with the Professional Services team at AWS. He is passionate about diving deep with customers to create solutions and modernize applications that support business innovations. In his spare time, he enjoys playing tennis, listening to music, and reading.

Isaac Privitera is a Senior Data Scientist at the Generative AI Innovation Center, where he develops bespoke generative AI based solutions to address customers’ business problems. He works primarily on building responsible AI systems using retrieval augmented generation (RAG) and chain of thought reasoning. In his spare time he enjoys golf, football, and walking with his dog Barry.

Harish Vaswani is a Principal Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud native applications and enables customers with best practices in their cloud transformation journey. Outside of work, Harish and his wife, Simin, are award-winning independent short film producers and love spending their time with their 5-year old son, Karan.

Read More

Dialogue-guided visual language processing with Amazon SageMaker JumpStart

Dialogue-guided visual language processing with Amazon SageMaker JumpStart

Visual language processing (VLP) is at the forefront of generative AI, driving advancements in multimodal learning that encompasses language intelligence, vision understanding, and processing. Combined with large language models (LLM) and Contrastive Language-Image Pre-Training (CLIP) trained with a large quantity of multimodality data, visual language models (VLMs) are particularly adept at tasks like image captioning, object detection and segmentation, and visual question answering. Their use cases span various domains, from media entertainment to medical diagnostics and quality assurance in manufacturing.

Key strengths of VLP include the effective utilization of pre-trained VLMs and LLMs, enabling zero-shot or few-shot predictions without necessitating task-specific modifications, and categorizing images from a broad spectrum through casual multi-round dialogues. Augmented by Grounded Segment Anything, VLP exhibits prowess in visual recognition, with object detection and segmentation being particularly notable. The potential exists to fine-tune VLMs and LLMs further using domain-specific data, aiming to boost precision and mitigate hallucination. However, like other nascent technologies, obstacles remain in managing model intricacy, harmonizing diverse modalities, and formulating uniform evaluation metrics.

Courtesy of NOMIC for OBELICS, HuggingFaceM4 for IDEFICS, Charles Bensimon for Gradio and Amazon Polly for TTS

In this post, we explore the technical nuances of VLP prototyping using Amazon SageMaker JumpStart in conjunction with contemporary generative AI models. Through multi-round dialogues, we highlight the capabilities of instruction-oriented zero-shot and few-shot vision language processing, emphasizing its versatility and aiming to capture the interest of the broader multimodal community. The demo implementation code is available in the following GitHub repo.

Solution overview

The proposed VLP solution integrates a suite of state-of-the-art generative AI modules to yield accurate multimodal outputs. Central to the architecture are the fine-tuned VLM and LLM, both instrumental in decoding visual and textual data streams. The TGI framework underpins the model inference layer, providing RESTful APIs for robust integration and effortless accessibility. Supplementing our auditory data processing, the Whisper ASR is also furnished with a RESTful API, enabling streamlined voice-to-text conversions. Addressing complex challenges like image-to-text segmentation, we use the containerized Grounded Segment Anything module, synergizing with the Grounded DINO and Segment Anything Model (SAM) mechanism for text-driven object detection and segmentation. The system is further refined with DistilBERT, optimizing our dialogue-guided multi-class classification process. Orchestrating these components is the LangChain processing pipeline, a sophisticated mechanism proficient in dissecting text or voice inputs, discerning user intentions, and methodically delegating sub-tasks to the relevant services. The synthesis of these operations produces aggregated outputs, delivering pinpoint and context-aware multimodal answers.

The following diagram illustrates the architecture of our dialogue-guided VLP solution.

Text Generation Inference

Text Generation Inference (TGI) is an open-source toolkit developed by Hugging Face for deploying LLMs as well as VLMs for inference. It enables high-performance text generation using tensor parallelism, model parallelism, and dynamic batching supporting some leading open-source LLMs such as Falcon and Llama V2, as well as VLMs like IDEFICS. Utilizing the latest Hugging Face LLM modules on Amazon SageMaker, AWS customers can now tap into the power of SageMaker deep learning containers (DLCs). This allows for the seamless deployment of LLMs from the Hugging Face hubs via pre-built SageMaker DLCs supporting TGI. This inference setup not only offers exceptional performance but also eliminates the need for managing the heavy lifting GPU infrastructure. Additionally, you benefit from advanced features like auto scaling of inference endpoints, enhanced security, and built-in model monitoring.

TGI offers text generation speeds up to 100 times faster than traditional inference methods and scales efficiently to handle increased requests. Its design ensures compatibility with various LLMs and, being open-source, democratizes advanced features for the tech community. TGI’s versatility extends across domains, enhancing chatbots, improving machine translations, summarizing texts, and generating diverse content, from poetry to code. Therefore, TGI emerges as a comprehensive solution for text generation challenges. TGI is implemented in Python and uses the PyTorch framework. It’s open-source and available on GitHub. It also supports PEFT with QLoRA for faster performance and logits warping to control generated text attributes, such as determining its length and diversity, without modifying the underlying model.

You can build a customized TGI Docker container directly from the following Dockerfile and then push the container image to Amazon Elastic Container Registry (ECR) for inference deployment. See the following code:

%%sh
# Define docker image name and container's Amazon Reource Name on ECR
container_name="tgi1.03"
region=`aws configure get region`
account=`aws sts get-caller-identity --query "Account" --output text`
full_name="${account}.dkr.ecr.${region}.amazonaws.com/${container_name}:latest"

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS 
    --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com

# Build the TGI docker image locally
docker build . -f Dockerfile -t ${container_name}
docker tag ${container_name} ${full_name}
docker push ${full_name}

LLM inference with TGI

The VLP solution in this post employs the LLM in tandem with LangChain, harnessing the chain-of-thought (CoT) approach for more accurate intent classification. CoT processes queries to discern intent and trigger-associated sub-tasks to meet the query’s goals. Llama-2-7b-chat-hf (license agreement) is the streamlined version of the Llama-2 line, designed for dialogue contexts. The inference of Llama-2-7b-chat-hf is powered by the TGI container image, making it available as an API-enabled service.

For Llama-2-7b-chat-hf inference, a g5.2xlarge (24G VRAM) is recommended to achieve peak performance. For applications necessitating a more robust LLM, the Llama-v2-13b models fit well with a g5.12xlarge (96G VRAM) instance. For the Llama-2-70b models, consider either the GPU [2xlarge] – 2x Nvidia A100 utilizing bitsandbytes quantization or the g5.48xlarge. Notably, employing bitsandbytes quantization can reduce the required inference GPU VRAM by 50%.

You can use SageMaker DLCs with the TGI container image detailed earlier to deploy Llama-2-7b-chat-hf for inference (see the following code). Alternatively, you can stand up a quick local inference for a proof of concept on a g5.2xlarge instance using a Docker container.

import json
from time import gmtime, strftime
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role

# Prerequisite:create an unique model name
model_name = 'Llama-7b-chat-hf' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# retrieve the llm image uri of SageMaker pre-built DLC TGI v1.03  
tgi_image_ecr_uri = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.0.3"
)


# Define Model and Endpoint configuration parameter
hf_config = {
  'HF_MODEL_ID': "meta-research/Llama-2-7b-chat-hf", # Matching model_id on Hugging Face Hub
  'SM_NUM_GPUS': json.dumps(number_of_gpu), 
  'MAX_TOTAL_TOKENS': json.dumps(1024), 
  'HF_MODEL_QUANTIZE': "bitsandbytes", # Use quantization for less vram requirement, commet it if no needed.
}

# create HuggingFaceModel with the SageMaker pre-built DLC TGI image uri
sm_llm_model = HuggingFaceModel(
  role=get_execution_role(),
  image_uri=tgi_image_ecr_uri,
  env=hf_config
)

# Deploy the model
llm = sm_llm_model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  container_startup_health_check_timeout=300, # in sec. Allow 5 minutes to be able to load the model
)

# define inference payload
prompt="""<|prompter|>How to select a right LLM for your generative AI project?<|endoftext|><|assistant|>"""

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "best_of": 1,
    "decoder_input_details": true,
    "details": true,
    "do_sample": true,
    "max_new_tokens": 20,
    "repetition_penalty": 1.03,
    "return_full_text": false,
    "seed": null,
    "stop": [
      "photographer"
    ],
    "temperature": 0.5,
    "top_k": 10,
    "top_p": 0.95,
    "truncate": null,
    "typical_p": 0.95,
    "watermark": true
  },
  "stream": false
}

# send request to endpoint
response = llm.predict(payload)

Fine-tune and customize your LLM

SageMaker JumpStart offers numerous notebook samples that demonstrate the use of Parameter Efficient Fine Tuning (PEFT), including QLoRA for training and fine-tuning LLMs. QLoRA maintains the pre-trained model weights in a static state and introduces trainable rank decomposition matrices into each layer of the Transformer structure. This method substantially decreases the number of trainable parameters needed for downstream tasks.

Alternatively, you can explore Direct Preference Optimization (DPO), which obviates the necessity for setting up a reward model, drawing samples during fine-tuning from the LLM, or extensive hyperparameter adjustments. Recent research has shown that DPO’s fine-tuning surpasses RLHF in managing sentiment generation and enhances the quality of summaries and single-conversation responses, all while being considerably easier to set up and educate. There are three main steps to the DPO training process (refer to the GitHub repo for details):

  1. Perform supervised fine-tuning of a pre-trained base LLM to create a fine-tuned LLM.
  2. Run the DPO trainer using the fine-tuned model to create a reinforcement learning model.
  3. Merge the adaptors from DPO into the base LLM model for text generation inference.

You can deploy the merged model for inference using the TGI container image.

Visual language model

Visual Language Models (VLM) which combine both the vision and language modalities have been showing their improving effectiveness in generalization, leading to various practical use cases with zero-shot prompts or few-shot prompts with instructions. A VLM typically consists of three key elements: an image encoder, a text encoder, and a strategy to fuse information from the two encoders. These key elements are tightly coupled together because the loss functions are designed around both the model architecture and the learning strategy. Many state-of-the-art VLMs use CLIP/ViT (such as OpenCLIP) and LLMs (such as Llama-v1) and are trained on multiple publicly available datasets such as Wikipedia, LAION, and Public Multimodal Dataset.

This demo used a pre-trained IDEFICS-9b-instruct model developed by HuggingFaceM4, a fine-tuned version of IDEFICS-9b, following the training procedure laid out in Flamingo by combining the two pre-trained models (laion/CLIP-ViT-H-14-laion2B-s32B-b79K and huggyllama/llama-7b) with modified Transformer blocks. The IDEFICS-9b was trained on OBELIC, Wikipedia, LAION, and PMD multimodal datasets with a total 150 billion tokens and 1.582 billion images with 224×224 resolution each. The IDEFICS-9b was based on Llama-7b with a 1.31 million effective batch size. The IDEFICS-9b-instruct was then fine-tuned with 6.8 million multimodality instruction datasets created from augmentation using generative AI by unfreezing all the parameters (vision encoder, language model, cross-attentions). The fine-tuning datasets include the pre-training data with the following sampling ratios: 5.1% of image-text pairs and 30.7% of OBELICS multimodal web documents.

The training software is built on top of Hugging Face Transformers and Accelerate, and DeepSpeed ZeRO-3 for training, plus WebDataset and Image2DataSets for data loading. The pre-training of IDEFICS-9b took 350 hours to complete on 128 Nvidia A100 GPUs, whereas fine-tuning of IDEFICS-9b-instruct took 70 hours on 128 Nvidia A100 GPUs, both on AWS p4.24xlarge instances.

With SageMaker, you can seamlessly deploy IDEFICS-9b-instruct on a g5.2xlarge instance for inference tasks. The following code snippet illustrates how to launch a tailored deep learning local container integrated with the customized TGI Docker image:

%%sh
llm_model='HuggingFaceM4/idefics-9b-instruct'
docker_rt_name='idefics-9b-instruct'
docker_image_name='tgi1.03'
docker run --gpus="1,2,3,4" --shm-size 20g -p 8080:80 --restart unless-stopped --name ${docker_rt_name} ${docker_image_name} --model-id ${llm_model}

# Test the LLM API using curl
curl -X 'POST'   'http://<hostname_or_ip>:8080/'   
    -H 'accept: application/json'   
    -H 'Content-Type: application/json'   
    -d '{  
        "inputs": "User:![](http://<image_url>/image.png)Which device produced this image? Please explain the main clinical purpose of such image?Can you write a radiology report based on this image?<end_of_utterance>", 
        "parameters": {    
            "best_of": 1,    "decoder_input_details": true,   
            "details": true,    "do_sample": true,    "max_new_tokens": 20,  
            "repetition_penalty": 1.03,    "return_full_text": false,    
            "seed": null,    "stop": [      "photographer"    ],    
            "temperature": 0.5,    "top_k": 10,    "top_p": 0.95,   
            "truncate": null,    "typical_p": 0.95,    "watermark": true  },  
        "stream": false 
        }'

You can fine-tune IDEFICS or other VLMs including Open Flamingo with your own domain-specific data with instructions. Refer to the following README for multimodality dataset preparation and the fine-tuning script for further details.

Intent classification with chain-of-thought

A picture is worth a thousand words, therefore VLM requires guidance to generate an accurate caption from a given image and question. We can use few-shot prompting to enable in-context learning, where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.

Standard few-shot prompting works well for many tasks but is still not a perfect technique, especially when dealing with more complex reasoning tasks. The few-shot prompting template is not enough to get reliable responses. It might help if we break the problem down into steps and demonstrate that to the model. More recently, chain-of-thought (CoT) prompting has been popularized to address more complex arithmetic, common sense, and symbolic reasoning tasks

CoT eliminate manual efforts by using LLMs with a “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one. However, this automatic process can still end up with mistakes in generated chains. To mitigate the effects of the mistakes, the diversity of demonstrations matter. This post proposes Auto-CoT, which samples questions with diversity and generates reasoning chains to construct the demonstrations. CoT consists of two main stages:

  • Question clustering – Partition questions of a given dataset into a few clusters
  • Demonstration sampling – Select a representative question from each cluster and generate its reasoning chain using zero-shot CoT with simple heuristics

See the following code snippet:

from langchain.llms import HuggingFaceTextGenInference
from langchain import PromptTemplate, LLMChain

inference_server_url_local = <Your_local_url_for_llm_on_tgi:port>

llm_local = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url_local,
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.1,
    repetition_penalty=1.05,
    
 
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten five maximum and keep the answer as subtle as possible. List all actionable sub-tasks step by step in detail. Be cautious to avoid phrasing that might replicate previous 
inquiries. This will help in obtaining an accurate and detailed answer. Avoid repetition for clarity.

Question: {question}
Answer: Understand the intent of the question then break down the {question} in to sub-tasks. """

prompt = PromptTemplate(
    template=template, 
    input_variables= ["question"]
)

llm_chain_local = LLMChain(prompt=prompt, llm=llm_local)
llm_chain_local("Can you describe the nature of this image? Do you think it's real??")

Automatic Speech Recognition

The VLP solution incorporates Whisper, an Automatic Speech Recognition (ASR) model by OpenAI, to handle audio queries. Whisper can be effortlessly deployed via SageMaker JumpStart using its template. SageMaker JumpStart, known for its straightforward setup, high performance, scalability, and dependability, is ideal for developers aiming to craft exceptional voice-driven applications. The following GitHub repo demonstrates how to harness SageMaker real-time inference endpoints to fine-tune and host Whisper for instant audio-to-text transcription, showcasing the synergy between SageMaker hosting and generative models.

Alternatively, you can directly download the Dockerfile.gpu from GitHub developed by ahmetoner, which includes a pre-configured RESTful API. You can then construct a Docker image and run the container on a GPU-powered Amazon Elastic Compute Cloud (EC2) instance for a quick proof of concept. See the following code:

%%sh
docker_iamge_name = 'whisper-asr-webservice-gpu'
docker build -f Dockerfile.gpu -t ${docker_iamge_nam}
docker run -d --gpus all -p 8083:9000 --restart unless-stopped -e ASR_MODEL=base ${docker_iamge_nam}

curl -X 'POST'   'http://<asr_api_hostname>:<port>/asr?task=transcribe&encode=true&output=txt'   
    -H 'accept: application/json'   
    -H 'Content-Type: multipart/form-data'   
    -F 'audio_file=@dgvlp_3_5.mp3;type=audio/mpeg'

In the provided example, port 8083 is selected to host the Whisper API, with inbound network security rules activated. To test, direct a web browser to http://<IP_or_hostname>:8083/docs and initiate a POST request test to the ASR endpoint. As an alternative, run the given command or employ the whisper-live module to verify API connectivity.

!pip install whisper-live
from whisper_live.client import TranscriptionClient
client = TranscriptionClient("<whisper_hostname_or_IP>", 8083, is_multilingual=True, lang="zh", translate=True)
client(audio_file_path) # Use sudio file
client() # Use microphone for transcribe

Multi-class text classification and keyword extraction

Multi-class classification plays a pivotal role in text prompt-driven object detection and segmentation. The distilbert-base-uncased-finetuned-sst-2-english model is a refined checkpoint of DistilBERT-base-uncased, optimized on the Stanford Sentiment Treebank (SST2) dataset by Hugging Face. This model achieves a 91.3% accuracy on the development set, while its counterpart bert-base-uncased boasts an accuracy of 92.7%. The Hugging Face Hub provides access to over 1,000 pre-trained text classification models. For those seeking enhanced precision, SageMaker JumpStart provides templates to fine-tune DistilBERT using custom annotated datasets for more tailored classification tasks.

import torch
from transformers import pipeline

def mclass(text_prompt, top_k=3, topics = ['Mask creation', 'Object  detection', 
        'Inpainting', 'Segmentation', 'Upscaling', 'Creating an image from another one', 'Generating:q an image from text'], 
        model='distilbert-base-uncased-finetuned-sst-2-english'):
        
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # Define a german hypothesis template and the potential candidates for entailment/contradiction
    template_de = 'The topic is {}'
    # Pipeline abstraction from hugging face
    pipe = pipeline(task='zero-shot-classification', model=model, tokenizer=model, device=device)
    # Run pipeline with a test case
    prediction = pipe(text_prompt, topics, hypothesis_template=template_de)
    # Top 3 topics as predicted in zero-shot regime
    return zip(prediction['labels'][0:top_k], prediction['scores'][0:top_k])

top_3_intend = mclass(text_prompt=user_prompt_str, topics=['Others', 'Create image mask', 'Image segmentation'], top_k=3) 

The keyword extraction process employs the KeyBERT module, a streamlined and user-friendly method that harnesses BERT embeddings to generate keywords and key phrases closely aligned with a document—in this case, the objects specified in the query:

# Keyword extraction
from keybert import KeyBERT
kw_model = KeyBERT()
words_list = kw_model.extract_keywords(docs=<user_prompt_str>, keyphrase_ngram_range=(1,3))

Text prompt-driven object detection and classification

The VLP solution employs dialogue-guided object detection and segmentation by analyzing the semantic meaning of the text and identifying the action and objects from text prompt. Grounded-SAM is an open-source package created by IDEA-Research to detect and segment anything from a given image with text inputs. It combines the strengths of Grounding DINO and Segment Anything in order to build a very powerful pipeline for solving complex problems.

The following figure illustrates how Grounded-SAM can detect objects and conduct instance segmentation by comprehending textual input.

SAM stands out as a robust segmentation model, though it requires prompts, such as bounding boxes or points, to produce high-quality object masks. Grounding DINO excels as a zero-shot detector, adeptly creating high-quality boxes and labels using free-form text prompts. When these two models are combined, they offer the remarkable capability to detect and segment any object purely through text inputs. The Python utility script dino_sam_inpainting.py was developed to integrate Grounded-SAM methods:

!pip install git+https://github.com/facebookresearch/segment-anything.git
import dino_sam_inpainting as D

def dino_sam(image_path, text_prompt, text_threshold=0.4, box_threshold=0.5, output_dir='/temp/gradio/outputs'):
    config_file = 'GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py'  # change the path of the model config file
    grounded_checkpoint = './models/groundingdino_swint_ogc.pth'  # change the path of the model
    sam_checkpoint = './models/sam_vit_h_4b8939.pth'
    sam_hq_checkpoint = '' #if to use high quality, like sam_hq_vit_h.pth
    use_sam_hq = ''
    output_dir = '/tmp/gradio/outputs'
    device = 'cuda'

    # make dir
    os.makedirs(output_dir, exist_ok=True)
    # load image
    image_pil, image = D.load_image(image_path)
    # load model
    model = D.load_model(config_file, grounded_checkpoint, device=device)

    output_file_name = f'{format(os.path.basename(image_path))}'

    # visualize raw image
    image_pil.save(os.path.join(output_dir, output_file_name))

    # run grounding dino model
    boxes_filt, pred_phrases = D.get_grounding_output(
        model, image, text_prompt, box_threshold, text_threshold, device=device
    )
    
    # initialize SAM
    if use_sam_hq:
        predictor = D.SamPredictor(D.build_sam_hq(checkpoint=sam_hq_checkpoint).to(device))
    else:
        predictor = D.SamPredictor(D.build_sam(checkpoint=sam_checkpoint).to(device))
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    predictor.set_image(image)


    size = image_pil.size
    H, W = size[1], size[0]
    for i in range(boxes_filt.size(0)):
        boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
        boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
        boxes_filt[i][2:] += boxes_filt[i][:2]

    boxes_filt = boxes_filt.cpu()
    transformed_boxes = predictor.transform.apply_boxes_torch(boxes_filt, image.shape[:2]).to(device)

    masks, _, _ = predictor.predict_torch(
        point_coords = None,
        point_labels = None,
        boxes = transformed_boxes.to(device),
        multimask_output = False,
    )

    # draw output image
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    for mask in masks:
        D.show_mask(mask.cpu().numpy(), plt.gca(), random_color=True)
    for box, label in zip(boxes_filt, pred_phrases):
        D.show_box(box.numpy(), plt.gca(), label)

    output_file_name = f'{format(os.path.basename(image_path))}'
    plt.axis('off')
    plt.savefig(
        os.path.join(output_dir, f'grounded_sam_{output_file_name}'),
        bbox_inches="tight", dpi=300, pad_inches=0.0
    )

    D.save_mask_data(output_dir, masks, boxes_filt, pred_phrases)
    return f'grounded_sam_{output_file_name}'
    
filename = dino_sam(image_path=<image_path_str>, text_prompt=<object_name_str>, output_dir=<output_image_filename_path_str>, box_threshold=0.5, text_threshold=0.55)

You can choose HQ-SAM to upgrade SAM for high-quality zero-shot segmentation. Refer to the following paper and code sample on GitHub for more details.

VLP processing pipeline

The main objective of the VLP processing pipeline is to combine the strengths of different models, creating a sophisticated workflow specialized for VLP. It’s important to highlight that this setup prioritizes the integration of top-tier models across visual, text, and voice domains. Each segment of the pipeline is modular, facilitating either standalone use or combined operation. Furthermore, the design ensures flexibility, enabling the replacement of components with more advanced models yet to come, while supporting multithreading and error handling with reputable implementation.

The following figure illustrates a VLP pipeline data flow and service components.

In our exploration of the VLP pipeline, we design one which can process both text prompts from open text format and casual voice inputs from microphones. The audio processing is facilitated by Whisper, capable of multilingual speech recognition and translation. The transcribed text is then channeled to an intent classification module, which discerns the semantic essence of the prompts. This works in tandem with a LangChain driven CoT engine, dissecting the main intent into finer sub-tasks for more detailed information retrieval and generation. If image processing is inferred from the input, the pipeline commences a keyword extraction process, selecting the top N keywords by cross-referencing objects detected in the original image. Subsequently, these keywords are routed to the Grounded-SAM engine, which generates bounding boxes. These bounding boxes are then supplied to the SAM model, which crafts precise segmentation masks, pinpointing each unique object instance in the source image. The final step involves overlaying the masks and bounding boxes onto the original image, yielding a processed image that is presented as a multimodal output.

When the input query seeks to interpret an image, the pipeline engages the LLM to organize the sub-tasks and refine the query with targeted goals. Subsequently, the outcome is directed to the VLM API, accompanied by few-shot instructions, the URL of the input image, and the rephrased text prompt. In response, the VLM provides the textual output. The VLP pipeline can be implemented using a Python-based workflow pipeline or alternative orchestration utilities. Such pipelines operate by chaining a sequential set of sophisticated models, culminating in a structured modeling procedure sequentially. The pipeline integrates with the Gradio engine for demonstration purposes:

def vlp_text_pipeline(str input_text, str original_image_path, chat_history):
   intent_class = intent_classification(input_text)
   key_words = keyword_extraction(input_text)
   image_caption = vlm(input_text, original_image_path)
   chat_history.append(image_caption)
   if intent_class in {supported intents}:
        object_bounding_box = object_detection(intent_class, key_words, original_image_path)
        mask_image_path = image_segmentation(object_bounding_box, key_words, original_image_path)
        chat_history.append(mask_image_path)
   return chat_history
    
def vlp_voice_pipeline(str audio_file_path, str original_image_path, chat_history):
   asr_text = whisper_transcrib(audio_file_path)
   chat_history.append(asr_text, original_image_path, chat_history)
   return chat_history
    
chat_history = map(vlp_pipelines, input_text, original_image_path, chat_history) 
               if (audio_file_path is None) 
               else map(vlp_voice_pipelines, original_image_path, chat_history)

Limitations

Using pre-trained VLM models for VLP has demonstrated promising potential for image understanding. Along with language-based object detection and segmentation, VLP can produce useful outputs with reasonable quality. However, VLP still suffers from inconsistent results, missing details from pictures, and it might even hallucinate. Moreover, models might produce factually incorrect texts and should not be relied on to produce factually accurate information. Since none of the referenced pre-trained VLM, SAM, or LLM models has been trained or fine-tuned for domain-specific production-grade applications, this solution is not designed for mission-critical applications that might impact livelihood or cause material losses

With prompt engineering, the IDEFICS model sometimes can recognize extra details after a text hint; however, the result is far from consistent and reliable. It can be persistent in maintaining inaccuracies and may be unable or unwilling to make corrections even when users highlight those during a conversation. Enhancing the backbone model by integrating Swin-ViT and fusing it with CNN-based models like DualToken-ViT, along with training using more advanced models like Llama-v2, could potentially address some of these limitations.

Next steps

The VLP solution is poised for notable progress. As we look ahead, there are several key opportunities to advance VLP solutions:

  • Prioritize integrating dynamic prompt instructions and few-shot learning hints. These improvements will enable more accurate AI feedback.
  • Intent classification teams should focus efforts on refining the classifier to pick up on nuanced, domain-specific intents from open prompts. Being able to understand precise user intents will be critical.
  • Implement an agent tree of thoughts model into the reasoning pipeline. This structure will allow for explicit reasoning steps to complete sub-tasks.
  • Pilot fine-tuning initiatives on leading models. Tailoring VLM, LLM, and SAM models to key industries and use cases through fine-tuning will be pivotal.

Acknowledgment

The authors extend their gratitude to Vivek Madan and Ashish Rawat for their insightful feedback and review of this post.


About the authors

Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, mentoring college students for entrepreneurship, and spending time with friends and families.

Xin HuangXin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Read More

How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale

How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale

Today, personally identifiable information (PII) is everywhere. PII is in emails, slack messages, videos, PDFs, and so on. It refers to any data or information that can be used to identify a specific individual. PII is sensitive in nature and includes various types of personal data, such as name, contact information, identification numbers, financial information, medical information, biometric data, date of birth, and so on.

Finding and redacting PII is essential to safeguarding privacy, ensuring data security, complying with laws and regulations, and maintaining trust with customers and stakeholders. It’s a critical component of modern data management and cybersecurity practices. But finding PII among the morass of electronic data can present challenges for an organization. These challenges arise due to the vast volume and variety of data, data fragmentation, encryption, data sharing, dynamic content, false positives and negatives, contextual understanding, legal complexities, resource constraints, evolving data, user-generated content, and adaptive threats. However, failure to accurately detect and redact PII can lead to severe consequences for organizations. Consequences might encompass legal penalties, lawsuits, reputation damage, data breach costs, regulatory probes, operational disruption, trust erosion, and sanctions.

In the legal system, discovery is the legal process governing the right to obtain and the obligation to produce non-privileged matter relevant to any party’s claims or defenses in litigation. Electronic discovery also known as eDiscovery is the electronic aspect of identifying, collecting, and producing electronically stored information (ESI) in response to a request for production in a lawsuit or investigation. In the legal domain, it’s often required to identify, collect, and produce ESI during a lawsuit or investigation. If organizations are dealing with eDiscovery for litigations on subpoena responses, they’re probably concerned about accidentally sharing PII. Many organizations including government agencies, school districts, and legal professionals face the challenge of detecting and redacting PII accurately at scale. Especially if they’re part of a government group, redacting PII through the Freedom of Information Act and Digital Services Act is crucial for protecting individual privacy, ensuring compliance with data protection laws, preventing identity theft, and maintaining trust and transparency in government and digital services. It strikes a balance between transparency and privacy while mitigating legal and security risks.

Organizations can search for PII using methods such as keyword searches, pattern matching, data loss prevention tools, machine learning (ML), metadata analysis, data classification software, optical character recognition (OCR), document fingerprinting, and encryption.

Now a part of Reveal’s AI-powered eDiscovery platform, Logikcull is a self-service solution that allows legal professionals to process, review, tag, and produce electronic documents as part of a lawsuit or investigation. This unique offering helps attorneys discover valuable information related to the matter in hand while reducing costs, speeding up resolutions, and mitigating risks.

In this post, Reveal experts showcase how they used Amazon Comprehend in their document processing pipeline to detect and redact individual pieces of PII. Amazon Comprehend is a fully managed and continuously trained natural language processing (NLP) service that can extract insight about the content of a document or text. You can use Amazon Comprehend ML capabilities to detect and redact PII in customer emails, support tickets, product reviews, social media, and more.

Overview of solution

The overarching goal for the engineering team is to detect and redact PII from millions of legal documents for their customers. Using Reveal’s Logikcull solution, the engineering team implemented two processes, namely first pass PII detection and second pass PII detection and redaction. This two pass solution was made possible by using the ContainsPiiEntities and DetectPiiEntities APIs.

First pass PII detection

The goal of first pass PII detection is to find the documents that might contain PII.

  1. Users upload the files on which they would like to perform PII detection and redaction through Logikcull’s public website into a project folder. These files can be in the form of office documents, .pdf files, emails, or a .zip file containing all the supported file types.
  2. Logikcull stores these project folders securely inside an Amazon Simple Storage Service (Amazon S3) bucket. The files then pass through Logikcull’s massively parallel processing pipeline hosted on Amazon Elastic Compute Cloud (Amazon EC2), which processes the files, extracts the metadata, and generates artifacts in text format for data review. Logikcull’s processing pipeline supports text extraction for a wide variety of forms and files, including audio and video files.
  3. After the files are available in text format, Logikcull passes the input text along with the language model, which is English, through Amazon Comprehend by making the ContainsPiiEntities API call. The processing pipeline servers hosted on Amazon EC2 make the Amazon Comprehend ContainsPiiEntities API call by passing the request parameters as text and language code. The ContainsPiiEntities API call analyzes input text for the presence of PII and returns the labels of identified PII entity types, such as name, address, bank account number, or phone number. The API response also includes a confidence score which indicates the level of confidence that Amazon Comprehend has assigned to the detection accuracy. The confidence score has a value between 0 and 1, with 1 signifying 100 percent confidence. Logikcull uses this confidence score to assign the tag PII Detected to the documents. Logikcull only assigns this tag to documents that have a confidence score of over 0.75.
  4. PII Detected tagged documents are fed into Logikcull’s search index cluster for their users to quickly identify documents that contain PII entities.

Second pass PII detection and redaction

The first pass PII detection process narrows down the scope of the dataset by identifying which documents contain PII information. This speeds up the PII detection process and also reduces the overall cost. The goal of the second pass PII detection is to identify the individual instances of PII and redact them from the tagged documents in the first pass.

  1. Users search for documents through the Logikcull’s website that contains PII using Logikcull’s advanced search filters feature.
  2. The request is handled by Logikcull’s application servers hosted on Amazon EC2 and the servers communicates with the search index cluster to find the documents.
  3. The Logikcull applications servers are able to identify the individual instances of PII by making the DetectPiiEntities API call. The servers make the API call by passing the text and language of input documents. The DetectPiiEntities API action inspects the input text for entities that contain PII. For each entity, the response provides the entity type, where the entity text begins and ends, and the level of confidence that Amazon Comprehend has in its detection.
  4. The users then select the specific entities that they want to redact using Logikcull’s web interface. The applications server sends these requests to Logikcull’s processing pipeline. The following is a screenshot of a PDF that was uploaded to Logikcull’s application. From the below screenshot, you can see that different PII entities such as name, address, phone number, email address, and so on, have been highlighted.

  1. The PII redaction is safely applied inside the Logikcull’s processing pipeline using custom business logic. From the screenshot that follows, you can see that users can select either specific PII entity types  or all PII entity types that they want to redact and then, with a click of a single button, redact all the PII information.

Results

Logikcull, a Reveal technology, is currently processing over 20 million documents each week and was able to narrow down the scope of detection using the ContainsPiiEntities API and display individual instances of PII entities to their customers by using the DetectPiiEntities API.

“With Amazon Comprehend, Logikcull has been able to rapidly deploy powerful NLP capabilities in a fraction of the time a custom-built solution would have required.”

– Steve Newhouse, VP of Product for Logikcull.

Conclusion

Amazon Comprehend allows Reveal’s Logikcull technology to run PII detection at large scale for relatively low cost using Amazon Comprehend. The ContainsPiiEntities API is used to do an initial scan of millions of documents. The DetectPiiEntities API is used to run a detailed analysis of thousands of documents and identify individual pieces of PII in their documents.

Take a look at all the Amazon Comprehend features. Give the features a try and send us feedback either through the AWS forum  for Amazon Comprehend or through your usual AWS support contacts.


About the Authors

Aman Tiwari is a General Solutions Architect working with Worldwide Commercial Sales at AWS. He works with customers in the Digital Native Business segment and helps them design innovative, resilient, and cost-effective solutions using AWS services. He holds a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.

Jeff Newburn is a Senior Software Engineering Manager leading the Data Engineering team at Logikcull – A Reveal Technology.  He oversees the company’s data initiatives, including data warehouses, visualizations, analytics, and machine learning.  With experience spanning development and management in areas from ride sharing to data systems, he enjoys leading teams of brilliant engineers to exciting products.

Søren Blond Daugaard is a Staff Engineer in the Data Engineering team at Logikcull – A Reveal Technology. He implements highly scalable AI and ML solutions into the Logikcull product, enabling our customers to do their work more efficiently and with higher precision. His expertise spans data pipelines, web-based systems, and machine learning systems.

Kevin Lufkin is a Senior Software Engineer on the Search Engineering team at Logikcull – A Reveal Technology, where he focuses on developing customer facing and search-related features. His extensive expertise in UI/UX is complemented by a background in full-stack web development, with a strong focus on bringing product visions to life.

Read More