Amazon AWS – Page 82

Connect to Amazon services using AWS PrivateLink in Amazon SageMaker

June 20, 2024

by Francisco Calderon Rodriguez Amazon AWS

AWS customers that implement secure development environments often have to restrict outbound and inbound internet traffic. This becomes increasingly important with artificial intelligence (AI) development because of the data assets that need to be protected. Transmitting data across the internet is not secure enough for highly sensitive data. Therefore, accessing AWS services without leaving the AWS network can be a secure workflow.

One of the ways you can secure AI development is by creating Amazon SageMaker instances within a virtual private cloud (VPC) with direct internet access disabled. This isolates the instance from the internet and makes API calls to other AWS services not possible. This presents a challenge for developers that are building architectures for production in which many AWS services need to function together.

In this post, we present a solution for configuring SageMaker notebook instances to connect to Amazon Bedrock and other AWS services with the use of AWS PrivateLink and Amazon Elastic Compute Cloud (Amazon EC2) security groups.

Solution overview

The following example architecture shows a SageMaker instance connecting to various services. The SageMaker instance is isolated from the internet but is still able to access AWS services through PrivateLink. One will notice that the connection to Amazon S3 is through a Gateway VPC endpoint. You can learn more about Gateway VPC endpoints here.

In the following sections, we show how to configure this on the AWS Management Console.

Create security groups for outbound and inbound endpoint access

First, you have to create the security groups that will be attached to the VPC endpoints and the SageMaker instance. You create the security groups before creating a SageMaker instance because after the instance has been created, the security group configuration can’t be changed.

You create two groups, one for outbound and another for inbound. Complete the following steps:

1. On the Amazon EC2 console, choose Security Groups in the navigation pane.

2. Choose Create security group.

3. For Security group name, enter a name (for example, inbound-sagemaker).

4. For Description, enter a description.

5. For VPC, choose your VPC.

6. Note the security group ID to use in the next steps.

7. Create a new outbound rule.

8. For Security group name, enter a name (for example, outbound-sagemaker).

9. For Description, enter description.

10. For VPC, choose the same VPC as the inbound rule.

11. In the Outbound rules section, choose Add rule.

12. Add an outbound rule with the inbound security group ID as the destination using HTTPS as the type.

13. Note the outbound security group ID to use in the next step.

14. Return to the inbound security group and add an inbound rule of HTTPS type with the destination set to the outbound security group ID.

Create a SageMaker instance with the outbound security group

You now create a SageMaker instance with the network configuration shown in the following screenshot. It’s important to choose the same VPC that you used to create the inbound and outbound security groups. You then choose the outbound security group you created earlier.

Create an Interface VPC endpoint

In this step, you create an Interface VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC) that automatically uses PrivateLink, which allows calls from your SageMaker instance to AWS services.

1. On the Amazon VPC console, choose Endpoints in the navigation pane.

2. Choose Create endpoint.

3. For Name tag, enter a name (for example, bedrock-link).

4. For Service category, select AWS services.

5. For Services, search for and choose com.amazonaws.<region>.bedrock-runtime.

6. Set the VPC to the same one you’ve been working with.

7. Specify the subnet(s).

A subnet is a range of IP addresses within a VPC. If you don’t know what subnet to specify, any subnet will work. Otherwise, specify the subnet that is required by any security requirements from your cloud security team.

8. Set the security group to the inbound security group you created earlier.

After you create the endpoint, it should take some time to become available.

Repeat these steps for every service that you need for your workflow. The following screenshots show examples of services that you can create interface VPC endpoints for, such as Amazon Simple Storage Service (Amazon S3), Amazon Kendra, and AWS Lambda. AWS PrivateLink enables you to connect privately to several AWS services, for a current list please see this page.

Test the connection

You can test the connection to Amazon Bedrock using a simple Python API call. The following is a code snippet that invokes the Amazon Bedrock model:

import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime')
prompt = """
Human: What type of sharks are there?

Assistant:"""

body = json.dumps({
"prompt": prompt,
"max_tokens_to_sample": 4000,
"temperature": 0.1,
"top_p": 0.9,
})

modelId = 'anthropic.claude-instant-v1'
accept = 'application/json'
contentType = 'application/json'

response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())

print(response_body.get('completion'))

If you were to run this in a Jupyter notebook cell, it would give you an error because you have not pointed the invocation to use the VPC endpoint. You do this by adding an endpoint URL to the client instantiation:

bedrock = boto3.client(
    service_name='bedrock-runtime',
    endpoint_url = 'https://vpce-0e452bc86b1f87c50-5xltzdpo.bedrock-runtime.us-west-2.vpce.amazonaws.com'
)

To find the endpoint URL, go back to the VPC endpoint that you created in the previous step and look for DNS names, illustrated in the following screenshot. The Private DNS is the best option since it is the same as the public, which means you don’t have to change anything to use the private connection. The next best option is to use the Regional DNS, which is the first option under “DNS names”. Both options allow your traffic to failover to other healthy Availability Zones (AZ), in case the current AZ is impaired.

Clean up

To clean up your resources, complete the following steps:

1. On the SageMaker console, navigate to the notebook configuration page.

2. Stop the instance, then choose Delete to delete the instance.

3. On the Amazon EC2 console, navigate to the inbound security group’s detail page.

4. On the Actions menu, choose Delete security groups.

5. Repeat these steps for the outbound security group.

6. On the Amazon VPC console, navigate to the VPC endpoint’s details page.

7. On the Actions menu, choose Delete.

8. Repeat this is step for every endpoint you created as part of this post.

Conclusion

In this post, we showed how to set up VPC endpoints and security groups to allow SageMaker to connect to Amazon Bedrock. When a SageMaker instance has restricted internet access, you can still develop and connect to other AWS services through the use of AWS PrivateLink. This post showed how to connect to Amazon Bedrock from an isolated SageMaker instance, but you can replicate the steps for other services.

We encourage you to get started developing AI applications on AWS. To learn more, visit Amazon SageMaker, Amazon Bedrock, and AWS PrivateLink for more information. Happy coding!

About the Author

Francisco Calderon is a Data Scientist at the AWS Generative AI Innovation Center. As a member of the GenAI Innovation Center, he helps solve critical business problems for AWS customers using the latest technology in Generative AI. In his spare time, Francisco likes to play music and guitar, play soccer with his daughters, and enjoy time with his family.

Sungmin Hong is an Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, traveling and reading.

Yash Shah is a Science Manager in the AWS Generative AI Innovation Center. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Anila Joshi has more than a decade of experience building AI solutions. As an Applied Science Manager at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and guides customers to strategically chart a course into the future of AI.

Maximize your Amazon Translate architecture using strategic caching layers

June 19, 2024

by Praneeth Reddy Tekula Amazon AWS

Amazon Translate is a neural machine translation service that delivers fast, high quality, affordable, and customizable language translation. Amazon Translate supports 75 languages and 5,550 language pairs. For the latest list, see the Amazon Translate Developer Guide. A key benefit of Amazon Translate is its speed and scalability. It can translate a large body of content or text passages in batch mode or translate content in real-time through API calls. This helps enterprises get fast and accurate translations across massive volumes of content including product listings, support articles, marketing collateral, and technical documentation. When content sets have phrases or sentences that are often repeated, you can optimize cost by implementing a write-through caching layer. For example, product descriptions for items contain many recurring terms and specifications. This is where implementing a translation cache can significantly reduce costs. The caching layer stores source content and its translated text. Then, when the same source content needs to be translated again, the cached translation is simply reused instead of paying for a brand-new translation.

In this post, we explain how setting up a cache for frequently accessed translations can benefit organizations that need scalable, multi-language translation across large volumes of content. You’ll learn how to build a simple caching mechanism for Amazon Translate to accelerate turnaround times.

Solution overview

The caching solution uses Amazon DynamoDB to store translations from Amazon Translate. DynamoDB functions as the cache layer. When a translation is required, the application code first checks the cache—the DynamoDB table—to see if the translation is already cached. If a cache hit occurs, the stored translation is read from DynamoDB with no need to call Amazon Translate again.

If the translation isn’t cached in DynamoDB (a cache miss), then the Amazon Translate API will be called to perform the translation. The source text is passed to Amazon Translate, and the translated result is returned and the translation is stored in DynamoDB, populating the cache for the next time that translation is requested.

For this blog post, we will be using Amazon API Gateway as a rest API for translation that integrates with AWS Lambda to perform backend logic. An Amazon Cognito user pool is used to control who can access your translate rest API. You can also use other mechanisms to control authentication and authorization to API Gateway based on your use-case.

Amazon Translate caching architecture

When a new translation is needed, the user or application makes a request to the translation rest API.
Amazon Cognito verifies the identity token in the request to grant access to the translation rest API.
When new content comes in for translation, the Amazon API Gateway invokes the Lambda function that checks the Amazon DynamoDB table for an existing translation.
If a match is found, the translation is retrieved from DynamoDB.
If no match is found, the content is sent to Amazon Translate to perform a custom translation using parallel data. The translated content is then stored in DynamoDB along with a new entry for hit rate percentage.

These high-value translations are periodically post-edited by human translators and then added as parallel data for machine translation. This improves the quality of future translations performed by Amazon Translate.

We will use a simple schema in DynamoDB to store the cache entries. Each item will contain the following attributes:

src_text: The original source text
target_locale: The target language to translate to
translated_text: The translated text
src_locale: The original source language
hash: The primary key of the table

The primary key will be constructed from the src_locale, target_locale, and src_text to uniquely identify cache entries. When retrieving translations, items will be looked up by their primary key.

Prerequisites

To deploy the solution, you need

An AWS account. If you don’t already have an AWS account, you can create one.
Your access to the AWS account must have AWS Identity and Access Management (IAM) permissions to launch AWS CloudFormation templates that create IAM roles.
Install AWS CLI.
Install jq tool.
AWS Cloud Development Kit (AWS CDK). See Getting started with the AWS CDK.
Postman installed and configured on your computer.

Deploy the solution with AWS CDK

We will use AWS CDK to deploy the DynamoDB table for caching translations. CDK allows defining the infrastructure through a familiar programming language such as Python.

Clone the repo from GitHub.

git clone https://github.com/aws-samples/maximize-translate-architecture-strategic-caching

Run the requirements.txt, to install python dependencies.
```
python3 -m pip install -r requirements.txt
```
Open app.py file and replace the AWS account number and AWS Region with yours.
To verify that the AWS CDK is bootstrapped, run cdk bootstrap from the root of the repository:

cdk bootstrap
 Bootstrapping environment aws://<acct#>/<region>... 
Trusted accounts for deployment: (none) 
Trusted accounts for lookup: (none) 
Using default execution policy of 
'arn:aws:iam::aws:policy/AdministratorAccess'. 
Pass '--cloudformation-execution-policies' to 
customize.  Environment aws://<acct#>/<region> 
bootstrapped (no changes).

Define your CDK stack to add DynamoDB and Lambda resources. The DynamoDB and Lambda Functions are defined as follows:

- This creates a DynamoDB table with the primary key as hash, because the TRANSLATION_CACHE table is schemaless, you don’t have to define other attributes in advance. This also creates a Lambda function with Python as the runtime.

table = ddb.Table(
            self, 'TRANSLATION_CACHE',
            table_name='TRANSLATION_CACHE',
            partition_key={'name': 'hash', 'type': ddb.AttributeType.STRING},
            removal_policy=RemovalPolicy.DESTROY
        )

        self._handler = _lambda.Function(
            self, 'GetTranslationHandler',
            runtime=_lambda.Runtime.PYTHON_3_10,
            handler='get_translation.handler',
            code=_lambda.Code.from_asset('lambda'),
            environment={
                'TRANSLATION_CACHE_TABLE_NAME': table.table_name,
            }
        )

- The Lambda function is defined such that it:
  - Parses the request body JSON into a Python dictionary.
  - Extracts the source locale, target locale, and input text from the request.
  - Gets the DynamoDB table name to use for a translation cache from environment variables.
  - Calls generate_translations_with_cache() to translate the text, passing the locales, text, and DynamoDB table name.
  - Returns a 200 response with the translations and processing time in the body.

def handler(event, context):

    print('request: {}'.format(json.dumps(event)))

    request = json.loads(event['body'])
    print("request", request)

    src_locale = request['src_locale']
    target_locale = request['target_locale']
    input_text = request['input_text']
    table_name = os.environ['TRANSLATION_CACHE_TABLE_NAME']

    if table_name == "":
        print("Defaulting table name")
        table_name = "TRANSLATION_CACHE"

    try:
        start = time.perf_counter()
        translations = generate_translations_with_cache(src_locale, target_locale, input_text, table_name)
        end = time.perf_counter()
        time_diff = (end - start)

        translations["processing_seconds"] = time_diff

        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json'
            },
            'body': json.dumps(translations)
        }

    except ClientError as error:

        error = {"error_text": error.response['Error']['Code']}
        return {
            'statusCode': 500,
            'headers': {
                'Content-Type': 'application/json'
            },
            'body': json.dumps(error)
        }

- The generate_translations_with_cache function divides the input text into separate sentences by splitting on a period (“.”) symbol. It stores each sentence as a separate entry in the DynamoDB table along with its translation. This segmentation into sentences is done so that cached translations can be reused for repeating sentences.
- In summary, it’s a Lambda function that accepts a translation request, translates the text using a cache, and returns the result with timing information. It uses DynamoDB to cache translations for better performance.
You can deploy the stack by changing the working directory to the root of the repository and running the following command.
```
cdk deploy
```

Considerations

Here are some additional considerations when implementing translation caching:

Eviction policy: An additional column can be defined indicating the cache expiration of the cache entry. The cache entry can then be evicted by defining a separate process.
Cache sizing: Determine expected cache size and provision DynamoDB throughput accordingly. Start with on-demand capacity if usage is unpredictable.
Cost optimization: Balance caching costs with savings from reducing Amazon Translate usage. Use a short DynamoDB Time-to-Live (TTL) and limit the cache size to minimize overhead.
Sensitive Information: DynamoDB encrypts all data at rest by default, if cached translations contain sensitive data, you can grant access to authorized users only. You can also choose to not cache data that contains sensitive information.

Customizing translations with parallel data

The translations generated in the translations table can be human-reviewed and used as parallel data to customize the translations. Parallel data consists of examples that show how you want segments of text to be translated. It includes a collection of textual examples in a source language; for each example, it contains the desired translation output in one or more target languages.

This is a great approach for most use cases, but some outliers might require light post-editing by human teams. The post-editing process can help you better understand the needs of your customers by capturing the nuances of local language that can be lost in translation. For businesses and organizations that want to augment the output of Amazon Translate (and other Amazon artificial intelligence (AI) services) with human intelligence, Amazon Augmented AI (Amazon A2I) provides a managed approach to do so, see Designing human review workflows with Amazon Translate and Amazon Augmented AI for more information.

When you add parallel data to a batch translation job, you create an Active Custom Translation job. When you run these jobs, Amazon Translate uses your parallel data at runtime to produce customized machine translation output. It adapts the translation to reflect the style, tone, and word choices that it finds in your parallel data. With parallel data, you can tailor your translations for terms or phrases that are unique to a specific domain, such as life sciences, law, or finance. For more information, see Customizing your translations with parallel data.

Testing the caching setup

Here is a video walkthrough of testing the solution.

There are multiple ways to test the caching setup. For this example, you will use Postman to test by sending requests. Because the Rest API is protected by an Amazon Cognito authorizer, you will need to configure Postman to send an authorization token with the API request.

As part of the AWS CDK deployment in the previous step, a Cognito user pool is created with an app client integration. On your AWS CloudFormation console, you can find BaseURL, translateCacheEndpoint, UserPoolID, and ClientID on the CDK stack output section. Copy these into a text editor for use later.

To generate an authorization token from Cognito, the next step is to create a user in the Cognito user pool.

Go to the Amazon Cognito console. Select the user pool that was created by the AWS CDK stack.
Select the Users tab and choose Create User.
Enter the following values and choose Create User.
1. On Invitation Message verify that Don’t send an invitation is selected.
2. For Email address, enter test@test.com.
3. On Temporary password, verify that Set a password is selected.
4. In Password enter testUser123!.
Now that the user is created, you will use AWS Command Line Interface (CLI) to simulate a sign in for the user. Go to the AWS CloudShell console.
Enter the following commands on the CloudShell terminal by replacing UserPoolID and ClientID from the CloudFormation output of the AWS CDK stack.

export YOUR_POOL_ID=<UserPoolID>

export YOUR_CLIENT_ID=<ClientID>

export Session_ID=$(aws cognito-idp admin-initiate-auth --user-pool-id ${YOUR_POOL_ID} --client-id ${YOUR_CLIENT_ID} --auth-flow ADMIN_NO_SRP_AUTH --auth-parameters 'USERNAME=test@test.com,PASSWORD="testUser123!"' | jq .Session -r)

aws cognito-idp admin-respond-to-auth-challenge --user-pool-id ${YOUR_POOL_ID}  --client-id ${YOUR_CLIENT_ID} --challenge-name NEW_PASSWORD_REQUIRED --challenge-responses 'USERNAME= test@test.com,NEW_PASSWORD="testUser456!"' --session "${Session_ID}"

The output from this call should be a valid session in the following format. The IdToken is the Open ID Connect-compatible identity token that we will pass to the APIs in the authorization header on Postman configuration. Copy it into a text editor to use later.

{
   "ChallengeParameters": {},
   "AuthenticationResult": {
"AccessToken":"YOU_WILL_SEE_VALID_ACCESS_TOKEN_VALUE_HERE",
      "ExpiresIn": 3600,
      "TokenType": "Bearer",
      "RefreshToken": "YOU_WILL_SEE_VALID_REFRESH_TOKEN_VALUE_HERE",
      "IdToken": "YOU_WILL_SEE_VALID_ID_TOKEN_VALUE_HERE"
   }
}

Now that you have an authorization token to pass with the API request to your rest API. Go to the Postman website. Sign in to the Postman website or download the Postman desktop client and create a Workspace with the name dev.

Select the workspace dev and choose on New request.
Change the method type to POST from GET.
Paste the <TranslateCacheEndpoint> URL from the CloudFormation output of the AWS CDK stack into the request URL textbox. Append the API path /translate to the URL, as shown in the following figure.

Now set up authorization configuration on Postman so that requests to the translate API are authorized by the Amazon Cognito user pool.

Select the Authorization tab below the request URL in Postman. Select OAuth2.0 as the Type.
Under Current Token, copy and paste Your IdToken from earlier into the Token field.

Select Configure New Token. Under Configuration Options add or select the values that follow. Copy the BaseURL and ClientID from the CloudFormation output of the AWS CDK stack. Leave the remaining fields at the default values.

- Token Name: token
- Grant Type: Select Authorization Code
- Callback URL: Enter https://localhost
- Auth URL: Enter <BaseURL>/oauth2/authorize
- Access Token URL: Enter <BaseURL>/oauth2/token
- ClientID: Enter <ClientID>
- Scope: Enter openid profile translate-cache/translate
- Client Authorization: Select Send client credentials in body.

Click Get New Access Token. You will be directed to another page to sign in as a user. Use the below credentials of the test user that was created earlier in your Cognito user pool:-
- Username: test@test.com
- Password: testUser456!
After authenticating, you will now get a new id_token. Copy the new id_token and go back to Postman authorization tab to replace that with the token value under Current Token.
Now, on the Postman request URL and Select the Body tab for Request. Select the raw . Change Body type to JSON and insert the following JSON content. When done, choose Send.

{
"src_locale": "en",
"target_locale": "fr",
"input_text": "Use the Amazon Translate service to translate content from a source language (the language of the input content) to a target language (the language that you select for the translation output). In a batch job, you can translate files from one or more source languages to one or more target languages. For more information about supported languages, see Supported languages and language codes."
}

First translation request to the API

The first request to the API takes more time, because the Lambda function checks the given input text against the DynamoDB database on the initial request. Because this is the first request, it won’t find the input text in the table and will call Amazon Translate to translate the provided text.

Examining the processing_seconds value reveals that this initial request took approximately 2.97 seconds to complete.

Subsequent translations requests to the API

After the first request, the input text and translated output are now stored in the DynamoDB table. On subsequent requests with the same input text, the Lambda function will first check DynamoDB for a cache hit. Because the table now contains the input text from the first request, the Lambda function will find it there and retrieve the translation from DynamoDB instead of calling Amazon Translate again.

Storing requests in a cache allows subsequent requests for the same translation to skip the Amazon Translate call, which is usually the most time-consuming part of the process. Retrieving the translation from DynamoDB is much faster than calling Amazon Translate to translate the text each time.

The second request has a processing time of approximately 0.79 seconds, about 3 times faster than the first request which took 2.97 seconds to complete.

Cache purge

Amazon Translate continuously improves its translation models over time. To benefit from these improvements, you need to periodically purge translations from your DynamoDB cache and fetch fresh translations from Amazon Translate.

DynamoDB provides a Time-to-Live (TTL) feature that can automatically delete items after a specified expiry timestamp. You can use this capability to implement cache purging. When a translation is stored in DynamoDB, a purge_date attribute set to 30 days in the future is added. DynamoDB will automatically delete items shortly after the purge_date timestamp is reached. This ensures cached translations older than 30 days are removed from the table. When these expired entries are accessed again, a cache miss occurs and Amazon Translate is called to retrieve an updated translation.

The TTL-based cache expiration allows you to efficiently purge older translations on an ongoing basis. This ensures your applications can benefit from the continuous improvements to the machine learning models used by Amazon Translate while minimizing costs by still using caching for repeated translations within a 30-day period.

Clean up

When deleting a stack, most resources will be deleted upon stack deletion, however that’s not the case for all resources. The DynamoDB table will be retained by default. If you don’t want to retain this table, you can set this in the AWS CDK code by using RemovalPolicy.

Additionally, the Lambda function will generate Amazon CloudWatch logs that are permanently retained. These won’t be tracked by CloudFormation because they’re not part of the stack, so the logs will persist. Use the Cloudwatch console to manually delete any logs that you don’t want to retain.

You can either delete the stack through the CloudFormation console or use AWS CDK destroy from the root folder.

cdk destroy

Conclusion

The solution outlined in this post provides an effective way to implement a caching layer for Amazon Translate to improve translation performance and reduce costs. Using a cache-aside pattern with DynamoDB allows frequently accessed translations to be served from the cache instead of calling Amazon Translate each time.

The caching architecture is scalable, secure, and cost-optimized. Additional enhancements such as setting TTLs, adding eviction policies, and encrypting cache entries can further customize the architecture to your specific use case.

Translations stored in the cache can also be post-edited and used as parallel data to train Amazon Translate. This creates a feedback loop that continuously improves translation quality over time.

By implementing a caching layer, enterprises can deliver fast, high-quality translations tailored to their business needs at reduced costs. Caching provides a way to scale Amazon Translate efficiently while optimizing performance and cost.

Additional resources

About the authors

Praneeth Reddy Tekula is a Senior Solutions Architect focusing on EdTech at AWS. He provides architectural guidance and best practices to customers in building resilient, secure and scalable systems on AWS. He is passionate about observability and has a strong networking background.

Reagan Rosario is a Solutions Architect at AWS, specializing in building scalable, highly available, and secure cloud solutions for education technology companies. With over 10 years of experience in software engineering and architecture roles, Reagan loves using his technical knowledge to help AWS customers architect robust cloud solutions that leverage the breadth and depth of AWS.

Deploy a Slack gateway for Amazon Bedrock

June 19, 2024

by Rushabh Lokhande Amazon AWS

In today’s fast-paced digital world, streamlining workflows and boosting productivity are paramount. That’s why we’re thrilled to share an exciting integration that will take your team’s collaboration to new heights. Get ready to unlock the power of generative artificial intelligence (AI) and bring it directly into your Slack workspace.

Imagine the possibilities: Quick and efficient brainstorming sessions, real-time ideation, and even drafting documents or code snippets—all powered by the latest advancements in AI. Say goodbye to context switching and hello to a streamlined, collaborative experience that will supercharge your team’s productivity. Whether you’re leading a dynamic team, working on complex projects, or simply looking to enhance your Slack experience, this integration is a game-changer.

In this post, we show you how to unlock new levels of efficiency and creativity by bringing the power of generative AI directly into your Slack workspace using Amazon Bedrock.

Solution overview

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In the following sections, we guide you through the process of setting up a Slack integration for Amazon Bedrock. We show how to create a Slack application, configure the necessary permissions, and deploy the required resources using AWS CloudFormation.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The user communicates with the Slack application.
The Slack application sends the event to Amazon API Gateway, which is used in the event subscription.
API Gateway forwards the event to an AWS Lambda function.
The Lambda function invokes Amazon Bedrock with the request, then responds to the user in Slack.

Prerequisites

You need an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need an existing account with Amazon Bedrock model access provided. If you don’t have model permission, refer to Model access.

Lastly, you need a Slack account and access to create and publish apps to your Slack organization. If you don’t have one, request your company to create a Slack sandbox organization for you to experiment, or go to Slack to create a free Slack account and workspace.

Create a Slack application

The security configuration varies across organizations. To manage your Slack workspace’s settings, reach out to your Slack administrator or as administrator, complete the following steps:

Navigate to the admin section within Slack and choose Build.
Choose Create New App.
For App Name, enter a name for your app (for this post, we name it BedrockSlackIntegration).
Choose your workspace.
Choose Create App.

After you create the app, you can configure its permissions.
On the app details page, choose Basic Information in the navigation pane.
Under Add features and functionality, choose Permissions
In the Scopes section, add the scopes im:read, im:write, and chat:write.

On the Basic Information page, Bots and Permissions should now both have a green check mark.

Under Install your app, choose Install to Workspace.
When prompted to install, choose Allow.
Open the Amazon Bedrock console and choose Model access in the navigation pane.
You can select your model from the available list. For this post, we grant access to ai21.j2-ultra-v1 (Jurassic-2 Ultra).For more information about requesting model access, see Model access. Next, we deploy the code and connect with Amazon Bedrock when we get a message from Slack. For that, we need the Slack bot token to use as an input parameter for the CloudFormation template in the next section.
On the Slack app details page, choose OAuth & Permissions in the navigation pane.
Copy the value for Bot User OAuth Token.

Deploy resources with AWS CloudFormation

Complete the following steps to launch the CloudFormation stack:

For Stack name, use default or enter a name of your choice.
For SlackTokenParam, enter the bot token you copied earlier.
Choose Next.
Create your stack and wait a few minutes for deployment to complete.
On the Outputs tab, copy the value for SlackBotEndpointOutput to use in the next steps.

In the next section, we start integrating Amazon Bedrock with Slack.

Integrate Amazon Bedrock with Slack

After you deploy your CloudFormation stack, complete the following steps:

On the Slack app details page, choose Event Subscriptions in the navigation pane.
Toggle Enable Events on.

The event subscription should get automatically verified.

Under Subscribe to bot events, add the events app_mention and message.im.
Choose Save Changes.

The integration is now complete.

Test the Slack bot

To test your bot, complete the following steps:

Navigate to your Slack.
Create a new group and add the app BedrockSlackIntegration.
Start interacting with the Amazon Bedrock bot using @BedrockSlackIntegration.

Your interaction will look like the following screenshot.

The bot demonstrated here doesn’t have the state of your previous questions or your chat history with new subsequent messages. However, you can implement this using Amazon DynamoDB. We will cover this in a later blog post.

Summary

In this post, we delved into the seamless integration of Amazon Bedrock with the popular collaboration platform, Slack. The step-by-step guide demonstrated how to establish a direct connection between these two powerful tools, enabling you and your team to harness the full potential of generative AI directly within your Slack workspace. With this integration, you can streamline your workflow and enhance productivity, making it effortless to tap into the cutting-edge capabilities of generative AI. Whether you’re seeking to generate content, analyze data, or explore innovative ideas, this integration empowers you to do it all without leaving the familiar Slack environment.

You can further empower your team by deploying a Slack gateway for Amazon Q Business, the generative AI assistant that empowers employees based on knowledge and data in your enterprise systems. To learn more about how to use generative AI with AWS services, see Generative AI on AWS.

About the Authors

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, analytics solutions, and generative AI solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Andrew Ang is a Senior ML Engineer with the AWS Generative AI Innovation Center, where he helps customers ideate and implement generative AI proof of concept projects. Outside of work, he enjoys playing squash and watching travel and food vlogs.

John Losito is an Associate Cloud Infrastructure Architect with AWS Professional Services, where he helps customers craft automation scripts using the AWS CDK or Terraform to efficiently deploy and managed cloud resources. Outside of work, he enjoys spending time with his family, exercising, and improving his archery skills.

Improving air quality with generative AI

June 18, 2024

by Sandra Topic Amazon AWS

As of this writing, Ghana ranks as the 27th most polluted country in the world, facing significant challenges due to air pollution. Recognizing the crucial role of air quality monitoring, many African countries, including Ghana, are adopting low-cost air quality sensors.

The Sensor Evaluation and Training Centre for West Africa (Afri-SET), aims to use technology to address these challenges. Afri-SET engages with air quality sensor manufacturers, providing crucial evaluations tailored to the African context. Through evaluations of sensors and informed decision-making support, Afri-SET empowers governments and civil society for effective air quality management.

On December 6^th-8^th 2023, the non-profit organization, Tech to the Rescue, in collaboration with AWS, organized the world’s largest Air Quality Hackathon – aimed at tackling one of the world’s most pressing health and environmental challenges, air pollution. More than 170 tech teams used the latest cloud, machine learning and artificial intelligence technologies to build 33 solutions. The solution addressed in this blog solves Afri-SET’s challenge and was ranked as the top 3 winning solutions.

This post presents a solution that uses a generative artificial intelligence (AI) to standardize air quality data from low-cost sensors in Africa, specifically addressing the air quality data integration problem of low-cost sensors. The solution harnesses the capabilities of generative AI, specifically Large Language Models (LLMs), to address the challenges posed by diverse sensor data and automatically generate Python functions based on various data formats. The fundamental objective is to build a manufacturer-agnostic database, leveraging generative AI’s ability to standardize sensor outputs, synchronize data, and facilitate precise corrections.

Current challenges

Afri-SET currently merges data from numerous sources, employing a bespoke approach for each of the sensor manufacturers. This manual synchronization process, hindered by disparate data formats, is resource-intensive, limiting the potential for widespread data orchestration. The platform, although functional, deals with CSV and JSON files containing hundreds of thousands of rows from various manufacturers, demanding substantial effort for data ingestion.

The objective is to automate data integration from various sensor manufacturers for Accra, Ghana, paving the way for scalability across West Africa. Despite the challenges, Afri-SET, with limited resources, envisions a comprehensive data management solution for stakeholders seeking sensor hosting on their platform, aiming to deliver accurate data from low-cost sensors. The attempt is disadvantaged by the current focus on data cleaning, diverting valuable skills away from building ML models for sensor calibration. Additionally, they aim to report corrected data from low-cost sensors, which requires information beyond specific pollutants.

The solution had the following requirements:

Cloud hosting – The solution must reside on the cloud, ensuring scalability and accessibility.
Automated data ingestion – An automated system is essential for recognizing and synchronizing new (unseen), diverse data formats with minimal human intervention.
Format flexibility – The solution should accommodate both CSV and JSON inputs and be flexible on the formatting (any reasonable column names, units of measure, any nested structure, or malformed CSV such as missing columns or extra columns)
Golden copy preservation – Retaining an untouched copy of the data is imperative for reference and validation purposes.
Cost-effective – The solution should only invoke LLM to generate reusable code on an as-needed basis instead of manipulating the data directly to be as cost-effective as possible.

The goal was to build a one-click solution that takes different data structure and formats (CSV and JSON) and automatically converts them to be integrated into a database with unified headers, as shown in the following figure. This allows for data to be aggregated for further manufacturer-agnostic analysis.

Figure 2: Covert data with different data formats into a desired data format with unified headers

Figure 1: Covert data with different data formats into a desired data format with unified headers

Overview of solution

The proposed solution uses Anthropic’s Claude 2.1 foundation model through Amazon Bedrock to generate Python codes, which converts input data into a unified data format. LLMs excel at writing code and reasoning over text, but tend to not perform as well when interacting directly with time-series data. In this solution, we leverage the reasoning and coding abilities of LLMs for creating reusable Extract, Transform, Load (ETL), which transforms sensor data files that do not conform to a universal standard to be stored together for downstream calibration and analysis. Additionally, we take advantage of the reasoning capabilities of LLMs to understand what the labels mean in the context of air quality sensor, such as particulate matter (PM), relative humidity, temperature, etc.

The following diagram shows the conceptual architecture:

Figure 3: The AWS reference architecture and the workflow for data transformation with Amazon Bedrock

Figure 2: The AWS reference architecture and the workflow for data transformation with Amazon Bedrock

Solution walkthrough

The solution reads raw data files (CSV and JSON files) from Amazon Simple Storage Service (Amazon S3) (Step 1) and checks if it has seen the device type (or data format) before. If yes, the solution retrieves and executes the previously-generated python codes (Step 2) and the transformed data is stored in S3 (Step 10). The solution only invokes the LLM for new device data file type (code has not yet been generated). This is done to optimize performance and minimize cost of LLM invocation. If the Python code is not available for a given device data, the solution notifies the operator to check the new data format (Step 3 and Step 4). At this time, the operator checks the new data format and validates if the new data format is from a new manufacturer (Step 5). Further, the solution checks if the file is CSV or JSON. If it is a CSV file, the data can be directly converted to a Pandas data frame by a Python function without LLM invocation. If it is a JSON file, the LLM is invoked to generate a Python function that creates a Pandas data frame from the JSON payload considering its schema and how nested it is (Step 6).

We invoke the LLM to generate Python functions that manipulate the data with three different prompts (input string):

The first invocation (Step 6) generates a Python function that converts a JSON file to a Pandas data frame. JSON files from manufacturers have different schemas. Some input data uses a pair of value type and value for a measurement. The latter format results in data frames containing one column of value type and one column of value. Such columns need to be pivoted.
The second invocation (Step 7) determines if the data needs to be pivoted and generates a Python function for pivoting if needed. Another issue of the input data is that the same air quality measurement can have different names from different manufacturers; for example, “P1” and “PM1” are for the same type of measurement.
The third invocation (Step 8) focuses on data cleaning. It generates a Python function to convert data frames to a common data format. The Python function may include steps for unifying column names for the same type of measurement and dropping columns.

All LLM generated Python codes are stored in the repository (Step 9) so that this can be used to process daily raw device data files for transformation into a common format.

The data is then stored in Amazon S3 (Step 10) and can be published to OpenAQ so other organizations can use the calibrated air quality data.

The following screenshot shows the proposed frontend for illustrative purposes only as the solution is designed to integrate with Afri-SET’s existing backend system

Results

The proposed method minimizes LLM invocations, thus optimizing cost and resources. The solution only invokes the LLM when a new data format is detected. The code that is generated is stored, so that an input data with the same format (seen before) can reuse the code for data processing.

A human-in-the-loop mechanism safeguards data ingestion. This happens only when a new data format is detected to avoid overburdening scarce Afri-SET resources. Having a human-in-the-loop to validate each data transformation step is optional.

Automatic code generation reduces data engineering work from months to days. Afri-SET can use this solution to automatically generate Python code, based on the format of input data. The output data is transformed to a standardized format and stored in a single location in Amazon S3 in Parquet format, a columnar and efficient storage format. If useful, it can be further extended to a data lake platform that uses AWS Glue (a serverless data integration service for data preparation) and Amazon Athena (a serverless and interactive analytics service) to analyze and visualize data. With AWS Glue custom connectors, it’s effortless to transfer data between Amazon S3 and other applications. Additionally, this is a no-code experience for Afri-SET’s software engineer to effortlessly build their data pipelines.

Conclusion

This solution allows for easy data integration to help expand cost-effective air quality monitoring. It offers data-driven and informed legislation, fostering community empowerment and encouraging innovation.

This initiative, aimed at gathering precise data, is a significant step towards a cleaner and healthier environment. We believe that AWS technology can help address poor air quality through technical solutions similar to the one described here. If you want to prototype similar solutions, apply to the AWS Health Equity initiative.

As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.

About the authors

Sandra Topic is an Environmental Equity Leader at AWS. In this role, she leverages her engineering background to find new ways to use technology for solving the world’s “To Do list” and drive positive social impact. Sandra’s journey includes social entrepreneurship and leading sustainability and AI efforts in tech companies.

Qiong (Jo) Zhang, PhD, is a Senior Partner Solutions Architect at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI. She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.

Gabriel Verreault is a Senior Partner Solutions Architect at AWS for the Industrial Manufacturing segment. Gabriel works with AWS partners to define, build, and evangelize solutions around Smart Manufacturing, Sustainability and AI/ML. Gabriel also has expertise in industrial data platforms, predictive maintenance, and combining AI/ML with industrial workloads.

Venkatavaradhan (Venkat) Viswanathan is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.

Use zero-shot large language models on Amazon Bedrock for custom named entity recognition

June 18, 2024

by Sujitha Martin Amazon AWS

Name entity recognition (NER) is the process of extracting information of interest, called entities, from structured or unstructured text. Manually identifying all mentions of specific types of information in documents is extremely time-consuming and labor-intensive. Some examples include extracting players and positions in an NFL game summary, products mentioned in an AWS keynote transcript, or key names from an article on a favorite tech company. This process must be repeated for every new document and entity type, making it impractical for processing large volumes of documents at scale. With more access to vast amounts of reports, books, articles, journals, and research papers than ever before, swiftly identifying desired information in large bodies of text is becoming invaluable.

Traditional neural network models like RNNs and LSTMs and more modern transformer-based models like BERT for NER require costly fine-tuning on labeled data for every custom entity type. This makes adopting and scaling these approaches burdensome for many applications. However, new capabilities of large language models (LLMs) enable high-accuracy NER across diverse entity types without the need for entity-specific fine-tuning. By using the model’s broad linguistic understanding, you can perform NER on the fly for any specified entity type. This capability is called zero-shot NER and enables the rapid deployment of NER across documents and many other use cases. This ability to extract specified entity mentions without costly tuning unlocks scalable entity extraction and downstream document understanding.

In this post, we cover the end-to-end process of using LLMs on Amazon Bedrock for the NER use case. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. In particular, we show how to use Amazon Textract to extract text from documents such PDFs or image files, and use the extracted text along with user-defined custom entities as input to Amazon Bedrock to conduct zero-shot NER. We also touch on the usefulness of text truncation for prompts using Amazon Comprehend, along with the challenges, opportunities, and future work with LLMs and NER.

Solution overview

In this solution, we implement zero-shot NER with LLMs using the following key services:

Amazon Textract – Extracts textual information from the input document.
Amazon Comprehend (optional) – Identifies predefined entities such as names of people, dates, and numeric values. You can use this feature to limit the context over which the entities of interest are detected.
Amazon Bedrock – Calls an LLM to identify entities of interest from the given context.

The following diagram illustrates the solution architecture.

The main inputs are the document image and target entities. The objective is to find values of the target entities within the document. If the truncation path is chosen, the pipeline uses Amazon Comprehend to reduce the context. The output of LLM is postprocessed to generate the output as entity-value pairs.

For example, if given the AWS Wikipedia page as the input document, and the target entities as AWS service names and geographic locations, then the desired output format would be as follows:

AWS service names: <all AWS service names mentioned in the Wikipedia page>
Geographic locations: <all geographic location names within the Wikipedia page>

In the following sections, we describe the three main modules to accomplish this task. For this post, we used Amazon SageMaker notebooks with ml.t3.medium instances along with Amazon Textract, Amazon Comprehend, and Amazon Bedrock.

Extract context

Context is the information that is taken from the document and where the values to the queried entities are found. When consuming a full document (full context), context significantly increases the input token count to the LLM. We provide an option of using the entire document or local context around relevant parts of the document, as defined by the user.

First, we extract context from the entire document using Amazon Textract. The code below uses the amazon-textract-caller library as a wrapper for the Textract API calls. You need to install the library first:

python -m pip install amazon-textract-caller

Then, for a single page document such as a PNG or JPEG file use the following code to extract the full context:

from textractcaller.t_call import call_textract, Textract_Features 
from textractprettyprinter.t_pretty_print import get_text_from_layout_json 

document_name = "sample_data/synthetic_sample_data.png"

# call Textract
layout_textract_json = call_textract(
input_document = document_name, 
features = [Textract_Features.LAYOUT]
) 

# extract the text from the JSON response
full_context = get_text_from_layout_json(textract_json = layout_textract_json)[1]

Note that PDF input documents have to be on a S3 bucket when using call_textract function. For multi-page TIFF files make sure to set force_async_api=True.

Truncate context (optional)

When the user-defined custom entities to be extracted are sparse compared to the full context, we provide an option to identify relevant local context and then look for the custom entities within the local context. To do so, we use generic entity extraction with Amazon Comprehend. This is assuming that the user-defined custom entity is a child of one of the default Amazon Comprehend entities, such as "name", "location", "date", or "organization". For example, "city" is a child of "location". We extract the default generic entities through the AWS SDK for Python (Boto3) as follows:

import pandas as pd
comprehend_client = boto3.client("comprehend")
generic_entities = comprehend_client.detect_entities(Text=full_context, 
                                                     LanguageCode="en")
df_entities = pd.DataFrame.from_dict(generic_entities["Entities"])

It outputs a list of dictionaries containing the entity as “Type”, the value as “Text”, along with other information such as “Score”, “BeginOffset”, and “EndOffset”. For more details, see DetectEntities. The following is an example output of Amazon Comprehend entity extraction, which provides the extracted generic entity-value pairs and location of the value within the text.

{
“Entities”: [
	{
	“Text”: “AWS”,
	“Score”: 0.98,
	“Type”: “ORGANIZATION”,
	“BeginOffset”: 21,
	“EndOffset”: 24
	},
	{
	“Text”: “US East”,
	“Score”: 0.97,
	“Type”: “LOCATION”,
	“BeginOffset”: 1100,
	“EndOffset”: 1107
	}
],
“LanguageCode”: “en”
}

The extracted list of generic entities may be more exhaustive than the queried entities, so a filtering step is necessary. For example, a queried entity is “AWS revenue” and generic entities contain “quantity”, “location”, “person”, and so on. To only retain the relevant generic entity, we define the mapping and apply the filter as follows:

query_entities = ['XX']
user_defined_map = {'XX': 'QUANTITY', 'YY': 'PERSON'}
entities_to_keep = [v for k,v in user_defined_map.items() if k in query_entities]
df_filtered = df_entities.loc[df_entities['Type'].isin(entities_to_keep)]

After we identify a subset of generic entity-value pairs, we want to preserve the local context around each pair and mask out everything else. We do this by applying a buffer to “BeginOffset” and “EndOffset” to add extra context around the offsets identified by Amazon Comprehend:

StrBuff, EndBuff =20,10
df_offsets = df_filtered.apply(lambda row : pd.Series({'BeginOffset':max(0, row['BeginOffset']-StrBuff),'EndOffset':min(row['EndOffset']+EndBuff, len(full_context))}), axis=1).reset_index(drop=True)

We also merge any overlapping offsets to avoid duplicating context:

for index, _ in df_offsets.iterrows():
    if (index>0) and (df_offsets.iloc[index]['BeginOffset']<=df_offsets.iloc[index-1]['EndOffset']):
        df_offsets.iloc[index]['BeginOffset'] = df_offsets.iloc[index-1]['BeginOffset']
df_offsets = df_offsets.groupby(['BeginOffset']).last().reset_index()

Finally, we truncate the full context using the buffered and merged offsets:

truncated_text = "/n".join([full_context[row['BeginOffset']:row['EndOffset']] for _, row in df_offsets.iterrows()])

An additional step for truncation is to use the Amazon Textract Layout feature to narrow the context to a relevant text block within the document. Layout is a new Amazon Textract feature that enables you to extract layout elements such as paragraphs, titles, lists, headers, footers, and more from documents. After a relevant text block has been identified, this can be followed by the buffer offset truncation we mentioned.

Extract entity-value pairs

Given either the full context or the local context as input, the next step is customized entity-value extraction using LLM. We propose a generic prompt template to extract customized entities through Amazon Bedrock. Examples of customized entities include product codes, SKU numbers, employee IDs, product IDs, revenue, and locations of operation. It provides generic instructions on the NER task and desired output formatting. The prompt input to LLM includes four components: an initial instruction, the customized entities as query entities, the context, and the format expected from the output of the LLM. The following is an example of the baseline prompt. The customized entities are incorporated as a list in query entities. This process is flexible to handle a variable number of entities.

prompt = “””
Given the text below, identify these name entities:
	“{query_entities}”
text: “{context}”
Respond in the following format:
	“{output formay}”
“””

With the preceding prompt, we can invoke a specified Amazon Bedrock model using InvokeModel as follows. For a full list of models available on Amazon Bedrock and prompting strategies, see Amazon Bedrock base model IDs (on-demand throughput).

import json
bedrock_client = boto3.client(service_name='bedrock-runtime')
body = json.dumps({
        "prompt": f"nnHuman: {prompt}nnAssistant:",
        "max_tokens_to_sample": 300,
        "temperature": 0.1,
        "top_p": 0.9,
    })
modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'

response = bedrock_client.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())
print(response_body.get('completion'))

Although the overall solution described here is intended for both unstructured data (such as documents and emails) and structured data (such as tables), another method to conduct entity extraction on structured data is by using the Amazon Textract Queries feature. When provided a query, Amazon Textract can extract entities using queries or custom queries by specifying natural language questions. For more information, see Specify and extract information from documents using the new Queries feature in Amazon Textract.

Use case

To demonstrate an example use case, we use Anthropic Claude-V2 on Amazon Bedrock to generate some text about AWS (as shown in the following figure), saved it as an image to simulate a scanned document, and then used the proposed solution to identify some entities within the text. Because this example was generated by an LLM, the content may not be completely accurate. We used the following prompt to generate the text: “Generate 10 paragraphs about Amazon AWS which contains examples of AWS service names, some numeric values as well as dollar amount values, list like items, and entity-value pairs.”

Let’s extract values for the following target entities:

Countries where AWS operates
AWS annual revenue

As shown in the solution architecture, the image is first sent to Amazon Textract to extract the contents as text. Then there are two options:

No truncation – You can use the whole text along with the target entities to create a prompt for the LLM
With truncation – You can use Amazon Comprehend to detect generic entities, identify candidate positions of the target entities, and truncate the text to the proximities of the entities

In this example, we ask Amazon Comprehend to identify "location" and "quantity" entities, and we postprocess the output to restrict the text to the neighborhood of identified entities. In the following figure, the "location" entities and context around them are highlighted in purple, and the "quantity" entities and context around them are highlighted in yellow. Because the highlighted text is the only text that persists after truncation, this approach can reduce the number of input tokens to the LLM and ultimately save cost. In this example, with truncation and total buffer size of 30, the input token count reduces by almost 50%. Because the LLM cost is a function of number of input tokens and output tokens, the cost due to input tokens is reduced by almost 50%. See Amazon Bedrock Pricing for more details.

Given the entities and (optionally truncated) context, the following prompt is sent to the LLM:

prompt = “””
Given the text below, identify these name entities:
	Countries where AWS operates in, AWS annual revenue

text: “{(optionally truncated) context}”

Respond in the following format:

Countries where AWS operates in: <all countries where AWS operates in entities from the text>

AWS annual revenue: <all AWS annual revenue entities from the text>
“”"

The following table shows the response of Anthropic Claude-V2 on Amazon Bedrock for different text inputs (again, the document used as input was generated by an LLM and may not be completely accurate). The LLM can still generate the correct response even after removing almost 50% of the context.

Input text

LLM response

Full context

Countries where AWS operates in: us-east-1 in Northern Virginia, eu-west-1 in Ireland, ap-southeast-1 in Singapore

AWS annual revenue: $62 billion

Truncated context

Countries where AWS operates in: us-east-1 in Northern Virginia, eu-west-1 in Ireland, ap-southeast-1 in Singapore

AWS annual revenue: $62 billion in annual revenue

Conclusion

In this post, we discussed the potential for LLMs to conduct NER without being specifically fine-tuned to do so. You can use this pipeline to extract information from structured and unstructured text documents at scale. In addition, the optional truncation modality has the potential to reduce the size of your documents, decreasing an LLM’s token input while maintaining comparable performance to using the full document. Although zero-shot LLMs have proved to be capable of conducting NER, we believe experimenting with few-shot LLMs is also worth exploring. For more information on how you can start your LLM journey on AWS, refer to the Amazon Bedrock User Guide.

About the Authors

Sujitha Martin is an Applied Scientist in the Generative AI Innovation Center (GAIIC). Her expertise is in building machine learning solutions involving computer vision and natural language processing for various industry verticals. In particular, she has extensive experience working on human-centered situational awareness and knowledge infused learning for highly autonomous systems.

Matthew Rhodes is a Data Scientist working in the Generative AI Innovation Center (GAIIC). He specializes in building machine learning pipelines that involve concepts such as natural language processing and computer vision.

Amin Tajgardoon is an Applied Scientist in the Generative AI Innovation Center (GAIIC). He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Safeguard a generative AI travel agent with prompt engineering and Guardrails for Amazon Bedrock

June 18, 2024

by Antonio Rodriguez Amazon AWS

In the rapidly evolving digital landscape, travel companies are exploring innovative approaches to enhance customer experiences. One promising solution is the integration of generative artificial intelligence (AI) to create virtual travel agents. These AI-powered assistants use large language models (LLMs) to engage in natural language conversations, providing personalized recommendations, answering queries, and guiding customers through the booking process. By harnessing the capabilities of LLMs, travel companies can offer a seamless and intuitive experience tailored to diverse customer needs and preferences. The advantages of using generative AI for virtual travel agents include improved customer satisfaction, increased efficiency, and the ability to handle a high volume of inquiries simultaneously.

However, the deployment of generative AI in customer-facing applications raises concerns around responsible AI. To mitigate risks such as harmful or biased outputs, exposure of sensitive information, or misuse for malicious purposes, it’s crucial to implement robust safeguards and validation mechanisms. This includes carefully engineering prompts, validating LLM outputs, using built-in guardrails provided by LLM providers, and employing external LLM-based guardrails for additional protection. Guardrails for Amazon Bedrock is a set of tools and services provided by AWS to help developers implement these types of safeguards and responsible AI practices when building applications with generative AI models like LLMs. Guardrails for Amazon Bedrock offers industry-leading safety protection on top of the native capabilities of FMs, helping customers block as much as 85% more harmful content than protection natively provided by some foundation models on Amazon Bedrock today. Guardrails for Amazon Bedrock is the only responsible AI capability offered by a major cloud provider that enables customers to build and customize safety and privacy protections for their generative AI applications in a single solution, and it works with all large language models (LLMs) in Amazon Bedrock, as well as fine-tuned models.

By implementing appropriate guardrails, organizations can mitigate the risks associated with generative AI while still using its powerful capabilities, resulting in a safe and responsible deployment of these technologies.

In this post, we explore a comprehensive solution for addressing the challenges of securing a virtual travel agent powered by generative AI. We provide an end-to-end example and its accompanying code to demonstrate how to implement prompt engineering techniques, content moderation, and various guardrails to make sure the assistant operates within predefined boundaries by relying on Guardrails for Amazon Bedrock. Additionally, we delve into monitoring strategies to track the activation of these safeguards, enabling proactive identification and mitigation of potential issues.

By following the steps outlined in this post, you will be able to deploy your own secure and responsible chatbots, tailored to your specific needs and use cases.

Solution overview

For building our chatbot, we use a combination of AWS services and validation techniques to create a secure and responsible virtual travel agent that operates within predefined boundaries. We can employ a multi-layered approach including the following protection mechanisms:

Prompting protection – The user input in the chatbot is embedded into a prompt template, where we can limit the scope of the responses for a given domain or use case. For example: “You’re a virtual travel agent. Only respond to questions about {topics}. If the user asks about anything else answer ‘Sorry, I cannot help with that. You can ask me about {topics}.’”
LLM built-in guardrails – The LLMs typically include their own built-in guardrails and include predefined responses for refusing to certain questions or instructions. The details of how each LLM protects against prompt misuse are typically described in the model cards. For example: “Input: Give me instructions for hacking a website. Output: I apologize, I cannot provide instructions for hacking or illegally accessing websites.”
Guardrails – Guardrails for Amazon Bedrock acts as an external validation element in the flow. It allows you to check user inputs and LLM responses against a set of topic denial rules, harmful content, words or text, or sensitive information filters before going back to the user. All rules are evaluated in parallel for avoiding additional latency, and you can configure predefined responses or sensitive information masking in the case of detecting any violations. You can also check traces of the validations done for the topics and filters defined.

The following diagram illustrates this layered protection for generative AI chatbots.

In the following GitHub repo, we provide a guided example that you can follow to deploy this solution in your own account. Alternatively, you can follow the instructions in Guardrails for Amazon Bedrock helps implement safeguards customized to your use cases and responsible AI policies (preview) to create and modify your guardrails on the Guardrails for Amazon Bedrock console.

Guardrail objectives

At the core of the architecture is Amazon Bedrock serving foundation models (FMs) with an API interface; the FM powers the conversational capabilities of the virtual agent. Today, the FMs already incorporate their own built-in guardrails for not responding to toxic, biased, or harmful questions or instructions; these mechanisms however are typically the result of a red teaming effort from the model provider, and are generic and universal to any user and use case. In our travel agent use case, we have additional specific needs for protecting our application:

Constrain the conversations to the travel domain – We want to make sure the application remains focused on its core purpose and provides relevant information to users.
Provide factual and accurate responses – Providing reliable and trustworthy information is crucial in the travel industry, because customers rely on our recommendations and advice when planning their trips. Inaccurate or fabricated information could lead to dissatisfied customers, damage our reputation, and potentially result in legal liabilities.
Block information related to finances or politics – This helps us maintain neutrality and avoid potential controversies that could damage the brand’s reputation.
Avoid responding to misconduct or violence requests – We want to uphold ethical standards and promote responsible use of the application.
Avoid any toxicity or bias in the responses – We want to create a safe and inclusive environment for all users, regardless of their background or characteristics.
Prevent any jailbreak and injection attacks – This helps us maintain the integrity and security of the application, protecting both customers’ data and the company’s assets.
Avoid any references to competitors – We want to maintain a professional and unbiased stance, and avoid potential legal issues or conflicts of interest.
Anonymize personal information – We need to protect users’ privacy and comply with data protection regulations.

Prompt engineering and guardrails

For our first two objectives, we rely on prompt engineering to craft a prompt that constrains the agent’s responses to travel-related topics, and avoids making up any content that is not factual. This is implemented with a prompt template in our code:

prompt = f"""You are a virtual travel agent for OctankTravel, a travel website.

<rules>
- You only provide information, answer questions, 
and provide recommendations about travel destinations.
- If the user asks about any non-travel related or relevant topic, 
just say 'Sorry, I can not respond to this. I can recommend you travel destinations 
and answer your questions about these'.
- If you have the information it's also OK to respond to hotels and airlines’ questions.
- Do not make up or create answers that are not based on facts. 
It’s OK to say that you don’t know an answer.
</rules>

Always follow the rules in the <rules> tags for responding to the user's question below.

{user_input}"""

Because of the nature of LLMs and how they generate text, it’s possible that even when we set up our prompt template for maintaining the conversations within the travel recommendations domain, some interactions still pass outside of this scope. For this reason, we must implement restrictions against specific topics (such as politics and finance in our example) that could be controversial, not be aligned with our use case, or damage the image of our brand. For this and the rest of our objectives in the preceding list, we integrate Guardrails for Amazon Bedrock, a powerful content validation and filtering feature, to apply external LLM-based guardrails to our application in both user inputs and the LLM responses.

Guardrails for Amazon Bedrock allows us to define the following:

Denied topics – Defining a set of topics that are undesirable in the context of your application. These topics will be blocked if detected in user queries or model responses. In our example, we configure denied topics for finance and politics.
Content filters – Adjusting pre-defined filter strengths to block input prompts or model responses containing harmful or undesired content. In our example, we rely on predefined content filters for sex, violence, hate, insults, misconduct, and prompt attacks such as jailbreak or injection.
Word filters – Configuring filters to block undesirable words, phrases, and profanity. In our example, we configure word filters for controlling references to competitors.
Sensitive information filters – Blocking or masking sensitive information, such as predefined personally identifiable information (PII) fields or custom regex-defined fields, in user inputs and model responses. In our example, we configure filters for masking the email address and age of our customers.

With this, our guardrail configuration is as follows:

Example topic 1: Finance
- Definition: Statements or questions about finances, transactions, or monetary advice
- Example phrases:
  - “What are the cheapest rates?”
  - “Where can I invest to get rich?”
  - “I want a refund!”
Example topic 2: Politics
- Definition: Statements or questions about politics or politicians
- Example phrases:
  - “What is the political situation in that country?”
  - “Give me a list of destinations governed by the greens”
Content filters enabled:
- For prompts: Hate: High, Insults: High, Sexual: High, Violence: High, Misconduct: High, Prompt attack: High
- For responses: Hate: High, Insults: High, Sexual: High, Violence: High, Misconduct: High, Prompt attack: High
Word filters:
- Custom words: “SeaScanner,” “Megatravel Deals”
- Managed words: Profanity
Sensitive information:
- Built-in PII entities: Anonymize AGE

The following screenshots show the configuration of these guardrails on the Amazon Bedrock console.

We can now test our secure travel agent with different inputs. The following table lists some examples (some words in the input are redacted).

User Input	Output To the User	Protection
“What is the status of my transaction?”	“Sorry, I can not respond to this. I can recommend you travel destinations and answer your questions about these.”	Guardrails for Amazon Bedrock Topic: Finance
“I hate xxxxxxx people.”	“”Sorry, I can not respond to this. I can recommend you travel destinations and answer your questions about these.”	Guardrails for Amazon Bedrock Filter: Hate
“What is a good destination for finding only xxxxxxx people?”	“Sorry, I can not respond to this. I can recommend you travel destinations and answer your questions about these.”	LLM built-in protections Amazon Titan on Amazon Bedrock
“I don’t like your service, I feel like punching someone”	“Sorry, I can not respond to this. I can recommend you travel destinations and answer your questions about these.”	Guardrails for Amazon Bedrock Filter: Violence
“Are your rates more expensive than Super Travel rates?”	“Sorry, I can not respond to this. I can recommend you travel destinations and answer your questions about these.”	Guardrails for Amazon Bedrock Words filter
“Who is the president of xxxxxxx?”	“Sorry, I can not respond to this. I can recommend you travel destinations and answer your questions about these.”	Guardrails for Amazon Bedrock Topic: Politics

Monitoring

Finally, to monitor the effectiveness of these safeguards, we implement logging and monitoring mechanisms that track the activation of the various filters and guardrails with Amazon CloudWatch. This allows us to identify patterns, detect potential issues proactively, and make informed decisions about refining the prompts, updating the denied topics list, or adjusting the content moderation settings as needed. The same monitoring can also be used as a trust and safety system, to track and block malicious actors interacting with our application.

Designing a personalized CloudWatch dashboard involves the use of metric filters to extract targeted insights from logs. In this context, our focus is on monitoring invocations where guardrails have been invoked and identifying the specific filters.

To create the metric filters, you need to include patterns that extract this information from the model invocation logs. You first need to activate model invocation logs using the Amazon Bedrock console or API.

The following screenshot shows an example of creating the guardrail intervention metric.

The following is an example of creating the prompt insults filter trigger metric.

By crafting metric filters derived from the logs, we can gain a comprehensive overview of the interventions and filter triggers from a single view.

By combining prompt engineering, Guardrails for Amazon Bedrock, built-in content filters, and comprehensive monitoring, we can create a robust and secure virtual travel agent that provides a delightful customer experience while adhering to the highest standards of responsible AI.

Cost

We can consider the following items for estimating the cost of the solution implemented:

Amazon Bedrock
- LLM: Amazon Titan Express on Amazon Bedrock
  - Input (on-demand) – Price per 1,000 input tokens: $0.0002
  - Output (on-demand) – Price per 1,000 input tokens: $0.0006
- Guardrails for Amazon Bedrock
  - Denied topics – Price per 1,000 text units: $1
  - Content filters – Price per 1,000 text units: $0.75
  - Sensitive information filter (PII) – Price per 1,000 text units: $0.10
  - Sensitive information filter (regular expression) – Free
  - Word filters – Free
AWS Lambda – $0.20 per 1 million requests
Amazon CloudWatch – CloudWatch metrics costs = $0.30 per metric per month

Prices are based on public pricing for June 10^th, 2024, in the US East (N. Virginia) AWS Region.

For our example, assuming we have 1,000 interactions from our users with our virtual travel agent per month, we could estimate a total cost of around $20 per month.

Clean up

To clean up the resources created in this example, you can follow these steps:

Delete the guardrail you created:
On the Amazon Bedrock console, under Safeguards in the navigation pane, choose Guardrails.
Select the guardrail you created and choose Delete.
Delete the CloudWatch dashboard:
On the CloudWatch console, choose Dashboards in the navigation pane.
Select the dashboard you created and choose Delete.
Delete the CloudWatch metrics:
On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
Choose your Amazon Bedrock log group.
On the Metric filters tab, select all the metric filters you created and choose Delete.

Responsible AI considerations

Although the solution outlined in this post provides a robust framework for securing a virtual travel agent, it’s important to recognize that responsible AI practices extend beyond technical safeguards. The following are some additional considerations to keep in mind:

Human oversight and governance – Even with advanced guardrails and content moderation mechanisms in place, it’s crucial to maintain human oversight and governance over the AI system. This makes sure ethical principles and values are consistently upheld, and that any potential issues or edge cases are promptly identified and addressed.
Continuous monitoring and improvement – AI systems, particularly those involving language models, can exhibit unexpected behaviors or biases over time. It’s essential to continuously monitor the performance and outputs of the virtual agent, and to have processes in place for refining and improving the system as needed.
Transparency and explainability – Strive for transparency in communicating the capabilities, limitations, and potential biases of the virtual agent to users. Additionally, consider implementing explainability techniques that can provide insights into the reasoning behind the agent’s responses, fostering trust and accountability.
Privacy and data protection – Make sure the virtual agent adheres to relevant privacy regulations and data protection laws, particularly when handling personal or sensitive information. Implement robust data governance practices and obtain appropriate user consent when necessary.
Inclusive and diverse perspectives – Involve diverse stakeholders, including representatives from different backgrounds, cultures, and perspectives, in the development and evaluation of the virtual agent. This can help identify and mitigate potential biases or blind spots in the system.
Ethical training and education – Provide ongoing training and education for the development team, as well as customer-facing personnel, on ethical AI principles, responsible AI practices, and the potential societal impacts of AI systems.
Collaboration and knowledge sharing – Engage with the broader AI community, industry groups, and academic institutions to stay informed about the latest developments, best practices, and emerging challenges in the field of responsible AI.

Conclusion

In this post, we explored a comprehensive solution for securing a virtual travel agent powered by generative AI. By using prompt engineering, Guardrails for Amazon Bedrock built-in filters, and comprehensive monitoring, we demonstrated how to create a robust and secure virtual assistant that adheres to the highest standards of responsible AI.

The key benefits of implementing this solution include:

Enhanced user experience – By making sure the virtual agent operates within predefined boundaries and provides appropriate responses, users can enjoy a seamless and delightful experience without encountering harmful, biased, or inappropriate content
Mitigated risks – The multi-layered approach mitigates the risks associated with generative AI, such as the generation of harmful or biased outputs, exposure of sensitive information, or misuse for malicious purposes
Responsible AI alignment – The solution aligns with ethical AI principles and responsible AI practices, fostering trust and accountability in the deployment of AI systems
Proactive issue identification – The monitoring mechanisms enable proactive identification of potential issues, allowing for timely adjustments and refinements to the system
Scalability and adaptability – The modular nature of the solution allows for effortless scaling and adaptation to different use cases or domains, providing long-term viability and relevance

By following the steps outlined in this post, organizations can confidently take advantage of the power of generative AI while prioritizing responsible AI practices, ultimately delivering a secure and trustworthy virtual travel agent that exceeds customer expectations.

To learn more, visit Guardrails for Amazon Bedrock.

About the Authors

Antonio Rodriguez is a Sr. Generative AI Specialist Solutions Architect in Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock.

Dani Mitchell is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is focused on computer vision use cases and helping customers across EMEA accelerate their ML journey.

Anubhav Mishra is a Principal Product Manager for Amazon Bedrock with AWS. He spends his time understanding customers and designing product experiences to address their business challenges.

Streamline financial workflows with generative AI for email automation

June 18, 2024

by Hariharan Nammalvar Amazon AWS

Many companies across all industries still rely on laborious, error-prone, manual procedures to handle documents, especially those that are sent to them by email. Despite the availability of technology that can digitize and automate document workflows through intelligent automation, businesses still mostly rely on labor-intensive manual document processing. This represents a major opportunity for businesses to optimize this workflow, save time and money, and improve accuracy by modernizing antiquated manual document handling with intelligent document processing (IDP) on AWS. To extract key information from high volumes of documents from emails and various sources, companies need comprehensive automation capable of ingesting emails, file uploads, and system integrations for seamless processing and analysis. Intelligent automation presents a chance to revolutionize document workflows across sectors through digitization and process optimization.

This post explains a generative artificial intelligence (AI) technique to extract insights from business emails and attachments. It examines how AI can optimize financial workflow processes by automatically summarizing documents, extracting data, and categorizing information from email attachments. This enables companies to serve more clients, direct employees to higher-value tasks, speed up processes, lower expenses, enhance data accuracy, and increase efficiency.

Challenges with manual data extraction

The majority of business sectors are currently having difficulties with manual document processing, and are reading emails and their attachments without the use of an automated system. These procedures cost money, take a long time, and are prone to mistakes. Manual procedures struggle to keep up with the number of documents. Finding relevant information that is necessary for business decisions is difficult. Therefore, there is a demand for shorter decision cycles and speedier document processing. The aim of this post is to help companies that process documents manually to speed up the delivery of data derived from those documents for use in business operations. By reducing the time and ongoing expenses associated with manual workflows, organizations can enhance productivity, responsiveness, and innovation through data analytics.

In the past, optical character recognition (OCR) worked well for flawless documents, but the performance of those old systems frequently did not meet customer needs when document quality was imperfect. Because mistakes are unavoidable in manual processes and double-checking every task can be expensive and time-consuming, variability is introduced into workflows. Companies with seasonal fluctuations in customer demand face challenges in staffing document processing to maintain quick customer service. The key is efficiently extracting the most vital data from extensive paperwork to enable prompt decisions. For example, a mortgage application may be over a thousand pages, but only a dozen or so data points critically impact the credit decision. The trick is pinpointing those key details among the flood of information in order to make timely loan approvals while still providing excellent service to applicant.

This post explores how generative AI can make working with business documents and email attachments more straightforward. Sample business considerations include financial industries that have seen an uptick in their user base. They need a back-office automation solution to extract details from emails and attachments, summarize the content to send downstream, classify the documents and content, and assign documents to human reviewers if required. At the same time, the solution must provide data security, such as PII and SOC compliance.

Solution overview

The accompanying code for this solution is available in the GitHub repo. The solution covers two steps to deploy generative AI for email automation:

Data extraction from email attachments and classification using various stages of intelligent document processing (IDP). IDP is an industry term used for describing the mechanism for processing and extracting information out of structured, semi-structured, and unstructured documents using AI and machine learning (ML).
Data summarization using large language models (LLMs).

The following figure provides a high-level overview of the pipeline steps you might go through while you develop your IDP solution.

The data capture stage is where documents are extracted from emails, compiled, and securely stored as input documents. There may occasionally be different sorts of documents and no automatic method for identifying and categorizing them. However, you can bypass the classification process and go directly to the next stage, which is accurately extracting information from your documents. In the enrichment stage, you can take the data and language from the documents and apply it in significant ways to enhance that data. A human-in-the-loop review is the last stage of the process, which enables you to request a human evaluation of data that has been extracted with a low degree of accuracy. Customers in highly regulated areas like financial services and healthcare are adding human evaluations to their pipelines in order to review the data points.

This solution offers the following key benefits:

Elasticity – You have the flexibility to scale up or down with the needs of the business
Innovation – You can automate document data extraction coming through email channels
Cost savings – You can optimize costs related to manual effort and associated operational cost

Data extraction workflow

The following figure shows a high-level representation of the possible stages of streamlining financial workflows to build our solution.

In the initial phase, the focus is to securely gather and compile data from documents, including email attachments. However, if you already have identifiable documents, you can bypass the classification process and proceed directly to the next phase. In the second step, you extract information accurately from your documents. In the third step, you can use extracted text and data to construct meaningful enhancements for these documents. The fourth and final step involves using foundation models (FMs) to standardize keys and values. This stage focuses on refining form data, including elements like first name, phone number formatting, and so on, into the specific formats required by individual customers. The transformed data is then tailored to match the formats required by their downstream databases. In cases where the confidence score is low or in industries subject to stringent regulations, the form data may be sent to a human-in-the-loop review. These automated stages can be used together or separately, resulting in significant cost reductions, elimination of manual effort, and enhancement of the outcomes of document processing for your business.

AWS architecture

The following figure illustrates the extended architecture of the sample system and explains how you can use AWS services to integrate the end-to-end process.

After the inbound email attachments are received and input documents are stored securely, AWS document processing services and FMs assist with the extraction and summarization in the desired format:

Amazon Simple Storage Service (Amazon S3) stores documents in various format files, originated from physical or digital mailrooms, email attachments, or user uploads from web or mobile apps, allowing for efficient processing and scalability.
Amazon Textract uses the power of NLP and other ML advancements cultivated over the years, enabling capabilities beyond conventional OCR technologies. Amazon Textract automatically extracts printed text, handwriting, layout elements, and other data such as key-value pairs and tabular information from any document or image.
Amazon Comprehend can automatically classify and extract insights from text, which also provides NLP capabilities. It has pre-trained models that identify entities such as places, people, brands, or events; determine the language of the text; extract key phrases; understand how positive or negative the sentiment of text is; and automatically organize a collection of text files by topic.
Amazon Bedrock is an enterprise cloud platform by AWS that provides a straightforward way to build and scale generative AI applications with FMs. It provides the necessary tools and infrastructure to deploy, monitor, scale, and govern AI/ML models effortlessly and cost-effectively. You can then have natural conversations with LLM models available in Amazon Bedrock to get insights from the vectorized data.

Our GitHub repo demonstrates how to combine Amazon Textract and LangChain to extract data from documents and use generative AI within different stages of IDP. These samples demonstrate using various LLMs.

Prerequisites

Before you start developing the document workflow, you must complete a few prerequisite steps. Refer to the GitHub repo for details on how you can integrate Amazon Textract with LangChain as a document loader to extract data from documents and use generative AI capabilities within the various IDP phases. The following imports are specific to document extraction from email:

!pip install unstructured
!pip install anthropic
import boto3 from langchain.llms.bedrock import Bedrock

Read emails and attachments

The configuration of UnstructuredEmailLoader is explained in the following code, which also summarizes the email content:

from langchain.document_loaders import UnstructuredEmailLoader
loader = UnstructuredEmailLoader("SampleDocument.eml")
document = loader.load()

template = """
summarize the email by associating tasks to different agents and as a next step
<document>{doc_text}</<document>
<summary>
"""
prompt = PromptTemplate(template=template, input_variables=["doc_text"])

llm_chain = LLMChain(prompt=prompt, llm=llm)
summary = llm_chain.run(document[0].page_content)
print(summary.replace("</summary>","").strip())

Clean up

Follow the cleanup steps specified in the GitHub repo to clean up your resources.

Conclusion

In this post, we explained how to streamline financial workflows with generative AI for email automation, including extracting data from email attachments, classifying documents, and summarizing and processing documents with IDP to derive insights. By examining the various stages of the IDP pipeline, you can enhance your own IDP pipeline with LLM workflows.

To expand this solution, consider the following:

Use Retrieval Augmented Generation (RAG) correlation of personalized data in your LLM
Keep summarized data private and accept existing data sources as augmented inputs to your desired decision outcome

To learn more, refer to the following resources:

About the Author

Hariharan Nammalvar is a Solutions Architect at AWS, technology professional with 20+ years of experience. He has a proven track record of designing and implementing innovative solutions that solve complex business challenges. He has worked with a wide range of industries, different customer domain helped them to leverage machine learning and AI to streamline operations, improve efficiency, and enhance customer experiences.

Raghavarao Sodabathina is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and Serverless Platform. He engages with customers to create innovative solutions that address customer business problems and to accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

How Twilio used Amazon SageMaker MLOps pipelines with PrestoDB to enable frequent model retraining and optimized batch transform

June 17, 2024

by Madhur Prashant Amazon AWS

This post is co-written with Shamik Ray, Srivyshnav K S, Jagmohan Dhiman and Soumya Kundu from Twilio.

Today’s leading companies trust Twilio’s Customer Engagement Platform (CEP) to build direct, personalized relationships with their customers everywhere in the world. Twilio enables companies to use communications and data to add intelligence and security to every step of the customer journey, from sales and marketing to growth and customer service, and many more engagement use cases in a flexible, programmatic way. Across 180 countries, millions of developers and hundreds of thousands of businesses use Twilio to create magical experiences for their customers. Being one of the largest AWS customers, Twilio engages with data and artificial intelligence and machine learning (AI/ML) services to run their daily workloads. This post outlines the steps AWS and Twilio took to migrate Twilio’s existing machine learning operations (MLOps), the implementation of training models, and running batch inferences to Amazon SageMaker.

ML models don’t operate in isolation. They must integrate into existing production systems and infrastructure to deliver value. This necessitates considering the entire ML lifecycle during design and development. With the right processes and tools, MLOps enables organizations to reliably and efficiently adopt ML across their teams for their specific use cases. SageMaker includes a suite of features for MLOps that includes Amazon SageMaker Pipelines and Amazon SageMaker Model Registry. Pipelines allow for straightforward creation and management of ML workflows while also offering storage and reuse capabilities for workflow steps. The model registry simplifies model deployment by centralizing model tracking.

This post focuses on how to achieve flexibility in using your data source of choice and integrate it seamlessly with Amazon SageMaker Processing jobs. With SageMaker Processing jobs, you can use a simplified, managed experience to run data preprocessing or postprocessing and model evaluation workloads on the SageMaker platform.

Twilio needed to implement an MLOps pipeline that queried data from PrestoDB. PrestoDB is an open source SQL query engine that is designed for fast analytic queries against data of any size from multiple sources.

In this post, we show you a step-by-step implementation to achieve the following:

Read data available in PrestoDB from a SageMaker Processing job
Train a binary classification model using SageMaker training jobs, and tune the model using SageMaker automatic model tuning
Run a batch transform pipeline for batch inference on data fetched from PrestoDB
Deploy the trained model as a real-time SageMaker endpoint

Use case overview

Twilio trained a binary classification ML model using scikit-learn’s RandomForestClassifier to integrate into their MLOps pipeline. This model is used as part of a batch process that runs periodically for their daily workloads, making training and inference workflows repeatable to accelerate model development. The training data used for this pipeline is made available through PrestoDB and read into Pandas through the PrestoDB Python client.

The end goal was to convert the existing steps into two pipelines: a training pipeline and a batch transform pipeline that connected the data queried from PrestoDB to a SageMaker Processing job, and finally deploy the trained model to a SageMaker endpoint for real-time inference.

In this post, we use an open source dataset available through the TPCH connector that is packaged with PrestoDB to illustrate the end-to-end workflow that Twilio used. Twilio was able to use this solution to migrate their existing MLOps pipeline to SageMaker. All the code for this solution is available in the GitHub repo.

Solution overview

This solution is divided into three main steps:

Model training pipeline – In this step, we connect a SageMaker Processing job to fetch data from a PrestoDB instance, train and tune the ML model, evaluate it, and register it with the SageMaker model registry.
Batch transform pipeline – In this step, we run a preprocessing data step that reads data from a PrestoDB instance and runs batch inference on the registered ML model (from the model registry) that we approve as a part of this pipeline. This model is approved either programmatically or manually through the model registry.
Real-time inference – In this step, we deploy the latest approved model as a SageMaker endpoint for real-time inference.

All pipeline parameters used in this solution exist in a single config.yml file. This file includes the necessary AWS and PrestoDB credentials to connect to the PrestoDB instance, information on the training hyperparameters and SQL queries that are run at training, and inference steps to read data from PrestoDB. This solution is highly customizable for industry-specific use cases so that it can be used with minimal code changes through simple updates in the config file.

The following code shows an example of how a query is configured within the config.yml file. This query is used at the data processing step of the training pipeline to fetch data from the PrestoDB instance. Here, we predict whether an order is a high_value_order or a low_value_order based on the orderpriority as given from the TPC-H data. For more information on the TPC-H data, its database entities, relationships, and characteristics, refer to TPC Benchmark H. You can change the query for your use case within the config file and run the solution with no code changes.

SELECT
    o.orderkey,
    COUNT(l.linenumber) AS lineitem_count,
    SUM(l.quantity) AS total_quantity,
    AVG(l.discount) AS avg_discount,
    SUM(l.extendedprice) AS total_extended_price,
    SUM(l.tax) AS total_payable_tax,
    o.orderdate,
    o.orderpriority,
    CASE
        WHEN (o.orderpriority = '2-HIGH') THEN 1 
        ELSE 0
    END AS high_value_order
FROM
    orders o
JOIN
    lineitem l ON o.orderkey = l.orderkey
GROUP BY
    o.orderkey,
    o.orderdate,
    o.orderpriority
ORDER BY 
    RANDOM() 
LIMIT 5000

The main steps of this solution are described in detail in the following sections.

Data preparation and training

The data preparation and training pipeline includes the following steps:

The training data is read from a PrestoDB instance, and any feature engineering needed is done as part of the SQL queries run in PrestoDB at retrieval time. The queries that are used to fetch data at training and batch inference steps are configured in the config file.
We use the FrameworkProcessor with SageMaker Processing jobs to read data from PrestoDB using the Python PrestoDB client.
For the training and tuning step, we use the SKLearn estimator from the SageMaker SDK and the RandomForestClassifier from scikit-learn to train the ML model. The HyperparameterTuner class is used for running automatic model tuning, which finds the best version of the model by running many training jobs on the dataset using the algorithm and the ranges of hyperparameters.
The model evaluation step checks that the trained and tuned model has an accuracy level above a user-defined threshold and only then register that model within the model registry. If the model accuracy doesn’t meet the threshold, the pipeline fails and the model is not registered with the model registry.
The model training pipeline is then run with pipeline.start, which invokes and instantiates all the preceding steps.

Batch transform

The batch transform pipeline consists of the following steps:

The pipeline implements a data preparation step that retrieves data from a PrestoDB instance (using a data preprocessing script) and stores the batch data in Amazon Simple Storage Service (Amazon S3).
The latest model registered in the model registry from the training pipeline is approved.
A Transformer instance is used to runs a batch transform job to get inferences on the entire dataset stored in Amazon S3 from the data preparation step and store the output in Amazon S3.

SageMaker real-time inference

The SageMaker endpoint pipeline consists of the following steps:

The latest approved model is retrieved from the model registry using the describe_model_package function from the SageMaker SDK.
The latest approved model is deployed as a real-time SageMaker endpoint.
The model is deployed on a ml.c5.xlarge instance with a minimum instance count of 1 and a maximum instance count of 3 (configurable by the user) with the automatic scaling policy set to ENABLED. This removes unnecessary instances so you don’t pay for provisioned instances that you aren’t using.

Prerequisites

To implement the solution provided in this post, you should have an AWS account, a SageMaker domain to access Amazon SageMaker Studio, and familiarity with SageMaker, Amazon S3, and PrestoDB.

The following prerequisites also need to be in place before running this code:

PrestoDB – We use the built-in datasets available in PrestoDB through the TPCH connector for this solution. Follow the instructions in the GitHub README.md to set up PrestoDB on an Amazon Elastic Compute Cloud (Amazon EC2) instance in your account. If you already have access to a PrestoDB instance, you can skip this step but note its connection details (see the presto section in the config file). When you have your PrestoDB credentials, fill out the presto section in the config file as follows (enter your host public IP, port, credentials, catalog and schema):

presto:
  host: <0.0.0.0>
  parameter: "0000"
  presto_credentials: <presto_credentials>
  catalog: <catalog>
  schema: <schema>

VPC network configurations – We also define the encryption, network isolation, and VPC configurations of the ML model and operations in the config file. For more information on network configurations and preferences, refer to Connect to SageMaker Within your VPC. If you are using the default VPC and security groups then you can leave these configuration parameters empty, see example in this configuration file. If not, then in the aws section, specify the enable_network_isolation status, security_group_ids, and subnets based on your network isolation preferences. :

network_config:
    enable_network_isolation: false
    security_group_ids: 
    - <security_group_id>
    subnets:
    - <subnet-1>
    - <subnet-2>
    - <subnet-3>

IAM role – Set up an AWS Identity and Access Management (IAM) role with appropriate permissions to allow SageMaker to access AWS Secrets Manager, Amazon S3, and other services within your AWS account. Until an AWS CloudFormation template is provided that creates the role with the requisite IAM permissions, use a SageMaker role that allows the AmazonSageMakerFullAccess AWS managed policy for your role.
Secrets Manager secret – Set up a secret in Secrets Manager for the PrestoDB user name and password. Call the secret prestodb-credentials and add a username field and password field to it. For instructions, refer to Create and manage secrets with AWS Secrets Manager.

Deploy the solution

Complete the following steps to deploy the solution:

Clone the GitHub repository in SageMaker Studio. For instructions, see Clone a Git Repository in SageMaker Studio Classic.
Edit the config.yml file as follows:
1. Edit the parameter values in the presto section. These parameters define the connectivity to PrestoDB.
2. Edit the parameter values in the aws section. These parameters define the network connectivity, IAM role, bucket name, AWS Region, and other AWS Cloud-related parameters.
3. Edit the parameter values in the sections corresponding to the pipeline steps (training_step, tuning_step, transform_step, and so on).
4. Review all the parameters in these sections carefully and edit them as appropriate for your use case.

When the prerequisites are complete and the config.yml file is set up correctly, you’re ready to run the mlops-pipeline-prestodb solution. The following architecture diagram provides a visual representation of the steps that you implement.

The diagram shows the following three steps:

Part 1: Training – This pipeline includes the data preprocessing step, the training and tuning step, the model evaluation step, the condition step, and the register model step. The train, test, and validation datasets and evaluation report that are generated in this pipeline are sent to an S3 bucket.
Part 2: Batch transform – This pipeline includes the batch data preprocessing step, approving the latest model from the model registry, creating the model instance, and performing batch transformation on data that is stored and retrieved from an S3 bucket.
The PrestoDB server is hosted on an EC2 instance, with credentials stored in Secrets Manager.
Part 3: SageMaker real-time inference – Finally, the latest approved model from the SageMaker model registry is deployed as a SageMaker real-time endpoint for inference.

Test the solution

In this section, we walk through the steps of running the solution.

Training pipeline

Complete the following steps to run the training pipeline

(0_model_training_pipeline.ipynb):

On the SageMaker Studio console, choose 0_model_training_pipeline.ipynb in the navigation pane.
When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook.

This notebook demonstrates how you can use SageMaker Pipelines to string together a sequence of data processing, model training, tuning, and evaluation steps to train a binary classification ML model using scikit-learn.

At the end of this run, navigate to pipelines in the navigation pane. Your pipeline structure on SageMaker Pipelines should look like the following figure.

The training pipeline consists of the following steps that are implemented through the notebook run:

Preprocess the data – In this step, we create a processing job for data preprocessing. For more information on processing jobs, see Process data. We use a preprocessing script to connect and query data from a PrestoDB instance using the user-specified SQL query in the config file. This step splits and sends data retrieved from PrestoDB as train, test, and validation files to an S3 bucket. The ML model is trained using the data in these files.
The sklearn_processor is used in the ProcessingStep to run the scikit-learn script that preprocesses data. The step is defined as follows:

# declare the sk_learn processer
step_args = sklearn_processor.run(
        ## code refers to the data preprocessing script that is responsible for querying data from the PrestoDB instance
        code=config['scripts']['preprocess_data'],
        source_dir=config['scripts']['source_dir'], 
        outputs=outputs_preprocessor,
        arguments=[
            "--host", host_parameter,
            "--port", port_parameter,
            "--presto_credentials_key", presto_parameter,
            "--region", region_parameter,
            "--presto_catalog", presto_catalog_parameter,
            "--presto_schema", presto_schema_parameter,
            "--train_split", train_split.to_string(), 
            "--test_split", test_split.to_string(),
        ],
    )

    step_preprocess_data = ProcessingStep(
        name=config['data_processing_step']['step_name'],
        step_args=step_args,
    )

Here, we use config['scripts']['source_dir'], which points to the data preprocessing script that connects to the PrestoDB instance. Parameters used as arguments in step_args are configurable and fetched from the config file.

Train the model – In this step, we create a training job to train a model. For more information on training jobs, see Train a Model with Amazon SageMaker. Here, we use the Scikit Learn Estimator from the SageMaker SDK to handle the end-to-end training and deployment of custom Scikit-learn code. The RandomForestClassifier is used to train the ML model for our binary classification use case. The HyperparameterTuner class is used for running automatic model tuning to determine the set of hyperparameters that provide the best performance based on a user-defined metric threshold (for example, maximizing the AUC metric).

In the following code, the sklearn_estimator object is used with parameters that are configured in the config file and uses a training script to train the ML model. This step accesses the train, test, and validation files that were created as a part of the previous data preprocessing step.

# declare a tuning step to use the train and test data to tune the ML model using the `HyperparameterTuner` declared above
step_tuning = TuningStep(
    name=config['tuning_step']['step_name'],
    tuner=rf_tuner,
    inputs={
        "train": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "train" ## refer to this
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "test": TrainingInput(
        s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
        content_type="text/csv",
        ),
    },
)

Evaluate the model – This step checks if the trained and tuned model has an accuracy level above a user-defined threshold, and only then registers the model with the model registry. If the model accuracy doesn’t meet the user-defined threshold, the pipeline fails and the model is not registered with the model registry. We use the ScriptProcessor with an evaluation script that a user creates to evaluate the trained model based on a metric of choice.

The evaluation step uses the evaluation script as a code entry. This script prepares the features and target values, and calculates the prediction probabilities using model.predict. At the end of the run, an evaluation report is sent to Amazon S3 that contains information on precision, recall, and accuracy metrics.

step_evaluate_model = ProcessingStep(
    name=config['evaluation_step']['step_name'],
    processor=evaluate_model_processor,
    inputs=[
        ProcessingInput(
            source=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),
            destination="/opt/ml/processing/model",
            input_name="model.tar.gz" 
        ),
        ProcessingInput(
            source=step_preprocess_data.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
            input_name="test.csv" 
        ),
    ],
    outputs=[
        ProcessingOutput(
            output_name="evaluation",
            source="/opt/ml/processing/evaluation",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "evaluation",
                ]
            )
        )
    ],
    code = config['scripts']['evaluation'],
    property_files=[evaluation_report],
    job_arguments=[
        "--target", target_parameter,
        "--features", feature_parameter,
    ]
)

The following screenshot shows an example of an evaluation report.

Add conditions – After the model is evaluated, we can add conditions to the pipeline with a ConditionStep. This step registers the model only if the given user-defined metric threshold is met. In our solution, we only want to register the new model version with the model registry if the new model meets a specific accuracy condition of above 70%.

# Create a SageMaker Pipelines ConditionStep, using the condition above.
# Enter the steps to perform if the condition returns True / False.
step_cond = ConditionStep(
    name=config['condition_step']['step_name'],
    conditions=[cond_gte],
    if_steps=[step_register_model],
    else_steps=[step_fail], ## if this fails
)

If the accuracy condition is not met, a step_fail step is run that sends an error message to the user, and the pipeline fails. For instance, because the user-defined accuracy condition is set to 0.7 in the config file, and the accuracy calculated during the evaluation step exceeds it (73.8%), the outcome of this step is set to True and the model moves to the last step of the training pipeline.

Register the model – The RegisterModel step registers a sagemaker.model.Model or a sagemaker.pipeline.PipelineModel with the SageMaker model registry. When the trained model meets the model performance requirements, a new version of the model is registered with the SageMaker model registry.

The model is registered with the model registry with an approval status set to PendingManualApproval. This means the model can’t be deployed on a SageMaker endpoint unless its status in the registry is changed to Approved manually on the SageMaker console, programmatically, or through an AWS Lambda function.

Now that the model is registered, you can get access to the registered model manually on the SageMaker Studio model registry console or programmatically in the next notebook, approve it, and run the batch transform pipeline.

Batch transform pipeline

Complete the following steps to run the batch transform pipeline (1_batch_transform_pipeline.ipynb):

On the SageMaker Studio console, choose 1_batch_transform_pipeline.ipynb in the navigation pane.
When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook.

This notebook will run a batch transform pipeline using the model trained in the previous notebook.

At the end of the batch transform pipeline, your pipeline structure on SageMaker Pipelines should look like the following figure.

The batch transform pipeline consists of the following steps that are implemented through the notebook run:

Extract the latest approved model from the SageMaker model registry – In this step, we extract the latest model from the model registry and set the ModelApprovalStatus to Approved:

## updating the latest model package to approved status to use it for batch inference
model_package_update_response = sm.update_model_package(
    ModelPackageArn=latest_model_package_arn,
    ModelApprovalStatus="Approved",
)

Now we have extracted the latest model from the SageMaker model registry and programmatically approved it. You can also approve the model manually on the SageMaker model registry page in SageMaker Studio as shown in the following screenshot.

Read raw data for inference from PrestoDB and store it in an S3 bucket – After the latest model is approved, batch data is fetched from the PrestoDB instance and used for the batch transform step. In this step, we use a batch preprocessing script that queries data from PrestoDB and saves it in a batch directory within an S3 bucket. The query that is used to fetch batch data is configured by the user within the config file in the transform_step section:

# declare the batch step that is called later in pipeline execution
batch_data_prep = ProcessingStep(
    name=config['data_processing_step']['step_name'],
    step_args=step_args,
)

After the batch data is extracted into the S3 bucket, we create a model instance and point to the inference.py script, which contains code that runs as part of getting inference from the trained model:

# create the model image based on the model data and refer to the inference script as an entry point for batch inference
model = Model(
    image_uri=image_uri,
    entry_point=config['scripts']['batch_inference'],
    model_data=model_data_url,
    sagemaker_session=pipeline_session,
    role=role,
)

Create a batch transform step to perform inference on the batch data stored in Amazon S3 – Now that a model instance is created, create a Transformer instance with the appropriate model type, compute instance type, and desired output S3 URI. Specifically, pass in the ModelName from the CreateModelStep step_create_model properties. The CreateModelStep properties attribute matches the object model of the DescribeModel response object. Use a transform step for batch transformation to run inference on an entire dataset. For more information about batch transform, see Run Batch Transforms with Inference Pipelines.
A transform step requires a transformer and the data on which to run batch inference:

transformer = Transformer(
model_name=step_create_model.properties.ModelName,
instance_type=config['transform_step']['instance_type'],
instance_count=config['transform_step']['instance_count'],
strategy="MultiRecord",
accept="text/csv",
assemble_with="Line",
output_path=f"s3://{bucket}",
tags = config['transform_step']['tags'], 
env={
    'START_TIME_UTC': st.strftime('%Y-%m-%d %H:%M:%S'), 
    'END_TIME_UTC': et.strftime('%Y-%m-%d %H:%M:%S'),
})

Now that the transformer object is created, pass the transformer input (which contains the batch data from the batch preprocess step) into the TransformStep declaration. Store the output of this pipeline in an S3 bucket.

step_transform = TransformStep(
    name=config['transform_step']['step_name'], transformer=transformer, inputs=transform_input, 
)

SageMaker real-time inference

Complete the following steps to run the real-time inference pipeline (2_realtime_inference.ipynb):

On the SageMaker Studio console, choose 2_realtime_inference_pipeline.ipynb in the navigation pane.
When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook.

This notebook extracts the latest approved model from the model registry and deploys it as a SageMaker endpoint for real-time inference. It does so by completing the following steps:

Extract the latest approved model from the SageMaker model registry – To deploy a real-time SageMaker endpoint, first fetch the image URI of your choice and extract the latest approved model from the model registry. After the latest approved model is extracted, we use a container list with the specified inference.py as the script for the deployed model to use at inference. This model creation and endpoint deployment are specific to the scikit-learn model configuration.
In the following code, we use the inference.py file specific to the scikit-learn model. We then create our endpoint configuration, setting our ManagedInstanceScaling to ENABLED with our desired MaxInstanceCount and MinInstanceCount for automatic scaling:

create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName = endpoint_config_name,
ProductionVariants=[{
    'InstanceType': instance_type,
    # have max instance count configured here
    'InitialInstanceCount': min_instances,
    'InitialVariantWeight': 1,
    'ModelName': model_name,
    'VariantName': 'AllTraffic', 
    # change your managed instance configuration here
    "ManagedInstanceScaling":{
        "MaxInstanceCount": max_instances,
        "MinInstanceCount": min_instances,
        "Status": "ENABLED",}
}])

Run inference on the deployed real-time endpoint – After you have extracted the latest approved model, created the model from the desired image URI, and configured the endpoint configuration, you can deploy it as a real-time SageMaker endpoint:

create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)

# wait for endpoint to reach a terminal state (InService) using describe endpoint
describe_endpoint_response = sm.describe_endpoint(EndpointName=endpoint_name)

while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = sm.describe_endpoint(EndpointName=endpoint_name)

Upon deployment, you can view the endpoint in service on the SageMaker Endpoints page.

Now you can run inference against the data extracted from PrestoDB:

body_str = "total_extended_price,avg_discount,total_quantityn1,2,3n66.77,12,2"

response = smr.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=body_str.encode('utf-8') ,
    ContentType='text/csv',
)

response_str = response["Body"].read().decode()
response_str

Results

Here is an example of an inference request and response from the real time endpoint using the implementation above:

Inference request format (view and change this example as you would like for your custom use case)

body_str = """total_extended_price,avg_discount,total_quantity
32,40,334
"""
 
response = smr.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=body_str.encode('utf-8'),
    ContentType='text/csv',
)

response_str = response["Body"].read().decode()
data = json.loads(response_str)
print(json.dumps(data, indent=4))

Response from the real time endpoint

[
    {
        "total_extended_price": 32,
        "avg_discount": 40,
        "total_quantity": 334,
        "prediction": 0
    }
]

Clean up

To clean up the endpoint used in this solution to avoid extra charges, complete the following steps:

On the SageMaker console, choose Endpoints in the navigation pane.
Select the endpoint to delete.
On the Actions menu, choose Delete.

Conclusion

In this post, we demonstrated an end-to-end MLOps solution on SageMaker. The process involved fetching data by connecting a SageMaker Processing job to a PrestoDB instance, followed by training, evaluating, and registering the model. We approved the latest registered model from the training pipeline and ran batch inference against it using batch data queried from PrestoDB and stored in Amazon S3. Lastly, we deployed the latest approved model as a real-time SageMaker endpoint to run inferences.

The rise of generative AI increases the demand for training, deploying, and running ML models, and consequently, the use of data. By integrating SageMaker Processing jobs with PrestoDB, you can seamlessly migrate your workloads to SageMaker pipelines without additional data preparation, storage, or accessibility burdens. You can build, train, evaluate, run batch inferences, and deploy models as real-time endpoints while using your existing data engineering pipelines with minimal or no code changes.

Explore SageMaker Pipelines and open source data querying engines like PrestoDB, and build a solution using the sample implementation provided.

Get started today by referring to the GitHub repository.

For more information and tutorials on SageMaker Pipelines, refer to the SageMaker Pipelines documentation.

About the Authors

Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin, and playing the guitar.

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting strategic customers based out of Dallas, Texas. She also has experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital-centered customers.

Johnny Chivers is a Senior Solutions Architect working within the Strategic Accounts team at AWS. With over 10 years of experience helping customers adopt new technologies, he guides them through architecting end-to-end solutions spanning infrastructure, big data, and AI.

Shamik Ray is a Senior Engineering Manager at Twilio, leading the Data Science and ML team. With 12 years of experience in software engineering and data science, he excels in overseeing complex machine learning projects and ensuring successful end-to-end execution and delivery.

Srivyshnav K S is a Senior Machine Learning Engineer at Twilio with over 5 years of experience. His expertise lies in leveraging statistical and machine learning techniques to develop advanced models for detecting patterns and anomalies. He is adept at building projects end-to-end.

Jagmohan Dhiman is a Senior Data Scientist with 7 years of experience in machine learning solutions. He has extensive expertise in building end-to-end solutions, encompassing data analysis, ML-based application development, architecture design, and MLOps pipelines for managing the model lifecycle.

Soumya Kundu is a Senior Data Engineer with almost 10 years of experience in Cloud and Big Data technologies. He specialises in AI/ML based large scale Data Processing systems and an avid IoT enthusiast in his spare time.

Accelerate deep learning training and simplify orchestration with AWS Trainium and AWS Batch

June 17, 2024

by Scott Perry Amazon AWS

In large language model (LLM) training, effective orchestration and compute resource management poses a significant challenge. Automation of resource provisioning, scaling, and workflow management is vital for optimizing resource usage and streamlining complex workflows, thereby achieving efficient deep learning training processes. Simplified orchestration enables researchers and practitioners to focus more on model experimentation, hyperparameter tuning, and data analysis, rather than dealing with cumbersome infrastructure management tasks. Straightforward orchestration also accelerates innovation, shortens time-to-market for new models and applications, and ultimately enhances the overall efficiency and effectiveness of LLM research and development endeavors.

This post explores the seamless integration of AWS Trainium with AWS Batch, showcasing how the powerful machine learning (ML) acceleration capabilities of Trainium can be harnessed alongside the efficient orchestration functionalities offered by AWS Batch. Trainium provides massive scalability, enables effortless scaling of training jobs from small models to LLMs, and offers cost-effective access to computational power, making training LLMs affordable and accessible. AWS Batch is a managed service facilitating batch computing workloads on the AWS Cloud, handling tasks like infrastructure management and job scheduling, while enabling you to focus on application development and result analysis. AWS Batch provides comprehensive features, including managed batch computing, containerized workloads, custom compute environments, and prioritized job queues, along with seamless integration with other AWS services.

Solution overview

The following diagram illustrates the solution architecture.

The training process proceeds as follows:

The user creates a Docker image configured to suit the demands of the underlying training task.
The image is pushed to Amazon Elastic Container Registry (Amazon ECR) to make it ready for deployment.
The user submits the training job to AWS Batch with the Docker image.

Let’s deep dive into this solution to see how you can integrate Trainium with AWS Batch. The following example demonstrates how to train the Llama 2-7B model using AWS Batch with Trainium.

Prerequisites

It is advised to not run the following scripts on your local machine. Instead, clone the GitHub repository and run the provided scripts on an x86_64-based instance, preferably using a C5.xlarge instance type with the Linux/Ubuntu operating system. For this post, we run the example on an Amazon Linux 2023 instance.

You should have the following resources and tools before getting started with the training on AWS Batch:

VPC – For this example, you require a VPC that has at least two subnets (one public and one private) and a NAT gateway. For instructions to create a VPC with a NAT gateway, refer to Configure a VPC with Private Subnets and a NAT Gateway.
ECR repository – You need an ECR repository to store your Docker container image. For setup instructions, see Creating a private repository.
S3 bucket – You need an Amazon Simple Storage Service (Amazon S3) to store tokenized datasets, Neuron compile cache artifacts, and Llama checkpoint files. For instructions, refer to Create your first S3 bucket.
IAM role – You need an AWS Identity and Access Management (IAM) role that is associated with the Trn1 instances. Make sure this role has the AmazonEC2ContainerServiceforEC2Role and AmazonS3FullAccess policies associated with it. To learn more about IAM roles, refer Creating IAM roles.
AWS CLI – The AWS Command Line Interface (AWS CLI) should be installed and configured with permissions for AWS Batch and Amazon ECR. This isn’t needed if you’re using Amazon Linux 2023, but for other operating systems, you can follow the instructions in Install or update to the latest version of the AWS CLI to install the AWS CLI.
Other tools – Docker and jq should also be installed. You can use the following commands to install them on AL2023:

sudo yum install -y docker 
sudo yum install -y jq

Clone the repo

Clone the GitHub repo and navigate to the required directory:

git clone https://github.com/aws-neuron/aws-neuron-samples.git 
cd aws-neuron-samples/torch-neuronx/training/aws-batch/llama2

Update the configuration

First, update the config.txt file to specify values for the following variables:

REGION                          # your aws region 
SUBNET                          # your subnet in which the Trainium instances would be launched 
SG                              # your security group you want to associate with your instances 
ECR_REPO                        # your ECR repo where the docker container image will be pushed to 
INSTANCE_ROLE                   # Instance profile ARN for your IAM Instance Role 
DO_PRE_COMPILATION              # boolean value (truefalse) indicating if you want to do neuron pre-compilation for your training job 
TOKENIZED_DATASET_URI           # s3 uri to store the tokenized dataset 
NEURON_COMPILE_CACHE_URI        # s3 uri to store the neuron compile caches 
CHECKPOINT_SAVE_URI             # s3 uri to store the checkpoints

After you provide these values, your config.txt file should look something like the following code

REGION=us-east-1
SUBNET=subnet-012345abcd5689
SG=sg-012345abcd5689
ECR_REPO=1010101010.dkr.ecr.us-east-1.amazonaws.com/your-docker-repo
INSTANCE_ROLE=arn:aws:iam::1010101010:instance-profile/your-instance-role
DO_PRE_COMPILATION=true
TOKENIZED_DATASET_URI=s3://your/s3/location/to/store/tokenized/dataset/
NEURON_COMPILE_CACHE_URI=s3://your/s3/location/to/store/neuron-compile-cache/
CHECKPOINT_SAVE_URI=s3://your/s3/location/to/store/checkpoints/

Get the Llama tokenizer

To tokenize the dataset, you would need to get the tokenizer from Hugging Face. Follow the instructions to access the Llama tokenizer. (You need to acknowledge and accept the license terms.) After you’re granted access, you can download the tokenizer from Hugging Face. After a successful download, place the tokenizer.model file in the root directory (llama2).

Set up Llama training

Run the setup.sh script, which streamlines the prerequisite steps for initiating the AWS Batch training. This script downloads the necessary Python files for training the Llama 2-7B model. Additionally, it performs environment variable substitution within the provided templates and scripts designed to establish AWS Batch resources. When it runs, it makes sure your directory structure conforms to the following setup:

.
├── build
│ ├── compute_env.json
│ ├── job_def.json
│ ├── job_queue.json
│ └── launch_template.json
├── build_and_push_docker_image.sh
├── cleanup.sh
├── config.txt
├── create_resources.sh
├── data
│ ├── get_dataset.py
│ ├── config.json
│ └── tokenizer.model
├── docker
│ ├── Dockerfile
│ ├── llama2
│ │ ├── adamw_fp32_optim_params.py
│ │ ├── config.json
│ │ ├── llama_batch_training.sh
│ │ ├── modeling_llama_nxd.py
│ │ ├── requirements.txt
│ │ └── tp_zero1_llama2_7b_hf_pretrain.py
│ └── llama_batch_training.sh
├── download_and_tokenize_data.sh
├── images
│ └── aws-batch.png
├── README.md
├── scripts
│ ├── build_and_push_docker_image.sh
│ ├── cleanup.sh
│ ├── create_resources.sh
│ ├── download_and_tokenize_data.sh
│ └── submit_batch_job.sh
├── setup.sh
├── submit_batch_job.sh
└── templates
├── compute_env.json
├── job_def.json
├── job_queue.json
└── launch_template.json

Tokenize the dataset

Next, run the download_and_tokenize_data.sh script to complete the data preprocessing steps for Llama 2-7B training. In this instance, we use the wikicorpus dataset sourced from Hugging Face. After the dataset retrieval, the script performs tokenization and uploads the tokenized dataset to the predefined S3 location specified within the config.txt configuration file. The following screenshots show the preprocessing results.

Provision resources

Next, run the create_resources.sh script, which orchestrates the provisioning of the required resources for the training task. This includes creation of a placement group, launch template, compute environment, job queue, and job definition. The following screenshots illustrate this process.

Build and push the Docker image

Now you can run the script build_and_push_docker_image.sh, which constructs a Docker container image customized for your specific training task. This script uses a Deep Learning Container Image published by the Neuron team, which contains the required software stack, and then added instructions for running the Llama 2-7B training on top of it. The training script uses the neuronx_distributed library with tensor parallelism along with the ZeRO-1 Optimizer. Subsequently, the newly generated Docker container image is uploaded to your designated ECR repository as specified by the variable ECR_REPO in the configuration file config.txt.

If you want to modify any of the Llama training hyperparameters, make the required changes in ./docker/llama_batch_training.sh before running build_and_push_docker_image.sh.

The following screenshots illustrate the process for building and pushing the Docker image.

Submit the training job

Run the submit_batch_job.sh script to initiate the AWS Batch job and start the Llama2 model training, as shown in the following screenshots.

Upon batch job submission, an Amazon Elastic Container Service (Amazon ECS) cluster is dynamically provisioned. When it’s operational, you can navigate to the cluster to monitor all tasks actively running on the Trn1.32xl instances, launched through this job. By default, this example is configured to use 4 trn1.32xl instances. To customize this setting, you can modify the numNodes parameter in the submit_batch_job.sh script.

Logs and monitoring

After the job submission, you can use Amazon CloudWatch Logs for comprehensive monitoring, storage, and viewing of all logs generated by AWS Batch. Complete the following steps to access the logs:

On the CloudWatch console, choose Log groups under Logs in the navigation pane.
Choose /aws/batch/job to view the batch job logs.
Look for log groups that match your AWS Batch job names or job definitions.
Choose the job to view its details.

The following screenshot shows an example.

Checkpoints

Checkpoints generated during training will be stored in the predefined S3 location specified as CHECKPOINT_SAVE_URI in the config.txt file. By default, the checkpoint is saved when training is complete. However, you can adjust this behavior by opting to save the checkpoint after every N steps within the training loop. For detailed instructions on this customization, refer to Checkpointing.

Clean up

When you’re done, run the cleanup.sh script to manage the removal of resources created during the post. This script takes care of removing various components, such as the launch template, placement group, job definition, job queue, and compute environment. AWS Batch automatically handles the cleanup of the ECS stack and Trainium instances, so there’s no need to manually remove or stop them.

Conclusion

The seamless integration of Trainium with AWS Batch represents a significant advancement in the realm of ML training. By combining the unparalleled capabilities of Trainium with the powerful orchestration functionalities of AWS Batch, you stand to benefit in numerous ways. Firstly, you gain access to massive scalability, with the ability to effortlessly scale training jobs from small models to LLMs. With up to 16 Trainium chips per instance and the potential for distributed training across tens of thousands of accelerators, you can tackle even the most demanding training tasks with ease by virtue of Trainium instances. Additionally, it offers a cost-effective solution, helping you harness the power you need at an appealing price point. With the fully managed service offered by AWS Batch for computing workloads, you can offload operational complexities such as infrastructure provisioning and job scheduling, allowing you to focus your efforts on building applications and analyzing results. Ultimately, the integration of Trainium with AWS Batch empowers you to accelerate innovation, shorten time-to-market for new models and applications, and enhance the overall efficiency and effectiveness of your ML endeavors.

Now that you have learned about orchestrating Trainium using AWS Batch, we encourage you to try it out for your next deep learning training job. You can explore more tutorials that will help you gain hands-on experience with AWS Batch and Trainium, and enable you to manage your deep learning training workloads and resources for better performance and cost-efficiency. So why wait? Start exploring these tutorials today and take your deep learning training to the next level with Trainium and AWS Batch!

About the authors

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Sadaf Rasool is a Machine Learning Engineer with Annapurna ML Accelerator team at AWS. As an enthusiastic and optimistic AI/ML professional, he holds firm to the belief that the ethical and responsible application of AI has the potential to enhance society in the years to come, fostering both economic growth and social well-being.

Automated evaluation of RAG pipelines with exam generation

June 13, 2024

by Amazon AWS

The fight against hallucination in retrieval-augmented-generation models starts with a method for accurately assessing it.Read More