Amazon AWS – Page 118

Promote pipelines in a multi-environment setup using Amazon SageMaker Model Registry, HashiCorp Terraform, GitHub, and Jenkins CI/CD

November 9, 2023

by Gayatri Ghanakota Amazon AWS

Building out a machine learning operations (MLOps) platform in the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML) for organizations is essential for seamlessly bridging the gap between data science experimentation and deployment while meeting the requirements around model performance, security, and compliance.

In order to fulfill regulatory and compliance requirements, the key requirements when designing such a platform are:

Address data drift
Monitor model performance
Facilitate automatic model retraining
Provide a process for model approval
Keep models in a secure environment

In this post, we show how to create an MLOps framework to address these needs while using a combination of AWS services and third-party toolsets. The solution entails a multi-environment setup with automated model retraining, batch inference, and monitoring with Amazon SageMaker Model Monitor, model versioning with SageMaker Model Registry, and a CI/CD pipeline to facilitate promotion of ML code and pipelines across environments by using Amazon SageMaker, Amazon EventBridge, Amazon Simple Notification Service (Amazon S3), HashiCorp Terraform, GitHub, and Jenkins CI/CD. We build a model to predict the severity (benign or malignant) of a mammographic mass lesion trained with the XGBoost algorithm using the publicly available UCI Mammography Mass dataset and deploy it using the MLOps framework. The full instructions with code are available in the GitHub repository.

Solution overview

The following architecture diagram shows an overview of the MLOps framework with the following key components:

Multi account strategy – Two different environments (dev and prod) are set up in two different AWS accounts following the AWS Well-Architected best practices, and a third account is set up in the central model registry:
- Dev environment – Where an Amazon SageMaker Studio domain is set up to allow model development, model training, and testing of ML pipelines (train and inference), before a model is ready to be promoted to higher environments.
- Prod environment – Where the ML pipelines from dev are promoted to as a first step, and scheduled and monitored over time.
- Central model registry – Amazon SageMaker Model Registry is set up in a separate AWS account to track model versions generated across the dev and prod environments.
CI/CD and source control – The deployment of ML pipelines across environments is handled through CI/CD set up with Jenkins, along with version control handled through GitHub. Code changes merged to the corresponding environment git branch triggers a CI/CD workflow to make appropriate changes to the given target environment.
Batch predictions with model monitoring – The inference pipeline built with Amazon SageMaker Pipelines runs on a scheduled basis to generate predictions along with model monitoring using SageMaker Model Monitor to detect data drift.
Automated retraining mechanism – The training pipeline built with SageMaker Pipelines is triggered whenever a data drift is detected in the inference pipeline. After it’s trained, the model is registered into the central model registry to be approved by a model approver. When it’s approved, the updated model version is used to generate predictions through the inference pipeline.
Infrastructure as code – The infrastructure as code (IaC), created using HashiCorp Terraform, supports the scheduling of the inference pipeline with EventBridge, triggering of the train pipeline based on an EventBridge rule and sending notifications using Amazon Simple Notification Service (Amazon SNS) topics.

The MLOps workflow includes the following steps:

Access the SageMaker Studio domain in the development account, clone the GitHub repository, go through the process of model development using the sample model provided, and generate the train and inference pipelines.
Run the train pipeline in the development account, which generates the model artifacts for the trained model version and registers the model into SageMaker Model Registry in the central model registry account.
Approve the model in SageMaker Model Registry in the central model registry account.
Push the code (train and inference pipelines, and the Terraform IaC code to create the EventBridge schedule, EventBridge rule, and SNS topic) into a feature branch of the GitHub repository. Create a pull request to merge the code into the main branch of the GitHub repository.
Trigger the Jenkins CI/CD pipeline, which is set up with the GitHub repository. The CI/CD pipeline deploys the code into the prod account to create the train and inference pipelines along with Terraform code to provision the EventBridge schedule, EventBridge rule, and SNS topic.
The inference pipeline is scheduled to run on a daily basis, whereas the train pipeline is set up to run whenever data drift is detected from the inference pipeline.
Notifications are sent through the SNS topic whenever there is a failure with either the train or inference pipeline.

Prerequisites

For this solution, you should have the following prerequisites:

Three AWS accounts (dev, prod, and central model registry accounts)
A SageMaker Studio domain set up in each of the three AWS accounts (see Onboard to Amazon SageMaker Studio or watch the video Onboard Quickly to Amazon SageMaker Studio for setup instructions)
Jenkins (we use Jenkins 2.401.1) with administrative privileges installed on AWS
Terraform version 1.5.5 or later installed on Jenkins server

For this post, we work in the us-east-1 Region to deploy the solution.

Provision KMS keys in dev and prod accounts

Our first step is to create AWS Key Management Service (AWS KMS) keys in the dev and prod accounts.

Create a KMS key in the dev account and give access to the prod account

Complete the following steps to create a KMS key in the dev account:

On the AWS KMS console, choose Customer managed keys in the navigation pane.
Choose Create key.
For Key type, select Symmetric.
For Key usage, select Encrypt and decrypt.
Choose Next.
Enter the production account number to give the production account access to the KMS key provisioned in the dev account. This is a required step because the first time the model is trained in the dev account, the model artifacts are encrypted with the KMS key before being written to the S3 bucket in the central model registry account. The production account needs access to the KMS key in order to decrypt the model artifacts and run the inference pipeline.
Choose Next and finish creating your key.

After the key is provisioned, it should be visible on the AWS KMS console.

Create a KMS key in the prod account

Go through the same steps in the previous section to create a customer managed KMS key in the prod account. You can skip the step to share the KMS key to another account.

Set up a model artifacts S3 bucket in the central model registry account

Create an S3 bucket of your choice with the string sagemaker in the naming convention as part of the bucket’s name in the central model registry account, and update the bucket policy on the S3 bucket to give permissions from both the dev and prod accounts to read and write model artifacts into the S3 bucket.

The following code is the bucket policy to be updated on the S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AddPerm",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<dev-account-id>:root"
            },
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>/*"
        },
        {
            "Sid": "AddPerm1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<dev-account-id>:root"
            },
            "Action": "s3:ListBucket",
            "Resource": [
                "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>",
                "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>/*"
            ]
        },
        {
            "Sid": "AddPerm2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<prod-account-id>:root"
            },
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>/*"
        },
        {
            "Sid": "AddPerm3",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<prod-account-id>:root"
            },
            "Action": "s3:ListBucket",
            "Resource": [
                "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>",
                "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>/*"
            ]
        }
    ]
}

Set up IAM roles in your AWS accounts

The next step is to set up AWS Identity and Access Management (IAM) roles in your AWS accounts with permissions for AWS Lambda, SageMaker, and Jenkins.

Lambda execution role

Set up Lambda execution roles in the dev and prod accounts, which will be used by the Lambda function run as part of the SageMaker Pipelines Lambda step. This step will run from the inference pipeline to fetch the latest approved model, using which inferences are generated. Create IAM roles in the dev and prod accounts with the naming convention arn:aws:iam::<account-id>:role/lambda-sagemaker-role and attach the following IAM policies:

Policy 1 – Create an inline policy named cross-account-model-registry-access, which gives access to the model package set up in the model registry in the central account:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "sagemaker:ListModelPackages",
            "Resource": "arn:aws:sagemaker:us-east-1:<central-model-registry-account-id>:model-package/mammo-severity-model-package/*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "sagemaker:DescribeModelPackageGroup",
            "Resource": "arn:aws:sagemaker:us-east-1:<central-model-registry-account-id>:model-package-group/mammo-severity-model-package"
        }
    ]
}

Policy 2 – Attach AmazonSageMakerFullAccess, which is an AWS managed policy that grants full access to SageMaker. It also provides select access to related services, such as AWS Application Auto Scaling, Amazon S3, Amazon Elastic Container Registry (Amazon ECR), and Amazon CloudWatch Logs.
Policy 3 – Attach AWSLambda_FullAccess, which is an AWS managed policy that grants full access to Lambda, Lambda console features, and other related AWS services.

Policy 4 – Use the following IAM trust policy for the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "lambda.amazonaws.com",
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

SageMaker execution role

The SageMaker Studio domains set up in the dev and prod accounts should each have an execution role associated, which can be found on the Domain settings tab on the domain details page, as shown in the following screenshot. This role is used to run training jobs, processing jobs, and more within the SageMaker Studio domain.

Add the following policies to the SageMaker execution role in both accounts:

Policy 1 – Create an inline policy named cross-account-model-artifacts-s3-bucket-access, which gives access to the S3 bucket in the central model registry account, which stores the model artifacts:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>/*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>",
                "arn:aws:s3:::<s3-bucket-in-central-model-registry-account>/*"
            ]
        }
    ]
}

Policy 2 – Create an inline policy named cross-account-model-registry-access, which gives access to the model package in the model registry in the central model registry account:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "sagemaker:CreateModelPackageGroup",
            "Resource": "arn:aws:sagemaker:us-east-1:<central-model-registry-account-id>:model-package-group/mammo-severity-model-package"
        }
    ]
}

Policy 3 – Create an inline policy named kms-key-access-policy, which gives access to the KMS key created in the previous step. Provide the account ID in which the policy is being created and the KMS key ID created in that account.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUseOfKeyInThisAccount",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:us-east-1:<account-id>:key/<kms-key-id>"
        }
    ]
}

Policy 4 – Attach AmazonSageMakerFullAccess, which is an AWS managed policy that grants full access to SageMaker and select access to related services.
Policy 5 – Attach AWSLambda_FullAccess, which is an AWS managed policy that grants full access to Lambda, Lambda console features, and other related AWS services.
Policy 6 – Attach CloudWatchEventsFullAccess, which is an AWS managed policy that grants full access to CloudWatch Events.

Policy 7 – Add the following IAM trust policy for the SageMaker execution IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "events.amazonaws.com",
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Policy 8 (specific to the SageMaker execution role in the prod account) – Create an inline policy named cross-account-kms-key-access-policy, which gives access to the KMS key created in the dev account. This is required for the inference pipeline to read model artifacts stored in the central model registry account where the model artifacts are encrypted using the KMS key from the dev account when the first version of the model is created from the dev account.
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUseOfKeyInDevAccount",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:us-east-1:<dev-account-id>:key/<dev-kms-key-id>"
        }
    ]
}
```

Cross-account Jenkins role

Set up an IAM role called cross-account-jenkins-role in the prod account, which Jenkins will assume to deploy ML pipelines and corresponding infrastructure into the prod account.

Add the following managed IAM policies to the role:

CloudWatchFullAccess
AmazonS3FullAccess
AmazonSNSFullAccess
AmazonSageMakerFullAccess
AmazonEventBridgeFullAccess
AWSLambda_FullAccess

Update the trust relationship on the role to give permissions to the AWS account hosting the Jenkins server:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "events.amazonaws.com",
                "AWS": "arn:aws:iam::<jenkins-account-id>:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}

Update permissions on the IAM role associated with the Jenkins server

Assuming that Jenkins has been set up on AWS, update the IAM role associated with Jenkins to add the following policies, which will give Jenkins access to deploy the resources into the prod account:

Policy 1 – Create the following inline policy named assume-production-role-policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::<prod-account-id>:role/cross-account-jenkins-role"
        }
    ]
}

Policy 2 – Attach the CloudWatchFullAccess managed IAM policy.

Set up the model package group in the central model registry account

From the SageMaker Studio domain in the central model registry account, create a model package group called mammo-severity-model-package using the following code snippet (which you can run using a Jupyter notebook):

import boto3 

model_package_group_name = "mammo-severity-model-package"
sm_client = boto3.Session().client("sagemaker")

create_model_package_group_response = sm_client.create_model_package_group(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageGroupDescription="Cross account model package group for mammo severity model",

)

print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))

Set up access to the model package for IAM roles in the dev and prod accounts

Provision access to the SageMaker execution roles created in the dev and prod accounts so you can register model versions within the model package mammo-severity-model-package in the central model registry from both accounts. From the SageMaker Studio domain in the central model registry account, run the following code in a Jupyter notebook:

import json 
import boto3 

model_package_group_name = "mammo-severity-model-package"
# Convert the policy from JSON dict to string
model_package_group_policy = dict(
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AddPermModelPackageGroupCrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::<dev-account-id>:root", "arn:aws:iam::<prod-account-id>:root"]
      },
      "Action": [
        "sagemaker:DescribeModelPackageGroup"      
        ],
      "Resource": "arn:aws:sagemaker:us-east-1:<central-model-registry-account>:model-package-group/mammo-severity-model-package"    
    },
    {
      "Sid": "AddPermModelPackageVersionCrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::<dev-account-id>:root", "arn:aws:iam::<prod-account-id>:root"] 
      },
      "Action": [
        "sagemaker:DescribeModelPackage",
        "sagemaker:ListModelPackages",
        "sagemaker:UpdateModelPackage",
        "sagemaker:CreateModelPackage",
        "sagemaker:CreateModel"      
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:<central-model-registry-account>:model-package/mammo-severity-model-package/*"
    }
  ]
})
model_package_group_policy = json.dumps(model_package_group_policy)
# Add Policy to the model package group
sm_client = boto3.Session().client("sagemaker")
response = sm_client.put_model_package_group_policy(
    ModelPackageGroupName = model_package_group_name,
    ResourcePolicy = model_package_group_policy)

Set up Jenkins

In this section, we configure Jenkins to create the ML pipelines and the corresponding Terraform infrastructure in the prod account through the Jenkins CI/CD pipeline.

On the CloudWatch console, create a log group named jenkins-log within the prod account to which Jenkins will push logs from the CI/CD pipeline. The log group should be created in the same Region as where the Jenkins server is set up.
Install the following plugins on your Jenkins server:
1. Job DSL
2. Git
3. Pipeline
4. Pipeline: AWS Steps
5. Pipeline Utility Steps
Set up AWS credentials in Jenkins using the cross-account IAM role (cross-account-jenkins-role) provisioned in the prod account.
For System Configuration, choose AWS.
Provide the credentials and CloudWatch log group you created earlier.
Set up GitHub credentials within Jenkins.
Create a new project in Jenkins.
Enter a project name and choose Pipeline.
On the General tab, select GitHub project and enter the forked GitHub repository URL.
Select This project is parameterized.
On the Add Parameter menu, choose String Parameter.
For Name, enter prodAccount.
For Default Value, enter the prod account ID.
Under Advanced Project Options, for Definition, select Pipeline script from SCM.
For SCM, choose Git.
For Repository URL, enter the forked GitHub repository URL.
For Credentials, enter the GitHub credentials saved in Jenkins.
Enter main in the Branches to build section, based on which the CI/CD pipeline will be triggered.
For Script Path, enter Jenkinsfile.
Choose Save.

The Jenkins pipeline should be created and visible on your dashboard.

Provision S3 buckets, collect and prepare data

Complete the following steps to set up your S3 buckets and data:

Create an S3 bucket of your choice with the string sagemaker in the naming convention as part of the bucket’s name in both dev and prod accounts to store datasets and model artifacts.
Set up an S3 bucket to maintain the Terraform state in the prod account.
Download and save the publicly available UCI Mammography Mass dataset to the S3 bucket you created earlier in the dev account.
Fork and clone the GitHub repository within the SageMaker Studio domain in the dev account. The repo has the following folder structure:
- /environments – Configuration script for prod environment
- /mlops-infra – Code for deploying AWS services using Terraform code
- /pipelines – Code for SageMaker pipeline components
- Jenkinsfile – Script to deploy through Jenkins CI/CD pipeline
- setup.py – Needed to install the required Python modules and create the run-pipeline command
- mammography-severity-modeling.ipynb – Allows you to create and run the ML workflow
Create a folder called data within the cloned GitHub repository folder and save a copy of the publicly available UCI Mammography Mass dataset.
Follow the Jupyter notebook mammography-severity-modeling.ipynb.

Run the following code in the notebook to preprocess the dataset and upload it to the S3 bucket in the dev account:

import boto3
import sagemaker
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

#Replace the values based on the resoures created
default_bucket = "<s3-bucket-in-dev-account>"
model_artifacts_bucket = "<s3-bucket-in-central-model-registry-account>"
region = "us-east-1"
model_name = "mammography-severity-model"
role = sagemaker.get_execution_role()
lambda_role = "arn:aws:iam::<dev-account-id>:role/lambda-sagemaker-role"
kms_key = "arn:aws:kms:us-east-1:<dev-account-id>:key/<kms-key-id-in-dev-account>"
model_package_group_name="arn:aws:sagemaker:us-east-1:<central-model-registry-account-id>:model-package-group/mammo-severity-model-package"

feature_columns_names = [
    'BIRADS',
    'Age',
    'Shape',
    'Margin',
    'Density',
]
feature_columns_dtype = {
    'BIRADS': np.float64,
    'Age': np.float64,
    'Shape': np.float64,
    'Margin': np.float64,
    'Density': np.float64,
}

# read raw dataset
mammographic_data = pd.read_csv("data/mammographic_masses.data",header=None)

# split data into batch and raw datasets
batch_df =mammographic_data.sample(frac=0.05,random_state=200)
raw_df =mammographic_data.drop(batch_df.index)

# Split the raw datasets to two parts, one of which will be used to train
#the model initially and then other dataset will be leveraged when 
#retraining the model
train_dataset_part2 =raw_df.sample(frac=0.1,random_state=200)
train_dataset_part1 =raw_df.drop(train_dataset_part2.index)

# save the train datasets 
train_dataset_part1.to_csv("data/mammo-train-dataset-part1.csv",index=False)
train_dataset_part2.to_csv("data/mammo-train-dataset-part2.csv",index=False)

# remove label column from the batch dataset which will be used to generate inferences
batch_df.drop(5,axis=1,inplace=True)

# create a copy of the batch dataset 
batch_modified_df = batch_df

def preprocess_batch_data(feature_columns_names,feature_columns_dtype,batch_df):
    batch_df.replace("?", "NaN", inplace = True)
    batch_df.columns = feature_columns_names
    batch_df = batch_df.astype(feature_columns_dtype)
    numeric_transformer = Pipeline( 
        steps=[("imputer", SimpleImputer(strategy="median"))]
        )
    numeric_features = list(feature_columns_names)
    preprocess = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features)
        ]
    ) 
    batch_df = preprocess.fit_transform(batch_df)
    return batch_df

# save the batch dataset file
batch_df = preprocess_batch_data(feature_columns_names,feature_columns_dtype,batch_df)
pd.DataFrame(batch_df).to_csv("data/mammo-batch-dataset.csv", header=False, index=False)

# modify batch dataset to introduce missing values
batch_modified_df.replace("?", "NaN", inplace = True)
batch_modified_df.columns = feature_columns_names
batch_modified_df = batch_modified_df.astype(feature_columns_dtype)

# save the batch dataset with outliers file
batch_modified_df.to_csv("data/mammo-batch-dataset-outliers.csv",index=False)

The code will generate the following datasets:

- data/ mammo-train-dataset-part1.csv – Will be used to train the first version of model.
- data/ mammo-train-dataset-part2.csv – Will be used to train the second version of model along with the mammo-train-dataset-part1.csv dataset.
- data/mammo-batch-dataset.csv – Will be used to generate inferences.
- data/mammo-batch-dataset-outliers.csv – Will introduce outliers into the dataset to fail the inference pipeline. This will enable us to test the pattern to trigger automated retraining of the model.

Upload the dataset mammo-train-dataset-part1.csv under the prefix mammography-severity-model/train-dataset, and upload the datasets mammo-batch-dataset.csv and mammo-batch-dataset-outliers.csv to the prefix mammography-severity-model/batch-dataset of the S3 bucket created in the dev account:

import boto3
s3_client = boto3.resource('s3')
s3_client.Bucket(default_bucket).upload_file("data/mammo-train-dataset-part1.csv","mammography-severity-model/data/train-dataset/mammo-train-dataset-part1.csv")
s3_client.Bucket(default_bucket).upload_file("data/mammo-batch-dataset.csv","mammography-severity-model/data/batch-dataset/mammo-batch-dataset.csv")
s3_client.Bucket(default_bucket).upload_file("data/mammo-batch-dataset-outliers.csv","mammography-severity-model/data/batch-dataset/mammo-batch-dataset-outliers.csv")

Upload the datasets mammo-train-dataset-part1.csv and mammo-train-dataset-part2.csv under the prefix mammography-severity-model/train-dataset into the S3 bucket created in the prod account through the Amazon S3 console.
Upload the datasets mammo-batch-dataset.csv and mammo-batch-dataset-outliers.csv to the prefix mammography-severity-model/batch-dataset of the S3 bucket in the prod account.

Run the train pipeline

Under <project-name>/pipelines/train, you can see the following Python scripts:

scripts/raw_preprocess.py – Integrates with SageMaker Processing for feature engineering
scripts/evaluate_model.py – Allows model metrics calculation, in this case auc_score
train_pipeline.py – Contains the code for the model training pipeline

Complete the following steps:

Upload the scripts into Amazon S3:

import boto3
s3_client = boto3.resource('s3')
s3_client.Bucket(default_bucket).upload_file("pipelines/train/scripts/raw_preprocess.py","mammography-severity-model/scripts/raw_preprocess.py")
s3_client.Bucket(default_bucket).upload_file("pipelines/train/scripts/evaluate_model.py","mammography-severity-model/scripts/evaluate_model.py")

Get the train pipeline instance:

from pipelines.train.train_pipeline import get_pipeline

train_pipeline = get_pipeline(
                        region=region,
                        role=role,
                        default_bucket=default_bucket,
                        model_artifacts_bucket=model_artifacts_bucket,
                        model_name = model_name,
                        kms_key = kms_key,
                        model_package_group_name= model_package_group_name,
                        pipeline_name="mammo-severity-train-pipeline",
                        base_job_prefix="mammo-severity",
                    )

train_pipeline.definition()

Submit the train pipeline and run it:

train_pipeline.upsert(role_arn=role)
train_execution = train_pipeline.start()

The following figure shows a successful run of the training pipeline. The final step in the pipeline registers the model in the central model registry account.

Approve the model in the central model registry

Log in to the central model registry account and access the SageMaker model registry within the SageMaker Studio domain. Change the model version status to Approved.

Once approved, the status should be changed on the model version.

Run the inference pipeline (Optional)

This step is not required but you can still run the inference pipeline to generate predictions in the dev account.

Under <project-name>/pipelines/inference, you can see the following Python scripts:

scripts/lambda_helper.py – Pulls the latest approved model version from the central model registry account using a SageMaker Pipelines Lambda step
inference_pipeline.py – Contains the code for the model inference pipeline

Complete the following steps:

Upload the script to the S3 bucket:

import boto3
s3_client = boto3.resource('s3')
s3_client.Bucket(default_bucket).upload_file("pipelines/inference/scripts/lambda_helper.py","mammography-severity-model/scripts/lambda_helper.py")

Get the inference pipeline instance using the normal batch dataset:

from pipelines.inference.inference_pipeline import get_pipeline

inference_pipeline = get_pipeline(
                        region=region,
                        role=role,
                        lambda_role = lambda_role,
                        default_bucket=default_bucket,
                        kms_key=kms_key,
                        model_name = model_name,
                        model_package_group_name= model_package_group_name,
                        pipeline_name="mammo-severity-inference-pipeline",
                        batch_dataset_filename = "mammo-batch-dataset"
                    )

Submit the inference pipeline and run it:

inference_pipeline.upsert(role_arn=role)
inference_execution = inference_pipeline.start()

The following figure shows a successful run of the inference pipeline. The final step in the pipeline generates the predictions and stores them in the S3 bucket. We use MonitorBatchTransformStep to monitor the inputs into the batch transform job. If there are any outliers, the inference pipeline goes into a failed state.

Run the Jenkins pipeline

The environment/ folder within the GitHub repository contains the configuration script for the prod account. Complete the following steps to trigger the Jenkins pipeline:

Update the config script prod.tfvars.json based on the resources created in the previous steps:

{
    "env_group": "prod",
    "aws_region": "us-east-1",
    "event_bus_name": "default",
    "pipelines_alert_topic_name": "mammography-model-notification",
    "email":"admin@org.com",
    "lambda_role":"arn:aws:iam::<prod-account-id>:role/lambda-sagemaker-role",
    "default_bucket":"<s3-bucket-in-prod-account>",
    "model_artifacts_bucket": "<s3-bucket-in-central-model-registry-account>",
    "kms_key": "arn:aws:kms:us-east-1:<prod-account-id>:key/<kms-key-id-in-prod-account>",
    "model_name": "mammography-severity-model",
    "model_package_group_name":"arn:aws:sagemaker:us-east-1:<central-model-registry-account-id>:model-package-group/mammo-severity-model-package",
    "train_pipeline_name":"mammo-severity-train-pipeline",
    "inference_pipeline_name":"mammo-severity-inference-pipeline",
    "batch_dataset_filename":"mammo-batch-dataset",
    "terraform_state_bucket":"<s3-bucket-terraform-state-in-prod-account>",
    "train_pipeline": {
            "name": "mammo-severity-train-pipeline",
            "arn": "arn:aws:sagemaker:us-east-1:<prod-account-id>:pipeline/mammo-severity-train-pipeline",
            "role_arn": "arn:aws:iam::<prod-account-id>:role/service-role/<sagemaker-execution-role-in-prod-account>"
        },
    "inference_pipeline": {
            "name": "mammo-severity-inference-pipeline",
            "arn": "arn:aws:sagemaker:us-east-1:<prod-account-id>:pipeline/mammo-severity-inference-pipeline",
            "cron_schedule": "cron(0 23 * * ? *)",
            "role_arn": "arn:aws:iam::<prod-account-id>:role/service-role/<sagemaker-execution-role-in-prod-account>"
        }

}

Once updated, push the code into the forked GitHub repository and merge the code into main branch.
Go to the Jenkins UI, choose Build with Parameters, and trigger the CI/CD pipeline created in the previous steps.

When the build is complete and successful, you can log in to the prod account and see the train and inference pipelines within the SageMaker Studio domain.

Additionally, you will see three EventBridge rules on the EventBridge console in the prod account:

Schedule the inference pipeline
Send a failure notification on the train pipeline
When the inference pipeline fails to trigger the train pipeline, send a notification

Finally, you will see an SNS notification topic on the Amazon SNS console that sends notifications through email. You’ll get an email asking you to confirm the acceptance of these notification emails.

Test the inference pipeline using a batch dataset without outliers

To test if the inference pipeline is working as expected in the prod account, we can log in to the prod account and trigger the inference pipeline using the batch dataset without outliers.

Run the pipeline via the SageMaker Pipelines console in the SageMaker Studio domain of the prod account, where the transform_input will be the S3 URI of the dataset without outliers (s3://<s3-bucket-in-prod-account>/mammography-severity-model/data/mammo-batch-dataset.csv).

The inference pipeline succeeds and writes the predictions back to the S3 bucket.

Test the inference pipeline using a batch dataset with outliers

You can run the inference pipeline using the batch dataset with outliers to check if the automated retraining mechanism works as expected.

Run the pipeline via the SageMaker Pipelines console in the SageMaker Studio domain of the prod account, where the transform_input will be the S3 URI of the dataset with outliers (s3://<s3-bucket-in-prod-account>/mammography-severity-model/data/mammo-batch-dataset-outliers.csv).

The inference pipeline fails as expected, which triggers the EventBridge rule, which in turn triggers the train pipeline.

After a few moments, you should see a new run of the train pipeline on the SageMaker Pipelines console, which picks up the two different train datasets (mammo-train-dataset-part1.csv and mammo-train-dataset-part2.csv) uploaded to the S3 bucket to retrain the model.

You will also see a notification sent to the email subscribed to the SNS topic.

To use the updated model version, log in to the central model registry account and approve the model version, which will be picked up during the next run of the inference pipeline triggered through the scheduled EventBridge rule.

Although the train and inference pipelines use a static dataset URL, you can have the dataset URL passed to the train and inference pipelines as dynamic variables in order to use updated datasets to retrain the model and generate predictions in a real-world scenario.

Clean up

To avoid incurring future charges, complete the following steps:

Remove the SageMaker Studio domain across all the AWS accounts.
Delete all the resources created outside SageMaker, including the S3 buckets, IAM roles, EventBridge rules, and SNS topic set up through Terraform in the prod account.
Delete the SageMaker pipelines created across accounts using the AWS Command Line Interface (AWS CLI).

Conclusion

Organizations often need to align with enterprise-wide toolsets to enable collaboration across different functional areas and teams. This collaboration ensures that your MLOps platform can adapt to evolving business needs and accelerates the adoption of ML across teams. This post explained how to create an MLOps framework in a multi-environment setup to enable automated model retraining, batch inference, and monitoring with Amazon SageMaker Model Monitor, model versioning with SageMaker Model Registry, and promotion of ML code and pipelines across environments with a CI/CD pipeline. We showcased this solution using a combination of AWS services and third-party toolsets. For instructions on implementing this solution, see the GitHub repository. You can also extend this solution by bringing in your own data sources and modeling frameworks.

About the Authors

Gayatri Ghanakota is a Sr. Machine Learning Engineer with AWS Professional Services. She is passionate about developing, deploying, and explaining AI/ ML solutions across various domains. Prior to this role, she led multiple initiatives as a data scientist and ML engineer with top global firms in the financial and retail space. She holds a master’s degree in Computer Science specialized in Data Science from the University of Colorado, Boulder.

Sunita Koppar is a Sr. Data Lake Architect with AWS Professional Services. She is passionate about solving customer pain points processing big data and providing long-term scalable solutions. Prior to this role, she developed products in internet, telecom, and automotive domains, and has been an AWS customer. She holds a master’s degree in Data Science from the University of California, Riverside.

Saswata Dash is a DevOps Consultant with AWS Professional Services. She has worked with customers across healthcare and life sciences, aviation, and manufacturing. She is passionate about all things automation and has comprehensive experience in designing and building enterprise-scale customer solutions in AWS. Outside of work, she pursues her passion for photography and catching sunrises.

Customizing coding companions for organizations

November 9, 2023

by Qing Sun Amazon AWS

Generative AI models for coding companions are mostly trained on publicly available source code and natural language text. While the large size of the training corpus enables the models to generate code for commonly used functionality, these models are unaware of code in private repositories and the associated coding styles that are enforced when developing with them. Consequently, the generated suggestions may require rewriting before they are appropriate for incorporation into an internal repository.

We can address this gap and minimize additional manual editing by embedding code knowledge from private repositories on top of a language model trained on public code. This is why we developed a customization capability for Amazon CodeWhisperer. In this post, we show you two possible ways of customizing coding companions using retrieval augmented generation and fine-tuning.

Our goal with CodeWhisperer customization capability is to enable organizations to tailor the CodeWhisperer model using their private repositories and libraries to generate organization-specific code recommendations that save time, follow organizational style and conventions, and avoid bugs or security vulnerabilities. This benefits enterprise software development and helps overcome the following challenges:

Sparse documentation or information for internal libraries and APIs that forces developers to spend time examining previously written code to replicate usage.
Lack of awareness and consistency in implementing enterprise-specific coding practices, styles and patterns.
Inadvertent use of deprecated code and APIs by developers.

By using internal code repositories for additional training that have already undergone code reviews, the language model can surface the use of internal APIs and code blocks that overcome the preceding list of problems. Because the reference code is already reviewed and meets the customer’s high bar, the likelihood of introducing bugs or security vulnerabilities is also minimized. And, by carefully selecting of the source files used for customization, organizations can reduce the use of deprecated code.

Design challenges

Customizing code suggestions based on an organization’s private repositories has many interesting design challenges. Deploying large language models (LLMs) to surface code suggestions has fixed costs for availability and variable costs due to inference based on the number of tokens generated. Therefore, having separate customizations for each customer and hosting them individually, thereby incurring additional fixed costs, can be prohibitively expensive. On the other hand, having multiple customizations simultaneously on the same system necessitates multi-tenant infrastructure to isolate proprietary code for each customer. Furthermore, the customization capability should surface knobs to enable the selection of the appropriate training subset from the internal repository using different metrics (for example, files with a history of fewer bugs or code that is recently committed into the repository). By selecting the code based on these metrics, the customization can be trained using higher-quality code which can improve the quality of code suggestions. Finally, even with continuously evolving code repositories, the cost associated with customization should be minimal to help enterprises realize cost savings from increased developer productivity.

A baseline approach to building customization could be to pretrain the model on a single training corpus composed of of the existing (public) pretraining dataset along with the (private) enterprise code. While this approach works in practice, it requires (redundant) individual pretraining using the public dataset for each enterprise. It also requires redundant deployment costs associated with hosting a customized model for each customer that only serves client requests originating from that customer. By decoupling the training of public and private code and deploying the customization on a multi-tenant system, these redundant costs can be avoided.

How to customize

At a high level, there are two types of possible customization techniques: retrieval-augmented generation (RAG) and fine-tuning (FT).

Retrieval-augmented generation: RAG finds matching pieces of code within a repository that is similar to a given code fragment (for example, code that immediately precedes the cursor in the IDE) and augments the prompt used to query the LLM with these matched code snippets. This enriches the prompt to help nudge the model into generating more relevant code. There are a few techniques explored in the literature along these lines. See Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, REALM, kNN-LM and RETRO.

Fine-tuning: FT takes a pre-trained LLM and trains it further on a specific, smaller codebase (compared to the pretraining dataset) to adapt it for the appropriate repository. Fine-tuning adjusts the LLM’s weights based on this training, making it more tailored to the organization’s unique needs.

Both RAG and fine-tuning are powerful tools for enhancing the performance of LLM-based customization. RAG can quickly adapt to private libraries or APIs with lower training complexity and cost. However, searching and augmenting retrieved code snippets to the prompt increases latency at runtime. Instead, fine-tuning does not require any augmentation of the context because the model is already trained on private libraries and APIs. However, it leads to higher training costs and complexities in serving the model, when multiple custom models have to be supported across multiple enterprise customers. As we discuss later, these concerns can be remedied by optimizing the approach further.

Retrieval augmented generation

There are a few steps involved in RAG:

Indexing

Given a private repository as input by the admin, an index is created by splitting the source code files into chunks. Put simply, chunking turns the code snippets into digestible pieces that are likely to be most informative for the model and are easy to retrieve given the context. The size of a chunk and how it is extracted from a file are design choices that affect the final result. For example, chunks can be split based on lines of code or based on syntactic blocks, and so on.

Administrator Workflow

Contextual search

Search a set of indexed code snippets based on a few lines of code above the cursor and retrieve relevant code snippets. This retrieval can happen using different algorithms. These choices might include:

Bag of words (BM25) – A bag-of-words retrieval function that ranks a set of code snippets based on the query term frequencies and code snippet lengths.

BM25-based retrieval

The following figure illustrates how BM25 works. In order to use BM25, an inverted index is built first. This is a data structure that maps different terms to the code snippets that those terms occur in. At search time, we look up code snippets based on the terms present in the query and score them based on the frequency.

Semantic retrieval [Contriever, UniXcoder] – Converts query and indexed code snippets into high-dimensional vectors and ranks code snippets based on semantic similarity. Formally, often k-nearest neighbors (KNN) or approximate nearest neighbor (ANN) search is often used to find other snippets with similar semantics.

Semantic retrieval

BM25 focuses on lexical matching. Therefore, replacing “add” with “delete” may not change the BM25 score based on the terms in the query, but the retrieved functionality may be the opposite of what is required. In contrast, semantic retrieval focuses on the functionality of the code snippet even though variable and API names may be different. Typically, a combination of BM25 and semantic retrievals can work well together to deliver better results.

Augmented inference

When developers write code, their existing program is used to formulate a query that is sent to the retrieval index. After retrieving multiple code snippets using one of the techniques discussed above, we prepend them to the original prompt. There are many design choices here, including the number of snippets to be retrieved, the relative placement of the snippets in the prompt, and the size of the snippet. The final design choice is primarily driven by empirical observation by exploring various approaches with the underlying language model and plays a key role in determining the accuracy of the approach. The contents from the returned chunks and the original code are combined and sent to the model to get customized code suggestions.

Developer workflow

Fine tuning:

Fine-tuning a language model is done for transfer learning in which the weights of a pre-trained model are trained on new data. The goal is to retain the appropriate knowledge from a model already trained on a large corpus and refine, replace, or add new knowledge from the new corpus — in our case, a new codebase. Simply training on a new codebase leads to catastrophic forgetting. For example, the language model may “forget” its knowledge of safety or the APIs that are sparsely used in the enterprise codebase to date. There are a variety of techniques like experience replay, GEM, and PP-TF that are employed to address this challenge.

Fine tuning

There are two ways of fine-tuning. One approach is to use the additional data without augmenting the prompt to fine-tune the model. Another approach is to augment the prompt during fine-tuning by retrieving relevant code suggestions. This helps improve the model’s ability to provide better suggestions in the presence of retrieved code snippets. The model is then evaluated on a held-out set of examples after it is trained. Subsequently, the customized model is deployed and used for generating the code suggestions.

Despite the advantages of using dedicated LLMs for generating code on private repositories, the costs can be prohibitive for small and medium-sized organizations. This is because dedicated compute resources are necessary even though they may be underutilized given the size of the teams. One way to achieve cost efficiency is serving multiple models on the same compute (for example, SageMaker multi-tenancy). However, language models require one or more dedicated GPUs across multiple zones to handle latency and throughput constraints. Hence, multi-tenancy of full model hosting on each GPU is infeasible.

We can overcome this problem by serving multiple customers on the same compute by using small adapters to the LLM. Parameter-efficient fine-tuning (PEFT) techniques like prompt tuning, prefix tuning, and Low-Rank Adaptation (LoRA) are used to lower training costs without any loss of accuracy. LoRA, especially, has seen great success at achieving similar (or better) accuracy than full-model fine-tuning. The basic idea is to design a low-rank matrix that is then added to the matrices with the original matrix weight of targeted layers of the model. Typically, these adapters are then merged with the original model weights for serving. This leads to the same size and architecture as the original neural network. Keeping the adapters separate, we can serve the same base model with many model adapters. This brings the economies of scale back to our small and medium-sized customers.

Low-Rank Adaptation (LoRA)

Measuring effectiveness of customization

We need evaluation metrics to assess the efficacy of the customized solution. Offline evaluation metrics act as guardrails against shipping customizations that are subpar compared to the default model. By building datasets out of a held-out dataset from within the provided repository, the customization approach can be applied to this dataset to measure effectiveness. Comparing the existing source code with the customized code suggestion quantifies the usefulness of the customization. Common measures used for this quantification include metrics like edit similarity, exact match, and CodeBLEU.

It is also possible to measure usefulness by quantifying how often internal APIs are invoked by the customization and comparing it with the invocations in the pre-existing source. Of course, getting both aspects right is important for a successful completion. For our customization approach, we have designed a tailor-made metric known as Customization Quality Index (CQI), a single user-friendly measure ranging between 1 and 10. The CQI metric shows the usefulness of the suggestions from the customized model compared to code suggestions with a generic public model.

Summary

We built Amazon CodeWhisperer customization capability based on a mixture of the leading technical techniques discussed in this blog post and evaluated it with user studies on developer productivity, conducted by Persistent Systems. In these two studies, commissioned by AWS, developers were asked to create a medical software application in Java that required use of their internal libraries. In the first study, developers without access to CodeWhisperer took (on average) ~8.2 hours to complete the task, while those who used CodeWhisperer (without customization) completed the task 62 percent faster in (on average) ~3.1 hours.

In the second study with a different set of developer cohorts, developers using CodeWhisperer that had been customized using their private codebase completed the task in 2.5 hours on average, 28 percent faster than those who were using CodeWhisperer without customization and completed the task in ~3.5 hours on average. We strongly believe tools like CodeWhisperer that are customized to your codebase have a key role to play in further boosting developer productivity and recommend giving it a run. For more information and to get started, visit the Amazon CodeWhisperer page.

About the authors

Qing Sun is a Senior Applied Scientist in AWS AI Labs and work on AWS CodeWhisperer, a generative AI-powered coding assistant. Her research interests lie in Natural Language Processing, AI4Code and generative AI. In the past, she had worked on several NLP-based services such as Comprehend Medical, a medical diagnosis system at Amazon Health AI and Machine Translation system at Meta AI. She received her PhD from Virginia Tech in 2017.

Arash Farahani is an Applied Scientist with Amazon CodeWhisperer. His current interests are in generative AI, search, and personalization. Arash is passionate about building solutions that resolve developer pain points. He has worked on multiple features within CodeWhisperer, and introduced NLP solutions into various internal workstreams that touch all Amazon developers. He received his PhD from University of Illinois at Urbana-Champaign in 2017.

Xiaofei Ma is an Applied Science Manager in AWS AI Labs. He joined Amazon in 2016 as an Applied Scientist within SCOT organization and then later AWS AI Labs in 2018 working on Amazon Kendra. Xiaofei has been serving as the science manager for several services including Kendra, Contact Lens, and most recently CodeWhisperer and CodeGuru Security. His research interests lie in the area of AI4Code and Natural Language Processing. He received his PhD from University of Maryland, College Park in 2010.

Murali Krishna Ramanathan is a Principal Applied Scientist in AWS AI Labs and co-leads AWS CodeWhisperer, a generative AI-powered coding companion. He is passionate about building software tools and workflows that help improve developer productivity. In the past, he built Piranha, an automated refactoring tool to delete code due to stale feature flags and led code quality initiatives at Uber engineering. He is a recipient of the Google faculty award (2015), ACM SIGSOFT Distinguished paper award (ISSTA 2016) and Maurice Halstead award (Purdue 2006). He received his PhD in Computer Science from Purdue University in 2008.

Ramesh Nallapati is a Senior Principal Applied Scientist in AWS AI Labs and co-leads CodeWhisperer, a generative AI-powered coding companion, and Titan Large Language Models at AWS. His interests are mainly in the areas of Natural Language Processing and Generative AI. In the past, Ramesh has provided science leadership in delivering many NLP-based AWS products such as Kendra, Quicksight Q and Contact Lens. He held research positions at Stanford, CMU and IBM Research, and received his Ph.D. in Computer Science from University of Massachusetts Amherst in 2006.

Build a medical imaging AI inference pipeline with MONAI Deploy on AWS

November 8, 2023

by Ming Qin Amazon AWS

This post is cowritten with Ming (Melvin) Qin, David Bericat and Brad Genereaux from NVIDIA.

Medical imaging AI researchers and developers need a scalable, enterprise framework to build, deploy, and integrate their AI applications. AWS and NVIDIA have come together to make this vision a reality. AWS, NVIDIA, and other partners build applications and solutions to make healthcare more accessible, affordable, and efficient by accelerating cloud connectivity of enterprise imaging. MONAI Deploy is one of the key modules within MONAI (Medical Open Network for Artificial Intelligence) developed by a consortium of academic and industry leaders, including NVIDIA. AWS HealthImaging (AHI) is a HIPAA-eligible, highly scalable, performant, and cost-effective medical imagery store. We have developed a MONAI Deploy connector to AHI to integrate medical imaging AI applications with subsecond image retrieval latencies at scale powered by cloud-native APIs. The MONAI AI models and applications can be hosted on Amazon SageMaker, which is a fully managed service to deploy machine learning (ML) models at scale. SageMaker takes care of setting up and managing instances for inference and provides built-in metrics and logs for endpoints that you can use to monitor and receive alerts. It also offers a variety of NVIDIA GPU instances for ML inference, as well as multiple model deployment options with automatic scaling, including real-time inference, serverless inference, asynchronous inference, and batch transform.

In this post, we demonstrate how to deploy a MONAI Application Package (MAP) with the connector to AWS HealthImaging, using a SageMaker multi-model endpoint for real-time inference and asynchronous inference. These two options cover a majority of near-real-time medical imaging inference pipeline use cases.

Solution overview

The following diagram illustrates the solution architecture.

Prerequisites

Complete the following prerequisite steps:

Use an AWS account with one of the following Regions, where AWS HealthImaging is available: North Virginia (us-east-1), Oregon (us-west-2), Ireland (eu-west-1), and Sydney (ap-southeast-2).
Create an Amazon SageMaker Studio domain and user profile with AWS Identity and Access Management (IAM) permission to access AWS HealthImaging.
Enable the JupyterLab v3 extension and install Imjoy-jupyter-extension if you want to visualize medical images on SageMaker notebook interactively using itkwidgets.

MAP connector to AWS HealthImaging

AWS HealthImaging imports DICOM P10 files and converts them into ImageSets, which are a optimized representation of a DICOM series. AHI provides API access to ImageSet metadata and ImageFrames. Metadata contains all DICOM attributes in a JSON document. ImageFrames are returned encoded in the High-Throughput JPEG2000 (HTJ2K) lossless format, which can be decoded extremely fast. ImageSets can be retrieved by using the AWS Command Line Interface (AWS CLI) or the AWS SDKs.

MONAI is a medical imaging AI framework that takes research breakthroughs and AI applications into clinical impact. MONAI Deploy is the processing pipeline that enables the end-to-end workflow, including packaging, testing, deploying, and running medical imaging AI applications in clinical production. It comprises the MONAI Deploy App SDK, MONAI Deploy Express, Workflow Manager, and Informatics Gateway. The MONAI Deploy App SDK provides ready-to-use algorithms and a framework to accelerate building medical imaging AI applications, as well as utility tools to package the application into a MAP container. The built-in standards-based functionalities in the app SDK allow the MAP to smoothly integrate into health IT networks, which requires the use of standards such as DICOM, HL7, and FHIR, and across data center and cloud environments. MAPs can use both predefined and customized operators for DICOM image loading, series selection, model inference, and postprocessing

We have developed a Python module using the AWS HealthImaging Python SDK Boto3. You can pip install it and use the helper function to retrieve DICOM Service-Object Pair (SOP) instances as follows:

!pip install -q AHItoDICOMInterface
from AHItoDICOMInterface.AHItoDICOM import AHItoDICOM
helper = AHItoDICOM()
instances = helper.DICOMizeImageSet(datastore_id=datastoreId , image_set_id=next(iter(imageSetIds)))

The output SOP instances can be visualized using the interactive 3D medical image viewer itkwidgets in the following notebook. The AHItoDICOM class takes advantage of multiple processes to retrieve pixel frames from AWS HealthImaging in parallel, and decode the HTJ2K binary blobs using the Python OpenJPEG library. The ImageSetIds come from the output files of a given AWS HealthImaging import job. Given the DatastoreId and import JobId, you can retrieve the ImageSetId, which is equivalent to the DICOM series instance UID, as follows:

imageSetIds = {}
try:
    response = s3.head_object(Bucket=OutputBucketName, Key=f"output/{res_createstore['datastoreId']}-DicomImport-{res_startimportjob['jobId']}/job-output-manifest.json")
    if response['ResponseMetadata']['HTTPStatusCode'] == 200:
        data = s3.get_object(Bucket=OutputBucketName, Key=f"output/{res_createstore['datastoreId']}-DicomImport-{res_startimportjob['jobId']}/SUCCESS/success.ndjson")
        contents = data['Body'].read().decode("utf-8")
        for l in contents.splitlines():
            isid = json.loads(l)['importResponse']['imageSetId']
            if isid in imageSetIds:
                imageSetIds[isid]+=1
            else:
                imageSetIds[isid]=1
except ClientError:
    pass

With ImageSetId, you can retrieve the DICOM header metadata and image pixels separately using native AWS HealthImaging API functions. The DICOM exporter aggregates the DICOM headers and image pixels into the Pydicom dataset, which can be processed by the MAP DICOM data loader operator. Using the DICOMizeImageSet()function, we have created a connector to load image data from AWS HealthImaging, based on the MAP DICOM data loader operator:

class AHIDataLoaderOperator(Operator):
    def __init__(self, ahi_client, must_load: bool = True, *args, **kwargs):
        self.ahi_client = ahi_client
        …
        def _load_data(self, input_obj: string):
            study_dict = {}
            series_dict = {}
            sop_instances = self.ahi_client.DICOMizeImageSet(input_obj['datastoreId'], input_obj['imageSetId'])

In the preceding code, ahi_client is an instance of the AHItoDICOM DICOM exporter class, with data retrieval functions illustrated. We have included this new data loader operator into a 3D spleen segmentation AI application created by the MONAI Deploy App SDK. You can first explore how to create and run this application on a local notebook instance, and then deploy this MAP application into SageMaker managed inference endpoints.

SageMaker asynchronous inference

A SageMaker asynchronous inference endpoint is used for requests with large payload sizes (up to 1 GB), long processing times (up to 15 minutes), and near-real-time latency requirements. When there are no requests to process, this deployment option can downscale the instance count to zero for cost savings, which is ideal for medical imaging ML inference workloads. Follow the steps in the sample notebook to create and invoke the SageMaker asynchronous inference endpoint. To create an asynchronous inference endpoint, you will need to create a SageMaker model and endpoint configuration first. To create a SageMaker model, you will need to load a model.tar.gz package with a defined directory structure into a Docker container. The model.tar.gz package includes a pre-trained spleen segmentation model.ts file and a customized inference.py file. We have used a prebuilt container with Python 3.8 and PyTorch 1.12.1 framework versions to load the model and run predictions.

In the customized inference.py file, we instantiate an AHItoDICOM helper class from AHItoDICOMInterface and use it to create a MAP instance in the model_fn() function, and we run the MAP application on every inference request in the predict_fn() function:

from app import AISpleenSegApp
from AHItoDICOMInterface.AHItoDICOM import AHItoDICOM
helper = AHItoDICOM()
def model_fn(model_dir, context):
    …
    monai_app_instance = AISpleenSegApp(helper, do_run=False,path="/home/model-server")

def predict_fn(input_data, model):
    with open('/home/model-server/inputImageSets.json', 'w') as f:
        f.write(json.dumps(input_data))
        output_folder = "/home/model-server/output"
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)
            model.run(input='/home/model-server/inputImageSets.json', output=output_folder, workdir='/home/model-server', model='/opt/ml/model/model.ts')

To invoke the asynchronous endpoint, you will need to upload the request input payload to Amazon Simple Storage Service (Amazon S3), which is a JSON file specifying the AWS HealthImaging datastore ID and ImageSet ID to run inference on:

sess = sagemaker.Session()
InputLocation = sess.upload_data('inputImageSets.json', bucket=sess.default_bucket(), key_prefix=prefix, extra_args={"ContentType": "application/json"})
response = runtime_sm_client.invoke_endpoint_async(EndpointName=endpoint_name, InputLocation=InputLocation, ContentType="application/json", Accept="application/json")
output_location = response["OutputLocation"]

The output can be found in Amazon S3 as well.

SageMaker multi-model real-time inference

SageMaker real-time inference endpoints meet interactive, low-latency requirements. This option can host multiple models in one container behind one endpoint, which is a scalable and cost-effective solution to deploying several ML models. A SageMaker multi-model endpoint uses NVIDIA Triton Inference Server with GPU to run multiple deep learning model inferences.

In this section, we walk through how to create and invoke a multi-model endpoint adapting your own inference container in the following sample notebook. Different models can be served in a shared container on the same fleet of resources. Multi-model endpoints reduce deployment overhead and scale model inferences based on the traffic patterns to the endpoint. We used AWS developer tools including Amazon CodeCommit, Amazon CodeBuild, and Amazon CodePipeline to build the customized container for SageMaker model inference. We prepared a model_handler.py to bring your own container instead of the inference.py file in the previous example, and implemented the initialize(), preprocess(), and inference() functions:

from app import AISpleenSegApp
from AHItoDICOMInterface.AHItoDICOM import AHItoDICOM
class ModelHandler(object):
    def __init__(self):
        self.initialized = False
        self.shapes = None
    def initialize(self, context):
        self.initialized = True
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        gpu_id = properties.get("gpu_id")
        helper = AHItoDICOM()
        self.monai_app_instance = AISpleenSegApp(helper, do_run=False, path="/home/model-server/")
    def preprocess(self, request):
        inputStr = request[0].get("body").decode('UTF8')
        datastoreId = json.loads(inputStr)['inputs'][0]['datastoreId']
        imageSetId = json.loads(inputStr)['inputs'][0]['imageSetId']
        with open('/tmp/inputImageSets.json', 'w') as f:
            f.write(json.dumps({"datastoreId": datastoreId, "imageSetId": imageSetId}))
        return '/tmp/inputImageSets.json'
    def inference(self, model_input):
        self.monai_app_instance.run(input=model_input, output="/home/model-server/output/", workdir="/home/model-server/", model=os.environ["model_dir"]+"/model.ts")

After the container is built and pushed to Amazon Elastic Container Registry (Amazon ECR), you can create SageMaker model with it, plus different model packages (tar.gz files) in a given Amazon S3 path:

model_name = "DEMO-MONAIDeployModel" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_url = "s3://{}/{}/".format(bucket, prefix)
container = "{}.dkr.ecr.{}.amazonaws.com/{}:dev".format( account_id, region, prefix )
container = {"Image": container, "ModelDataUrl": model_url, "Mode": "MultiModel"}
create_model_response = sm_client.create_model(ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=container)

It’s noteworthy that the model_url here only specifies the path to a folder of tar.gz files, and you specify which model package to use for inference when you invoke the endpoint, as shown in the following code:

Payload = {"inputs": [ {"datastoreId": datastoreId, "imageSetId": next(iter(imageSetIds))} ]}
response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, ContentType="application/json", Accept="application/json", TargetModel="model.tar.gz", Body=json.dumps(Payload))

We can add more models to the existing multi-model inference endpoint without having to update the endpoint or create a new one.

Clean up

Don’t forget to complete the Delete the hosting resources step in the lab-3 and lab-4 notebooks to delete the SageMaker inference endpoints. You should turn down the SageMaker notebook instance to save costs as well. Finally, you can either call the AWS HealthImaging API function or use the AWS HealthImaging console to delete the image sets and data store created earlier:

for s in imageSetIds.keys():
    medicalimaging.deleteImageSet(datastoreId, s)
medicalimaging.deleteDatastore(datastoreId)

Conclusion

In this post, we showed you how to create a MAP connector to AWS HealthImaging, which is reusable in applications built with the MONAI Deploy App SDK, to integrate with and accelerate image data retrieval from a cloud-native DICOM store to medical imaging AI workloads. The MONAI Deploy SDK can be used to support hospital operations. We also demonstrated two hosting options to deploy MAP AI applications on SageMaker at scale.

Go through the example notebooks in the GitHub repository to learn more about how to deploy MONAI applications on SageMaker with medical images stored in AWS HealthImaging. To know what AWS can do for you, contact an AWS representative.

For additional resources, refer to the following:

About the Authors

Ming (Melvin) Qin is an independent contributor on the Healthcare team at NVIDIA, focused on developing an AI inference application framework and platform to bring AI to medical imaging workflows. Before joining NVIDIA in 2018 as a founding member of Clara, Ming spent 15 years developing Radiology PACS and Workflow SaaS as lead engineer/architect at Stentor Inc., later acquired by Philips Healthcare to form its Enterprise Imaging.

David Bericat is a product manager for Healthcare at NVIDIA, where he leads the Project MONAI Deploy working group to bring AI from research to clinical deployments. His passion is to accelerate health innovation globally translating it to true clinical impact. Previously, David worked at Red Hat, implementing open source principles at the intersection of AI, cloud, edge computing, and IoT. His proudest moments include hiking to the Everest base camp and playing soccer for over 20 years.

Brad Genereaux is Global Lead, Healthcare Alliances at NVIDIA, where he is responsible for developer relations with a focus in medical imaging to accelerate artificial intelligence and deep learning, visualization, virtualization, and analytics solutions. Brad evangelizes the ubiquitous adoption and integration of seamless healthcare and medical imaging workflows into everyday clinical practice, with more than 20 years of experience in healthcare IT.

Gang Fu is a Healthcare Solutions Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over 10 years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.

JP Leger is a Senior Solutions Architect supporting academic medical centers and medical imaging workflows at AWS. He has over 20 years of expertise in software engineering, healthcare IT, and medical imaging, with extensive experience architecting systems for performance, scalability, and security in distributed deployments of large data volumes on premises, in the cloud, and hybrid with analytics and AI.

Chris Hafey is a Principal Solutions Architect at Amazon Web Services. He has over 25 years’ experience in the medical imaging industry and specializes in building scalable high-performance systems. He is the creator of the popular CornerstoneJS open source project, which powers the popular OHIF open source zero footprint viewer. He contributed to the DICOMweb specification and continues to work towards improving its performance for web-based viewing.

Optimize for sustainability with Amazon CodeWhisperer

November 8, 2023

by Isha Dua Amazon AWS

This post explores how Amazon CodeWhisperer can help with code optimization for sustainability through increased resource efficiency. Computationally resource-efficient coding is one technique that aims to reduce the amount of energy required to process a line of code and, as a result, aid companies in consuming less energy overall. In this era of cloud computing, developers are now harnessing open source libraries and advanced processing power available to them to build out large-scale microservices that need to be operationally efficient, performant, and resilient. However, modern applications often consist of extensive code, demanding significant computing resources. Although the direct environmental impact might not be obvious, sub-optimized code amplifies the carbon footprint of modern applications through factors like heightened energy consumption, prolonged hardware usage, and outdated algorithms. In this post, we discover how Amazon CodeWhisperer helps address these concerns and reduce the environmental footprint of your code.

Amazon CodeWhisperer is a generative AI coding companion that speeds up software development by making suggestions based on the existing code and natural language comments, reducing the overall development effort and freeing up time for brainstorming, solving complex problems, and authoring differentiated code. Amazon CodeWhisperer can help developers streamline their workflows, enhance code quality, build stronger security postures, generate robust test suites, and write computationally resource friendly code, which can help you optimize for environmental sustainability. It is available as part of the Toolkit for Visual Studio Code, AWS Cloud9, JupyterLab, Amazon SageMaker Studio, AWS Lambda, AWS Glue, and JetBrains IntelliJ IDEA. Amazon CodeWhisperer currently supports Python, Java, JavaScript, TypeScript, C#, Go, Rust, PHP, Ruby, Kotlin, C, C++, Shell scripting, SQL, and Scala.

Impact of unoptimized code on cloud computing and application carbon footprint

AWS’s infrastructure is 3.6 times more energy efficient than the median of surveyed US enterprise data centers and up to 5 times more energy efficient than the average European enterprise data center. Therefore, AWS can help lower the workload carbon footprint up to 96%. You can now use Amazon CodeWhisperer to write quality code with reduced resource usage and energy consumption, and meet scalability objectives while benefiting from AWS energy efficient infrastructure.

Increased resource usage

Unoptimized code can result in the ineffective usage of cloud computing resources. As a result, more virtual machines (VMs) or containers may be required, increasing resource allocation, energy use, and the related carbon footprint of the workload. You might encounter increases in the following:

CPU utilization – Unoptimized code often contains inefficient algorithms or coding practices that require excessive CPU cycles to run.
Memory consumption – Inefficient memory management in unoptimized code can result in unnecessary memory allocation, deallocation, or data duplication.
Disk I/O operations – Inefficient code can perform excessive input/output (I/O) operations. For example, if data is read from or written to disk more frequently than necessary, it can increase disk I/O utilization and latency.
Network usage – Due to ineffective data transmission techniques or duplicate communication, poorly optimized code may cause an excessive amount of network traffic. This can lead to higher latency and increased network bandwidth utilization. Increased network utilization may result in higher expenses and resource needs in situations where network resources are taxed based on usage, such as in cloud computing.

Higher energy consumption

Infrastructure-supporting applications with inefficient code uses more processing power. Overusing computing resources due to inefficient, bloated code can result in higher energy consumption and heat production, which subsequently necessitates more energy for cooling. Along with the servers, the cooling systems, the infrastructure for power distribution, and other auxiliary elements also consume energy.

Scalability challenges

In application development, scalability issues can be caused by unoptimized code. Such code may not scale effectively as the task grows, necessitating more resources and using more energy. This increases the energy consumed by these code fragments. As mentioned previously, inefficient or wasteful code has a compounding effect at scale.

The compounded energy savings from optimizing code that customers run in certain data centers is even further compounded when we take into consideration that cloud providers such as AWS have dozens of data centers around the world.

Amazon CodeWhisperer uses machine learning (ML) and large language models to provide code recommendations in real time based on the original code and natural language comments, and provides code recommendations that could be more efficient. The program’s infrastructure usage efficiency can be increased by optimizing the code using strategies including algorithmic advancements, effective memory management, and a reduction in pointless I/O operations.

Code generation, completion, and suggestions

Let’s examine several situations where Amazon CodeWhisperer can be useful.

By automating the development of repetitive or complex code, code generation tools minimize the possibility of human error while focusing on platform-specific optimizations. By using established patterns or templates, these programs may produce code that more consistently adheres to sustainability best practices. Developers can produce code that complies with particular coding standards, helping deliver more consistent and dependable code throughout the project. The resulting code may be more efficient and because it removes human coding variations, and can be more legible, improving development speed. It can automatically implement ways to reduce the application program size and length, such as deleting superfluous code, improving variable storage, or using compression methods. These optimizations can aid in memory consumption optimization and boosts overall system efficiency by shrinking the package size.

Generative AI has the potential to make programming more sustainable by optimizing resource allocation. Looking holistically at an application’s carbon footprint is important. Tools like Amazon CodeGuru Profiler can collect performance data to optimize latency between components. The profiling service examines code runs and identifies potential improvements. Developers can then manually refine the auto generated code based on these findings to further improve energy efficiency. The combination of generative AI, profiling, and human oversight creates a feedback loop that can continuously improve code efficiency and reduce environmental impact.

The following screenshot shows you results generated from CodeGuru Profiler in latency mode, which includes network and disk I/O. In this case, the application still spends most of its time in ImageProcessor.extractTasks (second bottom row), and almost all the time inside that is runnable, which means that it wasn’t waiting for anything. You can view these thread states by changing to latency mode from CPU mode. This can help you get a good idea of what is impacting the wall clock time of the application. For more information, refer to Reducing Your Organization’s Carbon Footprint with Amazon CodeGuru Profiler.

Generating test cases

Amazon CodeWhisperer can help suggest test cases and verify the code’s functionality by considering boundary values, edge cases, and other potential issues that may need to be tested. Also, Amazon CodeWhisperer can simplify creating repetitive code for unit testing. For example, if you need to create sample data using INSERT statements, Amazon CodeWhisperer can generate the necessary inserts based on a pattern. The overall resource requirements for software testing can also be decreased by identifying and optimizing resource-intensive test cases or removing redundant ones. Improved test suites have the potential to make the application become more environmentally friendly by increasing energy efficiency, decreasing resource consumption, minimizing waste, and reducing the workload carbon footprint.

For a more hands-on experience with Amazon CodeWhisperer, refer to Optimize software development with Amazon CodeWhisperer. The post showcases the code recommendations from Amazon CodeWhisperer in Amazon SageMaker Studio. It also demonstrates the suggested code based on comments for loading and analyzing a dataset.

Conclusion

In this post, we learned how Amazon CodeWhisperer can help developers write optimized, more sustainable code. Using advanced ML models, Amazon CodeWhisperer analyzes your code and provides personalized recommendations for improving efficiency, which can reduce costs and help decrease the carbon footprint.

By suggesting minor adjustments and alternative approaches, Amazon CodeWhisperer enables developers to significantly cut resource usage and emissions without sacrificing functionality. Whether you’re looking to optimize an existing code base or ensure new projects are resource efficient, Amazon CodeWhisperer can be an invaluable aid. To learn more about Amazon CodeWhisperer and AWS Sustainability resources for code optimization, consider the following next steps:

Get started with Amazon CodeWhisperer
Learn how Amazon CodeWhisperer can accelerate and enhance software development
Build a serverless app with Amazon CodeWhisperer
Learn about the AWS Well-Architected Pillar for Sustainability-Software and Architecture Pattern
After you have updated your code, make sure to review runtime performance data from your live applications, and recommendations that can help you fine-tune your application performance on AWS using AWS CodeGuru Profiler.

About the authors

Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while ensuring resilience and scalability. She’s passionate about machine learning technologies and environmental sustainability.

Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Erick Irigoyen is a Solutions Architect at Amazon Web Services focusing on clients in the Semiconductors and Electronics industry. He works closely with customers to understand their business challenges and identify how AWS can be leveraged to achieve their strategic goals. His work has primarily focused on projects related to Artificial Intelligence and Machine Learning (AI/ML). Prior to joining AWS, he was a Senior Consultant at Deloitte’s Advanced Analytics practice where he led workstreams in several engagements across the United States focusing on Analytics and AI/ML. Erick holds a B.S. in Business from the University of San Francisco and an M.S. in Analytics from North Carolina State University.

Harnessing the power of enterprise data with generative AI: Insights from Amazon Kendra, LangChain, and large language models

November 7, 2023

by Jin Tan Ruan Amazon AWS

Large language models (LLMs) with their broad knowledge, can generate human-like text on almost any topic. However, their training on massive datasets also limits their usefulness for specialized tasks. Without continued learning, these models remain oblivious to new data and trends that emerge after their initial training. Furthermore, the cost to train new LLMs can prove prohibitive for many enterprise settings. However, it’s possible to cross-reference a model answer with the original specialized content, thereby avoiding the need to train a new LLM model, using Retrieval-Augmented Generation (RAG).

RAG empowers LLMs by giving them the ability to retrieve and incorporate external knowledge. Instead of relying solely on their pre-trained knowledge, RAG allows models to pull data from documents, databases, and more. The model then skillfully integrates this outside information into its generated text. By sourcing context-relevant data, the model can provide informed, up-to-date responses tailored to your use case. The knowledge augmentation also reduces the likelihood of hallucinations and inaccurate or nonsensical text. With RAG, foundation models become adaptable experts that evolve as your knowledge base grows.

Today, we are excited to unveil three generative AI demos, licensed under MIT-0 license:

Amazon Kendra with foundational LLM – Utilizes the deep search capabilities of Amazon Kendra combined with the expansive knowledge of LLMs. This integration provides precise and context-aware answers to complex queries by drawing from a diverse range of sources.
Embeddings model with foundational LLM – Merges the power of embeddings—a technique to capture semantic meanings of words and phrases—with the vast knowledge base of LLMs. This synergy enables more accurate topic modeling, content recommendation, and semantic search capabilities.
Foundation Models Pharma Ad Generator – A specialized application tailored for the pharmaceutical industry. Harnessing the generative capabilities of foundational models, this tool creates convincing and compliant pharmaceutical advertisements, ensuring content adheres to industry standards and regulations.

These demos can be seamlessly deployed in your AWS account, offering foundational insights and guidance on utilizing AWS services to create a state-of-the-art LLM generative AI question and answer bot and content generation.

In this post, we explore how RAG combined with Amazon Kendra or custom embeddings can overcome these challenges and provide refined responses to natural language queries.

Solution overview

By adopting this solution, you can gain the following benefits:

Improved information access – RAG allows models to pull in information from vast external sources, which can be especially useful when the pre-trained model’s knowledge is outdated or incomplete.
Scalability – Instead of training a model on all available data, RAG allows models to retrieve relevant information on the fly. This means that as new data becomes available, it can be added to the retrieval database without needing to retrain the entire model.
Memory efficiency – LLMs require significant memory to store parameters. With RAG, the model can be smaller because it doesn’t need to memorize all details; it can retrieve them when needed.
Dynamic knowledge update – Unlike conventional models with a set knowledge endpoint, RAG’s external database can undergo regular updates, granting the model access to up-to-date information. The retrieval function can be fine-tuned for distinct tasks. For example, a medical diagnostic task can source data from medical journals, ensuring the model garners expert and pertinent insights.
Bias mitigation – The ability to draw from a well-curated database offers the potential to minimize biases by ensuring balanced and impartial external sources.

Before diving into the integration of Amazon Kendra with foundational LLMs, it’s crucial to equip yourself with the necessary tools and system requirements. Having the right setup in place is the first step towards a seamless deployment of the demos.

Prerequisites

You must have the following prerequisites:

An AWS account.
The AWS Command Line Interface (AWS CLI) v2. For instructions, refer to Install or update the latest version of the AWS CLI.
Python 3.6 or later.
Node.js 18.x or later.
Docker v20.10 or later.
For this post, you use the AWS Cloud Development Kit (AWS CDK) using Python. Follow the instructions in Getting Started with the AWS CDK to set up your local environment and bootstrap your development account.
This AWS CDK project requires Amazon SageMaker instances (two ml.g5.48xlarge). You may need to request a quota increase.

Although it’s possible to set up and deploy the infrastructure detailed in this tutorial from your local computer, AWS Cloud9 offers a convenient alternative. Pre-equipped with tools like AWS CLI, AWS CDK, and Docker, AWS Cloud9 can function as your deployment workstation. To use this service, simply set up the environment via the AWS Cloud9 console.

With the prerequisites out of the way, let’s dive into the features and capabilities of Amazon Kendra with foundational LLMs.

Amazon Kendra with foundational LLM

Amazon Kendra is an advanced enterprise search service enhanced by machine learning (ML) that provides out-of-the-box semantic search capabilities. Utilizing natural language processing (NLP), Amazon Kendra comprehends both the content of documents and the underlying intent of user queries, positioning it as a content retrieval tool for RAG based solutions. By using the high-accuracy search content from Kendra as a RAG payload, you can get better LLM responses. The use of Amazon Kendra in this solution also enables personalized search by filtering responses according to the end-user content access permissions.

The following diagram shows the architecture of a generative AI application using the RAG approach.

Documents are processed and indexed by Amazon Kendra through the Amazon Simple Storage Service (Amazon S3) connector. Customer requests and contextual data from Amazon Kendra are directed to an Amazon Bedrock foundation model. The demo lets you choose between Amazon’s Titan, AI21’s Jurassic, and Anthropic’s Claude models supported by Amazon Bedrock. The conversation history is saved in Amazon DynamoDB, offering added context for the LLM to generate responses.

We have provided this demo in the GitHub repo. Refer to the deployment instructions within the readme file for deploying it into your AWS account.

The following steps outline the process when a user interacts with the generative AI app:

The user logs in to the web app authenticated by Amazon Cognito.
The user uploads one or more documents into Amazon S3.
The user runs an Amazon Kendra sync job to ingest S3 documents into the Amazon Kendra index.
The user’s question is routed through a secure WebSocket API hosted on Amazon API Gateway backed by a AWS Lambda function.
The Lambda function, empowered by the LangChain framework—a versatile tool designed for creating applications driven by AI language models—connects to the Amazon Bedrock endpoint to rephrase the user’s question based on chat history. After rephrasing, the question is forwarded to Amazon Kendra using the Retrieve API. In response, the Amazon Kendra index displays search outcomes, providing excerpts from pertinent documents sourced from the enterprise’s ingested data.
The user’s question along with the data retrieved from the index are sent as a context in the LLM prompt. The response from the LLM is stored as chat history within DynamoDB.
Finally, the response from the LLM is sent back to the user.

Document indexing workflow

The following is the procedure for processing and indexing documents:

Users submit documents via the user interface (UI).
Documents are transferred to an S3 bucket utilizing the AWS Amplify API.
Amazon Kendra indexes new documents in the S3 bucket through the Amazon Kendra S3 connector.

Benefits

The following list highlights the advantages of this solution:

Enterprise-level retrieval – Amazon Kendra is designed for enterprise search, making it suitable for organizations with vast amounts of structured and unstructured data.
Semantic understanding – The ML capabilities of Amazon Kendra ensure that retrieval is based on deep semantic understanding and not just keyword matches.
Scalability – Amazon Kendra can handle large-scale data sources and provides quick and relevant search results.
Flexibility – The foundational model can generate answers based on a wide range of contexts, ensuring the system remains versatile.
Integration capabilities – Amazon Kendra can be integrated with various AWS services and data sources, making it adaptable for different organizational needs.

Embeddings model with foundational LLM

An embedding is a numerical vector that represents the core essence of diverse data types, including text, images, audio, and documents. This representation not only captures the data’s intrinsic meaning, but also adapts it for a wide range of practical applications. Embedding models, a branch of ML, transform complex data, such as words or phrases, into continuous vector spaces. These vectors inherently grasp the semantic connections between data, enabling deeper and more insightful comparisons.

RAG seamlessly combines the strengths of foundational models, like transformers, with the precision of embeddings to sift through vast databases for pertinent information. Upon receiving a query, the system utilizes embeddings to identify and extract relevant sections from an extensive body of data. The foundational model then formulates a contextually precise response based on this extracted information. This perfect synergy between data retrieval and response generation allows the system to provide thorough answers, drawing from the vast knowledge stored in expansive databases.

In the architectural layout, based on their UI selection, users are guided to either the Amazon Bedrock or Amazon SageMaker JumpStart foundation models. Documents undergo processing, and vector embeddings are produced by the embeddings model. These embeddings are then indexed using FAISS to enable efficient semantic search. Conversation histories are preserved in DynamoDB, enriching the context for the LLM to craft responses.

The following diagram illustrates the solution architecture and workflow.

We have provided this demo in the GitHub repo. Refer to the deployment instructions within the readme file for deploying it into your AWS account.

Embeddings model

The responsibilities of the embeddings model are as follows:

This model is responsible for converting text (like documents or passages) into dense vector representations, commonly known as embeddings.
These embeddings capture the semantic meaning of the text, allowing for efficient and semantically meaningful comparisons between different pieces of text.
The embeddings model can be trained on the same vast corpus as the foundational model or can be specialized for specific domains.

Q&A workflow

The following steps describe the workflow of the question answering over documents:

The user logs in to the web app authenticated by Amazon Cognito.
The user uploads one or more documents to Amazon S3.
Upon document transfer, an S3 event notification triggers a Lambda function, which then calls the SageMaker embedding model endpoint to generate embeddings for the new document. The embeddings model converts the question into a dense vector representation (embedding). The resulting vector file is securely stored within the S3 bucket.
The FAISS retriever compares this question embedding with the embeddings of all documents or passages in the database to find the most relevant passages.
The passages, along with the user’s question, are provided as context to the foundational model. The Lambda function uses the LangChain library and connects to the Amazon Bedrock or SageMaker JumpStart endpoint with a context-stuffed query.
The response from the LLM is stored in DynamoDB along with the user’s query, the timestamp, a unique identifier, and other arbitrary identifiers for the item such as question category. Storing the question and answer as discrete items allows the Lambda function to easily recreate a user’s conversation history based on the time when questions were asked.
Finally, the response is sent back to the user via a HTTPs request through the API Gateway WebSocket API integration response.

Benefits

The following list describe the benefits of this solution:

Semantic understanding – The embeddings model ensures that the retriever selects passages based on deep semantic understanding, not just keyword matches.
Scalability – Embeddings allow for efficient similarity comparisons, making it feasible to search through vast databases of documents quickly.
Flexibility – The foundational model can generate answers based on a wide range of contexts, ensuring the system remains versatile.
Domain adaptability – The embeddings model can be trained or fine-tuned for specific domains, allowing the system to be adapted for various applications.

Foundation Models Pharma Ad Generator

In today’s fast-paced pharmaceutical industry, efficient and localized advertising is more crucial than ever. This is where an innovative solution comes into play, using the power of generative AI to craft localized pharma ads from source images and PDFs. Beyond merely speeding up the ad generation process, this approach streamlines the Medical Legal Review (MLR) process. MLR is a rigorous review mechanism in which medical, legal, and regulatory teams meticulously evaluate promotional materials to guarantee their accuracy, scientific backing, and regulatory compliance. Traditional content creation methods can be cumbersome, often requiring manual adjustments and extensive reviews to ensure alignment with regional compliance and relevance. However, with the advent of generative AI, we can now automate the crafting of ads that truly resonate with local audiences, all while upholding stringent standards and guidelines.

The following diagram illustrates the solution architecture.

In the architectural layout, based on their selected model and ad preferences, users are seamlessly guided to the Amazon Bedrock foundation models. This streamlined approach ensures that new ads are generated precisely according to the desired configuration. As part of the process, documents are efficiently handled by Amazon Textract, with the resultant text securely stored in DynamoDB. A standout feature is the modular design for image and text generation, granting you the flexibility to independently regenerate any component as required.

We have provided this demo in the GitHub repo. Refer to the deployment instructions within the readme file for deploying it into your AWS account.

Content generation workflow

The following steps outline the process for content generation:

The user chooses their document, source image, ad placement, language, and image style.
Secure access to the web application is ensured through Amazon Cognito authentication.
The web application’s front end is hosted via Amplify.
A WebSocket API, managed by API Gateway, facilitates user requests. These requests are authenticated through AWS Identity and Access Management (IAM).
Integration with Amazon Bedrock includes the following steps:
- A Lambda function employs the LangChain library to connect to the Amazon Bedrock endpoint using a context-rich query.
- The text-to-text foundational model crafts a contextually appropriate ad based on the given context and settings.
- The text-to-image foundational model creates a tailored image, influenced by the source image, chosen style, and location.
The user receives the response through an HTTPS request via the integrated API Gateway WebSocket API.

Document and image processing workflow

The following is the procedure for processing documents and images:

The user uploads assets via the specified UI.
The Amplify API transfers the documents to an S3 bucket.
After the asset is transferred to Amazon S3, one of the following actions takes place:
- If it’s a document, a Lambda function uses Amazon Textract to process and extract text for ad generation.
- If it’s an image, the Lambda function converts it to base64 format, suitable for the Stable Diffusion model to create a new image from the source.
The extracted text or base64 image string is securely saved in DynamoDB.

Benefits

The following list describes the benefits of this solution:

Efficiency – The use of generative AI significantly accelerates the ad generation process, eliminating the need for manual adjustments.
Compliance adherence – The solution ensures that generated ads adhere to specific guidance and regulations, such as the FDA’s guidelines for marketing.
Cost-effective – By automating the creation of tailored ads, companies can significantly reduce costs associated with ad production and revisions.
Streamlined MLR process – The solution simplifies the MLR process, reducing friction points and ensuring smoother reviews.
Localized resonance – Generative AI produces ads that resonate with local audiences, ensuring relevance and impact in different regions.
Standardization – The solution maintains necessary standards and guidelines, ensuring consistency across all generated ads.
Scalability – The AI-driven approach can handle vast databases of source images and PDFs, making it feasible for large-scale ad generation.
Reduced manual intervention – The automation reduces the need for human intervention, minimizing errors and ensuring consistency.

You can deploy the infrastructure in this tutorial from your local computer or you can use AWS Cloud9 as your deployment workstation. AWS Cloud9 comes pre-loaded with the AWS CLI, AWS CDK, and Docker. If you opt for AWS Cloud9, create the environment from the AWS Cloud9 console.

Clean up

To avoid unnecessary cost, clean up all the infrastructure created via the AWS CloudFormation console or by running the following command on your workstation:

$ cdk destroy —all.

Additionally, remember to stop any SageMaker endpoints you initiated via the SageMaker console. Remember, deleting an Amazon Kendra index doesn’t remove the original documents from your storage.

Conclusion

Generative AI, epitomized by LLMs, heralds a paradigm shift in how we access and generate information. These models, while powerful, are often limited by the confines of their training data. RAG addresses this challenge, ensuring that the vast knowledge within these models is consistently infused with relevant, current insights.

Our RAG-based demos provide a tangible testament to this. They showcase the seamless synergy between Amazon Kendra, vector embeddings, and LLMs, creating a system where information is not only vast but also accurate and timely. As you dive into these demos, you’ll explore firsthand the transformational potential of merging pre-trained knowledge with the dynamic capabilities of RAG, resulting in outputs that are both trustworthy and tailored to enterprise content.

Although generative AI powered by LLMs opens up a new way of gaining information insights, these insights must be trustworthy and confined to enterprise content using the RAG approach. These RAG-based demos enable you to be equipped with insights that are accurate and up to date. The quality of these insights is dependent on semantic relevance, which is enabled by using Amazon Kendra and vector embeddings.

If you’re ready to further explore and harness the power of generative AI, here are your next steps:

Engage with our demos – The hands-on experience is invaluable. Explore the functionalities, understand the integrations, and familiarize yourself with the interface.
Deepen your knowledge – Take advantage of the resources available. AWS offers in-depth documentation, tutorials, and community support to aid in your AI journey.
Initiate a pilot project – Consider starting with a small-scale implementation of generative AI in your enterprise. This will provide insights into the system’s practicality and adaptability within your specific context.

For more information about generative AI applications on AWS, refer to the following:

Remember, the landscape of AI is constantly evolving. Stay updated, remain curious, and always be ready to adapt and innovate.

About The Authors

Jin Tan Ruan is a Prototyping Developer within the AWS Industries Prototyping and Customer Engineering (PACE) team, specializing in NLP and generative AI. With a background in software development and nine AWS certifications, Jin brings a wealth of experience to assist AWS customers in materializing their AI/ML and generative AI visions using the AWS platform. He holds a master’s degree in Computer Science & Software Engineering from the University of Syracuse. Outside of work, Jin enjoys playing video games and immersing himself in the thrilling world of horror movies.

Aravind Kodandaramaiah is a Senior Prototyping full stack solution builder within the AWS Industries Prototyping and Customer Engineering (PACE) team. He focuses on helping AWS customers turn innovative ideas into solutions with measurable and delightful outcomes. He is passionate about a range of topics, including cloud security, DevOps, and AI/ML, and can be usually found tinkering with these technologies.

Arjun Shakdher is a Developer on the AWS Industries Prototyping (PACE) team who is passionate about blending technology into the fabric of life. Holding a master’s degree from Purdue University, Arjun’s current role revolves around architecting and building cutting-edge prototypes that span an array of domains, presently prominently featuring the realms of AI/ML and IoT. When not immersed in code and digital landscapes, you’ll find Arjun indulging in the world of coffee, exploring the intricate mechanics of horology, or reveling in the artistry of automobiles.

Use generative AI to increase agent productivity through automated call summarization

November 6, 2023

by Chris Lott Amazon AWS

Your contact center serves as the vital link between your business and your customers. Every call to your contact center is an opportunity to learn more about your customers’ needs and how well you are meeting those needs.

Most contact centers require their agents to summarize their conversation after every call. Call summarization is a valuable tool that helps contact centers understand and gain insights from customer calls. Additionally, accurate call summaries enhance the customer journey by eliminating the need for customers to repeat information when transferred to another agent.

In this post, we explain how to use the power of generative AI to reduce the effort and improve the accuracy of creating call summaries and call dispositions. We also show how to get started quickly using the latest version of our open source solution, Live Call Analytics with Agent Assist.

Challenges with call summaries

As contact centers collect more speech data, the need for efficient call summarization has grown significantly. However, most summaries are empty or inaccurate because manually creating them is time-consuming, impacting agents’ key metrics like average handle time (AHT). Agents report that summarizing can take up to a third of the total call, so they skip it or fill in incomplete information. This hurts the customer experience—long holds frustrate customers while the agent types, and incomplete summaries mean asking customers to repeat information when transferred between agents.

The good news is that automating and solving the summarization challenge is now possible through generative AI.

Generative AI is helping summarize customer calls accurately and efficiently

Generative AI is powered by very large machine learning (ML) models referred to as foundation models (FMs) that are pre-trained on vast amounts of data at scale. A subset of these FMs focused on natural language understanding are called large language models (LLMs) and are able to generate human-like, contextually relevant summaries. The best LLMs can process even complex, non-linear sentence structures with ease and determine various aspects, including topic, intent, next steps, outcomes, and more. Using LLMs to automate call summarization allows for customer conversations to be summarized accurately and in a fraction of the time needed for manual summarization. This in turn enables contact centers to deliver superior customer experience while reducing the documentation burden on their agents.

The following screenshot shows an example of the Live Call Analytics with Agent Assist call details page, which contains information about each call.

The following video shows an example of the Live Call Analytics with Agent Assist summarizing an in-progress call, summarizing after the call ends, and generating a follow-up email.

Solution overview

The following diagram illustrates the solution workflow.

The first step to generating abstractive call summaries is transcribing the customer call. Having accurate, ready-to-use transcripts is crucial to generate accurate and effective call summaries. Amazon Transcribe can help you create transcripts with high accuracy for your contact center calls. Amazon Transcribe is a feature-rich speech-to-text API with state-of-the-art speech recognition models that are fully managed and continuously trained. Customers such as New York Times, Slack, Zillow, Wix, and thousands of others use Amazon Transcribe to generate highly accurate transcripts to improve their business outcomes. A key differentiator for Amazon Transcribe is its ability to protect customer data by redacting sensitive information from the audio and text. Although protecting customer privacy and safety is important in general to contact centers, it’s even more important to mask sensitive information such as bank account information and Social Security numbers before generating automated call summaries, so they don’t get injected into the summaries.

For customers who are already using Amazon Connect, our omnichannel cloud contact center, Contact Lens for Amazon Connect provides real-time transcription and analytics features natively. However, if you want to use generative AI with your existing contact center, we have developed solutions that do most of the heavy lifting associated with transcribing conversations in real time or post-call from your existing contact center, and generating automated call summaries using generative AI. Additionally, the solution detailed in this section allows you to integrate with your Customer Relationship Management (CRM) system to automatically update your CRM of choice with generated call summaries. In this example, we use our Live Call Analytics with Agent Assist (LCA) solution to generate real-time call transcriptions and call summaries with LLMs hosted on Amazon Bedrock. You can also write an AWS Lambda function and provide LCA the function’s Amazon Resource Name (ARN) in the AWS CloudFormation parameters, and use the LLM of your choice.

The following simplified LCA architecture illustrates call summarization with Amazon Bedrock.

LCA is provided as a CloudFormation template that deploys the preceding architecture and allows you to transcribe calls in real time. The workflow steps are as follows:

Call audio can be streamed via SIPREC from your telephony system to Amazon Chime SDK Voice Connector, which buffers the audio in Amazon Kinesis Video Streams. LCA also supports other audio ingestion mechanisms, such Genesys Cloud Audiohook.
Amazon Chime SDK Call Analytics then streams the audio from Kinesis Video Streams to Amazon Transcribe, and writes the JSON output to Amazon Kinesis Data Streams.
A Lambda function processes the transcription segments and persists them to an Amazon DynamoDB table.
After the call ends, Amazon Chime SDK Voice Connector publishes an Amazon EventBridge notification that triggers a Lambda function that reads the persisted transcript from DynamoDB, generates an LLM prompt (more on this in the following section), and runs an LLM inference with Amazon Bedrock. The generated summary is persisted to DynamoDB and can be used by the agent in the LCA user interface. You can optionally provide a Lambda function ARN that will be run after the summary is generated to integrate with third-party CRM systems.

LCA also allows the option to call the summarization Lambda function during the call, because at any time the transcript can be fetched and a prompt created, even if the call is in progress. This can be useful for times when a call is transferred to another agent or escalated to a supervisor. Rather than putting the customer on hold and explaining the call, the new agent can quickly read an auto-generated summary, and it can include what the current issue is and what the previous agent tried to do to resolve it.

Example call summarization prompt

You can run LLM inferences with prompt engineering to generate and improve your call summaries. You can modify the prompt templates to see what works best for the LLM you select. The following is an example of the default prompt for summarizing a transcript with LCA. We replace the {transcript} placeholder with the actual transcript of the call.

Human: Answer the questions below, defined in <question></question> based on the transcript defined in <transcript></transcript>. If you cannot answer the question, reply with 'n/a'. Use gender neutral pronouns. When you reply, only respond with the answer.

<question>
What is a summary of the transcript?
</question>

<transcript>
{transcript}
</transcript>

Assistant:

LCA runs the prompt and stores the generated summary. Besides summarization, you can direct the LLM to generate almost any text that is important for agent productivity. For example, you can choose from a set of topics that were covered during the call (agent disposition), generate a list of required follow-up tasks, or even write an email to the caller thanking them for the call.

The following screenshot is an example of agent follow-up email generation in the LCA user interface.

With a well-engineered prompt, some LLMs have the ability to generate all of this information in a single inference as well, reducing inference cost and processing time. The agent can then use the generated response within a few seconds of ending the call for their after-contact work. You can also integrate the generated response automatically into your CRM system.

The following screenshot shows an example summary in the LCA user interface.

It’s also possible to generate a summary while the call is still ongoing (see the following screenshot), which can be especially helpful for long customer calls.

Prior to generative AI, agents would be required to pay attention while also taking notes and performing other tasks as required. By automatically transcribing the call and using LLMs to automatically create summaries, we can lower the mental burden on the agent, so they can focus on delivering a superior customer experience. This also leads to more accurate after-call work, because the transcription is an accurate representation of what occurred during the call—not just what the agent took notes on or remembered.

Summary

The sample LCA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. For information about deploying LCA, refer to Live call analytics and agent assist for your contact center with Amazon language AI services. Browse to the LCA GitHub repository to explore the code, sign up to be notified of new releases, and check out the README for the latest documentation updates. For customers who are already on Amazon Connect, you can learn more about generative AI with Amazon Connect by referring to How contact center leaders can prepare for generative AI.

About the authors

Christopher Lott is a Senior Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California and enjoys gardening, aerospace, and traveling the world.

Smriti Ranjan is a Principal Product Manager in the AWS AI/ML team focusing on language and search services. Prior to joining AWS, she worked at Amazon Devices and other technology startups leading product and growth functions. Smriti lives in Boston, MA and enjoys hiking, attending concerts and traveling the world.

Customize Amazon Textract with business-specific documents using Custom Queries

November 6, 2023

by Shibin Michaelraj Amazon AWS

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Queries is a feature that enables you to extract specific pieces of information from varying, complex documents using natural language. Custom Queries provides a way for you to customize the Queries feature for your business-specific, non-standard documents such as auto lending contracts, checks, and pay statements, in a self-service way. By customizing the feature to recognize the unique terms, structures, and key information specific to these document types, you can meet your downstream processing needs with greater precision and minimal human intervention. Custom Queries is easy to integrate in your existing Textract pipeline and you continue to benefit from the fully managed intelligent document processing features of Amazon Textract without having to invest in ML expertise or infrastructure management.

In this post, we show how Custom Queries can accurately extract data from checks that are complex, non-standard documents. In addition, we discuss the benefits of Custom Queries and share best practices for effectively using this feature.

Solution overview

When starting with a new use case, you can evaluate how Textract Queries performs on your documents by navigating to the Textract console and using the Analyze Document Demo or Bulk Document Uploader. Refer to Best Practices for Queries to draft queries applicable to your use case. If you identify errors in the query responses due to the nature of your business documents, you can use Custom Queries to improve accuracy. Within hours, you can annotate your sample documents using the AWS Management Console and train an adapter. Adapters are components that plug in to the Amazon Textract pre-trained deep learning model, customizing its output based on your annotated documents. You can use the adapter for inference by passing the adapter identifier as an additional parameter to the Analyze Document Queries API request.

Let’s examine how Custom Queries can improve extraction accuracy in a challenging real-world scenario such as extraction of data from checks. The primary challenge when processing checks arises from their high degree of variation depending on the type (e.g., personal or cashier’s checks), financial institution and country (e.g., MICR line format). . These variations can include the placement of the payee’s name, the amount in numbers and words, the date, and the signature. Recognizing and adapting to these variations can be a complex task during data extraction. To improve data extraction, organizations often employ manual verification and validation processes, which increases the cost and time of the extraction process.

Custom Queries addresses these challenges by enabling you to customize the pre-trained Queries features on the different variations of checks. Customization of the pre-trained feature helps you achieve a high data extraction accuracy on the specific variety of layouts that you process.

In our use case, a financial institution wants to extract the following fields from a check: payee name, payer name, account number, routing number, payment amount (in numbers), payment amount (in words), check number, date, and memo.

Let’s explore the process of generating an adapter (component that customizes the output) for checks processing. Adapters can be created via the console or programmatically via the API. This post details the console experience; however, if you’d like to programmatically create the adapter, refer to the code samples in the custom-queries-checks-blog.ipynb Jupyter notebook (Option 2).

The adapter generation process involves five high-level steps: create an adapter, upload sample documents, annotate the documents, train the adapter, and evaluate performance metrics.

Create an adapter

On the Amazon Textract console, create a new adapter by providing a name, description, and optional tags that can help you identify the adapter. You have the option to enable automatic updates, which allows Amazon Textract to update your adapter when the underlying Queries feature is updated with new capabilities.

After the adapter is created, you will see an adapter details page with a list of steps in the How it works section. This section will activate your next steps as you complete them sequentially.

Upload sample documents

The initial phase in adapter generation involves the careful selection of an appropriate set of sample documents for annotation, training, and testing. We have an option to auto split the documents into test and train datasets; however, for this process, we manually split the dataset.

It’s important to note that you can construct an adapter with as few as five test and five training samples, but it’s essential to ensure that this sample set is diverse and representative of the workload encountered in a production environment.

For this tutorial, we have curated sample check datasets that you can download. Our dataset includes variations such as personal checks, cashier’s checks, stimulus checks and checks embedded within pay stubs. We also included handwritten and printed checks; along with variations in fields such as the memo line.

Annotate sample documents

As a next step, you annotate the sample documents by associating queries with their corresponding answers via the console. You can initiate annotation via auto labeling or manual labeling. Auto labeling uses Amazon Textract Queries to pre-label the dataset. We recommend using auto labeling to fast-track the annotation process.

For this checks processing use case, we use the following queries. If your use case involves other document types, refer to Best Practices for Queries to draft queries applicable to your use case.

Who is the payee?
What is the check#?
What is the payee address?
What is the date?
What is the account#?
What is the check amount in words?
What is the account name/payer/drawer name?
What is the dollar amount?
What is the bank name/drawee name?
What is the bank routing number?
What is the MICR line?
What is the memo?

When the auto labeling process is complete, you have the option to review and make edits to the answers provided for each document. Choose Start reviewing to review the annotations against each image.

If the response to a query is missing or wrong, you can add or edit the response either by drawing a bounding box or entering the response manually.

To accelerate your walkthrough, we have pre-annotated the checks samples for you to copy to your AWS account. Run the custom-queries-checks-blog.ipynb Jupyter notebook within the Amazon Textract code samples library to automatically update your annotations.

Train the adapter

After you’ve reviewed all the sample documents to ensure the accuracy of the annotations, you can begin the adapter training process. During this step, you need to designate a storage location where the adapter should be saved. The duration of the training process will vary depending on the size of the dataset utilized for training. The training API can also be invoked programmatically if you choose to use an annotation tool of your own choice and pass the relevant input files to the API. Refer to Custom Queries for more details.

Evaluate performance metrics

After the adapter has completed training, you can assess its performance by examining evaluation metrics such as F1 score, precision, and recall. You can analyze these metrics either collectively or on a per-document basis. Using our sample checks dataset, you will see the accuracy metric (F1 score) improve from 68% to 92% with the trained adapter.

Additionally, you can test the adapter’s output on new documents by choosing Try Adapter.

Following the evaluation, you can choose to enhance the adapter’s performance by either incorporating additional sample documents into the training dataset or by re-annotating documents with scores that are lower than your threshold. To re-annotate documents, choose Verify documents on the adapter details page, select the document, and choose Review annotations.

Programmatically test the adapter

With the training successfully completed, you can now use the adapter in your AnalyzeDocument API calls. The API request is similar to the Amazon Textract Queries API request, with the addition of the AdaptersConfig object.

You can run the following sample code or directly run it within the custom-queries-checks-blog.ipynb Jupyter notebook. The sample notebook also provides code to compare results between Amazon Textract Queries and Amazon Textract Custom Queries.

Create an AdaptersConfig object with the adapter ID and adapter version, and optionally include the pages you want the adapter to be applied to:

!python -m pip install amazon-textract-caller --upgrade
!python -m pip install amazon-textract-response-parser –upgrade

import boto3
from textractcaller.t_call import call_textract, Textract_Features, Query, QueriesConfig, Adapter, AdaptersConfig
import trp.trp2 as t2
from tabulate import tabulate

# Create AdaptersConfig
adapter1 = Adapter(adapter_id=”111111111”, version="1", pages=["*"])
adapters_config = AdaptersConfig(adapters=[adapter1])

Create a QueriesConfig object with the queries you trained the adapter with and call the Amazon Textract API. Note that you can also include additional queries that the adapter has not been trained on. Amazon Textract will automatically use the Queries feature for these questions and not Custom Queries, thereby providing you with the flexibility of using Custom Queries only where needed.

# Create QueriesConfig
queries = []
queries.append(Query(text="What is the check#?", alias="CHECK_NUMBER", pages=["*"]))
queries.append(Query(text="What is the date?", alias="DATE", pages=["*"]))
queries.append(Query(text="What is the check amount in words?", alias="CHECK_AMOUNT_WORDS", pages=["*"]))
queries.append(Query(text="What is the dollar amount?", alias="DOLLAR_AMOUNT", pages=["*"]))
queries.append(Query(text="Who is the payee?", alias="PAYEE_NAME", pages=["*"]))
queries.append(Query(text="What is the customer account#", alias="ACCOUNT_NUMBER", pages=["*"]))
queries.append(Query(text="what is the payee address?", alias="PAYEE_ADDRESS", pages=["*"]))
queries.append(Query(text="What is the bank routing number?", alias="BANK_ROUTING_NUMBER", pages=["*"]))
queries.append(Query(text="What is the memo", alias="MEMO", pages=["*"]))
queries.append(Query(text="What is the account name/payer/drawer name?", alias="ACCOUNT_NAME", pages=["*"]))
queries.append(Query(text="What is the bank name/drawee name?", alias="BANK_NAME", pages=["*"]))
queries_config = QueriesConfig(queries=queries)

document_name = "<image_name>"

textract_json_with_adapter = call_textract(input_document=document_name,
                  boto3_textract_client=textract_client,
                  features=[Textract_Features.QUERIES],
                  queries_config=queries_config,
                  adapters_config=adapters_config)

Finally, we tabulate our results for better readability:

def tabulate_query_answers(textract_json):
    d = t2.TDocumentSchema().load(textract_json)
    for page in d.pages:
        query_answers = d.get_query_answers(page=page)
        print(tabulate(query_answers, tablefmt="github"))

tabulate_query_answers(textract_json_with_adapter)

Clean up

To clean up your resources, complete the following steps:

On the Amazon Textract console, choose Custom Queries in the navigation pane.
Select the adaptor you want to delete.
Choose Delete.

Adapter management

You can regularly improve your adapters by creating new versions of a previously generated adapter. To create a new version of an adapter, you add new sample documents to an existing adapter, label the documents, and perform training. You can simultaneously maintain multiple versions of an adapter for use in your development pipelines. To update your adapters seamlessly, do not make changes to or delete your Amazon Simple Storage Service (Amazon S3) bucket where the files needed for adapter generation are saved.

Best practices

When using Custom Queries on your documents, refer to Best practices for Amazon Textract Custom Queries for additional considerations and best practices.

Benefits of Custom Queries

Custom Queries offers the following benefits:

Enhanced document understanding – Through its ability to extract and normalize data with high accuracy, Custom Queries reduces reliance on manual reviews, and audits, and enables you to build more reliable automation for your intelligent document processing workflows.
Faster time to value – When you encounter new document types where you need higher accuracy, you can use Custom Queries to generate an adapter in a self-service manner within a few hours. You don’t have to wait for a pre-trained model update when you encounter new document types or variations of existing ones in your workflow. You have complete control over your pipeline and don’t need to depend on Amazon Textract to support your new document types.
Data privacy – Custom Queries does not retain or use the data employed in generating adapters to enhance our general pretrained models available to all customers. The adapter is limited to the customer’s account or other accounts explicitly designated by the customer, ensuring that only such accounts can access the improvements made using the customer’s data.
Convenience –Custom Queries provides a fully managed inference experience similar to Queries. The adapter training is free and you will only pay for inference. Custom Queries saves you the overhead and expenses of training and operating custom models.

Conclusion

In this post, we discussed the benefits of Custom Queries, showed how Custom Queries can accurately extract data from checks, and shared best practices for effectively utilizing this feature. In just a few hours, you can create an adapter using the console and use it in the AnalyzeDocument API for your data extraction needs. For more information, refer to Custom Queries.

About the authors

Shibin Michaelraj is a Sr. Product Manager with the Amazon Textract team. He is focused on building AI/ML-based products for AWS customers. He is excited helping customers solve their complex business challenges by leveraging AI and ML technologies. In his spare time, he enjoys running, tuning into podcasts, and refining his amateur tennis skills.

Keith Mascarenhas is a Sr. Solutions Architect with the Amazon Textract service team. He is passionate about solving business problems at scale using machine learning, and currently helps our worldwide customers automate their document processing to achieve faster time to market with reduced operational costs.

Stream large language model responses in Amazon SageMaker JumpStart

November 6, 2023

by Rachna Chadha Amazon AWS

We are excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to see the model response output as it is being generated instead of waiting for LLMs to finish the response generation before it is made available for you to use or display. The streaming capability in SageMaker JumpStart can help you build applications with better user experience by creating a perception of low latency to the end-user.

In this post, we walk through how to deploy and stream the response from a Falcon 7B Instruct model endpoint.

At the time of this writing, the following LLMs available in SageMaker JumpStart support streaming:

Mistral AI 7B, Mistral AI 7B Instruct
Falcon 180B, Falcon 180B Chat
Falcon 40B, Falcon 40B Instruct
Falcon 7B, Falcon 7B Instruct
Rinna Japanese GPT NeoX 4B Instruction PPO
Rinna Japanese GPT NeoX 3.6B Instruction PPO

To check for updates on the list of models supporting streaming in SageMaker JumpStart, search for “huggingface-llm” at Built-in Algorithms with pre-trained Model Table.

Note that you can use the streaming feature of Amazon SageMaker hosting out of the box for any model deployed using the SageMaker TGI Deep Learning Container (DLC) as described in Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker.

Foundation models in SageMaker

SageMaker JumpStart provides access to a range of models from popular model hubs, including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your ML development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be adapted to a wide category of use cases, such as text summarization, generating digital art, and language translation. Because these models are expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.

You can now find foundation models from different model providers within SageMaker JumpStart, enabling you to get started with foundation models quickly. SageMaker JumpStart offers foundation models based on different tasks or model providers, and you can easily review model characteristics and usage terms. You can also try these models using a test UI widget. When you want to use a foundation model at scale, you can do so without leaving SageMaker by using prebuilt notebooks from model providers. Because the models are hosted and deployed on AWS, you trust that your data, whether used for evaluating or using the model at scale, won’t be shared with third parties.

Token streaming

Token streaming allows the inference response to be returned as it’s being generated by the model. This way, you can see the response generated incrementally rather than wait for the model to finish before providing the complete response. Streaming can help enable a better user experience because it decreases the latency perception for the end-user. You can start seeing the output as it’s generated and therefore can stop generation early if the output isn’t looking useful for your purposes. Streaming can make a big difference, especially for long-running queries, because you can start seeing outputs as it’s generated, which can create a perception of lower latency even though the end-to-end latency stays the same.

As of this writing, you can use streaming in SageMaker JumpStart for models that utilize Hugging Face LLM Text Generation Inference DLC.

Response with No Steaming	Response with Streaming

Solution overview

For this post, we use the Falcon 7B Instruct model to showcase the SageMaker JumpStart streaming capability.

You can use the following code to find other models in SageMaker JumpStart that support streaming:

from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And

filter_value = And("task == llm", "framework == huggingface")
model_ids = list_jumpstart_models(filter=filter_value)
print(model_ids)

We get the following model IDs that support streaming:

['huggingface-llm-bilingual-rinna-4b-instruction-ppo-bf16', 'huggingface-llm-falcon-180b-bf16', 'huggingface-llm-falcon-180b-chat-bf16', 'huggingface-llm-falcon-40b-bf16', 'huggingface-llm-falcon-40b-instruct-bf16', 'huggingface-llm-falcon-7b-bf16', 'huggingface-llm-falcon-7b-instruct-bf16', 'huggingface-llm-mistral-7b', 'huggingface-llm-mistral-7b-instruct', 'huggingface-llm-rinna-3-6b-instruction-ppo-bf16']

Prerequisites

Before running the notebook, there are some initial steps required for setup. Run the following commands:

%pip install --upgrade sagemaker –quiet

Deploy the model

As a first step, use SageMaker JumpStart to deploy a Falcon 7B Instruct model. For full instructions, refer to Falcon 180B foundation model from TII is now available via Amazon SageMaker JumpStart. Use the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id="huggingface-llm-falcon-7b-instruct-bf16")
predictor = my_model.deploy()

Query the endpoint and stream response

Next, construct a payload to invoke your deployed endpoint with. Importantly, the payload should contain the key/value pair "stream": True. This indicates to the text generation inference server to generate a streaming response.

payload = {
    "inputs": "How do I build a website?",
    "parameters": {"max_new_tokens": 256},
    "stream": True
}

Before you query the endpoint, you need to create an iterator that can parse the bytes stream response from the endpoint. Data for each token is provided as a separate line in the response, so this iterator returns a token each time a new line is identified in the streaming buffer. This iterator is minimally designed, and you might want to adjust its behavior for your use case; for example, while this iterator returns token strings, the line data contains other information, such as token log probabilities, that could be of interest.

import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("n"):
                self.read_pos += len(line) + 1
                full_line = line[:-1].decode("utf-8")
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"]["text"]
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

Now you can use the Boto3 invoke_endpoint_with_response_stream API on the endpoint that you created and enable streaming by iterating over a TokenIterator instance:

import boto3

client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)

for token in TokenIterator(response["Body"]):
    print(token, end="")

Specifying an empty end parameter to the print function will enable a visual stream without new line characters inserted. This produces the following output:

Building a website can be a complex process, but it generally involves the following steps:

1. Determine the purpose and goals of your website
2. Choose a domain name and hosting provider
3. Design and develop your website using HTML, CSS, and JavaScript
4. Add content to your website and optimize it for search engines
5. Test and troubleshoot your website to ensure it is working properly
6. Maintain and update your website regularly to keep it running smoothly.

There are many resources available online to guide you through these steps, including tutorials and templates. It may also be helpful to seek the advice of a web developer or designer if you are unsure about any of these steps.<|endoftext|>

You can use this code in a notebook or other applications like Streamlit or Gradio to see the streaming in action and the experience it provides for your customers.

Clean up

Finally, remember to clean up your deployed model and endpoint to avoid incurring additional costs:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to use newly launched feature of streaming in SageMaker JumpStart. We hope you will use the token streaming capability to build interactive applications requiring low latency for a better user experience.

About the authors

Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.