Host ML models on Amazon SageMaker using Triton: TensorRT models

Host ML models on Amazon SageMaker using Triton: TensorRT models

Sometimes it can be very beneficial to use tools such as compilers that can modify and compile your models for optimal inference performance. In this post, we explore TensorRT and how to use it with Amazon SageMaker inference using NVIDIA Triton Inference Server. We explore how TensorRT works and how to host and optimize these models for performance and cost efficiency on SageMaker. SageMaker provides single model endpoints (SMEs), which allow you to deploy a single ML model, or multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint for higher resource utilization.

To serve models, Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the TensorRT backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and get great results.

Deep dive into the TensorRT backend

TensorRT enables you to optimize inference using techniques such as quantization, layer and tensor fusion, kernel tuning, and others on NVIDIA GPUs. By adopting and compiling models to use TensorRT, you can optimize performance and utilization for your inference workloads. In some cases, there are trade-offs, which is typical of techniques such as quantization, but the results can be dramatic in benefiting performance, addressing latency and the number of transactions that can be processed.

The TensorRT backend is used to run TensorRT models. TensorRT is an SDK developed by NVIDIA that provides a high-performance deep learning inference library. It’s optimized for NVIDIA GPUs and provides a way to accelerate deep learning inference in production environments. TensorRT supports major deep learning frameworks and includes a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI applications.

TensorRT is able to accelerate model performance by using a technique called graph optimization to optimize the computation graph generated by a deep learning model. It optimizes the graph to minimize the memory footprint by freeing unnecessary memory and efficiently reusing it. TensorRT compilation fuses the sparse operations inside the model graph to form a larger kernel to avoid the overhead of multiple small kernel launches. With kernel auto-tuning, the engine selects the best algorithm for the target GPU, maximizing hardware utilization. Additionally, TensorRT employs CUDA streams to enable parallel processing of models, further improving GPU utilization and performance. Finally, through quantization, TensorRT can use mixed-precision acceleration of Tensor cores, enabling the model to run in FP32, TF32, FP16, and INT8 precision for the best inference performance. However, although the reduced precision can generally improve the latency performance, it might come with possible instability and degradation in model accuracy. Overall, TensorRT’s combination of techniques results in faster inference and lower latency compared to other inference engines.

The TensorRT backend for Triton Inference Server is designed to take advantage of the powerful inference capabilities of NVIDIA GPUs. To use TensorRT as a backend for Triton Inference Server, you need to create a TensorRT engine from your trained model using the TensorRT API. This engine is then loaded into Triton Inference Server and used to perform inference on incoming requests. The following are the basic steps to use TensorRT as a backend for Triton Inference Server:

  1. Convert your trained model to the ONNX format. Triton Inference Server supports ONNX as a model format. ONNX is a standard for representing deep learning models, enabling them to be transferred between frameworks. If your model isn’t already in the ONNX format, you need to convert it using the appropriate framework-specific tool. For example, in PyTorch, this can be done using the torch.onnx.export method.
  2. Import the ONNX model into TensorRT and generate the TensorRT engine. For TensorRT, there are several ways to build a TensorRT from your ONNX model. For this post, we use the trtexec CLI tool. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has three main purposes:
    1. Benchmarking networks on random or user-provided input data.
    2. Generating serialized engines from models.
    3. Generating a serialized timing cache from the builder.
  3. Load the TensorRT engine in Triton Inference Server. After the TensorRT engine is generated, it can be loaded into Triton Inference Server by creating a model configuration file. The model configuration (config.pbtxt) file should include the path to the TensorRT engine file and the input and output shapes of the model.

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. There are several key points to note in this configuration file:

  • name – This field defines the model’s name and must be unique within the model repository.
  • platform – This field defines the type of the model: TensorRT engine, PyTorch, or something else.
  • max_batch_size – This specifies the maximum batch size that can be passed to this model. If the model’s batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case, max_batch_size should be set to a value greater than or equal to 1, which indicates the maximum batch size that Triton should use with the model. For models that don’t support batching, or don’t support batching in the specific ways we’ve described, max_batch_size must be set to 0.
  • Input and output – These fields are required because NVIDIA Triton needs metadata about the model. Essentially, it requires the names of your network’s input and output layers and the shape of said inputs and outputs.
  • instance_group – This determines how many instances of this model will be created and whether they will use the GPU or CPU.
  • dynamic_batchingDynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. The preferred_batch_size property indicates the batch sizes that the dynamic batcher should attempt to create. For most models, preferred_batch_size should not be specified, as described in Recommended Configuration Process. An exception is TensorRT models that specify multiple optimization profiles for different batch sizes. In this case, because some optimization profiles may give significant performance improvement compared to others, it may make sense to use preferred_batch_size for the batch sizes supported by those higher-performance optimization profiles. You can also reference the batch size that was previously used when running trtexec. You can also configure the delay time to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch.

The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster transfers between CPU and GPU, and increasing compute and memory copy overlap on GPUs. It also reduces memory usage of TensorRT models in many cases by sharing weights across multiple model instances. Overall, the TensorRT backend for Triton Inference Server provides a powerful and flexible way to serve deep learning models with optimized TensorRT inference. By adjusting the configuration options, you can optimize performance and control behavior to suit your specific use case.

SageMaker provides Triton via SMEs and MMEs

SageMaker enables you to deploy both single and multi-model endpoints with Triton Inference Server. Triton supports a heterogeneous cluster with both GPUs and CPUs, which helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads. The following diagram illustrates the Triton Inference Server architecture. Inference requests arrive at the server via either HTTP/REST or by the C API, and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is a file system-based repository of the models that Triton will make available for inferencing.

Triton architecture

SageMaker takes care of traffic shaping to the MME endpoint and maintains optimal model copies on GPU instances for best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high utilization, SageMaker unloads the least-used models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon Simple Storage Service (Amazon S3) to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU-accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPU, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling groups. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. For single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you might want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Solution overview

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s TensorRT backend, which allows you to deploy TensorRT models. The TensorRT_backend repo contains the documentation and source for the backend. In the following sections, we walk you through the example notebook that demonstrates how to use NVIDIA Triton Inference Server on SageMaker MMEs with the GPU feature to deploy a BERT natural language processing (NLP) model.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the pre-built Triton ECR image:

import transformers
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
bucket = sagemaker_session.default_bucket()
print(bucket)

account_id_map = {
"us-east-1": "785573368785",
"us-east-2": "007439368137",
"us-west-1": "710691900526",
"us-west-2": "301217895009",
"eu-west-1": "802834080501",
"eu-west-2": "205493899709",
"eu-west-3": "254080097072",
"eu-north-1": "601324751636",
"eu-south-1": "966458181534",
"eu-central-1": "746233611703",
"ap-east-1": "110948597952",
"ap-south-1": "763008648453",
"ap-northeast-1": "941853720454",
"ap-northeast-2": "151534178276",
"ap-southeast-1": "324986816169",
"ap-southeast-2": "355873309152",
"cn-northwest-1": "474822919863",
"cn-north-1": "472730292857",
"sa-east-1": "756306329178",
"ca-central-1": "464438896020",
"me-south-1": "836785723513",
"af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
    
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
account_id=account_id_map[region], region=region, base=base
)

Add utility methods for preparing the request payload

We create the functions to transform the sample text we’re using for inference into the payload that can be sent for inference to Triton Inference Server. The tritonclient package, which was installed at the beginning, provides utility methods to generate the payload without having to know the details of the specification. We use the created methods to convert our inference request into a binary format, which provides lower latencies for inference. These functions are used during the inference step.

Prepare the TensorRT model

In this step, we load the pre-trained BERT model and convert to ONNX representation using the torch ONNX exporter and the onnx_exporter.py script. After the ONNX model is created, we use the TensorRT trtexec command to create the model plan to be hosted with Triton. This is run as part of the generate_model.sh script from the following cell. Note that the cell takes around 30 minutes to complete.

!docker run --gpus=all --rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_models.sh

While waiting for the command to finish running, you can check the scripts used in this step. In the onnx_exporter.py script, we use the torch.onnx.export function for ONNX model creation:


    torch.onnx.export(
        model,
        dummy_inputs,
        args.save,
        export_params=True,
        opset_version=10,
        input_names=["token_ids", "attn_mask"],
        output_names=["output","pooled_output"],
        dynamic_axes={"token_ids": [0, 1], "attn_mask": [0, 1], "output": [0]},
    )

The command line in the generate_model.sh file creates the TensorRT model plan. For more information, refer to the trtexec command-line tool.

trtexec —onnx=model.onnx —saveEngine=model_bs16.plan —minShapes=token_ids:1x128,attn_mask:1x128 —optShapes=token_ids:16x128,attn_mask:16x128 —maxShapes=token_ids:128x128,attn_mask:128x128 —fp16 —verbose —workspace=14000 | tee conversion_bs16_dy.txt

Build a TensorRT NLP BERT model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the BERT model is as follows:

Folder structure for model

Note that Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, the folder 1 represents version 1 of the BERT model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifacts required by that backend. Here, we are using the TensorRT backend, which requires the TensorRT plan file that is used for serving (for this example, model.plan). If we were using a PyTorch backend, a model.pt file would be required. For more details on naming conventions for model files, refer to Model Files.

Every TensorRT model must provide a config.pbtxt file describing the model configuration. In order to use this backend, you must set the backend field of your model config.pbtxt file to tensorrt_plan. The following section of code shows an example of how to define the configuration file for the BERT model being served through Triton’s TensorRT backend:

name: "bert"
platform: "tensorrt_plan"
max_batch_size: 128
input [
  {
    name: "token_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attn_mask"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [128, 768]
  },
  {
    name: "pooled_output"
    data_type: TYPE_FP32
    dims: [768]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 16
}

SageMaker expects a .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. To simulate several similar models being hosted, you might think all it takes is to tar the model repository we have already built, and then copy it with different file names. However, Triton requires unique model names. Therefore, we first copy the model repo N times, changing the model directory names and their corresponding config.pbtxt files. You can change the number of N to have more copies of the model that can be dynamically loaded to the hosting endpoint to simulate the model load/unload action managed by SageMaker. See the following code:

import os
import shutil

N = 5
prefix = 'bert-mme'
model_repo_base = 'model_repo'

# Get model names from model_repo_0
model_names = [name for name in os.listdir(f'{model_repo_base}_0') if os.path.isdir(f'{model_repo_base}_0/{name}')]

for i in range(N):
    # Make copy of previous model repo, increment # id
    shutil.copytree(f'{model_repo_base}_0', f'{model_repo_base}_{i+1}')
    time.sleep(5)
    for name in model_names:
        model_dirs_path = f'{model_repo_base}_{i+1}/{name}'

        # Open each model's config file to increment model # id there 
        fin = open(f'{model_dirs_path}/config.pbtxt', "rt")
        data = fin.read()
        data = data.replace(name, name[:-1] + str(i+1))
        fin.close()
        fin = open(f'{model_dirs_path}/config.pbtxt', "wt")
        fin.write(data)
        fin.close()
    
        # Change model directory name to match new config
        os.rename(model_dirs_path,model_dirs_path[:-1]+str(i+1))
        time.sleep(2)
        
    if i == 0:
        tar_file_name = f'bert-{i}.tar.gz'
        model_repo_target = f'{model_repo_base}_{i}/'
        !tar -C $model_repo_target -czf $tar_file_name .
        sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix)

    tar_file_name = f'bert-{i+1}.tar.gz'
    model_repo_target = f'{model_repo_base}_{i+1}/'
    !tar -C $model_repo_target -czf $tar_file_name .
    sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix)
    !sudo rm -r "$tar_file_name" "$model_repo_target"

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create the SageMaker model object, endpoint configuration, and endpoint.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. See the following code:

container = {
"Image": triton_image_uri,
"ModelDataUrl": model_data_uri,
"Mode": "MultiModel",
}

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

With this endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

endpoint_name = "triton-nlp-bert-trt-mme-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

Invoke your model hosted on the SageMaker endpoint

When the endpoint is running, we can use some sample raw data to perform inference using either JSON or binary+JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send the inference request to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Here we invoke the endpoint in a for loop to request the endpoint to dynamically load or unload models based on the requests:

text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
        {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
    ]
}

for i in range(N):
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel=f"bert-{i}.tar.gz",
    )

    print(json.loads(response["Body"].read().decode("utf8")))

You can monitor the model loading and unloading status using Amazon CloudWatch metrics and logs. SageMaker multi-model endpoints provide instance-level metrics to monitor; for more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch. The LoadedModelCount metric shows the number of models loaded in the containers. The ModelCacheHit metric shows the number of invocations to model that are already loaded onto the container to help you get model invitation-level insights. To check if models are unloaded from the memory, you can look for the successful unloaded log entries in the endpoint’s CloudWatch logs.

The notebook can be found in the GitHub repository.

Best practices

Before starting any optimization effort with TensorRT, it’s essential to determine what should be measured. Without measurements, it’s impossible to make reliable progress or measure whether success has been achieved. Here are some best practices to consider when using the TensorRT backend for Triton Inference Server:

  • Optimize your TensorRT model – Before deploying a model on Triton with the TensorRT backend, make sure to optimize the model following the TensorRT best practices guide. This will help you achieve better performance by reducing inference time and memory consumption.
  • Use TensorRT instead of other Triton backends when possible – TensorRT is designed to optimize deep learning models for deployment on NVIDIA GPUs, so using it can significantly improve inference performance compared to using other supported Triton backends.
  • Use the right precision – TensorRT supports multiple precisions (FP32, FP16, INT8), and selecting the right precision for your model can have a significant impact on performance. Consider using lower precision when possible.
  • Use batch sizes that fit your hardware – Make sure to choose batch sizes that fit your GPU’s memory and compute capabilities. Using batch sizes that are too large or too small can negatively impact performance.

Conclusion

In this post, we dove deep into the TensorRT backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your TensorRT models. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to take advantage of this capability using single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.


 About the Authors

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

Build an image search engine with Amazon Kendra and Amazon Rekognition

Build an image search engine with Amazon Kendra and Amazon Rekognition

In this post, we discuss a machine learning (ML) solution for complex image searches using Amazon Kendra and Amazon Rekognition. Specifically, we use the example of architecture diagrams for complex images due to their incorporation of numerous different visual icons and text.

With the internet, searching and obtaining an image has never been easier. Most of the time, you can accurately locate your desired images, such as searching for your next holiday getaway destination. Simple searches are often successful, because they’re not associated with many characteristics. Beyond the desired image characteristics, the search criteria typically doesn’t require significant details to locate the required result. For example, if a user tried to search for a specific type of blue bottle, results of many different types of blue bottles will be displayed. However, the desired blue bottle may not be easily found due to generic search terms.

Interpreting search context also contributes to simplification of results. When users have a desired image in mind, they try to frame this into a text-based search query. Understanding the nuances between search queries for similar topics is important to provide relevant results and minimize the effort required from the user to manually sort through results. For example, the search query “Dog owner plays fetch” seeks to return image results showing a dog owner playing a game of fetch with a dog. However, the actual results generated may instead focus on a dog fetching an object without displaying an owner’s involvement. Users may have to manually filter out unsuitable image results when dealing with complex searches.

To address the problems associated with complex searches, this post describes in detail how you can achieve a search engine that is capable of searching for complex images by integrating Amazon Kendra and Amazon Rekognition. Amazon Kendra is an intelligent search service powered by ML, and Amazon Rekognition is an ML service that can identify objects, people, text, scenes, and activities from images or videos.

What images can be too complex to be searchable? One example is architecture diagrams, which can be associated with many search criteria depending on the use case complexity and number of technical services required, which results in significant manual search effort for the user. For example, if users want to find an architecture solution for the use case of customer verification, they will typically use a search query similar to “Architecture diagrams for customer verification.” However, generic search queries would span a wide range of services and across different content creation dates. Users would need to manually select suitable architectural candidates based on specific services and consider the relevance of the architecture design choices according to the content creation date and query date.

The following figure shows an example diagram that illustrates an orchestrated extract, transform, and load (ETL) architecture solution.

For users who are not familiar with the service offerings that are provided on the cloud platform, they may provide different generic ways and descriptions when searching for such a diagram. The following are some examples of how it could be searched:

  • “Orchestrate ETL workflow”
  • “How to automate bulk data processing”
  • “Methods to create a pipeline for transforming data”

Solution overview

We walk you through the following steps to implement the solution:

  1. Train an Amazon Rekognition Custom Labels model to recognize symbols in architecture diagrams.
  2. Incorporate Amazon Rekognition text detection to validate architecture diagram symbols.
  3. Use Amazon Rekognition inside a web crawler to build a repository for searching
  4. Use Amazon Kendra to search the repository.

To easily provide users with a large repository of relevant results, the solution should provide an automated way of searching through trusted sources. Using architecture diagrams as an example, the solution needs to search through reference links and technical documents for architecture diagrams and identify the services present. Identifying keywords such as use cases and industry verticals in these sources also allows the information to be captured and for more relevant search results to be displayed to the user.

Considering the objective of how relevant diagrams should be searched, the image search solution needs to fulfil three criteria:

  • Enable simple keyword search
  • Interpret search queries based on use cases that users provide
  • Sort and order search results

Keyword search is simply searching for “Amazon Rekognition” and being shown architecture diagrams on how the service is used in different use cases. Alternatively, the search terms can be linked indirectly to the diagram through use cases and industry verticals that may be associated with the architecture. For example, searching for the terms “How to orchestrate ETL pipeline” returns results of architecture diagrams built with AWS Glue and AWS Step Functions. Sorting and ordering of search results based on attributes such as creation date would ensure the architecture diagrams are still relevant in spite of service updates and releases. The following figure shows the architecture diagram to the image search solution.

As illustrated in the preceding diagram and in the solution overview, there are two main aspects of the solution. The first aspect is performed by Amazon Rekognition, which can identify objects, people, text, scenes, and activities from images or videos. It consists of pre-trained models that can be applied to analyze images and videos at scale. With its custom labels feature, Amazon Rekognition allows you to tailor the ML service to your specific business needs by labeling images collated from sourcing through architecture diagrams in trusted reference links and technical documents. By uploading a small set of training images, Amazon Rekognition automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. Therefore, users without ML expertise can enjoy the benefits of a custom labels model through an API call, because a significant amount of overhead is reduced. The solution applies Amazon Rekognition Custom Labels to detect AWS service logos on architecture diagrams to allow the architecture diagrams to be searchable with service names. After modeling, detected services of each architecture diagram image and its metadata, like URL origin and image title, are indexed for future search purposes and stored in Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database designed to run high-performance applications.

The second aspect is supported by Amazon Kendra, an intelligent enterprise search service powered by ML that allows you to search across different content repositories. With Amazon Kendra, you can search for results, such as images or documents, that have been indexed. These results can also be stored across different repositories because the search service employs built-in connectors. Keywords, phrases, and descriptions could be used for searching, which allows you to accurately search for diagrams that are related to a particular use case. Therefore, you can easily build an intelligent search service with minimal development costs.

With an understanding of the problem and solution, the subsequent sections dive into how to automate data sourcing through the crawling of architecture diagrams from credible sources. Following this, we walk through the process of generating a custom label ML model with a fully managed service. Lastly, we cover the data ingestion by an intelligent search service, powered by ML.

Create an Amazon Rekognition model with custom labels

Before obtaining any architecture diagrams, we need a tool to evaluate if an image can be identified as an architecture diagram. Amazon Rekognition Custom Labels provides a streamlined process to create an image recognition model that identifies objects and scenes in images that are specific to a business need. In this case, we use Amazon Rekognition Custom Labels to identify AWS service icons, then the images are indexed with the services for a more relevant search using Amazon Kendra. This model doesn’t differentiate whether a picture is an architecture diagram or not; it simply identifies service icons, if any. As such, there may be instances where images that aren’t architecture diagrams end up in the search results. However, such results are minimal.

The following figure shows the steps that this solution takes to create an Amazon Rekognition Custom Labels model.

This process involves uploading the datasets, generating a manifest file that references the uploaded datasets, followed by uploading this manifest file into Amazon Rekognition. A Python script is used to aid in the process of uploading the datasets and generating the manifest file. Upon successfully generating the manifest file, it’s then uploaded into Amazon Rekognition to begin the model training process. For details on the Python script and how to run it, refer to the GitHub repo.

To train the model, in the Amazon Rekognition project, choose Train model, select the project you want to train, then add any relevant tags and choose Train model. For instructions on starting an Amazon Rekognition Custom Labels project, refer to the available video tutorials. The model may take up to 8 hours to train with this dataset.

When the training is complete, you may choose the trained model to view the evaluation results. For more details on the different metrics such as precision, recall, and F1, refer to Metrics for evaluation your model. To use the model, navigate to the Use Model tab, leave the number of inference units at 1, and start the model. Then we can use an AWS Lambda function to send images to the model in base64, and the model returns a list of labels and confidence scores.

Upon successfully training an Amazon Rekognition model with Amazon Rekognition Custom Labels, we can use it to identify service icons in the architecture diagrams that have been crawled. To increase the accuracy of identifying services in the architecture diagram, we use another Amazon Rekognition feature called text detection. To use this feature, we pass in the same picture in base64, and Amazon Rekognition returns the list of text identified in the picture. In the following figures, we compare the original image and what it looks like after the services in the image are identified. The first figure shows the original image.

The following figure shows the original image with detected services.

To ensure scalability, we use a Lambda function, which will be exposed through an API endpoint created using Amazon API Gateway. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Using a Lambda function eliminates a common concern about scaling up when large volumes of requests are made to the API endpoint. Lambda automatically runs the function for the specific API call, which stops when the invocation is complete, thereby reducing cost incurred to the user. Because the request would be directed to the Amazon Rekognition endpoint, having only the Lambda function being scalable is not sufficient. In order for the Amazon Rekognition endpoint to be scalable, you can increase the inference unit of the endpoint. For more details on configuring the inference unit, refer to Inference units.

The following is a code snippet of the Lambda function for the image recognition process:

const AWS = require("aws-sdk");
const axios = require("axios");

// API to retrieve information about individual services
const SERVICE_API = process.env.SERVICE_API;
// ARN of Amazon Rekognition model
const MODEL_ARN = process.env.MODEL_ARN;

const rekognition = new AWS.Rekognition();

exports.handler = async (event) => {
  const body = JSON.parse(event["body"]);
  let base64Binary = "";

  // Checks if the payload contains a url to the image or the image in base64
  if (body.url) {
    const base64Res = await new Promise((resolve) => {
      axios
        .get(body.url, {
          responseType: "arraybuffer",
        })
        .then((response) => {
          resolve(Buffer.from(response.data, "binary").toString("base64"));
        });
    });

    base64Binary = new Buffer.from(base64Res, "base64");
  } else if (body.byte) {
    const base64Cleaned = body.byte.split("base64,")[1];
    base64Binary = new Buffer.from(base64Cleaned, "base64");
  }

  // Pass the contents through the trained Custom Labels model and text detection
  const [labels, text] = await Promise.all([
    detectLabels(rekognition, base64Binary, MODEL_ARN),
    detectText(rekognition, base64Binary),
  ]);
  const texts = text.TextDetections.map((text) => ({
    DetectedText: text.DetectedText,
    ParentId: text.ParentId,
  }));

  // Compare between overlapping labels and retain the label with the highest confidence
  let filteredLabels = removeOverlappingLabels(labels);

  // Sort all the labels from most to least confident
  filteredLabels = sortByConfidence(filteredLabels);

  // Remove duplicate services in the list
  const services = retrieveUniqueServices(filteredLabels, texts);

  // Pass each service into the reference document API to retrieve the URL to the documentation
  const refLinks = await getReferenceLinks(services);

  var responseBody = {
    labels: filteredLabels,
    text: texts,
    ref_links: refLinks,
  };

  console.log("Response: ", response_body);

  const response = {
    statusCode: 200,
    headers: {
      "Access-Control-Allow-Origin": "*", // Required for CORS to work
    },
    body: JSON.stringify(responseBody),
  };
  return response;
};

// Code removed to truncate section

After creating the Lambda function, we can proceed to expose it as an API using API Gateway. For instructions on creating an API with Lambda proxy integration, refer to Tutorial: Build a Hello World REST API with Lambda proxy integration.

Crawl the architecture diagrams

In order for the search feature to work feasibly, we need a repository of architecture diagrams. However, these diagrams must originate from credible sources such as AWS Blog and AWS Prescriptive Guidance. Establishing credibility of data sources ensures the underlying implementation and purpose of the use cases are accurate and well vetted. The next step is to set up a crawler that can help gather many architecture diagrams to feed into our repository. We created a web crawler to extract architecture diagrams and information such as a description of the implementation from the relevant sources. There are multiple ways that you could achieve building such a mechanism; for this example, we use a program that runs on Amazon Elastic Compute Cloud (Amazon EC2). The program first obtains links to blog posts from an AWS Blog API. The response returned from the API contains information of the post such as title, URL, date, and the links to images found in the post.

The following is a code snippet of the JavaScript function for the web crawling process:

import axios from "axios";
import puppeteer from "puppeteer";
import {
  putItemDDB,
  identifyImageHighConfidence,
  getReferenceList,
} from "./utils.js";

/** Global variables */
const blogPostsApi = process.env.BLOG_POSTS_API;
const IMAGE_URL_PATTERN =
  "<pattern in the url that identified as link to image>";
const DDB_Table = process.env.DDB_Table;

// Function that retrieves URLs of records from a public API
function getURLs(blogPostsApi) {
  // Return a list of URLs
  return axios
    .get(blogPostsApi)
    .then((response) => {
      var data = response.data.items;
      console.log("RESPONSE:");
      const blogLists = data.map((blog) => [
        blog.item.additionalFields.link,
        blog.item.dateUpdated,
      ]);
      return blogLists;
    })
    .catch((error) => console.error(error));
}

// Function that crawls content of individual URLs
async function crawlFromUrl(urls) {
  const browser = await puppeteer.launch({
    executablePath: "/usr/bin/chromium-browser",
  });
  // const browser = await puppeteer.launch();

  const page = await browser.newPage();

  let numOfValidArchUrls = 0;

  for (let index = 0; index < urls.length; index++) {
    console.log("index: ", index);
    let blogURL = urls[index][0];
    let dateUpdated = urls[index][1];

    await page.goto(blogURL);
    console.log("blogUrl:", blogURL);
    console.log("date:", dateUpdated);

    // Identify and get image from post based on URL pattern
    const images = await page.evaluate(() =>
      Array.from(document.images, (e) => e.src)
    );
    const filter1 = images.filter((img) => img.includes(IMAGE_URL_PATTERN));
    console.log("all images:", filter1);

    // Validate if image is an architecture diagram
    for (let index_1 = 0; index_1 < filter1.length; index_1++) {
      const imageUrl = filter1[index_1];

      const rekog = await identifyImageHighConfidence(imageUrl);

      if (rekog) {
        if (rekog.labels.size >= 2) {
          console.log("Rekog.labels.size = ", rekog.labels.size);
          console.log("Selected image url  = ", imageUrl);

          let articleSection = [];
          let metadata = await page.$$('span[property="articleSection"]');

          for (let i = 0; i < metadata.length; i++) {
            const element = metadata[i];
            const value = await element.evaluate(
              (el) => el.textContent,
              element
            );
            console.log("value: ", value);
            articleSection.push(value);
          }

          const title = await page.title();
          const allRefLinks = await getReferenceList(
            rekog.labels,
            rekog.textServices
          );

          numOfValidArchUrls = numOfValidArchUrls + 1;

          putItemDDB(
            blogURL,
            dateUpdated,
            imageUrl,
            articleSection.toString(),
            rekog,
            { L: allRefLinks },
            title,
            DDB_Table
          );

          console.log("numOfValidArchUrls = ", numOfValidArchUrls);
        }
      }
      if (rekog && rekog.labels.size >= 2) {
        break;
      }
    }
  }
  console.log("valid arch : ", numOfValidArchUrls);
  await browser.close();
}

async function startCrawl() {
  // Get a list of URLs
  // Extract architecture image from those URLs
  const urls = await getURLs(blogPostsApi);

  if (urls) console.log("Crawling urls completed");
  else {
    console.log("Unable to crawl images");
    return;
  }
  await crawlFromUrl(urls);
}

startCrawl();

With this mechanism, we can easily crawl hundreds and thousands of images from different blogs. However, we need a filter that only accepts images that contain content of an architecture diagram, which in our case are icons of AWS services, to filter out images that are not architecture diagrams.

This is the purpose of our Amazon Rekognition model. The diagrams go through the image recognition process, which identifies service icons and determines if it could be considered as a valid architecture diagram.

The following is a code snippet of the function that sends images to the Amazon Rekognition model:

import axios from "axios";
import AWS from "aws-sdk";

// Configuration
AWS.config.update({ region: process.env.REGION });

/** Global variables */
// API to identify images
const LABEL_API = process.env.LABEL_API;
// API to get relevant documentations of individual services
const DOCUMENTATION_API = process.env.DOCUMENTATION_API;
// Create the DynamoDB service object
const dynamoDB = new AWS.DynamoDB({ apiVersion: "2012-08-10" });

// Function to identify image using an API that calls Amazon Rekognition model
function identifyImageHighConfidence(image_url) {
  return axios
    .post(LABEL_API, {
      url: image_url,
    })
    .then((res) => {
      let data = res.data;
      let rekogLabels = new Set();
      let rekogTextServices = new Set();
      let rekogTextMetadata = new Set();

      data.labels.forEach((element) => {
        if (element.Confidence >= 40) rekogLabels.add(element.Name);
      });

      data.text.forEach((element) => {
        if (
          element.DetectedText.includes("AWS") ||
          element.DetectedText.includes("Amazon")
        ) {
          rekogTextServices.add(element.DetectedText);
        } else {
          rekogTextMetadata.add(element.DetectedText);
        }
      });
      rekogTextServices.delete("AWS");
      rekogTextServices.delete("Amazon");
      return {
        labels: rekogLabels,
        textServices: rekogTextServices,
        textMetadata: Array.from(rekogTextMetadata).join(", "),
      };
    })
    .catch((error) => console.error(error));
}

After passing the image recognition check, the results returned from the Amazon Rekognition model and the information relevant to it are bundled into their own metadata. The metadata is then stored in a DynamoDB table where the record would be used to ingest into Amazon Kendra.

The following is a code snippet of the function that stores the metadata of the diagram in DynamoDB:

// Code removed to truncate section

// Function that PUTS item into Amazon DynamoDB table
function putItemDDB(
  originUrl,
  publishDate,
  imageUrl,
  crawlerData,
  rekogData,
  referenceLinks,
  title,
  tableName
) {
  console.log("WRITE TO DDB");
  console.log("originUrl :   ", originUrl);
  console.log("publishDate:  ", publishDate);
  console.log("imageUrl: ", imageUrl);
  let write_params = {
    TableName: tableName,
    Item: {
      OriginURL: { S: originUrl },
      PublishDate: { S: formatDate(publishDate) },
      ArchitectureURL: {
        S: imageUrl,
      },
      Metadata: {
        M: {
          crawler: {
            S: crawlerData,
          },
          Rekognition: {
            M: {
              labels: {
                S: Array.from(rekogData.labels).join(", "),
              },
              textServices: {
                S: Array.from(rekogData.textServices).join(", "),
              },
              textMetadata: {
                S: rekogData.textMetadata,
              },
            },
          },
        },
      },
      Reference: referenceLinks,
      Title: {
        S: title,
      },
    },
  };

  dynamoDB.putItem(write_params, function (err, data) {
    if (err) {
      console.log("*** DDB Error", err);
    } else {
      console.log("Successfuly inserted in DDB", data);
    }
  });
}

Ingest metadata into Amazon Kendra

After the architecture diagrams go through the image recognition process and the metadata is stored in DynamoDB, we need a way for the diagrams to be searchable while referencing the content in the metadata. The approach to this is to have a search engine that can be integrated with the application and can handle a large amount of search queries. Therefore, we use Amazon Kendra, an intelligent enterprise search service.

We use Amazon Kendra as the interactive component of the solution is because of its powerful search capabilities, particularly with the use of natural language. This adds an additional layer of simplicity when users are searching for diagrams that are closest to what they’re looking for. Amazon Kendra offers a number of data sources connectors for ingesting and connecting contents. This solution uses a custom connector to ingest architecture diagrams’ information from DynamoDB. To configure a data source to an Amazon Kendra index, you can use an existing index or create a new index.

The diagrams crawled then have to be ingested into the Amazon Kendra index that has been created. The following figure shows the flow of how the diagrams are indexed.

First, the diagrams inserted into DynamoDB create a Put event via Amazon DynamoDB Streams. The event triggers the Lambda function that acts as a custom data source for Amazon Kendra and loads the diagrams into the index. For instructions on creating a DynamoDB Streams trigger for a Lambda function, refer to Tutorial: Using AWS Lambda with Amazon DynamoDB Streams

After we integrate the Lambda function with DynamoDB, we need to ingest the records of the diagrams sent to the function into the Amazon Kendra index. The index accepts data from various types of sources, and ingesting items into the index from the Lambda function means that it has to use the custom data source configuration. For instructions on creating a custom data source for your index, refer to Custom data source connector.

The following is a code snippet of the Lambda function for how a diagram could be indexed in a custom manner:

import json
import os
import boto3

KENDRA = boto3.client("kendra")
INDEX_ID = os.environ["INDEX_ID"]
DS_ID = os.environ["DS_ID"]


def lambda_handler(event, context):
    dbRecords = event["Records"]

    # Loop through items from Amazon DynamoDB
    for row in dbRecords:
        rowData = row["dynamodb"]["NewImage"]
        originUrl = rowData["OriginURL"]["S"]
        publishedDate = rowData["PublishDate"]["S"]
        architectureUrl = rowData["ArchitectureURL"]["S"]
        title = rowData["Title"]["S"]

        metadata = rowData["Metadata"]["M"]
        crawlerMetadata = metadata["crawler"]["S"]
        rekognitionMetadata = metadata["Rekognition"]["M"]
        rekognitionLabels = rekognitionMetadata["labels"]["S"]
        rekognitionServices = rekognitionMetadata["textServices"]["S"]

        concatenatedText = (
            f"{crawlerMetadata} {rekognitionLabels} {rekognitionServices}"
        )

        add_document(
            dsId=DS_ID,
            indexId=INDEX_ID,
            originUrl=originUrl,
            architectureUrl=architectureUrl,
            title=title,
            publishedDate=publishedDate,
            text=concatenatedText,
        )

    return


# Function to add the diagram into Kendra index
def add_document(dsId, indexId, originUrl, architectureUrl, title, publishedDate, text):
    document = get_document(
        dsId, indexId, originUrl, architectureUrl, title, publishedDate, text
    )
    documents = [document]
    result = KENDRA.batch_put_document(IndexId=indexId, Documents=documents)
    print("result:" + json.dumps(result))
    return True


# Frame the diagram into a document that Kendra accepts
def get_document(dsId, originUrl, architectureUrl, title, publishedDate, text):
    document = {
        "Id": originUrl,
        "Title": title,
        "Attributes": [
            {"Key": "_data_source_id", "Value": {"StringValue": dsId}},
            {"Key": "_source_uri", "Value": {"StringValue": architectureUrl}},
            {"Key": "_created_at", "Value": {"DateValue": publishedDate}},
            {"Key": "publish_date", "Value": {"DateValue": publishedDate}},
        ],
        "Blob": text,
    }

    return document

The important factor that enables diagrams to be searchable is the Blob key in a document. This is what Amazon Kendra looks into when users provide their search input. In this example code, the Blob key contains a summarized version of the use case of the diagram concatenated with the information detected from the image recognition process. This allows users to search for architecture diagrams based on use cases such as “Fraud Detection” or by service names like “Amazon Kendra.”

To illustrate an example of what the Blob key looks like, the following snippet references the initial ETL diagram that we introduced earlier in this post. It contains a description of the diagram that was obtained when it was crawled, as well as the services that were identified by the Amazon Rekognition model.

{
    ...,
    "Blob": "Build and orchestrate ETL pipelines using Amazon Athena and AWS Step Functions Amazon Athena, AWS Step Functions, Amazon S3, AWS Glue Data Catalog "
}

Search with Amazon Kendra

After we put all the components together, the results of an example search of “real time analytics” look like the following screenshot.

By searching for this use case, it produces different architecture diagrams. Users are provided with these different methods of the specific workload that they’re trying to implement.

Clean up

Complete the steps in this section to clean up the resources you created as part of this post:

  1. Delete the API:
    1. On the API Gateway console, select the API to be deleted.
    2. On the Actions menu, choose Delete.
    3. Choose Delete to confirm.
  2. Delete the DynamoDB table:
    1. On the DynamoDB console, choose Tables in the navigation pane.
    2. Select the table you created and choose Delete.
    3. Enter delete when prompted for confirmation.
    4. Choose Delete table to confirm.
  3. Delete the Amazon Kendra index:
    1. On the Amazon Kendra console, choose Indexes in the navigation pane.
    2. Select the index you created and choose Delete
    3. Enter a reason when prompted for confirmation.
    4. Choose Delete to confirm.
  4. Delete the Amazon Rekognition project:
    1. On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane, then choose Projects.
    2. Select the project you created and choose Delete.
    3. Enter Delete when prompted for confirmation.
    4. Choose Delete associated datasets and models to confirm.
  5. Delete the Lambda function:
    1. On the Lambda console, select the function to be deleted.
    2. On the Actions menu, choose Delete.
    3. Enter Delete when prompted for confirmation.
    4. Choose Delete to confirm.

Summary

In this post, we showed an example of how you can intelligently search information from images. This includes the process of training an Amazon Rekognition ML model that acts as a filter for images, the automation of image crawling, which ensures credibility and efficiency, and querying for diagrams by attaching a custom data source that enables a more flexible manner to index items. To dive deeper into the implementation of the codes, refer to the GitHub repo.

Now that you understand how to deliver the backbone of a centralized search repository for complex searches, try creating your own image search engine. For more information on the core features, refer to Getting started with Amazon Rekognition Custom Labels, Moderating content, and the Amazon Kendra Developer Guide. If you’re new to Amazon Rekognition Custom Labels, try it out using our Free Tier, which lasts 3 months and includes 10 free training hours per month and 4 free inference hours per month.


About the Authors

Ryan See is a Solutions Architect at AWS. Based in Singapore, he works with customers to build solutions to solve their business problems as well as tailor a technical vision to help run more scalable and efficient workloads in the cloud.

James Ong Jia Xiang is a Customer Solutions Manager at AWS. He specializes in the Migration Acceleration Program (MAP) where he helps customers and partners successfully implement large-scale migration programs to AWS. Based in Singapore, he also focuses on driving modernization and enterprise transformation initiatives across APJ through scalable mechanisms. For leisure, he enjoys nature activities like trekking and surfing.

Hang Duong is a Solutions Architect at AWS. Based in Hanoi, Vietnam, she focuses on driving cloud adoption across her country by providing highly available, secure, and scalable cloud solutions for her customers. Additionally, she enjoys building and is involved in various prototyping projects. She is also passionate about the field of machine learning.

Trinh Vo is a Solutions Architect at AWS, based in Ho Chi Minh City, Vietnam. She focuses on working with customers across different industries and partners in Vietnam to craft architectures and demonstrations of the AWS platform that work backward from the customer’s business needs and accelerate the adoption of appropriate AWS technology. She enjoys caving and trekking for leisure.

Wai Kin Tham is a Cloud Architect at AWS. Based in Singapore, his day job involves helping customers migrate to the cloud and modernize their technology stack in the cloud. In his free time, he attends Muay Thai and Brazilian Jiu Jitsu classes.

Read More

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

This is a joint post co-written by AWS and Voxel51. Voxel51 is the company behind FiftyOne, the open-source toolkit for building high-quality datasets and computer vision models.

A retail company is building a mobile app to help customers buy clothes. To create this app, they need a high-quality dataset containing clothing images, labeled with different categories. In this post, we show how to repurpose an existing dataset via data cleaning, preprocessing, and pre-labeling with a zero-shot classification model in FiftyOne, and adjusting these labels with Amazon SageMaker Ground Truth.

You can use Ground Truth and FiftyOne to accelerate your data labeling project. We illustrate how to seamlessly use the two applications together to create high-quality labeled datasets. For our example use case, we work with the Fashion200K dataset, released at ICCV 2017.

Solution overview

Ground Truth is a fully self-served and managed data labeling service that empowers data scientists, machine learning (ML) engineers, and researchers to build high-quality datasets. FiftyOne by Voxel51 is an open-source toolkit for curating, visualizing, and evaluating computer vision datasets so that you can train and analyze better models by accelerating your use cases.

In the following sections, we demonstrate how to do the following:

  • Visualize the dataset in FiftyOne
  • Clean the dataset with filtering and image deduplication in FiftyOne
  • Pre-label the cleaned data with zero-shot classification in FiftyOne
  • Label the smaller curated dataset with Ground Truth
  • Inject labeled results from Ground Truth into FiftyOne and review labeled results in FiftyOne

Use case overview

Suppose you own a retail company and want to build a mobile application to give personalized recommendations to help users decide what to wear. Your prospective users are looking for an application that tells them which articles of clothing in their closet work well together. You see an opportunity here: if you can identify good outfits, you can use this to recommend new articles of clothing that complement the clothing a customer already owns.

You want to make things as easy as possible for the end-user. Ideally, someone using your application only needs to take pictures of the clothes in their wardrobe, and your ML models work their magic behind the scenes. You might train a general-purpose model or fine-tune a model to each user’s unique style with some form of feedback.

First, however, you need to identify what type of clothing the user is capturing. Is it a shirt? A pair of pants? Or something else? After all, you probably don’t want to recommend an outfit that has multiple dresses or multiple hats.

To address this initial challenge, you want to generate a training dataset consisting of images of various articles of clothing with various patterns and styles. To prototype with a limited budget, you want to bootstrap using an existing dataset.

To illustrate and walk you through the process in this post, we use the Fashion200K dataset released at ICCV 2017. It’s an established and well-cited dataset, but it isn’t directly suited for your use case.

Although articles of clothing are labeled with categories (and subcategories) and contain a variety of helpful tags that are extracted from the original product descriptions, the data is not systematically labeled with pattern or style information. Your goal is to turn this existing dataset into a robust training dataset for your clothing classification models. You need to clean the data, augmenting the labeling schema with style labels. And you want to do so quickly and with as little spend as possible.

Download the data locally

First, download the women.tar zip file and the labels folder (with all of its subfolders) following the instructions provided in the Fashion200K dataset GitHub repository. After you’ve unzipped them both, create a parent directory fashion200k, and move the labels and women folders into this. Fortunately, these images have already been cropped to the object detection bounding boxes, so we can focus on classification, rather than worry about object detection.

Despite the “200K” in its moniker, the women directory we extracted contains 338,339 images. To generate the official Fashion200K dataset, the dataset’s authors crawled more than 300,000 products online, and only products with descriptions containing more than four words made the cut. For our purposes, where the product description isn’t essential, we can use all of the crawled images.

Let’s look at how this data is organized: within the women folder, images are arranged by top-level article type (skirts, tops, pants, jackets, and dresses), and article type subcategory (blouses, t-shirts, long-sleeved tops).

Within the subcategory directories, there is a subdirectory for each product listing. Each of these contains a variable number of images. The cropped_pants subcategory, for instance, contains the following product listings and associated images.

The labels folder contains a text file for each top-level article type, for both train and test splits. Within each of these text files is a separate line for each image, specifying the relative file path, a score, and tags from the product description.

Because we’re repurposing the dataset, we combine all of the train and test images. We use these to generate a high-quality application-specific dataset. After we complete this process, we can randomly split the resulting dataset into new train and test splits.

Inject, view, and curate a dataset in FiftyOne

If you haven’t already done so, install open-source FiftyOne using pip:

pip install fiftyone

A best practice is to do so within a new virtual (venv or conda) environment. Then import the relevant modules. Import the base library, fiftyone, the FiftyOne Brain, which has built-in ML methods, the FiftyOne Zoo, from which we will load a model that will generate zero-shot labels for us, and the ViewField, which lets us efficiently filter the data in our dataset:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

You also want to import the glob and os Python modules, which will help us work with paths and pattern match over directory contents:

from glob import glob
import os

Now we’re ready to load the dataset into FiftyOne. First, we create a dataset named fashion200k and make it persistent, which allows us to save the results of computationally intensive operations, so we only need to compute said quantities once.

dataset = fo.Dataset("fashion200k", persistent=True)

We can now iterate through all subcategory directories, adding all the images within the product directories. We add a FiftyOne classification label to each sample with the field name article_type, populated by the image’s top-level article category. We also add both category and subcategory information as tags:

# Map dir categories to article type labels
labels_map = {
    "dresses": "dress",
    "jackets": "jacket",
    "pants": "pants",
    "skirts": "skirt",
    "tops": "top",
}

dataset_dir = "./fashion200k"

for d in glob(os.path.join(dataset_dir, "women", "*", "*")):
    _, _, category, subcategory = d.split("/")
    subcategory = subcategory.replace("_", " ")
    label = labels_map[category]

    dataset.add_samples(
        [
            fo.Sample(
                    filepath=filepath,
tags=[category, subcategory],   article_type=fo.Classification(label=label),
            )
            for filepath in glob(os.path.join(d, "*", "*"))
        ]
    )

At this point, we can visualize our dataset in the FiftyOne app by launching a session:

session = fo.launch_app(dataset)

We can also print out a summary of the dataset in Python by running print(dataset):

Name:        fashion200k
Media type:  image
Num samples: 338339
Persistent:  True
Tags:        []
Sample fields:
    id:            fiftyone.core.fields.ObjectIdField
    filepath:      fiftyone.core.fields.StringField
    tags:          fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    article_type:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

We can also add the tags from the labels directory to the samples in our dataset:

working_dir = os.getcwd()

tags = {
f: set(t) 
for f, t in zip(*dataset.values(["filepath", "tags"]))
}


for label_file in glob("fashion200k/labels/*"):
    with open(label_file, 'r') as f:
        for line in f.readlines():
            line_list = line.split()
            fp = os.path.join(
                working_dir, 
                dataset_dir, 
                line_list[0]
            )
          
           # add new tags
          new_tags_for_fp = line_list[2:]
          tags[fp].update(new_tags_for_fp)

# Update tags
dataset.set_values("tags", tags, key_field="filepath")

Looking at the data, a few things become clear:

  • Some of the images are fairly grainy, with low resolution. This is likely because these images were generated by cropping initial images in object detection bounding boxes.
  • Some clothes are worn by a person, and some are photographed on their own. These details are encapsulated by the viewpoint property.
  • A lot of the images of the same product are very similar, so at least initially, including more than one image per product may not add much predictive power. For the most part, the first image of each product (ending in _0.jpeg) is the cleanest.

Initially, we might want to train our clothing style classification model on a controlled subset of these images. To this end, we use high-resolution images of our products, and limit our view to one representative sample per product.

First, we filter out the low-resolution images. We use the compute_metadata() method to compute and store image width and height, in pixels, for each image in the dataset. We then employ the FiftyOne ViewField to filter out images based on the minimum allowed width and height values. See the following code:

dataset.compute_metadata()

min_width = 200
min_height = 300

width_filter = F("metadata.width") > min_width
height_filter = F("metadata.height") > min_height


high_res_view = dataset.match(
    width_filter & height_filter
)

session.view = high_res_view.view()

This high-resolution subset has just under 200,000 samples.

From this view, we can create a new view into our dataset containing only one representative sample (at most) for each product. We use the ViewField once again, pattern matching for file paths that end with _0.jpeg:

representative_view = high_res_view.match(
    F("filepath").ends_with("_0.jpeg")
)

Let’s view a randomly shuffled ordering of images in this subset:

session.view = representative_view.shuffle()

Remove redundant images in the dataset

This view contains 66,297 images, or just over 19% of the original dataset. When we look at the view, however, we see that there are many very similar products. Keeping all of these copies will likely only add cost to our labeling and model training, without noticeably improving performance. Instead, let’s get rid of the near duplicates to create a smaller dataset that still packs the same punch.

Because these images are not exact duplicates, we can’t check for pixel-wise equality. Fortunately, we can use the FiftyOne Brain to help us clean our dataset. In particular, we’ll compute an embedding for each image—a lower-dimensional vector representing the image—and then look for images whose embedding vectors are close to each other. The closer the vectors, the more similar the images.

We use a CLIP model to generate a 512-dimensional embedding vector for each image, and store these embeddings in the field embeddings on the samples in our dataset:

## load model
model = foz.load_zoo_model("clip-vit-base32-torch")
 
## compute embeddings
representative_view.compute_embeddings(
model, 
embeddings_field="embedding"
)

Then we compute the closeness between embeddings, using cosine similarity, and assert that any two vectors whose similarity is greater than some threshold are likely to be near duplicates. Cosine similarity scores lie in the range [0, 1], and looking at the data, a threshold score of thresh=0.5 seems to be about right. Again, this doesn’t need to be perfect. A few near-duplicate images are not likely to ruin our predictive power, and throwing away a few non-duplicate images doesn’t materially impact model performance.

results = fob.compute_similarity(
view,
embeddings="embedding",
brain_key="sim",
metric="cosine"
)

results.find_duplicates(thresh=0.5)

We can view the purported duplicates to verify that they are indeed redundant:

## view the duplicates, paired up, 
## to make sure it is doing what we think it is doing
dup_view = results.duplicates_view()
session = fo.launch_app(dup_view)

When we’re happy with the result and believe these images are indeed near duplicates, we can pick one sample from each set of similar samples to keep, and ignore the others:

## get one image from each group of duplicates
dup_rep_ids = list(results.neighbors_map.keys())

# get ids of non-duplicates
non_dup_ids = representative_view.exclude(
dup_view.values("id")
).values("id")

# ids to keep
ids = dup_rep_ids + non_dup_ids

# create view from ids
non_dup_view = representative_view[ids]

Now this view has 3,729 images. By cleaning the data and identifying a high-quality subset of the Fashion200K dataset, FiftyOne lets us restrict our focus from more than 300,000 images to just under 4,000, representing a reduction by 98%. Using embeddings to remove near-duplicate images alone brought our total number of images under consideration down by more than 90%, with little if any effect on any models to be trained on this data.

Before pre-labeling this subset, we can better understand the data by visualizing the embeddings we have already computed. We can use the FiftyOne Brain’s built-in compute_visualization() method, which employs the uniform manifold approximation (UMAP) technique to project the 512-dimensional embedding vectors into two-dimensional space so we can visualize them:

fob.compute_visualization(
    non_dup_view, 
    embeddings="embedding", 
    brain_key="vis"
)

We open a new Embeddings panel in the FiftyOne app and coloring by article type, and we can see that these embeddings roughly encode a notion of article type (among other things!).

Now we are ready to pre-label this data.

Inspecting these highly unique, high-resolution images, we can generate a decent initial list of styles to use as classes in our pre-labeling zero-shot classification. Our goal in pre-labeling these images is not to necessarily label each image correctly. Rather, our goal is to provide a good starting point for human annotators so we can reduce labeling time and cost.

styles = [
 "graphic", 
 "lettered", 
 "plain", 
 "striped", 
 "polka dot", 
 "floral", 
 "jersey", 
 "checkered", 
 "denim", 
 "plaid",
 "houndstooth",
 "chevron", 
 "paisley", 
 "animal print", 
 "quatrefoil",
 “camouflage”
]

We can then instantiate a zero-shot classification model for this application. We use a CLIP model, which is a general-purpose model trained on both images and natural language. We instantiate a CLIP model with the text prompt “Clothing in the style,” so that given an image, the model will output the class for which “Clothing in the style [class]” is the best fit. CLIP is not trained on retail or fashion-specific data, so this won’t be perfect, but it can save you in labeling and annotation costs.

zero_shot_model = foz.load_zoo_model(
 "clip-vit-base32-torch",
 text_prompt="Clothing in the style ",
 classes=styles,
)

We then apply this model to our reduced subset and store the results in an article_style field:

non_dup_view.apply_model(
zero_shot_model, 
label_field="article_style"
)

Launching the FiftyOne App once again, we can visualize the images with these predicted style labels. We sort by prediction confidence so we view the most confident style predictions first:

high_conf_view = non_dup_view.sort_by(
 "article_style.confidence", reverse=True
)

session.view = high_conf_view

We can see that the highest confidence predictions seem to be for “jersey,” “animal print,” “polka dot,” and “lettered” styles. This makes sense, because these styles are relatively distinct. It also seems like, for the most part, the predicted style labels are accurate.

We can also look at the lowest-confidence style predictions:

low_conf_view = non_dup_view.sort_by(
"article_style.confidence"
)
session.view = low_conf_view

For some of these images, the appropriate style category is in the provided list, and the article of clothing is incorrectly labeled. The first image in the grid, for instance, should clearly be “camouflage” and not “chevron.” In other cases, however, the products don’t fit neatly into the style categories. The dress in the second image in the second row, for example, is not exactly “striped,” but given the same labeling options, a human annotator might also have been conflicted. As we build out our dataset, we need to decide whether to remove edge cases like these, add new style categories, or augment the dataset.

Export the final dataset from FiftyOne

Export the final dataset with the following code:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

We can export a smaller dataset, for example, 16 images, to the folder 200kFashionDatasetExportResult-16Images. We create a Ground Truth adjustment job using it:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult-16Images"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.take(16).export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

Upload the revised dataset, convert the label format to Ground Truth, upload to Amazon S3, and create a manifest file for the adjustment job

We can convert the labels in the dataset to match the output manifest schema of a Ground Truth bounding box job, and upload the images to an Amazon Simple Storage Service (Amazon S3) bucket to launch a Ground Truth adjustment job:

import json
# open the labels.json file of ground truth bounding box 
#labels from the exported dataset
f = open('200kFashionDatasetExportResult-16Images/labels.json')
data = json.load(f)

# provide your aws s3 bucket name, prefix, and aws credentials
bucket_name = 'sagemaker-your-preferred-s3-bucket'
s3_prefix = 'sagemaker-your-preferred-s3-prefix'

session = boto3.Session(
    aws_access_key_id='<AWS_ACCESS_KEY_ID>',
    aws_secret_access_key='<AWS_SECRET_ACCESS_KEY>'
)
s3 = session.resource('s3')

for image in data['images']:
    file_name = image['file_name']
    file_id = file_name[:-4]
    image_id = image['id']
    
    # upload the image to s3
    s3.meta.client.upload_file('200kFashionDatasetExportResult-16Images/data/'+image['file_name'], bucket_name, s3_prefix+'/'+image['file_name'])
    
    gt_annotations = []
    confidence = 0.00
    
    for annotation in data['annotations']:
        if annotation['image_id'] == image['id']:
            confidence = annotation['score']
            gt_annotation = {
                "class_id": gt_class_array.index(style_category), 
                # convert the original ground_truth bounding box 
                #label to predicted style label
                "left": annotation['bbox'][0],
                "top": annotation['bbox'][1],
                "width": annotation['bbox'][2],
                "height": annotation['bbox'][3]
            }
            
            gt_annotations.append(gt_annotation)
            break
    
    gt_metadata_objects = []
    for gt_annotation in gt_annotations:
        gt_metadata_objects.append({
            "confidence": confidence
        })
    
    gt_label_attribute_metadata = {
        "class-map": gt_class_map,
        "objects": gt_metadata_objects,
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2023-02-19T00:23:25.339582",
        "job-name": "labeling-job/200k-fashion-origin"
    }
    
    gt_output = {
        "source-ref": f"s3://{bucket_name}/{s3_prefix}/{image['file_name']}",
        "200k-fashion-origin": {
            "image_size": [
                {
                    "width": image['width'],
                    "height": image['height'],
                    "depth": 3
                  }
      
            ],
            "annotations": gt_annotations
        },
        "200k-fashion-origin-metadata": gt_label_attribute_metadata
    }
    

    # write to the manifest file    
    with open(200k-fashion-output.manifest', 'a') as output_file:
        output_file.write(json.dumps(gt_output) + "n")

Upload the manifest file to Amazon S3 with the following code:

s3.meta.client.upload_file(200k-fashion-output.manifest', bucket_name, s3_prefix+'/200k-fashion-output.manifest')

Create corrected styled labels with Ground Truth

To annotate your data with style labels using Ground Truth, complete the necessary steps to start a bounding box labeling job by following the procedure outlined in the Getting Started with Ground Truth guide with the dataset in the same S3 bucket.

  1. On the SageMaker console, create a Ground Truth labeling job.
  2. Set the Input dataset location to be the manifest that we created in the preceding steps.
  3. Specify an S3 path for Output dataset location.
  4. For IAM Role, choose Enter a custom IAM role ARN, then enter the role ARN.
  5. For Task category, choose Image and select Bounding box.
  6. Choose Next.
  7. In the Workers section, choose the type of workforce you would like to use.
    You can select a workforce through Amazon Mechanical Turk, third-party vendors, or your own private workforce. For more details about your workforce options, see Create and Manage Workforces.
  8. Expand Existing-labels display options and select I want to display existing labels from the dataset for this job.
  9. For Label attribute name, choose the name from your manifest that corresponds to the labels that you want to display for adjustment.
    You will only see label attribute names for labels that match the task type you selected in the previous steps.
  10. Manually enter the labels for Bounding box labeling tool.
    The labels must contain the same labels used in the public dataset. You can add new labels. The following screenshot shows how you can choose the workers and configure the tool for your labeling job.
  11. Choose Preview to preview the image and original annotations.

We have now created a labeling job in Ground Truth. After our job is complete, we can load the newly generated labeled data into FiftyOne. Ground Truth produces output data in a Ground Truth output manifest. For more details on the output manifest file, see Bounding Box Job Output. The following code shows an example of this output manifest format:

{
    "source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.png",
    "bounding-box-attribute-name":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 111, "top": 134,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 20, "top": 20,
                     "width": 30, "height": 30}
        ]
    },
    "bounding-box-attribute-name-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "jersey",
            "5": "polka dot"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-10-18T22:18:13.527256",
        "job-name": "identify-fashion-set"
    },
    "adjusted-bounding-box":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 110, "top": 135,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 10, "top": 10,
                     "width": 30, "height": 30}
        ]
    },
    "adjusted-bounding-box-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "dog",
            "5": "bone"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-11-20T22:18:13.527256",
        "job-name": "adjust-identify-fashion-set",
        "adjustment-status": "adjusted"
    }
 }

Review labeled results from Ground Truth in FiftyOne

After the job is complete, download the output manifest of the labeling job from Amazon S3.

Read the output manifest file:

with open('<path-to-your-output.manifest>', 'r') as fh:
    adjustment_manifest_lines = fh.readlines()

Create a FiftyOne dataset and convert the manifest lines to samples in the dataset:

def get_classification_labels(manifest_line, dataset, attr_name) -> fo.Classifications:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    predictions = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")
        
        prediction = fo.Classification(label=label, confidence=confidence)

        predictions.append(prediction)

    return fo.Classifications(classifications=predictions)

def get_bounding_box_labels(manifest_line, dataset, attr_name) -> fo.Detections:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    detections = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")

        # Bounding box coordinates should be relative values
        # in [0, 1] in the following format:
        # [top-left-x, top-left-y, width, height]
        bounding_box = [
            annotation.get("left") / width,
            annotation.get("top") / height,
            annotation.get("width") / width,
            annotation.get("height") / height,
        ]

        detection = fo.Detection(
            label=label, bounding_box=bounding_box, confidence=confidence
        )
        
        detections.append(detection)

    return fo.Detections(detections=detections)
    
def get_sample_from_manifest_line(manifest_line, dataset, attr_name):
    """
    For each line in manifest, transform annotations into Fiftyone format
    Args:
        line: manifest line
    Output:
        Fiftyone image sample
    """
    file_name = manifest_line.get("source-ref")[5:].split("/")[-1]
    file_loc = f'200kFashionDatasetExportResult-16Images/data/{file_name}'

    sample = fo.Sample(filepath=file_loc)

    sample['ground_truth'] = get_bounding_box_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )
    sample["prediction"] = get_classification_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )

    return sample

adjustment_dataset = fo.Dataset("adjustment-job-dataset")

samples = [
            get_sample_from_manifest_line(
                manifest_line=json.loads(manifest_line), dataset=adjustment_dataset, attr_name='smgt-fiftyone-style-adjustment-job'
            )
            for manifest_line in adjustment_manifest_lines
        ]

adjustment_dataset.add_samples(samples)

session = fo.launch_app(adjustment_dataset)

You can now see high-quality labeled data from Ground Truth in FiftyOne.

Conclusion

In this post, we showed how to build high-quality datasets by combining the power of FiftyOne by Voxel51, an open-source toolkit that allows you to manage, track, visualize, and curate your dataset, and Ground Truth, a data labeling service that allows you to efficiently and accurately label the datasets required for training ML systems by providing access to multiple built-in task templates and access to a diverse workforce through Mechanical Turk, third-party vendors, or your own private workforce.

We encourage you to try out this new functionality by installing a FiftyOne instance and using the Ground Truth console to get started. To learn more about Ground Truth, refer to Label Data, Amazon SageMaker Data Labeling FAQs, and the AWS Machine Learning Blog.

Connect with the Machine Learning & AI community if you have any questions or feedback!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!


About the Authors

Shalendra Chhabra is currently Head of Product Management for Amazon SageMaker Human-in-the-Loop (HIL) Services. Previously, Shalendra incubated and led Language and Conversational Intelligence for Microsoft Teams Meetings, was EIR at Amazon Alexa Techstars Startup Accelerator, VP of Product and Marketing at Discuss.io, Head of Product and Marketing at Clipboard (acquired by Salesforce), and Lead Product Manager at Swype (acquired by Nuance). In total, Shalendra has helped build, ship, and market products that have touched more than a billion lives.

Jacob Marks is a Machine Learning Engineer and Developer Evangelist at Voxel51, where he helps bring transparency and clarity to the world’s data. Prior to joining Voxel51, Jacob founded a startup to help emerging musicians connect and share creative content with fans. Before that, he worked at Google X, Samsung Research, and Wolfram Research. In a past life, Jacob was a theoretical physicist, completing his PhD at Stanford, where he investigated quantum phases of matter. In his free time, Jacob enjoys climbing, running, and reading science fiction novels.

Jason Corso is co-founder and CEO of Voxel51, where he steers strategy to help bring transparency and clarity to the world’s data through state-of-the-art flexible software. He is also a Professor of Robotics, Electrical Engineering, and Computer Science at the University of Michigan, where he focuses on cutting-edge problems at the intersection of computer vision, natural language, and physical platforms. In his free time, Jason enjoys spending time with his family, reading, being in nature, playing board games, and all sorts of creative activities.

Brian Moore is co-founder and CTO of Voxel51, where he leads technical strategy and vision. He holds a PhD in Electrical Engineering from the University of Michigan, where his research was focused on efficient algorithms for large-scale machine learning problems, with a particular emphasis on computer vision applications. In his free time, he enjoys badminton, golf, hiking, and playing with his twin Yorkshire Terriers.

Zhuling Bai is a Software Development Engineer at Amazon Web Services. She works on developing large-scale distributed systems to solve machine learning problems.

Read More

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

The world of artificial intelligence (AI) and machine learning (ML) has been witnessing a paradigm shift with the rise of generative AI models that can create human-like text, images, code, and audio. Compared to classical ML models, generative AI models are significantly bigger and more complex. However, their increasing complexity also comes with high costs for inference and a growing need for powerful compute resources. The high cost of inference for generative AI models can be a barrier to entry for businesses and researchers with limited resources, necessitating the need for more efficient and cost-effective solutions. Furthermore, the majority of generative AI use cases involve human interaction or real-world scenarios, necessitating hardware that can deliver low-latency performance. AWS has been innovating with purpose-built chips to address the growing need for powerful, efficient, and cost-effective compute hardware.

Today, we are excited to announce that Amazon SageMaker supports AWS Inferentia2 (ml.inf2) and AWS Trainium (ml.trn1) based SageMaker instances to host generative AI models for real-time and asynchronous inference. ml.inf2 instances are available for model deployment on SageMaker in US East (Ohio) and ml.trn1 instances in US East (N. Virginia).

You can use these instances on SageMaker to achieve high performance at a low cost for generative AI models, including large language models (LLMs), Stable Diffusion, and vision transformers. In addition, you can use Amazon SageMaker Inference Recommender to help you run load tests and evaluate the price-performance benefits of deploying your model on these instances.

You can use ml.inf2 and ml.trn1 instances to run your ML applications on SageMaker for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more. You can easily get started by specifying ml.trn1 or ml.inf2 instances when configuring your SageMaker endpoint. You can use ml.trn1 and ml.inf2 compatible AWS Deep Learning Containers (DLCs) for PyTorch, TensorFlow, Hugging Face, and large model inference (LMI) to easily get started. For the full list with versions, see Available Deep Learning Containers Images.

In this post, we show the process of deploying a large language model on AWS Inferentia2 using SageMaker, without requiring any extra coding, by taking advantage of the LMI container. We use the GPT4ALL-J, a fine-tuned GPT-J 7B model that provides a chatbot style interaction.

Overview of ml.trn1 and ml.inf2 instances

ml.trn1 instances are powered by the Trainium accelerator, which is purpose built mainly for high-performance deep learning training of generative AI models, including LLMs. However, these instances also support inference workloads for models that are even larger than what fits into Inf2. The largest instance size, trn1.32xlarge instances, features 16 Trainium accelerators with 512 GB of accelerator memory in a single instance delivering up to 3.4 petaflops of FP16/BF16 compute power. 16 Trainium accelerators are connected with ultra-high-speed NeuronLinkv2 for streamlined collective communications.

ml.Inf2 instances are powered by the AWS Inferentia2 accelerator, a purpose built accelerator for inference. It delivers three times higher compute performance, up to four times higher throughput, and up to 10 times lower latency compared to first-generation AWS Inferentia. The largest instance size, Inf2.48xlarge, features 12 AWS Inferentia2 accelerators with 384 GB of accelerator memory in a single instance for a combined compute power of 2.3 petaflops for BF16/FP16. It enables you to deploy up to a 175-billion-parameter model in a single instance. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that don’t fit into a single accelerator, data flows directly between accelerators with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.

Both AWS Inferentia2 and Trainium accelerators have two NeuronCores-v2, 32 GB HBM memory stacks, and dedicated collective-compute engines, which automatically optimize runtime by overlapping computation and communication when doing multi-accelerator inference. For more details on the architecture, refer to Trainium and Inferentia devices.

The following diagram shows an example architecture using AWS Inferentia2.

AWS Neuron SDK

AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium based instances. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated into TensorFlow and PyTorch. With Neuron, you can develop, profile, and deploy high-performance ML workloads on ml.trn1 and ml.inf2.

The Neuron Compiler accepts ML models in various formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. The Neuron compiler is invoked within the ML framework, where ML models are sent to the compiler by the Neuron framework plugin. The resulting compiler artifact is called a NEFF file (Neuron Executable File Format) that in turn is loaded by the Neuron runtime to the Neuron device.

The Neuron runtime consists of kernel driver and C/C++ libraries, which provide APIs to access AWS Inferentia and Trainium Neuron devices. The Neuron ML frameworks plugins for TensorFlow and PyTorch use the Neuron runtime to load and run models on the NeuronCores. The Neuron runtime loads compiled deep learning models (NEFF) to the Neuron devices and is optimized for high throughput and low latency.

Host NLP models using SageMaker ml.inf2 instances

Before we dive deep into serving LLMs with transformers-neuronx, which is an open-source library to shard the model’s large weight matrices onto multiple NeuronCores, let’s briefly go through the typical deployment flow for a model that can fit onto the single NeuronCore.

Check the list of supported models to ensure the model is supported on AWS Inferentia2. Next, the model needs to be pre-compiled by the Neuron Compiler. You can use a SageMaker notebook or an Amazon Elastic Compute Cloud (Amazon EC2) instance to compile the model. You can use the SageMaker Python SDK to deploy models using popular deep learning frameworks such as PyTorch, as shown in the following code. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling.

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    source_dir="code",
    entry_point="inference.py",
    image_uri=ecr_image
)

predictor = pytorch_model.deploy(
    initial_instance_count=1, 
    instance_type="ml.inf2.xlarge"
)

Refer to Developer Flows for more details on typical development flows of Inf2 on SageMaker with sample scripts.

Host LLMs using SageMaker ml.inf2 instances

Large language models with billions of parameters are often too big to fit on a single accelerator. This necessitates the use of model parallel techniques for hosting LLMs across multiple accelerators. Another crucial requirement for hosting LLMs is the implementation of a high-performance model-serving solution. This solution should efficiently load the model, manage partitioning, and seamlessly serve requests via HTTP endpoints.

SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference. For resources to get started with LMI on SageMaker, refer to Model parallelism and large model inference. SageMaker maintains DLCs with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. These specialized DLCs are referred to as SageMaker LMI containers.

SageMaker LMI containers use DJLServing, a model server that is integrated with the transformers-neuronx library to support tensor parallelism across NeuronCores. To learn more about how DJLServing works, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. The DJL model server and transformers-neuronx library serve as core components of the container, which also includes the Neuron SDK. This setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.

The LMI container supports loading models from an Amazon Simple Storage Service (Amazon S3) bucket or Hugging Face Hub. The default handler script loads the model, compiles and converts it into a Neuron-optimized format, and loads it. To use the LMI container to host LLMs, we have two options:

  • A no-code (preferred) – This is the easiest way to deploy an LLM using an LMI container. In this method, you can use the provided default handler and just pass the model name and the parameters required in serving.properties file to load and host the model. To use the default handler, we provide the entryPoint parameter as djl_python.transformers-neuronx.
  • Bring your own script – In this approach, you have the option to create your own model.py file, which contains the code necessary for loading and serving the model. This file acts as an intermediary between the DJLServing APIs and the transformers-neuronx APIs. To customize the model loading process, you can provide serving.properties with configurable parameters. For a comprehensive list of available configurable parameters, refer to All DJL configuration options. Here is an example of a model.py file.

Runtime architecture

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple NeuronCores. For instance, inf2.24xlarge has six AWS Inferentia2 accelerators. Each AWS Inferentia2 accelerator has two NeuronCores. Each NeuronCore has a dedicated high bandwidth memory (HBM) of 16 GB storing tensor parallel modules. With a tensor parallel degree of 4, the LMI will allocate three model copies of the same model, each utilizing four NeuronCores. As shown in the following diagram, when the LMI container starts, the model will be loaded and traced first in the CPU addressable memory. When the tracing is complete, the model is partitioned across the NeuronCores based on the tensor parallel degree.

LMI uses DJLServing as its model serving stack. After the container’s health check passes in SageMaker, the container is ready to serve the inference request. DJLServing launches multiple Python processes equivalent to the TOTAL NUMBER OF NEURON CORES/TENSOR_PARALLEL_DEGREE. Each Python process contains threads in C++ equivalent to TENSOR_PARALLEL_DEGREE. Each C++ threads holds one shard of the model on one NeuronCore.

Many practitioners (Python process) tend to run inference sequentially when the server is invoked with multiple independent requests. Although it’s easier to set up, it’s usually not the best practice to utilize the accelerator’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput. All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements. The throughput also depends on the number of model copies and the Python process groups launched in the container. As shown in the following diagram, with the tensor parallel degree set to 4, the LMI container launches three Python process groups, each holding the full copy of the model. This allows you to increase the batch size and get higher throughput.

SageMaker notebook for deploying LLMs

In this section, we provide a step-by-step walkthrough of deploying GPT4All-J, a 6-billion-parameter model that is 24 GB in FP32. GPT4All-J is a popular chatbot that has been trained on a vast variety of interaction content like word problems, dialogs, code, poems, songs, and stories. GPT4all-J is a fine-tuned GPT-J model that generates responses similar to human interactions.

The complete notebook for this example is provided on GitHub. We can use the SageMaker Python SDK to deploy the model to an Inf2 instance. We use the provided default handler to load the model. With this, we just need to provide a servings.properties file. This file has the required configurations for the DJL model server to download and host the model. We can specify the name of the Hugging Face model using the model_id parameter to download the model directly from the Hugging Face repo. Alternatively, you can download the model from Amazon S3 by providing the s3url parameter. The entryPoint parameter is configured to point to the library to load the model. For more details on djl_python.fastertransformer, refer to the GitHub code.

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple devices. For instance, with 12 NeuronCores and a tensor parallel degree of 4, LMI will allocate three model copies, each utilizing four NeuronCores. You can also define the precision type using the property dtype. n_position parameter defines the sum of max input and output sequence length for the model. See the following code:

%%writefile serving.properties# Start writing content here
engine=Python
option.entryPoint=djl_python.transformers-neuronx
#option.model_id=nomic-ai/gpt4all-j
option.s3url = {{s3url}}
option.tensor_parallel_degree=2
option.model_loading_timeout=2400
option.n_positions=512

Construct the tarball containing serving.properties and upload it to an S3 bucket. Although the default handler is used in this example, you can develop a model.py file for customizing the loading and serving process. If there are any packages that need installation, include them in the requirements.txt file. See the following code:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

Retrieve the DJL container image and create the SageMaker model:

##Retrieve djl container image
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.21.0"
    )
image_uri = image_uri.split(":")[0] + ":" + "0.22.1-neuronx-sdk2.9.0"

model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)

Next, we create the SageMaker endpoint with the model configuration defined earlier. The container downloads the model into the /tmp space because SageMaker maps the /tmp to Amazon Elastic Block Store (Amazon EBS). We need to add a volume_size parameter to ensure the /tmp directory has enough space to download and compile the model. We set container_startup_health_check_timeout to 3,600 seconds to ensure the health check starts after the model is ready. We use the ml.inf2.8xlarge instance. See the following code:

instance_type = "ml.inf2.8xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")


model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=3600,
             volume_size=256
            )

After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

predictor.predict(
    {"inputs": "write a blog on new York", "parameters": {}}
)

Clean up

Delete the endpoints to save costs after you finish your tests:

# - Delete the end point
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

Conclusion

In this post, we showcased the newly launched capability of SageMaker, which now supports ml.inf2 and ml.trn1 instances for hosting generative AI models. We demonstrated how to deploy GPT4ALL-J, a generative AI model, on AWS Inferentia2 using SageMaker and the LMI container, without writing any code. We also showcased how you can use DJLServing and transformers-neuronx to load a model, partition it, and serve.

Inf2 instances provide the most cost-effective way to run generative AI models on AWS. For performance details, refer to Inf2 Performance.

Check out the GitHub repo for an example notebook. Try it out and let us know if you have any questions!


About the Authors

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with Machine Learning Startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML Inference and low-code ML. He has worked on projects in different domains, including Natural Language Processing and Computer Vision.

Hiroshi Tokoyo is a Solutions Architect at AWS Annapurna Labs. Based in Japan, he joined Annapurna Labs even before the acquisition by AWS and has consistently helped customers with Annapurna Labs technology. His recent focus is on Machine Learning solutions based on purpose-built silicon, AWS Inferentia and Trainium.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Varun Syal is a Software Development Engineer with AWS Sagemaker working on critical customer facing features for the ML Inference platform. He is passionate about working in the Distributed Systems and AI space. In his spare time, he likes reading and gardening.

Read More

Automate the deployment of an Amazon Forecast time-series forecasting model

Automate the deployment of an Amazon Forecast time-series forecasting model

Time series forecasting refers to the process of predicting future values of time series data (data that is collected at regular intervals over time). Simple methods for time series forecasting use historical values of the same variable whose future values need to be predicted, whereas more complex, machine learning (ML)-based methods use additional information, such as the time series data of related variables.

Amazon Forecast is an ML-based time series forecasting service that includes algorithms that are based on over 20 years of forecasting experience used by Amazon.com, bringing the same technology used at Amazon to developers as a fully managed service, removing the need to manage resources. Forecast uses ML to learn not only the best algorithm for each item, but also the best ensemble of algorithms for each item, automatically creating the best model for your data.

This post describes how to deploy recurring Forecast workloads (time series forecasting workloads) with no code using AWS CloudFormation, AWS Step Functions, and AWS Systems Manager. The method presented here helps you build a pipeline that allows you to use the same workflow starting from the first day of your time series forecasting experimentation through the deployment of the model into production.

Time series forecasting using Forecast

The workflow for Forecast involves the following common concepts:

  • Importing datasets – In Forecast, a dataset group is a collection of datasets, schema, and forecast results that go together. Each dataset group can have up to three datasets, one of each dataset type: target time series (TTS), related time series (RTS), and item metadata. A dataset is a collection of files that contain data that is relevant for a forecasting task. A dataset must conform to the schema defined within Forecast. For more details, refer to Importing Datasets.
  • Training predictors – A predictor is a Forecast-trained model used for making forecasts based on time series data. During training, Forecast calculates accuracy metrics that you use to evaluate the predictor and decide whether to use the predictor to generate a forecast. For more information, refer to Training Predictors.
  • Generating forecasts – You can then use the trained model for generating forecasts for a future time horizon, known as the forecasting horizon. Forecast provides forecasts at various specified quantiles. For example, a forecast at the 0.90 quantile will estimate a value that is lower than the observed value 90% of the time. By default, Forecast uses the following values for the predictor forecast types: 0.1 (P10), 0.5 (P50), and 0.9 (P90). Forecasts at various quantiles are typically used to provide a prediction interval (an upper and lower bound for forecasts) to account for forecast uncertainty.

You can implement this workflow in Forecast either from the AWS Management Console, the AWS Command Line Interface (AWS CLI), via API calls using Python notebooks, or via automation solutions. The console and AWS CLI methods are best suited for quick experimentation to check the feasibility of time series forecasting using your data. The Python notebook method is great for data scientists already familiar with Jupyter notebooks and coding, and provides maximum control and tuning. However, the notebook-based method is difficult to operationalize. Our automation approach facilitates rapid experimentation, eliminates repetitive tasks, and allows easier transition between various environments (development, staging, production).

In this post, we describe an automation approach to using Forecast that allows you to use your own data and provides a single workflow that you can use seamlessly throughout the lifecycle of the development of your forecasting solution, from the first days of experimentation through the deployment of the solution in your production environment.

Solution overview

In the following sections, we describe a complete end-to-end workflow that serves as a template to follow for automated deployment of time series forecasting models using Forecast. This workflow creates forecasted data points from an open-source input dataset; however, you can use the same workflow for your own data, as long as you can format your data according to the steps outlined in this post. After you upload the data, we walk you through the steps to create Forecast dataset groups, import data, train ML models, and produce forecasted data points on future unseen time horizons from raw data. All of this is possible without having to write or compile code.

The following diagram illustrates the forecasting workflow.

Cyclical forecasting workflow

The solution is deployed using two CloudFormation templates: the dependencies template and the workload template. CloudFormation enables you to perform AWS infrastructure deployments predictably and repeatedly by using templates describing the resources to be deployed. A deployed template is referred to as a stack. We’ve taken care of defining the infrastructure in the solution for you in the two provided templates. The dependencies template defines prerequisite resources used by the workload template, such as an Amazon Simple Storage Service (Amazon S3) bucket for object storage and AWS Identity and Access Management (IAM) permissions for AWS API actions. The resources defined in the dependencies template may be shared by multiple workload templates. The workload template defines the resources used to ingest data, train a predictor, and generate a forecast.

Deployment workflow

Deploy the dependencies CloudFormation template

First, let’s deploy the dependencies template to create our prerequisite resources. The dependencies template deploys an optional S3 bucket, AWS Lambda functions, and IAM roles. Amazon S3 is a low-cost, highly available, resilient, object storage service. We use an S3 bucket in this solution to store source data and trigger the workflow, resulting in a forecast. Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The dependencies template includes functions to do things like create a dataset group in Forecast and purge objects within an S3 bucket before deleting the bucket. IAM roles define permissions within AWS for users and services. The dependencies template deploys a role to be used by Lambda and another for Step Functions, a workflow management service that will coordinate the tasks of data ingestion and processing, as well as predictor training and inference using Forecast.

Complete the following steps to deploy the dependencies template:

  1. On the console, select the desired Region supported by Forecast for solution deployment.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  3. Choose Create stack and choose With new resources (standard).
    Create stack
  4. For Template source, select Amazon S3 URL.
  5. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-dependency.yaml.
  6. Choose Next.
    Specify template
  7. For Stack name, enter forecast-mlops-dependency.
  8. Under Parameters, choose to use an existing S3 bucket or create a new one, then provide the name of the bucket.
  9. Choose Next.
  10. Choose Next to accept the default stack options.
  11. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the forecast-mlops-dependency stack. When the status changes to CREATE_COMPLETE, you may move to the next step.

Deploy the workload CloudFormation template

Next, let’s deploy the workload template to create our prerequisite resources. The workload template deploys Step Functions state machines for workflow management, AWS Systems Manager Parameter Store parameters to store parameter values from AWS CloudFormation and inform the workflow, an Amazon Simple Notification Service (Amazon SNS) topic for workflow notifications, and an IAM role for workflow service permissions.

The solution creates five state machines:

  • CreateDatasetGroupStateMachine – Creates a Forecast dataset group for data to be imported into.
  • CreateImportDatasetStateMachine – Imports source data from Amazon S3 into a dataset group for training.
  • CreateForecastStateMachine – Manages the tasks required to train a predictor and generate a forecast.
  • AthenaConnectorStateMachine – Enables you to write SQL queries with the Amazon Athena connector to land data in Amazon S3. This is an optional process to obtain historical data in the required format for Forecast by using Athena instead of placing files manually in Amazon S3.
  • StepFunctionWorkflowStateMachine – Coordinates calls out to the other four state machines and manages the overall workflow.

Parameter Store, a capability of Systems Manager, provides secure, hierarchical storage and programmatic retrieval of configuration data management and secrets management. Parameter Store is used to store parameters set in the workload stack as well as other parameters used by the workflow.

Complete the following steps to deploy the workload template:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack and choose With new resources (standard).
  3. For Template source, select Amazon S3 URL.
  4. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-solution-guidance.yaml.
  5. Choose Next.
  6. For Stack name, enter a name.
  7. Accept the default values or modify the parameters.

Be sure to enter the S3 bucket name from the dependencies stack for S3 Bucket and a valid email address for SNSEndpoint even if you accept the default parameter values.

The following table describes each parameter.

Parameter Description More Information
DatasetGroupFrequencyRTS The frequency of data collection for the RTS dataset. .
DatasetGroupFrequencyTTS The frequency of data collection for the TTS dataset. .
DatasetGroupName A short name for the dataset group, a self-contained workload. CreateDatasetGroup
DatasetIncludeItem Specify if you want to provide item metadata for this use case. .
DatasetIncludeRTS Specify if you want to provide a related time series for this use case. .
ForecastForecastTypes When a CreateForecast job runs, this declares which quantiles to produce predictions for. You may choose up to five values in this array. Edit this value to include values according to need. CreateForecast
PredictorAttributeConfigs For the target variable in TTS and each numeric field in the RTS datasets, a record must be created for each time interval for each item. This configuration helps determine how missing records are filled in: with 0, NaN, or otherwise. We recommend filing the gaps in the TTS with NaN instead of 0. With 0, the model might learn wrongly to bias forecasts toward 0. NaN is how the guidance is delivered. Consult with your AWS Solutions Architect with any questions on this. CreateAutoPredictor
PredictorExplainPredictor Valid values are TRUE or FALSE. These determine if explainability is enabled for your predictor. This can help you understand how values in the RTS and item metadata influence the model. Explainability
PredictorForecastDimensions You may want to forecast at a finer grain than item. Here, you can specify dimensions such as location, cost center, or whatever your needs are. This needs to agree with the dimensions in your RTS and TTS. Note that if you have no dimension, the correct parameter is null, by itself and in all lowercase. null is a reserved word that lets the system know there is no parameter for the dimension. CreateAutoPredictor
PredictorForecastFrequency Defines the time scale at which your model and predictions will be generated, such as daily, weekly, or monthly. The drop-down menu helps you choose allowed values. This needs to agree with your RTS time scale if you’re using RTS. CreateAutoPredictor
PredictorForecastHorizon The number of time steps that the model predicts. The forecast horizon is also called the prediction length. CreateAutoPredictor
PredictorForecastOptimizationMetric Defines the accuracy metric used to optimize the predictor. The drop-down menu will help you select weighted quantile loss balances for over- or under-forecasting. RMSE is concerned with units, and WAPE/MAPE are concerned with percent errors. CreateAutoPredictor
PredictorForecastTypes When a CreateAutoPredictor job runs, this declares which quantiles are used to train prediction points. You may choose up to five values in this array, allowing you to balance over- and under-forecasting. Edit this value to include values according to need. CreateAutoPredictor
S3Bucket The name of the S3 bucket where input data and output data are written for this workload. .
SNSEndpoint A valid email address to receive notifications when the predictor and Forecast jobs are complete. .
SchemaITEM This defines the physical order, column names, and data types for your item metadata dataset. This is an optional file provided in the solution example. CreateDataset
SchemaRTS This defines the physical order, column names, and data types for your RTS dataset. The dimensions must agree with your TTS. The time-grain of this file governs the time-grain at which predictions can be made. This is an optional file provided in the solution example. CreateDataset
SchemaTTS This defines the physical order, column names, and data types for your TTS dataset, the only required dataset. The file must contain a target value, timestamp, and item at a minimum. CreateDataset
TimestampFormatRTS Defines the timestamp format provided in the RTS file. CreateDatasetImportJob
TimestampFormatTTS Defines the timestamp format provided in the TTS file. CreateDatasetImportJob
  1. Choose Next to accept the default stack options.
  2. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the stack name you chose earlier. When the status changes to CREATE_COMPLETE, you may move to the data upload step.

Upload the data

In the previous section, you provided a stack name and an S3 bucket. This section describes how to deposit the publicly available dataset Food Demand in this bucket. If you’re using your own dataset, refer to Datasets to prepare your dataset in a format the deployment is expecting. The dataset needs to contain at least the target time series, and optionally, the related time series and the item metadata:

  • TTS is the time series data that includes the field that you want to generate a forecast for; this field is called the target field
  • RTS is time series data that doesn’t include the target field, but includes a related field
  • The item data file isn’t time series data, but includes metadata information about the items in the TTS or RTS datasets

Complete the following steps:

  1. If you’re using the provided sample dataset, download the dataset Food Demand to your computer and unzip the file, which creates three files inside three directories (rts, tts, item).
  2. On the Amazon S3 console, navigate to the bucket you created earlier.
  3. Choose Create folder.
  4. Use the same string as your workload stack name for the folder name.
  5. Choose Upload.
  6. Choose the three dataset folders, then choose Upload.

When the upload is complete, you should see something like the following screenshot. For this example, our folder is aiml42.S3 folder structure

Create a Forecast dataset group

Complete the steps in this section to create a dataset group as a one-time event for each workload. Going forward, you should plan on running the import data, create predictor, and create forecast steps as appropriate, as a series, according to your schedule, which could be daily, weekly, or otherwise.

  1. On the Step Functions console, locate the state machine containing Create-Dataset-Group.
  2. On the state machine detail page, choose Start execution.
  3. Choose Start execution again to confirm.

The state machine takes about 1 minute to run. When it’s complete, the value under Execution Status should change from Running to Succeeded Execution status

Import data into Forecast

Follow the steps in this section to import the data set that you uploaded to your S3 bucket into your dataset group:

  1. On the Step Functions console, locate the state machine containing Import-Dataset.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.

The amount of time the state machine takes to run depends on the dataset being processed.

Graph inspector

  1. While this is running, in your browser, open another tab and navigate to the Forecast console.
  2. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  3. Choose View datasets.

You should see the data imports in progress.

Data imports in progress

When the state machine for Import-Dataset is complete, you can proceed to the next step to build your time series data model.

Create AutoPredictor (train a time series model)

This section describes how to train an initial predictor with Forecast. You may choose to create a new predictor (your first, baseline predictor) or retrain a predictor during each production cycle, which could be daily, weekly, or otherwise. You may also elect not to create a predictor each cycle and rely on predictor monitoring to guide you when to create one. The following figure visualizes the process of creating a production-ready Forecast predictor.

Production ready predictor workflow

To create a new predictor, complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Predictor.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    The amount of runtime can depend on the dataset being processed. This could take up to an hour or more to complete.
  4. While this is running, in your browser, open another tab and navigate to the Forecast console.
  5. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  6. Choose View predictors.

You should see the predictor training in progress (Training status shows “Create in progress…”).

Data imports in progress

When the state machine for Create-Predictor is complete, you can evaluate its performance.

As part of the state machine, the system creates a predictor and also runs a BacktestExport job that writes out time series-level predictor metrics to Amazon S3. These are files located in two S3 folders under the backtest-export folder:

  • accuracy-metrics-values – Provides item-level accuracy metric computations so you can understand the performance of a single time series. This allows you to investigate the spread rather than focusing on the global metrics alone.
  • forecasted-values – Provides step-level predictions for each time series in the backtest window. This enables you to compare the actual target value from a holdout test set to the predicted quantile values. Reviewing this helps formulate ideas on how to provide additional data features in RTS or item metadata to help better estimate future values, further reducing loss. You may download backtest-export files from Amazon S3 or query them in place with Athena.

S3 bucket contents

With your own data, you need to closely inspect the predictor outcomes and ensure the metrics meet your expected results by using the backtest export data. When satisfied, you can begin generating future-dated predictions as described in the next section.

Generate a forecast (inference about future time horizons)

This section describes how to generate forecast data points with Forecast. Going forward, you should harvest new data from the source system, import the data into Forecast, and then generate forecast data points. Optionally, you may also insert a new predictor creation after import and before forecast. The following figure visualizes the process of creating production time series forecasts using Forecast.

Production time series forecast workflow

Complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Forecast.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    This state machine finishes very quickly because the system isn’t configured to generate a forecast. It doesn’t know which predictor model you have approved for inference.
    Let’s configure the system to use your trained predictor.
  4. On the Forecast console, locate the ARN for your predictor.
  5. Copy the ARN to use in a later step.
    Predictor details
  6. In your browser, open another tab and navigate to the Systems Manager console.
  7. On the Systems Manager console, choose Parameter Store in the navigation pane.
  8. Locate the parameter related to your stack (/forecast/<StackName>/Forecast/PredictorArn).
  9. Enter the ARN you copied for your predictor.
    This is how you associate a trained predictor with the inference function of Forecast.
  10. Locate the parameter /forecast/<StackName>/Forecast/Generate and edit the value, replacing FALSE with TRUE.
    Now you’re ready to run a forecast job for this dataset group.
  11. On the Step Functions console, run the Create-Forecast state machine.

This time, the job runs as expected. As part of the state machine, the system creates a forecast and a ForecastExport job, which writes out time series predictions to Amazon S3. These files are located in the forecast folder

Forecast folder contents

Inside the forecast folder, you will find predictions for your items, located in many CSV or Parquet files, depending on your selection. The predictions for each time step and selected time series exist with all your chosen quantile values per record. You may download these files from Amazon S3, query them in place with Athena, or choose another strategy to use the data.

This wraps up the entire workflow. You can now visualize your output using any visualization tool of your choice, such as Amazon QuickSight. Alternatively, data scientists can use pandas to generate their own plots. If you choose to use QuickSight, you can connect your forecast results to QuickSight to perform data transformations, create one or more data analyses, and create visualizations.

This process provides a template to follow. You will need to adapt the sample to your schema, set the forecast horizon, time resolution, and so forth according to your use case. You will also need to set a recurring schedule where data is harvested from the source system, import the data, and produce forecasts. If desired, you may insert a predictor task between the import and forecast steps.

Retrain the predictor

We have walked through the process of training a new predictor, but what about retraining a predictor? Retraining a predictor is one way to reduce the cost and time involved with training a predictor on the latest available data. Rather than create a new predictor and train it on the entire dataset, we can retrain the existing predictor by providing only the new incremental data made available since the predictor was last trained. Let’s walk through how to retrain a predictor using the automation solution:

  1. On the Forecast console, choose View dataset groups.
  2. Choose the dataset group associated with the predictor you want to retrain.
  3. Choose View predictors, then chose the predictor you want to retrain.
  4. On the Settings tab, copy the predictor ARN.
    We need to update a parameter used by the workflow to identify the predictor to retrain.
  5. On the Systems Manager console, choose Parameter Store in the navigation pane.
  6. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/ReferenceArn.
  7. On the parameter detail page, choose Edit.
  8. For Value, enter the predictor ARN.
    This identifies the correct predictor for the workflow to retrain. Next, we need to update a parameter used by the workflow to change the training strategy.
  9. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/Strategy.
  10. On the parameter detail page, choose Edit.
  11. For Value, enter RETRAIN.
    The workflow defaults to training a new predictor; however, we can modify that behavior to retrain an existing predictor or simply reuse an existing predictor without retraining by setting this value to NONE. You may want to forego training if your data is relatively stable or you’re using automated predictor monitoring to decide when retraining is necessary.
  12. Upload the incremental training data to the S3 bucket.
  13. On the Step Functions console, locate the state machine <STACKNAME>-Create-Predictor.
  14. On the state machine detail page, choose Start execution to begin the retraining.

When the retraining is complete, the workflow will end and you will receive an SNS email notification to the email address provided in the workload template parameters.

Clean up

When you’re done with this solution, follow the steps in this section to delete related resources.

Delete the S3 bucket

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket where data was uploaded and choose Empty to delete all data associated with the solution, including source data.
  3. Enter permanently delete to delete the bucket contents permanently.
  4. On the Buckets page, select the bucket and choose Delete.
  5. Enter the name of the bucket to confirm the deletion and choose Delete bucket.

Delete Forecast resources

  1. On the Forecast console, choose View dataset groups.
  2. Select the dataset group name associated with the solution, then choose Delete.
  3. Enter delete to delete the dataset group and associated predictors, predictor backtest export jobs, forecasts, and forecast export jobs.
  4. Choose Delete to confirm.

Delete the CloudFormation stacks

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the workload stack and choose Delete.
  3. Choose Delete stack to confirm deletion of the stack and all associated resources.
  4. When the deletion is complete, select the dependencies stack and choose Delete.
  5. Choose Delete to confirm.

Conclusion

In this post, we discussed some different ways to get started using Forecast. We walked through an automated forecasting solution based on AWS CloudFormation for a rapid, repeatable solution deployment of a Forecast pipeline from data ingestion to inference, with little infrastructure knowledge required. Finally, we saw how we can use Lambda to automate model retraining, reducing cost and training time.

There’s no better time than the present to start forecasting with Forecast. To start building and deploying an automated workflow, visit Amazon Forecast resources. Happy forecasting!


About the Authors

Aaron Fagan is a Principal Specialist Solutions Architect at AWS based in New York. He specializes in helping customers architect solutions in machine learning and cloud security.

Raju Patil is a Data Scientist in AWS Professional Services. He builds and deploys AI/ML solutions to assist AWS customers in overcoming their business challenges. His AWS engagements have covered a wide range of AI/ML use cases such as computer vision, time-series forecasting, and predictive analytics, etc., across numerous industries, including financial services, telecom, health care, and more. Prior to this, he has led Data Science teams in Advertising Technology, and made significant contributions to numerous research and development initiatives in computer vision and robotics. Outside of work, he enjoys photography, hiking, travel, and culinary explorations.

Read More

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Generative AI is gaining a lot of public attention at present, with talk around products such as GPT4, ChatGPT, DALL-E2, Bard, and many other AI technologies. Many customers have been asking for more information on AWS’s generative AI solutions. The aim of this post is to address those needs.

This post provides an overview of generative AI with a real customer use case, provides a concise description and outlines its benefits, references an easy-to-follow demo of AWS DeepComposer for creating new musical compositions, and outlines how to get started using Amazon SageMaker JumpStart for deploying GPT2, Stable Diffusion 2.0, and other generative AI models.

Generative AI overview

Generative AI is a specific field of artificial intelligence that focuses on generating new material. It’s one of the most exciting fields in the AI world, with the potential to transform existing businesses and allow completely new business ideas to come to market. You can use generative techniques for:

  • Creating new works of art using a model such as Stable Diffusion 2.0
  • Writing a best-selling book using a model such as GPT2, Bloom, or Flan-T5-XL
  • Composing your next symphony using the Transformers technique in AWS DeepComposer

AWS DeepComposer is an educational tool that helps you understand the key concepts associated with machine learning (ML) through the language of musical composition. To learn more, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Stable Diffusion, GPT2, Bloom, and Flan-T5-XL are all ML models. They are simply mathematical algorithms that need to be trained to identify patterns within data. After the patterns are learned, they’re deployed onto endpoints, ready for a process known as inference. New data that the model hasn’t seen is fed into the inference model, and new creative material is produced.

For example, with image generation models such as Stable Diffusion, we can create stunning illustrations using a few words. With text generation models such as GPT2, Bloom, and Flan-T5-XL, we can generate new literary articles, and potentially books, from a simple human sentence.

Autodesk is an AWS customer using Amazon SageMaker to help their product designers sort through thousands of iterations of visual designs for various use cases and use ML to help choose the optimal design. Specifically, they have worked with Edera Safety to help develop a spinal cord protector that protects riders from accidents while participating in sporting events, such as mountain biking. For more information, check out the video AWS Machine Learning Enables Design Optimization.

To learn more about what AWS customers are doing with generative AI and fashion, refer to Virtual fashion styling with generative AI using Amazon SageMaker.

Now that we understand what generative AI is all about, let’s jump into a JumpStart demonstration to learn how to generate new text or images with AI.

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run JumpStart, we need to set up Studio. You can skip this step if you already have your own version of Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Next is to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

Choose a JumpStart solution

Now we come to the exciting part. You should now be logged in to Studio, and see a page similar to the following screenshot.

In the navigation pane, under SageMaker JumpStart, choose Models, notebooks, solutions.

You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case.

If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs.

For example, if you’re interested in fraud detection solutions, enter fraud detection into the search bar.

Fraud Detection Screenshot

If you’re interested in text generation solutions, enter text generation into the search bar. A good place to start if you want to explore a range of text generation models is to select the Intro to JS – Text Generation notebook.

JS - Text Generation

Let’s dive into a specific demonstration of the GPT-2 model.

JumpStart GPT-2 model demo

GPT 2 is a language model that helps generate human-like text based on a given prompt. We can use this type of transformer model to create new sentences and help us automate writing. This can be used for content creation such as blogs, social media posts, and books.

The GPT 2 model is part of the Generative Pre-Trained Transformer family that was the predecessor to GPT 3. At the time of writing, GPT 3 is used as the foundation for the OpenAI ChatGPT application.

To start exploring the GPT-2 model demo in JumpStart, complete the following steps:

  1. On JumpStart, search for and choose GPT 2.
  2. In the Deploy Model section, expand Deployment Configuration.
  3. For SageMaker hosting instance, choose your instance (for this post, we use ml.c5.2xlarge).

Different machine types have different price points attached. At the time of writing, the ml.c5.2xlarge that we selected incurs under $0.50 per hour. For the most up-to-date pricing, refer to Amazon SageMaker Pricing.

  1. For Endpoint name, enter demo-hf-textgeneration-gpt2.
  2. Choose Deploy.

Endpoint Name & Deploy

Wait for the ML endpoint to deploy (up to 15 minutes).

  1. When the endpoint is deployed, choose Open Notebook.

Endpoint Status

You’ll see a page similar to the following screenshot.
Python Code

The document we’re using to showcase our demonstration is a Jupyter notebook, which encompasses all the necessary Python code. Note that the code in this screenshot maybe be slightly different to the code you have, because AWS is constantly updating these notebooks and making sure they are secure, are free of defects, and provide the best customer experience.

  1. Click into the first cell and choose Ctrl+Enter to run the code block.

Code Block 1

An asterisk (*) appears to the left of the code block and then turns into a number. The asterisk indicates that the code is running and is complete when the number appears.

  1. In the next code block, enter some sample text, then press Ctrl+Enter.

Code Block 2

  1. Choose Ctrl+Enter in the third code block to run it.

After about 30-60 seconds, you will see your inference results.

For the input text “Once upon a time there were 18 sandwiches,” we get the following generated text:

Once upon a time there were 18 sandwiches, four plates with some salad, and three sandwiches with some beef. One restaurant was so nice that the food was made by hand. There were people living at the beginning of the time who were waiting so that

For the input text “And for the final time Peter said to Mary,” we get the following generated text:

And for the final time Peter said to Mary that he was a saint.

11 But Peter said that it was not a blessing, but rather that it would be the death of Peter. And when Mary heard of that Peter said to him,

You can experiment with running this third code block multiple times, and you will notice that the model makes different predictions each time.

To tailor the output using some of the advanced features, scroll down to experiment in the fourth code block.

To learn more about text generation models, refer to Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart.

Clean up resources

Before we move on, don’t forget to delete your endpoint when you’re finished. On the previous tab, under Delete Endpoint, choose Delete.

Delete Endpoint

If you have accidentally closed this notebook, you can also delete your endpoint via the SageMaker console. Under Inference in the navigation pane, choose Endpoints.

Select the endpoint you used and on the Actions menu, choose Delete.

Delete Endpoint

Now that we understand how to use our first JumpStart solution, let’s look at using a Stable Diffusion model.

JumpStart Stable Diffusion model demo

We can use the Stable Diffusion 2 model to generate images from a simple line of text. This can be used to generate content for things like social media posts, promotional material, album covers, or anything that requires creative artwork.

  1. Return to JumpStart, then search for and choose Stable Diffusion 2.

Stable Diffusion 2

  1. In the Deploy Model section, expand Deployment Configuration.
  2. For SageMaker hosting instance, choose your instance (for this post, we use ml.g5.2xlarge).
  3. For Endpoint name, enter demo-stabilityai-stable-diffusion-v2.
  4. Choose Deploy.

Because this is a larger model, it can take up to 25 minutes to deploy. When it’s ready, the endpoint status shows as In Service.

In Service

  1. Choose Open Notebook to open a Jupyter notebook with Python code.

Python Code

  1. Run the first and second code blocks.
  2. In the third code block, change the text prompt, then run the cell.

Code Block 1

Wait about 30–60 seconds for your image to appear. The following image is based on our example text.

Output Picture

Again, you can play with the advanced features in the next code block. The picture it creates is different every time.

Clean up resources

Again, don’t forget to delete your endpoint. This time, we’re using ml.g5.2xlarge, so it incurs slightly higher charges than before. At the time of writing, it was just over $1 per hour.

Finally, let’s move to AWS DeepComposer.

AWS DeepComposer

AWS DeepComposer is a great way to learn about generative AI. It allows you to use built-in melodies in your models to generate new forms of music. The model that you use determines on how the input melody is transformed.

If you’re used to participating in AWS DeepRacer days to help your employees learn about re-enforcement learning, consider augmenting and enhancing the day with AWS DeepComposer to learn about generative AI.

For a detailed explanation and easy-to-follow demonstration of three of the models in this post, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Check out the following cool examples uploaded to SoundCloud using AWS DeepComposer.

We would love to see your experiments, so feel free to reach out via social media (@digitalcolmer) and share your learnings and experiments.

Conclusion

In this post, we talked about the definition of generative AI, illustrated by an AWS customer story. We then stepped you through how to get started with Studio and JumpStart, and showed you how to get started with GPT 2 and Stable Diffusion models. We wrapped up with a brief overview of AWS DeepComposer.

To explore JumpStart more, try using your own data to fine-tune an existing model. For more information, refer to Incremental training with Amazon SageMaker JumpStart. For information about fine-tuning Stable Diffusion models, refer to Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart.

To learn more about Stable Diffusion models, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

We didn’t cover any information on the Flan-T5-XL model, so to learn more, refer to the following GitHub repo. The Amazon SageMaker Examples repo also includes a range of available notebooks on GitHub for the various SageMaker products, including JumpStart, covering a range of different use cases.

To learn more about AWS ML via a range of free digital assets, check out our AWS Machine Learning Ramp-Up Guide. You can also try our free ML Learning Plan to build on your current knowledge or have a clear starting point. To take an instructor-led course, we highly recommend the following courses:

It is truly an exciting time in the AI/ML space. AWS is here to support your ML journey, so please connect with us on social media. We look forward to seeing all your learning, experiments, and fun with the various ML services over the coming months and relish the opportunity to be your instructor on your ML journey.


About the Author

Paul Colmer is a Senior Technical Trainer at Amazon Web Services specializing in machine learning and generative AI. His passion is helping customers, partners, and employees develop and grow through compelling storytelling, shared experiences, and knowledge transfer. With over 25 years in the IT industry, he specializes in agile cultural practices and machine learning solutions. Paul is a Fellow of the London College of Music and Fellow of the British Computer Society.

Read More

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Generative AI (GenAI) and large language models (LLMs), such as those available soon via Amazon Bedrock and Amazon Titan are transforming the way developers and enterprises are able to solve traditionally complex challenges related to natural language processing and understanding. Some of the benefits offered by LLMs include the ability to create more capable and compelling conversational AI experiences for customer service applications, and improving employee productivity through more intuitive and accurate responses.

For these use cases, however, it’s critical for the GenAI applications implementing the conversational experiences to meet two key criteria: limit the responses to company data, thereby mitigating model hallucinations (incorrect statements), and filter responses according to the end-user content access permissions.

To restrict the GenAI application responses to company data only, we need to use a technique called Retrieval Augmented Generation (RAG). An application using the RAG approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

In designing effective RAG, content retrieval is a critical step to ensure the LLM receives the most relevant and concise context from enterprise content to generate accurate responses. This is where the highly accurate, machine learning (ML)-powered intelligent search in Amazon Kendra plays an important role. Amazon Kendra is a fully managed service that provides out-of-the-box semantic search capabilities for state-of-the-art ranking of documents and passages. You can use the high-accuracy search in Amazon Kendra to source the most relevant content and documents to maximize the quality of your RAG payload, yielding better LLM responses than using conventional or keyword-based search solutions. Amazon Kendra offers easy-to-use deep learning search models that are pre-trained on 14 domains and don’t require any ML expertise, so there’s no need to deal with word embeddings, document chunking, and other lower-level complexities typically required for RAG implementations. Amazon Kendra also comes with pre-built connectors to popular data sources such as Amazon Simple Storage Service (Amazon S3), SharePoint, Confluence, and websites, and supports common document formats such as HTML, Word, PowerPoint, PDF, Excel, and pure text files. To filter responses based on only those documents that the end-user permissions allow, Amazon Kendra offers connectors with access control list (ACL) support. Amazon Kendra also offers AWS Identity and Access Management (IAM) and AWS IAM Identity Center (successor to AWS Single Sign-On) integration for user-group information syncing with customer identity providers such as Okta and Azure AD.

In this post, we demonstrate how to implement a RAG workflow by combining the capabilities of Amazon Kendra with LLMs to create state-of-the-art GenAI applications providing conversational experiences over your enterprise content. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar GenAI applications using Amazon Bedrock, so stay tuned.

Solution overview

The following diagram shows the architecture of a GenAI application with a RAG approach.

We use an Amazon Kendra index to ingest enterprise unstructured data from data sources such as wiki pages, MS SharePoint sites, Atlassian Confluence, and document repositories such as Amazon S3. When a user interacts with the GenAI app, the flow is as follows:

  1. The user makes a request to the GenAI app.
  2. The app issues a search query to the Amazon Kendra index based on the user request.
  3. The index returns search results with excerpts of relevant documents from the ingested enterprise data.
  4. The app sends the user request and along with the data retrieved from the index as context in the LLM prompt.
  5. The LLM returns a succinct response to the user request based on the retrieved data.
  6. The response from the LLM is sent back to the user.

With this architecture, you can choose the most suitable LLM for your use case. LLM options include our partners Hugging Face, AI21 Labs, Cohere, and others hosted on an Amazon SageMaker endpoint, as well as models by companies like Anthropic and OpenAI. With Amazon Bedrock, you will be able to choose Amazon Titan, Amazon’s own LLM, or partner LLMs such as those from AI21 Labs and Anthropic with APIs securely without the need for your data to leave the AWS ecosystem. The additional benefits that Amazon Bedrock will offer include a serverless architecture, a single API to call the supported LLMs, and a managed service to streamline the developer workflow.

For the best results, a GenAI app needs to engineer the prompt based on the user request and the specific LLM being used. Conversational AI apps also need to manage the chat history and the context. GenAI app developers can use open-source frameworks such as LangChain that provide modules to integrate with the LLM of choice, and orchestration tools for activities such as chat history management and prompt engineering. We have provided the KendraIndexRetriever class, which implements a LangChain retriever interface, which applications can use in conjunction with other LangChain interfaces such as chains to retrieve data from an Amazon Kendra index. We have also provided a few sample applications in the GitHub repo. You can deploy this solution in your AWS account using the step-by-step guide in this post.

Prerequisites

For this tutorial, you’ll need a bash terminal with Python 3.9 or higher installed on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. We also recommend using an AWS Cloud9 instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Implement a RAG workflow

To configure your RAG workflow, complete the following steps:

  1. Use the provided AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and Amazon SageMaker. Alternately, if you have an Amazon Kendra index and have indexed your own dataset, you can use that. Launching the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index. Therefore, wait for about 45 minutes after launching the stack. Note the index ID and AWS Region on the stack’s Outputs tab.

  1. For an improved GenAI experience, we recommend requesting an Amazon Kendra service quota increase for maximum DocumentExcerpt size, so that Amazon Kendra provides larger document excerpts to improve semantic context for the LLM.
  2. Install the AWS SDK for Python on the command line interface of your choice.
  3. If you want to use the sample web apps built using Streamlit, you first need to install Streamlit. This step is optional if you want to only run the command line versions of the sample applications.
  4. Install LangChain.
  5. The sample applications used in this tutorial require you to have access to one or more LLMs from Flan-T5-XL, Flan-T5-XXL, Anthropic Claud-V1, and OpenAI-text-davinci-003.
    1. If you want to use Flan-T5-XL or Flan-T5-XXL, deploy them to an endpoint for inference using Amazon SageMaker Studio Jumpstart.
    2. If you want to work with Anthropic Claud-V1 or OpenAI-da-vinci-003, acquire the API keys for your LLMs of your interest from https://www.anthropic.com/ and https://openai.com/, respectively.
  6. Follow the instructions in the GitHub repo to install the KendraIndexRetriever interface and sample applications.
  7. Before you run the sample applications, you need to set environment variables with the Amazon Kendra index details and API keys of your preferred LLM or the SageMaker endpoints of your deployments for Flan-T5-XL or Flan-T5-XXL. The following is a sample script to set the environment variables:
    export AWS_REGION="<YOUR-AWS-REGION>"
    export KENDRA_INDEX_ID="<YOUR-KENDRA-INDEX-ID>"
    export FLAN_XL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XL>"
    export FLAN_XXL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XXL>"
    export OPENAI_API_KEY="<YOUR-OPEN-AI-API-KEY>"
    export ANTHROPIC_API_KEY="<YOUR-ANTHROPIC-API-KEY>"

  8. In a command line window, change to the samples subdirectory of where you have cloned the GitHub repository. You can run the command line apps from the command line as python <sample-file-name.py>. You can run the streamlit web app by changing the directory to samples and running streamlit run app.py <anthropic|flanxl|flanxxl|openai>.
  9. Open the sample file kendra_retriever_flan_xxl.py in an editor of your choice.

Observe the statement result = run_chain(chain, "What's SageMaker?"). This is the user query (“What’s SageMaker?”) that’s being run through the chain that uses Flan-T-XXL as the LLM and Amazon Kendra as the retriever. When this file is run, you can observe the output as follows. The chain sent the user query to the Amazon Kendra index, retrieved the top three result excerpts, and sent them as the context in a prompt along with the query, to which the LLM responded with a succinct answer. It has also provided the sources, (the URLs to the documents used in generating the answer).

~. python3 kendra_retriever_flan_xxl.py
Amazon SageMaker is a machine learning service that lets you train and deploy models in the cloud.
Sources:
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
  1. Now let’s run the web app app.py as streamlit run app.py flanxxl. For this specific run, we are using a Flan-T-XXL model as the LLM.

It opens a browser window with the web interface. You can enter a query, which in this case is “What is Amazon Lex?” As seen in the following screenshot, the application responds with an answer, and the Sources section provides the URLs to the documents from which the excerpts were retrieved from the Amazon Kendra index and sent to the LLM in the prompt as the context along with the query.

  1. Now let’s run app.py again and get a feel of the conversational experience using streamlit run app.py anthropic. Here the underlying LLM used is Anthropic Claud-V1.

As you can see in the following video, the LLM provides a detailed answer to the user’s query based on the documents it retrieved from the Amazon Kendra index and then supports the answer with the URLs to the source documents that were used to generate the answer. Note that the subsequent queries don’t explicitly mention Amazon Kendra; however, the ConversationalRetrievalChain (a type of chain that’s part of the LangChain framework and provides an easy mechanism to develop conversational application-based information retrieved from retriever instances, used in this LangChain application), manages the chat history and the context to get an appropriate response.

Also note that in the following screenshot, Amazon Kendra finds the extractive answer to the query and shortlists the top documents with excerpts. Then the LLM is able to generate a more succinct answer based on these retrieved excerpts.

In the following sections, we explore two use cases for using Generative AI with Amazon Kendra.

Use case 1: Generative AI for financial service companies

Financial organizations create and store data across various data repositories, including financial reports, legal documents, and whitepapers. They must adhere to strict government regulations and oversight, which means employees need to find relevant, accurate, and trustworthy information quickly. Additionally, searching and aggregating insights across various data sources is cumbersome and error prone. With Generative AI on AWS, users can quickly generate answers from various data sources and types, synthesizing accurate answers at enterprise scale.

We chose a solution using Amazon Kendra and AI21 Lab’s Jurassic-2 Jumbo Instruct LLM. With Amazon Kendra, you can easily ingest data from multiple data sources such as Amazon S3, websites, and ServiceNow. Then Amazon Kendra uses AI21 Lab’s Jurassic-2 Jumbo Instruct LLM to carry out inference activities on enterprise data such as data summarization, report generation, and more. Amazon Kendra augments LLMs to provide accurate and verifiable information to the end-users, which reduces hallucination issues with LLMs. With the proposed solution, financial analysts can make faster decisions using accurate data to quickly build detailed and comprehensive portfolios. We plan to make this solution available as an open-source project in near future.

Example

Using the Kendra Chatbot solution, financial analysts and auditors can interact with their enterprise data (financial reports and agreements) to find reliable answers to audit-related questions. Kendra ChatBot provides answers along with source links and has the capability to summarize longer answers. The following screenshot shows an example conversation with Kendra ChatBot.

Architecture overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Financial documents and agreements are stored on Amazon S3, and ingested to an Amazon Kendra index using the S3 data source connector.
  2. The LLM is hosted on a SageMaker endpoint.
  3. An Amazon Lex chatbot is used to interact with the user via the Amazon Lex web UI.
  4. The solution uses an AWS Lambda function with LangChain to orchestrate between Amazon Kendra, Amazon Lex, and the LLM.
  5. When users ask the Amazon Lex chatbot for answers from a financial document, Amazon Lex calls the LangChain orchestrator to fulfill the request.
  6. Based on the query, the LangChain orchestrator pulls the relevant financial records and paragraphs from Amazon Kendra.
  7. The LangChain orchestrator provides these relevant records to the LLM along with the query and relevant prompt to carry out the required activity.
  8. The LLM processes the request from the LangChain orchestrator and returns the result.
  9. The LangChain orchestrator gets the result from the LLM and sends it to the end-user through the Amazon Lex chatbot.

Use case 2: Generative AI for healthcare researchers and clinicians

Clinicians and researchers often analyze thousands of articles from medical journals or government health websites as part of their research. More importantly, they want trustworthy data sources they can use to validate and substantiate their findings. The process requires hours of intensive research, analysis, and data synthesis, lengthening the time to value and innovation. With Generative AI on AWS, you can connect to trusted data sources and run natural language queries to generate insights across these trusted data sources in seconds. You can also review the sources used to generate the response and validate its accuracy.

We chose a solution using Amazon Kendra and Flan-T5-XXL from Hugging Face. First, we use Amazon Kendra to identify text snippets from semantically relevant documents in the entire corpus. Then we use the power of an LLM such as Flan-T5-XXL to use the text snippets from Amazon Kendra as context and obtain a succinct natural language answer. In this approach, the Amazon Kendra index functions as the passage retriever component in the RAG mechanism. Lastly, we use Amazon Lex to power the front end, providing a seamless and responsive experience to end-users. We plan to make this solution available as an open-source project in the near future.

Example

The following screenshot is from a web UI built for the solution using the template available on GitHub. The text in pink are responses from the Amazon Kendra LLM system, and the text in blue are the user questions.

Architecture overview

The architecture and solution workflow for this solution are similar to that of use case 1.

Clean up

To save costs, delete all the resources you deployed as part of the tutorial. If you launched the CloudFormation stack, you can delete it via the AWS CloudFormation console. Similarly, you can delete any SageMaker endpoints you may have created via the SageMaker console.

Conclusion

Generative AI powered by large language models is changing how people acquire and apply insights from information. However, for enterprise use cases, the insights must be generated based on enterprise content to keep the answers in-domain and mitigate hallucinations, using the Retrieval Augmented Generation approach. In the RAG approach, the quality of the insights generated by the LLM depends on the semantic relevance of the retrieved information on which it is based, making it increasingly necessary to use solutions such as Amazon Kendra that provide high-accuracy semantic search results out of the box. With its comprehensive ecosystem of data source connectors, support for common file formats, and security, you can quickly start using Generative AI solutions for enterprise use cases with Amazon Kendra as the retrieval mechanism.

For more information on working with Generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building RAG proofs of concept (POCs) for your enterprise GenAI apps, using the method outlined in this blog. As mentioned earlier, once Amazon Bedrock is available, we will publish a follow up blog showing how you can build RAG using Amazon Bedrock.


About the authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Jean-Pierre Dodel is the Principal Product Manager for Amazon Kendra and leads key strategic product capabilities and roadmap prioritization. He brings extensive Enterprise Search and ML/AI experience to the team, with prior leading roles at Autonomy, HP, and search startups prior to joining Amazon 7 years ago.

Mithil Shah is an ML/AI Specialist at AWS. Currently he helps public sector customers improve lives of citizens by building Machine Learning solutions on AWS.

Firaz Akmal is a Sr. Product Manager for Amazon Kendra at AWS. He is a customer advocate, helping customers understand their search and generative AI use-cases with Kendra on AWS. Outside of work Firaz enjoys spending time in the mountains of the PNW or experiencing the world through his daughter’s perspective.

Abhishek Maligehalli Shivalingaiah is a Senior AI Services Solution Architect at AWS with focus on Amazon Kendra. He is passionate about building applications using Amazon Kendra ,Generative AI and NLP. He has around 10 years of experience in building Data & AI solutions to create value for customers and enterprises. He has built a (personal) chatbot for fun to answers questions about his career and professional journey. Outside of work he enjoys making portraits of family & friends, and loves creating artworks.

Read More