Host ML models on Amazon SageMaker using Triton: Python backend

Host ML models on Amazon SageMaker using Triton: Python backend

Amazon SageMaker provides a number of options for users who are looking for a solution to host their machine learning (ML) models. Of these options, one of the key features that SageMaker provides is real-time inference. Real-time inference workloads can have varying levels of requirements and service level agreements (SLAs) in terms of latency and throughput. Regardless of the use case, SageMaker offers a number of options that allow you to find the right balance of cost and performance to meet your business objectives.

There are many factors to consider when choosing the right real-time inference option for your business. For example, your business may have a model that must meet the strictest SLAs for latency and throughput with very predictable performance. For that use case, SageMaker provides SageMaker single model endpoints (SMEs), which allow you to deploy a single ML model against a logical endpoint. For other use cases, you can choose to manage cost and performance using SageMaker multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint. Regardless of the option you may choose, SageMaker endpoints provide a scalable mechanism for even the most demanding enterprise users while providing value in a plethora of features, including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, see CloudWatch Metrics for Multi-Model Endpoint Deployments).

One option supported by SageMaker single and multi-model endpoints is NVIDIA Triton Inference Server. Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the Python backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and achieve great results.

SageMaker provides Triton via SMEs and MMEs

The Python backend is available through SageMaker, which enables you to deploy both single and multi-model endpoints with NVIDIA Triton Inference Server. Triton supports instance types that support GPUs, CPUs, and AWS Inferentia chips, which allow you to maximize the performance for your workloads. The following diagram illustrates the NVIDIA Triton Inference Server architecture.

Triton Architecture

Inference requests arrive at the server via either HTTP/REST or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis and can help tune performance. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inference using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is an object-based repository of the models powered by Amazon Simple Storage Service (Amazon S3) that Triton will make available for inferencing.

For MMEs, SageMaker takes care of traffic shaping to the endpoint and maintains optimal model copies on GPU instances for the best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high memory utilization, SageMaker unloads the least popular models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU at the same time with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon S3 to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPUs, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Python backend runtime architecture

As the name suggests, the Python backend is for running models that are written and run in the Python language. Various use cases fall into this category, such as preprocessing or postprocessing steps composing a model ensemble. In other cases, the Python backend may be used as a wrapper to call a Python-based model or framework. Later in this post, we show an example of how you can use the Python backend to call a PyTorch T5 model. This may not always be the most performant option, but it showcases the flexibility that the Python backend provides.

The Python backend creates a runtime environment that creates Python processes using the host’s CPU and memory. You can still attain GPU acceleration if it’s exposed by a Python front end of the framework running the inference. No additional GPU acceleration occurs by using the Python backend itself, but there should be no compatibility errors for any Python process.

On SageMaker, the default Triton Python backend allocates 16 MB, and grows only by 1 MB. However, you can change this by setting the SageMaker environment variables SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE and SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE. These variables are important because it’s through shared memory that the Python backend will exchange tensors.

The following diagram shows the ensemble scheduler runtime architecture so that you can fine-tune the memory areas, including CPU addressable shared memory, that are used for inter-process communication between C++ and the Python process for exchanging tensors (input/output).

Architecture Diagram

You can monitor resource utilization using CloudWatch, which has native integration with SageMaker.

To get started with the Python backend, you need to create a Python file that has a structure similar to the following code, which dictates the structure as well as how to interact with parameters and return values. Take note of the point in the lifecycle that the methods are called.

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    @staticmethod
    def auto_complete_config(auto_complete_model_config):

        Parameters
        ----------
        auto_complete_model_config : pb_utils.ModelConfig
          An object containing the existing model configuration. You can build
          upon the configuration given by this object when setting the
          properties for this model.

        Returns
        -------
        pb_utils.ModelConfig
          An object containing the auto-completed model configuration
        """
       

    def initialize(self, args):
       `initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional and allows you to 
        do any initialization before execution. This functional allows
        the model to initialize any state associated with the model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
            ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        ""

    def execute(self, requests):
       `execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        
        
    def finalize(self):
       `finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        

By utilizing the model.py methods, you can take on the responsibility to load models on a specific device (CPU or GPU) by writing the code in the model.py file explicitly. Although other backends that Triton provides allow you to specify a KIND attribute in the config.pbtxt file to determine if the backend runs on CPU or GPU, it’s not applicable for the Python backend because the model is loaded in the respective device depending on the code written in model.py, like .to(device) in torch. It’s important to note that if you explicitly load artifacts into memory or create temporary files, you reclaim your resources by cleaning up, which usually occurs in the finalize method. Otherwise, you may experience unwanted situations such as memory leaks.

SageMaker notebook walkthrough

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s Python backend, which allows you to write your model logic in Python. For example, you can use this backend to run preprocessing and postprocessing code written in Python, or run a PyTorch Python script directly (instead of first converting it to TorchScript and then using the PyTorch backend). The python_backend GitHub repo contains the documentation and source for the backend.

In this section, we walk you through the example notebook, which demonstrates how to use NVIDIA Triton Inference Server on an Amazon SageMaker MME with the GPU feature to deploy an T5 NLP model for translation.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the prebuilt Triton ECR image:

import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
import os
 
os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
s3_client = boto3.client("s3")
bucket = sagemaker.Session().default_bucket()
prefix = "nlp-mme-gpu"
 
# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}
 
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
 
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

Generate model artifacts

In this example, we host a pre-trained T5-small Hugging Face PyTorch model using Triton’s Python backend. Here we have the Python script model.py, which implements all the logic to initialize the T5 model and run inference for the translation task. There are three main functions in the script:

  • initialize – The initialize function is called one time when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running the model. This function allows the model to initialize any state associated with this model.
  • execute – The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you’re given a list of InferenceRequest objects. There are two modes of implementing this function: default and decoupled mode. The default mode is the most generic way you would like to implement your model and requires the execute function to return exactly one response per request. The decoupled mode allows you to send multiple responses for a request or not send any responses for a request. The mode you choose should depend on your use case—that is, whether or not you want to return decoupled responses from this model. In this example notebook, we use the default mode.
  • finalize – Implementing finalize is optional. This function allows you to do any cleanup necessary before the model is unloaded from Triton Inference Server.

Build the model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the T5 model is as follows:

Directory structure

Note that Triton has specific requirements for the model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, that is 1, representing version 1 of our T5 PyTorch model. Each model is run by a specific backend, so each version subdirectory must contain the model artifact required by that backend. Here, we are using the Python backend, and it requires the Python file that is used for serving (model.py). If we were using a PyTorch backend, a model.pt file would be required. For more details on naming conventions for model files, refer to Model Files.

Every Python Triton model must provide a config.pbtxt file describing the model configuration. To use this backend, you must set the backend field of your model config.pbtxt file to python. The following code shows how to define the config file for the T5 PyTorch model being served through Triton’s Python backend:

name: "t5_pytorch"
backend: "python"
max_batch_size: 8
input: [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT32
        dims: [ -1 ]
    }
]
output [
  {
    name: "output"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
}
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

In this configuration, we have defined the parameters section to provide an environment path. This is because to serve the Hugging Face T5 PyTorch model using Triton’s Python backend, we have PyTorch and Hugging Face transformers as dependencies. You need to create a custom run environment in the Python backend to include all the dependencies in this example. The alternative is to install Python and all the dependencies in the local environment. The custom run environment is only needed if you want portability across different systems that might not have the Python environment to run the inference. If a custom run environment is required for SageMaker, then this should be pointed out clearly. Currently, the Python backend only supports conda-pack for this purpose. conda-pack ensures that your Conda environment is portable. We follow the instructions from the Triton documentation for packaging dependencies to be used in the Python backend as the Conda environment TAR file. Running the bash script create_hf_env.sh creates the Conda environment containing PyTorch and Hugging Face transformers and packages it as a TAR file, and then we move it into the t5-pytorch model directory:

!bash workspace/create_hf_env.sh
!mv hf_env.tar.gz model_repository/t5_pytorch/

After we create the TAR file from the Conda environment, we place it in the model folder. The following code in the model config.pbtxt file tells the Python backend to use this custom environment for your model:

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

Here, $$TRITON_MODEL_DIRECTORY helps provide the environment path relative to the model folder in the model repository, and is resolved to $pwd/model_repository/t5_pytorch. Finally, hf_env.tar.gz is the name we gave to our Conda environment file.

Next, we package our model as *.tar.gz files for uploading to Amazon S3:

!tar -C model_repository/ -czf t5_pytorch.tar.gz t5_pytorch
model_uri_t5_pytorch = sagemaker_session.upload_data(path="t5_pytorch.tar.gz", key_prefix=prefix)

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint. To create a SageMaker endpoint, we need to first create the SageMaker model object and endpoint configuration.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker would create the endpoint with MME container specifications. See the following code:

container = {
"Image": mme_triton_image_uri,
"ModelDataUrl": model_data_url,
"Mode": "MultiModel",
}

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.2xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
        {
            "InstanceType": "ml.g5.2xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

We use this configuration to create a new SageMaker endpoint and wait for the deployment to finish:

endpoint_name = f"{prefix}-ep-{ts}-2xl"
create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

The status will change to InService after the deployment is successful.

Invoke your model hosted on the SageMaker endpoint

After the endpoint is running, we can use some sample raw data to perform inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send inference requests to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. See the following code:

texts_to_translate = ["translate English to German: The house is wonderful."]
batch_size = len(texts_to_translate)

t5_payload = get_text_payload("t5-small", texts_to_translate)
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(t5_payload),
TargetModel="t5_pytorch.tar.gz",
)
response_body = json.loads(response["Body"].read().decode("utf8"))
output_ids = np.array(response_body["outputs"][0]["data"]).reshape(batch_size, -1)
t5_tokenizer = get_tokenizer("t5-small")
decoded_outputs = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for text in decoded_outputs:
print(text, "n")

The notebook can be found in the GitHub repository.

Best practices

When using the Python backend, it can sometimes be complicated to optimize the workload for throughput and latency. You should consider the options available through the SageMaker and Triton environment variables that we discussed previously in regards to batch sizes, max delay, and other factors. In addition, you should be aware of the Python backend-specific configuration and the configuration of the underlying framework. The following are some best practices:

  • If using PyTorch (or any other deep learning framework) module in the Python backend, consider experimenting with different values of intra/inter op thread pool size. Because each Python backend model instance runs in a separate process, limiting the number of threads per process prevents over-subscribing the system resources when scaling up the instance count.
  • Even though the Python backend is highly flexible, it performs some extra data copies that can impact inference performance. For the best performance on GPU, consider using Triton’s TensorRT backend when possible.
  • When using Python backend models in an ensemble, refer to Interoperability and GPU Support for a possible zero-copy transfer of Python backend tensors to other frameworks.
  • You can also use the instance_group_count variable in the config.pbtxt file to add a worker process and increase throughput. Be aware that increasing this variable will increase the amount of resource consumption, including CPU and GPU utilization.

You can explore these options and parameters to get the desired performance characteristics you seek. As always, be aware that resources such as processor or memory consumption can change and should be monitored so you can fine-tune and optimize inference performance.

Conclusion

In this post, we dove deep into the Python backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your models that are written and run in the Python language. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.


About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

Securing MLflow in AWS: Fine-grained access control with AWS native services

Securing MLflow in AWS: Fine-grained access control with AWS native services

With Amazon SageMaker, you can manage the whole end-to-end machine learning (ML) lifecycle. It offers many native capabilities to help manage ML workflows aspects, such as experiment tracking, and model governance via the model registry. This post provides a solution tailored to customers that are already using MLflow, an open-source platform for managing ML workflows.

In a previous post, we discussed MLflow and how it can run on AWS and be integrated with SageMaker—in particular, when tracking training jobs as experiments and deploying a model registered in MLflow to the SageMaker managed infrastructure. However, the open-source version of MLflow doesn’t provide native user access control mechanisms for multiple tenants on the tracking server. This means any user with access to the server has admin rights and can modify experiments, model versions, and stages. This can be a challenge for enterprises in regulated industries that need to keep strong model governance for audit purposes.

In this post, we address these limitations by implementing the access control outside of the MLflow server and offloading authentication and authorization tasks to Amazon API Gateway, where we implement fine-grained access control mechanisms at the resource level using Identity and Access Management (IAM). By doing so, we can achieve robust and secure access to the MLflow server from both SageMaker managed infrastructure and Amazon SageMaker Studio, without having to worry about credentials and all the complexity behind credential management. The modular design proposed in this architecture makes modifying access control logic straightforward without impacting the MLflow server itself. Lastly, thanks to SageMaker Studio extensibility, we further improve the data scientist experience by making MLflow accessible within Studio, as shown in the following screenshot.

MLflow in Studio

MLflow has integrated the feature that enables request signing using AWS credentials into the upstream repository for its Python SDK, improving the integration with SageMaker. The changes to the MLflow Python SDK are available for everyone since MLflow version 1.30.0.

At a high level, this post demonstrates the following:

  • How to deploy an MLflow server on a serverless architecture running on a private subnet not accessible directly from the outside. For this task, we build on top the following GitHub repo: Manage your machine learning lifecycle with MLflow and Amazon SageMaker.
  • How to expose the MLflow server via private integrations to an API Gateway, and implement a secure access control for programmatic access via the SDK and browser access via the MLflow UI.
  • How to log experiments and runs, and register models to an MLflow server from SageMaker using the associated SageMaker execution roles to authenticate and authorize requests, and how to authenticate via Amazon Cognito to the MLflow UI. We provide examples demonstrating experiment tracking and using the model registry with MLflow from SageMaker training jobs and Studio, respectively, in the provided notebook.
  • How to use MLflow as a centralized repository in a multi-account setup.
  • How to extend Studio to enhance the user experience by rendering MLflow within Studio. For this task, we show how to take advantage of Studio extensibility by installing a JupyterLab extension.

Now let’s dive deeper into the details.

Solution overview

You can think about MLflow as three different core components working side by side:

  • A REST API for the backend MLflow tracking server
  • SDKs for you to programmatically interact with the MLflow tracking server APIs from your model training code
  • A React front end for the MLflow UI to visualize your experiments, runs, and artifacts

At a high level, the architecture we have envisioned and implemented is shown in the following figure.

Architecture

Prerequisites

Before deploying the solution, make sure you have access to an AWS account with admin permissions.

Deploy the solution infrastructure

To deploy the solution described in this post, follow the detailed instructions in the GitHub repository README. To automate the infrastructure deployment, we use the AWS Cloud Development Kit (AWS CDK). The AWS CDK is an open-source software development framework to create AWS CloudFormation stacks through automatic CloudFormation template generation. A stack is a collection of AWS resources that can be programmatically updated, moved, or deleted. AWS CDK constructs are the building blocks of AWS CDK applications, representing the blueprint to define cloud architectures.

We combine four stacks:

  • The MLFlowVPCStack stack performs the following actions:
  • The RestApiGatewayStack stack performs the following actions:
    • Exposes the MLflow server via AWS PrivateLink to an REST API Gateway.
    • Deploys an Amazon Cognito user pool to manage the users accessing the UI (still empty after the deployment).
    • Deploys an AWS Lambda authorizer to verify the JWT token with the Amazon Cognito user pool ID keys and returns IAM policies to allow or deny a request. This authorization strategy is applied to <MLFlow-Tracking-Server-URI>/*.
    • Adds an IAM authorizer. This will be applied to the to the <MLFlow-Tracking-Server-URI>/api/*, which will take precedence over the previous one.
  • The AmplifyMLFlowStack stack performs the following action:
    • Creates an app linked to the patched MLflow repository in AWS CodeCommit to build and deploy the MLflow UI.
  • The SageMakerStudioUserStack stack performs the following actions:
    • Deploys a Studio domain (if one doesn’t exist yet).
    • Adds three users, each one with a different SageMaker execution role implementing a different access level:
      • mlflow-admin – Has admin-like permission to any MLflow resources.
      • mlflow-reader – Has read-only admin permissions to any MLflow resources.
      • mlflow-model-approver – Has the same permissions as mlflow-reader, plus can register new models from existing runs in MLflow and promote existing registered models to new stages.

Deploy the MLflow tracking server on a serverless architecture

Our aim is to have a reliable, highly available, cost-effective, and secure deployment of the MLflow tracking server. Serverless technologies are the perfect candidate to satisfy all these requirements with minimal operational overhead. To achieve that, we build a Docker container image for the MLflow experiment tracking server, and we run it in on AWS Fargate on Amazon ECS in its dedicated VPC running on a private subnet. MLflow relies on two storage components: the backend store and for the artifact store. For the backend store, we use Aurora Serverless, and for the artifact store, we use Amazon S3. For the high-level architecture, refer to Scenario 4: MLflow with remote Tracking Server, backend and artifact stores. Extensive details on how to do this task can be found in the following GitHub repo: Manage your machine learning lifecycle with MLflow and Amazon SageMaker.

Secure MLflow via API Gateway

At this point, we still don’t have an access control mechanism in place. As a first step, we expose MLflow to the outside world using AWS PrivateLink, which establishes a private connection between the VPC and other AWS services, in our case API Gateway. Incoming requests to MLflow are then proxied via a REST API Gateway, giving us the possibility to implement several mechanisms to authorize incoming requests. For our purposes, we focus on only two:

  • Using IAM authorizers – With IAM authorizers, the requester must have the right IAM policy assigned to access the API Gateway resources. Every request must add authentication information to requests sent via HTTP by AWS Signature Version 4.
  • Using Lambda authorizers – This offers the greatest flexibility because it leaves full control over how a request can be authorized. Eventually, the Lambda authorizer must return an IAM policy, which in turn will be evaluated by API Gateway on whether the request should be allowed or denied.

For the full list of supported authentication and authorization mechanisms in API Gateway, refer to Controlling and managing access to a REST API in API Gateway.

MLflow Python SDK authentication (IAM authorizer)

The MLflow experiment tracking server implements a REST API to interact in a programmatic way with the resources and artifacts. The MLflow Python SDK provides a convenient way to log metrics, runs, and artifacts, and it interfaces with the API resources hosted under the namespace <MLflow-Tracking-Server-URI>/api/. We configure API Gateway to use the IAM authorizer for resource access control on this namespace, thereby requiring every request to be signed with AWS Signature Version 4.

To facilitate the request signing process, starting from MLflow 1.30.0, this capability can be seamlessly enabled. Make sure that the requests_auth_aws_sigv4 library is installed in the system and set the MLFLOW_TRACKING_AWS_SIGV4 environment variable to True. More information can be found in the official MLflow documentation.

At this point, the MLflow SDK only needs AWS credentials. Because request_auth_aws_sigv4 uses Boto3 to retrieve credentials, we know that it can load credentials from the instance metadata when an IAM role is associated with an Amazon Elastic Compute Cloud (Amazon EC2) instance (for other ways to supply credentials to Boto3, see Credentials). This means that it can also load AWS credentials when running from a SageMaker managed instance from the associated execution role, as discussed later in this post.

Configure IAM policies to access MLflow APIs via API Gateway

You can use IAM roles and policies to control who can invoke resources on API Gateway. For more details and IAM policy reference statements, refer to Control access for invoking an API.

The following code shows an example IAM policy that grants the caller permissions to all methods on all resources on the API Gateway shielding MLflow, practically giving admin access to the MLflow server:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/*/*",
      "Effect": "Allow"
    }
  ]
}

If we want a policy that allows a user read-only access to all resources, the IAM policy would look like the following code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "execute-api:Invoke",
      "Resource": [
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/GET/*",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/runs/search/",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/experiments/search",
       ],
       "Effect": "Allow"
     }
  ]
}

Another example might be a policy to give specific users permissions to register models to the model registry and promote them later to specific stages (staging, production, and so on):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "execute-api:Invoke",
      "Resource": [
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/GET/*",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/runs/search/",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/experiments/search",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/model-versions/*",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/registered-models/*"
      ],
      "Effect": "Allow"
    }
  ]
}

MLflow UI authentication (Lambda authorizer)

Browser access to the MLflow server is handled by the MLflow UI implemented with React. The MLflow UI hasn’t been designed to support authenticated users. Implementing a robust login flow might appear a daunting task, but luckily we can rely on the Amplify UI React components for authentication, which greatly reduces the effort to create a login flow in a React application, using Amazon Cognito for the identities store.

Amazon Cognito allows us to manage our own user base and also support third-party identity federation, making it feasible to build, for example, ADFS federation (see Building ADFS Federation for your Web App using Amazon Cognito User Pools for more details). Tokens issued by Amazon Cognito must be verified on API Gateway. Simply verifying the token is not enough for fine-grained access control, therefore the Lambda authorizer allows us the flexibility to implement the logic we need. We can then build our own Lambda authorizer to verify the JWT token and generate the IAM policies to let the API Gateway deny or allow the request. The following diagram illustrates the MLflow login flow.

MLflow UI auth steps

For more information about the actual code changes, refer to the patch file cognito.patch, applicable to MLflow version 2.3.1.

This patch introduces two capabilities:

  • Add the Amplify UI components and configure the Amazon Cognito details via environment variables that implement the login flow
  • Extract the JWT from the session and create an Authorization header with a bearer token of where to send the JWT

Although maintaining diverging code from the upstream always adds more complexity than relying on the upstream, it’s worth noting that the changes are minimal because we rely on the Amplify React UI components.

With the new login flow in place, let’s create the production build for our updated MLflow UI. AWS Amplify Hosting is an AWS service that provides a git-based workflow for CI/CD and hosting of web apps. The build step in the pipeline is defined by the buildspec.yaml, where we can inject as environment variables details about the Amazon Cognito user pool ID, the Amazon Cognito identity pool ID, and the user pool client ID needed by the Amplify UI React component to configure the authentication flow. The following code is an example of the buildspec.yaml file:

version: "1.0"
applications:
  - frontend:
      phases:
        preBuild:
          commands:
          - fallocate -l 4G /swapfile
          - chmod 600 /swapfile
          - mkswap /swapfile
          - swapon /swapfile
          - swapon -s
          - yarn install
        build:
          commands:
          - echo "REACT_APP_REGION=$REACT_APP_REGION" >> .env
          - echo "REACT_APP_COGNITO_USER_POOL_ID=$REACT_APP_COGNITO_USER_POOL_ID" >> .env
          - echo "REACT_APP_COGNITO_IDENTITY_POOL_ID=$REACT_APP_COGNITO_IDENTITY_POOL_ID" >> .env
          - echo "REACT_APP_COGNITO_USER_POOL_CLIENT_ID=$REACT_APP_COGNITO_USER_POOL_CLIENT_ID" >> .env
          - yarn run build
        artifacts:
          baseDirectory: build
          files:
          - "**/*"

Securely log experiments and runs using the SageMaker execution role

One of the key aspects of the solution discussed here is the secure integration with SageMaker. SageMaker is a managed service, and as such, it performs operations on your behalf. What SageMaker is allowed to do is defined by the IAM policies attached to the execution role that you associate to a SageMaker training job, or that you associate to a user profile working from Studio. For more information on the SageMaker execution role, refer to SageMaker Roles.

By configuring the API Gateway to use IAM authentication on the <MLFlow-Tracking-Server-URI>/api/* resources, we can define a set of IAM policies on the SageMaker execution role that will allow SageMaker to interact with MLflow according to the access level specified.

When setting the MLFLOW_TRACKING_AWS_SIGV4 environment variable to True while working in Studio or in a SageMaker training job, the MLflow Python SDK will automatically sign all requests, which will be validated by the API Gateway:

os.environ['MLFLOW_TRACKING_AWS_SIGV4'] = "True"
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)

Test the SageMaker execution role with the MLflow SDK

If you access the Studio domain that was generated, you will find three users:

  • mlflow-admin – Associated to an execution role with similar permissions as the user in the Amazon Cognito group admins
  • mlflow-reader – Associated to an execution role with similar permissions as the user in the Amazon Cognito group readers
  • mlflow-model-approver – Associated to an execution role with similar permissions as the user in the Amazon Cognito group model-approvers

To test the three different roles, refer to the labs provided as part of this sample on each user profile.

The following diagram illustrates the workflow for Studio user profiles and SageMaker job authentication with MLflow.

SageMaker logging to MLflow

Similarly, when running SageMaker jobs on the SageMaker managed infrastructure, if you set the environment variable MLFLOW_TRACKING_AWS_SIGV4 to True, and the SageMaker execution role passed to the jobs has the correct IAM policy to access the API Gateway, you can securely interact with your MLflow tracking server without needing to manage the credentials yourself. When running SageMaker training jobs and initializing an estimator class, you can pass environment variables that SageMaker will inject and make it available to the training script, as shown in the following code:

environment={
  "AWS_DEFAULT_REGION": region,
  "MLFLOW_EXPERIMENT_NAME": experiment_name,
  "MLFLOW_TRACKING_URI": tracking_uri,
  "MLFLOW_AMPLIFY_UI_URI": mlflow_amplify_ui,
  "MLFLOW_TRACKING_AWS_SIGV4": "true",
  "MLFLOW_USER": user
}

estimator = SKLearn(
  entry_point='train.py',
  source_dir='source_dir',
  role=role,
  metric_definitions=metric_definitions,
  hyperparameters=hyperparameters,
  instance_count=1,
  instance_type='ml.m5.large',
  framework_version='1.0-1',
  base_job_name='mlflow',
  environment=environment
)

Visualize runs and experiments from the MLflow UI

After the first deployment is complete, let’s populate the Amazon Cognito user pool with three users, each belonging to a different group, to test the permissions we have implemented. You can use this script add_users_and_groups.py to seed the user pool. After running the script, if you check the Amazon Cognito user pool on the Amazon Cognito console, you should see the three users created.

Cognito users

On the REST API Gateway side, the Lambda authorizer will first verify the signature of the token using the Amazon Cognito user pool key and verify the claims. Only after that will it extract the Amazon Cognito group the user belongs to from the claim in the JWT token (cognito:groups) and apply different permissions based on the group that we have programmed.

For our specific case, we have three groups:

  • admins – Can see and can edit everything
  • readers – Can only see everything
  • model-approvers – The same as readers, plus can register models, create versions, and promote model versions to the next stage

Depending on the group, the Lambda authorizer will generate different IAM policies. This is just an example on how authorization can be achieved; with a Lambda authorizer, you can implement any logic you need. We have opted to build the IAM policy at run time in the Lambda function itself; however, you can pregenerate appropriate IAM policies, store them in Amazon DynamoDB, and retrieve them at run time according to your own business logic. However, if you want to restrict only a subset of actions, you need to be aware of the MLflow REST API definition.

You can explore the code for the Lambda authorizer on the GitHub repo.

Multi-account considerations

Data science workflows have to pass multiple stages as they progress from experimentation to production. A common approach involves separate accounts dedicated to different phases of the AI/ML workflow (experimentation, development, and production). However, sometimes it’s desirable to have a dedicated account that acts as central repository for models. Although our architecture and sample refer to a single account, it can be easily extended to implement this last scenario, thanks to the IAM capability to switch roles even across accounts.

The following diagram illustrates an architecture using MLflow as a central repository in an isolated AWS account.

MLflow sagemaker multi account

For this use case, we have two accounts: one for the MLflow server, and one for the experimentation accessible by the data science team. To enable cross-account access from a SageMaker training job running in the data science account, we need the following elements:

  • A SageMaker execution role in the data science AWS account with an IAM policy attached that allows assuming a different role in the MLflow account:
{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "<ARN-ROLE-IN-MLFLOW-ACCOUNT>"
  }
}
  • An IAM role in the MLflow account with the right IAM policy attached that grants access to the MLflow tracking server, and allows the SageMaker execution role in the data science account to assume it:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<ARN-SAGEMAKER-EXECUTION-ROLE-IN-DATASCIENCE-ACCOUNT>"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Within the training script running in the data science account, you can use this example before initializing the MLflow client. You need to assume the role in the MLflow account and store the temporary credentials as environment variables, because this new set of credentials will be picked up by a new Boto3 session initialized within the MLflow client.

import boto3

# Session using the SageMaker Execution Role in the Data Science Account
session = boto3.Session()
sts = session.client("sts")

response = sts.assume_role(
  RoleArn="<ARN-ROLE-IN-MLFLOW-ACCOUNT>",
  RoleSessionName="AssumedMLflowAdmin"
)

credentials = response['Credentials']
os.environ['AWS_ACCESS_KEY_ID'] = credentials['AccessKeyId']
os.environ['AWS_SECRET_ACCESS_KEY'] = credentials['SecretAccessKey']
os.environ['AWS_SESSION_TOKEN'] = credentials['SessionToken']

# set remote mlflow server and initialize a new boto3 session in the context
# of the assumed role
mlflow.set_tracking_uri(tracking_uri)
experiment = mlflow.set_experiment(experiment_name)

In this example, RoleArn is the ARN of the role you want to assume, and RoleSessionName is name that you choose for the assumed session. The sts.assume_role method returns temporary security credentials that the MLflow client will use to create a new client for the assumed role. The MLflow client then will send signed requests to API Gateway in the context of the assumed role.

Render MLflow within SageMaker Studio

SageMaker Studio is based on JupyterLab, and just as in JupyterLab, you can install extensions to boost your productivity. Thanks to this flexibility, data scientists working with MLflow and SageMaker can further improve their integration by accessing the MLflow UI from the Studio environment and immediately visualizing the experiments and runs logged. The following screenshot shows an example of MLflow rendered in Studio.

MLflow Iframe in Studio

For information about installing JupyterLab extensions in Studio, refer to Amazon SageMaker Studio and SageMaker Notebook Instance now come with JupyterLab 3 notebooks to boost developer productivity. For details on adding automation via lifecycle configurations, refer to Customize Amazon SageMaker Studio using Lifecycle Configurations.

In the sample repository supporting this post, we provide instructions on how to install the jupyterlab-iframe extension. After the extension has been installed, you can access the MLflow UI without leaving Studio using the same set of credentials you have stored in the Amazon Cognito user pool.

Next steps

There are several options for expanding upon this work. One idea is to consolidate the identity store for both SageMaker Studio and the MLflow UI. Another option would be to utilize a third-party identity federation service with Amazon Cognito, and then utilize AWS IAM Identity Center (successor to AWS Single Sign-On) to grant access to Studio using the same third-party identity. Another one is to introduce full automation using Amazon SageMaker Pipelines for the CI/CD part of the model building, and using MLflow as a centralized experiment tracking server and model registry with strong governance capabilities, as well as automation to automatically deploy approved models to a SageMaker hosting endpoint.

Conclusion

The aim of this post was to provide enterprise-level access control for MLflow. To achieve this, we separated the authentication and authorization processes from the MLflow server and transferred them to API Gateway. We utilized two authorization methods offered by API Gateway, IAM authorizers and Lambda authorizers, to cater to the requirements of both the MLflow Python SDK and the MLflow UI. It’s important to understand that users are external to MLflow, therefore a consistent governance requires maintaining the IAM policies, especially in case of very granular permissions. Finally, we demonstrated how to enhance the experience of data scientists by integrating MLflow into Studio through simple extensions.

Try out the solution on your own by accessing the GitHub repo and let us know if you have any questions in the comments!

Additional resources

For more information about SageMaker and MLflow, see the following:


About the Authors

Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunication Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Chris Fregly is a Principal Specialist Solution Architect for AI and machine learning at Amazon Web Services (AWS) based in San Francisco, California. He is co-author of the O’Reilly Book, “Data Science on AWS.” Chris is also the Founder of many global meetups focused on Apache Spark, TensorFlow, Ray, and KubeFlow. He regularly speaks at AI and machine learning conferences across the world including O’Reilly AI, Open Data Science Conference, and Big Data Spain.

Irshad Buchh is a Principal Solutions Architect at Amazon Web Services (AWS). Irshad works with large AWS Global ISV and SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform. Irshad interacts with CIOs, CTOs and their Architects and helps them and their end customers implement their cloud vision. Irshad owns the strategic and technical engagements and ultimate success around specific implementation projects, and developing a deep expertise in the Amazon Web Services technologies as well as broad know-how around how applications and services are constructed using the Amazon Web Services platform.

Read More

Host ML models on Amazon SageMaker using Triton: TensorRT models

Host ML models on Amazon SageMaker using Triton: TensorRT models

Sometimes it can be very beneficial to use tools such as compilers that can modify and compile your models for optimal inference performance. In this post, we explore TensorRT and how to use it with Amazon SageMaker inference using NVIDIA Triton Inference Server. We explore how TensorRT works and how to host and optimize these models for performance and cost efficiency on SageMaker. SageMaker provides single model endpoints (SMEs), which allow you to deploy a single ML model, or multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint for higher resource utilization.

To serve models, Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the TensorRT backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and get great results.

Deep dive into the TensorRT backend

TensorRT enables you to optimize inference using techniques such as quantization, layer and tensor fusion, kernel tuning, and others on NVIDIA GPUs. By adopting and compiling models to use TensorRT, you can optimize performance and utilization for your inference workloads. In some cases, there are trade-offs, which is typical of techniques such as quantization, but the results can be dramatic in benefiting performance, addressing latency and the number of transactions that can be processed.

The TensorRT backend is used to run TensorRT models. TensorRT is an SDK developed by NVIDIA that provides a high-performance deep learning inference library. It’s optimized for NVIDIA GPUs and provides a way to accelerate deep learning inference in production environments. TensorRT supports major deep learning frameworks and includes a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI applications.

TensorRT is able to accelerate model performance by using a technique called graph optimization to optimize the computation graph generated by a deep learning model. It optimizes the graph to minimize the memory footprint by freeing unnecessary memory and efficiently reusing it. TensorRT compilation fuses the sparse operations inside the model graph to form a larger kernel to avoid the overhead of multiple small kernel launches. With kernel auto-tuning, the engine selects the best algorithm for the target GPU, maximizing hardware utilization. Additionally, TensorRT employs CUDA streams to enable parallel processing of models, further improving GPU utilization and performance. Finally, through quantization, TensorRT can use mixed-precision acceleration of Tensor cores, enabling the model to run in FP32, TF32, FP16, and INT8 precision for the best inference performance. However, although the reduced precision can generally improve the latency performance, it might come with possible instability and degradation in model accuracy. Overall, TensorRT’s combination of techniques results in faster inference and lower latency compared to other inference engines.

The TensorRT backend for Triton Inference Server is designed to take advantage of the powerful inference capabilities of NVIDIA GPUs. To use TensorRT as a backend for Triton Inference Server, you need to create a TensorRT engine from your trained model using the TensorRT API. This engine is then loaded into Triton Inference Server and used to perform inference on incoming requests. The following are the basic steps to use TensorRT as a backend for Triton Inference Server:

  1. Convert your trained model to the ONNX format. Triton Inference Server supports ONNX as a model format. ONNX is a standard for representing deep learning models, enabling them to be transferred between frameworks. If your model isn’t already in the ONNX format, you need to convert it using the appropriate framework-specific tool. For example, in PyTorch, this can be done using the torch.onnx.export method.
  2. Import the ONNX model into TensorRT and generate the TensorRT engine. For TensorRT, there are several ways to build a TensorRT from your ONNX model. For this post, we use the trtexec CLI tool. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has three main purposes:
    1. Benchmarking networks on random or user-provided input data.
    2. Generating serialized engines from models.
    3. Generating a serialized timing cache from the builder.
  3. Load the TensorRT engine in Triton Inference Server. After the TensorRT engine is generated, it can be loaded into Triton Inference Server by creating a model configuration file. The model configuration (config.pbtxt) file should include the path to the TensorRT engine file and the input and output shapes of the model.

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. There are several key points to note in this configuration file:

  • name – This field defines the model’s name and must be unique within the model repository.
  • platform – This field defines the type of the model: TensorRT engine, PyTorch, or something else.
  • max_batch_size – This specifies the maximum batch size that can be passed to this model. If the model’s batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case, max_batch_size should be set to a value greater than or equal to 1, which indicates the maximum batch size that Triton should use with the model. For models that don’t support batching, or don’t support batching in the specific ways we’ve described, max_batch_size must be set to 0.
  • Input and output – These fields are required because NVIDIA Triton needs metadata about the model. Essentially, it requires the names of your network’s input and output layers and the shape of said inputs and outputs.
  • instance_group – This determines how many instances of this model will be created and whether they will use the GPU or CPU.
  • dynamic_batchingDynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. The preferred_batch_size property indicates the batch sizes that the dynamic batcher should attempt to create. For most models, preferred_batch_size should not be specified, as described in Recommended Configuration Process. An exception is TensorRT models that specify multiple optimization profiles for different batch sizes. In this case, because some optimization profiles may give significant performance improvement compared to others, it may make sense to use preferred_batch_size for the batch sizes supported by those higher-performance optimization profiles. You can also reference the batch size that was previously used when running trtexec. You can also configure the delay time to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch.

The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster transfers between CPU and GPU, and increasing compute and memory copy overlap on GPUs. It also reduces memory usage of TensorRT models in many cases by sharing weights across multiple model instances. Overall, the TensorRT backend for Triton Inference Server provides a powerful and flexible way to serve deep learning models with optimized TensorRT inference. By adjusting the configuration options, you can optimize performance and control behavior to suit your specific use case.

SageMaker provides Triton via SMEs and MMEs

SageMaker enables you to deploy both single and multi-model endpoints with Triton Inference Server. Triton supports a heterogeneous cluster with both GPUs and CPUs, which helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads. The following diagram illustrates the Triton Inference Server architecture. Inference requests arrive at the server via either HTTP/REST or by the C API, and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is a file system-based repository of the models that Triton will make available for inferencing.

Triton architecture

SageMaker takes care of traffic shaping to the MME endpoint and maintains optimal model copies on GPU instances for best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high utilization, SageMaker unloads the least-used models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon Simple Storage Service (Amazon S3) to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU-accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPU, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling groups. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. For single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you might want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Solution overview

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s TensorRT backend, which allows you to deploy TensorRT models. The TensorRT_backend repo contains the documentation and source for the backend. In the following sections, we walk you through the example notebook that demonstrates how to use NVIDIA Triton Inference Server on SageMaker MMEs with the GPU feature to deploy a BERT natural language processing (NLP) model.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the pre-built Triton ECR image:

import transformers
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
bucket = sagemaker_session.default_bucket()
print(bucket)

account_id_map = {
"us-east-1": "785573368785",
"us-east-2": "007439368137",
"us-west-1": "710691900526",
"us-west-2": "301217895009",
"eu-west-1": "802834080501",
"eu-west-2": "205493899709",
"eu-west-3": "254080097072",
"eu-north-1": "601324751636",
"eu-south-1": "966458181534",
"eu-central-1": "746233611703",
"ap-east-1": "110948597952",
"ap-south-1": "763008648453",
"ap-northeast-1": "941853720454",
"ap-northeast-2": "151534178276",
"ap-southeast-1": "324986816169",
"ap-southeast-2": "355873309152",
"cn-northwest-1": "474822919863",
"cn-north-1": "472730292857",
"sa-east-1": "756306329178",
"ca-central-1": "464438896020",
"me-south-1": "836785723513",
"af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
    
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
account_id=account_id_map[region], region=region, base=base
)

Add utility methods for preparing the request payload

We create the functions to transform the sample text we’re using for inference into the payload that can be sent for inference to Triton Inference Server. The tritonclient package, which was installed at the beginning, provides utility methods to generate the payload without having to know the details of the specification. We use the created methods to convert our inference request into a binary format, which provides lower latencies for inference. These functions are used during the inference step.

Prepare the TensorRT model

In this step, we load the pre-trained BERT model and convert to ONNX representation using the torch ONNX exporter and the onnx_exporter.py script. After the ONNX model is created, we use the TensorRT trtexec command to create the model plan to be hosted with Triton. This is run as part of the generate_model.sh script from the following cell. Note that the cell takes around 30 minutes to complete.

!docker run --gpus=all --rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_models.sh

While waiting for the command to finish running, you can check the scripts used in this step. In the onnx_exporter.py script, we use the torch.onnx.export function for ONNX model creation:


    torch.onnx.export(
        model,
        dummy_inputs,
        args.save,
        export_params=True,
        opset_version=10,
        input_names=["token_ids", "attn_mask"],
        output_names=["output","pooled_output"],
        dynamic_axes={"token_ids": [0, 1], "attn_mask": [0, 1], "output": [0]},
    )

The command line in the generate_model.sh file creates the TensorRT model plan. For more information, refer to the trtexec command-line tool.

trtexec —onnx=model.onnx —saveEngine=model_bs16.plan —minShapes=token_ids:1x128,attn_mask:1x128 —optShapes=token_ids:16x128,attn_mask:16x128 —maxShapes=token_ids:128x128,attn_mask:128x128 —fp16 —verbose —workspace=14000 | tee conversion_bs16_dy.txt

Build a TensorRT NLP BERT model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the BERT model is as follows:

Folder structure for model

Note that Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, the folder 1 represents version 1 of the BERT model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifacts required by that backend. Here, we are using the TensorRT backend, which requires the TensorRT plan file that is used for serving (for this example, model.plan). If we were using a PyTorch backend, a model.pt file would be required. For more details on naming conventions for model files, refer to Model Files.

Every TensorRT model must provide a config.pbtxt file describing the model configuration. In order to use this backend, you must set the backend field of your model config.pbtxt file to tensorrt_plan. The following section of code shows an example of how to define the configuration file for the BERT model being served through Triton’s TensorRT backend:

name: "bert"
platform: "tensorrt_plan"
max_batch_size: 128
input [
  {
    name: "token_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attn_mask"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [128, 768]
  },
  {
    name: "pooled_output"
    data_type: TYPE_FP32
    dims: [768]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 16
}

SageMaker expects a .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. To simulate several similar models being hosted, you might think all it takes is to tar the model repository we have already built, and then copy it with different file names. However, Triton requires unique model names. Therefore, we first copy the model repo N times, changing the model directory names and their corresponding config.pbtxt files. You can change the number of N to have more copies of the model that can be dynamically loaded to the hosting endpoint to simulate the model load/unload action managed by SageMaker. See the following code:

import os
import shutil

N = 5
prefix = 'bert-mme'
model_repo_base = 'model_repo'

# Get model names from model_repo_0
model_names = [name for name in os.listdir(f'{model_repo_base}_0') if os.path.isdir(f'{model_repo_base}_0/{name}')]

for i in range(N):
    # Make copy of previous model repo, increment # id
    shutil.copytree(f'{model_repo_base}_0', f'{model_repo_base}_{i+1}')
    time.sleep(5)
    for name in model_names:
        model_dirs_path = f'{model_repo_base}_{i+1}/{name}'

        # Open each model's config file to increment model # id there 
        fin = open(f'{model_dirs_path}/config.pbtxt', "rt")
        data = fin.read()
        data = data.replace(name, name[:-1] + str(i+1))
        fin.close()
        fin = open(f'{model_dirs_path}/config.pbtxt', "wt")
        fin.write(data)
        fin.close()
    
        # Change model directory name to match new config
        os.rename(model_dirs_path,model_dirs_path[:-1]+str(i+1))
        time.sleep(2)
        
    if i == 0:
        tar_file_name = f'bert-{i}.tar.gz'
        model_repo_target = f'{model_repo_base}_{i}/'
        !tar -C $model_repo_target -czf $tar_file_name .
        sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix)

    tar_file_name = f'bert-{i+1}.tar.gz'
    model_repo_target = f'{model_repo_base}_{i+1}/'
    !tar -C $model_repo_target -czf $tar_file_name .
    sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix)
    !sudo rm -r "$tar_file_name" "$model_repo_target"

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create the SageMaker model object, endpoint configuration, and endpoint.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. See the following code:

container = {
"Image": triton_image_uri,
"ModelDataUrl": model_data_uri,
"Mode": "MultiModel",
}

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

With this endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

endpoint_name = "triton-nlp-bert-trt-mme-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

Invoke your model hosted on the SageMaker endpoint

When the endpoint is running, we can use some sample raw data to perform inference using either JSON or binary+JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send the inference request to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Here we invoke the endpoint in a for loop to request the endpoint to dynamically load or unload models based on the requests:

text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
        {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
    ]
}

for i in range(N):
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel=f"bert-{i}.tar.gz",
    )

    print(json.loads(response["Body"].read().decode("utf8")))

You can monitor the model loading and unloading status using Amazon CloudWatch metrics and logs. SageMaker multi-model endpoints provide instance-level metrics to monitor; for more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch. The LoadedModelCount metric shows the number of models loaded in the containers. The ModelCacheHit metric shows the number of invocations to model that are already loaded onto the container to help you get model invitation-level insights. To check if models are unloaded from the memory, you can look for the successful unloaded log entries in the endpoint’s CloudWatch logs.

The notebook can be found in the GitHub repository.

Best practices

Before starting any optimization effort with TensorRT, it’s essential to determine what should be measured. Without measurements, it’s impossible to make reliable progress or measure whether success has been achieved. Here are some best practices to consider when using the TensorRT backend for Triton Inference Server:

  • Optimize your TensorRT model – Before deploying a model on Triton with the TensorRT backend, make sure to optimize the model following the TensorRT best practices guide. This will help you achieve better performance by reducing inference time and memory consumption.
  • Use TensorRT instead of other Triton backends when possible – TensorRT is designed to optimize deep learning models for deployment on NVIDIA GPUs, so using it can significantly improve inference performance compared to using other supported Triton backends.
  • Use the right precision – TensorRT supports multiple precisions (FP32, FP16, INT8), and selecting the right precision for your model can have a significant impact on performance. Consider using lower precision when possible.
  • Use batch sizes that fit your hardware – Make sure to choose batch sizes that fit your GPU’s memory and compute capabilities. Using batch sizes that are too large or too small can negatively impact performance.

Conclusion

In this post, we dove deep into the TensorRT backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your TensorRT models. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to take advantage of this capability using single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.


 About the Authors

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

Build an image search engine with Amazon Kendra and Amazon Rekognition

Build an image search engine with Amazon Kendra and Amazon Rekognition

In this post, we discuss a machine learning (ML) solution for complex image searches using Amazon Kendra and Amazon Rekognition. Specifically, we use the example of architecture diagrams for complex images due to their incorporation of numerous different visual icons and text.

With the internet, searching and obtaining an image has never been easier. Most of the time, you can accurately locate your desired images, such as searching for your next holiday getaway destination. Simple searches are often successful, because they’re not associated with many characteristics. Beyond the desired image characteristics, the search criteria typically doesn’t require significant details to locate the required result. For example, if a user tried to search for a specific type of blue bottle, results of many different types of blue bottles will be displayed. However, the desired blue bottle may not be easily found due to generic search terms.

Interpreting search context also contributes to simplification of results. When users have a desired image in mind, they try to frame this into a text-based search query. Understanding the nuances between search queries for similar topics is important to provide relevant results and minimize the effort required from the user to manually sort through results. For example, the search query “Dog owner plays fetch” seeks to return image results showing a dog owner playing a game of fetch with a dog. However, the actual results generated may instead focus on a dog fetching an object without displaying an owner’s involvement. Users may have to manually filter out unsuitable image results when dealing with complex searches.

To address the problems associated with complex searches, this post describes in detail how you can achieve a search engine that is capable of searching for complex images by integrating Amazon Kendra and Amazon Rekognition. Amazon Kendra is an intelligent search service powered by ML, and Amazon Rekognition is an ML service that can identify objects, people, text, scenes, and activities from images or videos.

What images can be too complex to be searchable? One example is architecture diagrams, which can be associated with many search criteria depending on the use case complexity and number of technical services required, which results in significant manual search effort for the user. For example, if users want to find an architecture solution for the use case of customer verification, they will typically use a search query similar to “Architecture diagrams for customer verification.” However, generic search queries would span a wide range of services and across different content creation dates. Users would need to manually select suitable architectural candidates based on specific services and consider the relevance of the architecture design choices according to the content creation date and query date.

The following figure shows an example diagram that illustrates an orchestrated extract, transform, and load (ETL) architecture solution.

For users who are not familiar with the service offerings that are provided on the cloud platform, they may provide different generic ways and descriptions when searching for such a diagram. The following are some examples of how it could be searched:

  • “Orchestrate ETL workflow”
  • “How to automate bulk data processing”
  • “Methods to create a pipeline for transforming data”

Solution overview

We walk you through the following steps to implement the solution:

  1. Train an Amazon Rekognition Custom Labels model to recognize symbols in architecture diagrams.
  2. Incorporate Amazon Rekognition text detection to validate architecture diagram symbols.
  3. Use Amazon Rekognition inside a web crawler to build a repository for searching
  4. Use Amazon Kendra to search the repository.

To easily provide users with a large repository of relevant results, the solution should provide an automated way of searching through trusted sources. Using architecture diagrams as an example, the solution needs to search through reference links and technical documents for architecture diagrams and identify the services present. Identifying keywords such as use cases and industry verticals in these sources also allows the information to be captured and for more relevant search results to be displayed to the user.

Considering the objective of how relevant diagrams should be searched, the image search solution needs to fulfil three criteria:

  • Enable simple keyword search
  • Interpret search queries based on use cases that users provide
  • Sort and order search results

Keyword search is simply searching for “Amazon Rekognition” and being shown architecture diagrams on how the service is used in different use cases. Alternatively, the search terms can be linked indirectly to the diagram through use cases and industry verticals that may be associated with the architecture. For example, searching for the terms “How to orchestrate ETL pipeline” returns results of architecture diagrams built with AWS Glue and AWS Step Functions. Sorting and ordering of search results based on attributes such as creation date would ensure the architecture diagrams are still relevant in spite of service updates and releases. The following figure shows the architecture diagram to the image search solution.

As illustrated in the preceding diagram and in the solution overview, there are two main aspects of the solution. The first aspect is performed by Amazon Rekognition, which can identify objects, people, text, scenes, and activities from images or videos. It consists of pre-trained models that can be applied to analyze images and videos at scale. With its custom labels feature, Amazon Rekognition allows you to tailor the ML service to your specific business needs by labeling images collated from sourcing through architecture diagrams in trusted reference links and technical documents. By uploading a small set of training images, Amazon Rekognition automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. Therefore, users without ML expertise can enjoy the benefits of a custom labels model through an API call, because a significant amount of overhead is reduced. The solution applies Amazon Rekognition Custom Labels to detect AWS service logos on architecture diagrams to allow the architecture diagrams to be searchable with service names. After modeling, detected services of each architecture diagram image and its metadata, like URL origin and image title, are indexed for future search purposes and stored in Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database designed to run high-performance applications.

The second aspect is supported by Amazon Kendra, an intelligent enterprise search service powered by ML that allows you to search across different content repositories. With Amazon Kendra, you can search for results, such as images or documents, that have been indexed. These results can also be stored across different repositories because the search service employs built-in connectors. Keywords, phrases, and descriptions could be used for searching, which allows you to accurately search for diagrams that are related to a particular use case. Therefore, you can easily build an intelligent search service with minimal development costs.

With an understanding of the problem and solution, the subsequent sections dive into how to automate data sourcing through the crawling of architecture diagrams from credible sources. Following this, we walk through the process of generating a custom label ML model with a fully managed service. Lastly, we cover the data ingestion by an intelligent search service, powered by ML.

Create an Amazon Rekognition model with custom labels

Before obtaining any architecture diagrams, we need a tool to evaluate if an image can be identified as an architecture diagram. Amazon Rekognition Custom Labels provides a streamlined process to create an image recognition model that identifies objects and scenes in images that are specific to a business need. In this case, we use Amazon Rekognition Custom Labels to identify AWS service icons, then the images are indexed with the services for a more relevant search using Amazon Kendra. This model doesn’t differentiate whether a picture is an architecture diagram or not; it simply identifies service icons, if any. As such, there may be instances where images that aren’t architecture diagrams end up in the search results. However, such results are minimal.

The following figure shows the steps that this solution takes to create an Amazon Rekognition Custom Labels model.

This process involves uploading the datasets, generating a manifest file that references the uploaded datasets, followed by uploading this manifest file into Amazon Rekognition. A Python script is used to aid in the process of uploading the datasets and generating the manifest file. Upon successfully generating the manifest file, it’s then uploaded into Amazon Rekognition to begin the model training process. For details on the Python script and how to run it, refer to the GitHub repo.

To train the model, in the Amazon Rekognition project, choose Train model, select the project you want to train, then add any relevant tags and choose Train model. For instructions on starting an Amazon Rekognition Custom Labels project, refer to the available video tutorials. The model may take up to 8 hours to train with this dataset.

When the training is complete, you may choose the trained model to view the evaluation results. For more details on the different metrics such as precision, recall, and F1, refer to Metrics for evaluation your model. To use the model, navigate to the Use Model tab, leave the number of inference units at 1, and start the model. Then we can use an AWS Lambda function to send images to the model in base64, and the model returns a list of labels and confidence scores.

Upon successfully training an Amazon Rekognition model with Amazon Rekognition Custom Labels, we can use it to identify service icons in the architecture diagrams that have been crawled. To increase the accuracy of identifying services in the architecture diagram, we use another Amazon Rekognition feature called text detection. To use this feature, we pass in the same picture in base64, and Amazon Rekognition returns the list of text identified in the picture. In the following figures, we compare the original image and what it looks like after the services in the image are identified. The first figure shows the original image.

The following figure shows the original image with detected services.

To ensure scalability, we use a Lambda function, which will be exposed through an API endpoint created using Amazon API Gateway. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Using a Lambda function eliminates a common concern about scaling up when large volumes of requests are made to the API endpoint. Lambda automatically runs the function for the specific API call, which stops when the invocation is complete, thereby reducing cost incurred to the user. Because the request would be directed to the Amazon Rekognition endpoint, having only the Lambda function being scalable is not sufficient. In order for the Amazon Rekognition endpoint to be scalable, you can increase the inference unit of the endpoint. For more details on configuring the inference unit, refer to Inference units.

The following is a code snippet of the Lambda function for the image recognition process:

const AWS = require("aws-sdk");
const axios = require("axios");

// API to retrieve information about individual services
const SERVICE_API = process.env.SERVICE_API;
// ARN of Amazon Rekognition model
const MODEL_ARN = process.env.MODEL_ARN;

const rekognition = new AWS.Rekognition();

exports.handler = async (event) => {
  const body = JSON.parse(event["body"]);
  let base64Binary = "";

  // Checks if the payload contains a url to the image or the image in base64
  if (body.url) {
    const base64Res = await new Promise((resolve) => {
      axios
        .get(body.url, {
          responseType: "arraybuffer",
        })
        .then((response) => {
          resolve(Buffer.from(response.data, "binary").toString("base64"));
        });
    });

    base64Binary = new Buffer.from(base64Res, "base64");
  } else if (body.byte) {
    const base64Cleaned = body.byte.split("base64,")[1];
    base64Binary = new Buffer.from(base64Cleaned, "base64");
  }

  // Pass the contents through the trained Custom Labels model and text detection
  const [labels, text] = await Promise.all([
    detectLabels(rekognition, base64Binary, MODEL_ARN),
    detectText(rekognition, base64Binary),
  ]);
  const texts = text.TextDetections.map((text) => ({
    DetectedText: text.DetectedText,
    ParentId: text.ParentId,
  }));

  // Compare between overlapping labels and retain the label with the highest confidence
  let filteredLabels = removeOverlappingLabels(labels);

  // Sort all the labels from most to least confident
  filteredLabels = sortByConfidence(filteredLabels);

  // Remove duplicate services in the list
  const services = retrieveUniqueServices(filteredLabels, texts);

  // Pass each service into the reference document API to retrieve the URL to the documentation
  const refLinks = await getReferenceLinks(services);

  var responseBody = {
    labels: filteredLabels,
    text: texts,
    ref_links: refLinks,
  };

  console.log("Response: ", response_body);

  const response = {
    statusCode: 200,
    headers: {
      "Access-Control-Allow-Origin": "*", // Required for CORS to work
    },
    body: JSON.stringify(responseBody),
  };
  return response;
};

// Code removed to truncate section

After creating the Lambda function, we can proceed to expose it as an API using API Gateway. For instructions on creating an API with Lambda proxy integration, refer to Tutorial: Build a Hello World REST API with Lambda proxy integration.

Crawl the architecture diagrams

In order for the search feature to work feasibly, we need a repository of architecture diagrams. However, these diagrams must originate from credible sources such as AWS Blog and AWS Prescriptive Guidance. Establishing credibility of data sources ensures the underlying implementation and purpose of the use cases are accurate and well vetted. The next step is to set up a crawler that can help gather many architecture diagrams to feed into our repository. We created a web crawler to extract architecture diagrams and information such as a description of the implementation from the relevant sources. There are multiple ways that you could achieve building such a mechanism; for this example, we use a program that runs on Amazon Elastic Compute Cloud (Amazon EC2). The program first obtains links to blog posts from an AWS Blog API. The response returned from the API contains information of the post such as title, URL, date, and the links to images found in the post.

The following is a code snippet of the JavaScript function for the web crawling process:

import axios from "axios";
import puppeteer from "puppeteer";
import {
  putItemDDB,
  identifyImageHighConfidence,
  getReferenceList,
} from "./utils.js";

/** Global variables */
const blogPostsApi = process.env.BLOG_POSTS_API;
const IMAGE_URL_PATTERN =
  "<pattern in the url that identified as link to image>";
const DDB_Table = process.env.DDB_Table;

// Function that retrieves URLs of records from a public API
function getURLs(blogPostsApi) {
  // Return a list of URLs
  return axios
    .get(blogPostsApi)
    .then((response) => {
      var data = response.data.items;
      console.log("RESPONSE:");
      const blogLists = data.map((blog) => [
        blog.item.additionalFields.link,
        blog.item.dateUpdated,
      ]);
      return blogLists;
    })
    .catch((error) => console.error(error));
}

// Function that crawls content of individual URLs
async function crawlFromUrl(urls) {
  const browser = await puppeteer.launch({
    executablePath: "/usr/bin/chromium-browser",
  });
  // const browser = await puppeteer.launch();

  const page = await browser.newPage();

  let numOfValidArchUrls = 0;

  for (let index = 0; index < urls.length; index++) {
    console.log("index: ", index);
    let blogURL = urls[index][0];
    let dateUpdated = urls[index][1];

    await page.goto(blogURL);
    console.log("blogUrl:", blogURL);
    console.log("date:", dateUpdated);

    // Identify and get image from post based on URL pattern
    const images = await page.evaluate(() =>
      Array.from(document.images, (e) => e.src)
    );
    const filter1 = images.filter((img) => img.includes(IMAGE_URL_PATTERN));
    console.log("all images:", filter1);

    // Validate if image is an architecture diagram
    for (let index_1 = 0; index_1 < filter1.length; index_1++) {
      const imageUrl = filter1[index_1];

      const rekog = await identifyImageHighConfidence(imageUrl);

      if (rekog) {
        if (rekog.labels.size >= 2) {
          console.log("Rekog.labels.size = ", rekog.labels.size);
          console.log("Selected image url  = ", imageUrl);

          let articleSection = [];
          let metadata = await page.$$('span[property="articleSection"]');

          for (let i = 0; i < metadata.length; i++) {
            const element = metadata[i];
            const value = await element.evaluate(
              (el) => el.textContent,
              element
            );
            console.log("value: ", value);
            articleSection.push(value);
          }

          const title = await page.title();
          const allRefLinks = await getReferenceList(
            rekog.labels,
            rekog.textServices
          );

          numOfValidArchUrls = numOfValidArchUrls + 1;

          putItemDDB(
            blogURL,
            dateUpdated,
            imageUrl,
            articleSection.toString(),
            rekog,
            { L: allRefLinks },
            title,
            DDB_Table
          );

          console.log("numOfValidArchUrls = ", numOfValidArchUrls);
        }
      }
      if (rekog && rekog.labels.size >= 2) {
        break;
      }
    }
  }
  console.log("valid arch : ", numOfValidArchUrls);
  await browser.close();
}

async function startCrawl() {
  // Get a list of URLs
  // Extract architecture image from those URLs
  const urls = await getURLs(blogPostsApi);

  if (urls) console.log("Crawling urls completed");
  else {
    console.log("Unable to crawl images");
    return;
  }
  await crawlFromUrl(urls);
}

startCrawl();

With this mechanism, we can easily crawl hundreds and thousands of images from different blogs. However, we need a filter that only accepts images that contain content of an architecture diagram, which in our case are icons of AWS services, to filter out images that are not architecture diagrams.

This is the purpose of our Amazon Rekognition model. The diagrams go through the image recognition process, which identifies service icons and determines if it could be considered as a valid architecture diagram.

The following is a code snippet of the function that sends images to the Amazon Rekognition model:

import axios from "axios";
import AWS from "aws-sdk";

// Configuration
AWS.config.update({ region: process.env.REGION });

/** Global variables */
// API to identify images
const LABEL_API = process.env.LABEL_API;
// API to get relevant documentations of individual services
const DOCUMENTATION_API = process.env.DOCUMENTATION_API;
// Create the DynamoDB service object
const dynamoDB = new AWS.DynamoDB({ apiVersion: "2012-08-10" });

// Function to identify image using an API that calls Amazon Rekognition model
function identifyImageHighConfidence(image_url) {
  return axios
    .post(LABEL_API, {
      url: image_url,
    })
    .then((res) => {
      let data = res.data;
      let rekogLabels = new Set();
      let rekogTextServices = new Set();
      let rekogTextMetadata = new Set();

      data.labels.forEach((element) => {
        if (element.Confidence >= 40) rekogLabels.add(element.Name);
      });

      data.text.forEach((element) => {
        if (
          element.DetectedText.includes("AWS") ||
          element.DetectedText.includes("Amazon")
        ) {
          rekogTextServices.add(element.DetectedText);
        } else {
          rekogTextMetadata.add(element.DetectedText);
        }
      });
      rekogTextServices.delete("AWS");
      rekogTextServices.delete("Amazon");
      return {
        labels: rekogLabels,
        textServices: rekogTextServices,
        textMetadata: Array.from(rekogTextMetadata).join(", "),
      };
    })
    .catch((error) => console.error(error));
}

After passing the image recognition check, the results returned from the Amazon Rekognition model and the information relevant to it are bundled into their own metadata. The metadata is then stored in a DynamoDB table where the record would be used to ingest into Amazon Kendra.

The following is a code snippet of the function that stores the metadata of the diagram in DynamoDB:

// Code removed to truncate section

// Function that PUTS item into Amazon DynamoDB table
function putItemDDB(
  originUrl,
  publishDate,
  imageUrl,
  crawlerData,
  rekogData,
  referenceLinks,
  title,
  tableName
) {
  console.log("WRITE TO DDB");
  console.log("originUrl :   ", originUrl);
  console.log("publishDate:  ", publishDate);
  console.log("imageUrl: ", imageUrl);
  let write_params = {
    TableName: tableName,
    Item: {
      OriginURL: { S: originUrl },
      PublishDate: { S: formatDate(publishDate) },
      ArchitectureURL: {
        S: imageUrl,
      },
      Metadata: {
        M: {
          crawler: {
            S: crawlerData,
          },
          Rekognition: {
            M: {
              labels: {
                S: Array.from(rekogData.labels).join(", "),
              },
              textServices: {
                S: Array.from(rekogData.textServices).join(", "),
              },
              textMetadata: {
                S: rekogData.textMetadata,
              },
            },
          },
        },
      },
      Reference: referenceLinks,
      Title: {
        S: title,
      },
    },
  };

  dynamoDB.putItem(write_params, function (err, data) {
    if (err) {
      console.log("*** DDB Error", err);
    } else {
      console.log("Successfuly inserted in DDB", data);
    }
  });
}

Ingest metadata into Amazon Kendra

After the architecture diagrams go through the image recognition process and the metadata is stored in DynamoDB, we need a way for the diagrams to be searchable while referencing the content in the metadata. The approach to this is to have a search engine that can be integrated with the application and can handle a large amount of search queries. Therefore, we use Amazon Kendra, an intelligent enterprise search service.

We use Amazon Kendra as the interactive component of the solution is because of its powerful search capabilities, particularly with the use of natural language. This adds an additional layer of simplicity when users are searching for diagrams that are closest to what they’re looking for. Amazon Kendra offers a number of data sources connectors for ingesting and connecting contents. This solution uses a custom connector to ingest architecture diagrams’ information from DynamoDB. To configure a data source to an Amazon Kendra index, you can use an existing index or create a new index.

The diagrams crawled then have to be ingested into the Amazon Kendra index that has been created. The following figure shows the flow of how the diagrams are indexed.

First, the diagrams inserted into DynamoDB create a Put event via Amazon DynamoDB Streams. The event triggers the Lambda function that acts as a custom data source for Amazon Kendra and loads the diagrams into the index. For instructions on creating a DynamoDB Streams trigger for a Lambda function, refer to Tutorial: Using AWS Lambda with Amazon DynamoDB Streams

After we integrate the Lambda function with DynamoDB, we need to ingest the records of the diagrams sent to the function into the Amazon Kendra index. The index accepts data from various types of sources, and ingesting items into the index from the Lambda function means that it has to use the custom data source configuration. For instructions on creating a custom data source for your index, refer to Custom data source connector.

The following is a code snippet of the Lambda function for how a diagram could be indexed in a custom manner:

import json
import os
import boto3

KENDRA = boto3.client("kendra")
INDEX_ID = os.environ["INDEX_ID"]
DS_ID = os.environ["DS_ID"]


def lambda_handler(event, context):
    dbRecords = event["Records"]

    # Loop through items from Amazon DynamoDB
    for row in dbRecords:
        rowData = row["dynamodb"]["NewImage"]
        originUrl = rowData["OriginURL"]["S"]
        publishedDate = rowData["PublishDate"]["S"]
        architectureUrl = rowData["ArchitectureURL"]["S"]
        title = rowData["Title"]["S"]

        metadata = rowData["Metadata"]["M"]
        crawlerMetadata = metadata["crawler"]["S"]
        rekognitionMetadata = metadata["Rekognition"]["M"]
        rekognitionLabels = rekognitionMetadata["labels"]["S"]
        rekognitionServices = rekognitionMetadata["textServices"]["S"]

        concatenatedText = (
            f"{crawlerMetadata} {rekognitionLabels} {rekognitionServices}"
        )

        add_document(
            dsId=DS_ID,
            indexId=INDEX_ID,
            originUrl=originUrl,
            architectureUrl=architectureUrl,
            title=title,
            publishedDate=publishedDate,
            text=concatenatedText,
        )

    return


# Function to add the diagram into Kendra index
def add_document(dsId, indexId, originUrl, architectureUrl, title, publishedDate, text):
    document = get_document(
        dsId, indexId, originUrl, architectureUrl, title, publishedDate, text
    )
    documents = [document]
    result = KENDRA.batch_put_document(IndexId=indexId, Documents=documents)
    print("result:" + json.dumps(result))
    return True


# Frame the diagram into a document that Kendra accepts
def get_document(dsId, originUrl, architectureUrl, title, publishedDate, text):
    document = {
        "Id": originUrl,
        "Title": title,
        "Attributes": [
            {"Key": "_data_source_id", "Value": {"StringValue": dsId}},
            {"Key": "_source_uri", "Value": {"StringValue": architectureUrl}},
            {"Key": "_created_at", "Value": {"DateValue": publishedDate}},
            {"Key": "publish_date", "Value": {"DateValue": publishedDate}},
        ],
        "Blob": text,
    }

    return document

The important factor that enables diagrams to be searchable is the Blob key in a document. This is what Amazon Kendra looks into when users provide their search input. In this example code, the Blob key contains a summarized version of the use case of the diagram concatenated with the information detected from the image recognition process. This allows users to search for architecture diagrams based on use cases such as “Fraud Detection” or by service names like “Amazon Kendra.”

To illustrate an example of what the Blob key looks like, the following snippet references the initial ETL diagram that we introduced earlier in this post. It contains a description of the diagram that was obtained when it was crawled, as well as the services that were identified by the Amazon Rekognition model.

{
    ...,
    "Blob": "Build and orchestrate ETL pipelines using Amazon Athena and AWS Step Functions Amazon Athena, AWS Step Functions, Amazon S3, AWS Glue Data Catalog "
}

Search with Amazon Kendra

After we put all the components together, the results of an example search of “real time analytics” look like the following screenshot.

By searching for this use case, it produces different architecture diagrams. Users are provided with these different methods of the specific workload that they’re trying to implement.

Clean up

Complete the steps in this section to clean up the resources you created as part of this post:

  1. Delete the API:
    1. On the API Gateway console, select the API to be deleted.
    2. On the Actions menu, choose Delete.
    3. Choose Delete to confirm.
  2. Delete the DynamoDB table:
    1. On the DynamoDB console, choose Tables in the navigation pane.
    2. Select the table you created and choose Delete.
    3. Enter delete when prompted for confirmation.
    4. Choose Delete table to confirm.
  3. Delete the Amazon Kendra index:
    1. On the Amazon Kendra console, choose Indexes in the navigation pane.
    2. Select the index you created and choose Delete
    3. Enter a reason when prompted for confirmation.
    4. Choose Delete to confirm.
  4. Delete the Amazon Rekognition project:
    1. On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane, then choose Projects.
    2. Select the project you created and choose Delete.
    3. Enter Delete when prompted for confirmation.
    4. Choose Delete associated datasets and models to confirm.
  5. Delete the Lambda function:
    1. On the Lambda console, select the function to be deleted.
    2. On the Actions menu, choose Delete.
    3. Enter Delete when prompted for confirmation.
    4. Choose Delete to confirm.

Summary

In this post, we showed an example of how you can intelligently search information from images. This includes the process of training an Amazon Rekognition ML model that acts as a filter for images, the automation of image crawling, which ensures credibility and efficiency, and querying for diagrams by attaching a custom data source that enables a more flexible manner to index items. To dive deeper into the implementation of the codes, refer to the GitHub repo.

Now that you understand how to deliver the backbone of a centralized search repository for complex searches, try creating your own image search engine. For more information on the core features, refer to Getting started with Amazon Rekognition Custom Labels, Moderating content, and the Amazon Kendra Developer Guide. If you’re new to Amazon Rekognition Custom Labels, try it out using our Free Tier, which lasts 3 months and includes 10 free training hours per month and 4 free inference hours per month.


About the Authors

Ryan See is a Solutions Architect at AWS. Based in Singapore, he works with customers to build solutions to solve their business problems as well as tailor a technical vision to help run more scalable and efficient workloads in the cloud.

James Ong Jia Xiang is a Customer Solutions Manager at AWS. He specializes in the Migration Acceleration Program (MAP) where he helps customers and partners successfully implement large-scale migration programs to AWS. Based in Singapore, he also focuses on driving modernization and enterprise transformation initiatives across APJ through scalable mechanisms. For leisure, he enjoys nature activities like trekking and surfing.

Hang Duong is a Solutions Architect at AWS. Based in Hanoi, Vietnam, she focuses on driving cloud adoption across her country by providing highly available, secure, and scalable cloud solutions for her customers. Additionally, she enjoys building and is involved in various prototyping projects. She is also passionate about the field of machine learning.

Trinh Vo is a Solutions Architect at AWS, based in Ho Chi Minh City, Vietnam. She focuses on working with customers across different industries and partners in Vietnam to craft architectures and demonstrations of the AWS platform that work backward from the customer’s business needs and accelerate the adoption of appropriate AWS technology. She enjoys caving and trekking for leisure.

Wai Kin Tham is a Cloud Architect at AWS. Based in Singapore, his day job involves helping customers migrate to the cloud and modernize their technology stack in the cloud. In his free time, he attends Muay Thai and Brazilian Jiu Jitsu classes.

Read More

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

This is a joint post co-written by AWS and Voxel51. Voxel51 is the company behind FiftyOne, the open-source toolkit for building high-quality datasets and computer vision models.

A retail company is building a mobile app to help customers buy clothes. To create this app, they need a high-quality dataset containing clothing images, labeled with different categories. In this post, we show how to repurpose an existing dataset via data cleaning, preprocessing, and pre-labeling with a zero-shot classification model in FiftyOne, and adjusting these labels with Amazon SageMaker Ground Truth.

You can use Ground Truth and FiftyOne to accelerate your data labeling project. We illustrate how to seamlessly use the two applications together to create high-quality labeled datasets. For our example use case, we work with the Fashion200K dataset, released at ICCV 2017.

Solution overview

Ground Truth is a fully self-served and managed data labeling service that empowers data scientists, machine learning (ML) engineers, and researchers to build high-quality datasets. FiftyOne by Voxel51 is an open-source toolkit for curating, visualizing, and evaluating computer vision datasets so that you can train and analyze better models by accelerating your use cases.

In the following sections, we demonstrate how to do the following:

  • Visualize the dataset in FiftyOne
  • Clean the dataset with filtering and image deduplication in FiftyOne
  • Pre-label the cleaned data with zero-shot classification in FiftyOne
  • Label the smaller curated dataset with Ground Truth
  • Inject labeled results from Ground Truth into FiftyOne and review labeled results in FiftyOne

Use case overview

Suppose you own a retail company and want to build a mobile application to give personalized recommendations to help users decide what to wear. Your prospective users are looking for an application that tells them which articles of clothing in their closet work well together. You see an opportunity here: if you can identify good outfits, you can use this to recommend new articles of clothing that complement the clothing a customer already owns.

You want to make things as easy as possible for the end-user. Ideally, someone using your application only needs to take pictures of the clothes in their wardrobe, and your ML models work their magic behind the scenes. You might train a general-purpose model or fine-tune a model to each user’s unique style with some form of feedback.

First, however, you need to identify what type of clothing the user is capturing. Is it a shirt? A pair of pants? Or something else? After all, you probably don’t want to recommend an outfit that has multiple dresses or multiple hats.

To address this initial challenge, you want to generate a training dataset consisting of images of various articles of clothing with various patterns and styles. To prototype with a limited budget, you want to bootstrap using an existing dataset.

To illustrate and walk you through the process in this post, we use the Fashion200K dataset released at ICCV 2017. It’s an established and well-cited dataset, but it isn’t directly suited for your use case.

Although articles of clothing are labeled with categories (and subcategories) and contain a variety of helpful tags that are extracted from the original product descriptions, the data is not systematically labeled with pattern or style information. Your goal is to turn this existing dataset into a robust training dataset for your clothing classification models. You need to clean the data, augmenting the labeling schema with style labels. And you want to do so quickly and with as little spend as possible.

Download the data locally

First, download the women.tar zip file and the labels folder (with all of its subfolders) following the instructions provided in the Fashion200K dataset GitHub repository. After you’ve unzipped them both, create a parent directory fashion200k, and move the labels and women folders into this. Fortunately, these images have already been cropped to the object detection bounding boxes, so we can focus on classification, rather than worry about object detection.

Despite the “200K” in its moniker, the women directory we extracted contains 338,339 images. To generate the official Fashion200K dataset, the dataset’s authors crawled more than 300,000 products online, and only products with descriptions containing more than four words made the cut. For our purposes, where the product description isn’t essential, we can use all of the crawled images.

Let’s look at how this data is organized: within the women folder, images are arranged by top-level article type (skirts, tops, pants, jackets, and dresses), and article type subcategory (blouses, t-shirts, long-sleeved tops).

Within the subcategory directories, there is a subdirectory for each product listing. Each of these contains a variable number of images. The cropped_pants subcategory, for instance, contains the following product listings and associated images.

The labels folder contains a text file for each top-level article type, for both train and test splits. Within each of these text files is a separate line for each image, specifying the relative file path, a score, and tags from the product description.

Because we’re repurposing the dataset, we combine all of the train and test images. We use these to generate a high-quality application-specific dataset. After we complete this process, we can randomly split the resulting dataset into new train and test splits.

Inject, view, and curate a dataset in FiftyOne

If you haven’t already done so, install open-source FiftyOne using pip:

pip install fiftyone

A best practice is to do so within a new virtual (venv or conda) environment. Then import the relevant modules. Import the base library, fiftyone, the FiftyOne Brain, which has built-in ML methods, the FiftyOne Zoo, from which we will load a model that will generate zero-shot labels for us, and the ViewField, which lets us efficiently filter the data in our dataset:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

You also want to import the glob and os Python modules, which will help us work with paths and pattern match over directory contents:

from glob import glob
import os

Now we’re ready to load the dataset into FiftyOne. First, we create a dataset named fashion200k and make it persistent, which allows us to save the results of computationally intensive operations, so we only need to compute said quantities once.

dataset = fo.Dataset("fashion200k", persistent=True)

We can now iterate through all subcategory directories, adding all the images within the product directories. We add a FiftyOne classification label to each sample with the field name article_type, populated by the image’s top-level article category. We also add both category and subcategory information as tags:

# Map dir categories to article type labels
labels_map = {
    "dresses": "dress",
    "jackets": "jacket",
    "pants": "pants",
    "skirts": "skirt",
    "tops": "top",
}

dataset_dir = "./fashion200k"

for d in glob(os.path.join(dataset_dir, "women", "*", "*")):
    _, _, category, subcategory = d.split("/")
    subcategory = subcategory.replace("_", " ")
    label = labels_map[category]

    dataset.add_samples(
        [
            fo.Sample(
                    filepath=filepath,
tags=[category, subcategory],   article_type=fo.Classification(label=label),
            )
            for filepath in glob(os.path.join(d, "*", "*"))
        ]
    )

At this point, we can visualize our dataset in the FiftyOne app by launching a session:

session = fo.launch_app(dataset)

We can also print out a summary of the dataset in Python by running print(dataset):

Name:        fashion200k
Media type:  image
Num samples: 338339
Persistent:  True
Tags:        []
Sample fields:
    id:            fiftyone.core.fields.ObjectIdField
    filepath:      fiftyone.core.fields.StringField
    tags:          fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    article_type:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

We can also add the tags from the labels directory to the samples in our dataset:

working_dir = os.getcwd()

tags = {
f: set(t) 
for f, t in zip(*dataset.values(["filepath", "tags"]))
}


for label_file in glob("fashion200k/labels/*"):
    with open(label_file, 'r') as f:
        for line in f.readlines():
            line_list = line.split()
            fp = os.path.join(
                working_dir, 
                dataset_dir, 
                line_list[0]
            )
          
           # add new tags
          new_tags_for_fp = line_list[2:]
          tags[fp].update(new_tags_for_fp)

# Update tags
dataset.set_values("tags", tags, key_field="filepath")

Looking at the data, a few things become clear:

  • Some of the images are fairly grainy, with low resolution. This is likely because these images were generated by cropping initial images in object detection bounding boxes.
  • Some clothes are worn by a person, and some are photographed on their own. These details are encapsulated by the viewpoint property.
  • A lot of the images of the same product are very similar, so at least initially, including more than one image per product may not add much predictive power. For the most part, the first image of each product (ending in _0.jpeg) is the cleanest.

Initially, we might want to train our clothing style classification model on a controlled subset of these images. To this end, we use high-resolution images of our products, and limit our view to one representative sample per product.

First, we filter out the low-resolution images. We use the compute_metadata() method to compute and store image width and height, in pixels, for each image in the dataset. We then employ the FiftyOne ViewField to filter out images based on the minimum allowed width and height values. See the following code:

dataset.compute_metadata()

min_width = 200
min_height = 300

width_filter = F("metadata.width") > min_width
height_filter = F("metadata.height") > min_height


high_res_view = dataset.match(
    width_filter & height_filter
)

session.view = high_res_view.view()

This high-resolution subset has just under 200,000 samples.

From this view, we can create a new view into our dataset containing only one representative sample (at most) for each product. We use the ViewField once again, pattern matching for file paths that end with _0.jpeg:

representative_view = high_res_view.match(
    F("filepath").ends_with("_0.jpeg")
)

Let’s view a randomly shuffled ordering of images in this subset:

session.view = representative_view.shuffle()

Remove redundant images in the dataset

This view contains 66,297 images, or just over 19% of the original dataset. When we look at the view, however, we see that there are many very similar products. Keeping all of these copies will likely only add cost to our labeling and model training, without noticeably improving performance. Instead, let’s get rid of the near duplicates to create a smaller dataset that still packs the same punch.

Because these images are not exact duplicates, we can’t check for pixel-wise equality. Fortunately, we can use the FiftyOne Brain to help us clean our dataset. In particular, we’ll compute an embedding for each image—a lower-dimensional vector representing the image—and then look for images whose embedding vectors are close to each other. The closer the vectors, the more similar the images.

We use a CLIP model to generate a 512-dimensional embedding vector for each image, and store these embeddings in the field embeddings on the samples in our dataset:

## load model
model = foz.load_zoo_model("clip-vit-base32-torch")
 
## compute embeddings
representative_view.compute_embeddings(
model, 
embeddings_field="embedding"
)

Then we compute the closeness between embeddings, using cosine similarity, and assert that any two vectors whose similarity is greater than some threshold are likely to be near duplicates. Cosine similarity scores lie in the range [0, 1], and looking at the data, a threshold score of thresh=0.5 seems to be about right. Again, this doesn’t need to be perfect. A few near-duplicate images are not likely to ruin our predictive power, and throwing away a few non-duplicate images doesn’t materially impact model performance.

results = fob.compute_similarity(
view,
embeddings="embedding",
brain_key="sim",
metric="cosine"
)

results.find_duplicates(thresh=0.5)

We can view the purported duplicates to verify that they are indeed redundant:

## view the duplicates, paired up, 
## to make sure it is doing what we think it is doing
dup_view = results.duplicates_view()
session = fo.launch_app(dup_view)

When we’re happy with the result and believe these images are indeed near duplicates, we can pick one sample from each set of similar samples to keep, and ignore the others:

## get one image from each group of duplicates
dup_rep_ids = list(results.neighbors_map.keys())

# get ids of non-duplicates
non_dup_ids = representative_view.exclude(
dup_view.values("id")
).values("id")

# ids to keep
ids = dup_rep_ids + non_dup_ids

# create view from ids
non_dup_view = representative_view[ids]

Now this view has 3,729 images. By cleaning the data and identifying a high-quality subset of the Fashion200K dataset, FiftyOne lets us restrict our focus from more than 300,000 images to just under 4,000, representing a reduction by 98%. Using embeddings to remove near-duplicate images alone brought our total number of images under consideration down by more than 90%, with little if any effect on any models to be trained on this data.

Before pre-labeling this subset, we can better understand the data by visualizing the embeddings we have already computed. We can use the FiftyOne Brain’s built-in compute_visualization() method, which employs the uniform manifold approximation (UMAP) technique to project the 512-dimensional embedding vectors into two-dimensional space so we can visualize them:

fob.compute_visualization(
    non_dup_view, 
    embeddings="embedding", 
    brain_key="vis"
)

We open a new Embeddings panel in the FiftyOne app and coloring by article type, and we can see that these embeddings roughly encode a notion of article type (among other things!).

Now we are ready to pre-label this data.

Inspecting these highly unique, high-resolution images, we can generate a decent initial list of styles to use as classes in our pre-labeling zero-shot classification. Our goal in pre-labeling these images is not to necessarily label each image correctly. Rather, our goal is to provide a good starting point for human annotators so we can reduce labeling time and cost.

styles = [
 "graphic", 
 "lettered", 
 "plain", 
 "striped", 
 "polka dot", 
 "floral", 
 "jersey", 
 "checkered", 
 "denim", 
 "plaid",
 "houndstooth",
 "chevron", 
 "paisley", 
 "animal print", 
 "quatrefoil",
 “camouflage”
]

We can then instantiate a zero-shot classification model for this application. We use a CLIP model, which is a general-purpose model trained on both images and natural language. We instantiate a CLIP model with the text prompt “Clothing in the style,” so that given an image, the model will output the class for which “Clothing in the style [class]” is the best fit. CLIP is not trained on retail or fashion-specific data, so this won’t be perfect, but it can save you in labeling and annotation costs.

zero_shot_model = foz.load_zoo_model(
 "clip-vit-base32-torch",
 text_prompt="Clothing in the style ",
 classes=styles,
)

We then apply this model to our reduced subset and store the results in an article_style field:

non_dup_view.apply_model(
zero_shot_model, 
label_field="article_style"
)

Launching the FiftyOne App once again, we can visualize the images with these predicted style labels. We sort by prediction confidence so we view the most confident style predictions first:

high_conf_view = non_dup_view.sort_by(
 "article_style.confidence", reverse=True
)

session.view = high_conf_view

We can see that the highest confidence predictions seem to be for “jersey,” “animal print,” “polka dot,” and “lettered” styles. This makes sense, because these styles are relatively distinct. It also seems like, for the most part, the predicted style labels are accurate.

We can also look at the lowest-confidence style predictions:

low_conf_view = non_dup_view.sort_by(
"article_style.confidence"
)
session.view = low_conf_view

For some of these images, the appropriate style category is in the provided list, and the article of clothing is incorrectly labeled. The first image in the grid, for instance, should clearly be “camouflage” and not “chevron.” In other cases, however, the products don’t fit neatly into the style categories. The dress in the second image in the second row, for example, is not exactly “striped,” but given the same labeling options, a human annotator might also have been conflicted. As we build out our dataset, we need to decide whether to remove edge cases like these, add new style categories, or augment the dataset.

Export the final dataset from FiftyOne

Export the final dataset with the following code:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

We can export a smaller dataset, for example, 16 images, to the folder 200kFashionDatasetExportResult-16Images. We create a Ground Truth adjustment job using it:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult-16Images"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.take(16).export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

Upload the revised dataset, convert the label format to Ground Truth, upload to Amazon S3, and create a manifest file for the adjustment job

We can convert the labels in the dataset to match the output manifest schema of a Ground Truth bounding box job, and upload the images to an Amazon Simple Storage Service (Amazon S3) bucket to launch a Ground Truth adjustment job:

import json
# open the labels.json file of ground truth bounding box 
#labels from the exported dataset
f = open('200kFashionDatasetExportResult-16Images/labels.json')
data = json.load(f)

# provide your aws s3 bucket name, prefix, and aws credentials
bucket_name = 'sagemaker-your-preferred-s3-bucket'
s3_prefix = 'sagemaker-your-preferred-s3-prefix'

session = boto3.Session(
    aws_access_key_id='<AWS_ACCESS_KEY_ID>',
    aws_secret_access_key='<AWS_SECRET_ACCESS_KEY>'
)
s3 = session.resource('s3')

for image in data['images']:
    file_name = image['file_name']
    file_id = file_name[:-4]
    image_id = image['id']
    
    # upload the image to s3
    s3.meta.client.upload_file('200kFashionDatasetExportResult-16Images/data/'+image['file_name'], bucket_name, s3_prefix+'/'+image['file_name'])
    
    gt_annotations = []
    confidence = 0.00
    
    for annotation in data['annotations']:
        if annotation['image_id'] == image['id']:
            confidence = annotation['score']
            gt_annotation = {
                "class_id": gt_class_array.index(style_category), 
                # convert the original ground_truth bounding box 
                #label to predicted style label
                "left": annotation['bbox'][0],
                "top": annotation['bbox'][1],
                "width": annotation['bbox'][2],
                "height": annotation['bbox'][3]
            }
            
            gt_annotations.append(gt_annotation)
            break
    
    gt_metadata_objects = []
    for gt_annotation in gt_annotations:
        gt_metadata_objects.append({
            "confidence": confidence
        })
    
    gt_label_attribute_metadata = {
        "class-map": gt_class_map,
        "objects": gt_metadata_objects,
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2023-02-19T00:23:25.339582",
        "job-name": "labeling-job/200k-fashion-origin"
    }
    
    gt_output = {
        "source-ref": f"s3://{bucket_name}/{s3_prefix}/{image['file_name']}",
        "200k-fashion-origin": {
            "image_size": [
                {
                    "width": image['width'],
                    "height": image['height'],
                    "depth": 3
                  }
      
            ],
            "annotations": gt_annotations
        },
        "200k-fashion-origin-metadata": gt_label_attribute_metadata
    }
    

    # write to the manifest file    
    with open(200k-fashion-output.manifest', 'a') as output_file:
        output_file.write(json.dumps(gt_output) + "n")

Upload the manifest file to Amazon S3 with the following code:

s3.meta.client.upload_file(200k-fashion-output.manifest', bucket_name, s3_prefix+'/200k-fashion-output.manifest')

Create corrected styled labels with Ground Truth

To annotate your data with style labels using Ground Truth, complete the necessary steps to start a bounding box labeling job by following the procedure outlined in the Getting Started with Ground Truth guide with the dataset in the same S3 bucket.

  1. On the SageMaker console, create a Ground Truth labeling job.
  2. Set the Input dataset location to be the manifest that we created in the preceding steps.
  3. Specify an S3 path for Output dataset location.
  4. For IAM Role, choose Enter a custom IAM role ARN, then enter the role ARN.
  5. For Task category, choose Image and select Bounding box.
  6. Choose Next.
  7. In the Workers section, choose the type of workforce you would like to use.
    You can select a workforce through Amazon Mechanical Turk, third-party vendors, or your own private workforce. For more details about your workforce options, see Create and Manage Workforces.
  8. Expand Existing-labels display options and select I want to display existing labels from the dataset for this job.
  9. For Label attribute name, choose the name from your manifest that corresponds to the labels that you want to display for adjustment.
    You will only see label attribute names for labels that match the task type you selected in the previous steps.
  10. Manually enter the labels for Bounding box labeling tool.
    The labels must contain the same labels used in the public dataset. You can add new labels. The following screenshot shows how you can choose the workers and configure the tool for your labeling job.
  11. Choose Preview to preview the image and original annotations.

We have now created a labeling job in Ground Truth. After our job is complete, we can load the newly generated labeled data into FiftyOne. Ground Truth produces output data in a Ground Truth output manifest. For more details on the output manifest file, see Bounding Box Job Output. The following code shows an example of this output manifest format:

{
    "source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.png",
    "bounding-box-attribute-name":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 111, "top": 134,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 20, "top": 20,
                     "width": 30, "height": 30}
        ]
    },
    "bounding-box-attribute-name-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "jersey",
            "5": "polka dot"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-10-18T22:18:13.527256",
        "job-name": "identify-fashion-set"
    },
    "adjusted-bounding-box":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 110, "top": 135,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 10, "top": 10,
                     "width": 30, "height": 30}
        ]
    },
    "adjusted-bounding-box-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "dog",
            "5": "bone"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-11-20T22:18:13.527256",
        "job-name": "adjust-identify-fashion-set",
        "adjustment-status": "adjusted"
    }
 }

Review labeled results from Ground Truth in FiftyOne

After the job is complete, download the output manifest of the labeling job from Amazon S3.

Read the output manifest file:

with open('<path-to-your-output.manifest>', 'r') as fh:
    adjustment_manifest_lines = fh.readlines()

Create a FiftyOne dataset and convert the manifest lines to samples in the dataset:

def get_classification_labels(manifest_line, dataset, attr_name) -> fo.Classifications:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    predictions = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")
        
        prediction = fo.Classification(label=label, confidence=confidence)

        predictions.append(prediction)

    return fo.Classifications(classifications=predictions)

def get_bounding_box_labels(manifest_line, dataset, attr_name) -> fo.Detections:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    detections = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")

        # Bounding box coordinates should be relative values
        # in [0, 1] in the following format:
        # [top-left-x, top-left-y, width, height]
        bounding_box = [
            annotation.get("left") / width,
            annotation.get("top") / height,
            annotation.get("width") / width,
            annotation.get("height") / height,
        ]

        detection = fo.Detection(
            label=label, bounding_box=bounding_box, confidence=confidence
        )
        
        detections.append(detection)

    return fo.Detections(detections=detections)
    
def get_sample_from_manifest_line(manifest_line, dataset, attr_name):
    """
    For each line in manifest, transform annotations into Fiftyone format
    Args:
        line: manifest line
    Output:
        Fiftyone image sample
    """
    file_name = manifest_line.get("source-ref")[5:].split("/")[-1]
    file_loc = f'200kFashionDatasetExportResult-16Images/data/{file_name}'

    sample = fo.Sample(filepath=file_loc)

    sample['ground_truth'] = get_bounding_box_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )
    sample["prediction"] = get_classification_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )

    return sample

adjustment_dataset = fo.Dataset("adjustment-job-dataset")

samples = [
            get_sample_from_manifest_line(
                manifest_line=json.loads(manifest_line), dataset=adjustment_dataset, attr_name='smgt-fiftyone-style-adjustment-job'
            )
            for manifest_line in adjustment_manifest_lines
        ]

adjustment_dataset.add_samples(samples)

session = fo.launch_app(adjustment_dataset)

You can now see high-quality labeled data from Ground Truth in FiftyOne.

Conclusion

In this post, we showed how to build high-quality datasets by combining the power of FiftyOne by Voxel51, an open-source toolkit that allows you to manage, track, visualize, and curate your dataset, and Ground Truth, a data labeling service that allows you to efficiently and accurately label the datasets required for training ML systems by providing access to multiple built-in task templates and access to a diverse workforce through Mechanical Turk, third-party vendors, or your own private workforce.

We encourage you to try out this new functionality by installing a FiftyOne instance and using the Ground Truth console to get started. To learn more about Ground Truth, refer to Label Data, Amazon SageMaker Data Labeling FAQs, and the AWS Machine Learning Blog.

Connect with the Machine Learning & AI community if you have any questions or feedback!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!


About the Authors

Shalendra Chhabra is currently Head of Product Management for Amazon SageMaker Human-in-the-Loop (HIL) Services. Previously, Shalendra incubated and led Language and Conversational Intelligence for Microsoft Teams Meetings, was EIR at Amazon Alexa Techstars Startup Accelerator, VP of Product and Marketing at Discuss.io, Head of Product and Marketing at Clipboard (acquired by Salesforce), and Lead Product Manager at Swype (acquired by Nuance). In total, Shalendra has helped build, ship, and market products that have touched more than a billion lives.

Jacob Marks is a Machine Learning Engineer and Developer Evangelist at Voxel51, where he helps bring transparency and clarity to the world’s data. Prior to joining Voxel51, Jacob founded a startup to help emerging musicians connect and share creative content with fans. Before that, he worked at Google X, Samsung Research, and Wolfram Research. In a past life, Jacob was a theoretical physicist, completing his PhD at Stanford, where he investigated quantum phases of matter. In his free time, Jacob enjoys climbing, running, and reading science fiction novels.

Jason Corso is co-founder and CEO of Voxel51, where he steers strategy to help bring transparency and clarity to the world’s data through state-of-the-art flexible software. He is also a Professor of Robotics, Electrical Engineering, and Computer Science at the University of Michigan, where he focuses on cutting-edge problems at the intersection of computer vision, natural language, and physical platforms. In his free time, Jason enjoys spending time with his family, reading, being in nature, playing board games, and all sorts of creative activities.

Brian Moore is co-founder and CTO of Voxel51, where he leads technical strategy and vision. He holds a PhD in Electrical Engineering from the University of Michigan, where his research was focused on efficient algorithms for large-scale machine learning problems, with a particular emphasis on computer vision applications. In his free time, he enjoys badminton, golf, hiking, and playing with his twin Yorkshire Terriers.

Zhuling Bai is a Software Development Engineer at Amazon Web Services. She works on developing large-scale distributed systems to solve machine learning problems.

Read More

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

The world of artificial intelligence (AI) and machine learning (ML) has been witnessing a paradigm shift with the rise of generative AI models that can create human-like text, images, code, and audio. Compared to classical ML models, generative AI models are significantly bigger and more complex. However, their increasing complexity also comes with high costs for inference and a growing need for powerful compute resources. The high cost of inference for generative AI models can be a barrier to entry for businesses and researchers with limited resources, necessitating the need for more efficient and cost-effective solutions. Furthermore, the majority of generative AI use cases involve human interaction or real-world scenarios, necessitating hardware that can deliver low-latency performance. AWS has been innovating with purpose-built chips to address the growing need for powerful, efficient, and cost-effective compute hardware.

Today, we are excited to announce that Amazon SageMaker supports AWS Inferentia2 (ml.inf2) and AWS Trainium (ml.trn1) based SageMaker instances to host generative AI models for real-time and asynchronous inference. ml.inf2 instances are available for model deployment on SageMaker in US East (Ohio) and ml.trn1 instances in US East (N. Virginia).

You can use these instances on SageMaker to achieve high performance at a low cost for generative AI models, including large language models (LLMs), Stable Diffusion, and vision transformers. In addition, you can use Amazon SageMaker Inference Recommender to help you run load tests and evaluate the price-performance benefits of deploying your model on these instances.

You can use ml.inf2 and ml.trn1 instances to run your ML applications on SageMaker for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more. You can easily get started by specifying ml.trn1 or ml.inf2 instances when configuring your SageMaker endpoint. You can use ml.trn1 and ml.inf2 compatible AWS Deep Learning Containers (DLCs) for PyTorch, TensorFlow, Hugging Face, and large model inference (LMI) to easily get started. For the full list with versions, see Available Deep Learning Containers Images.

In this post, we show the process of deploying a large language model on AWS Inferentia2 using SageMaker, without requiring any extra coding, by taking advantage of the LMI container. We use the GPT4ALL-J, a fine-tuned GPT-J 7B model that provides a chatbot style interaction.

Overview of ml.trn1 and ml.inf2 instances

ml.trn1 instances are powered by the Trainium accelerator, which is purpose built mainly for high-performance deep learning training of generative AI models, including LLMs. However, these instances also support inference workloads for models that are even larger than what fits into Inf2. The largest instance size, trn1.32xlarge instances, features 16 Trainium accelerators with 512 GB of accelerator memory in a single instance delivering up to 3.4 petaflops of FP16/BF16 compute power. 16 Trainium accelerators are connected with ultra-high-speed NeuronLinkv2 for streamlined collective communications.

ml.Inf2 instances are powered by the AWS Inferentia2 accelerator, a purpose built accelerator for inference. It delivers three times higher compute performance, up to four times higher throughput, and up to 10 times lower latency compared to first-generation AWS Inferentia. The largest instance size, Inf2.48xlarge, features 12 AWS Inferentia2 accelerators with 384 GB of accelerator memory in a single instance for a combined compute power of 2.3 petaflops for BF16/FP16. It enables you to deploy up to a 175-billion-parameter model in a single instance. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that don’t fit into a single accelerator, data flows directly between accelerators with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.

Both AWS Inferentia2 and Trainium accelerators have two NeuronCores-v2, 32 GB HBM memory stacks, and dedicated collective-compute engines, which automatically optimize runtime by overlapping computation and communication when doing multi-accelerator inference. For more details on the architecture, refer to Trainium and Inferentia devices.

The following diagram shows an example architecture using AWS Inferentia2.

AWS Neuron SDK

AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium based instances. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated into TensorFlow and PyTorch. With Neuron, you can develop, profile, and deploy high-performance ML workloads on ml.trn1 and ml.inf2.

The Neuron Compiler accepts ML models in various formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. The Neuron compiler is invoked within the ML framework, where ML models are sent to the compiler by the Neuron framework plugin. The resulting compiler artifact is called a NEFF file (Neuron Executable File Format) that in turn is loaded by the Neuron runtime to the Neuron device.

The Neuron runtime consists of kernel driver and C/C++ libraries, which provide APIs to access AWS Inferentia and Trainium Neuron devices. The Neuron ML frameworks plugins for TensorFlow and PyTorch use the Neuron runtime to load and run models on the NeuronCores. The Neuron runtime loads compiled deep learning models (NEFF) to the Neuron devices and is optimized for high throughput and low latency.

Host NLP models using SageMaker ml.inf2 instances

Before we dive deep into serving LLMs with transformers-neuronx, which is an open-source library to shard the model’s large weight matrices onto multiple NeuronCores, let’s briefly go through the typical deployment flow for a model that can fit onto the single NeuronCore.

Check the list of supported models to ensure the model is supported on AWS Inferentia2. Next, the model needs to be pre-compiled by the Neuron Compiler. You can use a SageMaker notebook or an Amazon Elastic Compute Cloud (Amazon EC2) instance to compile the model. You can use the SageMaker Python SDK to deploy models using popular deep learning frameworks such as PyTorch, as shown in the following code. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling.

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    source_dir="code",
    entry_point="inference.py",
    image_uri=ecr_image
)

predictor = pytorch_model.deploy(
    initial_instance_count=1, 
    instance_type="ml.inf2.xlarge"
)

Refer to Developer Flows for more details on typical development flows of Inf2 on SageMaker with sample scripts.

Host LLMs using SageMaker ml.inf2 instances

Large language models with billions of parameters are often too big to fit on a single accelerator. This necessitates the use of model parallel techniques for hosting LLMs across multiple accelerators. Another crucial requirement for hosting LLMs is the implementation of a high-performance model-serving solution. This solution should efficiently load the model, manage partitioning, and seamlessly serve requests via HTTP endpoints.

SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference. For resources to get started with LMI on SageMaker, refer to Model parallelism and large model inference. SageMaker maintains DLCs with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. These specialized DLCs are referred to as SageMaker LMI containers.

SageMaker LMI containers use DJLServing, a model server that is integrated with the transformers-neuronx library to support tensor parallelism across NeuronCores. To learn more about how DJLServing works, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. The DJL model server and transformers-neuronx library serve as core components of the container, which also includes the Neuron SDK. This setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.

The LMI container supports loading models from an Amazon Simple Storage Service (Amazon S3) bucket or Hugging Face Hub. The default handler script loads the model, compiles and converts it into a Neuron-optimized format, and loads it. To use the LMI container to host LLMs, we have two options:

  • A no-code (preferred) – This is the easiest way to deploy an LLM using an LMI container. In this method, you can use the provided default handler and just pass the model name and the parameters required in serving.properties file to load and host the model. To use the default handler, we provide the entryPoint parameter as djl_python.transformers-neuronx.
  • Bring your own script – In this approach, you have the option to create your own model.py file, which contains the code necessary for loading and serving the model. This file acts as an intermediary between the DJLServing APIs and the transformers-neuronx APIs. To customize the model loading process, you can provide serving.properties with configurable parameters. For a comprehensive list of available configurable parameters, refer to All DJL configuration options. Here is an example of a model.py file.

Runtime architecture

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple NeuronCores. For instance, inf2.24xlarge has six AWS Inferentia2 accelerators. Each AWS Inferentia2 accelerator has two NeuronCores. Each NeuronCore has a dedicated high bandwidth memory (HBM) of 16 GB storing tensor parallel modules. With a tensor parallel degree of 4, the LMI will allocate three model copies of the same model, each utilizing four NeuronCores. As shown in the following diagram, when the LMI container starts, the model will be loaded and traced first in the CPU addressable memory. When the tracing is complete, the model is partitioned across the NeuronCores based on the tensor parallel degree.

LMI uses DJLServing as its model serving stack. After the container’s health check passes in SageMaker, the container is ready to serve the inference request. DJLServing launches multiple Python processes equivalent to the TOTAL NUMBER OF NEURON CORES/TENSOR_PARALLEL_DEGREE. Each Python process contains threads in C++ equivalent to TENSOR_PARALLEL_DEGREE. Each C++ threads holds one shard of the model on one NeuronCore.

Many practitioners (Python process) tend to run inference sequentially when the server is invoked with multiple independent requests. Although it’s easier to set up, it’s usually not the best practice to utilize the accelerator’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput. All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements. The throughput also depends on the number of model copies and the Python process groups launched in the container. As shown in the following diagram, with the tensor parallel degree set to 4, the LMI container launches three Python process groups, each holding the full copy of the model. This allows you to increase the batch size and get higher throughput.

SageMaker notebook for deploying LLMs

In this section, we provide a step-by-step walkthrough of deploying GPT4All-J, a 6-billion-parameter model that is 24 GB in FP32. GPT4All-J is a popular chatbot that has been trained on a vast variety of interaction content like word problems, dialogs, code, poems, songs, and stories. GPT4all-J is a fine-tuned GPT-J model that generates responses similar to human interactions.

The complete notebook for this example is provided on GitHub. We can use the SageMaker Python SDK to deploy the model to an Inf2 instance. We use the provided default handler to load the model. With this, we just need to provide a servings.properties file. This file has the required configurations for the DJL model server to download and host the model. We can specify the name of the Hugging Face model using the model_id parameter to download the model directly from the Hugging Face repo. Alternatively, you can download the model from Amazon S3 by providing the s3url parameter. The entryPoint parameter is configured to point to the library to load the model. For more details on djl_python.fastertransformer, refer to the GitHub code.

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple devices. For instance, with 12 NeuronCores and a tensor parallel degree of 4, LMI will allocate three model copies, each utilizing four NeuronCores. You can also define the precision type using the property dtype. n_position parameter defines the sum of max input and output sequence length for the model. See the following code:

%%writefile serving.properties# Start writing content here
engine=Python
option.entryPoint=djl_python.transformers-neuronx
#option.model_id=nomic-ai/gpt4all-j
option.s3url = {{s3url}}
option.tensor_parallel_degree=2
option.model_loading_timeout=2400
option.n_positions=512

Construct the tarball containing serving.properties and upload it to an S3 bucket. Although the default handler is used in this example, you can develop a model.py file for customizing the loading and serving process. If there are any packages that need installation, include them in the requirements.txt file. See the following code:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

Retrieve the DJL container image and create the SageMaker model:

##Retrieve djl container image
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.21.0"
    )
image_uri = image_uri.split(":")[0] + ":" + "0.22.1-neuronx-sdk2.9.0"

model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)

Next, we create the SageMaker endpoint with the model configuration defined earlier. The container downloads the model into the /tmp space because SageMaker maps the /tmp to Amazon Elastic Block Store (Amazon EBS). We need to add a volume_size parameter to ensure the /tmp directory has enough space to download and compile the model. We set container_startup_health_check_timeout to 3,600 seconds to ensure the health check starts after the model is ready. We use the ml.inf2.8xlarge instance. See the following code:

instance_type = "ml.inf2.8xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")


model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=3600,
             volume_size=256
            )

After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

predictor.predict(
    {"inputs": "write a blog on new York", "parameters": {}}
)

Clean up

Delete the endpoints to save costs after you finish your tests:

# - Delete the end point
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

Conclusion

In this post, we showcased the newly launched capability of SageMaker, which now supports ml.inf2 and ml.trn1 instances for hosting generative AI models. We demonstrated how to deploy GPT4ALL-J, a generative AI model, on AWS Inferentia2 using SageMaker and the LMI container, without writing any code. We also showcased how you can use DJLServing and transformers-neuronx to load a model, partition it, and serve.

Inf2 instances provide the most cost-effective way to run generative AI models on AWS. For performance details, refer to Inf2 Performance.

Check out the GitHub repo for an example notebook. Try it out and let us know if you have any questions!


About the Authors

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with Machine Learning Startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML Inference and low-code ML. He has worked on projects in different domains, including Natural Language Processing and Computer Vision.

Hiroshi Tokoyo is a Solutions Architect at AWS Annapurna Labs. Based in Japan, he joined Annapurna Labs even before the acquisition by AWS and has consistently helped customers with Annapurna Labs technology. His recent focus is on Machine Learning solutions based on purpose-built silicon, AWS Inferentia and Trainium.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Varun Syal is a Software Development Engineer with AWS Sagemaker working on critical customer facing features for the ML Inference platform. He is passionate about working in the Distributed Systems and AI space. In his spare time, he likes reading and gardening.

Read More

Automate the deployment of an Amazon Forecast time-series forecasting model

Automate the deployment of an Amazon Forecast time-series forecasting model

Time series forecasting refers to the process of predicting future values of time series data (data that is collected at regular intervals over time). Simple methods for time series forecasting use historical values of the same variable whose future values need to be predicted, whereas more complex, machine learning (ML)-based methods use additional information, such as the time series data of related variables.

Amazon Forecast is an ML-based time series forecasting service that includes algorithms that are based on over 20 years of forecasting experience used by Amazon.com, bringing the same technology used at Amazon to developers as a fully managed service, removing the need to manage resources. Forecast uses ML to learn not only the best algorithm for each item, but also the best ensemble of algorithms for each item, automatically creating the best model for your data.

This post describes how to deploy recurring Forecast workloads (time series forecasting workloads) with no code using AWS CloudFormation, AWS Step Functions, and AWS Systems Manager. The method presented here helps you build a pipeline that allows you to use the same workflow starting from the first day of your time series forecasting experimentation through the deployment of the model into production.

Time series forecasting using Forecast

The workflow for Forecast involves the following common concepts:

  • Importing datasets – In Forecast, a dataset group is a collection of datasets, schema, and forecast results that go together. Each dataset group can have up to three datasets, one of each dataset type: target time series (TTS), related time series (RTS), and item metadata. A dataset is a collection of files that contain data that is relevant for a forecasting task. A dataset must conform to the schema defined within Forecast. For more details, refer to Importing Datasets.
  • Training predictors – A predictor is a Forecast-trained model used for making forecasts based on time series data. During training, Forecast calculates accuracy metrics that you use to evaluate the predictor and decide whether to use the predictor to generate a forecast. For more information, refer to Training Predictors.
  • Generating forecasts – You can then use the trained model for generating forecasts for a future time horizon, known as the forecasting horizon. Forecast provides forecasts at various specified quantiles. For example, a forecast at the 0.90 quantile will estimate a value that is lower than the observed value 90% of the time. By default, Forecast uses the following values for the predictor forecast types: 0.1 (P10), 0.5 (P50), and 0.9 (P90). Forecasts at various quantiles are typically used to provide a prediction interval (an upper and lower bound for forecasts) to account for forecast uncertainty.

You can implement this workflow in Forecast either from the AWS Management Console, the AWS Command Line Interface (AWS CLI), via API calls using Python notebooks, or via automation solutions. The console and AWS CLI methods are best suited for quick experimentation to check the feasibility of time series forecasting using your data. The Python notebook method is great for data scientists already familiar with Jupyter notebooks and coding, and provides maximum control and tuning. However, the notebook-based method is difficult to operationalize. Our automation approach facilitates rapid experimentation, eliminates repetitive tasks, and allows easier transition between various environments (development, staging, production).

In this post, we describe an automation approach to using Forecast that allows you to use your own data and provides a single workflow that you can use seamlessly throughout the lifecycle of the development of your forecasting solution, from the first days of experimentation through the deployment of the solution in your production environment.

Solution overview

In the following sections, we describe a complete end-to-end workflow that serves as a template to follow for automated deployment of time series forecasting models using Forecast. This workflow creates forecasted data points from an open-source input dataset; however, you can use the same workflow for your own data, as long as you can format your data according to the steps outlined in this post. After you upload the data, we walk you through the steps to create Forecast dataset groups, import data, train ML models, and produce forecasted data points on future unseen time horizons from raw data. All of this is possible without having to write or compile code.

The following diagram illustrates the forecasting workflow.

Cyclical forecasting workflow

The solution is deployed using two CloudFormation templates: the dependencies template and the workload template. CloudFormation enables you to perform AWS infrastructure deployments predictably and repeatedly by using templates describing the resources to be deployed. A deployed template is referred to as a stack. We’ve taken care of defining the infrastructure in the solution for you in the two provided templates. The dependencies template defines prerequisite resources used by the workload template, such as an Amazon Simple Storage Service (Amazon S3) bucket for object storage and AWS Identity and Access Management (IAM) permissions for AWS API actions. The resources defined in the dependencies template may be shared by multiple workload templates. The workload template defines the resources used to ingest data, train a predictor, and generate a forecast.

Deployment workflow

Deploy the dependencies CloudFormation template

First, let’s deploy the dependencies template to create our prerequisite resources. The dependencies template deploys an optional S3 bucket, AWS Lambda functions, and IAM roles. Amazon S3 is a low-cost, highly available, resilient, object storage service. We use an S3 bucket in this solution to store source data and trigger the workflow, resulting in a forecast. Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The dependencies template includes functions to do things like create a dataset group in Forecast and purge objects within an S3 bucket before deleting the bucket. IAM roles define permissions within AWS for users and services. The dependencies template deploys a role to be used by Lambda and another for Step Functions, a workflow management service that will coordinate the tasks of data ingestion and processing, as well as predictor training and inference using Forecast.

Complete the following steps to deploy the dependencies template:

  1. On the console, select the desired Region supported by Forecast for solution deployment.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  3. Choose Create stack and choose With new resources (standard).
    Create stack
  4. For Template source, select Amazon S3 URL.
  5. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-dependency.yaml.
  6. Choose Next.
    Specify template
  7. For Stack name, enter forecast-mlops-dependency.
  8. Under Parameters, choose to use an existing S3 bucket or create a new one, then provide the name of the bucket.
  9. Choose Next.
  10. Choose Next to accept the default stack options.
  11. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the forecast-mlops-dependency stack. When the status changes to CREATE_COMPLETE, you may move to the next step.

Deploy the workload CloudFormation template

Next, let’s deploy the workload template to create our prerequisite resources. The workload template deploys Step Functions state machines for workflow management, AWS Systems Manager Parameter Store parameters to store parameter values from AWS CloudFormation and inform the workflow, an Amazon Simple Notification Service (Amazon SNS) topic for workflow notifications, and an IAM role for workflow service permissions.

The solution creates five state machines:

  • CreateDatasetGroupStateMachine – Creates a Forecast dataset group for data to be imported into.
  • CreateImportDatasetStateMachine – Imports source data from Amazon S3 into a dataset group for training.
  • CreateForecastStateMachine – Manages the tasks required to train a predictor and generate a forecast.
  • AthenaConnectorStateMachine – Enables you to write SQL queries with the Amazon Athena connector to land data in Amazon S3. This is an optional process to obtain historical data in the required format for Forecast by using Athena instead of placing files manually in Amazon S3.
  • StepFunctionWorkflowStateMachine – Coordinates calls out to the other four state machines and manages the overall workflow.

Parameter Store, a capability of Systems Manager, provides secure, hierarchical storage and programmatic retrieval of configuration data management and secrets management. Parameter Store is used to store parameters set in the workload stack as well as other parameters used by the workflow.

Complete the following steps to deploy the workload template:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack and choose With new resources (standard).
  3. For Template source, select Amazon S3 URL.
  4. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-solution-guidance.yaml.
  5. Choose Next.
  6. For Stack name, enter a name.
  7. Accept the default values or modify the parameters.

Be sure to enter the S3 bucket name from the dependencies stack for S3 Bucket and a valid email address for SNSEndpoint even if you accept the default parameter values.

The following table describes each parameter.

Parameter Description More Information
DatasetGroupFrequencyRTS The frequency of data collection for the RTS dataset. .
DatasetGroupFrequencyTTS The frequency of data collection for the TTS dataset. .
DatasetGroupName A short name for the dataset group, a self-contained workload. CreateDatasetGroup
DatasetIncludeItem Specify if you want to provide item metadata for this use case. .
DatasetIncludeRTS Specify if you want to provide a related time series for this use case. .
ForecastForecastTypes When a CreateForecast job runs, this declares which quantiles to produce predictions for. You may choose up to five values in this array. Edit this value to include values according to need. CreateForecast
PredictorAttributeConfigs For the target variable in TTS and each numeric field in the RTS datasets, a record must be created for each time interval for each item. This configuration helps determine how missing records are filled in: with 0, NaN, or otherwise. We recommend filing the gaps in the TTS with NaN instead of 0. With 0, the model might learn wrongly to bias forecasts toward 0. NaN is how the guidance is delivered. Consult with your AWS Solutions Architect with any questions on this. CreateAutoPredictor
PredictorExplainPredictor Valid values are TRUE or FALSE. These determine if explainability is enabled for your predictor. This can help you understand how values in the RTS and item metadata influence the model. Explainability
PredictorForecastDimensions You may want to forecast at a finer grain than item. Here, you can specify dimensions such as location, cost center, or whatever your needs are. This needs to agree with the dimensions in your RTS and TTS. Note that if you have no dimension, the correct parameter is null, by itself and in all lowercase. null is a reserved word that lets the system know there is no parameter for the dimension. CreateAutoPredictor
PredictorForecastFrequency Defines the time scale at which your model and predictions will be generated, such as daily, weekly, or monthly. The drop-down menu helps you choose allowed values. This needs to agree with your RTS time scale if you’re using RTS. CreateAutoPredictor
PredictorForecastHorizon The number of time steps that the model predicts. The forecast horizon is also called the prediction length. CreateAutoPredictor
PredictorForecastOptimizationMetric Defines the accuracy metric used to optimize the predictor. The drop-down menu will help you select weighted quantile loss balances for over- or under-forecasting. RMSE is concerned with units, and WAPE/MAPE are concerned with percent errors. CreateAutoPredictor
PredictorForecastTypes When a CreateAutoPredictor job runs, this declares which quantiles are used to train prediction points. You may choose up to five values in this array, allowing you to balance over- and under-forecasting. Edit this value to include values according to need. CreateAutoPredictor
S3Bucket The name of the S3 bucket where input data and output data are written for this workload. .
SNSEndpoint A valid email address to receive notifications when the predictor and Forecast jobs are complete. .
SchemaITEM This defines the physical order, column names, and data types for your item metadata dataset. This is an optional file provided in the solution example. CreateDataset
SchemaRTS This defines the physical order, column names, and data types for your RTS dataset. The dimensions must agree with your TTS. The time-grain of this file governs the time-grain at which predictions can be made. This is an optional file provided in the solution example. CreateDataset
SchemaTTS This defines the physical order, column names, and data types for your TTS dataset, the only required dataset. The file must contain a target value, timestamp, and item at a minimum. CreateDataset
TimestampFormatRTS Defines the timestamp format provided in the RTS file. CreateDatasetImportJob
TimestampFormatTTS Defines the timestamp format provided in the TTS file. CreateDatasetImportJob
  1. Choose Next to accept the default stack options.
  2. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the stack name you chose earlier. When the status changes to CREATE_COMPLETE, you may move to the data upload step.

Upload the data

In the previous section, you provided a stack name and an S3 bucket. This section describes how to deposit the publicly available dataset Food Demand in this bucket. If you’re using your own dataset, refer to Datasets to prepare your dataset in a format the deployment is expecting. The dataset needs to contain at least the target time series, and optionally, the related time series and the item metadata:

  • TTS is the time series data that includes the field that you want to generate a forecast for; this field is called the target field
  • RTS is time series data that doesn’t include the target field, but includes a related field
  • The item data file isn’t time series data, but includes metadata information about the items in the TTS or RTS datasets

Complete the following steps:

  1. If you’re using the provided sample dataset, download the dataset Food Demand to your computer and unzip the file, which creates three files inside three directories (rts, tts, item).
  2. On the Amazon S3 console, navigate to the bucket you created earlier.
  3. Choose Create folder.
  4. Use the same string as your workload stack name for the folder name.
  5. Choose Upload.
  6. Choose the three dataset folders, then choose Upload.

When the upload is complete, you should see something like the following screenshot. For this example, our folder is aiml42.S3 folder structure

Create a Forecast dataset group

Complete the steps in this section to create a dataset group as a one-time event for each workload. Going forward, you should plan on running the import data, create predictor, and create forecast steps as appropriate, as a series, according to your schedule, which could be daily, weekly, or otherwise.

  1. On the Step Functions console, locate the state machine containing Create-Dataset-Group.
  2. On the state machine detail page, choose Start execution.
  3. Choose Start execution again to confirm.

The state machine takes about 1 minute to run. When it’s complete, the value under Execution Status should change from Running to Succeeded Execution status

Import data into Forecast

Follow the steps in this section to import the data set that you uploaded to your S3 bucket into your dataset group:

  1. On the Step Functions console, locate the state machine containing Import-Dataset.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.

The amount of time the state machine takes to run depends on the dataset being processed.

Graph inspector

  1. While this is running, in your browser, open another tab and navigate to the Forecast console.
  2. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  3. Choose View datasets.

You should see the data imports in progress.

Data imports in progress

When the state machine for Import-Dataset is complete, you can proceed to the next step to build your time series data model.

Create AutoPredictor (train a time series model)

This section describes how to train an initial predictor with Forecast. You may choose to create a new predictor (your first, baseline predictor) or retrain a predictor during each production cycle, which could be daily, weekly, or otherwise. You may also elect not to create a predictor each cycle and rely on predictor monitoring to guide you when to create one. The following figure visualizes the process of creating a production-ready Forecast predictor.

Production ready predictor workflow

To create a new predictor, complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Predictor.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    The amount of runtime can depend on the dataset being processed. This could take up to an hour or more to complete.
  4. While this is running, in your browser, open another tab and navigate to the Forecast console.
  5. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  6. Choose View predictors.

You should see the predictor training in progress (Training status shows “Create in progress…”).

Data imports in progress

When the state machine for Create-Predictor is complete, you can evaluate its performance.

As part of the state machine, the system creates a predictor and also runs a BacktestExport job that writes out time series-level predictor metrics to Amazon S3. These are files located in two S3 folders under the backtest-export folder:

  • accuracy-metrics-values – Provides item-level accuracy metric computations so you can understand the performance of a single time series. This allows you to investigate the spread rather than focusing on the global metrics alone.
  • forecasted-values – Provides step-level predictions for each time series in the backtest window. This enables you to compare the actual target value from a holdout test set to the predicted quantile values. Reviewing this helps formulate ideas on how to provide additional data features in RTS or item metadata to help better estimate future values, further reducing loss. You may download backtest-export files from Amazon S3 or query them in place with Athena.

S3 bucket contents

With your own data, you need to closely inspect the predictor outcomes and ensure the metrics meet your expected results by using the backtest export data. When satisfied, you can begin generating future-dated predictions as described in the next section.

Generate a forecast (inference about future time horizons)

This section describes how to generate forecast data points with Forecast. Going forward, you should harvest new data from the source system, import the data into Forecast, and then generate forecast data points. Optionally, you may also insert a new predictor creation after import and before forecast. The following figure visualizes the process of creating production time series forecasts using Forecast.

Production time series forecast workflow

Complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Forecast.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    This state machine finishes very quickly because the system isn’t configured to generate a forecast. It doesn’t know which predictor model you have approved for inference.
    Let’s configure the system to use your trained predictor.
  4. On the Forecast console, locate the ARN for your predictor.
  5. Copy the ARN to use in a later step.
    Predictor details
  6. In your browser, open another tab and navigate to the Systems Manager console.
  7. On the Systems Manager console, choose Parameter Store in the navigation pane.
  8. Locate the parameter related to your stack (/forecast/<StackName>/Forecast/PredictorArn).
  9. Enter the ARN you copied for your predictor.
    This is how you associate a trained predictor with the inference function of Forecast.
  10. Locate the parameter /forecast/<StackName>/Forecast/Generate and edit the value, replacing FALSE with TRUE.
    Now you’re ready to run a forecast job for this dataset group.
  11. On the Step Functions console, run the Create-Forecast state machine.

This time, the job runs as expected. As part of the state machine, the system creates a forecast and a ForecastExport job, which writes out time series predictions to Amazon S3. These files are located in the forecast folder

Forecast folder contents

Inside the forecast folder, you will find predictions for your items, located in many CSV or Parquet files, depending on your selection. The predictions for each time step and selected time series exist with all your chosen quantile values per record. You may download these files from Amazon S3, query them in place with Athena, or choose another strategy to use the data.

This wraps up the entire workflow. You can now visualize your output using any visualization tool of your choice, such as Amazon QuickSight. Alternatively, data scientists can use pandas to generate their own plots. If you choose to use QuickSight, you can connect your forecast results to QuickSight to perform data transformations, create one or more data analyses, and create visualizations.

This process provides a template to follow. You will need to adapt the sample to your schema, set the forecast horizon, time resolution, and so forth according to your use case. You will also need to set a recurring schedule where data is harvested from the source system, import the data, and produce forecasts. If desired, you may insert a predictor task between the import and forecast steps.

Retrain the predictor

We have walked through the process of training a new predictor, but what about retraining a predictor? Retraining a predictor is one way to reduce the cost and time involved with training a predictor on the latest available data. Rather than create a new predictor and train it on the entire dataset, we can retrain the existing predictor by providing only the new incremental data made available since the predictor was last trained. Let’s walk through how to retrain a predictor using the automation solution:

  1. On the Forecast console, choose View dataset groups.
  2. Choose the dataset group associated with the predictor you want to retrain.
  3. Choose View predictors, then chose the predictor you want to retrain.
  4. On the Settings tab, copy the predictor ARN.
    We need to update a parameter used by the workflow to identify the predictor to retrain.
  5. On the Systems Manager console, choose Parameter Store in the navigation pane.
  6. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/ReferenceArn.
  7. On the parameter detail page, choose Edit.
  8. For Value, enter the predictor ARN.
    This identifies the correct predictor for the workflow to retrain. Next, we need to update a parameter used by the workflow to change the training strategy.
  9. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/Strategy.
  10. On the parameter detail page, choose Edit.
  11. For Value, enter RETRAIN.
    The workflow defaults to training a new predictor; however, we can modify that behavior to retrain an existing predictor or simply reuse an existing predictor without retraining by setting this value to NONE. You may want to forego training if your data is relatively stable or you’re using automated predictor monitoring to decide when retraining is necessary.
  12. Upload the incremental training data to the S3 bucket.
  13. On the Step Functions console, locate the state machine <STACKNAME>-Create-Predictor.
  14. On the state machine detail page, choose Start execution to begin the retraining.

When the retraining is complete, the workflow will end and you will receive an SNS email notification to the email address provided in the workload template parameters.

Clean up

When you’re done with this solution, follow the steps in this section to delete related resources.

Delete the S3 bucket

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket where data was uploaded and choose Empty to delete all data associated with the solution, including source data.
  3. Enter permanently delete to delete the bucket contents permanently.
  4. On the Buckets page, select the bucket and choose Delete.
  5. Enter the name of the bucket to confirm the deletion and choose Delete bucket.

Delete Forecast resources

  1. On the Forecast console, choose View dataset groups.
  2. Select the dataset group name associated with the solution, then choose Delete.
  3. Enter delete to delete the dataset group and associated predictors, predictor backtest export jobs, forecasts, and forecast export jobs.
  4. Choose Delete to confirm.

Delete the CloudFormation stacks

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the workload stack and choose Delete.
  3. Choose Delete stack to confirm deletion of the stack and all associated resources.
  4. When the deletion is complete, select the dependencies stack and choose Delete.
  5. Choose Delete to confirm.

Conclusion

In this post, we discussed some different ways to get started using Forecast. We walked through an automated forecasting solution based on AWS CloudFormation for a rapid, repeatable solution deployment of a Forecast pipeline from data ingestion to inference, with little infrastructure knowledge required. Finally, we saw how we can use Lambda to automate model retraining, reducing cost and training time.

There’s no better time than the present to start forecasting with Forecast. To start building and deploying an automated workflow, visit Amazon Forecast resources. Happy forecasting!


About the Authors

Aaron Fagan is a Principal Specialist Solutions Architect at AWS based in New York. He specializes in helping customers architect solutions in machine learning and cloud security.

Raju Patil is a Data Scientist in AWS Professional Services. He builds and deploys AI/ML solutions to assist AWS customers in overcoming their business challenges. His AWS engagements have covered a wide range of AI/ML use cases such as computer vision, time-series forecasting, and predictive analytics, etc., across numerous industries, including financial services, telecom, health care, and more. Prior to this, he has led Data Science teams in Advertising Technology, and made significant contributions to numerous research and development initiatives in computer vision and robotics. Outside of work, he enjoys photography, hiking, travel, and culinary explorations.

Read More

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Generative AI is gaining a lot of public attention at present, with talk around products such as GPT4, ChatGPT, DALL-E2, Bard, and many other AI technologies. Many customers have been asking for more information on AWS’s generative AI solutions. The aim of this post is to address those needs.

This post provides an overview of generative AI with a real customer use case, provides a concise description and outlines its benefits, references an easy-to-follow demo of AWS DeepComposer for creating new musical compositions, and outlines how to get started using Amazon SageMaker JumpStart for deploying GPT2, Stable Diffusion 2.0, and other generative AI models.

Generative AI overview

Generative AI is a specific field of artificial intelligence that focuses on generating new material. It’s one of the most exciting fields in the AI world, with the potential to transform existing businesses and allow completely new business ideas to come to market. You can use generative techniques for:

  • Creating new works of art using a model such as Stable Diffusion 2.0
  • Writing a best-selling book using a model such as GPT2, Bloom, or Flan-T5-XL
  • Composing your next symphony using the Transformers technique in AWS DeepComposer

AWS DeepComposer is an educational tool that helps you understand the key concepts associated with machine learning (ML) through the language of musical composition. To learn more, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Stable Diffusion, GPT2, Bloom, and Flan-T5-XL are all ML models. They are simply mathematical algorithms that need to be trained to identify patterns within data. After the patterns are learned, they’re deployed onto endpoints, ready for a process known as inference. New data that the model hasn’t seen is fed into the inference model, and new creative material is produced.

For example, with image generation models such as Stable Diffusion, we can create stunning illustrations using a few words. With text generation models such as GPT2, Bloom, and Flan-T5-XL, we can generate new literary articles, and potentially books, from a simple human sentence.

Autodesk is an AWS customer using Amazon SageMaker to help their product designers sort through thousands of iterations of visual designs for various use cases and use ML to help choose the optimal design. Specifically, they have worked with Edera Safety to help develop a spinal cord protector that protects riders from accidents while participating in sporting events, such as mountain biking. For more information, check out the video AWS Machine Learning Enables Design Optimization.

To learn more about what AWS customers are doing with generative AI and fashion, refer to Virtual fashion styling with generative AI using Amazon SageMaker.

Now that we understand what generative AI is all about, let’s jump into a JumpStart demonstration to learn how to generate new text or images with AI.

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run JumpStart, we need to set up Studio. You can skip this step if you already have your own version of Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Next is to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

Choose a JumpStart solution

Now we come to the exciting part. You should now be logged in to Studio, and see a page similar to the following screenshot.

In the navigation pane, under SageMaker JumpStart, choose Models, notebooks, solutions.

You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case.

If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs.

For example, if you’re interested in fraud detection solutions, enter fraud detection into the search bar.

Fraud Detection Screenshot

If you’re interested in text generation solutions, enter text generation into the search bar. A good place to start if you want to explore a range of text generation models is to select the Intro to JS – Text Generation notebook.

JS - Text Generation

Let’s dive into a specific demonstration of the GPT-2 model.

JumpStart GPT-2 model demo

GPT 2 is a language model that helps generate human-like text based on a given prompt. We can use this type of transformer model to create new sentences and help us automate writing. This can be used for content creation such as blogs, social media posts, and books.

The GPT 2 model is part of the Generative Pre-Trained Transformer family that was the predecessor to GPT 3. At the time of writing, GPT 3 is used as the foundation for the OpenAI ChatGPT application.

To start exploring the GPT-2 model demo in JumpStart, complete the following steps:

  1. On JumpStart, search for and choose GPT 2.
  2. In the Deploy Model section, expand Deployment Configuration.
  3. For SageMaker hosting instance, choose your instance (for this post, we use ml.c5.2xlarge).

Different machine types have different price points attached. At the time of writing, the ml.c5.2xlarge that we selected incurs under $0.50 per hour. For the most up-to-date pricing, refer to Amazon SageMaker Pricing.

  1. For Endpoint name, enter demo-hf-textgeneration-gpt2.
  2. Choose Deploy.

Endpoint Name & Deploy

Wait for the ML endpoint to deploy (up to 15 minutes).

  1. When the endpoint is deployed, choose Open Notebook.

Endpoint Status

You’ll see a page similar to the following screenshot.
Python Code

The document we’re using to showcase our demonstration is a Jupyter notebook, which encompasses all the necessary Python code. Note that the code in this screenshot maybe be slightly different to the code you have, because AWS is constantly updating these notebooks and making sure they are secure, are free of defects, and provide the best customer experience.

  1. Click into the first cell and choose Ctrl+Enter to run the code block.

Code Block 1

An asterisk (*) appears to the left of the code block and then turns into a number. The asterisk indicates that the code is running and is complete when the number appears.

  1. In the next code block, enter some sample text, then press Ctrl+Enter.

Code Block 2

  1. Choose Ctrl+Enter in the third code block to run it.

After about 30-60 seconds, you will see your inference results.

For the input text “Once upon a time there were 18 sandwiches,” we get the following generated text:

Once upon a time there were 18 sandwiches, four plates with some salad, and three sandwiches with some beef. One restaurant was so nice that the food was made by hand. There were people living at the beginning of the time who were waiting so that

For the input text “And for the final time Peter said to Mary,” we get the following generated text:

And for the final time Peter said to Mary that he was a saint.

11 But Peter said that it was not a blessing, but rather that it would be the death of Peter. And when Mary heard of that Peter said to him,

You can experiment with running this third code block multiple times, and you will notice that the model makes different predictions each time.

To tailor the output using some of the advanced features, scroll down to experiment in the fourth code block.

To learn more about text generation models, refer to Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart.

Clean up resources

Before we move on, don’t forget to delete your endpoint when you’re finished. On the previous tab, under Delete Endpoint, choose Delete.

Delete Endpoint

If you have accidentally closed this notebook, you can also delete your endpoint via the SageMaker console. Under Inference in the navigation pane, choose Endpoints.

Select the endpoint you used and on the Actions menu, choose Delete.

Delete Endpoint

Now that we understand how to use our first JumpStart solution, let’s look at using a Stable Diffusion model.

JumpStart Stable Diffusion model demo

We can use the Stable Diffusion 2 model to generate images from a simple line of text. This can be used to generate content for things like social media posts, promotional material, album covers, or anything that requires creative artwork.

  1. Return to JumpStart, then search for and choose Stable Diffusion 2.

Stable Diffusion 2

  1. In the Deploy Model section, expand Deployment Configuration.
  2. For SageMaker hosting instance, choose your instance (for this post, we use ml.g5.2xlarge).
  3. For Endpoint name, enter demo-stabilityai-stable-diffusion-v2.
  4. Choose Deploy.

Because this is a larger model, it can take up to 25 minutes to deploy. When it’s ready, the endpoint status shows as In Service.

In Service

  1. Choose Open Notebook to open a Jupyter notebook with Python code.

Python Code

  1. Run the first and second code blocks.
  2. In the third code block, change the text prompt, then run the cell.

Code Block 1

Wait about 30–60 seconds for your image to appear. The following image is based on our example text.

Output Picture

Again, you can play with the advanced features in the next code block. The picture it creates is different every time.

Clean up resources

Again, don’t forget to delete your endpoint. This time, we’re using ml.g5.2xlarge, so it incurs slightly higher charges than before. At the time of writing, it was just over $1 per hour.

Finally, let’s move to AWS DeepComposer.

AWS DeepComposer

AWS DeepComposer is a great way to learn about generative AI. It allows you to use built-in melodies in your models to generate new forms of music. The model that you use determines on how the input melody is transformed.

If you’re used to participating in AWS DeepRacer days to help your employees learn about re-enforcement learning, consider augmenting and enhancing the day with AWS DeepComposer to learn about generative AI.

For a detailed explanation and easy-to-follow demonstration of three of the models in this post, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Check out the following cool examples uploaded to SoundCloud using AWS DeepComposer.

We would love to see your experiments, so feel free to reach out via social media (@digitalcolmer) and share your learnings and experiments.

Conclusion

In this post, we talked about the definition of generative AI, illustrated by an AWS customer story. We then stepped you through how to get started with Studio and JumpStart, and showed you how to get started with GPT 2 and Stable Diffusion models. We wrapped up with a brief overview of AWS DeepComposer.

To explore JumpStart more, try using your own data to fine-tune an existing model. For more information, refer to Incremental training with Amazon SageMaker JumpStart. For information about fine-tuning Stable Diffusion models, refer to Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart.

To learn more about Stable Diffusion models, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

We didn’t cover any information on the Flan-T5-XL model, so to learn more, refer to the following GitHub repo. The Amazon SageMaker Examples repo also includes a range of available notebooks on GitHub for the various SageMaker products, including JumpStart, covering a range of different use cases.

To learn more about AWS ML via a range of free digital assets, check out our AWS Machine Learning Ramp-Up Guide. You can also try our free ML Learning Plan to build on your current knowledge or have a clear starting point. To take an instructor-led course, we highly recommend the following courses:

It is truly an exciting time in the AI/ML space. AWS is here to support your ML journey, so please connect with us on social media. We look forward to seeing all your learning, experiments, and fun with the various ML services over the coming months and relish the opportunity to be your instructor on your ML journey.


About the Author

Paul Colmer is a Senior Technical Trainer at Amazon Web Services specializing in machine learning and generative AI. His passion is helping customers, partners, and employees develop and grow through compelling storytelling, shared experiences, and knowledge transfer. With over 25 years in the IT industry, he specializes in agile cultural practices and machine learning solutions. Paul is a Fellow of the London College of Music and Fellow of the British Computer Society.

Read More