Optimize AWS Inferentia utilization with FastAPI and PyTorch models on Amazon EC2 Inf1 & Inf2 instances

Optimize AWS Inferentia utilization with FastAPI and PyTorch models on Amazon EC2 Inf1 & Inf2 instances

When deploying Deep Learning models at scale, it is crucial to effectively utilize the underlying hardware to maximize performance and cost benefits. For production workloads requiring high throughput and low latency, the selection of the Amazon Elastic Compute Cloud (EC2) instance, model serving stack, and deployment architecture is very important. Inefficient architecture can lead to suboptimal utilization of the accelerators and unnecessarily high production cost.

In this post we walk you through the process of deploying FastAPI model servers on AWS Inferentia devices (found on Amazon EC2 Inf1 and Amazon EC Inf2 instances). We also demonstrate hosting a sample model that is deployed in parallel across all NeuronCores for maximum hardware utilization.

Solution overview

FastAPI is an open-source web framework for serving Python applications that is much faster than traditional frameworks like Flask and Django. It utilizes an Asynchronous Server Gateway Interface (ASGI) instead of the widely used Web Server Gateway Interface (WSGI). ASGI processes incoming requests asynchronously as opposed to WSGI which processes requests sequentially. This makes FastAPI the ideal choice to handle latency sensitive requests. You can use FastAPI to deploy a server that hosts an endpoint on an Inferentia (Inf1/Inf2) instances that listens to client requests through a designated port.

Our objective is to achieve highest performance at lowest cost through maximum utilization of the hardware. This allows us to handle more inference requests with fewer accelerators. Each AWS Inferentia1 device contains four NeuronCores-v1 and each AWS Inferentia2 device contains two NeuronCores-v2. The AWS Neuron SDK allows us to utilize each of the NeuronCores in parallel, which gives us more control in loading and inferring four or more models in parallel without sacrificing throughput.

With FastAPI, you have your choice of Python web server (Gunicorn, Uvicorn, Hypercorn, Daphne). These web servers provide and abstraction layer on top of the underlying Machine Learning (ML) model. The requesting client has the benefit of being oblivious to the hosted model. A client doesn’t need to know the model’s name or version that has been deployed under the server; the endpoint name is now just a proxy to a function that loads and runs the model. In contrast, in a framework-specific serving tool, such as TensorFlow Serving, the model’s name and version are part of the endpoint name. If the model changes on the server side, the client has to know and change its API call to the new endpoint accordingly. Therefore, if you are continuously evolving the version models, such as in the case of A/B testing, then using a generic Python web server with FastAPI is a convenient way of serving models, because the endpoint name is static.

An ASGI server’s role is to spawn a specified number of workers that listen for client requests and run the inference code. An important capability of the server is to make sure the requested number of workers are available and active. In case a worker is killed, the server must launch a new worker. In this context, the server and workers may be identified by their Unix process ID (PID). For this post, we use a Hypercorn server, which is a popular choice for Python web servers.

In this post, we share best practices to deploy deep learning models with FastAPI on AWS Inferentia NeuronCores. We show that you can deploy multiple models on separate NeuronCores that can be called concurrently. This setup increases throughput because multiple models can be inferred concurrently and NeuronCore utilization is fully optimized. The code can be found on the GitHub repo. The following figure shows the architecture of how to set up the solution on an EC2 Inf2 instance.

The same architecture applies to an EC2 Inf1 instance type except it has four cores. So that changes the architecture diagram a little bit.

AWS Inferentia NeuronCores

Let’s dig a little deeper into tools provided by AWS Neuron to engage with the NeuronCores. The following tables shows the number of NeuronCores in each Inf1 and Inf2 instance type. The host vCPUs and the system memory are shared across all available NeuronCores.

Instance Size # Inferentia Accelerators # NeuronCores-v1 vCPUs Memory (GiB)
Inf1.xlarge 1 4 4 8
Inf1.2xlarge 1 4 8 16
Inf1.6xlarge 4 16 24 48
Inf1.24xlarge 16 64 96 192
Instance Size # Inferentia Accelerators # NeuronCores-v2 vCPUs Memory (GiB)
Inf2.xlarge 1 2 4 32
Inf2.8xlarge 1 2 32 32
Inf2.24xlarge 6 12 96 192
Inf2.48xlarge 12 24 192 384

Inf2 instances contain the new NeuronCores-v2 in comparison to the NeuronCore-v1 in the Inf1 instances. Despite fewer cores, they are able to offer 4x higher throughput and 10x lower latency than Inf1 instances. Inf2 instances are ideal for Deep Learning workloads like Generative AI, Large Language Models (LLM) in OPT/GPT family and vision transformers like Stable Diffusion.

The Neuron Runtime is responsible for running models on Neuron devices. Neuron Runtime determines which NeuronCore will run which model and how to run it. Configuration of Neuron Runtime is controlled through the use of environment variables at the process level. By default, Neuron framework extensions will take care of Neuron Runtime configuration on the user’s behalf; however, explicit configurations are also possible to achieve more optimized behavior.

Two popular environment variables are NEURON_RT_NUM_CORES and NEURON_RT_VISIBLE_CORES. With these environment variables, Python processes can be tied to a NeuronCore. With NEURON_RT_NUM_CORES, a specified number of cores can be reserved for a process, and with NEURON_RT_VISIBLE_CORES, a range of NeuronCores can be reserved. For example, NEURON_RT_NUM_CORES=2 myapp.py will reserve two cores and NEURON_RT_VISIBLE_CORES=’0-2’ myapp.py will reserve zero, one, and two cores for myapp.py. You can reserve NeuronCores across devices (AWS Inferentia chips) as well. So, NEURON_RT_VISIBLE_CORES=’0-5’ myapp.py will reserve the first four cores on device1 and one core on device2 in an Ec2 Inf1 instance type. Similarly, on an EC2 Inf2 instance type, this configuration will reserve two cores across device1 and device2 and one core on device3. The following table summarizes the configuration of these variables.

Name Description Type Expected Values Default Value RT Version
NEURON_RT_VISIBLE_CORES Range of specific NeuronCores needed by the process Integer range (like 1-3) Any value or range between 0 to Max NeuronCore in the system None 2.0+
NEURON_RT_NUM_CORES Number of NeuronCores required by the process Integer A value from 1 to Max NeuronCore in the system 0, which is interpreted as “all” 2.0+

For a list of all environment variables, refer to Neuron Runtime Configuration.

By default, when loading models, models get loaded onto NeuronCore 0 and then NeuronCore 1 unless explicitly stated by the preceding environment variables. As specified earlier, the NeuronCores share the available host vCPUs and system memory. Therefore, models deployed on each NeuronCore will compete for the available resources. This won’t be an issue if the model is utilizing the NeuronCores to a large extent. But if a model is running only partly on the NeuronCores and the rest on host vCPUs then considering CPU availability per NeuronCore become important. This affects the choice of the instance as well.

The following table shows number of host vCPUs and system memory available per model if one model was deployed to each NeuronCore. Depending on your application’s NeuronCore usage, vCPU, and memory usage, it is recommended to run tests to find out which configuration is most performant for your application. The Neuron Top tool can help in visualizing core utilization and device and host memory utilization. Based on these metrics an informed decision can be made. We demonstrate the use of Neuron Top at the end of this blog.

Instance Size # Inferentia Accelerators # Models vCPUs/Model Memory/Model (GiB)
Inf1.xlarge 1 4 1 2
Inf1.2xlarge 1 4 2 4
Inf1.6xlarge 4 16 1.5 3
Inf1.24xlarge 16 64 1.5 3
Instance Size # Inferentia Accelerators # Models vCPUs/Model Memory/Model (GiB)
Inf2.xlarge 1 2 2 8
Inf2.8xlarge 1 2 16 64
Inf2.24xlarge 6 12 8 32
Inf2.48xlarge 12 24 8 32

To test out the Neuron SDK features yourself, check out the latest Neuron capabilities for PyTorch.

System setup

The following is the system setup used for this solution:

Set up the solution

There are a couple of things we need to do to setup the solution. Start by creating an IAM role that your EC2 instance is going to assume that will allow it to push and pull from Amazon Elastic Container Registry.

Step 1: Setup the IAM role

  1. Start by logging into the console and accessing IAM > Roles > Create Role
  2. Select Trusted entity type AWS Service
  3. Select EC2 as the service under use-case
  4. Click Next and you’ll be able to see all policies available
  5. For the purpose of this solution, we’re going to give our EC2 instance full access to ECR. Filter for AmazonEC2ContainerRegistryFullAccess and select it.
  6. Press next and name the role inf-ecr-access

Note: the policy we attached gives the EC2 instance full access to Amazon ECR. We strongly recommend following the principal of least-privilege for production workloads.

Step 2: Setup AWS CLI

If you’re using the prescribed Deep Learning AMI listed above, it comes with AWS CLI installed. If you’re using a different AMI (Amazon Linux 2023, Base Ubuntu etc.), install the CLI tools by following this guide.

Once you have the CLI tools installed, configure the CLI using the command aws configure. If you have access keys, you can add them here but don’t necessarily need them to interact with AWS services. We’re relying on IAM roles to do that.

Note: We need to enter at-least one value (default region or default format) to create the default profile. For this example, we’re going with us-east-2 as the region and json as the default output.

Clone the Github repository

The GitHub repo provides all the scripts necessary to deploy models using FastAPI on NeuronCores on AWS Inferentia instances. This example uses Docker containers to ensure we can create reusable solutions. Included in this example is the following config.properties file for users to provide inputs.

# Docker Image and Container Name
docker_image_name_prefix=<Docker image name>
docker_container_name_prefix=<Docker container name>

# Deployment Setup
path_to_traced_models=<Path to traced model>
compiled_model=<Compiled model file name>
num_cores=<Number of NeuronCores to Deploy a Model Server>
num_models_per_server=<Number of Models to Be Loaded Per Server>

The configuration file needs user-defined name prefixes for the Docker image and Docker containers. The build.sh script in the fastapi and trace-model folders use this to create Docker images.

Compile a model on AWS Inferentia

We will start with tracing the model and producing a PyTorch Torchscript .pt file. Start by accessing trace-model directory and modifying the .env file. Depending upon the type of instance you chose, modify the CHIP_TYPE within the .env file. As an example, we will choose Inf2 as the guide. The same steps apply to the deployment process for Inf1.

Next set the default region in the same file. This region will be used to create an ECR repository and Docker images will be pushed to this repository. Also in this folder, we provide all the scripts necessary to trace a bert-base-uncased model on AWS Inferentia. This script could be used for most models available on Hugging Face. The Dockerfile has all the dependencies to run models with Neuron and runs the trace-model.py code as the entry point.

Neuron compilation explained

The Neuron SDK’s API closely resembles the PyTorch Python API. The torch.jit.trace() from PyTorch takes the model and sample input tensor as arguments. The sample inputs are fed to the model and the operations that are invoked as that input makes its way through the model’s layers are recorded as TorchScript. To learn more about JIT Tracing in PyTorch, refer to the following documentation.

Just like torch.jit.trace(), you can check to see if your model can be compiled on AWS Inferentia with the following code for inf1 instances.

import torch_neuron
model_traced = torch.neuron.trace(model, 
                                  example_inputs,
                                  compiler_args = 
                                  [‘--fast-math’, ‘fp32-cast-matmul’,
                                   ‘--neuron-core-pipeline-cores’,’1’],
                         optimizations=[torch_neuron.Optimization.FLOAT32_TO_FLOAT16])

For inf2, the library is called torch_neuronx. Here’s how you can test your model compilation against inf2 instances.

import torch
import torch_neuronx
model_traced = torch.neuronx.trace(model, 
                                   example_inputs,
                                   compiler_args = 
                                   [‘--fast-math’, ‘fp32-cast-matmul’,
                                    ‘--neuron-core-pipeline-cores’,’1’],
          optimizations=[torch_neuronx.Optimization.FLOAT32_TO_FLOAT16])

After creating the trace instance, we can pass the example tensor input like so:

answer_logits = model_traced(*example_inputs)

And finally save the resulting TorchScript output on local disk

model_traced.save('./compiled-model-bs-{batch_size}.pt')

As shown in the preceding code, you can use compiler_args and optimizations to optimize the deployment. For a detailed list of arguments for the torch.neuron.trace API, refer to PyTorch-Neuron trace python API.

Keep the following important points in mind:

  • The Neuron SDK doesn’t support dynamic tensor shapes as of this writing. Therefore, a model will have to be compiled separately for different input shapes. For more information on running inference on variable input shapes with bucketing, refer to Running inference on variable input shapes with bucketing.
  • If you face out of memory issues when compiling a model, try compiling the model on an AWS Inferentia instance with more vCPUs or memory, or even a large c6i or r6i instance as compilation only uses CPUs. Once compiled, the traced model can probably be run on smaller AWS Inferentia instance sizes.

Build process explanation

Now we will build this container by running build.sh. The build script file simply creates the Docker image by pulling a base Deep Learning Container Image and installing the HuggingFace transformers package. Based on the CHIP_TYPE specified in the .env file, the docker.properties file decides the appropriate BASE_IMAGE. This BASE_IMAGE points to a Deep Learning Container Image for Neuron Runtime provided by AWS.

It is available through a private ECR repository. Before we can pull the image, we need to login and get temporary AWS credentials.

aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin 763104351884.dkr.ecr.<region>.amazonaws.com

Note: we need to replace the region listed in the command specified by the region flag and within the repository URI with the region we put in the .env file.

For the purpose of making this process easier, we can use the fetch-credentials.sh file. The region will be taken from the .env file automatically.

Next, we’ll push the image using the script push.sh. The push script creates a repository in Amazon ECR for you and pushes the container image.

Finally, when the image is built and pushed, we can run it as a container by running run.sh and tail running logs with logs.sh. In the compiler logs (see the following screenshot), you will see the percentage of arithmetic operators compiled on Neuron and percentage of model sub-graphs successfully compiled on Neuron. The screenshot shows the compiler logs for the bert-base-uncased-squad2 model. The logs show that 95.64% of the arithmetic operators were compiled, and it also gives a list of operators that were compiled on Neuron and those that aren’t supported.

Here is a list of all supported operators in the latest PyTorch Neuron package. Similarly, here is the list of all supported operators in the latest PyTorch Neuronx package.

Deploy models with FastAPI

After the models are compiled, the traced model will be present in the trace-model folder. In this example, we have placed the traced model for a batch size of 1. We consider a batch size of 1 here to account for those use cases where a higher batch size is not feasible or required. For use cases where higher batch sizes are needed, the torch.neuron.DataParallel (for Inf1) or torch.neuronx.DataParallel (for Inf2) API may also be useful.

The fast-api folder provides all the necessary scripts to deploy models with FastAPI. To deploy the models without any changes, simply run the deploy.sh script and it will build a FastAPI container image, run containers on the specified number of cores, and deploy the specified number of models per server in each FastAPI model server. This folder also contains a .env file, modify it to reflect the correct CHIP_TYPE and AWS_DEFAULT_REGION.

Note: FastAPI scripts rely on the same environment variables used to build, push and run the images as containers. FastAPI deployment scripts will use the last known values from these variables. So, if you traced the model for Inf1 instance type last, that model will be deployed through these scripts.

The fastapi-server.py file which is responsible for hosting the server and sending the requests to the model does the following:

  • Reads the number of models per server and the location of the compiled model from the properties file
  • Sets visible NeuronCores as environment variables to the Docker container and reads the environment variables to specify which NeuronCores to use
  • Provides an inference API for the bert-base-uncased-squad2 model
  • With jit.load(), loads the number of models per server as specified in the config and stores the models and the required tokenizers in global dictionaries

With this setup, it would be relatively easy to set up APIs that list which models and how many models are stored in each NeuronCore. Similarly, APIs could be written to delete models from specific NeuronCores.

The Dockerfile for building FastAPI containers is built on the Docker image we built for tracing the models. This is why the docker.properties file specifies the ECR path to the Docker image for tracing the models. In our setup, the Docker containers across all NeuronCores are similar, so we can build one image and run multiple containers from one image. To avoid any entry point errors, we specify ENTRYPOINT ["/usr/bin/env"] in the Dockerfile before running the startup.sh script, which looks like hypercorn fastapi-server:app -b 0.0.0.0:8080. This startup script is the same for all containers. If you’re using the same base image as for tracing models, you can build this container by simply running the build.sh script. The push.sh script remains the same as before for tracing models. The modified Docker image and container name are provided by the docker.properties file.

The run.sh file does the following:

  • Reads the Docker image and container name from the properties file, which in turn reads the config.properties file, which has a num_cores user setting
  • Starts a loop from 0 to num_cores and for each core:
    • Sets the port number and device number
    • Sets the NEURON_RT_VISIBLE_CORES environment variable
    • Specifies the volume mount
    • Runs a Docker container

For clarity, the Docker run command for deploying in NeuronCore 0 for Inf1 would look like the following code:

docker run -t -d 
	    --name $ bert-inf-fastapi-nc-0 
	    --env NEURON_RT_VISIBLE_CORES="0-0" 
	    --env CHIP_TYPE="inf1" 
	    -p ${port_num}:8080 --device=/dev/neuron0 ${registry}/ bert-inf-fastapi

The run command for deploying in NeuronCore 5 would look like the following code:

docker run -t -d 
	    --name $ bert-inf-fastapi-nc-5 
	    --env NEURON_RT_VISIBLE_CORES="5-5" 
	    --env CHIP_TYPE="inf1" 
	    -p ${port_num}:8080 --device=/dev/neuron0 ${registry}/ bert-inf-fastapi

After the containers are deployed, we use the run_apis.py script, which calls the APIs in parallel threads. The code is set up to call six models deployed, one on each NeuronCore, but can be easily changed to a different setting. We call the APIs from the client side as follows:

import requests

url_template = http://localhost:%i/predictions_neuron_core_%i/model_%i

# NeuronCore 0
response = requests.get(url_template % (8081,0,0))

# NeuronCore 5
response = requests.get(url_template % (8086,5,0))

Monitor NeuronCore

After the model servers are deployed, to monitor NeuronCore utilization, we may use neuron-top to observe in real time the utilization percentage of each NeuronCore. neuron-top is a CLI tool in the Neuron SDK to provide information such as NeuronCore, vCPU, and memory utilization. In a separate terminal, enter the following command:

neuron-top

You output should be similar to the following figure. In this scenario, we have specified to use two NeuronCores and two models per server on an Inf2.xlarge instance. The following screenshot shows that two models of size 287.8MB each are loaded on two NeuronCores. With a total of 4 models loaded, you can see the device memory used is 1.3 GB. Use the arrow keys to move between the NeuronCores on different devices

Similarly, on an Inf1.16xlarge instance type we see a total of 12 models (2 models per core over 6 cores) loaded. A total memory of 2.1GB is consumed and every model is 177.2MB in size.

After you run the run_apis.py script, you can see the percentage of utilization of each of the six NeuronCores (see the following screenshot). You can also see the system vCPU usage and runtime vCPU usage.

The following screenshot shows the Inf2 instance core usage percentage.

Similarly, this screenshot shows core utilization in an inf1.6xlarge instance type.

Clean up

To clean up all the Docker containers you created, we provide a cleanup.sh script that removes all running and stopped containers. This script will remove all containers, so don’t use it if you want to keep some containers running.

Conclusion

Production workloads often have high throughput, low latency, and cost requirements. Inefficient architectures that sub-optimally utilize accelerators could lead to unnecessarily high production costs. In this post, we showed how to optimally utilize NeuronCores with FastAPI to maximize throughput at minimum latency. We have published the instructions on our GitHub repo. With this solution architecture, you can deploy multiple models in each NeuronCore and operate multiple models in parallel on different NeuronCores without losing performance. For more information on how to deploy models at scale with services like Amazon Elastic Kubernetes Service (Amazon EKS), refer to Serve 3,000 deep learning models on Amazon EKS with AWS Inferentia for under $50 an hour.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

K.C. Tung is a Senior Solution Architect in AWS Annapurna Labs. He specializes in large deep learning model training and deployment at scale in cloud. He has a Ph.D. in molecular biophysics from the University of Texas Southwestern Medical Center in Dallas. He has spoken at AWS Summits and AWS Reinvent. Today he helps customers to train and deploy large PyTorch and TensorFlow models in AWS cloud. He is the author of two books: Learn TensorFlow Enterprise and TensorFlow 2 Pocket Reference.

Pronoy Chopra is a Senior Solutions Architect with the Startups Generative AI team at AWS. He specializes in architecting and developing IoT and Machine Learning solutions. He has co-founded two startups in the past and enjoys being hands-on with projects in the IoT, AI/ML and Serverless domain.

Read More

Analyze rodent infestation using Amazon SageMaker geospatial capabilities

Analyze rodent infestation using Amazon SageMaker geospatial capabilities

Rodents such as rats and mice are associated with a number of health risks and are known to spread more than 35 diseases. Identifying regions of high rodent activity can help local authorities and pest control organizations plan for interventions effectively and exterminate the rodents.

In this post, we show how to monitor and visualize a rodent population using Amazon SageMaker geospatial capabilities. We then visualize rodent infestation effects on vegetation and bodies of water. Finally, we correlate and visualize the number of monkey pox cases reported with rodent sightings in a region. Amazon SageMaker makes it easier for data scientists and machine learning (ML) engineers to build, train, and deploy models using geospatial data. The tool makes it easier to access geospatial data sources, run purpose-built processing operations, apply pre-trained ML models, and use built-in visualization tools faster and at scale.

Notebook

First, we use an Amazon SageMaker Studio notebook with a geospatial image by following the steps outlined in Getting Started with Amazon SageMaker geospatial capabilities.

Data access

The geospatial image comes preinstalled with SageMaker geospatial capabilities that make it easier to enrich data for geospatial analysis and ML. For our post, we use satellite images from Sentinel-2 and the rodent activity and monkeypox datasets from open-source NYC open data.

First, we use the rodent activity and extract the latitude and longitude of rodent sightings and inspections. Then we enrich this location information with human-readable street addresses. We create a vector enrichment job (VEJ) in the SageMaker Studio notebook to run a reverse geocoding operation so that you can convert geographic coordinates (latitude, longitude) to human-readable addresses, powered by Amazon Location Service. We create the VEJ as follows:

import boto3
import botocore
import sagemaker
import sagemaker_geospatial_map

region = boto3.Session().region_name
session = botocore.session.get_session()
execution_role = sagemaker.get_execution_role()

sg_client= session.create_client(
    service_name='sagemaker-geospatial',
    region_name=region
)
response = sg_client.start_vector_enrichment_job(
    ExecutionRoleArn=execution_role,
    InputConfig={
        'DataSourceConfig': {
            'S3Data': {
                'S3Uri': 's3://<bucket>/sample/rodent.csv'
            }
        },
        'DocumentType': 'CSV'
    },
    JobConfig={
        "ReverseGeocodingConfig": { 
         "XAttributeName": "longitude",
         "YAttributeName": "latitude"
      }
    },
    Name='vej-reversegeo',
)

my_vej_arn = response['Arn']

Visualize rodent activity in a region

Now we can use SageMaker geospatial capabilities to visualize rodent sightings. After the VEJ is complete, we export the output of the job to an Amazon S3 bucket.

sg_client.export_vector_enrichment_job(
    Arn=my_vej_arn,
    ExecutionRoleArn=execution_role,
    OutputConfig={
        'S3Data': {
            'S3Uri': 's3://<bucket>/reversegeo/'
        }
    }
)

When the export is complete, you will see the output CSV file in your Amazon Simple Storage Service (Amazon S3) bucket, which consists of your input data (longitude and latitude coordinates) along with additional columns: address number, country, label, municipality, neighborhood, postal code, and region of that location appended at the end.

From the output file generated by VEJ, we can use SageMaker geospatial capabilities to overlay the output on a base map and provide layered visualization to make collaboration easier. SageMaker geospatial capabilities provide built-in visualization tooling powered by Foursquare Studio, which natively works from within a SageMaker notebook via the SageMaker geospatial Map SDK. Below, we can visualize the rodent sightings and also get the human readable addresses for each of the data points. The address information of each of the rodent sightings data points can be useful for rodent inspection and treatment purposes.

Analyze the effects of rodent infestation on vegetation and bodies of water

To analyze the effects of rodent infestation on vegetation and bodies of water, we need to classify each location as vegetation, water, and bare ground. Let’s look at how we can use these geospatial capabilities to perform this analysis.

The new geospatial capabilities in SageMaker offer easier access to geospatial data such as Sentinel-2 and Landsat 8. Built-in geospatial dataset access saves weeks of effort otherwise lost to collecting and processing data from various data providers and vendors. Also, these geospatial capabilities offer a pre-trained Land Use Land Cover (LULC) segmentation model to identify the physical material, such as vegetation, water, and bare ground, at the earth surface.

We use this LULC ML model to analyze the effects of rodent population on vegetation and bodies of water.

In the following code snippet, we first define the area of interest coordinates (aoi_coords) of New York City. Then we create an Earth Observation Job (EOJ) and select the LULC operation. SageMaker downloads and preprocesses the satellite image data for the EOJ. Next, SageMaker automatically runs model inference for the EOJ. The runtime of the EOJ will vary from several minutes to hours depending on the number of images processed. You can monitor the status of EOJs using the get_earth_observation_job function, and visualize the input and output of the EOJ in the map.

aoi_coords = [
    [
            [
              -74.13513011934334,
              40.87856296920188
            ],
            [
              -74.13513011934334,
              40.565792636343616
            ],
            [
              -73.8247144462764,
              40.565792636343616
            ],
            [
              -73.8247144462764,
              40.87856296920188
            ],
            [
              -74.13513011934334,
              40.87856296920188
            ]
    ]
]

eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": aoi_coords
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2023-01-01T00:00:00Z",
            "EndTime": "2023-02-28T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 2.0}}}],
            "LogicalOperator": "AND",
        },
    }
}
eoj_config = {
  "LandCoverSegmentationConfig": {}
}

response = geospatial_client.start_earth_observation_job(
    Name="eoj-rodent-infestation-lulc-example",
    InputConfig=eoj_input_config,
    JobConfig=eoj_config,
    ExecutionRoleArn=execution_role,
)
eoj_arn = response["Arn"]
eoj_arn

Map = sagemaker_geospatial_map.create_map()
Map.set_sagemaker_geospatial_client(sg_client)

Map.render()

time_range_filter = {
    "start_date": "2023-01-01T00:00:00Z",
    "end_date": "2023-02-28T23:59:59Z",
}


config = {"preset": "singleBand", "band_name": "mask"}
output_layer = Map.visualize_eoj_output(
    Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)

To visualize the rodent population with respect to vegetation, we overlay the rodent population and sighting data on the land cover segmentation model predictions. This visualization can help us locate the population of rodents and analyze it on vegetation and bodies of water.

Visualize monkeypox cases and corelating with rodent data

To visualize the relation between the monkeypox cases and rodent sightings, we add the monkeypox dataset and the geoJSON file for New York City borough boundaries. See the following code:

nybb = pd.read_csv("./nybb.csv")
monkeypox = pd.read_csv("./monkeypox.csv")
dataset = Map.add_dataset({
    "data": nybb
}, auto_create_layers=False)
dataset = Map.add_dataset({
    "data": monkeypox
}, auto_create_layers=False)

Within a SageMaker Studio notebook, we can use the visualization tool powered by Foursquare to add layers in the map and add charts. Here, we added the monkeypox data as a chart to show the number of monkeypox cases for each of the boroughs. To see the correlation between monkeypox cases and rodent sightings, we have added the borough boundaries as a polygon layer and added the heatmap layer that represents rodent activity. The borough boundary layer is colored to match the monkeypox data chart. As we can see, the borough of Manhattan exhibits a high concentration of rodent sightings and records the highest number of monkeypox cases, followed by Brooklyn.

This is supported by a simple statistical analysis of calculating the correlation between the concentration of rodent sightings and monkeypox cases in each borough. The calculation produced an r value of 0.714, which implies a positive correlation.

r = np.corrcoef(borough_stats['Concentration (sightings per square km)'], borough_stats['Monkeypox Cases'])

Conclusion

In this post, we demonstrated how you can use SageMaker geospatial capabilities to get detailed addresses of rodent sightings and visualize the rodent effects on vegetation and bodies of water. This can help local authorities and pest control organizations plan for interventions effectively and exterminate rodents. We also correlated the rodent sightings to monkeypox cases in the area with the built-in visualization tool. By utilizing vector enrichment and EOJs along with the built-in visualization tools, SageMaker geospatial capabilities eliminate the challenges of handling large-scale geospatial datasets, model training, and inference, and provide the ability to rapidly explore predictions and geospatial data on an interactive map using 3D accelerated graphics and built-in visualization tools.

You can get started with SageMaker geospatial capabilities in two ways:

To learn more, visit Amazon SageMaker geospatial capabilities and Getting Started with Amazon SageMaker geospatial capabilitites. Also, visit our GitHub repo, which has several example notebooks on SageMaker geospatial capabilities.


About the authors

Bunny Kaushik is a Solutions Architect at AWS. He is passionate about building AI/ML solutions and helping customers innovate on the AWS platform. Outside of work, he enjoys hiking, rock climbing, and swimming.

Clarisse Vigal is a Sr. Technical Account Manager at AWS, focused on helping customers accelerate their cloud adoption journey. Outside of work, Clarisse enjoys traveling, hiking, and reading sci-fi thrillers.

Veda Raman is a Senior Specialist Solutions Architect for machine learning based in Maryland. Veda works with customers to help them architect efficient, secure and scalable machine learning applications. Veda is interested in helping customers leverage serverless technologies for Machine learning.

Read More

Enel automates large-scale power grid asset management and anomaly detection using Amazon SageMaker

Enel automates large-scale power grid asset management and anomaly detection using Amazon SageMaker

This is a guest post by Mario Namtao Shianti Larcher, Head of Computer Vision at Enel.

Enel, which started as Italy’s national entity for electricity, is today a multinational company present in 32 countries and the first private network operator in the world with 74 million users. It is also recognized as the first renewables player with 55.4 GW of installed capacity. In recent years, the company has invested heavily in the machine learning (ML) sector by developing strong in-house know-how that has enabled them to realize very ambitious projects such as automatic monitoring of its 2.3 million kilometers of distribution network.

Every year, Enel inspects its electricity distribution network with helicopters, cars, or other means; takes millions of photographs; and reconstructs the 3D image of its network, which is a point cloud 3D reconstruction of the network, obtained using LiDAR technology.

Examination of this data is critical for monitoring the state of the power grid, identifying infrastructure anomalies, and updating databases of installed assets, and it allows granular control of the infrastructure down to the material and status of the smallest insulator installed on a given pole. Given the amount of data (more than 40 million images each year just in Italy), the number of items to be identified, and their specificity, a completely manual analysis is very costly, both in terms of time and money, and error prone. Fortunately, thanks to enormous advances in the world of computer vision and deep learning and the maturity and democratization of these technologies, it’s possible to automate this expensive process partially or even completely.

Of course, the task remains very challenging, and, like all modern AI applications, it requires computing power and the ability to handle large volumes of data efficiently.

Enel built its own ML platform (internally called the ML factory) based on Amazon SageMaker, and the platform is established as the standard solution to build and train models at Enel for different use cases, across different digital hubs (business units) with tens of ML projects being developed on Amazon SageMaker Training, Amazon SageMaker Processing, and other AWS services like AWS Step Functions.

Enel collects imagery and data from two different sources:

  1. Aerial network inspections:
    • LiDAR point clouds – They have the advantage of being an extremely accurate and geo-localized 3D reconstruction of the infrastructure, and therefore are very useful for calculating distances or taking measurements with an accuracy not obtainable from 2D image analysis.
    • High-resolution images – These images of the infrastructure are taken within seconds of each other. This makes it possible to detect elements and anomalies that are too small to be identified in the point cloud.
  2. Satellite images – Although these can be more affordable than a power line inspection (some are available for free or for a fee), their resolution and quality is often not on par with images taken directly by Enel. The characteristics of these images make them useful for certain tasks like evaluating forest density and macro-category or finding buildings.

In this post, we discuss the details of how Enel uses these three sources, and share how Enel automates their large-scale power grid assessment management and anomaly detection process using SageMaker.

Analyzing high-resolution photographs to identify assets and anomalies

As with other unstructured data collected during inspections, the photographs taken are stored on Amazon Simple Storage Service (Amazon S3). Some of these are manually labeled with the goal of training different deep learning models for different computer vision tasks.

Conceptually, the processing and inference pipeline involves a hierarchical approach with multiple steps: first, the regions of interest in the image are identified, then these are cropped, assets are identified within them, and finally these are classified according to the material or presence of anomalies on them. Because the same pole often appears in more than one image, it’s also necessary to be able to group its images to avoid duplicates, an operation called reidentification.

For all these tasks, Enel uses the PyTorch framework and the latest architectures for image classification and object detection, such as EfficientNet/EfficientDet or others for the semantic segmentation of certain anomalies, such as oil leaks on transformers. For the reidentification task, if they can’t do it geometrically because they lack camera parameters, they use SimCLR-based self-supervised methods or Transformer-based architectures are used. It would be impossible to train all these models without having access to a large number of instances equipped with high-performance GPUs, so all the models were trained in parallel using Amazon SageMaker Training jobs with GPU accelerated ML instances. Inference has the same structure and is orchestrated by a Step Functions state machine that governs several SageMaker processing and training jobs that, despite the name, are as usable in training as in inference.

The following is a high-level architecture of the ML pipeline with its main steps.

Architectural Diagram

This diagram shows the simplified architecture of the ODIN image inference pipeline, which extracts and analyzes ROIs (such as electricity posts) from dataset images. The pipeline further drills down on ROIs, extracting and analyzing electrical elements (transformers, insulators, and so on). After the components (ROIs and elements) are finalized, the reidentification process begins: images and poles in the network map are matched based on 3D metadata. This allows the clustering of ROIs referring to the same pole. After that, anomalies get finalized and reports are generated.

Extracting precise measurements using LiDAR point clouds

High-resolution photographs are very useful, but because they’re 2D, it’s impossible to extract precise measurements from them. LiDAR point clouds come to the rescue here, because they are 3D and have each point in the cloud a position with an associated error of less than a handful of centimeters.

However, in many cases, a raw point cloud is not useful, because you can’t do much with it if you don’t know whether a set of points represents a tree, a power line, or a house. For this reason, Enel uses KPConv, a semantic point cloud segmentation algorithm, to assign a class to each point. After the cloud is classified, it’s possible to figure out whether vegetation is too close to the power line rather than measuring the tilt of poles. Due to the flexibility of SageMaker services, the pipeline of this solution is not much different from the one already described, with the only difference being that in this case it is necessary to use GPU instances for inference as well.

The following are some examples of point cloud images.

LiDAR image 1

LiDAR image2

Looking at the power grid from space: Mapping vegetation to prevent service disruptions

Inspecting the power grid with helicopters and other means is generally very expensive and can’t be done too frequently. On the other hand, having a system to monitor vegetation trends in short time intervals is extremely useful for optimizing one of the most expensive processes of an energy distributor: tree pruning. This is why Enel also included in its solution the analysis of satellite images, from which with a multitask approach is identified where vegetation is present, its density, and the type of plants divided into macro classes.

For this use case, after experimenting with different resolutions, Enel concluded that the free Sentinel 2 images provided by the Copernicus program had the best cost-benefit ratio. In addition to vegetation, Enel also uses satellite imagery to identify buildings, which is useful information to understand if there are discrepancies between their presence and where Enel delivers power and therefore any irregular connections or problems in the databases. For the latter use case, the resolution of Sentinel 2, where one pixel represents an area of 10 square meters, is not sufficient, and so paid-for images with a resolution of 50 square centimeters are purchased. This solution also doesn’t differ much from the previous ones in terms of services used and flow.

The following is an aerial picture with identification of assets (pole and insulators).

Angela Italiano, Director of Data Science at ENEL Grid, says,

“At Enel, we use computer vision models to inspect our electricity distribution network by reconstructing 3D images of our network using tens of millions of high-quality images and LiDAR point clouds. The training of these ML models requires access to a large number of instances equipped with high-performance GPUs and the ability to handle large volumes of data efficiently. With Amazon SageMaker, we can quickly train all of our models in parallel without needing to manage the infrastructure as Amazon SageMaker training scales the compute resources up and down as needed. Using Amazon SageMaker, we are able to build 3D images of our systems, monitor for anomalies, and serve over 60 million customers efficiently.”

Conclusion

In this post, we saw how a top player in the energy world like Enel used computer vision models and SageMaker training and processing jobs to solve one of the main problems of those who have to manage an infrastructure of this colossal size, keep track of installed assets, and identify anomalies and sources of danger for a power line such as vegetation too close to it.

Learn more about the related features of SageMaker.


About the Authors

Mario Namtao Shianti Larcher is the Head of Computer Vision at Enel. He has a background in mathematics, statistics, and a profound expertise in machine learning and computer vision, he leads a team of over ten professionals. Mario’s role entails implementing advanced solutions that effectively utilize the power of AI and computer vision to leverage Enel’s extensive data resources. In addition to his professional endeavors, he nurtures a personal passion for both traditional and AI-generated art.

Cristian Gavazzeni is a Senior Solution Architect at Amazon Web Services. He has more than 20 years of experience as a pre-sales consultant focusing on Data Management, Infrastructure and Security. During his spare time he likes playing golf with friends and travelling abroad with only fly and drive bookings.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Read More

Efficiently train, tune, and deploy custom ensembles using Amazon SageMaker

Efficiently train, tune, and deploy custom ensembles using Amazon SageMaker

Artificial intelligence (AI) has become an important and popular topic in the technology community. As AI has evolved, we have seen different types of machine learning (ML) models emerge. One approach, known as ensemble modeling, has been rapidly gaining traction among data scientists and practitioners. In this post, we discuss what ensemble models are and why their usage can be beneficial. We then provide an example of how you can train, optimize, and deploy your custom ensembles using Amazon SageMaker.

Ensemble learning refers to the use of multiple learning models and algorithms to gain more accurate predictions than any single, individual learning algorithm. They have been proven to be efficient in diverse applications and learning settings such as cybersecurity [1] and fraud detection, remote sensing, predicting best next steps in financial decision-making, medical diagnosis, and even computer vision and natural language processing (NLP) tasks. We tend to categorize ensembles by the techniques used to train them, their composition, and the way they merge the different predictions into a single inference. These categories include:

  • Boosting – Training sequentially multiple weak learners, where each incorrect prediction from previous learners in the sequence is given a higher weight and input to the next learner, thereby creating a stronger learner. Examples include AdaBoost, Gradient Boosting, and XGBoost.
  • Bagging – Uses multiple models to reduce the variance of a single model. Examples include Random Forest and Extra Trees.
  • Stacking (blending) – Often uses heterogenous models, where predictions of each individual estimator are stacked together and used as input to a final estimator that handles the prediction. This final estimator’s training process often uses cross-validation.

There are multiple methods of combining the predictions into the single one that the model finally produce, for example, using a meta-estimator such as linear learner, a voting method that uses multiple models to make a prediction based on majority voting for classification tasks, or an ensemble averaging for regression.

Although several libraries and frameworks provide implementations of ensemble models, such as XGBoost, CatBoost, or scikit-learn’s random forest, in this post we focus on bringing your own models and using them as a stacking ensemble. However, instead of using dedicated resources for each model (dedicated training and tuning jobs and hosting endpoints per model), we train, tune, and deploy a custom ensemble (multiple models) using a single SageMaker training job and a single tuning job, and deploy to a single endpoint, thereby reducing possible cost and operational overhead.

BYOE: Bring your own ensemble

There are several ways to train and deploy heterogenous ensemble models with SageMaker: you can train each model in a separate training job and optimize each model separately using Amazon SageMaker Automatic Model Tuning. When hosting these models, SageMaker provides various cost-effective ways to host multiple models on the same tenant infrastructure. Detailed deployment patterns for this kind of settings can be found in Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker. These patterns include using multiple endpoints (for each trained model) or a single multi-model endpoint, or even a single multi-container endpoint where the containers can be invoked individually or chained in a pipeline. All these solutions include a meta-estimator (for example in an AWS Lambda function) that invokes each model and implements the blending or voting function.

However, running multiple training jobs might introduce operational and cost overhead, especially if your ensemble requires training on the same data. Similarly, hosting different models on separate endpoints or containers and combining their prediction results for better accuracy requires multiple invocations, and therefore introduces additional management, cost, and monitoring efforts. For example, SageMaker supports ensemble ML models using Triton Inference Server, but this solution requires the models or model ensembles to be supported by the Triton backend. Additionally, additional efforts are required from the customer to set up the Triton server and additional learning to understand how different Triton backends work. Therefore, customers prefer a more straightforward way to implement solutions where they only need to send the invocation once to the endpoint and have the flexibility to control how the results are aggregated to generate the final output.

Solution overview

To address these concerns, we walk through an example of ensemble training using a single training job, optimizing the model’s hyperparameters and deploying it using a single container to a serverless endpoint. We use two models for our ensemble stack: CatBoost and XGBoost (both of which are boosting ensembles). For our data, we use the diabetes dataset [2] from the scikit-learn library: It consists of 10 features (age, sex, body mass, blood pressure, and six blood serum measurements), and our model predicts the disease progression 1 year after baseline features were collected (a regression model).

The full code repository can be found on GitHub.

Train multiple models in a single SageMaker job

For training our models, we use SageMaker training jobs in Script mode. With Script mode, you can write custom training (and later inference code) while using SageMaker framework containers. Framework containers enable you to use ready-made environments managed by AWS that include all necessary configuration and modules. To demonstrate how you can customize a framework container, as an example, we use the pre-built SKLearn container, which doesn’t include the XGBoost and CatBoost packages. There are two options to add these packages: either extend the built-in container to install CatBoost and XGBoost (and then deploy as a custom container), or use the SageMaker training job script mode feature, which allows you to provide a requirements.txt file when creating the training estimator. The SageMaker training job installs the listed libraries in the requirements.txt file during run time. This way, you don’t need to manage your own Docker image repository and it provides more flexibility to running training scripts that need additional Python packages.

The following code block shows the code we use to start the training. The entry_point parameter points to our training script. We also use two of the SageMaker SDK API’s compelling features:

  • First, we specify the local path to our source directory and dependencies in the source_dir and dependencies parameters, respectively. The SDK will compress and upload those directories to Amazon Simple Storage Service (Amazon S3) and SageMaker will make them available on the training instance under the working directory /opt/ml/code.
  • Second, we use the SDK SKLearn estimator object with our preferred Python and framework version, so that SageMaker will pull the corresponding container. We have also defined a custom training metric ‘validation:rmse‘, which will be emitted in the training logs and captured by SageMaker. Later, we use this metric as the objective metric in the tuning job.
hyperparameters = {"num_round": 6, "max_depth": 5}
estimator_parameters = {
    "entry_point": "multi_model_hpo.py",
    "source_dir": "code",
    "dependencies": ["my_custom_library"],
    "instance_type": training_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "xgboost-model",
    "framework_version": "1.0-1",
    "keep_alive_period_in_seconds": 60,
    "metric_definitions":[
       {'Name': 'validation:rmse', 'Regex': 'validation-rmse:(.*?);'}
    ]
}
estimator = SKLearn(**estimator_parameters)

Next, we write our training script (multi_model_hpo.py). Our script follows a simple flow: capture hyperparameters with which the job was configured and train the CatBoost model and XGBoost model. We also implement a k-fold cross validation function. See the following code:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--validation", type=str, default=os.environ["SM_CHANNEL_VALIDATION"])
    .
    .
    .
    
    """
    Train catboost
    """
    
    K = args.k_fold    
    catboost_hyperparameters = {
        "max_depth": args.max_depth,
        "eta": args.eta,
    }
    rmse_list, model_catboost = cross_validation_catboost(train_df, K, catboost_hyperparameters)
    .
    .
    .
    
    """
    Train the XGBoost model
    """

    hyperparameters = {
        "max_depth": args.max_depth,
        "eta": args.eta,
        "objective": args.objective,
        "num_round": args.num_round,
    }

    rmse_list, model_xgb = cross_validation(train_df, K, hyperparameters)

After the models are trained, we calculate the mean of both the CatBoost and XGBoost predictions. The result, pred_mean, is our ensemble’s final prediction. Then, we determine the mean_squared_error against the validation set. val_rmse is used for the evaluation of the whole ensemble during training. Notice that we also print the RMSE value in a pattern that fits the regex we used in the metric_definitions. Later, SageMaker Automatic Model Tuning will use that to capture the objective metric. See the following code:

pred_mean = np.mean(np.array([pred_catboost, pred_xgb]), axis=0)
val_rmse = mean_squared_error(y_validation, pred_mean, squared=False)
print(f"Final evaluation result: validation-rmse:{val_rmse}")

Finally, our script saves both model artifacts to the output folder located at /opt/ml/model.

When a training job is complete, SageMaker packages and copies the content of the /opt/ml/model directory as a single object in compressed TAR format to the S3 location that you specified in the job configuration. In our case, SageMaker bundles the two models in a TAR file and uploads it to Amazon S3 at the end of the training job. See the following code:

model_file_name = 'catboost-regressor-model.dump'
   
    # Save CatBoost model
    path = os.path.join(args.model_dir, model_file_name)
    print('saving model file to {}'.format(path))
    model.save_model(path)
   .
   .
   .
   # Save XGBoost model
   model_location = args.model_dir + "/xgboost-model"
   pickle.dump(model, open(model_location, "wb"))
   logging.info("Stored trained model at {}".format(model_location))

In summary, you should notice that in this procedure we downloaded the data one time and trained two models using a single training job.

Automatic ensemble model tuning

Because we’re building a collection of ML models, exploring all of the possible hyperparameter permutations is impractical. SageMaker offers Automatic Model Tuning (AMT), which looks for the best model hyperparameters by focusing on the most promising combinations of values within ranges that you specify (it’s up to you to define the right ranges to explore). SageMaker supports multiple optimization methods for you to choose from.

We start by defining the two parts of the optimization process: the objective metric and hyperparameters we want to tune. In our example, we use the validation RMSE as the target metric and we tune eta and max_depth (for other hyperparameters, refer to XGBoost Hyperparameters and CatBoost hyperparameters):

from sagemaker.tuner import (
    IntegerParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "eta": ContinuousParameter(0.2, 0.3),
    "max_depth": IntegerParameter(3, 4)
}
metric_definitions = [{"Name": "validation:rmse", "Regex": "validation-rmse:([0-9\.]+)"}]
objective_metric_name = "validation:rmse"

We also need to ensure in the training script that our hyperparameters are not hardcoded and are pulled from the SageMaker runtime arguments:

catboost_hyperparameters = {
    "max_depth": args.max_depth,
    "eta": args.eta,
}

SageMaker also writes the hyperparameters to a JSON file and can be read from /opt/ml/input/config/hyperparameters.json on the training instance.

Like CatBoost, we also capture the hyperparameters for the XGBoost model (notice that objective and num_round aren’t tuned):

catboost_hyperparameters = {
    "max_depth": args.max_depth,
    "eta": args.eta,
}

Finally, we launch the hyperparameter tuning job using these configurations:

tuner = HyperparameterTuner(
    estimator, 
    objective_metric_name,
    hyperparameter_ranges, 
    max_jobs=4, 
    max_parallel_jobs=2, 
    objective_type='Minimize'
)
tuner.fit({"train": train_location, "validation": validation_location}, include_cls_metadata=False)

When the job is complete, you can retrieve the values for the best training job (with minimal RMSE):

job_name=tuner.latest_tuning_job.name
attached_tuner = HyperparameterTuner.attach(job_name)
attached_tuner.describe()["BestTrainingJob"]

For more information on AMT, refer to Perform Automatic Model Tuning with SageMaker.

Deployment

To deploy our custom ensemble, we need to provide a script to handle the inference request and configure SageMaker hosting. In this example, we used a single file that includes both the training and inference code (multi_model_hpo.py). SageMaker uses the code under if _ name _ == "_ main _" for the training and the functions model_fn, input_fn, and predict_fn when deploying and serving the model.

Inference script

As with training, we use the SageMaker SKLearn framework container with our own inference script. The script will implement three methods required by SageMaker.

First, the model_fn method reads our saved model artifact files and loads them into memory. In our case, the method returns our ensemble as all_model, which is a Python list, but you can also use a dictionary with model names as keys.

def model_fn(model_dir):
    catboost_model = CatBoostRegressor()
    catboost_model.load_model(os.path.join(model_dir, model_file_name))
    
    model_file = "xgboost-model"
    model = pickle.load(open(os.path.join(model_dir, model_file), "rb"))
    
    all_model = [catboost_model, model]
    return all_model

Second, the input_fn method deserializes the request input data to be passed to our inference handler. For more information about input handlers, refer to Adapting Your Own Inference Container.

def input_fn(input_data, content_type):
    dtype=None
    payload = StringIO(input_data)
    return np.genfromtxt(payload, dtype=dtype, delimiter=",")

Third, the predict_fn method is responsible for getting predictions from the models. The method takes the model and the data returned from input_fn as parameters and returns the final prediction. In our example, we get the CatBoost result from the model list first member (model[0]) and the XGBoost from the second member (model[1]), and we use a blending function that returns the mean of both predictions:

def predict_fn(input_data, model):
    predictions_catb = model[0].predict(input_data)
    
    dtest = xgb.DMatrix(input_data)
    predictions_xgb = model[1].predict(dtest,
                                          ntree_limit=getattr(model, "best_ntree_limit", 0),
                                          validate_features=False)
    
    return np.mean(np.array([predictions_catb, predictions_xgb]), axis=0)

Now that we have our trained models and inference script, we can configure the environment to deploy our ensemble.

SageMaker Serverless Inference

Although there are many hosting options in SageMaker, in this example, we use a serverless endpoint. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic. This takes away the undifferentiated heavy lifting of managing servers. This option is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts.

Configuring the serverless endpoint is straightforward because we don’t need to choose instance types or manage scaling policies. We only need to provide two parameters: memory size and maximum concurrency. The serverless endpoint automatically assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. You should always choose your endpoint’s memory size according to your model size. The second parameter we need to provide is maximum concurrency. For a single endpoint, this parameter can be set up to 200 (as of this writing, the limit for total number of serverless endpoints in a Region is 50). You should note that the maximum concurrency for an individual endpoint prevents that endpoint from taking up all the invocations allowed for your account, because any endpoint invocations beyond the maximum are throttled (for more information about the total concurrency for all serverless endpoints per Region, refer to Amazon SageMaker endpoints and quotas).

from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144,
    max_concurrency=1,
) 

Now that we configured the endpoint, we can finally deploy the model that was selected in our hyperparameter optimization job:

estimator=attached_tuner.best_estimator()
predictor = estimator.deploy(serverless_inference_config=serverless_config)

Clean up

Even though serverless endpoints have zero cost when not being used, when you have finished running this example, you should make sure to delete the endpoint:

predictor.delete_endpoint(predictor.endpoint)

Conclusion

In this post, we covered one approach to train, optimize, and deploy a custom ensemble. We detailed the process of using a single training job to train multiple models, how to use automatic model tuning to optimize the ensemble hyperparameters, and how to deploy a single serverless endpoint that blends the inferences from multiple models.

Using this method solves potential cost and operational issues. The cost of a training job is based on the resources you use for the duration of usage. By downloading the data only once for training the two models, we reduced by half the job’s data download phase and the used volume that stores the data, thereby reducing the training job’s overall cost. Furthermore, the AMT job ran four training jobs, each with the aforementioned reduced time and storage, so that represent 4 times in cost saving! With regard to model deployment on a serverless endpoint, because you also pay for the amount of data processed, by invoking the endpoint only once for two models, you pay half of the I/O data charges.

Although this post only showed the benefits with two models, you can use this method to train, tune, and deploy numerous ensemble models to see an even greater effect.

References

[1] Raj Kumar, P. Arun; Selvakumar, S. (2011). “Distributed denial of service attack detection using an ensemble of neural classifier”. Computer Communications. 34 (11): 1328–1341. doi:10.1016/j.comcom.2011.01.012.

[2] Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)


About the Authors

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, hiking, and minimizing RMSEs.

Read More

Use a generative AI foundation model for summarization and question answering using your own data

Use a generative AI foundation model for summarization and question answering using your own data

Large language models (LLMs) can be used to analyze complex documents and provide summaries and answers to questions. The post Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data describes how to fine-tune an LLM using your own dataset. Once you have a solid LLM, you’ll want to expose that LLM to business users to process new documents, which could be hundreds of pages long. In this post, we demonstrate how to construct a real-time user interface to let business users process a PDF document of arbitrary length. Once the file is processed, you can summarize the document or ask questions about the content. The sample solution described in this post is available on GitHub.

Working with financial documents

Financial statements like quarterly earnings reports and annual reports to shareholders are often tens or hundreds of pages long. These documents contain a lot of boilerplate language like disclaimers and legal language. If you want to extract the key data points from one of these documents, you need both time and some familiarity with the boilerplate language so you can identify the interesting facts. And of course, you can’t ask an LLM questions about a document it has never seen.

LLMs used for summarization have a limit on the number of tokens (characters) passed into the model, and with some exceptions, these are typically no more than a few thousand tokens. That normally precludes the ability to summarize longer documents.

Our solution handles documents that exceed an LLM’s maximum token sequence length, and make that document available to the LLM for question answering.

Solution overview

Our design has three important pieces:

  • It has an interactive web application for business users to upload and process PDFs
  • It uses the langchain library to split a large PDF into more manageable chunks
  • It uses the retrieval augmented generation technique to let users ask questions about new data that the LLM hasn’t seen before

As shown in the following diagram, we use a front end implemented with React JavaScript hosted in an Amazon Simple Storage Service (Amazon S3) bucket fronted by Amazon CloudFront. The front-end application lets users upload PDF documents to Amazon S3. After the upload is complete, you can trigger a text extraction job powered by Amazon Textract. As part of the post-processing, an AWS Lambda function inserts special markers into the text indicating page boundaries. When that job is done, you can invoke an API that summarizes the text or answers questions about it.

Because some of these steps may take some time, the architecture uses a decoupled asynchronous approach. For example, the call to summarize a document invokes a Lambda function that posts a message to an Amazon Simple Queue Service (Amazon SQS) queue. Another Lambda function picks up that message and starts an Amazon Elastic Container Service (Amazon ECS) AWS Fargate task. The Fargate task calls the Amazon SageMaker inference endpoint. We use a Fargate task here because summarizing a very long PDF may take more time and memory than a Lambda function has available. When the summarization is done, the front-end application can pick up the results from an Amazon DynamoDB table.

For summarization, we use AI21’s Summarize model, one of the foundation models available through Amazon SageMaker JumpStart. Although this model handles documents of up to 10,000 words (approximately 40 pages), we use langchain’s text splitter to make sure that each summarization call to the LLM is no more than 10,000 words long. For text generation, we use Cohere’s Medium model, and we use GPT-J for embeddings, both via JumpStart.

Summarization processing

When handling larger documents, we need to define how to split the document into smaller pieces. When we get the text extraction results back from Amazon Textract, we insert markers for larger chunks of text (a configurable number of pages), individual pages, and line breaks. Langchain will split based on those markers and assemble smaller documents that are under the token limit. See the following code:

text_splitter = RecursiveCharacterTextSplitter(
      separators = ["<CHUNK>", "<PAGE>", "n"],
         chunk_size = int(chunk_size),
         chunk_overlap  = int(chunk_overlap))

 with open(local_path) as f:
     doc = f.read()
 texts = text_splitter.split_text(doc)
 print(f"Number of splits: {len(texts)}")


 llm = SageMakerLLM(endpoint_name = endpoint_name)

 responses = []
 for t in texts:
     r = llm(t)
     responses.append(r)
 summary = "n".join(responses)

The LLM in the summarization chain is a thin wrapper around our SageMaker endpoint:

class SageMakerLLM(LLM):

endpoint_name: str
    
@property
def _llm_type(self) -> str:
    return "summarize"
    
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    response = ai21.Summarize.execute(
                      source=prompt,
                      sourceType="TEXT",
                      sm_endpoint=self.endpoint_name
    )
    return response.summary 

Question answering

In the retrieval augmented generation method, we first split the document into smaller segments. We create embeddings for each segment and store them in the open-source Chroma vector database via langchain’s interface. We save the database in an Amazon Elastic File System (Amazon EFS) file system for later use. See the following code:

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500,
                                                chunk_overlap  = 0)
texts = text_splitter.split_documents(documents)
print(f"Number of splits: {len(texts)}")

embeddings = SMEndpointEmbeddings(
    endpoint_name=endpoint_name,
)
vectordb = Chroma.from_documents(texts, embeddings, 
    persist_directory=persist_directory)
vectordb.persist()

When the embeddings are ready, the user can ask a question. We search the vector database for the text chunks that most closely match the question:

embeddings = SMEndpointEmbeddings(
    endpoint_name=endpoint_embed
)
vectordb = Chroma(persist_directory=persist_directory, 
embedding_function=embeddings)
docs = vectordb.similarity_search_with_score(question)

We take the closest matching chunk and use it as context for the text generation model to answer the question:

cohere_client = Client(endpoint_name=endpoint_qa)
context = docs[high_score_idx][0].page_content.replace("n", "")
qa_prompt = f'Context={context}nQuestion={question}nAnswer='
response = cohere_client.generate(prompt=qa_prompt, 
                                  max_tokens=512, 
                                  temperature=0.25, 
                                  return_likelihoods='GENERATION')
answer = response.generations[0].text.strip().replace('n', '')

User experience

Although LLMs represent advanced data science, most of the use cases for LLMs ultimately involve interaction with non-technical users. Our example web application handles an interactive use case where business users can upload and process a new PDF document.

The following diagram shows the user interface. A user starts by uploading a PDF. After the document is stored in Amazon S3, the user is able to start the text extraction job. When that’s complete, the user can invoke the summarization task or ask questions. The user interface exposes some advanced options like the chunk size and chunk overlap, which would be useful for advanced users who are testing the application on new documents.

User interface

Next steps

LLMs provide significant new information retrieval capabilities. Business users need convenient access to those capabilities. There are two directions for future work to consider:

  • Take advantage of the powerful LLMs already available in Jumpstart foundation models. With just a few lines of code, our sample application could deploy and make use of advanced LLMs from AI21 and Cohere for text summarization and generation.
  • Make these capabilities accessible to non-technical users. A prerequisite to processing PDF documents is extracting text from the document, and summarization jobs may take several minutes to run. That calls for a simple user interface with asynchronous backend processing capabilities, which is easy to design using cloud-native services like Lambda and Fargate.

We also note that a PDF document is semi-structured information. Important cues like section headings are difficult to identify programmatically, because they rely on font sizes and other visual indicators. Identifying the underlying structure of information helps the LLM process the data more accurately, at least until such time that LLMs can handle input of unbounded length.

Conclusion

In this post, we showed how to build an interactive web application that lets business users upload and process PDF documents for summarization and question answering. We saw how to take advantage of Jumpstart foundation models to access advanced LLMs, and use text splitting and retrieval augmented generation techniques to process longer documents and make them available as information to the LLM.

At this point in time, there is no reason not to make these powerful capabilities available to your users. We encourage you to start using the Jumpstart foundation models today.


About the author

Author pictureRandy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

Read More

Integrate Amazon SageMaker Model Cards with the model registry

Integrate Amazon SageMaker Model Cards with the model registry

Amazon SageMaker Model Cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a factsheet of the model that is important for model governance.

Until now, model cards were logically associated to a model in the Amazon SageMaker Model Registry using model name match. However, when solving a business problem through a machine learning (ML) model, as customers iterate on the problem, they create multiple versions of the model and they need to operationalize and govern multiple model versions. Therefore, they need the ability to associate a model card to a particular model version.

In this post, we discuss a new feature that supports integrating model cards with the model registry at the deployed model version level. We discuss the solution architecture and best practices for managing model card versions, and walk through how to set up, operationalize, and govern the model card integration with the model version in the model registry.

Solution overview

SageMaker model cards help you standardize documenting your models from a governance perspective, and the SageMaker model registry helps you deploy and operationalize ML models. The model registry supports a hierarchical structure for organizing and storing ML models with model metadata information.

When an organization solves a business problem using ML, such as a customer churn prediction, we recommend the following steps:

  1. Create a model card for the business problem to be solved.
  2. Create a model package group for the business problem to be solved.
  3. Build, train, evaluate, and register the first version of the model package version (for example, Customer Churn V1).
  4. Update the model card linking the model package version to the model card.
  5. As you iterate on new model package version, clone the model card from the previous version and link to the new model package version (for example, Customer Churn V2).

The following figure illustrates how a SageMaker model card integrates with the model registry.

As illustrated in the preceding diagram, the integration of SageMaker model cards and the model registry allows you to associate a model card with a specific model version in the model registry. This enables you to establish a single source of truth for your registered model versions, with comprehensive and standardized documentation across all stages of the model’s journey on SageMaker, facilitating discoverability and promoting governance, compliance, and accountability throughout the model lifecycle.

Best practices for managing model cards

Operating in machine learning with governance is a critical requirement for many enterprise organizations today, notably in highly regulated industries. As part of those requirements, AWS provides several services that enable reliable operation of the ML environment.

SageMaker model cards document critical details about your ML models in a single place for streamlined governance and reporting. Model cards help you capture details such as the intended use and risk rating of a model, training details and metrics, evaluation results and observations, and additional call-outs such as considerations, recommendations, and custom information.

Model cards need to be managed and updated as part of your development process, throughout the ML lifecycle. They are an important part of continuous delivery and pipelines in ML. In the same way that a Well-Architected ML project implements continuous integration and continuous delivery (CI/CD) under the umbrella of MLOps, a continuous ML documentation process is a critical capability in a lot of regulated industries or for higher risk use cases. Model cards are part of the best practices for responsible and transparent ML development.

The following diagram shows how model cards should be part of a development lifecycle.

Consider the following best practices:

  • We recommend creating model cards early in your project lifecycle. In the first phase of the project, when you are working on identifying the business goal and framing the ML problem, you should initiate the creation of the model card. As you work through the different steps of business requirements and important performance metrics, you can create the model card in a draft status and determine the business details and intended uses.
  • As part of your model development lifecycle phase, you should use the model registry to catalog models for production, manage model versions, and associate metadata with a model. The model registry enables lineage tracking.
  • After you have iterated successfully and are ready to deploy your model to production, it’s time to update the model card. In the deployment lifecycle phase, you can update the model details of the model card. You should also update training details, evaluation details, ethical considerations, and caveats and recommendations.

Model cards have versions associated with them. A given model version is immutable across all attributes other than the model card status. If you make any other changes to the model card, such as evaluation metrics, description, or intended uses, SageMaker creates a new version of the model card to reflect the updated information. This is to ensure that a model card, once created, can’t be tampered with. Additionally, each unique model name can have only one associated model card and it can’t be changed after you create the model card.

ML models are dynamic and workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.

Therefore, the lifecycle of your model cards will look as described in the following diagram. Every time you update your model card through the model lifecycle, you automatically create a new version of the model card. Every time you iterate on a new model version, you create a new model card that can inherit some model card information of the previous model versions and follow the same lifecycle.

Pre-requisites

This post assumes that you already have models in your model registry. If you want to follow along, you can use the following SageMaker example on GitHub to populate your model registry: SageMaker Pipelines integration with Model Monitor and Clarify.

Integrate a model card with the model version in the model registry

In this example, we have the model-monitor-clarify-group package in our model registry.

In this package, two model versions are available.

For this example, we link Version 1 of the model to a new model card. In the model registry, you can see the details for Version 1.

We can now use the new feature in the SageMaker Python SDK. From the sagemaker.model_card ModelPackage module, you can select a specific model version from the model registry that you would like to link the model card to.

You can now create a new model card for the model version and specify the model_package_details parameter with the previous model package retrieved. You need to populate the model card with all the additional details necessary. For this post, we create a simple model card as an example.

You can then use that definition to create a model card using the SageMaker Python SDK.

When loading the model card again, you can see the associated model under "__model_package_details".

You also have the option to update an existing model card with the model_package as shown in the example code snippet below:

my_card = ModelCard.load(("<model_card_name>")
mp_details = ModelPackage.from_model_package_arn("<arn>")
my_card.model_package_details = mp_details
my_card.update()

Finally, when creating or updating a new model package version in an existing model package, if a model card already exists in that model package group, some information such as the business details and intended uses can be carried over to the new model card.

Clean up

Users are responsible for cleaning up resources if created using the notebook mentioned in the pre-requisites section. Please follow the instructions in the notebook to clean up resources.

Conclusion

In this post, we discussed how to integrate a SageMaker model card with a model version in the model registry. We shared the solution architecture with best practices for implementing a model card and showed how to set up and operationalize a model card to improve your model governance posture. We encourage you to try out this solution and share your feedback in the comments section.


About the Authors

Ram VittalRam Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!

Natacha Fort is the Government Data Science Lead for Public Sector Australia and New Zealand, Principal SA at AWS. She helps organizations navigate their machine learning journey, supporting them from framing the machine learning problem to deploying into production, all the while making sure the best architecture practices are in place to ensure their success. Natacha focuses with organizations on MLOps and responsible AI.

Read More