alt_text

Efficient Large-Scale Training with Pytorch FSDP and AWS

Cutting-edge AI models are becoming extremely large. The cost and overhead of training these models is increasing rapidly, and involves large amounts of engineering and guesswork to find the right training regime. FSDP reduces these costs significantly by enabling you to train much larger models with the same amount of resources. FSDP lowers the memory footprint on your GPUs, and is usable via a lightweight configuration that requires substantially less effort, typically with just a few lines of code.

The main performance gains in FSDP come from maximizing the overlap between network communication and model computation, and eliminating the memory redundancy inherent in traditional data parallel training (DDP). PyTorch FSDP can train models approximately 4x larger on the same server resources as DDP and 20x larger if we combine activation checkpointing and activation offloading.

Since PyTorch 1.12, FSDP is now in beta status, and has added a number of new features that can be tuned to further accelerate your model training.

In this series of blog posts, we will explain multiple performance optimizations you can run with FSDP to boost your distributed training speed and model sizes within the context of your available server resources. We use the HuggingFace T5 3B, 11B and DeepVit, in fine-tuning mode, as the running examples throughout the series.

As a preview of some of the optimizations discussed in this series, we show the before and after performance scaled in Flops below (Note that these results can vary based on your server resources and model architecture).

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*T5 3B Performance measured on AWS A100 and A10 servers. Original with no optimizations and Tuned with the applied optimization

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*T5 11B Performance measured on A100 servers. Original with no optimizations and Tuned with the applied optimization

In this first post, we will provide a quick overview of FSDP and how it can make training large- scale AI models more efficient. We will highlight briefly the multiple performance options available, and dive deeper into the details on these in upcoming posts. We will then conclude with an overview on how to leverage AWS parallel cluster for large- scale training with FSDP.

Optimization T5 Model Throughput Improvement
Mixed Precision 3 B 5x
11 B 10x
Activation Checkpointing (AC) 3 B 10x
11 B 100x
Transformer Wrapping Policy 3 B 2x
11 B Unable to run the experiment without the Transformer wrapping policy.
Full Shard Strategy 3 B 1.5x
11 B Not able to run with Zero2

Performance optimization gains on T5 models over non-optimized.

In our experiments with the T5 3B model, using the transformer wrapping policy resulted in >2x higher throughput measured in TFLOPS versus the default wrapping policy. Activation checkpointing resulted in 10x improvement by reinvesting the freed memory from the checkpoints into larger batch size. _Mixed precision _with BFloat16 resulted in ~5x improvement versus FP32 and finally the _full sharding strategy_ versus zero2 (DDP) resulted in 1.5x improvement.

We ran similar experiments for a larger model, T5 11B, but the larger model size resulted in some changes to the experiment space. Specifically, we found that two optimizations, transformer wrapping policy and activation checkpointing, were needed to enable us to run these experiments on 3 nodes (each node had 8 A100 gpus with 80 GB of memory). With these optimizations, we could fit a batch size of 50 and get higher throughput compared to removing each one of them. Thus rather than running on/off solely for a single optimization test as with the 3B model, the larger model experiments were done with 1 of 3 optimizations turned on/off while always running the other two in order to allow a usable batch size for both test states for each item.

Based on TFLOP comparisons, with the 11B model, we saw even more payoff from the optimizations. Mixed precision(~10x improvement) and activation checkpointing (~100x improvement) had a much larger impact with the 11B model compared to the 3B parameter model. With mixed precision we could fit ~2x larger batch sizes and with activation checkpointing >15x batch sizes (from 3 with no activation checkpointing to 50 with activation checkpointing) which translated into large throughput improvements.

We also have observed that for these larger models > 3B, using Zero2 sharding strategy would result in minimal room left in memory for the batch data, and had to go with very small batch sizes (e.g 1-2) that essentially makes full sharding strategy a necessity to enable fitting larger batches sizes.

Note – this tutorial assumes a basic understanding of FSDP. To learn more about basics of FSDP please refer to the getting started and advanced FSDP tutorials.

**What is FSDP? How does it make Large-Scale Training More Efficient **

FSDP expands upon distributed data parallel, by parallelizing not just data, but the model parameters, the optimizer states and gradients associated with the model. Specifically – each GPU only stores a subset of the entire model **and the associated subset of optimizer states and gradients. **

To show the evolution of distributed training, we can start from the beginning, where AI models were simply trained on a single GPU.

DDP (Distributed Data Parallel) was the initial step up from training with only a single GPU, and was an effort to address the data and model size growth, where multiple GPUs each housed their own copy of the same model. The gain here is that the data for each batch could be split and processed independently on each GPU, all at the same time,thus parallelizing the processing of the data set and increasing training speed by the increasing number of GPUs. The tradeoff is the need to communicate the gradients between each GPU to synchronize the models after the backward pass.

FSDP expands on scaling models by removing the redundancy of optimizer calculations and state storage, as well as gradient and memory storage of model parameters that are present in DDP (DDP = Distributed Data Parallel). This redundancy reduction, along with increased communication overlap where model parameter communication takes place at the same time as model computation, is what allows FSDP to train much larger models with the same resources as DDP.

A key point is that this efficiency also allows for AI models that are larger than a single GPU to be trained. The model size available for training is now increased to the aggregate memory of all GPUs, rather than the size of a single GPU. (And as a point of note, FSDP can go beyond aggregated GPU memory by leveraging CPU memory as well, though we will not directly cover this aspect here).

As discussed in a previous blog post, with DDP the largest model that we could train on 32, A100 gpus with 40 GB memory (4 nodes) was up to 3B parameters, and batch size of 128, with the help of activation checkpointing. By contrast, using FSDP we were able to train up to 81B model size, combining activation checkpointing, along with activation and parameter offloading. In another experiment, we benchmarked a 1T parameter model with FSDP using 512 gpus.

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

For intuition on the parameter level workings of FSDP, below we show an animation detailing how the model parameters are sharded and communicated assuming a two GPU scenario and a simple 8 parameter model:

>>>>> gd2md-html alert: inline image link here (to images/image4.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Above – the animations walk through the steps involved with the initial sharding of the model amongst ranks, and we start the all_gathers and forward pass.

>>>>> gd2md-html alert: inline image link here (to images/image5.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

We continue through the model with the forward pass. After each FSDP unit completes, non-locally owned params are dropped to free memory, and optionally activations can be checkpointed. This continues until we finish the forward pass and compute the loss.

>>>>> gd2md-html alert: inline image link here (to images/image6.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

_During the backward pass, another all_gather is used to load the parameters and the gradients are computed. These gradients are then reduce_scattered so that the local owners of each param can aggregate and prepare to update the weights. _

>>>>> gd2md-html alert: inline image link here (to images/image7.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

_Finally, each rank passes the summed gradients through the optimizer states and updates the weights to complete the mini-batch. _

With the model now distributed across the entire set of available GPUs, the logical question is how data moves through the model given this sharding of model parameters.

This is accomplished by FSDP coordinating with all GPUs to effectively share (communicate) the respective parts of the model. The model is decomposed into FSDP units and parameters within each unit are flattened and then sharded across all GPUs. Within each FSDP unit, GPU’s are assigned interleaving ownership of individual model parameters.

By interleaving, we mean the following – assuming 2 gpus with an id of 1 and 2, the FSDP unit ownership pattern would be [12121212], rather than a contiguous chunk of [111222].

During training, an all_gather is initiated and the locally owned model parameters within a FSDP unit are shared by the owner GPU with the other non-owners, when they need it, on a ‘just in time’ type basis. FSDP prefetches parameters to overlap all_gather communication with computation.

When those requested parameters arrive, the GPU uses the delivered parameters, in combination with the parameters it already owns, to create a fully populated FSDP unit. Thus there is a moment where each GPU hits peak memory usage while holding a fully populated FSDP unit.

It then processes the data through the FSDP unit, and drops the parameters it received from other GPU’s to free up memory for the next unit…the process continues over and over proceeding through the entire model to complete the forward pass. The process is then repeated (in general) for the backward pass.
(note – this is a simplified version for understanding..there is additional complexity but this should help construct a basic mental model of the FSDP process).

This eliminates much of the memory redundancy present in DDP, but imposes the cost of higher amounts of network communication to shuttle these requested parameters back and forth amongst all the GPUs. ** Overlapping the communication timing with the computation taking place is the basis of many of the performance improvements we’ll discuss in this series.** The key gains are frequently based on the fact that communication can often take place at the same time as computation. As you can surmise, having high communication speed is vital for FSDP performance.

How do I optimize my training with FSDP?

There are four main performance improvements we will cover – the transformer wrapper, activation checkpointing, mixed precision, and selecting the proper sharding strategy. The flowchart below will help as a checklist for tuning options that we will discuss in this post.

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Wrapping policy – for transformers, use Transformer wrapping policy

The first performance optimization is leveraging the FSDP transformer wrapper for transformer models.

One of the pre-defined wrapping policy is size_based_autowrap_policy. With size_based_autowrap_policy, FSDP will traverse the module structure from bottom to top, a new FSDP unit will be created once the current unit has at least the ‘min_num_params’ specified within the size policy (this defaults to 1e8, or 100M). If the module can not be created as an FSDP unit, FSDP will continue to check its parent module. This size based wrapping policy may not be ideal for some model structures, PyTorch distributed team is actively working on a new default wrapping policy in the next release which is based on size and also module execution order, users can simply tune the size and achieve the optimized performance.

In the current release, you can greatly improve your performance when running Transformer models by using the ‘transformer wrapper’. You will need to provide the appropriate layer class for your model. Here, layer class is the class that houses the Multi-Head Attention and Feed Forward Network.

FSDP will then form the FSDP units around the layer class rather than arbitrary breaks based on parameter size. By sharding the model around layer classes that are uniformly repeated within the transformer, FSDP can create uniform FSDP units that better balance the overlap of computation and communication. By contrast, size based wrapping can produce very uneven or skewed shards for models, which then have uneven matching of compute vs communication overlap. As discussed earlier, the main driver of FSDP high performance is the overlap of communication and computation, and hence why the Transformer wrapper provides improved performance. Note that the Transformer wrapper can also be used for non-transformer models if these models have a list of uniform layers.

Let’s compare the performance difference on a T5, 3B parameter model when running under the default wrapper and the transformer wrapper.

For default wrapping, we don’t need to take any action – we simply pass the model to FSDP as shown:

model = FSDP(
      model,
      device_id=torch.cuda.current_device(),
  )

In this case FSDP will simply wrap the whole model in a single FSDP unit.

Running on an NVIDIA A100-SXM4–40GB with 8 GPUs, we are able to reach 2.3 TFlops and 95% GPU memory utilization with a batch size of 14.

However, since T5 is a transformer model, we are better served to leverage the transformer wrapper for this model.

To use that, we need to isolate the layer class for the transformer, and then pass it in to create our transformer wrapper.

from transformers.models.t5.modeling_t5 import T5Block

And now we can create our Transformer wrapper:

transformer_auto_wrapper_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={
            T5Block,  # < ---- Your Transformer layer class
        },
    )
With our model aware wrapper ready, we can initialize FSDP:
# invoke FSDP with your transformer wrapper policy:
model = FSDP(
        model,
        auto_wrap_policy=transformer_auto_wrapper_policy,
        device_id=torch.cuda.current_device(),  # streaming init
    )

Running this wrapped model, we can see some substantial performance gains.
We can fit nearly double the batch size, going to 28, and with better memory and communication efficiency, we see a TFlops increase to 5.07 from 2.3.

Thus, we’ve increased our training throughput by over 200% (2.19x) due to providing greater model info to FSDP! The transformer wrapping policy results in more fine-grained and balanced FSDP units each holding a layer class, which leads to a more effective communication-computation overlap.

>>>>> gd2md-html alert: inline image link here (to images/image9.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Above: Graphical comparison of TFlops based on wrapper type

If you are training a Transformer model, it pays to configure your training with FSDP using the transformer wrapper. For more information on how to isolate your layer class, please see our in depth video on Transformer wrapping here, where we walk through a number of transformers showing where the layer class can be found.

Mixed precision -_ use BF16 if you have an Ampere architecture GPU_

FSDP supports a flexible mixed precision policy that gives you granular control over parameters, gradients and buffer data types. This lets you easily leverage BFloat16 or FP16 to increase your training speed by up to 70%.

*Note that BFloat 16 is only available on Ampere type GPUs. On AWS this is available with p4dn and g5 instances.

By way of comparison, we can show a 77% speed improvement when comparing fully tuned BFloat16 vs FP32 on an 8B DeepVit model.

>>>>> gd2md-html alert: inline image link here (to images/image10.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

We have obtained even greater acceleration using BFloat16 in fine-tuning a 3B HuggingFace T5 model as shown in the figures below. We observed that because of the lower precision the validation loss of BFloat16 is slightly behind in the first few epochs, but it is able to catch up and results in the same final accuracy as FP32.

>>>>> gd2md-html alert: inline image link here (to images/image11.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

To use mixed precision, we create a policy with our desired data types, and pass it in during the FSDP initialization.
To create our policy, we need to import the MixedPrecision class, and then define our custom policy using our customized class:

from torch.distributed.fsdp import MixedPrecision

bfSixteen = MixedPrecision(
   param_dtype=torch.bfloat16,
   # Gradient communication precision.
   reduce_dtype=torch.bfloat16,
   # Buffer precision.
   buffer_dtype=torch.bfloat16,
)

model = FSDP(
       model,
       auto_wrap_policy=transformer_auto_wrapper_policy,
       mixed_precision=bfloatPolicy)

You can mix and match the precision for parameters, gradients and buffers as you prefer:

comboPolicy = MixedPrecision(
        # Param precision
        param_dtype=torch.bfloat16,
        # Gradient communication precision.
        reduce_dtype=torch.float32,
        # Buffer precision.
        buffer_dtype=torch.float32,
    )

For training with FP16, you will need to also use the ShardedGradScaler, which we will cover in subsequent posts. For BFloat16, it is a drop-in replacement.

AnyPrecision Optimizer – _going beyond mixed precision with full BF16 training _

Mixed precision training, both in FSDP and elsewhere, maintains the working weights in the reduced datatype (BF16 or FP16) while keeping the master weights in full FP32. The reason for the master weights in FP32 is that running in pure BF16 will result in ‘weight stagnation’, where very small weight updates are lost due to the lower precision, and the accuracy flatlines over time while FP32 weights can continue to improve from these small updates.

In order to resolve this dilemma, we can use the new AnyPrecision optimizer available in TorchDistX (Torch Distributed Experimental) that allows you to successfully train and keep the master weights in pure BF16 instead of FP32. In addition, unlike the typical storage of optimizer states in FP32, AnyPrecision is able to maintain states in pure BF16 as well.

AnyPrecision enables pure BF16 training by maintaining an extra buffer that tracks the precision lost during the weight updates and re-applies that during the next update…effectively resolving the weight stagnation issue without requiring FP32.

As a comparison of the throughput gains available with pure BF16 training using AnyPrecision, we ran experiments using FSDP with the T5 11B model with regular FP32 training, Mixed Precision training with BF16, and pure BF16 training using the AnyPrecision optimizer on 3 nodes with A100 gpus as mentioned previously.

>>>>> gd2md-html alert: inline image link here (to images/image12.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text
As shown above, training with AnyPrecision and pure BF16 resulted in 2x the throughput vs Mixed Precision, and over 20x improvement vs FP32.

The potential tradeoff is the impact on final accuracy – in the cases we tested, the accuracy was equal or better than FP32 due to a regularization effect from the slightly reduced precision, but your results may vary.

AnyPrecision optimizer is available for you to test with here, and is a drop in replacement for AdamW optimizer.

Activation checkpointing – increasing throughput by trading compute for memory

>>>>> gd2md-html alert: inline image link here (to images/image13.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

FSDP supports activation checkpointing once the model has been sharded, and makes it easy to implement. The graph above shows ~4x throughput improvement using activation checkpointing.

Activation checkpointing is where the intermediate activations are freed during the forward pass, and a checkpoint is left as a placeholder. This generally increases available GPU memory by over 30%.

The tradeoff is that during the backward pass, these previously removed intermediate activations must be re-calculated again using information in the checkpoint (duplicate compute), but by leveraging the increased GPU memory, one can increase the batch size such that the net throughput can increase substantially.

# verify we have FSDP activation support ready by importing:
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
   checkpoint_wrapper,
   CheckpointImpl,
   apply_activation_checkpointing_wrapper,
)

The steps required to implement activation checkpointing is to first import the FSDP checkpointing functions. We need declare our checkpointer wrapper type which is non-reentrant and create a check function to identify which layer to wrap as follows

non_reentrant_wrapper = partial(
    checkpoint_wrapper,
    offload_to_cpu=False,
    checkpoint_impl=CheckpointImpl.NO_REENTRANT,
)

check_fn = lambda submodule: isinstance(submodule, T5Block)

apply_activation_checkpointing_wrapper(
       model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn
   )

Important note – this must be run after the model has been initialized with FSDP.

However, hopefully you’ve seen how some initial tuning with FSDP options can have a large impact on your training performance.

With that, we turn our attention from how to scale within FSDP, to how to scale your server hardware for FSDP using AWS.

Large Scale Training with FSDP on AWS – For multi-node prioritize high speed network

AWS provides several services that can be used to run distributed training with FSDP:
Amazon EC2 Accelerated Computing instances, AWS ParallelCluster, and Amazon Sagemaker.

In this series of blog posts, we used Amazon EC2 p4d instances in a single-instance multi-GPU configuration and in a multi-instance configuration using AWS ParallelCluster and SageMaker in order to run our training jobs.

Here, we’ll focus specifically on AWS parallel cluster and provide an overview of how to utilize it for training purposes.

AWS ParallelCluster Setup

AWS ParallelCluster is an open source, cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS. AWS ParallelCluster uses yaml configuration files to provision all the necessary resources. It also supports multiple instance types, job submission queues, shared file systems like [Amazon EFS](https://aws.amazon.com/efs/?trk=3c5ce89c-8865-47a3-bec3-f6820351aa6d&sc_channel=ps&sc_campaign=acquisition&sc_medium=ACQ-P PS-GO Non-Brand Desktop SU Storage Solution US EN DSA&ef_id=Cj0KCQjwuaiXBhCCARIsAKZLt3l6dtldpE152xuxTMa3mbUbaqtTXwsBdfDRIzCL8cw3NO5DO_y1vOgaAj1pEALw_wcB:G:s&s_kwcid=AL!4422!3!579408162404!!!g!!) (NFS) or Amazon FSx for Lustre, and job schedulers like AWS Batch and Slurm.

>>>>> gd2md-html alert: inline image link here (to images/image14.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Workflow on Clusters

The high level idea is to have a cluster that has a head node which controls the compute nodes. The actual training job runs on the compute nodes. Overall steps to run a training job on a cluster are as follows:

  1. Set up an AWS ParallelCuster (we discuss below)
  2. Connect to the head node, and import the training code/ setup the environment.
  3. Pull the data and place it in a shared folder that compute nodes can access (FSx Lustre drive).
  4. Run the training job using a job scheduler (in this case Slurm).

Setup AWS ParallelCuster

To setup AWS** **ParallelCluster,

  1. Deploy a network stack. This step is optional since you could use your account default VPC and let AWS ParallelCluster create your subnets and security groups. However, we prefer to compartmentalize our desired network infrastructure and do this deployment via a CloudFormation stack.

    Since we deploy a public and a private subnet, we want to create them into an Availability Zone that contains our target instances, in this case p4d. We consult their availability in the region we use (us-east-1) through the following AWS CLI command:

    aws ec2 describe-instance-type-offerings --location-type availability-zone --filters Name=instance-type,Values=p4d.24xlarge --region us-east-1 --output table

    We see three availability zones containing p4d instances, we pick one of them (us-east-1c, yours may be different) when deploying our network stack. This can be done with the AWS Console or the AWS CLI. In our case we use the latter as follows

    ` aws cloudformation create-stack –stack-name VPC-Large-Scale –capabilities CAPABILITY_IAM –template-body file://VPC-Large-Scale.yaml –parameters ParameterKey=SubnetsAZ,ParameterValue=us-east-1c`

    CloudFormation will deploy our new VPC, subnets, security groups and endpoints on our behalf. Once done, you can retrieve the IDs of the public and private subnets by querying the stack outputs and the values PublicSubnet and PrivateSubnet.

    For example, using the AWS CLI for the private subnet:

    aws cloudformation describe-stacks --stack-name VPC-Large-Scale --query "Stacks[0].Outputs[?OutputKey=='PrivateSubnet'].OutputValue" --output text

  2. Create ParallelCluster, The cluster configuration file specifies the resources for our cluster. These resources include instance type for Head node, compute nodes, access to S3 buckets, shared storage where our data will be located. We will use Amazon FSx for Lustre that offers a fully managed shared storage service with Lustre.

    Here is an example of a cluster configuration file. We can use AWs ParallelCluster CLI to create the cluster. Please note that the private and public subnet IDs will need to be replaced by the ones you retrieved earlier. You will be able to control the cluster using the AWS ParallelCluster CLI to start, stop, pause, etc.

    pcluster create-cluster --cluster-name my-hpc-cluster --cluster-configuration cluster.yaml
    
  3. SSH to Head node – once the cluster is ready, we can connect to the Head node using the SSH protocol, pull our training code with and place the data in the shared storage specified in the cluster configuration file.

    pcluster ssh --cluster-name cluster -i your-key_pair
    
  4. Launch the training job – now that we have the data and training code, we can launch the slurm job for training. Here is an example of a slurm script to launch the job using torchrun.

More details on how to set up the cluster is out of the scope of this post, however we will have a separate post on it.

What’s next?

With this post we provided a high level overview of FSDP and how it efficiently scales distributed AI training. The flowchart included will help provide a checklist for you to review tuning options discussed such as the transformer wrapper and activation checkpointing.
In the next posts, we will continue with the T5 model and go deeper into each of the topics above, specifically with sharding strategy and other optimizations to provide more insight and details. For now, a good reference for the sharding strategy is in our video tutorial here:

If you have questions or find an issue, please find the authors Less, Hamid and Geeta or open an issue on PyTorch github.

Special thanks to:

Pytorch Distributed team, Shen Li, Rohan Varma, Yanli Zhao, Andrew Gu, Anjali Sridhar, Ana Simoes, Pierre-Yves Aquilanti, Sundar Ranganathan, and the broader AWS team for supporting us with providing infrastructure and technical support for running the large scale experiments.

Resources:

FSDP video series

Getting started with FSDP

Advanced tutorial on FSDP

API documentation

Read More

“ID + Selfie” – Improving digital identity verification using AWS

“ID + Selfie” – Improving digital identity verification using AWS

The COVID-19 global pandemic has accelerated the need to verify and onboard users online across several industries, such as financial services, insurance, and healthcare. When it comes to user experience it is crucial to provide a frictionless transaction while maintaining a high standard for identity verification.  The question is, how do you verify real people in the digital world?

Amazon Rekognition provides pre-trained facial recognition and analysis capabilities for identity verification to your online applications, such as banking, benefits, ecommerce, and much more.

In this post, we present the “ID + Selfie” identity verification design pattern and sample code you can use to create your own identity verification REST endpoint. This is a common design pattern that you can incorporate into existing or new solutions that require face-based identity verification. The user presents a form of identification like a driver’s license or passport. The user than captures a real-time selfie with the application. We then compare the face from the document with the real-time selfie taken on their device.

The Amazon Rekognition CompareFaces API

At the core of the “ID + Selfie” design pattern is the comparison of the face in the selfie to the face on the identification document. For this, we use the Amazon Rekognition CompareFaces API. The API compares a face in the source input image with a face or faces detected in the target input image. In the following example, we compare a sample driver’s license (left) with a selfie (right).

Source Target

The following is an example of the API code:

response = client.compare_faces(SimilarityThreshold=80,
                              SourceImage={'Bytes': s_bytes},
                              TargetImage={'Bytes': t_bytes})

for faceMatch in response['FaceMatches']:
    position = faceMatch['Face']['BoundingBox']
    similarity = str(faceMatch['Similarity'])

Several values are returned in the CompareFaces API response. We focus on the Similarity value returned in FaceMatches to validate the selfie matches the ID provided.

Understanding key tuning parameters

SimilarityThreshold is set to 80% by default and will only return results with a similarity score greater than or equal to 80%. Adjust the value by specifying the SimilarityThreshold parameter.

QualityFilter is an input parameter to filter out detected faces that don’t meet a required quality bar. The quality bar is based on a variety of common use cases. Use QualityFilter to set the quality bar by specifying LOW, MEDIUM, or HIGH. If you don’t want to filter poor quality faces, specify NONE. The default value is NONE.

Solution overview

You can create an “ID + Selfie” API for digital identity verification by deploying the following components:

  • A REST API with a POST method that allows us to send the selfie and identification payload and returns a response, in this case the similarity score
  • A function to receive the payload, convert the images to the proper format, and call the Amazon Rekognition compare_faces API.

We implement Amazon API Gateway for the REST API functionality and AWS Lambda for the function.

The following diagram illustrates the solution architecture and workflow.

The workflow contains the following steps:

  1. The user uploads the required identification document and a selfie.
  2. The client submits the identification document and selfie to the REST endpoint.
  3. The REST endpoint returns a similarity score to the client.
  4. An evaluation is done through business logic in your application. For example, if the similarity score is below 80%, it fails the digital identity check; otherwise it passes the digital identity check.
  5. The client sends the status to the user.

Lambda code

The Lambda function converts the incoming payload from base64 to byte for each image and then sends the source (selfie) and target (identification) to the Amazon Rekognition compare_faces API and returns the similarity score received in the body of the API response. See the following code:

import boto3
import sys
import json
import base64


def lambda_handler(event, context):

  client = boto3.client('rekognition')

  payload_dict = json.loads(json.loads(event['body']))
  selfie = payload_dict['selfie']
  dl = payload_dict['dl']

  # convert text to base64
  s_base64 = dl.encode('utf-8')
  t_base64 = selfie.encode('utf-8')
  #convert base64 to bytes
  s_bytes = base64.b64decode(s_base64)
  t_bytes = base64.b64decode(t_base64)
  response = client.compare_faces(SimilarityThreshold=80,
                                SourceImage={'Bytes': s_bytes},
                                TargetImage={'Bytes': t_bytes})

  for faceMatch in response['FaceMatches']:
      position = faceMatch['Face']['BoundingBox']
      similarity = str(faceMatch['Similarity'])

  return {

    'statusCode': response['ResponseMetadata']['HTTPStatusCode'],

    'body': similarity

  }

Deploy the project

This project is available to deploy through AWS Samples with the AWS Cloud Development Kit (AWS CDK). You can clone the repository and use the following AWS CDK process to deploy to your AWS account.

  1. Set up a user who has permissions to programmatically deploy the solution resources through the AWS CDK.
  2. Set up the AWS Command Line Interface (AWS CLI). For instructions, refer to Configuring the AWS CLI.
  3. If this is your first time using the AWS CDK, complete the prerequisites listed in Working with the AWS CDK in Python.
  4. Clone the GitHub repository.
  5. Create the virtual environment. The command you use depends on your OS:
    1. If using Windows, run the following command in your terminal window from the source of the cloned repository:
      ..venvScriptsactivate

    2. If using Mac or Linux, run the following command in your terminal window from the source of the cloned repository:
      .venv/bin/activate

  6. After activating the virtual environment, install the app’s standard dependencies:
    python -m pip install -r requirements.txt

  7. Now that the environment is set up and the requirements are met, we can issue the AWS CDK deployment command to deploy this project to AWS:
    CDK Deploy

Make API calls

We need to send the payload in base64 format to the REST endpoint. We use a Python file to make the API call, which allows us to open the source and target files, convert them to base64, and send the payload to the API Gateway. This code is available in the repository.

Note that the SOURCE and TARGET file locations will be on your local file system, and the URL is the API Gateway URL generated during the creation of the project.

import requests
from base64 import b64encode
from json import dumps

TARGET = '<Selfie>.png'
SOURCE = <ID Image>.png'
URL = "https://<your api gateway>.execute-api.<region>.amazonaws.com/<deployment slot>/ips"
ENCODING = 'utf-8'
JSON_NAME = 'output.json'

# first: reading the binary stuff
with open(SOURCE, 'rb') as source_file:
    s_byte_content = source_file.read()
with open(TARGET, 'rb') as target_file:
    t_byte_content = target_file.read()

# second: base64 encode read data
s_base64_bytes = b64encode(s_byte_content)
t_base64_bytes = b64encode(t_byte_content)

# third: decode these bytes to text
s_base64_string = s_base64_bytes.decode(ENCODING)
t_base64_string = t_base64_bytes.decode(ENCODING)

# make raw data for json
raw_data = {
    " dl ": s_base64_string,
    " selfie ": t_base64_string
}

# now: encoding the data to json
json_data = dumps(raw_data, indent=2)

response = requests.post(url=URL, json=json_data)
response.raise_for_status()

print("Status Code", response.status_code)
print("Body ", response.json())

Clean up

We used the AWS CDK to build this project, so we can open our project locally and issue the following AWS CDK command to clean up the resources:

CDK Destroy

Conclusion

There you have it, the “ID + Selfie” design pattern with a simple API that you can integrate with your application to perform digital identity verification. In the next post in our series, we expand upon this pattern further by extracting text from the identification document and searching a collection of faces to prevent duplication.

To learn more, check out the Amazon Rekognition Developer Guide on detecting and analyzing faces.


About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing guitar in a 90s metal band.

Noah Donaldson is a Solutions Architect at AWS supporting federal financial organizations. He is excited about AI/ML technology that can reduce manual processes, improve customer experiences, and help solve interesting problems. Outside of work, he enjoys spending time on the ice with his son playing hockey, hunting with his oldest daughter, and shooting hoops with his youngest daughter.

Read More

Getting started with deploying real-time models on Amazon SageMaker

Getting started with deploying real-time models on Amazon SageMaker

Amazon SageMaker is a fully-managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models at scale. ML is realized in inference. SageMaker offers four Inference options:

  1. Real-Time Inference
  2. Serverless Inference
  3. Asynchronous Inference
  4. Batch Transform

These four options can be broadly classified into Online and Batch inference options. In Online Inference, requests are expected to be processed as they arrive, and the consuming application expects a response after each request is processed. This can either happen synchronously (real-time Inference, serverless) or asynchronously (asynchronous inference). In a synchronous pattern, the consuming application is blocked and can’t proceed until it receives a response. These workloads tend to be real-time applications, such as online credit card fraud detection, where responses are expected in the order of milliseconds to seconds and request payloads are small (a few MB). In the asynchronous pattern, the application experience isn’t blocked (for example, submitting an insurance claim via a mobile app), and usually requires larger payload sizes and/or longer processing times. In Offline inference, an aggregation (batch) of inference requests are processed together, and responses are provided only after the entire batch has been processed. Usually, these workloads aren’t latency sensitive, involve large volumes (multiple GBs) of data, and are scheduled at a regular cadence (for example, run object detection on security camera footage at the end of the day or process payroll data at the end of the month).

At the bare bones, SageMaker Real-Time Inference consists of a model(s), the framework/container with which you’re working, and the infrastructure/instances that are backing your deployed endpoint. In this post, we’ll explore how you can create and invoke a Single Model Endpoint.

Choosing model deployment option

Choosing the right inference type can be difficult, and the following simple guide can help you. It’s not a strict flow chart, so if you find that another option works better for you, then feel free to use those. In particular, Real-Time Inference is a great option for hosting your models when you have low and consistent latency (in the order of milliseconds or seconds) and throughput sensitive workloads. You can control the instance type and count behind your endpoint while also configuring AutoScaling policy to handle traffic. There are two other SageMaker Inference options that you can also use to create an endpoint. Asynchronous Inference is when you have large payload sizes and near real-time latency bandwidth. This is a good option, especially for NLP and Computer Vision models that have longer preprocessing times. Serverless Inference is a great option when you have intermittent traffic and don’t want to manage infrastructure scaling. The recipe for creating an endpoint remains the same regardless of the Inference type that you choose. In this post, we’ll focus on creating a real-time instance-based endpoint, but you can easily adapt it to either of the other Inference Options based on your use-case. Lastly, Batch inference takes place offline, so you can provide a set of data that you want to get inference from and we’ll run it. This is similarly instance-based, so you can select the optimal instance for your workload. As there is no endpoint up and running, you only pay for the duration of the job. It is good for processing gigabytes of data and the job duration can be days. There are built-in features to make working with structured data easier and optimizations to automatically distribute structured data. Some example use cases are propensity modeling, predictive maintenance, and churn prediction. All of these can take place offline in bulk because it doesn’t have to react to a specific event.

Hosting a model on SageMaker Endpoints

At the crux, SageMaker Real-Time Endpoints consists of a model and the infrastructure with which you choose to back the Endpoint. SageMaker uses containers to host models, which means that you need a container that properly sets up the environment for the framework that you use for each model that you provide. For example, if you’re working with a Sklearn model, you must pass in your model scripts/data within a container that properly sets up Sklearn. Luckily, SageMaker provides managed images for popular frameworks, such as TensorFlow, PyTorch, Sklearn, and HuggingFace. You can retrieve and utilize these images using the high-level SageMaker Python SDK and inject your model scripts and data into these containers. In the case that SageMaker doesn’t have a supported container, you can also Build Your Own Container and push your own custom image, installing the dependencies that are necessary for your model.

SageMaker supports both trained and pre-trained models. In the previous paragraph when we’re talking about model scripts/data, we’re referencing this matter. You can either mount a script on your container, or if you have a pre-trained model artifact (for example, `model.joblib` for SKLearn), then you can provide this along with your image to SageMaker. To understand SageMaker Inference, there are three main entities that you’ll create in the process of Endpoint creation:

  1. SageMaker Model Entity – Here you can pass in your trained model data/model script and your image that you’re working with, whether it’s owned by AWS or built by you.
  2. Endpoint configuration creation – Here you define your infrastructure, meaning that you select the instance type, count, etc.
  3. Endpoint creation – This is the REST Endpoint that hosts your model that you’re invoking to get a response. Let’s look at how you can utilize a managed SageMaker Image and your own custom-built image to deploy an endpoint.

Real-time endpoint requirements

  1. Before creating an Endpoint, you must understand what type of Model you want to host. If it’s a Framework model, such as TensorFlow, PyTorch, or MXNet, then you can utilize one of the prebuilt Framework images.
    If it’s a custom model, or you would like full flexibility in creating the container that SageMaker will run for inference, then you can build your own container.

SageMaker Endpoints are made up of a SageMaker Model and Endpoint Configuration.
If you’re using Boto3, then you would create both objects. Otherwise, if you’re utilizing the SageMaker Python SDK, then the Endpoint Configuration is created on your behalf when you use the .deploy(..) function.

SageMaker entities:

  • SageMaker Model:
    • Contains the details of the inference image, location of the model artifacts in Amazon Simple Storage Service (Amazon S3), network configuration, and AWS Identity and Access Management (IAM) role to be used by the Endpoint.
      • SageMaker requires your model artifacts to be compressed in a .tar.gz file. SageMaker automatically extracts this .tar.gz file into the /opt/ml/model/ directory in your container. If you’re utilizing one of the framework containers, such as TensorFlow, PyTorch, or MXNet, then the container expects your TAR structure to be as follows:
        • TensorFlow

          model.tar.gz/
          |--[model_version_number]/
          |--variables
          |--saved_model.pb
          code/
          |--inference.py
          |--requirements.txt

        • PyTorch

          model.tar.gz/
          |- model.pth
          |- code/
          |- inference.py
          |- requirements.txt # only for versions 1.3.1 and higher

        • MXNet

          model.tar.gz/
          |- model-symbol.json
          |- model-shapes.json
          |- model-0000.params
          |- code/
              |- inference.py
              |- requirements.txt # only for versions 1.6.0 and higher

        • Sklearn

          model.tar.gz/
          |- model.joblib
          | code/ 
          |- inference.py

      • When utilizing a Framework image, we can provide a custom entry-point script, where we can implement our own pre and post processing. In our case, the inference script is packaged in the model.tar.gz under the /code directory.
    • Endpoint Configuration
      • Contains the infrastructure information required to deploy the SageMaker Model to the Endpoint.
      • For example, the SageMaker Model we created is specified here as well as the Instance Type and Initial Instance count.

Frameworks and BYOC

    • Retrieving SageMaker images
      • This portion isn’t always necessary and abstracted out by the SageMaker Python SDK via estimators. However, if you would like to be able to retrieve a SageMaker managed image to extend on it, then you can get the images that are available via the SDK. The following is an example of retreiving a TF 2.2 image for inference.
        import sagemaker
        tf_image = sagemaker.image_uris.retreive(framework="tensorflow", region="us-east-1",
        image_scope = "inference", version = "2.2", instance_type = "ml.c5.xlarge)
        print(tf_image)

    • Frameworks
      • In the case that you want to deploy a Framework Model, such as TensorFlow, PyTorch, or MXNet, then all you need is the model artifacts.
      • See the documentation for deploying models directly from model artifacts for TensorFlow, PyTorch, or MXNet.
    • Choosing between 1P and BYOC
      • The SageMaker SDK also abstracts handling the image out, as you saw in the previous Frameworks section. It has ready-made estimators for Sklearn, TensorFlow, and PyTorch that automatically select the image for you based off of the version that you’ve selected. Then you can pass in a training/inference script through Script Mode into these estimators.
        from sagemaker.pytorch import PyTorch #PyTorch Estimator within SageMaker SDK
        estimator_parameters = {"entry_point": "train_deploy_pytorch_without_dependencies.py",
        "source_dir": "pytorch_script","instance_type": train_instance_type,
        "instance_count": 1,"hyperparameters": hyperparameters,
        "role": role,"base_job_name": "pytorch-model","framework_version": "1.5",
        "py_version": "py3",}
        
        ## Model Training
        estimator = PyTorch(**estimator_parameters)estimator.fit(inputs)
        
        ## Deploy Trained model
        pytorch_predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=pytorch_endpoint_name)

      • Not all packages and images are supported by SageMaker, and in this case you must bring your own container (BYOC). This means building a Dockerfile that will setup the proper environment for your model serving. An example of this is the Spacy NLP module, and there are no managed SageMaker containers for this framework. Therefore, you must provide a Dockerfile that installs Spacy. Within the container you also mount your model inference scripts. Let’s quickly discuss the components that you provide in a Bring Your Own Container format, as these stay consistent for most examples.
        • “nginx.conf“ is the configuration file for the nginx front-end. You won’t have to edit this file, unless you would like to tune these portions.
        • “predictor.py“ is the program that actually implements the Flask web server and model code for your application. You can have further Python files or functions in your container that you can call in this file.
        • “serve“ is the program started when the container is started for hosting. It simply launches the gunicorn server, which runs multiple instances of the Flask app defined in predictor.py. Like nginx.conf, you don’t have to edit this file unless there’s further tuning that you would like to perform.
        • “train“ is the program that is invoked when the container is run for training. You’ll modify this program to implement your training algorithm. If you’re bringing a pre-trained model or framework like Spacy, then you don’t need this file.
        • “wsgi.py“ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is, unless you’ve changed the name of your predictor.py file. In that case, make sure that maps properly here.
    • Custom inference script
      • SageMaker Framework containers give you the flexibility to handle pre/post processing of the request and model loading using a custom entry point script/inference.py.
      • See the documentation for creating a custom inference.py script for TensorFlow, PyTorch and MXNet.
    • Custom container

Different ways that you can interact with SageMaker Endpoints

There are many options for using SageMaker programmatically so that you can call your deployed models to get inference. The AWS Command Line Interface (AWS CLI), REST APIs, AWS CloudFormation, AWS Cloud Development Kit (AWS CDK), and AWS SDKs are common tools offered by AWS and widely supported by other AWS services. For SageMaker, we also have a SageMaker Python SDK. Now, let’s compare the different options to create, invoke, and manage SageMaker Endpoints.

In addition to SageMaker CLI, there are two ways programmatically that you can interact with Endpoints in SageMaker through the SDKs. Let’s look at some differences between SageMaker Python SDK and Boto3 Python SDK:

  1. High-level SageMaker “Python” SDK – This SDK is an open-source library that provides higher level abstraction specifically meant for calling SageMaker APIs programmatically using Python. The Good part of this SDK is that it’s very easy to call sagemaker APIs, lots of heavy lifting is done already like calling the APIs synchronously/async mode (helps to avoid polling), simpler request/response schema, much less code, and much simpler code. SageMaker Python SDK provides several high-level abstractions for working with SageMaker. The package is meant to simplify different ML processes on SageMaker.
  2. Low-level AWS SDK (Boto3 SDK) – This SDK works at the lower level by allowing the user to choose from the supported programming languages and call any AWS services programmatically. This isn’t just specific to SageMaker but can be used in general for all AWS services. The low-level AWS SDKs are available in various programming languages, such as .NET, Python, Java, Node.js, etc. One of the popular SDKs used is boto3 python SDK, which is popular in the data scientist community for ML. The good part of this SDK is that it’s very lightweight and available by default installed on AWS Lambda Runtime. Furthermore, you can use this SDK to interact with any AWS service outside of SageMaker.

Both of these SDKs can be utilized for the same tasks, but in some cases it’s more intuitive to use one more than the other. SageMaker Python SDK is recommended for easy testing while AWS SDK/Boto3 is recommended for production use cases for better control on performance. For example, SageMaker as a service provides pre-built and maintained images for popular frameworks, such as Sklearn, PyTorch, and TensorFlow. It can be particularly useful to use SageMaker SDK to retrieve deep learning images, train models using Estimators, and easily deploy the model using a simple API call. An example to showcase this in action can be found here.

On the other hand, sometimes you have pre-trained models or different frameworks that you may be using. This requires a greater deal of customization and the SageMaker SDK doesn’t always offer that. We have three important steps and corresponding boto3 API calls that we need to execute to deploy an endpoint: Model Creation, Endpoint Configuration Creation, and Endpoint Creation. The first two entities were abstracted out with the SageMaker SDK with our supported frameworks, but we see those details with the Boto3 SDK. An extensive example to showcase the steps involved in using a Boto3 SDK to create and manage an endpoint can be found here.

Considerations of SageMaker hosting

SageMaker Real-Time Inference has two main optimizations that you can consider: 1/ Performance optimization, and 2/ Cost optimization. Let’s first look at performance optimization, as when we’re dealing with latency sensitive workloads, every millisecond is crucial. There are different knobs that you can tune to optimize your latency and throughput. At the instance level, you can use Inference Recommender, our built-in load testing tool, to help you select the right instance type and count for your workload. Utilizing the proper combination of compute will help you with both performance and cost. You can also tune at the container and framework level.
Questions to ask yourself include:

  1. What framework are you using?
  2. Are there any environment variables that you can tune within your container?

An example of this is maximizing TensorFlow performance with SageMaker containers. Another example of container level optimizations is utilizing gRPC rather than REST behind your endpoint. Lastly, you can also optimize at the script level. Is your inference code taking extra time at certain blocks? Timing each and every line of your script will help you capture any bottlenecks within your code.

There are three ways to look at improving the utilization of your Real Time endpoint:

  1. Multi-model Endpoints (MME)
    • You can host thousands of models behind a single endpoint. This is perfect for use cases where you don’t need a dedicated endpoint for each one of your models. MME works best when the models are similarly sized and latency and belong to the same ML framework. These can be typically used when you don’t need to call the same model at all times. You can dynamically load the respective model onto the SageMaker Endpoint to serve your request. An example that showcases MME in action can be found here. If you want to learn more about the different caveats and best practices for hosting models on MME, then refer to the post here.
  2. Multi-Container Endpoints (MCE)
    • Instead of utilizing multiple endpoints to host multiple containers, you can look at hosting up to 15 containers on a single endpoint. Each one of these containers can be invoked directly. Therefore, you can look at hosting disparate models of different frameworks all on a single endpoint. This option is best when containers exhibit similar usage and performance characteristics. An example that showcases MCE can be found here. If you want to learn more about the different caveats and best practices for hosting models on MCE, then refer to the post here.
  3. Serial Inference Pipeline (SIP)
    • If you have a pipeline of steps in your inference logic, then you might utilize Serial Inference Pipeline (SIP). SIP lets you chain 2-15 containers together behind a single endpoint. SIP works well when you have preprocessing and post-processing steps. If you want to learn more about the design patterns for serial inference pipelines, then refer to the post here.

The second main optimization to keep in mind is cost. Real-Time Inference is one of three options within creating SageMaker Endpoints. SageMaker Endpoints are running at all times unless deleted. Therefore, you must look at improving the utilization of the endpoint which in turn provides a cost benefit.

SageMaker also offers Savings Plans. Savings Plans can reduce your costs by up to 64%. This is a 1 or 3-year term commitment to a consistent amount of usage ($/hour). See this link for more information. And see this link for best to optimize costs for Inference on Amazon SageMaker.

Conclusion

In this post, we showed you some of the best practices to choose between different model hosting options on SageMaker. We discussed the SageMaker Endpoint requirements, and also contrasted Framework and BYOC requirements and functionality. Furthermore, we talked about the different ways that you can leverage Real-Time Endpoints to host your ML models in production. in a cost-effective way, and have high performance.

See the corresponding GitHub repository and try out the examples.


About the authors

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Marc Karp is a ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Read More

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

<!––>

Recent advances have expanded the applicability of language models (LM) to downstream tasks. On one hand, existing language models that are properly prompted, via chain-of-thought, demonstrate emergent capabilities that carry out self-conditioned reasoning traces to derive answers from questions, excelling at various arithmetic, commonsense, and symbolic reasoning tasks. However, with chain-of-thought prompting, a model is not grounded in the external world and uses its own internal representations to generate reasoning traces, limiting its ability to reactively explore and reason or update its knowledge. On the other hand, recent work uses pre-trained language models for planning and acting in various interactive environments (e.g., text games, web navigation, embodied tasks, robotics), with a focus on mapping text contexts to text actions via the language model’s internal knowledge. However, they do not reason abstractly about high-level goals or maintain a working memory to support acting over long horizons.

In “ReAct: Synergizing Reasoning and Acting in Language Models”, we propose a general paradigm that combines reasoning and acting advances to enable language models to solve various language reasoning and decision making tasks. We demonstrate that the Reason+Act (ReAct) paradigm systematically outperforms reasoning and acting only paradigms, when prompting bigger language models and fine-tuning smaller language models. The tight integration of reasoning and acting also presents human-aligned task-solving trajectories that improve interpretability, diagnosability, and controllability..

Model Overview

ReAct enables language models to generate both verbal reasoning traces and text actions in an interleaved manner. While actions lead to observation feedback from an external environment (“Env” in the figure below), reasoning traces do not affect the external environment. Instead, they affect the internal state of the model by reasoning over the context and updating it with useful information to support future reasoning and acting.

Previous methods prompt language models (LM) to either generate self-conditioned reasoning traces or task-specific actions. We propose ReAct, a new paradigm that combines reasoning and acting advances in language models.

ReAct Prompting

We focus on the setup where a frozen language model, PaLM-540B, is prompted with few-shot in-context examples to generate both domain-specific actions (e.g., “search” in question answering, and “go to” in room navigation), and free-form language reasoning traces (e.g., “Now I need to find a cup, and put it on the table”) for task solving.

For tasks where reasoning is of primary importance, we alternate the generation of reasoning traces and actions so that the task-solving trajectory consists of multiple reasoning-action-observation steps. In contrast, for decision making tasks that potentially involve a large number of actions, reasoning traces only need to appear sparsely in the most relevant positions of a trajectory, so we write prompts with sparse reasoning and let the language model decide the asynchronous occurrence of reasoning traces and actions for itself.

As shown below, there are various types of useful reasoning traces, e.g., decomposing task goals to create action plans, injecting commonsense knowledge relevant to task solving, extracting important parts from observations, tracking task progress while maintaining plan execution, handling exceptions by adjusting action plans, and so on.

The synergy between reasoning and acting allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interacting with the external environments (e.g., Wikipedia) to incorporate additional information into reasoning (act to reason).

ReAct Fine-tuning

We also explore fine-tuning smaller language models using ReAct-format trajectories. To reduce the need for large-scale human annotation, we use the ReAct prompted PaLM-540B model to generate trajectories, and use trajectories with task success to fine-tune smaller language models (PaLM-8/62B).

Comparison of four prompting methods, (a) Standard, (b) Chain of thought (CoT, Reason Only), (c) Act-only, and (d) ReAct, solving a HotpotQA question. In-context examples are omitted, and only the task trajectory is shown. ReAct is able to retrieve information to support reasoning, while also using reasoning to target what to retrieve next, demonstrating a synergy of reasoning and acting.

Results

We conduct empirical evaluations of ReAct and state-of-the-art baselines across four different benchmarks: question answering (HotPotQA), fact verification (Fever), text-based game (ALFWorld), and web page navigation (WebShop). For HotPotQA and Fever, with access to a Wikipedia API with which the model can interact, ReAct outperforms vanilla action generation models while being competitive with chain of thought reasoning (CoT) performance. The approach with the best results is a combination of ReAct and CoT that uses both internal knowledge and externally obtained information during reasoning.

HotpotQA (exact match, 6-shot)    FEVER (accuracy, 3-shot)
Standard 28.7 57.1
Reason-only (CoT) 29.4 56.3
Act-only 25.7 58.9
ReAct 27.4 60.9
Best ReAct + CoT Method 35.1 64.6
Supervised SoTA 67.5 (using ~140k samples) 89.5 (using ~90k samples)

PaLM-540B prompting results on HotpotQA and Fever.

On ALFWorld and WebShop, ReAct with both one-shot and two-shot prompting outperforms imitation and reinforcement learning methods trained with ~105 task instances, with an absolute improvement of 34% and 10% in success rates, respectively, over existing baselines.

AlfWorld (2-shot) WebShop (1-shot)
Act-only 45 30.1
ReAct 71 40
Imitation Learning Baselines     37 (using ~100k samples)     29.1 (using ~90k samples)

PaLM-540B prompting task success rate results on AlfWorld and WebShop.
Scaling results for prompting and fine-tuning on HotPotQA with ReAct and different baselines. ReAct consistently achieves best fine-tuning performances.
A comparison of the ReAct (top) and CoT (bottom) reasoning trajectories on an example from Fever (observation for ReAct is omitted to reduce space). In this case ReAct provided the right answer, and it can be seen that the reasoning trajectory of ReAct is more grounded on facts and knowledge, in contrast to CoT’s hallucination behavior.

We also explore human-in-the-loop interactions with ReAct by allowing a human inspector to edit ReAct’s reasoning traces. We demonstrate that by simply replacing a hallucinating sentence with inspector hints, ReAct can change its behavior to align with inspector edits and successfully complete a task. Solving tasks becomes significantly easier when using ReAct as it only requires the manual editing of a few thoughts, which enables new forms of human-machine collaboration.

A human-in-the-loop behavior correction example with ReAct on AlfWorld. (a) ReAct trajectory fails due to a hallucinating reasoning trace (Act 17). (b) A human inspector edits two reasoning traces (Act 17, 23), ReAct then produces desirable reasoning traces and actions to complete the task.

Conclusion

We present ReAct, a simple yet effective method for synergizing reasoning and acting in language models. Through various experiments that focus on multi-hop question-answering, fact checking, and interactive decision-making tasks, we show that ReAct leads to superior performance with interpretable decision traces.

ReAct demonstrates the feasibility of jointly modeling thought, actions and feedback from the environment within a language model, making it a versatile agent that is capable of solving tasks that require interactions with the environment. We plan to further extend this line of research and leverage the strong potential of the language model for tackling broader embodied tasks, via approaches like massive multitask training and coupling ReAct with equally strong reward models.

Acknowledgements

We would like to thank Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran and Karthik Narasimhan for their great contribution in this work. We would also like to thank Google’s Brain team and the Princeton NLP Group for their joint support and feedback, including project scoping, advising and insightful discussions.

Read More

Predict lung cancer survival status using multimodal data on Amazon SageMaker JumpStart

Predict lung cancer survival status using multimodal data on Amazon SageMaker JumpStart

Non-small cell lung cancer (NSCLC) is the most common type of lung cancer, and is composed of tumors with significant molecular heterogeneity resulting from differences in intrinsic oncogenic signaling pathways [1]. Enabling precision medicine, anticipating patient preferences, detecting disease, and improving care quality for NSCLC patients are important topics among healthcare and life sciences (HCLS) communities.

Applying machine learning (ML) to diverse health datasets, known as multimodal machine learning (multimodal ML), is an active area of research and development. Analyzing linked patient-level data from diverse data modalities, such as genomics and medical imaging, promises to accelerate improvements in patient care. However, performing analyses of multiple modalities at scale has been challenging in on-premises or cloud environments due to the distinct infrastructure requirements of each modality. With Amazon SageMaker, you can create purpose-built pipelines and scale them to meet your needs easily, paying only for what you use.

We’re announcing a new solution on lung cancer survival prediction in Amazon SageMaker JumpStart, based on the blog posts Building Scalable Machine Learning Pipelines for Multimodal Health Data on AWS and Training Machine Learning Models on Multimodal Health Data with Amazon SageMaker. JumpStart provides pre-trained, open-source models and pre-built solution templates for a wide range of problem types to help data scientists and ML practitioners get started on training and deploying ML models quickly. This is the first HCLS solution offered by JumpStart.

The solution builds a multimodal ML model for predicting survival outcome of patients diagnosed with NSCLC. The multimodal model is trained on data derived from different modalities or domains, including medical imaging, genomic, and clinical data. Multimodal ML has been adopted in HCLS for personalized treatment, clinical decision support, and drug response prediction. In this post, we demonstrate how you can create a scalable, purpose-built ML pipeline easily with one-click deployment from JumpStart.

What’s in the dataset

Non–small cell lung cancer is the leading cause for cancer death [2]. However, no two cancer diagnoses are alike, because tumors contain significant molecular heterogeneity resulting from differences in intrinsic oncogenic signaling pathways [1]. In addition, different clinical information collected from patients may impact prognosis and treatment options. Therefore, enabling precision medicine, anticipating patient preferences, detecting disease, and improving care quality for NSCLC patients is of utmost importance in the oncology and HCLS communities.

The Non-Small Cell Lung Cancer (NSCLC) Radiogenomic dataset [3] comprises medical imaging, clinical, and genomic data collected from a cohort of early-stage NSCLC patients referred for surgical treatment. It includes Computed Tomography (CT), Positron Emission Tomography (PET)/CT images, semantic annotations of the tumors as observed on the medical images using a controlled vocabulary, segmentation maps of tumors in the CT scans, and quantitative values obtained from the PET/CT scans. The genomic data contains gene mutation and RNA sequencing data from samples of surgically excised tumor tissue. It also consists of clinical data reflective of electronic health records (EHR) such as age, gender, weight, ethnicity, smoking status, Tumor Node Metastasis (TNM) stage, histopathological grade, and survival outcome. Each data modality presents a different view of a patient.

Medical imaging data

Medical imaging biomarkers of cancer promise improvements in patient care through advances in precision medicine. Compared to genomic biomarkers, imaging biomarkers provide the advantages of being non-invasive, and characterizing a heterogeneous tumor in its entirety, as opposed to limited tissue available via biopsy [3]. In this dataset, CT and PET/CT imaging sequences were acquired for patients prior to surgical procedures. Segmentation of tumor regions were annotated by two expert thoracic radiologists. The following image is an example overlay of a tumor segmentation onto a lung CT scan (case R01-093 in the dataset).

Genomic data

Samples of surgically excised tumor tissues were analyzed with RNA sequencing technology. The dataset file that is available from the source was preprocessed using open-source tools, including STAR v.2.3 for alignment and Cufflinks v.2.0.2 for expression calls [4]. The original dataset (GSE103584_R01_NSCLC_RNAseq.txt.gz) can be found on the NCBI website. Although the original data contains more than 22,000 genes, for the purpose of demonstration, we used 21 genes from 10 highly co-expressed gene clusters (metagenes) that were identified, validated in publicly available gene-expression cohorts, and correlated with prognosis [4].

The following table shows the tabular representation of the gene expression data. Each row corresponds to a patient, and the columns represent a subset of genes selected for demonstration. The value denotes the expression level of a gene for a patient. A higher value means the corresponding gene is highly expressed in that specific tumor sample.

Case_ID LRIG1 HPGD GDF15 CDH2 POSTN ……
R01-024 26.7037 3.12635 13.0269 0 36.4332 ……
R01-153 15.2133 5.0693 0.90866 0 32.8595 ……
R01-031 5.54082 1.23083 29.8832 1.13549 34.8544 ……
R01-032 12.8391 7.21931 12.0701 0 7.77297 ……
R01-033 33.7975 3.19058 5.43418 0 9.84029 ……

Clinical data

Clinical data was collected from medical records. It included demographics, smoking history, survival, recurrence status, histology, histopathological grading, Pathological TNM staging, and survival outcome of patients. The data is stored in CSV format, as shown in the following table.

Case ID Survival Status Age at Histological Diagnosis Weight (lbs) Smoking status Pack Years Quit Smoking Year Chemotherapy Adjuvant Treatment EGFR mutation status ……
R01-005 Dead 84 145 Former 20 1951 No No Wildtype ……
R01-006 Alive 62 Not Collected Former Not Collected nan No No Wildtype ……
R01-007 Dead 68 Not Collected Former 15 1968 Yes Yes Wildtype ……
R01-008 Alive 73 102 Nonsmoker nan nan No No Wildtype ……
R01-009 Dead 59 133 Current 100 nan No …… …… ……

Solution overview

Working with multimodal healthcare data for ML purposes often requires dedicated data processing pipelines and compute resources to extract relevant biomarkers and features. Collecting, storing, and managing extracted features can be challenging. In this JumpStart solution, we show you how to process each modality in separate notebooks, use Amazon SageMaker Processing for compute intensive 3D image construction, use Amazon SageMaker Feature Store to centrally store extracted features for ML modeling, run SageMaker experiments to enhance ML lineage, implement the SageMaker built-in algorithm to train an ML model without sophisticated coding, and deploy a model for real-time prediction with SageMaker deployment.

The architecture for the solution is illustrated in the following diagram.

The key services involved are as follows:

  • Amazon SageMaker JumpStart – JumpStart provides pre-trained, open-source models for a wide range of problem types to help you get started. JumpStart also provides solution templates that set up infrastructure for common use cases, and notebooks for ML with SageMaker. This service is where we launch our solution.
  • Amazon Simple Storage Service – Amazon S3 is an object storage service that offers scalability, data availability, security, and performance. Amazon S3 allows you to store data to be used by SageMaker.
  • Amazon SageMaker Studio notebooks – You can launch these collaborative notebooks quickly because you don’t need to set up compute instances and file storage beforehand. This is the environment you can build in.
  • SageMaker Processing jobs – These jobs enable you to analyze data and evaluate ML models. This managed experience helps you run your data processing workloads, such as feature engineering, data validation, and model interpretation. This service allows you to process the data that is stored in Amazon S3.
  • Amazon SageMaker Feature Store – You can create, share, and manage features for ML development. This is where you store processed features to use for training.
  • Amazon SageMaker Experiments – You can organize, track, compare, and evaluate ML experiments.
  • Built-in XGBoost algorithm (eXtreme gradient boosting) – This is a popular and efficient open-source implementation of the gradient boosting algorithm that is used for training the.
  • SageMaker training jobs – You create a managed training job in SageMaker using four attributes: the URL of the S3 bucket you’ve stored the training data, compute resources, the URL of the S3 bucket where you want to store the output of the job, and the Amazon Elastic Container Registry (Amazon ECR) path where the training code is stored.
  • SageMaker real-time endpoints – These endpoints are ideal for inference workloads where you have real-time, interactive, low-latency requirements. You deploy the model to SageMaker hosting services and get an endpoint that can be used for inference.

With this JumpStart solution, you can easily spin up the solution in Amazon SageMaker Studio, and follow the post to explore, train, and deploy a lung cancer survival status prediction model for learning purposes.

Prerequisites

To use the Lung Cancer Survival Prediction solution provided by JumpStart, you need to have a Studio domain. A SageMaker project and JumpStart must also be enabled.

If you don’t have a Studio domain, refer to Onboard to Amazon SageMaker Domain Using Quick Setup.

If you have a SageMaker domain, make sure that the project and JumpStart features have been enabled by following these steps:

  1. On the SageMaker console, choose the gear icon next to Domain.
  2. Choose Next.
  3. In the SageMaker Projects and JumpStart section, select Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for this account and Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for Studio users.

We have completed the prerequisites needed to use JumpStart.

Deploy the solution and example demo notebook

To start using the Lung Cancer Survival Prediction solution, complete the following steps:

  1. Open Studio.
  2. Choose Go to SageMaker JumpStart under Jumpstart models, algorithms, and solutions.
  3. To find the solution within JumpStart, choose Explore All Solutions.
  4. Search for and choose Lung Cancer Survival Prediction.
  5. Under Launch Solution, choose Launch.
    Note that you can specify custom execution roles to be used throughout this solution, otherwise execution roles are created for you.
    It should take a few moments for the solution to launch. You can follow along as a number of resources are launched.

    Wait until the endpoint and model statuses show as Complete. You may need to refresh the page.

    This solution generates a model, endpoint configuration, endpoint, and five notebooks.

  6. To navigate to the first notebook, under Open solution in Studio, choose Open Notebook.
  7. To navigate to other notebooks, select the folder icon and open the S3Downloads folder.
  8. Open the jumpstart-prod-lcsp_****** folder.

Form here, you can access the following notebooks:

  • 0_demo.ipynb – Demonstrates how to send inference requests to a pre-deployed endpoint and receive a model response.
  • 1_preprocess_genomic_data.ipynb – Showcases how to read and process genomic data (in tabular format).
  • 2_preprocess_clinical_data.ipynb – Demonstrates how to read and process health insurance claims data (in tabular format).
  • 3_preprocess_imaging_data.ipynb – Showcases how to read and process medical imagining data in DICOM file format and convert to NIfTI neuroimaging format. Therefore, it takes medical imagining data in volumetric format. Note that in notebooks 1, 2, and 3, we store the output of each processing job in Feature Store.
  • 4_train_test_model.ipynb – Demonstrates how to access multimodal features from Feature Store, train an XGBoost model, and predict the survival status of patients diagnosed with non-small cell lung cancer.

After the solution is deployed in Studio, you can start building a lung cancer survival status prediction using multimodal health data. To start, let’s look at the 0_demo.ipynb notebook, which demonstrates how to send inference requests to a pre-deployed endpoint, and get the model response (survived vs. dead).

In this notebook, you can see a preview of the datasets (which we cover in a later section), and the steps needed to make predictions with an endpoint that has already been deployed. A predictor is instantiated to begin making real-time predictions against a SageMaker endpoint. We can use the predictor’s predict function to invoke the endpoint with test data.

The endpoint returns the predicted survival status, which is used to assess model performance, as shown in the following screenshot.

Feature engineering

As described earlier, genomic, clinical, and medical imaging data is available in the NSCLC dataset for us to create a comprehensive view of a patient. To process and compute the features that later help us build a ML model, let’s start with the genomic data pipeline in 1_preprocess_genomic_data.ipynb and step through 2_preprocess_clinical_data.ipynb and 3_preprocess_imaging_data.ipynb for the clinical data pipeline and medical imaging pipeline, respectively.

Genomic

For genomic data, we read the RNA sequence data from the JumpStart solution bucket into the notebook 1_preprocess_genomic_data.ipynb:

file_name = "GSE103584_R01_NSCLC_RNAseq.txt"
input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
input_data = f"{input_data_bucket}/{file_name}"
!aws s3 cp $input_data .

We then keep 21 genes from 10 highly co-expressed gene clusters (metagenes) that were identified, validated in publicly available gene-expression cohorts, and correlated with prognosis [4]:

selected_columns = ['Case_ID','LRIG1', 'HPGD', 'GDF15', 'CDH2', 'POSTN', 'VCAN', 'PDGFRA',
                    'VCAM1', 'CD44', 'CD48', 'CD4', 'LYL1', 'SPI1', 'CD37', 'VIM', 'LMO2',
                    'EGR2', 'BGN', 'COL4A1', 'COL5A1', 'COL5A2']
gen_data_t = gen_data_t[selected_columns]
data_gen = gen_data_t.fillna(0)

With the genomic features ready for analysis, we ingest the features into Feature Store as a feature group. Having the features in the Feature Store allows us to repeatedly source the important genomic features in the downstream analysis with governance.

genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, 
                                     sagemaker_session=feature_store_session)
# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
genomic_feature_group.load_feature_definitions(data_frame=data_gen)
genomic_feature_group.create(s3_uri=f"s3://{BUCKET}/{prefix}",
                             record_identifier_name=record_identifier_feature_name,
                             event_time_feature_name=event_time_feature_name, 
                             role_arn=sagemaker.get_execution_role(),
                             enable_online_store=True)
genomic_feature_group.ingest(data_frame=data_gen, max_workers=3, wait=True)

To see the recently ingested features in the Feature Store UI, under SageMaker resources, choose Feature Store.

From here, you can find the most recently ingested features into Feature Store. The feature group name should be sagemaker-soln-lcsp-js-******-genomic-feature-group.

Clinical

For clinical data, we perform data cleaning and processing on the data hosted on the JumpStart solution bucket and ingest into the Feature Store, as shown in 2_preprocess_clinical_data.ipynb:

file_name = "NSCLCR01Radiogenomic_DATA_LABELS_2018-05-22_1500-shifted.csv"
input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
input_data = f"{input_data_bucket}/{file_name}"
!aws s3 cp $input_data .

We run one-hot encoding to convert categorical attributes to numerical attributes. We then remove columns that don’t provide useful information (such as dates), and remove rows with missing values. Afterwards, we create another feature group in Feature Store to store the clinical features, similarly to how we did in the genomic notebook.

Medical imaging

For medical imaging data, we create patient-level 3-dimensional radiomic features that explain the size, shape, and visual attributes of the tumors observed in the CT scans, and store them in a feature group in Feature Store. For each patient study, the following steps are performed. Details can be found in 3_preprocess_imaging_data.ipynb.

  1. Read the 2D DICOM slice files for both the CT scan and tumor segmentation, combine them to 3D volumes, and save the volumes in NIfTI format.
  2. Align CT volume and tumor segmentation so we can focus the computation inside the tumor.
  3. Compute radiomic features describing the tumor region using the pyradiomics library. We extract 120 radiomic features of eight classes such as statistical representations of the distribution and co-occurrence of the intensity within the tumorous region of interest, and shape-based measurements describing the tumor morphologically.

It’s worth mentioning how the medical imaging pipeline is done at scale. Processing hundreds of large, high-resolution 3D images requires the right compute resource. Parallelizing the tasks can further reduce the total processing time and help you get the results sooner. We use SageMaker Processing, an on-demand, scalable feature for data processing need in SageMaker, to run DICOM to 3D volume construction, CT/tumor segmentation alignment, radiomic feature extract, and feature store ingestion. We parallelize the processing to one job per patient data, as shown in the following figure.

To launch the SageMaker Processing jobs for multiple patients in parallel efficiently, we run the utility function launch_processing_job(), which submits a configured SageMaker Processing job on one ml.r5.large instance, in a for loop.

In this case, we need to work with the service quota for the instance cleverly. The default service quota for the number of instances across processing jobs and number of ml.r5.large instances is four. If your account has a higher limit, you may run a higher number of simultaneous processing jobs (therefore a faster completion time). To request a quota increase, refer to AWS service quotas. We implemented the function wait_for_instance_quota() to check for the current job count that is in InProgress state and limit the total count in this experiment to the value set in job_limit. If the total running job count is at the limit, the function waits the number of seconds specified in the wait argument and checks the job count again. This is to account for the account-level SageMaker quota that may cause errors in the for loop.

Another challenge when we work with a large amount of processing and training jobs is keeping track of the details and the lineage of all the jobs in an experiment. We use SageMaker Experiments to achieve this. Setting up the experiment tracking is simple. When each SageMaker Processing job is submitted within launch_processing_job(), an experiment configuration is set up and is provided to run the job. See the following code:

experiment_config={'ExperimentName': experiment_name,
                   'TrialName': trial_name,
                   'TrialComponentDisplayName': f'ImageProcessing-{subject}'}
script_processor = ScriptProcessor(...)
script_processor.run(..., experiment_config=experiment_config)

We can then find the status and details of each job on the Experiments and trials menu.

Note that medical imaging processing and feature store offline store synchronization takes some time to complete for all patients, and imaging features may not be available in the offline store immediately for model training in the next section. To make sure the medical imaging features are available for training, a check of total entry count is implemented before proceeding to the next notebook 4_train_test_model.ipynb.

Modeling

As described earlier, we process the data of each modality in three separate notebooks, and create and ingest features into per-modality feature groups. At the time of training, we flexibly choose multimodal features using a SQL query against the offline store in Feature Store. We then preprocess the data and apply dimensionality reduction prior to training an XGBoost model from the SageMaker built-in algorithm for the binary classification task of predicting the survival outcome of patients. After the model is trained, we host the XGBoost model in a SageMaker real-time endpoint for testing and inference purposes.

To create a multimodal view of a patient for model training, we join the feature vectors from three modalities. This includes the following steps:

  1. Normalize the range of independent features using feature scaling.
  2. For ML training, run a query against the three feature groups to join the data stored in the offline store. For the given dataset, this integration results in 119 data samples, where each sample is a 215-dimensional vector.
    genomic_table = <GENOMIC-TABLE-NAME>
    clinical_table = <CLINICAL-TABLE-NAME>
    imaging_table = <IMAGING-TABLE-NAME>
    
    query = <FEATURE-GROUP-NAME>.athena_query()
    
    query_string = 'SELECT '+genomic_table+'.*, '+clinical_table+'.*, '+imaging_table+'.* 
    FROM '+genomic_table+' LEFT OUTER JOIN '+clinical_table+' ON '+clinical_table+'.case_id = '+genomic_table+'.case_id 
    LEFT OUTER JOIN '+imaging_table+' ON '+clinical_table+'.case_id = '+imaging_table+'.subject 
    ORDER BY '+clinical_table+'.case_id ASC;'
    
    query.run(query_string=query_string, output_location='s3://'+<BUCKET>+'/'+prefix+'/query_results/')
    
    imaging_query.wait()
    
    multimodal_features = query.as_dataframe()

  3. Perform principal component analysis (PCA) on the features to reduce the dimensionality and identify the most discriminative features that contribute to 95% variance in the data. This results in a dimensionality reduction from 215 features down to 45 principal components, which constitute features for the supervised learner.
    from sklearn.decomposition import PCA
    
    # Set variance threshold for PCA to 99%
    pca_threshold = 0.99
    pca = PCA(n_components = pca_threshold)
    X_trainval_pca = pca.fit_transform(X_trainval_scaled)
    X_test_pca = pca.transform(X_test_scaled)

  4. Randomly shuffle this data and divide it into 80% for training and 20% for testing the model.
  5. Further split the training data into 80% for training and 20% for validating the model.

To train the ML model, construct an estimator of the gradient boosting library XGBoost through a SageMaker XGBoost container. Because our objective is to train a baseline model with multimodal data, we consider default hyperparameters and don’t perform any hyperparameter tuning. The model predicts NSCLC patients’ survival status (dead or alive) in a form of probability. In addition to the model and prediction, we also generate reports to explain the model. The medical imaging pipeline produces 3D lung CT volumes and tumor segmentation for visualization purposes. See the following code

container = retrieve("xgboost", region=region, version='1.2-1')

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/training-output'.format(bucket, prefix),
                                    sagemaker_session=sm_session)

xgb.set_hyperparameters(eta=0.1, objective='reg:logistic', num_round=10) 

# Train model
xgb.fit({'train': train_data, 'validation': validation_data})

After we train the model, we can deploy the model to a SageMaker endpoint to give us the ability to make predictions in real time in the next step.

Inference

To deploy the endpoint, we use the deploy() method from the trained estimator. The deploy method uses several parameters. We provide the deploy method with an instance_count, which specifies the number of compute instances to launch initially, and instance_type, which specifies the compute instance type. We then use a CSVSerilizer to serialize the incoming data of various formats to a CSV-formatted string for our endpoint. See the following code:

from sagemaker.serializers import CSVSerializer

predictor = xgb.deploy(initial_instance_count = 1, 
                       instance_type = 'ml.m5.xlarge', 
                       serializer = CSVSerializer(),
                       endpoint_name=endpoint_name)

After the endpoint has been deployed, we can make requests and receive predictions in real time. To make predictions, we use the predict() method to pass in data from the test_data data frame:

predictions = predictor.predict(test_data.to_numpy()[:,1:]).decode('utf-8')
# Predictions come back as str. Below transforms stings to float and the probability to binary label (1: dead, 0: alive)
y_predict = [1 if float(pred) > 0.5 else 0 for pred in predictions.split(',')]

The predictor returns a probability. If the probability is greater than 0.5, the patient is less likely to survive NSCLC. If the probability is less than 0.5, the patient is more likely to survive NSCLC.

Finally, we evaluate the prediction with the ground truth in the test_data using accuracy, F1 score, precision, recall, and a confusion matrix. We see that the model can accurately predict 19 out of 25 patients in the test_data, with a balanced precision and recall (and F1 score).


Clean up

After completing this solution, we delete resources instantiated from the solution to avoid incurring further charges. To delete the endpoint and model resources, run the Clean up step in 4_train_test_model.ipynb.

To delete the full JumpStart solution, go to the launched solution you deployed by choosing the JumpStart icon. Under Your launched solutions, select Lung Cancer Survival Prediction, and scroll to the bottom of the solution page. Under Delete solution, choose Delete all resources. This will delete the launched resources including the notebooks that were created.

If you now go back to the S3Downloads folder, you will notice that the Studio notebooks have been deleted.

To delete your SageMaker domain, refer to Delete an Amazon SageMaker Domain.

Conclusion

In this post, we announced a new JumpStart solution that predicts the survival status of non-small cell lung cancer patients using multimodal data. We walked you through the diverse multimodal dataset (medical imaging, genomic, and clinical records); discussed our solution that uses SageMaker features such as SageMaker Processing, Feature Store, the built-in XGBoost algorithm, and SageMaker real-time endpoints; and showed you how to easily run the solution in JumpStart, where cloud resources are managed and deployed for you with just one click.

We encourage you to launch the solution in Studio and step through the notebooks in detail to learn how to process complex healthcare multimodal data, build a survival status prediction model, and make inference on the data using SageMaker. You will then have the knowledge to build an ML solution using SageMaker for your own healthcare and life sciences datasets and use cases.

To learn more about other solutions, pre-built models, and algorithms in JumpStart, visit SageMaker JumpStart.

To learn more about multimodal and multi-omics analysis on AWS, visit Simplifying Multi-modal & Multi-omics Analysis with AWS for Health.

Disclaimer

This solution is for demonstrative purposes only. It is not for clinical use and is not a substitute for professional medical advice, diagnosis, or treatment. The associated notebooks, including the trained model and sample data, are not intended for production. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.

References

[1] Travis, William D., et al. “The 2015 World Health Organization classification of lung tumors: impact of genetic, clinical and radiologic advances since the 2004 classification.” Journal of thoracic oncology 10.9 (2015): 1243-1260.

[2] Jernal, A., et al. “Cancer statistics, 2002.” CA cancer J clin52.1 (2002): 23-47.

[3] Bakr, Shaimaa, et al. “A radiogenomic dataset of non-small cell lung cancer.” Scientific data 5.1 (2018): 1-9.

[4] Zhou, Mu, et al. “Non–small cell lung cancer radiogenomics map identifies relationships between molecular and imaging phenotypes with prognostic implications.” Radiology 286.1 (2018): 307.


About the Authors

Michael Hsieh is a Principal AI/ML Specialist Solutions Architect. He focuses on solving business challenges using AI/ML for customers in the healthcare and life sciences industry. As a Seattle transplant, he loves exploring the great Mother Nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay. As a former long-time resident of Philadelphia, he has been rooting for the Philadelphia Eagles and Philadelphia Phillies.

Olivia Choudhury, PhD, is a Senior Partner Solutions Architect at AWS. She helps partners in the healthcare and life sciences domain design, develop, and scale state-of-the-art solutions using AWS. She has a background in genomics, healthcare analytics, federated learning, and privacy-preserving machine learning. Outside of work, she plays board games, paints landscapes, and collects manga.

Curt Lockhart is an AI/ML Specialist Solutions Architect at AWS. He comes from a non-traditional background of working in the arts before his move to tech, and enjoys making machine learning approachable for each customer. Based in Seattle, you can find him venturing to local art museums, catching a concert, and wandering throughout the cities and outdoors of the Pacific Northwest.

Read More

Research Focus: Week of November 7, 2022

Research Focus: Week of November 7, 2022

Microsoft Research Focus 03: Week of November 7th, 2022

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Microsoft Turing Universal Language Representation model, T-ULRv6, tops both XTREME and GLUE leaderboards with a single model

Barun Patra, Saksham Singhal, Shaohan Huang, Zewen Chi, Li Dong, Furu Wei, Vishrav Chaudhary, Xia Song

The most recent addition to Microsoft’s Turing Universal Language Representation Model family (T-ULRv6) has achieved the top position on both the Google XTREME and GLUE leaderboards, the first time that a single multilingual model has demonstrated state-of-the-art capabilities in both English and multilingual understanding tasks. The result of a collaboration between the Microsoft Turing team and Microsoft Research, the T-ULRv6 XXL model is based on “XY-LENT,” which leverages X-Y (non-English Centric) bitexts and incorporates the key innovations of XLM-E, the improved training data and vocabulary, and the advanced fine-tuning technique of XTune. Furthermore, to enable scaling to XXL sized models, T-ULRv6 leverages the memory optimization benefits afforded by ZeRO. To effectively utilize X-Y bitexts, the team adopted a novel sampling strategy and reconstructed the vocabulary using VoCap, which ensures an efficient distribution of data across languages and helps mitigate sparse sampling distributions from previous works. The XXL model variant outperforms both XLM-R XXL and mT5 XXL while being ~2x and ~3x smaller, respectively.

T-ULRv6 powers the language universalization of Microsoft Bing, enabling users to search and discover information across languages and domains. T-ULRv6 will soon enhance other Microsoft products with its multilingual capabilities. 

XTREME, or Cross-lingual TRansfer Evaluation of Multilingual Encoders, is a benchmark covering 40 typologically diverse languages across 12 language families, with nine tasks that require reasoning about syntax or semantics.

GLUE – or the General Language Understanding Evaluation benchmark – is a collection of resources for training, evaluating, and analyzing natural language understanding systems.


PACT: Perception-Action Causal Transformer for autoregressive robotics pretraining

Rogerio Bonatti, Sai Vemprala, Shuang Ma, Felipe Frujeri, Shuhang Chen, Ashish Kapoor

Recent advances in machine learning architectures have induced a paradigm shift from task-specific models towards large general-purpose networks. For instance, in the past few years we have witnessed a revolution in the domains of natural language and computer vision with models such as GPT-3, BERT and DALL-E. The field of robotics is still mostly riddled with single-purpose systems architectures whose modules and connections, whether traditional or learning-based, require significant human design expertise. Inspired by these large pre-trained models, this work introduces a general-purpose robotics representation that can serve as a starting point for multiple tasks for a mobile agent, such as navigation, mapping and localization.

We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. This representation can then function as a single starting point to achieve distinct tasks through fine-tuning with minimal data.


Microsoft Research and NHS Scotland conduct world’s first clinical trials of Holoportation™-based 3D telemedicine system to increase access to healthcare

Steven Lo, Spencer Fowers, Kwame Darko, Thiago Spina, Catriona Graham, Andrea Britto, Anna Rose, David Tittsworth, Aileen McIntyre, Chris O’Dowd, Roma Maguire, Wayne Chang, David Young, Amber Hoak, Robin Young, Mark Dunlop, Levi Ankrah, Martina Messow, Opoku Ampomah, Ben Cutler, Roma Armstrong, Ruchi Lalwani, Ruairidh Davison, Sophie Bagnall, Whitney Hudson, Mike Shepperd, Jonny Johnson, 3DTM (3D Telemedicine) Collaborative research group

The Covid pandemic has increased the usage of remote health consultations and underscored the need for a better system. Current 2D telemedicine engagements fail to replicate the fluency or authenticity of in-person consultations. Real-time 3D telemedicine has previously been proposed within a research setting only, with constraints on complexity, bandwidth and technology.

This research reports on an international collaboration on the participatory development and first validated clinical use of a novel, real-time 360-degree 3D telemedicine system worldwide. NHS Greater Glasgow and Clyde have been working with Microsoft since 2019 to assess how health care could leverage Microsoft’s 3D telemedicine, focusing on plastic surgery patients and leveraging Microsoft’s Holoportation™ communication technology.

This research was designed to compare validated outcome measures of a patient-centered 3D telemedicine system with a 2D system, assess alignment with an in-person consultation, and to ensure safety, reliability and clinical concordance. In three separate studies, the 3D system improved patient metrics in comparison to 2D telemedicine, suggesting that remote consultations could get closer to the experience of face-to-face consultations.


Interactive code generation via test-driven user-intent formalization

Shuvendu Lahiri, Aaditya Naik, Georgios Sakkas, Piali Choudhury, Curtis von Veh, Madan Musuvathi, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao

Automatic code generation from natural language intent using large language models is disrupting coding. However, the correctness of the resulting code with respect to user intent expressed in natural language is difficult to establish because natural language lacks formal semantics. In this project, we investigate the problem of neural specification generation (i.e., generating partial formal specifications that match the intent expressed in natural language), and incorporating such specifications during the coding process to improve trust in human-written or AI-generated code.

We instantiate this framework starting with unit tests; tests serve as weak yet formal specifications of a module. We can leverage the abundance of human-written unit tests to train models. Further, these specifications (tests) can be checked using concrete execution without the need for more sophisticated abstract interpreters.  In prior work on TOGA, we demonstrated a neural model for synthesizing test oracles for a method and illustrated its use in finding functional bugs in code. In this work on TiCoder, we describe an interactive workflow to formalizing the informal user-intent through such model-generated tests and improving the accuracy, correctness and understanding of generated code.

The post Research Focus: Week of November 7, 2022 appeared first on Microsoft Research.

Read More

3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’

3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

The warm, friendly animation Mushroom Spirit is featured In the NVIDIA Studio this week, modeled by talented 3D illustrator Julie Greenberg, aka Juliestrator.

In addition, NVIDIA Omniverse, an open platform for virtual collaboration and real-time photorealistic simulation, just dropped a beta release for 3D artists.

And with the approaching winter season comes the next NVIDIA Studio community challenge. Join the #WinterArtChallenge, running through the end of the year, by sharing winter-themed art on Instagram, Twitter or Facebook for a chance to be featured on the NVIDIA Studio social media channels. Be sure to tag #WinterArtChallenge to enter.

New in NVIDIA Omniverse

With new support for GeForce RTX 40 Series GPUs, NVIDIA Omniverse is faster, more accessible and more flexible than ever for collaborative 3D workflows across apps.

An example of what’s possible when talented 3D artists collaborate in Omniverse: a scene from the ‘NVIDIA Racer RTX’ demo.

NVIDIA DLSS 3, powered by the GeForce RTX 40 Series, is now available in Omniverse, enabling complete real-time ray-tracing workflows within the platform. The NVIDIA Ada Lovelace GPU architecture delivers a generational leap in performance and power that enables users to work in large-scale, virtual worlds with true interactivity — so creators can navigate viewports at full fidelity in real time.

The Omniverse Create app has new large world-authoring and animation improvements.

In Omniverse Machinima, creators gain AI superpowers with Audio2Gesture — an AI-powered tool that creates lifelike body movements based on an audio file.

PhysX 5, the technology behind Omniverse’s hyperrealistic physics simulation, features built-in audio for collisions, as well as improved cloth and deformable body simulations. Newly available as open source software, PhysX 5 enables artists and developers to modify, build and distribute custom physics engines.

The Omniverse Connect library has received updates to Omniverse Connectors, including Autodesk 3ds Max, Autodesk Maya, Autodesk Revit, Epic Games Unreal Engine, McNeel Rhino, Trimble SketchUp and Graphisoft Archicad. Connectors for Autodesk Alias and PTC Creo are also now available.

The updated Reallusion iClone 8.1.0’s live-sync Connector allows for seamless character interactions between iClone and Omniverse apps. And OTOY’s OctaneRender Hydra Relegate enables Omniverse users to access OctaneRender directly in Omniverse apps.

Learn more about the Omniverse release and tune into the Twitch livestream detailing announcements on Wednesday, Nov. 9. Download Omniverse, which is free for NVIDIA RTX and GeForce RTX GPU owners.

Featuring a Fun-gi 

Juliestrator’s artistic inspiration comes from the examination of the different worlds that people create. “No matter if it’s the latest Netflix show or an artwork I see on Twitter, I love when a piece of art leaves space for my own imagination to fill in the gaps and come up with my own stories,” she said.

Mushroom Spirit was conceived as a sketch for last year’s Inktober challenge, which had the prompt of “spirit.” Rather than creating a ghost like many others, Juliestrator took a different approach. Mushroom Spirit was born as a cute nature spirit lurking in a forest, like the Kodama creatures from the Princess Mononoke film from which she drew inspiration.

Juliestrator gathered reference material using Pinterest. She then used PureRef’s overlay feature to help position reference imagery while modeling in Blender software. Though it’s rare for Juliestrator to sketch in 2D for 3D projects, she said Mushroom Spirit called for a more personal touch, so she generated a quick scribble in Procreate.

The origins of ‘Mushroom Spirit.’

Using Blender, she then entered the block-out phase — creating a rough-draft level built using simple 3D shapes, without details or polished art assets. This helped to keep base meshes clean, eliminating the need to create new meshes in the next round, which required only minor edits.

Getting the basic shapes down by blocking out ‘Mushroom Spirit’ in Blender.

At this point, many artists would typically start to model detailed scene elements, but Julistrator prioritizes coloring. “I’ve noticed how much color influences the compositions and mood of the artwork, so I try to make this important decision as early as possible,” the artist said.

Color modifications in Adobe Substance 3D Painter.

She used Adobe Substance 3D Painter software to apply a myriad of colors and experimental textures to her models. On her NVIDIA Studio laptop, the Razer Blade 15 Studio equipped with an NVIDIA Quadro RTX 5000 GPU, Juliestrator used RTX-accelerated light and ambient occlusion to bake assets in mere seconds.

She then refined the existing models in Blender. “This is where powerful hardware helps a lot,” she said. “The NVIDIA OptiX AI-accelerated denoiser helps me preview any changes I make in Blender almost instantly, which lets me test more ideas at the same time and as a result get better finished renders.”

Tinkering and tweaking color palettes in Blender.

Though she enjoys the modeling stage, Juliestrator said that the desire to refine an endless number of details can be overwhelming. As such, she deploys an “80/20 rule,” dedicating no more than 20% of the entire project’s timeline to detailed modeling. “That’s the magic of the 80/20 rule: tackle the correct 20%, and the other 80% often falls into place,” she said.

Juliestrator finally adjusts the composition in 3D — manipulating the light objects, rotating the camera and adding animations. She completed all of this quickly with an assist from RTX-accelerated OptiX ray tracing in the Blender viewport, using Blender Cycles for the fastest frame renders.

Animations in Blender during the final stage.

Blender is Juliestrator’s preferred 3D modeling app, she said, due to its ease of use and powerful AI features, as well as its accessibility. “I truly appreciate the efforts of the Blender Foundation and all of its partners in keeping Blender free and available to people from all over the world, to enhance anyone’s creativity,” she said.

 

Juliestrator chose to use an NVIDIA Studio laptop, a “porta-bella” system for efficiency and convenience, she said. “I needed a powerful computer that would let me use both Blender and a game engine like Unity or Unreal Engine 5, while staying mobile and on the go,” the artist added.

Illustrator Julie Greenberg, aka Juliestrator.

Check out Juliestrator’s portfolio and social media links.

For more direction and inspiration for building 3D worlds, check out Juliestrator’s five-part tutorial, Modeling 3D New York Diorama, which covers the critical stages in 3D workflows: sketching composition, modeling details and more. The tutorials can be found on the NVIDIA Studio YouTube channel, which posts new videos every week.

And don’t forget to enter the NVIDIA Studio #WinterArtChallenge on Instagram, Twitter or Facebook.

The post 3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Read More