How Carrier predicts HVAC faults using AWS Glue and Amazon SageMaker

How Carrier predicts HVAC faults using AWS Glue and Amazon SageMaker

In their own words, “In 1902, Willis Carrier solved one of mankind’s most elusive challenges of controlling the indoor environment through modern air conditioning. Today, Carrier products create comfortable environments, safeguard the global food supply, and enable safe transport of vital medical supplies under exacting conditions.”

At Carrier, the foundation of our success is making products our customers can trust to keep them comfortable and safe year-round. High reliability and low equipment downtime are increasingly important as extreme temperatures become more common due to climate change. We have historically relied on threshold-based systems that alert us to abnormal equipment behavior, using parameters defined by our engineering team. Although such systems are effective, they are intended to identify and diagnose equipment issues rather than predict them. Predicting faults before they occur allows our HVAC dealers to proactively address issues and improve the customer experience.

In order to improve our equipment reliability, we partnered with the Amazon Machine Learning Solutions Lab to develop a custom machine learning (ML) model capable of predicting equipment issues prior to failure. Our teams developed a framework for processing over 50 TB of historical sensor data and predicting faults with 91% precision. We can now notify dealers of impending equipment failure, so that they can schedule inspections and minimize unit downtime. The solution framework is scalable as more equipment is installed and can be reused for a variety of downstream modeling tasks.

In this post, we show how the Carrier and AWS teams applied ML to predict faults across large fleets of equipment using a single model. We first highlight how we use AWS Glue for highly parallel data processing. We then discuss how Amazon SageMaker helps us with feature engineering and building a scalable supervised deep learning model.

Overview of use case, goals, and risks

The main goal of this project is to reduce downtime by predicting impending equipment failures and notifying dealers. This allows dealers to schedule maintenance proactively and provide exceptional customer service. We faced three primary challenges when working on this solution:

  • Data scalability – Data processing and feature extraction needs to scale across large growing historical sensor data
  • Model scalability – The modeling approach needs to be capable of scaling across over 10,000 units
  • Model precision – Low false positive rates are needed to avoid unnecessary maintenance inspections

Scalability, both from a data and modeling perspective, is a key requirement for this solution. We have over 50 TB of historical equipment data and expect this data to grow quickly as more HVAC units are connected to the cloud. Data processing and model inference need to scale as our data grows. In order for our modeling approach to scale across over 10,000 units, we need a model that can learn from a fleet of equipment rather than relying on anomalous readings for a single unit. This will allow for generalization across units and reduce the cost of inference by hosting a single model.

The other concern for this use case is triggering false alarms. This means that a dealer or technician will go on-site to inspect the customer’s equipment and find everything to be operating appropriately. The solution requires a high precision model to ensure that when a dealer is alerted, the equipment is likely to fail. This helps earn the trust of dealers, technicians, and homeowners alike, and reduces the costs associated with unnecessary on-site inspections.

We partnered with the AI/ML experts at the Amazon ML Solutions Lab for a 14-week development effort. In the end, our solution includes two primary components. The first is a data processing module built with AWS Glue that summarizes equipment behavior and reduces the size of our training data for efficient downstream processing. The second is a model training interface managed through SageMaker, which allows us to train, tune, and evaluate our model before it is deployed to a production endpoint.

Data processing

Each HVAC unit we install generates data from 90 different sensors with readings for RPMs, temperature, and pressures throughout the system. This amounts to roughly 8 million data points generated per unit per day, with tens of thousands of units installed. As more HVAC systems are connected to the cloud, we anticipate the volume of data to grow quickly, making it critical for us to manage its size and complexity for use in downstream tasks. The length of the sensor data history also presents a modeling challenge. A unit may start displaying signs of impending failure months before a fault is actually triggered. This creates a significant lag between the predictive signal and the actual failure. A method for compressing the length of the input data becomes critical for ML modeling.

To address the size and complexity of the sensor data, we compress it into cycle features as shown in Figure 1. This dramatically reduces the size of data while capturing features that characterize the equipment’s behavior.

Figure 1: Sample of HVAC sensor data

AWS Glue is a serverless data integration service for processing large quantities of data at scale. AWS Glue allowed us to easily run parallel data preprocessing and feature extraction. We used AWS Glue to detect cycles and summarize unit behavior using key features identified by our engineering team. This dramatically reduced the size of our dataset from over 8 million data points per day per unit down to roughly 1,200. Crucially, this approach preserves predictive information about unit behavior with a much smaller data footprint.

The output of the AWS Glue job is a summary of unit behavior for each cycle. We then use an Amazon SageMaker Processing job to calculate features across cycles and label our data. We formulate the ML problem as a binary classification task with a goal of predicting equipment faults in the next 60 days. This allows our dealer network to address potential equipment failures in a timely manner. It’s important to note that not all units fail within 60 days. A unit experiencing slow performance degradation could take more time to fail. We address this during the model evaluation step. We focused our modeling on summertime because those months are when most HVAC systems in the US are in consistent operation and under more extreme conditions.

Modeling

Transformer architectures have become the state-of-the-art approach for handling temporal data. They can use long sequences of historical data at each time step without suffering from vanishing gradients. The input to our model at a given point in time is composed of the features for the previous 128 equipment cycles, which is roughly one week of unit operation. This is processed by a three-layer encoder whose output is averaged and fed into a multi-layered perceptron (MLP) classifier. The MLP classifier is composed of three linear layers with ReLU activation functions and a final layer with LogSoftMax activation. We use weighted negative log-likelihood loss with a different weight on the positive class for our loss function. This biases our model towards high precision and avoids costly false alarms. It also incorporates our business objectives directly into the model training process. Figure 2 illustrates the transformer architecture.

Transformer Architecture

Figure 2: Temporal transformer architecture

Training

One challenge when training this temporal learning model is data imbalance. Some units have a longer operational history than others and therefore have more cycles in our dataset. Because they are overrepresented in the dataset, these units will have more influence on our model. We solve this by randomly sampling 100 cycles in a unit’s history where we assess the probability of a failure at that time. This ensures that each unit is equally represented during the training process. While removing the imbalanced data problem, this approach has the added benefit of replicating a batch processing approach that will be used in production. This sampling approach was applied to the training, validation, and test sets.

Training was performed using a GPU-accelerated instance on SageMaker. Monitoring the loss shows that it achieves the best results after 180 training epochs as show in Figure 3. Figure 4 shows that the area under the ROC curve for the resulting temporal classification model is 81%.

Training Curve

Figure 3: Training loss over epochs

Figure 4: ROC-AUC for 60-day lockout

Evaluation

While our model is trained at the cycle level, evaluation needs to take place at the unit level. In this way, one unit with multiple true positive detections is still only counted as a single true positive at the unit level. To do this, we analyze the overlap between the predicted outcomes and the 60-day window preceding a fault. This is illustrated in the following figure, which shows four cases of predicting outcomes:

  • True negative – All the prediction results are negative (purple) (Figure 5)
  • False positive – The positive predictions are false alarms (Figure 6)
  • False negative – Although the predictions are all negative, the actual labels could be positive (green) (Figure 7)
  • True positive – Some of the predictions could be negative (green), and at least one prediction is positive (yellow) (Figure 8)
True Negative

Figure 5.1: True negative case

False Positive

Figure 5.2: False positive case

False Negative

Figure 5.3: False negative case

True Positive

Figure 5.4: True positive case

After training, we use the evaluation set to tune the threshold for sending an alert. Setting the model confidence threshold at 0.99 yields a precision of roughly 81%. This falls short of our initial 90% criterion for success. However, we found that a good portion of units failed just outside the 60-day evaluation window. This makes sense, because a unit may actively display faulty behavior but take longer than 60 days to fail. To handle this, we defined a metric called effective precision, which is a combination of the true positive precision (81%) with the added precision of lockouts that occurred in the 30 days beyond our target 60-day window.

For an HVAC dealer, what is most important is that an onsite inspection helps prevent future HVAC issues for the customer. Using this model, we estimate that 81.2% of the time the inspection will prevent a lockout from occurring in the next 60 days. Additionally, 10.4% of the time the lockout would have occurred in within 90 days of inspection. The remaining 8.4% will be a false alarm. The effective precision of the trained model is 91.6%.

Conclusion

In this post, we showed how our team used AWS Glue and SageMaker to create a scalable supervised learning solution for predictive maintenance. Our model is capable of capturing trends across long-term histories of sensor data and accurately detecting hundreds of equipment failures weeks in advance. Predicting faults in advance will reduce curb-to-curb time, allowing our dealers to provide more timely technical assistance and improving the overall customer experience. The impacts of this approach will grow over time as more cloud-connected HVAC units are installed every year.

Our next step is to integrate these insights into the upcoming release of Carrier’s Connected Dealer Portal. The portal combines these predictive alerts with other insights we derive from our AWS-based data lake in order to give our dealers more clarity into equipment health across their entire client base. We will continue to improve our model by integrating data from additional sources and extracting more advanced features from our sensor data. The methods employed in this project provide a strong foundation for our team to start answering other key questions that can help us reduce warranty claims and improve equipment efficiency in the field.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab. To learn more about the services used in this project, refer to the AWS Glue Developer Guide and the Amazon SageMaker Developer Guide.


About the Authors

Ravi Patankar is a technical leader for IoT related analytics at Carrier’s Residential HVAC Unit. He formulates analytics problems related to diagnostics and prognostics and provides direction for ML/deep learning-based analytics solutions and architecture.

Dan Volk is a Data Scientist at the AWS Generative AI Innovation Center. He has ten years of experience in machine learning, deep learning and time-series analysis and holds a Master’s in Data Science from UC Berkeley. He is passionate about transforming complex business challenges into opportunities by leveraging cutting-edge AI technologies.

Yingwei Yu is an Applied Scientist at AWS Generative AI Innovation Center. He has experience working with several organizations across industries on various proof-of-concepts in machine learning, including NLP, time-series analysis, and generative AI technologies. Yingwei received his PhD in computer science from Texas A&M University.

Yanxiang Yu is an Applied Scientist at Amazon Web Services, working on the Generative AI Innovation Center. With over 8 years of experience building AI and machine learning models for industrial applications, he specializes in generative AI, computer vision, and time series modeling. His work focuses on finding innovative ways to apply advanced generative techniques to real-world problems.

Diego Socolinsky is a Senior Applied Science Manager with the AWS Generative AI Innovation Center, where he leads the delivery team for the Eastern US and Latin America regions. He has over twenty years of experience in machine learning and computer vision, and holds a PhD degree in mathematics from The Johns Hopkins University.

Kexin Ding is a fifth-year Ph.D. candidate in computer science at UNC-Charlotte. Her research focuses on applying deep learning methods for analyzing multi-modal data, including medical image and genomics sequencing data.

Read More

Optimize deployment cost of Amazon SageMaker JumpStart foundation models with Amazon SageMaker asynchronous endpoints

Optimize deployment cost of Amazon SageMaker JumpStart foundation models with Amazon SageMaker asynchronous endpoints

The success of generative AI applications across a wide range of industries has attracted the attention and interest of companies worldwide who are looking to reproduce and surpass the achievements of competitors or solve new and exciting use cases. These customers are looking into foundation models, such as TII Falcon, Stable Diffusion XL, or OpenAI’s GPT-3.5, as the engines that power the generative AI innovation.

Foundation models are a class of generative AI models that are capable of understanding and generating human-like content, thanks to the vast amounts of unstructured data they have been trained on. These models have revolutionized various computer vision (CV) and natural language processing (NLP) tasks, including image generation, translation, and question answering. They serve as the building blocks for many AI applications and have become a crucial component in the development of advanced intelligent systems.

However, the deployment of foundation models can come with significant challenges, particularly in terms of cost and resource requirements. These models are known for their size, often ranging from hundreds of millions to billions of parameters. Their large size demands extensive computational resources, including powerful hardware and significant memory capacity. In fact, deploying foundation models usually requires at least one (often more) GPUs to handle the computational load efficiently. For example, the TII Falcon-40B Instruct model requires at least an ml.g5.12xlarge instance to be loaded into memory successfully, but performs best with bigger instances. As a result, the return on investment (ROI) of deploying and maintaining these models can be too low to prove business value, especially during development cycles or for spiky workloads. This is due to the running costs of having GPU-powered instances for long sessions, potentially 24/7.

Earlier this year, we announced Amazon Bedrock, a serverless API to access foundation models from Amazon and our generative AI partners. Although it’s currently in Private Preview, its serverless API allows you to use foundation models from Amazon, Anthropic, Stability AI, and AI21, without having to deploy any endpoints yourself. However, open-source models from communities such as Hugging Face have been growing a lot, and not every one of them has been made available through Amazon Bedrock.

In this post, we target these situations and solve the problem of risking high costs by deploying large foundation models to Amazon SageMaker asynchronous endpoints from Amazon SageMaker JumpStart. This can help cut costs of the architecture, allowing the endpoint to run only when requests are in the queue and for a short time-to-live, while scaling down to zero when no requests are waiting to be serviced. This sounds great for a lot of use cases; however, an endpoint that has scaled down to zero will introduce a cold start time before being able to serve inferences.

Solution overview

The following diagram illustrates our solution architecture.

The architecture we deploy is very straightforward:

  • The user interface is a notebook, which can be replaced by a web UI built on Streamlit or similar technology. In our case, the notebook is an Amazon SageMaker Studio notebook, running on an ml.m5.large instance with the PyTorch 2.0 Python 3.10 CPU kernel.
  • The notebook queries the endpoint in three ways: the SageMaker Python SDK, the AWS SDK for Python (Boto3), and LangChain.
  • The endpoint is running asynchronously on SageMaker, and on the endpoint, we deploy the Falcon-40B Instruct model. It’s currently the state of the art in terms of instruct models and available in SageMaker JumpStart. A single API call allows us to deploy the model on the endpoint.

What is SageMaker asynchronous inference

SageMaker asynchronous inference is one of the four deployment options in SageMaker, together with real-time endpoints, batch inference, and serverless inference. To learn more about the different deployment options, refer to Deploy models for Inference.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making this option ideal for requests with large payload sizes up to 1 GB, long processing times, and near-real-time latency requirements. However, the main advantage that it provides when dealing with large foundation models, especially during a proof of concept (POC) or during development, is the capability to configure asynchronous inference to scale in to an instance count of zero when there are no requests to process, thereby saving costs. For more information about SageMaker asynchronous inference, refer to Asynchronous inference. The following diagram illustrates this architecture.

To deploy an asynchronous inference endpoint, you need to create an AsyncInferenceConfig object. If you create AsyncInferenceConfig without specifying its arguments, the default S3OutputPath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-outputs/{UNIQUE-JOB-NAME} and S3FailurePath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-failures/{UNIQUE-JOB-NAME}.

What is SageMaker JumpStart

Our model comes from SageMaker JumpStart, a feature of SageMaker that accelerates the machine learning (ML) journey by offering pre-trained models, solution templates, and example notebooks. It provides access to a wide range of pre-trained models for different problem types, allowing you to start your ML tasks with a solid foundation. SageMaker JumpStart also offers solution templates for common use cases and example notebooks for learning. With SageMaker JumpStart, you can reduce the time and effort required to start your ML projects with one-click solution launches and comprehensive resources for practical ML experience.

The following screenshot shows an example of just some of the models available on the SageMaker JumpStart UI.

Deploy the model

Our first step is to deploy the model to SageMaker. To do that, we can use the UI for SageMaker JumpStart or the SageMaker Python SDK, which provides an API that we can use to deploy the model to the asynchronous endpoint:

%%time
from sagemaker.jumpstart.model import JumpStartModel, AsyncInferenceConfig
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model_id, model_version = "huggingface-llm-falcon-40b-instruct-bf16", "*"
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy(
    initial_instance_count=0,
    instance_type="ml.g5.12xlarge",
    async_inference_config=AsyncInferenceConfig()
)

This call can take approximately10 minutes to complete. During this time, the endpoint is spun up, the container together with the model artifacts are downloaded to the endpoint, the model configuration is loaded from SageMaker JumpStart, then the asynchronous endpoint is exposed via a DNS endpoint. To make sure that our endpoint can scale down to zero, we need to configure auto scaling on the asynchronous endpoint using Application Auto Scaling. You need to first register your endpoint variant with Application Auto Scaling, define a scaling policy, and then apply the scaling policy. In this configuration, we use a custom metric using CustomizedMetricSpecification, called ApproximateBacklogSizePerInstance, as shown in the following code. For a detailed list of Amazon CloudWatch metrics available with your asynchronous inference endpoint, refer to Monitoring with CloudWatch.

import boto3

client = boto3.client("application-autoscaling")
resource_id = "endpoint/" + my_model.endpoint_name + "/variant/" + "AllTraffic"

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0, # Miminum number of instances we want to scale down to - scale down to 0 to stop incurring in costs
    MaxCapacity=1, # Maximum number of instances we want to scale up to - scale up to 1 max is good enough for dev
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": my_model.endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,  # The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
        "ScaleOutCooldown": 300,  # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        # 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled.
        # If the value is true, scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    },
)

You can verify that this policy has been set successfully by navigating to the SageMaker console, choosing Endpoints under Inference in the navigation pane, and looking for the endpoint we just deployed.

Invoke the asynchronous endpoint

To invoke the endpoint, you need to place the request payload in Amazon Simple Storage Service (Amazon S3) and provide a pointer to this payload as a part of the InvokeEndpointAsync request. Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon Simple Notification Service (Amazon SNS).

SageMaker Python SDK

After deployment is complete, it will return an AsyncPredictor object. To perform asynchronous inference, you need to upload data to Amazon S3 and use the predict_async() method with the S3 URI as the input. It will return an AsyncInferenceResponse object, and you can check the result using the get_response() method.

Alternatively, if you would like to check for a result periodically and return it upon generation, use the predict() method. We use this second method in the following code:

import time

# Invoking the asynchronous endpoint with the SageMaker Python SDK
def query_endpoint(payload):
    """Query endpoint and print the response"""
    response = predictor.predict_async(
        data=payload,
        input_path="s3://{}/{}".format(bucket, prefix),
    )
    while True:
        try:
            response = response.get_result()
            break
        except:
            print("Inference is not ready ...")
            time.sleep(5)
    print(f"33[1m Input:33[0m {payload['inputs']}")
    print(f"33[1m Output:33[0m {response[0]['generated_text']}")
    
query_endpoint(payload)

Boto3

Let’s now explore the invoke_endpoint_async method from Boto3’s sagemaker-runtime client. It enables developers to asynchronously invoke a SageMaker endpoint, providing a token for progress tracking and retrieval of the response later. Boto3 doesn’t offer a way to wait for the asynchronous inference to be completed like the SageMaker Python SDK’s get_result() operation. Therefore, we take advantage of the fact that Boto3 will store the inference output in Amazon S3 in the response["OutputLocation"]. We can use the following function to wait for the inference file to be written to Amazon S3:

import json
import time
import boto3
from botocore.exceptions import ClientError

s3_client = boto3.client("s3")

# Wait until the prediction is generated
def wait_inference_file(bucket, prefix):
    while True:
        try:
            response = s3_client.get_object(Bucket=bucket, Key=prefix)
            break
        except ClientError as ex:
            if ex.response['Error']['Code'] == 'NoSuchKey':
                print("Waiting for file to be generated...")
                time.sleep(5)
                next
            else:
                raise
        except Exception as e:
            print(e.__dict__)
            raise
    return response

With this function, we can now query the endpoint:

# Invoking the asynchronous endpoint with the Boto3 SDK
import boto3

sagemaker_client = boto3.client("sagemaker-runtime")

# Query the endpoint function
def query_endpoint_boto3(payload):
    """Query endpoint and print the response"""
    response = sagemaker_client.invoke_endpoint_async(
        EndpointName=my_model.endpoint_name,
        InputLocation="s3://{}/{}".format(bucket, prefix),
        ContentType="application/json",
        Accept="application/json"
    )
    output_url = response["OutputLocation"]
    output_prefix = "/".join(output_url.split("/")[3:])
    # Read the bytes of the file from S3 in output_url with Boto3
    output = wait_inference_file(bucket, output_prefix)
    output = json.loads(output['Body'].read())[0]['generated_text']
    # Emit output
    print(f"33[1m Input:33[0m {payload['inputs']}")
    print(f"33[1m Output:33[0m {output}")

query_endpoint_boto3(payload)

LangChain

LangChain is an open-source framework launched in October 2022 by Harrison Chase. It simplifies the development of applications using large language models (LLMs) by providing integrations with various systems and data sources. LangChain allows for document analysis, summarization, chatbot creation, code analysis, and more. It has gained popularity, with contributions from hundreds of developers and significant funding from venture firms. LangChain enables the connection of LLMs with external sources, making it possible to create dynamic, data-responsive applications. It offers libraries, APIs, and documentation to streamline the development process.

LangChain provides libraries and examples for using SageMaker endpoints with its framework, making it easier to use ML models hosted on SageMaker as the “brain” of the chain. To learn more about how LangChain integrates with SageMaker, refer to the SageMaker Endpoint in the LangChain documentation.

One of the limits of the current implementation of LangChain is that it doesn’t support asynchronous endpoints natively. To use an asynchronous endpoint to LangChain, we have to define a new class, SagemakerAsyncEndpoint, that extends the SagemakerEndpoint class already available in LangChain. Additionally, we provide the following information:

  • The S3 bucket and prefix where asynchronous inference will store the inputs (and outputs)
  • A maximum number of seconds to wait before timing out
  • An updated _call() function to query the endpoint with invoke_endpoint_async() instead of invoke_endpoint()
  • A way to wake up the asynchronous endpoint if it’s in cold start (scaled down to zero)

To review the newly created SagemakerAsyncEndpoint, you can check out the sagemaker_async_endpoint.py file available on GitHub.

from typing import Dict
from langchain import PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains import LLMChain
from sagemaker_async_endpoint import SagemakerAsyncEndpoint

class ContentHandler(LLMContentHandler):
    content_type:str = "application/json"
    accepts:str = "application/json"
    len_prompt:int = 0

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        self.len_prompt = len(prompt)
        input_str = json.dumps({"inputs": prompt, "parameters": {"max_new_tokens": 100, "do_sample": False, "repetition_penalty": 1.1}})
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        ans = res[0]['generated_text']
        return ans

chain = LLMChain(
    llm=SagemakerAsyncEndpoint(
        input_bucket=bucket,
        input_prefix=prefix,
        endpoint_name=my_model.endpoint_name,
        region_name=sagemaker.Session().boto_region_name,
        content_handler=ContentHandler(),
    ),
    prompt=PromptTemplate(
        input_variables=["query"],
        template="{query}",
    ),
)

print(chain.run(payload['inputs']))

Clean up

When you’re done testing the generation of inferences from the endpoint, remember to delete the endpoint to avoid incurring in extra charges:

predictor.delete_endpoint()

Conclusion

When deploying large foundation models like TII Falcon, optimizing cost is crucial. These models require powerful hardware and substantial memory capacity, leading to high infrastructure costs. SageMaker asynchronous inference, a deployment option that processes requests asynchronously, reduces expenses by scaling the instance count to zero when there are no pending requests. In this post, we demonstrated how to deploy large SageMaker JumpStart foundation models to SageMaker asynchronous endpoints. We provided code examples using the SageMaker Python SDK, Boto3, and LangChain to illustrate different methods for invoking asynchronous endpoints and retrieving results. These techniques enable developers and researchers to optimize costs while using the capabilities of foundation models for advanced language understanding systems.

To learn more about asynchronous inference and SageMaker JumpStart, check out the following posts:


About the author

Picture of DavideDavide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Elevating the generative AI experience: Introducing streaming support in Amazon SageMaker hosting

Elevating the generative AI experience: Introducing streaming support in Amazon SageMaker hosting

We’re excited to announce the availability of response streaming through Amazon SageMaker real-time inference. Now you can continuously stream inference responses back to the client when using SageMaker real-time inference to help you build interactive experiences for generative AI applications such as chatbots, virtual assistants, and music generators. With this new feature, you can start streaming the responses immediately when they’re available instead of waiting for the entire response to be generated. This lowers the time-to-first-byte for your generative AI applications.

In this post, we’ll show how to build a streaming web application using SageMaker real-time endpoints with the new response streaming feature for an interactive chat use case. We use Streamlit for the sample demo application UI.

Solution overview

To get responses streamed back from SageMaker, you can use our new InvokeEndpointWithResponseStream API. It helps enhance customer satisfaction by delivering a faster time-to-first-response-byte. This reduction in customer-perceived latency is particularly crucial for applications built with generative AI models, where immediate processing is valued over waiting for the entire payload. Moreover, it introduces a sticky session that will enable continuity in interactions, benefiting use cases such as chatbots, to create more natural and efficient user experiences.

The implementation of response streaming in SageMaker real-time endpoints is achieved through HTTP 1.1 chunked encoding, which is a mechanism for sending multiple responses. This is a HTTP standard that supports binary content and is supported by most client/server frameworks. HTTP chunked encoding supports both text and image data streaming, which means the models hosted on SageMaker endpoints can send back streamed responses as text or image, such as Falcon, Llama 2, and Stable Diffusion models. In terms of security, both the input and output are secured using TLS using AWS Sigv4 Auth. Other streaming techniques like Server-Sent Events (SSE) are also implemented using the same HTTP chunked encoding mechanism. To take advantage of the new streaming API, you need to make sure the model container returns the streamed response as chunked encoded data.

The following diagram illustrates the high-level architecture for response streaming with a SageMaker inference endpoint.

One of the use cases that will benefit from streaming response is generative AI model-powered chatbots. Traditionally, users send a query and wait for the entire response to be generated before receiving an answer. This could take precious seconds or even longer, which can potentially degrade the performance of the application. With response streaming, the chatbot can begin sending back partial inference results as they are generated. This means that users can see the initial response almost instantaneously, even as the AI continues refining its answer in the background. This creates a seamless and engaging conversation flow, where users feel like they’re chatting with an AI that understands and responds in real time.

In this post, we showcase two container options to create a SageMaker endpoint with response streaming: using an AWS Large Model Inference (LMI) and Hugging Face Text Generation Inference (TGI) container. In the following sections, we walk you through the detailed implementation steps to deploy and test the Falcon-7B-Instruct model using both LMI and TGI containers on SageMaker. We chose Falcon 7B as an example, but any model can take advantage of this new streaming feature.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account. If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. For the Falcon-7B-Instruct model, we use an ml.g5.2xlarge SageMaker hosting instance. For hosting a Falcon-40B-Instruct model, we use an ml.g5.48xlarge SageMaker hosting instance. You can request a quota increase from the Service Quotas UI. For more information, refer to Requesting a quota increase.

Option 1: Deploy a real-time streaming endpoint using an LMI container

The LMI container is one of the Deep Learning Containers for large model inference hosted by SageMaker to facilitate hosting large language models (LLMs) on AWS infrastructure for low-latency inference use cases. The LMI container uses Deep Java Library (DJL) Serving, which is an open-source, high-level, engine-agnostic Java framework for deep learning. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, Transformers-neuronx, and FasterTransformer to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. For more details on the benefits using the LMI container to deploy large models on SageMaker, refer to Deploy large models at high performance using FasterTransformer on Amazon SageMaker and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. You can also find more examples of hosting open-source LLMs on SageMaker using the LMI containers in this GitHub repo.

For the LMI container, we expect the following artifacts to help set up the model for inference:

  • serving.properties (required) – Defines the model server settings
  • model.py (optional) – A Python file to define the core inference logic
  • requirements.txt (optional) – Any additional pip wheel will need to install

LMI containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom preprocessing of the input data or postprocessing of the model’s predictions. We use the following configuration:

  • For this example, we host the Falcon-7B-Instruct model. We need to create a serving.properties configuration file with our desired hosting options and package it up into a tar.gz artifact. Response streaming can be enabled in DJL Serving by setting the enable_streaming option in the serving.properties file. For all the supported parameters, refer to Streaming Python configuration.
  • In this example, we use the default handlers in DJL Serving to stream responses, so we only care about sending requests and parsing the output response. You can also provide an entrypoint code with a custom handler in a model.py file to customize input and output handlers. For more details on the custom handler, refer to Custom model.py handler.
  • Because we’re hosting the Falcon-7B-Instruct model on a single GPU instance (ml.g5.2xlarge), we set option.tensor_parallel_degree to 1. If you plan to run in multiple GPUs, use this to set the number of GPUs per worker.
  • We use option.output_formatter to control the output content type. The default output content type is application/json, so if your application requires a different output, you can overwrite this value. For more information on the available options, refer to Configurations and settings and All DJL configuration options.
%%writefile serving.properties
engine=MPI 
option.model_id=tiiuae/falcon-7b-instruct
option.trust_remote_code=true
option.tensor_parallel_degree=1
option.max_rolling_batch_size=32
option.rolling_batch=auto
option.output_formatter=jsonlines
option.paged_attention=false
option.enable_streaming=true

To create the SageMaker model, retrieve the container image URI:

image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.23.0"
)

Use the SageMaker Python SDK to create the SageMaker model and deploy it to a SageMaker real-time endpoint using the deploy method:

instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-falcon-7b")

model = Model(sagemaker_session=sess, 
                image_uri=image_uri, 
                model_data=code_artifact, 
                role=role)

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900
)

When the endpoint is in service, you can use the InvokeEndpointWithResponseStream API call to invoke the model. This API allows the model to respond as a stream of parts of the full response payload. This enables models to respond with responses of larger size and enables faster-time-to-first-byte for models where there is a significant difference between the generation of the first and last byte of the response.

The response content type shown in x-amzn-sagemaker-content-type for the LMI container is application/jsonlines as specified in the model properties configuration. Because it’s part of the common data formats supported for inference, we can use the default deserializer provided by the SageMaker Python SDK to deserialize the JSON lines data. We create a helper LineIterator class to parse the response stream received from the inference request:

class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}n'
    b'{"outputs": [" challenging"]}n'
    b'{"outputs": [" problem"]}n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a 'n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

With the class in the preceding code, each time a response is streamed, it will return a binary string (for example, b'{"outputs": [" a"]}n') that can be deserialized again into a Python dictionary using JSON package. We can use the following code to iterate through each streamed line of text and return text response:

body = {"inputs": "what is life", "parameters": {"max_new_tokens":400}}
resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
event_stream = resp['Body']

for line in LineIterator(event_stream):
    resp = json.loads(line)
    print(resp.get("outputs")[0], end='')

The following screenshot shows what it would look like if you invoked the model through the SageMaker notebook using an LMI container.

Option 2: Implement a chatbot using a Hugging Face TGI container

In the previous section, you saw how to deploy the Falcon-7B-Instruct model using an LMI container. In this section, we show how to do the same using a Hugging Face Text Generation Inference (TGI) container on SageMaker. TGI is an open source, purpose-built solution for deploying LLMs. It incorporates optimizations including tensor parallelism for faster multi-GPU inference, dynamic batching to boost overall throughput, and optimized transformers code using flash-attention for popular model architectures including BLOOM, T5, GPT-NeoX, StarCoder, and LLaMa.

TGI deep learning containers support token streaming using Server-Sent Events (SSE). With token streaming, the server can start answering after the first prefill pass directly, without waiting for all the generation to be done. For extremely long queries, this means clients can start to see something happening orders of magnitude before the work is done. The following diagram shows a high-level end-to-end request/response workflow for hosting LLMs on a SageMaker endpoint using the TGI container.

To deploy the Falcon-7B-Instruct model on a SageMaker endpoint, we use the HuggingFaceModel class from the SageMaker Python SDK. We start by setting our parameters as follows:

hf_model_id = "tiiuae/falcon-7b-instruct" # model id from huggingface.co/models
number_of_gpus = 1 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model
instance_type = "ml.g5.2xlarge" # instance type to use for deployment

Compared to deploying regular Hugging Face models, we first need to retrieve the container URI and provide it to our HuggingFaceModel model class with image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in SageMaker, we can use the get_huggingface_llm_image_uri method provided by the SageMaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, Region, and version. For more details on the available versions, refer to HuggingFace Text Generation Inference Containers.

llm_image = get_huggingface_llm_image_uri(
    "huggingface",
    version="0.9.3"
)

We then create the HuggingFaceModel and deploy it to SageMaker using the deploy method:

endpoint_name = sagemaker.utils.name_from_base("tgi-model-falcon-7b")
    llm_model = HuggingFaceModel(
    role=role,
    image_uri=llm_image,
    env={
            'HF_MODEL_ID': hf_model_id,
            # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
            'SM_NUM_GPUS': str(number_of_gpus),
            'MAX_INPUT_LENGTH': "1900",  # Max length of input text
            'MAX_TOTAL_TOKENS': "2048",  # Max length of the generation (including input text)
        }
)

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    endpoint_name=endpoint_name,
)

The main difference compared to the LMI container is that you enable response streaming when you invoke the endpoint by supplying stream=true as part of the invocation request payload. The following code is an example of the payload used to invoke the TGI container with streaming:

body = {
    "inputs":"tell me one sentence",
    "parameters":{
        "max_new_tokens":400,
        "return_full_text": False
    },
    "stream": True
}

Then you can invoke the endpoint and receive a streamed response using the following command:

from sagemaker.base_deserializers import StreamDeserializer

llm.deserializer=StreamDeserializer()
resp = smr.invoke_endpoint_with_response_stream(EndpointName=llm.endpoint_name, Body=json.dumps(body), ContentType='application/json')

The response content type shown in x-amzn-sagemaker-content-type for the TGI container is text/event-stream. We use StreamDeserializer to deserialize the response into the EventStream class and parse the response body using the same LineIterator class as that used in the LMI container section.

Note that the streamed response from the TGI containers will return a binary string (for example, b`data:{"token": {"text": " sometext"}}`), which can be deserialized again into a Python dictionary using a JSON package. We can use the following code to iterate through each streamed line of text and return a text response:

event_stream = resp['Body']
start_json = b'{'
for line in LineIterator(event_stream):
    if line != b'' and start_json in line:
        data = json.loads(line[line.find(start_json):].decode('utf-8'))
        if data['token']['text'] != stop_token:
            print(data['token']['text'],end='')

The following screenshot shows what it would look like if you invoked the model through the SageMaker notebook using a TGI container.

Run the chatbot app on SageMaker Studio

In this use case, we build a dynamic chatbot on SageMaker Studio using Streamlit, which invokes the Falcon-7B-Instruct model hosted on a SageMaker real-time endpoint to provide streaming responses. First, you can test that the streaming responses work in the notebook as shown in the previous section. Then, you can set up the Streamlit application in the SageMaker Studio JupyterServer terminal and access the chatbot UI from your browser by completing the following steps:

  1. Open a system terminal in SageMaker Studio.
  2. On the top menu of the SageMaker Studio console, choose File, then New, then Terminal.
  3. Install the required Python packages that are specified in the requirements.txt file:
    $ pip install -r requirements.txt

  4. Set up the environment variable with the endpoint name deployed in your account:
    $ export endpoint_name=<Falcon-7B-instruct endpoint name deployed in your account>

  5. Launch the Streamlit app from the streamlit_chatbot_<LMI or TGI>.py file, which will automatically update the endpoint names in the script based on the environment variable that was set up earlier:
    $ streamlit run streamlit_chatbot_LMI.py --server.port 6006

  6. To access the Streamlit UI, copy your SageMaker Studio URL to another tab in your browser and replace lab? with proxy/[PORT NUMBER]/. Because we specified the server port to 6006, the URL should look as follows:
    https://<domain ID>.studio.<region>.sagemaker.aws/jupyter/default/proxy/6006/

Replace the domain ID and Region in the preceding URL with your account and Region to access the chatbot UI. You can find some suggested prompts in the left pane to get started.

The following demo shows how response streaming revolutionizes the user experience. It can make interactions feel fluid and responsive, ultimately enhancing user satisfaction and engagement. Refer to the GitHub repo for more details of the chatbot implementation.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required:

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion

In this post, we provided an overview of building applications with generative AI, the challenges, and how SageMaker real-time response streaming helps you address these challenges. We showcased how to build a chatbot application to deploy the Falcon-7B-Instruct model to use response streaming using both SageMaker LMI and HuggingFace TGI containers using an example available on GitHub.

Start building your own cutting-edge streaming applications with LLMs and SageMaker today! Reach out to us for expert guidance and unlock the potential of large model streaming for your projects.


About the Authors

Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as artificial intelligence, distributed computing, networking, and storage. His expertise lies in deep learning in the domains of natural language processing (NLP) and computer vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.

Sam Edwards, is a Cloud Engineer (AI/ML) at AWS Sydney specialized in machine learning and Amazon SageMaker. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. Outside of work, he enjoys playing racquet sports and traveling.

James Sanders is a Senior Software Engineer at Amazon Web Services. He works on the real-time inference platform for Amazon SageMaker.

Read More

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

Nowadays, the majority of our customers is excited about large language models (LLMs) and thinking how generative AI could transform their business. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this post, we discuss how to operationalize generative AI applications using MLOps principles leading to foundation model operations (FMOps). Furthermore, we deep dive on the most common generative AI use case of text-to-text applications and LLM operations (LLMOps), a subset of FMOps. The following figure illustrates the topics we discuss.

Specifically, we briefly introduce MLOps principles and focus on the main differentiators compared to FMOps and LLMOps regarding processes, people, model selection and evaluation, data privacy, and model deployment. This applies to customers that use them out of the box, create foundation models from scratch, or fine-tune them. Our approach applies to both open-source and proprietary models equally.

ML operationalization summary

As defined in the post MLOps foundation roadmap for enterprises with Amazon SageMaker, ML and operations (MLOps) is the combination of people, processes, and technology to productionize machine learning (ML) solutions efficiently. To achieve this, a combination of teams and personas need to collaborate, as illustrated in the following figure.

These teams are as follows:

  • Advanced analytics team (data lake and data mesh) – Data engineers are responsible for preparing and ingesting data from multiple sources, building ETL (extract, transform, and load) pipelines to curate and catalog the data, and prepare the necessary historical data for the ML use cases. These data owners are focused on providing access to their data to multiple business units or teams.
  • Data science team – Data scientists need to focus on creating the best model based on predefined key performance indicators (KPIs) working in notebooks. After the completion of the research phase, the data scientists need to collaborate with ML engineers to create automations for building (ML pipelines) and deploying models into production using CI/CD pipelines.
  • Business team – A product owner is responsible for defining the business case, requirements, and KPIs to be used to evaluate model performance. The ML consumers are other business stakeholders who use the inference results (predictions) to drive decisions.
  • Platform team – Architects are responsible for the overall cloud architecture of the business and how all the different services are connected together. Security SMEs review the architecture based on business security policies and needs. MLOps engineers are responsible for providing a secure environment for data scientists and ML engineers to productionize the ML use cases. Specifically, they are responsible for standardizing CI/CD pipelines, user and service roles and container creation, model consumption, testing, and deployment methodology based on business and security requirements.
  • Risk and compliance team – For more restrictive environments, auditors are responsible for assessing the data, code, and model artifacts and making sure that the business is compliant with regulations, such as data privacy.

Note that multiple personas can be covered by the same person depending on the scaling and MLOps maturity of the business.

These personas need dedicated environments to perform the different processes, as illustrated in the following figure.

The environments are as follows:

  • Platform administration – The platform administration environment is the place where the platform team has access to create AWS accounts and link the right users and data
  • Data – The data layer, often known as the data lake or data mesh, is the environment that data engineers or owners and business stakeholders use to prepare, interact, and visualize with the data
  • Experimentation – The data scientists use a sandbox or experimentation environment to test new libraries and ML techniques to prove that their proof of concept can solve business problems
  • Model build, model test, model deployment – The model build, test, and deployment environment is the layer of MLOps, where data scientists and ML engineers collaborate to automate and move the research to production
  • ML governance – The last piece of the puzzle is the ML governance environment, where all the model and code artifacts are stored, reviewed, and audited by the corresponding personas

The following diagram illustrates the reference architecture, which has already been discussed in MLOps foundation roadmap for enterprises with Amazon SageMaker.

Each business unit has each own set of development (automated model training and building), preproduction (automatic testing), and production (model deployment and serving) accounts to productionize ML use cases, which retrieve data from a centralized or decentralized data lake or data mesh, respectively. All the produced models and code automation are stored in a centralized tooling account using the capability of a model registry. The infrastructure code for all these accounts is versioned in a shared service account (advanced analytics governance account) that the platform team can abstract, templatize, maintain, and reuse for the onboarding to the MLOps platform of every new team.

Generative AI definitions and differences to MLOps

In classic ML, the preceding combination of people, processes, and technology can help you productize your ML use cases. However, in generative AI, the nature of the use cases requires either an extension of those capabilities or new capabilities. One of these new notions is the foundation model (FM). They are called as such because they can be used to create a wide range of other AI models, as illustrated in the following figure.

FM have been trained based on terabytes of data and have hundreds of billions of parameters to be able to predict the next best answer based on three main categories of generative AI use cases:

  • Text-to-text – The FMs (LLMs) have been trained based on unlabeled data (such as free text) and are able to predict the next best word or sequence of words (paragraphs or long essays). Main use cases are around human-like chatbots, summarization, or other content creation such as programming code.
  • Text-to-image – Labeled data, such as pairs of <text, image>, has been used to train FMs, which are able to predict the best combination of pixels. Example use cases are clothing design generation or imaginary personalized images.
  • Text-to-audio or video – Both labeled and unlabeled data can be used for FM training. One main generative AI use case example is music composition.

To productionize those generative AI use cases, we need to borrow and extend the MLOps domain to include the following:

  • FM operations (FMOps) – This can productionize generative AI solutions, including any use case type
  • LLM operations (LLMOps) – This is a subset of FMOps focusing on productionizing LLM-based solutions, such as text-to-text

The following figure illustrates the overlap of these use cases.

Compared to classic ML and MLOps, FMOps and LLMOps defer based on four main categories that we cover in the following sections: people and process, selection and adaptation of FM, evaluation and monitoring of FM, data privacy and model deployment, and technology needs. We will cover monitoring in a separate post.

Operationalization journey per generative AI user type

To simplify the description of the processes, we need to categorize the main generative AI user types, as shown in the following figure.

The user types are as follows:

  • Providers – Users who build FMs from scratch and provide them as a product to other users (fine-tuner and consumer). They have deep end-to-end ML and natural language processing (NLP) expertise and data science skills, and massive data labeler and editor teams.
  • Fine-tuners – Users who retrain (fine-tune) FMs from providers to fit custom requirements. They orchestrate the deployment of the model as a service for use by consumers. These users need strong end-to-end ML and data science expertise and knowledge of model deployment and inference. Strong domain knowledge for tuning, including prompt engineering, is required as well.
  • Consumers – Users who interact with generative AI services from providers or fine-tuners by text prompting or a visual interface to complete desired actions. No ML expertise is required but, mostly, application developers or end-users with understanding of the service capabilities. Only prompt engineering is necessary for better results.

As per the definition and the required ML expertise, MLOps is required mostly for providers and fine-tuners, while consumers can use application productionization principles, such as DevOps and AppDev to create the generative AI applications. Furthermore, we have observed a movement among the user types, where providers might become fine-tuners to support use cases based on a specific vertical (such as the financial sector) or consumers might become fine-tuners to achieve more accurate results. But let’s observe the main processes per user type.

The journey of consumers

The following figure illustrates the consumer journey.

As previously mentioned, consumers are required to select, test, and use an FM, interacting with it by providing specific inputs, otherwise known as prompts. Prompts, in the context of computer programming and AI, refer to the input that is given to a model or system to generate a response. This can be in the form of a text, command, or a question, which the system uses to process and generate an output. The output generated by the FM can then be utilized by end-users, who should also be able to rate these outputs to enhance the model’s future responses.

Beyond these fundamental processes, we’ve noticed consumers expressing a desire to fine-tune a model by harnessing the functionality offered by fine-tuners. Take, for instance, a website that generates images. Here, end-users can set up private accounts, upload personal photos, and subsequently generate content related to those images (for example, generating an image depicting the end-user on a motorbike wielding a sword or located in an exotic location). In this scenario, the generative AI application, designed by the consumer, must interact with the fine-tuner backend via APIs to deliver this functionality to the end-users.

However, before we delve into that, let’s first concentrate on the journey of model selection, testing, usage, input and output interaction, and rating, as shown in the following figure.

*15K available FM reference

Step 1. Understand top FM capabilities

There are many dimensions that need to be considered when selecting foundation models, depending on the use case, the data available, regulations, and so on. A good checklist, although not comprehensive, might be the following:

  • Proprietary or open-source FM – Proprietary models often come at a financial cost, but they typically offer better performance (in terms of quality of the generated text or image), often being developed and maintained by dedicated teams of model providers who ensure optimal performance and reliability. On the other hand, we also see adoption of open-source models that, other than being free, offer additional benefits of being accessible and flexible (for example, every open-source model is fine-tunable). An example of a proprietary model is Anthropic’s Claude model, and an example of a high performing open-source model is Falcon-40B, as of July 2023.
  • Commercial license – Licensing considerations are crucial when deciding on an FM. It’s important to note that some models are open-source but can’t be used for commercial purposes, due to licensing restrictions or conditions. The differences can be subtle: The newly released xgen-7b-8k-base model, for example, is open source and commercially usable (Apache-2.0 license), whereas the instruction fine-tuned version of the model xgen-7b-8k-inst is only released for research purposes only. When selecting an FM for a commercial application, it’s essential to verify the license agreement, understand its limitations, and ensure it aligns with the intended use of the project.
  • Parameters – The number of parameters, which consist of the weights and biases in the neural network, is another key factor. More parameters generally means a more complex and potentially powerful model, because it can capture more intricate patterns and correlations in the data. However, the trade-off is that it requires more computational resources and, therefore, costs more to run. Additionally, we do see a trend towards smaller models, especially in the open-source space (models ranging from 7–40 billion) that perform well, especially, when fine-tuned.
  • Speed – The speed of a model is influenced by its size. Larger models tend to process data slower (higher latency) due to the increased computational complexity. Therefore, it’s crucial to balance the need for a model with high predictive power (often larger models) with the practical requirements for speed, especially in applications, like chat bots, that demand real-time or near-real-time responses.
  • Context window size (number of tokens) – The context window, defined by the maximum number of tokens that can be input or output per prompt, is crucial in determining how much context the model can consider at a time (a token roughly translates to 0.75 words for English). Models with larger context windows can understand and generate longer sequences of text, which can be useful for tasks involving longer conversations or documents.
  • Training dataset – It’s also important to understand what kind of data the FM was trained on. Some models may be trained on diverse text datasets like internet data, coding scripts, instructions, or human feedback. Others may also be trained on multimodal datasets, like combinations of text and image data. This can influence the model’s suitability for different tasks. In addition, an organization might have copyright concerns depending on the exact sources a model has been trained on—therefore, it’s mandatory to inspect the training dataset closely.
  • Quality – The quality of an FM can vary based on its type (proprietary vs. open source), size, and what it was trained on. Quality is context-dependent, meaning what is considered high-quality for one application might not be for another. For example, a model trained on internet data might be considered high quality for generating conversational text, but less so for technical or specialized tasks.
  • Fine-tunable – The ability to fine-tune an FM by adjusting its model weights or layers can be a crucial factor. Fine-tuning allows for the model to better adapt to the specific context of the application, improving performance on the specific task at hand. However, fine-tuning requires additional computational resources and technical expertise, and not all models support this feature. Open-source models are (in general) always fine-tunable because the model artifacts are available for downloading and the users are able to extend and use them at will. Proprietary models might sometimes offer the option of fine-tuning.
  • Existing customer skills – The selection of an FM can also be influenced by the skills and familiarity of the customer or the development team. If an organization has no AI/ML experts in their team, then an API service might be better suited for them. Also, if a team has extensive experience with a specific FM, it might be more efficient to continue using it rather than investing time and resources to learn and adapt to a new one.

The following is an example of two shortlists, one for proprietary models and one for open-source models. You might compile similar tables based on your specific needs to get a quick overview of the available options. Note that the performance and parameters of those models change rapidly and might be outdated by the time of reading, while other capabilities might be important for specific customers, such as the supported language.

The following is an example of notable proprietary FMs available in AWS (July 2023).

The following is an example of notable open-source FM available in AWS (July 2023).

After you have compiled an overview of 10–20 potential candidate models, it becomes necessary to further refine this shortlist. In this section, we propose a swift mechanism that will yield two or three viable final models as candidates for the next round.

The following diagram illustrates the initial shortlisting process.

Typically, prompt engineers, who are experts in creating high-quality prompts that allow AI models to understand and process user inputs, experiment with various methods to perform the same task (such as summarization) on a model. We suggest that these prompts are not created on the fly, but are systematically extracted from a prompt catalog. This prompt catalog is a central location for storing prompts to avoid replications, enable version control, and share prompts within the team to ensure consistency between different prompt testers in the different development stages, which we introduce in the next section. This prompt catalog is analogous to a Git repository of a feature store. The generative AI developer, who could potentially be the same person as the prompt engineer, then needs to evaluate the output to determine if it would be suitable for the generative AI application they are seeking to develop.

Step 2. Test and evaluate the top FM

After the shortlist is reduced to approximately three FMs, we recommend an evaluation step to further test the FMs’ capabilities and suitability for the use case. Depending on the availability and nature of evaluation data, we suggest different methods, as illustrated in the following figure.

The method to use first depends on whether you have labeled test data or not.

If you have labeled data, you can use it to conduct a model evaluation, as we do with the traditional ML models (input some samples and compare the output with the labels). Depending on whether the test data has discrete labels (such as positive, negative, or neutral sentiment analysis) or is unstructured text (such as summarization), we propose different methods for evaluation:

  • Accuracy metrics – In case of discrete outputs (such as sentiment analysis), we can use standard accuracy metrics such as precision, recall, and F1 score
  • Similarity metrics – If the output is unstructured (such as a summary), we suggest similarity metrics like ROUGE and cosine similarity

Some use cases don’t lend themselves to having one true answer (for example, “Create a short children’s story for my 5-year-old daughter”). In such cases, it becomes more challenging to evaluate the models because you don’t have labeled test data. We propose two approaches, depending on the importance of human review of the model versus automated evaluation:

  • Human-in-the-Loop (HIL) – In this case, a team of prompt testers will review the responses from a model. Depending on how critical the application is, the prompt testers might review 100% of the model outputs or just a sample.
  • LLM-powered evaluation – In this scenario, the prompt testers are replaced by an LLM, ideally one that is more powerful (although perhaps slower and most costly) than the ones being tested. The LLM will review all model-generated responses and score them. This method may result in lower quality, but it’s a cheaper and faster evaluation option that might provide a good initial gauge on the models’ performance.

For example, we can use the following example prompt to an LLM, which includes the input query, the LLM response, and instructions: “We would like to request your feedback on the performance of an AI assistant in response to the user question displayed above. Please rate the helpfulness, relevance, accuracy, level of details of the assistant’s response. The assistant shall receive an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.”

This prompt will be sent to another LLM (the EvalLLM) for evaluation. The following is an example where the response of the Falcon-40B model has been evaluated by an EvalLLM (in this case Claude). We use the question “What if the Suez Canal had never been constructed?” The response is as follows:

Question: What if the Suez Canal had never been constructed?

falcon-40b-instruct

If the Suez Canal had never been constructed, ships would have to travel around Africa to navigate between the Mediterranean and the Red Sea. This would add significant time and distance to the voyage, making it less efficient and more expensive. Additionally, without the Suez Canal, many countries in the Middle East and North Africa would have been much less connected to the rest of the world, hindering economic and cultural development.

Review (Score: 9/10)

The assistant’s response is highly relevant, accurate, and detailed. It provides an excellent answer to the user’s hypothetical question about the non-existence of the Suez Canal. The assistant correctly points out the implications on maritime travel and the economic and cultural impact on the Middle East and North Africa. However, it could have further elaborated on the geopolitical implications or the impact on global trade patterns for a more comprehensive response.

The following figure illustrates the end-to-end evaluation process example.

Based on this example, to perform evaluation, we need to provide the example prompts, which we store in the prompt catalog, and an evaluation labeled or unlabeled dataset based on our specific applications. For example, with a labeled evaluation dataset, we can provide prompts (input and query) such as “Give me the full name of the UK PM in 2023” and outputs and answers, such as “Rishi Sunak.” With an unlabeled dataset, we provide just the question or instruction, such as “Generate the source code for a retail website.” We call the combination of prompt catalog and evaluation dataset the evaluation prompt catalog. The reason that we differentiate the prompt catalog and evaluation prompt catalog is because the latter is dedicated to a specific use case instead of generic prompts and instructions (such as question answering) that the prompt catalog contains.

With this evaluation prompt catalog, the next step is to feed the evaluation prompts to the top FMs. The result is an evaluation result dataset that contains the prompts, outputs of each FM, and the labeled output together with a score (if it exists). In the case of an unlabeled evaluation prompt catalog, there is an additional step for an HIL or LLM to review the results and provide a score and feedback (as we described earlier). The final outcome will be aggregated results that combine the scores of all the outputs (calculate the average precision or human rating) and allow the users to benchmark the quality of the models.

After the evaluation results have been collected, we propose choosing a model based on several dimensions. These typically come down to factors such as precision, speed, and cost. The following figure shows an example.

Each model will possess strengths and certain trade-offs along these dimensions. Depending on the use case, we should assign varying priorities to these dimensions. In the preceding example, we elected to prioritize cost as the most important factor, followed by precision, and then speed. Even though it’s slower and not as efficient as FM1, it remains sufficiently effective and significantly cheaper to host. Consequently, we might select FM2 as the top choice.

Step 3. Develop the generative AI application backend and frontend

At this point, the generative AI developers have selected the right FM for the specific application together with the help of prompt engineers and testers. The next step is to start developing the generative AI application. We have separated the development of the generative AI application into two layers, a backend and front end, as shown in the following figure.

On the backend, the generative AI developers incorporate the selected FM into the solutions and work together with the prompt engineers to create the automation to transform the end-user input to appropriate FM prompts. The prompt testers create the necessary entries to the prompt catalog for automatic or manual (HIL or LLM) testing. Then, the generative AI developers create the prompt chaining and application mechanism to provide the final output. Prompt chaining, in this context, is a technique to create more dynamic and contextually-aware LLM applications. It works by breaking down a complex task into a series of smaller, more manageable sub-tasks. For example, if we ask an LLM the question “Where was the prime minister of the UK born and how far is that place from London,” the task can be broken down into individual prompts, where a prompt might be built based on the answer of a previous prompt evaluation, such as “Who is the prime minister of the UK,” “What is their birthplace,” and “How far is that place from London?” To ensure a certain input and output quality, the generative AI developers also need to create the mechanism to monitor and filter the end-user inputs and application outputs. If, for example, the LLM application is supposed to avoid toxic requests and responses, they could apply a toxicity detector for input and output and filter those out. Lastly, they need to provide a rating mechanism, which will support the augmentation of the evaluation prompt catalog with good and bad examples. A more detailed representation of those mechanisms will be presented in future posts.

To provide the functionality to the generative AI end-user, the development of a frontend website that interacts with the backend is necessary. Therefore, DevOps and AppDevs (application developers on the cloud) personas need to follow best development practices to implement the functionality of input/output and rating.

In addition to this basic functionality, the frontend and backend need to incorporate the feature of creating personal user accounts, uploading data, initiating fine-tuning as a black box, and using the personalized model instead of the basic FM. The productionization of a generative AI application is similar with a normal application. The following figure depicts an example architecture.

In this architecture, the generative AI developers, prompt engineers, and DevOps or AppDevs create and test the application manually by deploying it via CI/CD to a development environment (generative AI App Dev in the preceding figure) using dedicated code repositories and merging with the dev branch. At this stage, the generative AI developers will use the corresponding FM by calling the API as has been provided by the FM providers of fine-tuners. Then, to test the application extensively, they need to promote the code to the test branch, which will trigger the deployment via CI/CD to the preproduction environment (generative AI App Pre-prod). At this environment, the prompt testers need to try a large amount of prompt combinations and review the results. The combination of prompts, outputs, and review need to be moved to the evaluation prompt catalog to automate the testing process in the future. After this extensive test, the last step is to promote the generative AI application to production via CI/CD by merging with the main branch (generative AI App Prod). Note that all the data, including the prompt catalog, evaluation data and results, end-user data and metadata, and fine-tuned model metadata, need to be stored in the data lake or data mesh layer. The CI/CD pipelines and repositories need to be stored in a separate tooling account (similar to the one described for MLOps).

The journey of providers

FM providers need to train FMs, such as deep learning models. For them, the end-to-end MLOps lifecycle and infrastructure is necessary. Additions are required in historical data preparation, model evaluation, and monitoring. The following figure illustrates their journey.

In classic ML, the historical data is most often created by feeding the ground truth via ETL pipelines. For example, in a churn prediction use case, an automation updates a database table based on the new status of a customer to churn/not churn automatically. In the case of FMs, they need either billions of labeled or unlabeled data points. In text-to-image use cases, a team of data labelers need to label <text, image> pairs manually. This is an expensive exercise requiring a large number of people resources. Amazon SageMaker Ground Truth Plus can provide a team of labelers to perform this activity for you. For some use cases, this process can be also partially automated, for example by using CLIP-like models. In the case of an LLM, such as text-to-text, the data is unlabeled. However, they need to be prepared and follow the format of the existing historical unlabeled data. Therefore, data editors are needed to perform necessary data preparation and ensure consistency.

With the historical data prepared, the next step is the training and productionization of the model. Note that the same evaluation techniques as we described for consumers can be used.

The journey of fine-tuners

Fine-tuners aim to adapt an existing FM to their specific context. For example, an FM model can summarize a general-purpose text but not a financial report accurately or can’t generate source code for a non-common programming language. In those cases, the fine-tuners need to label data, fine-tune a model by running a training job, deploy the model, test it based on the consumer processes, and monitor the model. The following diagram illustrates this process.

For the time being, there are two fine-tuning mechanisms:

  • Fine-tuning – By using an FM and labeled data, a training job recalculates the weights and biases of the deep learning model layers. This process can be computationally intensive and requires a representative amount of data but can generate accurate results.
  • Parameter-efficient fine-tuning (PEFT) – Instead of recalculating all the weights and biases, researchers have shown that by adding additional small layers to the deep learning models, they can achieve satisfactory results (for example, LoRA). PEFT requires lower computational power than deep fine-tuning and a training job with less input data. The drawback is potential lower accuracy.

The following diagram illustrates these mechanisms.

Now that we have defined the two main fine-tuning methods, the next step is to determine how we can deploy and use the open-source and proprietary FM.

With open-source FMs, the fine-tuners can download the model artifact and the source code from the web, for example, by using the Hugging Face Model Hub. This gives you the flexibility to deep fine-tune the model, store it to a local model registry, and deploy it to an Amazon SageMaker endpoint. This process requires an internet connection. To support more secure environments (such as for customers in the financial sector), you can download the model on premises, run all the necessary security checks, and upload them to a local bucket on an AWS account. Then, the fine-tuners use the FM from the local bucket without an internet connection. This ensures data privacy, and the data doesn’t travel over the internet. The following diagram illustrates this method.

With proprietary FMs, the deployment process is different because the fine-tuners don’t have access to the model artifact or source code. The models are stored in proprietary FM provider AWS accounts and model registries. To deploy such a model to a SageMaker endpoint, the fine-tuners can request only the model package that will be deployed directly to an endpoint. This process requires customer data to be used in the proprietary FM providers’ accounts, which raises questions regarding customer-sensitive data being used in a remote account to perform fine-tuning, and models being hosted in a model registry that is shared among multiple customers. This leads to a multi-tenancy problem that becomes more challenging if the proprietary FM providers need to serve these models. If the fine-tuners use Amazon Bedrock, these challenges are resolved—the data doesn’t travel over the internet and the FM providers don’t have access to fine-tuners’ data. The same challenges hold for the open-source models if the fine-tuners want to serve models from multiple customers, such as the example we gave earlier with the website that thousands of customers will upload personalized images to. However, these scenarios can be considered controllable because only the fine-tuner is involved. The following diagram illustrates this method.

From a technology perspective, the architecture that a fine-tuner needs to support is like the one for MLOps (see the following figure). The fine-tuning needs to be conducted in dev by creating ML pipelines, such as using Amazon SageMaker Pipelines; performing preprocessing, fine-tuning (training job), and postprocessing; and sending the fine-tuned models to a local model registry in the case of an open-source FM (otherwise, the new model will be stored to the proprietary FM provide environment). Then, in pre-production, we need to test the model as we describe for the consumers’ scenario. Finally, the model will be served and monitored in prod. Note that the current (fine-tuned) FM requires GPU instance endpoints. If we need to deploy each fine-tuned model to a separate endpoint, this might increase the cost in the case of hundreds of models. Therefore, we need to use multi-model endpoints and resolve the multi-tenancy challenge.

The fine-tuners adapt an FM model based on a specific context to use it for their business purpose. That means that most of the time, the fine-tuners are also consumers required to support all the layers, as we described in the previous sections, including generative AI application development, data lake and data mesh, and MLOps.

The following figure illustrates the complete FM fine-tuning lifecycle that the fine-tuners need to provide the generative AI end-user.

The following figure illustrates the key steps.

The key steps are the following:

  1. The end-user creates a personal account and uploads private data.
  2. The data is stored in the data lake and is preprocessed to follow the format that the FM expects.
  3. This triggers a fine-tuning ML pipeline that adds the model to the model registry,
  4. From there, either the model is deployed to production with minimum testing or the model pushes extensive testing with HIL and manual approval gates.
  5. The fine-tuned model is made available for end-users.

Because this infrastructure is complex for non-enterprise customers, AWS released Amazon Bedrock to offload the effort of creating such architectures and bringing fine-tuned FMs closer to production.

FMOps and LLMOps personas and processes differentiators

Based on the preceding user type journeys (consumer, producer, and fine-tuner), new personas with specific skills are required, as illustrated in the following figure.

The new personas are as follows:

  • Data labelers and editors – These users label data, such as <text, image> pairs, or prepare unlabeled data, such as free text, and extend the advanced analytics team and data lake environments.
  • Fine-tuners – These users have deep knowledge on FMs and know to tune them, extending the data science team that will focus on classic ML.
  • Generative AI developers – They have deep knowledge on selecting FMs, chaining prompts and applications, and filtering input and outputs. They belong a new team—the generative AI application team.
  • Prompt engineers – These users design the input and output prompts to adapt the solution to the context and test and create the initial version of prompt catalog. Their team is the generative AI application team.
  • Prompt testers – They test at scale the generative AI solution (backend and frontend) and feed their results to augment the prompt catalog and evaluation dataset. Their team is the generative AI application team.
  • AppDev and DevOps – They develop the front end (such as a website) of the generative AI application. Their team is the generative AI application team.
  • Generative AI end-users – These users consume generative AI applications as black boxes, share data, and rate the quality of the output.

The extended version of the MLOps process map to incorporate generative AI can be illustrated with the following figure.

A new application layer is the environment where generative AI developers, prompt engineers, and testers, and AppDevs created the backend and front end of generative AI applications. The generative AI end-users interact with the generative AI applications front end via the internet (such as a web UI). On the other side, data labelers and editors need to preprocess the data without accessing the backend of the data lake or data mesh. Therefore, a web UI (website) with an editor is necessary for interacting securely with the data. SageMaker Ground Truth provides this functionality out of the box.

Conclusion

MLOps can help us productionize ML models efficiently. However, to operationalize generative AI applications, you need additional skills, processes, and technologies, leading to FMOps and LLMOps. In this post, we defined the main concepts of FMOps and LLMOps and described the key differentiators compared to MLOps capabilities in terms of people, processes, technology, FM model selection, and evaluation. Furthermore, we illustrated the thought process of a generative AI developer and the development lifecycle of a generative AI application.

In the future, we will focus on providing solutions per the domain we discussed, and will provide more details on how to integrate FM monitoring (such as toxicity, bias, and hallucination) and third-party or private data source architectural patterns, such as Retrieval Augmented Generation (RAG), into FMOps/LLMOps.

To learn more, refer to MLOps foundation roadmap for enterprises with Amazon SageMaker and try out the end-to-end solution in Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models.

If you have any comments or questions, please leave them in the comments section.


About the Authors

Dr. Sokratis Kartakis is a Senior Machine Learning and Operations Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) solutions by exploiting AWS services and shaping their operating model, i.e. MLOps foundation, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and Internet of Things (IoT) solutions in the domains of energy, retail, health, finance/banking, motorsports etc. Sokratis likes to spend his spare time with family and friends, or riding motorbikes.

Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning with a special focus on natural language processing, large language models, and generative AI. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers be successful in their AI/ML journey on AWS and has worked with organizations in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. In his spare time, Heiko travels as much as possible.

Read More

Use Amazon SageMaker Model Card sharing to improve model governance

Use Amazon SageMaker Model Card sharing to improve model governance

As Artificial Intelligence (AI) and Machine Learning (ML) technologies have become mainstream, many enterprises have been successful in building critical business applications powered by ML models at scale in production. However, since these ML models are making critical business decisions for the business, it’s important for enterprises to add proper guardrails throughout their ML lifecycle. Guardrails ensure that security, privacy, and quality of the code, configuration, and data and model configuration used in model lifecycle are versioned and preserved.

Implementing these guardrails is getting harder for enterprises because the ML processes and activities within enterprises are becoming more complex due to the inclusion of deeply involved processes that require contributions from multiple stakeholders and personas. In addition to data engineers and data scientists, there have been inclusions of operational processes to automate & streamline the ML lifecycle. Additionally, the surge of business stakeholders and in some cases legal and compliance reviews need capabilities to add transparency for managing access control, activity tracking, and reporting across the ML lifecycle.

The framework that gives systematic visibility into ML model development, validation, and usage is called ML governance. During AWS re:Invent 2022, AWS introduced new ML governance tools for Amazon SageMaker which simplifies access control and enhances transparency over your ML projects. One of the tools available as part of the ML governance is Amazon SageMaker Model Cards, which has the capability to create a single source of truth for model information by centralizing and standardizing documentation throughout the model lifecycle.

SageMaker model cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a fact sheet of the model that is important for model governance.

As you scale your models, projects, and teams, as a best practice we recommend that you adopt a multi-account strategy that provides project and team isolation for ML model development and deployment. For more information about improving governance of your ML models, refer to Improve governance of your machine learning models with Amazon SageMaker.

Architecture overview

The architecture is implemented as follows:

  • Data Science Account – Data Scientists conduct their experiments in SageMaker Studio and build an MLOps setup to deploy models to staging/production environments using SageMaker Projects.
  • ML Shared Services Account – The MLOps set up from the Data Science account will trigger continuous integration and continuous delivery (CI/CD) pipelines using AWS CodeCommit and AWS CodePipeline.
  • Dev Account – The CI/CD pipelines will further trigger ML pipelines in this account covering data pre-processing, model training and post processing like model evaluation and registration. Output of these pipelines will deploy the model in SageMaker endpoints to be consumed for inference purposes. Depending on your governance requirements, Data Science & Dev accounts can be merged into a single AWS account.
  • Data Account – The ML pipelines running in the Dev Account will pull the data from this account.
  • Test and Prod Accounts – The CI/CD pipelines will continue the deployment after the Dev Account to set up SageMaker endpoint configuration in these accounts.
  • Security and Governance – Services like AWS Identity and Access Management (IAM), AWS IAM Identity Center, AWS CloudTrail, AWS Key Management Service (AWS KMS), Amazon CloudWatch, and AWS Security Hub will be used across these accounts as part of security and governance.

The following diagram illustrates this architecture.

For more information about setting scalable multi account ML architecture, refer to MLOps foundation for enterprises with Amazon SageMaker.

Our customers need the capability to share model cards across accounts to improve visibility and governance of their models through information shared in the model card. Now, with cross-account model cards sharing, customers can enjoy the benefits of multi-account strategy while having accessibility into the available model cards in their organization, so they can accelerate collaboration and ensure governance.

In this post, we show how to set up and access model cards across Model Development Lifecycle (MDLC) accounts using the new cross-account sharing feature of the model card. First, we will describe a scenario and architecture for setting up the cross-account sharing feature of the model card, and then dive deep into each component of how to set up and access shared model cards across accounts to improve visibility and model governance.

Solution overview

When building ML models, we recommend setting up a multi-account architecture to provide workload isolation improving security, reliability, and scalability. For this post, we will assume building and deploying a model for Customer Churn use case. The architecture diagram that follows shows one of the recommended approaches – centralized model card – for managing a model card in a multi-account Machine Learning Model-Development Lifecycle (MDLC) architecture. However, you can also adopt another approach, a hub-and-spoke model card. In this post, we will focus only on a centralized model card approach, but the same principles can be extended to a hub-and-spoke approach. The main difference is that each spoke account will maintain their own version of model card and it will have processes to aggregate and copy to a centralized account.

The following diagram illustrates this architecture.

The architecture is implemented as follows:

  1. Lead Data Scientist is notified to solve the Customer Churn use case using ML, and they start the ML project through creation of a model card for Customer Churn V1 model in Draft status in the ML Shared Services Account
  2. Through automation, that model card is shared with ML Dev Account
  3. Data Scientist builds the model and starts to populate information via APIs into the model card based on their experimentation results and the model card status is set to Pending Review
  4. Through automation, that model card is shared with the ML test account
  5. ML Engineer (MLE) runs integration and validation tests in ML Test account and the model in the central registry is marked Pending Approval
  6. Model Approver reviews the model results with the supporting documentation provided in the central model card and approves the model card for production deployment.
  7. Through automation, that model card is shared with ML Prod account in read-only mode.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • Two AWS accounts.
  • In both AWS accounts, an IAM federation role with administrator access to do the following:
    • Create, edit, view, and delete model cards within Amazon SageMaker.
    • Create, edit, view, and delete resource share within AWS RAM.

For more information, refer to Example IAM policies for AWS RAM.

Setting up model card sharing

The account where the model cards are created is the model card account. Users in the model card account share them with the shared accounts where they can be updated. Users in the model card account can share their model cards through AWS Resource Access Manager (AWS RAM). AWS RAM helps you share resources across AWS accounts.

In the following section, we show how to share model cards.

First, create a model card for a Customer Churn use case as previously described. On the Amazon SageMaker console, expand the Governance section and choose Model cards.

We create the model card in Draft status with the name Customer-Churn-Model-Card. For more information, refer to Create a model card. In this demonstration, you can leave the remainder of the fields blank and create the model card.

Alternatively, you can use the following AWS CLI command to create the model card:

aws sagemaker create-model-card --model-card-name Customer-Churn-Model-Card --content "{"model_overview": {"model_owner": "model-owner","problem_type": "Customer Churn Model"}}" --model-card-status Draft

Now, create the cross-account share using AWS RAM. In the AWS RAM console, select Create a resource share.

Enter a name for the resource share, for example “Customer-Churn-Model-Card-Share”. In the Resources – optional section, select the resource type as SageMaker Model Cards. The model card we created in the previous step will appear in the listing.

Select that model and it will appear in the Selected resources section. Select that resource again as shown in the following steps and choose Next.

On the next page, you can select the Managed permissions. You can create custom permissions or use the default option “AWSRAMPermissionSageMakerModelCards” and select Next. For more information, refer to Managing permissions in AWS RAM.

On the next page, you can select Principals. Under Select principal type, choose AWS Account and enter the ID of the account of the share the model card. Select Add and continue to the next page.

On the last page, review the information and select “Create resource share”. Alternatively, you can use the following AWS CLI command to create a resource share:

aws ram create-resource-share --name <Name of the Model Card>

aws ram associate-resource-share --resource-share-arn <ARN of resource share create from the previous command> --resource-arns <ARN of the Model Card>

On the AWS RAM console, you see the attributes of the resource share. Make sure that Shared resources, Managed permissions, and Shared principals are in the “Associated” status.

After you use AWS RAM to create a resource share, the principals specified in the resource share can be granted access to the share’s resources.

  • If you turn on AWS RAM sharing with AWS Organizations, and your principals that you share with are in the same organization as the sharing account, those principals can receive access as soon as their account administrator grants them permissions.
  • If you don’t turn on AWS RAM sharing with Organizations, you can still share resources with individual AWS accounts that are in your organization. The administrator in the consuming account receives an invitation to join the resource share, and they must accept the invitation before the principals specified in the resource share can access the shared resources.
  • You can also share with accounts outside of your organization if the resource type supports it. The administrator in the consuming account receives an invitation to join the resource share, and they must accept the invitation before the principals specified in the resource share can access the shared resources.

For more information about AWS RAM, refer to Terms and concepts for AWS RAM.

Accessing shared model cards

Now we can log in to the shared AWS account to access the model card. Make sure that you are accessing the AWS console using IAM permissions (IAM role) which allow access to AWS RAM.

With AWS RAM, you can view the resource shares to which you have been added, the shared resources that you can access, and the AWS accounts that have shared resources with you. You can also leave a resource share when you no longer require access to its shared resources.

To view the model card in the shared AWS account:

  1. Navigate to the Shared with me: Shared resources page in the AWS RAM console.
  2. Make sure that you are operating in the same AWS region where the share was created.
  3. The model shared from the model account will be available in the listing. If there is a long list of resources, then you can apply a filter to find specific shared resources. You can apply multiple filters to narrow your search.
  4. The following information is available:
    1. Resource ID – The ID of the resource. This is the name of the model card that we created earlier in the model card account.
    2. Resource type – The type of resource.
    3. Last share date – The date on which the resource was shared with you.
    4. Resource shares – The number of resource shares in which the resource is included. Choose the value to view the resource shares.
    5. Owner ID – The ID of the principal who owns the resource.

You can also access the model card using the AWS CLI option. For the AWS IAM policy configured with the correct credentials, make sure that you have permissions to create, edit, and delete model cards within Amazon SageMaker. For more information, refer to Configure the AWS CLI.

You can use the following AWS IAM permissions policy as template:

{
     "Version": "2012-10-17",
     "Statement": [
        {
             "Effect": "Allow",
             "Action": [
                 "sagemaker:DescribeModelCard",
                 "sagemaker:UpdateModelCard",
                 "sagemaker:CreateModelCardExportJob",
                 "sagemaker:ListModelCardVersions",
                 "sagemaker:DescribeModelCardExportJob"
             ],
             "Resource": [
                 "arn:aws:sagemaker:AWS-Region:AWS-model-card-account-id:model-card/example-model-card-name-0",
                 "arn:aws:sagemaker:AWS-Region:AWS-model-card-account-id:model-card/example-model-card-name-1/*"
             ]
        },
        { 
             "Effect": "Allow", 
             "Action": "s3:PutObject",
             "Resource": "arn:aws:s3:::Amazon-S3-bucket-storing-the-pdf-of-the-model-card/model-card-name/*"
        }
    ]
}

You can run the following AWS CLI command to access the details of the shared model card.

aws sagemaker describe-model-card --model-card-name <ARN of the model card>

Now you can make changes to this model card from this account.

aws sagemaker update-model-card --model-card-name <ARN of the Model Card> --content "{"model_overview": {"model_owner": "model-owner","problem_type": "Customer Churn Model"}}"

After you make changes, go back to the model card account to see the changes that we made in this shared account.

The problem type has been updated to “Customer Churn Model” which we had provided as part of the AWS CLI command input.

Clean up

You can now delete the model card you created. Make sure that you delete the AWS RAM resource share that you created to share the model card.

Conclusion

In this post, we provided an overview of multi-account architecture for scaling and governing your ML workloads securely and reliably. We discussed the architecture patterns for setting up model card sharing and illustrated how centralized model card sharing patterns work. Finally, we set up model card sharing across multiple accounts for improving visibility and governance in your model development lifecycle. We encourage you try out the new model card sharing feature and let us know your feedback.


About the authors

Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Machine Learning, DevOps, and Containers. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Ram VittalRam Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!

Read More

Deploy self-service question answering with the QnABot on AWS solution powered by Amazon Lex with Amazon Kendra and large language models

Deploy self-service question answering with the QnABot on AWS solution powered by Amazon Lex with Amazon Kendra and large language models

Powered by Amazon Lex, the QnABot on AWS solution is an open-source, multi-channel, multi-language conversational chatbot. QnABot allows you to quickly deploy self-service conversational AI into your contact center, websites, and social media channels, reducing costs, shortening hold times, and improving customer experience and brand sentiment. Customers now want to apply the power of large language models (LLMs) to further improve the customer experience with generative AI capabilities. This includes automatically generating accurate answers from existing company documents and knowledge bases, and making their self-service chatbots more conversational.

Our latest QnABot releases, v5.4.0+, can now use an LLM to disambiguate customer questions by taking conversational context into account, dynamically generating answers from relevant FAQs or Amazon Kendra search results and document passages. It also provides attribution and transparency by displaying links to the reference documents and context passages that were used by the LLM to construct the answers.

When you deploy QnABot, you can choose to automatically deploy a state-of-the-art open-source LLM model (Falcon-40B-instruct) on an Amazon SageMaker endpoint. The LLM landscape is constantly evolving—new models are released frequently and our customers want to experiment with different models and providers to see what works best for their use cases. This is why QnABot also integrates with any other LLM using an AWS Lambda function that you provide. To help you get started, we’ve also released a set of sample one-click deployable Lambda functions (plugins) to integrate QnABot with your choice of leading LLM providers, including our own Amazon Bedrock service and APIs from third-party providers, Anthropic and AI21.

In this post, we introduce the new Generative AI features for QnABot and walk through a tutorial to create, deploy, and customize QnABot to use these features. We also discuss some relevant use cases.

New Generative AI features

Using the LLM, QnABot now has two new important features, which we discuss in this section.

Generate answers to questions from Amazon Kendra search results or text passages

QnABot can now generate concise answers to questions from document extracts provided by an Amazon Kendra search, or text passages created or imported directly. This provides the following advantages:

  • The number of FAQs that you need to maintain and import into QnABot is reduced, because you can now synthesize concise answers on the fly from your existing documents.
  • Generated answers can be modified to create the best experience for the intended channel. For example, you can set the answers to be short, concise, and suitable for voice channel contact center bots, and website or text bots could potentially provide more detailed information.
  • Generated answers are fully compatible with QnABot’s multi-language support—users can interact in their chosen languages and receive generated answers in the same language.
  • Generated answers can include links to the reference documents and context passages used, to provide attribution and transparency on how the LLM constructed the answers.

For example, when asked “What is Amazon Lex?”, QnABot can retrieve relevant passages from an Amazon Kendra index (containing AWS documentation). QnABot then asks (prompts) the LLM to answer the question based on the context of the passages (which can also optionally be viewed in the web client). The following screenshot shows an example.

Disambiguate follow-up questions that rely on preceding conversation context

Understanding the direction and context of an ever-evolving conversation is key to building natural, human-like conversational interfaces. User queries often require a bot to interpret requests based on conversation memory and context. Now QnABot will ask the LLM to generate a disambiguated question based on the conversation history. This can then be used as a search query to retrieve the FAQs, passages, or Amazon Kendra results to answer the user’s question. The following is an example chat history:

Human: What is Amazon Lex?
AI: "Amazon Lex is an AWS service for building conversational interfaces for applications using voice and text..."
Human: Can it integrate with my CRM?

QnABot uses the LLM to rewrite the follow-up question to make “it” unambiguous, for example, “Can Amazon Lex integrate with my CRM system?” This allows users to interact like they would in a human conversation, and QnABot generates clear search queries to find the relevant FAQs or document passages that have the information to answer the user’s question.

These new features make QnABot more conversational and provide the ability to dynamically generate responses based on a knowledge base. This is still an experimental feature with tremendous potential. We strongly encourage users to experiment to find the best LLM and corresponding prompts and model parameters to use. QnABot makes it straightforward to experiment!

Tutorial

Time to try it! Let’s deploy the latest QnABot (v5.4.0 or later) and enable the new Generative AI features. The high-level steps are as follows:

  1. Create and populate an Amazon Kendra index.
  2. Choose and deploy an LLM plugin (optional).
  3. Deploy QnABot.
  4. Configure QnABot for your Lambda plugin (if using a plugin).
  5. Access the QnABot web client and start experimenting.
  6. Customize behavior using QnABot settings.
  7. Add curated Q&As and text passages to the knowledge base.

Create and populate an Amazon Kendra Index

Download and use the following AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and SageMaker. Deploying the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index.

When the Amazon Kendra index stack is successfully deployed, navigate to the stack’s Outputs tab and note the Index Id, which you will use later when deploying QnABot.

Alternatively, if you already have an Amazon Kendra index with your own content, you can use it instead with your own example questions for the tutorial.

Choose and deploy an LLM plugin (optional)

QnABot can deploy a built-in LLM (Falcon-40B-instruct on SageMaker) or use Lambda functions to call any other LLMs of your choice. In this section, we show you how to use the Lambda option with a pre-built sample Lambda function. Skip to the next step if you want to use the built-in LLM instead.

First, choose the plugin LLM you want to use. Review your options from the qnabot-on-aws-plugin-samples repository README. As of this writing, plugins are available for Amazon Bedrock (in preview), and for AI21 and Anthropic third-party APIs. We expect to add more sample plugins over time.

Deploy your chosen plugin by choosing Launch Stack in the Deploy a new Plugin stack section, which will deploy into the us-east-1 Region by default (to deploy in other Regions, see Build and Publish QnABot Plugins CloudFormation artifacts).

When the Plugin stack is successfully deployed, navigate to the stack’s Outputs tab (see the following screenshot) and inspect its contents, which you will use in the following steps to deploy and configure QnABot. Keep this tab open in your browser.

Deploy QnABot

Choose Launch Solution from the QnABot implementation guide to deploy the latest QnABot template via AWS CloudFormation. Provide the following parameters:

  • For DefaultKendraIndexId, use the Amazon Kendra Index ID (a GUID) you collected earlier
  • For EmbeddingsApi (see Semantic Search using Text Embeddings), choose one of the following:
    • SAGEMAKER (the default built-in embeddings model)
    • LAMBDA (to use the Amazon Bedrock embeddings API with the BEDROCK-EMBEDDINGS-AND-LLM Plugin)
      • For EmbeddingsLambdaArn, use the EmbeddingsLambdaArn output value from your BEDROCK-EMBEDDINGS-AND-LLM Plugin stack.
  • For LLMApi (see Query Disambiguation for Conversational Retrieval, and Generative Question Answering), choose one of the following:
    • SAGEMAKER (the default built-in LLM model)
    • LAMBDA (to use the LLM Plugin deployed earlier)
      • For LLMLambdaArn, use the LLMLambdaArn output value from your Plugin stack

For all other parameters, accept the defaults (see the implementation guide for parameter definitions), and proceed to launch the QnABot stack.

Configure QnABot for your Lambda plugin (if using a plugin)

If you deployed QnABot using a sample LLM Lambda plugin to access a different LLM, update the QnABot model parameters and prompt template settings as recommended for your chosen plugin. For more information, see Update QnABot Settings. If you used the SageMaker (built-in) LLM option, skip to the next step, because the settings are already configured for you.

Access the QnABot web client and start experimenting

On the AWS CloudFormation console, choose the Outputs tab of the QnABot CloudFormation stack and choose the ClientURL link. Alternatively, launch the client by choosing QnABot on AWS Client from the Content Designer tools menu.

Now, try to ask questions related to AWS services, for example:

  • What is Amazon Lex?
  • How does SageMaker scale up inference workloads?
  • Is Kendra a search service?

Then you can ask follow-up questions without specifying the previously mentioned services or context, for example:

  • Is it secure?
  • Does it scale?

Customize behavior using QnABot settings

You can customize many settings on the QnABot Content Designer Settings page—see README – LLM Settings for a full list of relevant settings. For example, try the following:

  • Set ENABLE_DEBUG_RESPONSES to TRUE, save the settings, and try the previous questions again. Now you will see additional debug output at the top of each response, showing you how the LLM generates the Amazon Kendra search query based on the chat history, how long the LLM inferences took to run, and more. For example:
    [User Input: "Is it fast?", LLM generated query (1207 ms): "Does Amazon Kendra provide search results quickly?", Search string: "Is it fast? / Does Amazon Kendra provide search results quickly?"["LLM: LAMBDA"], Source: KENDRA RETRIEVE API

  • Set ENABLE_DEBUG_RESPONSES back to FALSE, set LLM_QA_SHOW_CONTEXT_TEXT and LLM_QA_SHOW_SOURCE_LINKS to FALSE, and try the examples again. Now the context and sources links are not shown, and the output contains only the LLM-generated response.
  • If you feel adventurous, experiment also with the LLM prompt template settings—LLM_GENERATE_QUERY_PROMPT_TEMPLATE and LLM_QA_PROMPT_TEMPLATE. Refer to README – LLM Settings to see how you can use placeholders for runtime values like chat history, context, user input, query, and more. Note that the default prompts can most likely be improved and customized to better suit your use cases, so don’t be afraid to experiment! If you break something, you can always revert to the default settings using the RESET TO DEFAULTS option on the settings page.

Add curated Q&As and text passages to the knowledge base

QnABot can, of course, continue to answer questions based on curated Q&As. It can also use the LLM to generate answers from text passages created or imported directly into QnABot, in addition to using Amazon Kendra index.

QnABot attempts to find a good answer to the disambiguated user question in the following sequence:

  1. QnA items
  2. Text passage items
  3. Amazon Kendra index

Let’s try some examples.

On the QnABot Content Designer tools menu, choose Import, then load the two example packages:

  • TextPassages-NurseryRhymeExamples
  • blog-samples-final

QnABot can use text embeddings to provide semantic search capability (using QnABot’s built-in OpenSearch index as a vector store), which improves accuracy and reduces question tuning, compared to standard OpenSearch keyword based matching. To illustrate this, try questions like the following:

  • “Tell me about the Alexa device with the screen”
  • “Tell me about Amazon’s video streaming device?”

These should ideally match the sample QNA you imported, even though the words used to ask the question are poor keyword matches (but good semantic matches) with the configured QnA items: Alexa.001 (What is an Amazon Echo Show) and FireTV.001 (What is an Amazon Fire TV).

Even if you are not (yet) using Amazon Kendra (and you should!), QnABot can also answer questions based on passages created or imported into Content Designer. The following questions (and follow-up questions) are all answered from an imported text passage item that contains the nursery rhyme 0.HumptyDumpty:

  • “Where did Humpty Dumpty sit before he fell?”
  • “What happened after he fell? Was he OK?”

When using embeddings, a good answer is an answer that returns a similarity score above the threshold defined by the corresponding threshold setting. See Semantic question matching, using Large Language Model Text Embeddings for more details on how to test and tune the threshold settings.

If there are no good answers, or if the LLM’s response matches the regular expression defined in LLM_QA_NO_HITS_REGEX, then QnABot invokes the configurable Custom Don’t Know (no_hits) behavior, which, by default, returns a message saying “You stumped me.”

Try some experiments by creating Q&As or text passage items in QnABot, as well as using an Amazon Kendra index for fallback generative answers. Experiment (using the TEST tab in the designer) to find the best values to use for the embedding threshold settings to get the behavior you want. It’s hard to get the perfect balance, but see if you can find a good enough balance that results in useful answers most of the time.

Clean up

You can, of course, leave QnABot running to experiment with it and show it to your colleagues! But it does incur some cost—see Plan your deployment – Cost for more details. To remove the resources and avoid costs, delete the following CloudFormation stacks:

  • QnABot stack
  • LLM Plugin stack (if applicable)
  • Amazon Kendra index stack

Use case examples

These new features make QnABot relevant for many customer use cases such as self-service customer service and support bots and automated web-based Q&A bots. We discuss two such use cases in this section.

Integrate with a contact center

QnABot’s automated question answering capabilities deliver effective self-service for inbound voice calls in contact centers, with compelling outcomes. For example, see how Kentucky Transportation Cabinet reduced call hold time and improved customer experience with self-service virtual agents using Amazon Connect and Amazon Lex. Integrating the new generative AI features strengthens this value proposition further by dynamically generating reliable answers from existing content such as documents, knowledge bases, and websites. This eliminates the need for bot designers to anticipate and manually curate responses to every possible question that a user might ask. To integrate QnABot with Amazon Connect, see Connecting QnABot on AWS to an Amazon Connect call center. To integrate with other contact centers, See how Amazon Chime SDK can be used to connect Amazon Lex voice bots with 3rd party contact centers via SIPREC and Build an AI-powered virtual agent for Genesys Cloud using QnABot and Amazon Lex.

The LLM-powered QnABot can also play a pivotal role as an automated real-time agent assistant. In this solution, QnABot passively listens to the conversation and uses the LLM to generate real-time suggestions for the human agents based on certain cues. It’s straightforward to set up and try—give it a go! This solution can be utilized with both Amazon Connect and other on-prem and cloud contact centers. For more information, see Live call analytics and agent assist for your contact center with Amazon language AI services.

Integrate with a website

Embedding QnABot in your websites and applications allows users to get automated assistance with natural dialogue. For more information, see Deploy a Web UI for your Chatbot. For curated Q&A content, use markdown syntax and UI buttons and incorporate links, images, videos, and other dynamic elements that inform and delight your users. Integrate the QnABot Amazon Lex web UI with Amazon Connect live chat to facilitate quick escalation to human agents when the automated assistant cannot fully address a user’s inquiry on its own.

The QnABot on the AWS plugin samples repository

As shown in this post, QnABot v5.4.0+ not only offers built-in support for embeddings and LLM models hosted on SageMaker, but it also offers the ability to easily integrate with any other LLM by using Lambda functions. You can author your own custom Lambda functions or get started faster with one of the samples we have provided in our new qnabot-on-aws-plugin-samples repository.

This repository includes a ready-to-deploy plugin for Amazon Bedrock, which supports both embeddings and text generation requests. At the time of writing, Amazon Bedrock is available through private preview—you can request preview access. When Amazon Bedrock is generally available, we expect to integrate it directly with QnABot, but why wait? Apply for preview access and use our sample plugin to start experimenting!

Today’s LLM innovation cycle is driving a breakneck pace of new model releases, each aiming to surpass the last. This repository will expand to include additional QnABot plugin samples over time. As of this writing, we have support for two third-party model providers: Anthropic and AI21. We plan to add integrations for more LLMs, embeddings, and potentially common use case examples involving Lambda hooks and knowledge bases. These plugins are offered as-is without warranty, for your convenience—users are responsible for supporting and maintaining them once deployed.

We hope that the QnABot plugins repository will mature into a thriving open-source community project. Watch the qnabot-on-aws-plugin-samples GitHub repo to receive updates on new plugins and features, use the Issues forum to report problems or provide feedback, and contribute improvements via pull requests. Contributions are welcome!

Conclusion

In this post, we introduced the new generative AI features for QnABot and walked through a solution to create, deploy, and customize QnABot to use these features. We also discussed some relevant use cases. Automating repetitive inquiries frees up human workers and boosts productivity. Rich responses create engaging experiences. Deploying the LLM-powered QnABot can help you elevate the self-service experience for customers and employees.

Don’t miss this opportunity—get started today and revolutionize the user experience on your QnABot deployment!


About the authors

Clevester Teo is a Senior Partner Solutions Architect at AWS, focused on the Public Sector partner ecosystem. He enjoys building prototypes, staying active outdoors, and experiencing new cuisines. Clevester is passionate about experimenting with emerging technologies and helping AWS partners innovate and better serve public sector customers.

Windrich is a Solutions Architect at AWS who works with customers in industries such as finance and transport, to help accelerate their cloud adoption journey. He is especially interested in Serverless technologies and how customers can leverage them to bring values to their business. Outside of work, Windrich enjoys playing and watching sports, as well as exploring different cuisines around the world.

Bob Strahan Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Read More

Automatically generate impressions from findings in radiology reports using generative AI on AWS

Automatically generate impressions from findings in radiology reports using generative AI on AWS

Radiology reports are comprehensive, lengthy documents that describe and interpret the results of a radiological imaging examination. In a typical workflow, the radiologist supervises, reads, and interprets the images, and then concisely summarizes the key findings. The summarization (or impression) is the most important part of the report because it helps clinicians and patients focus on the critical contents of the report that contain information for clinical decision-making. Creating a clear and impactful impression involves much more effort than simply restating the findings. The entire process is therefore laborious, time consuming, and prone to error. It often takes years of training for doctors to accumulate enough expertise in writing concise and informative radiology report summarizations, further highlighting the significance of automating the process. Additionally, automatic generation of report findings summarization is critical for radiology reporting. It enables translation of reports into human readable language, thereby alleviating the patients’ burden of reading through lengthy and obscure reports.

To solve this problem, we propose the use of generative AI, a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. Generative AI is powered by machine learning (ML) models—very large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). Recent advancements in ML (specifically the invention of the transformer-based neural network architecture) have led to the rise of models that contain billions of parameters or variables. The proposed solution in this post uses fine-tuning of pre-trained large language models (LLMs) to help generate summarizations based on findings in radiology reports.

This post demonstrates a strategy for fine-tuning publicly available LLMs for the task of radiology report summarization using AWS services. LLMs have demonstrated remarkable capabilities in natural language understanding and generation, serving as foundation models that can be adapted to various domains and tasks. There are significant benefits to using a pre-trained model. It reduces computation costs, reduces carbon footprints, and allows you to use state-of-the-art models without having to train one from scratch.

Our solution uses the FLAN-T5 XL FM, using Amazon SageMaker JumpStart, which is an ML hub offering algorithms, models, and ML solutions. We demonstrate how to accomplish this using a notebook in Amazon SageMaker Studio. Fine-tuning a pre-trained model involves further training on specific data to improve performance on a different but related task. This solution involves fine-tuning the FLAN-T5 XL model, which is an enhanced version of T5 (Text-to-Text Transfer Transformer) general-purpose LLMs. T5 reframes natural language processing (NLP) tasks into a unified text-to-text-format, in contrast to BERT-style models that can only output either a class label or a span of the input. It is fine-tuned for a summarization task on 91,544 free-text radiology reports obtained from the MIMIC-CXR dataset.

Overview of solution

In this section, we discuss the key components of our solution: choosing the strategy for the task, fine-tuning an LLM, and evaluating the results. We also illustrate the solution architecture and the steps to implement the solution.

Identify the strategy for the task

There are various strategies to approach the task of automating clinical report summarization. For example, we could use a specialized language model pre-trained on clinical reports from scratch. Alternatively, we could directly fine-tune a publicly available general-purpose language model to perform the clinical task. Using a fine-tuned domain-agnostic model may be necessary in settings where training a language model from scratch is too costly. In this solution, we demonstrate the latter approach of using a FLAN -T5 XL model, which we fine-tune for the clinical task of summarization of radiology reports. The following diagram illustrates the model workflow.

A typical radiology report is well-organized and succinct. Such reports often have three key sections:

  • Background – Provides general information about the demographics of the patient with essential information about the patient, clinical history, and relevant medical history and details of exam procedures
  • Findings – Presents detailed exam diagnosis and results
  • Impression – Concisely summarizes the most salient findings or interpretation of the findings with an assessment of significance and potential diagnosis based on the observed abnormalities

Using the findings section in the radiology reports, the solution generates the impression section, which corresponds to the doctors’ summarization. The following figure is an example of a radiology report .

Fine-tune a general-purpose LLM for a clinical task

In this solution, we fine-tune a FLAN-T5 XL model (tuning all the parameters of the model and optimizing them for the task). We fine-tune the model using the clinical domain dataset MIMIC-CXR, which is a publicly available dataset of chest radiographs. To fine-tune this model through SageMaker Jumpstart, labeled examples must be provided in the form of {prompt, completion} pairs. In this case, we use pairs of {Findings, Impression} from the original reports in MIMIC-CXR dataset. For inferencing, we use a prompt as shown in the following example:

The model is fine-tuned on an accelerated computing ml.p3.16xlarge instance with 64 virtual CPUs and 488 GiB memory. For validation, 5% of the dataset was randomly selected. The elapsed time of the SageMaker training job with fine-tuning was 38,468 seconds (approximately 11 hours).

Evaluate the results

When the training is complete, it’s critical to evaluate the results. For a quantitative analysis of the generated impression, we use ROUGE (Recall-Oriented Understudy for Gisting Evaluation), the most commonly used metric for evaluating summarization. This metric compares an automatically produced summary against a reference or a set of references (human-produced) summary or translation. ROUGE1 refers to the overlap of unigrams (each word) between the candidate (the model’s output) and reference summaries. ROUGE2 refers to the overlap of bigrams (two words) between the candidate and reference summaries. ROUGEL is a sentence-level metric and refers to the longest common subsequence (LCS) between two pieces of text. It ignores newlines in the text. ROUGELsum is a summary-level metric. For this metric, newlines in the text aren’t ignored but are interpreted as sentence boundaries. The LCS is then computed between each pair of reference and candidate sentences, and then union-LCS is computed. For aggregation of these scores over a given set of reference and candidate sentences, the average is computed.

Walkthrough and architecture

The overall solution architecture as shown in the following figure primarily consists of a model development environment that uses SageMaker Studio, model deployment with a SageMaker endpoint, and a reporting dashboard using Amazon QuickSight.

In the following sections, we demonstrate fine-tuning an LLM available on SageMaker JumpStart for summarization of a domain-specific task via the SageMaker Python SDK. In particular, we discuss the following topics:

  • Steps to set up the development environment
  • An overview of the radiology report datasets on which the model is fine-tuned and evaluated
  • A demonstration of fine-tuning the FLAN-T5 XL model using SageMaker JumpStart programmatically with the SageMaker Python SDK
  • Inferencing and evaluation of the pre-trained and fine-tuned models
  • Comparison of results from pre-trained model and fine-tuned models

The solution is available in the Generating Radiology Report Impression using generative AI with Large Language Model on AWS GitHub repo.

Prerequisites

To get started, you need an AWS account in which you can use SageMaker Studio. You will need to create a user profile for SageMaker Studio if you don’t already have one.

The training instance type used in this post is ml.p3.16xlarge. Note that the p3 instance type requires a service quota limit increase.

The MIMIC CXR dataset can be accessed through a data use agreement, which requires user registration and completion of a credentialing process.

Set up the development environment

To set up your development environment, you create an S3 bucket, configure a notebook, create endpoints and deploy the models, and create a QuickSight dashboard.

Create an S3 bucket

Create an S3 bucket called llm-radiology-bucket to host the training and evaluation datasets. This will also be used to store the model artifact during model development.

Configure a notebook

Complete the following steps:

  1. Launch SageMaker Studio from either the SageMaker console or the AWS Command Line Interface (AWS CLI).

For more information about onboarding to a domain, see Onboard to Amazon SageMaker Domain.

  1. Create a new SageMaker Studio notebook for cleaning the report data and fine-tuning the model. We use an ml.t3.medium 2vCPU+4GiB notebook instance with a Python 3 kernel.
  1. Within the notebook, install the relevant packages such as nest-asyncio, IPyWidgets (for interactive widgets for Jupyter notebook), and the SageMaker Python SDK:
!pip install nest-asyncio==1.5.5 --quiet 
!pip install ipywidgets==8.0.4 --quiet 
!pip install sagemaker==2.148.0 --quiet

Create endpoints and deploy the models for inference

For inferencing the pre-trained and fine-tuned models, create an endpoint and deploy each model in the notebook as follows:

  1. Create a model object from the Model class that can be deployed to an HTTPS endpoint.
  2. Create an HTTPS endpoint with the model object’s pre-built deploy() method:
from sagemaker import model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

# Retrieve the URI of the pre-trained model
pre_trained_model_uri =model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")

large_model_env = {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"}

pre_trained_name = name_from_base(f"jumpstart-demo-pre-trained-{model_id}")

# Create the SageMaker model instance of the pre-trained model
if ("small" in model_id) or ("base" in model_id):
    deploy_source_uri = script_uris.retrieve(
        model_id=model_id, model_version=model_version, script_scope="inference"
    )
    pre_trained_model = Model(
        image_uri=deploy_image_uri,
        source_dir=deploy_source_uri,
        entry_point="inference.py",
        model_data=pre_trained_model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=pre_trained_name,
    )
else:
    # For those large models, we already repack the inference script and model
    # artifacts for you, so the `source_dir` argument to Model is not required.
    pre_trained_model = Model(
        image_uri=deploy_image_uri,
        model_data=pre_trained_model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=pre_trained_name,
        env=large_model_env,
    )

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=pre_trained_name,
)

Create a QuickSight dashboard

Create a QuickSight dashboard with an Athena data source with inference results in Amazon Simple Storage Service (Amazon S3) to compare the inference results with the ground truth. The following screenshot shows our example dashboard.

Radiology report datasets

The model is now fine-tuned, all the model parameters are tuned on 91,544 reports downloaded from the MIMIC-CXR v2.0 dataset. Because we used only the radiology report text data, we downloaded just one compressed report file (mimic-cxr-reports.zip) from the MIMIC-CXR website. Now we evaluate the fine-tuned model on 2,000 reports (referred to as the dev1 dataset) from the separate held out subset of this dataset. We use another 2,000 radiology reports (referred to as dev2) for evaluating the fine-tuned model from the chest X-ray collection from the Indiana University hospital network. All the datasets are read as JSON files and uploaded to the newly created S3 bucket llm-radiology-bucket. Note that all the datasets by default don’t contain any Protected Health Information (PHI); all sensitive information is replaced with three consecutive underscores (___) by the providers.

Fine-tune with the SageMaker Python SDK

For fine-tuning, the model_id is specified as huggingface-text2text-flan-t5-xl from the list of SageMaker JumpStart models. The training_instance_type is set as ml.p3.16xlarge and the inference_instance_type as ml.g5.2xlarge. The training data in JSON format is read from the S3 bucket. The next step is to use the selected model_id to extract the SageMaker JumpStart resource URIs, including image_uri (the Amazon Elastic Container Registry (Amazon ECR) URI for the Docker image), model_uri (the pre-trained model artifact Amazon S3 URI), and script_uri (the training script):

from sagemaker import image_uris, model_uris, script_uris

# Training instance will use this image
train_image_uri = image_uris.retrieve(
    region=aws_region,
    framework=None,  # automatically inferred from model_id
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Pre-trained model
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

# Script to execute on the training instance
train_script_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)

output_location = f"s3://{output_bucket}/demo-llm-rad-fine-tune-flan-t5/"

Also, an output location is set up as a folder within the S3 bucket.

Only one hyperparameter, epochs, is changed to 3, and the rest all are set as default:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# We will override some default hyperparameters with custom values
hyperparameters["epochs"] = "3"
print(hyperparameters)

The training metrics such as eval_loss (for validation loss), loss (for training loss), and epoch to be tracked are defined and listed:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

model_name = "-".join(model_id.split("-")[2:])  # get the most informative part of ID
training_job_name = name_from_base(f"js-demo-{model_name}-{hyperparameters['epochs']}")
print(f"{bold}job name:{unbold} {training_job_name}")

training_metric_definitions = [
    {"Name": "val_loss", "Regex": "'eval_loss': ([0-9\.]+)"},
    {"Name": "train_loss", "Regex": "'loss': ([0-9\.]+)"},
    {"Name": "epoch", "Regex": "'epoch': ([0-9\.]+)"},
]

We use the SageMaker JumpStart resource URIs (image_uri, model_uri, script_uri) identified earlier to create an estimator and fine-tune it on the training dataset by specifying the S3 path of the dataset. The Estimator class requires an entry_point parameter. In this case, JumpStart uses transfer_learning.py. The training job fails to run if this value is not set.

# Create SageMaker Estimator instance
sm_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    model_uri=train_model_uri,
    source_dir=train_script_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    volume_size=300,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=output_location,
    metric_definitions=training_metric_definitions,
)

# Launch a SageMaker training job over data located in the given S3 path
# Training jobs can take hours, it is recommended to set wait=False,
# and monitor job status through SageMaker console
sm_estimator.fit({"training": train_data_location}, job_name=training_job_name, wait=True)

This training job can take hours to complete; therefore, it’s recommended to set the wait parameter to False and monitor the training job status on the SageMaker console. Use the TrainingJobAnalytics function to keep track of the training metrics at various timestamps:

from sagemaker import TrainingJobAnalytics

# Wait for a couple of minutes for the job to start before running this cell
# This can be called while the job is still running
df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()

Deploy inference endpoints

In order to draw comparisons, we deploy inference endpoints for both the pre-trained and fine-tuned models.

First, retrieve the inference Docker image URI using model_id, and use this URI to create a SageMaker model instance of the pre-trained model. Deploy the pre-trained model by creating an HTTPS endpoint with the model object’s pre-built deploy() method. In order to run inference through SageMaker API, make sure to pass the Predictor class.

from sagemaker import image_uris
# Retrieve the inference docker image URI. This is the base HuggingFace container image
deploy_image_uri = image_uris.retrieve(
    region=aws_region,
    framework=None,  # automatically inferred from model_id
    model_id=model_id,
    model_version=model_version,
    image_scope="inference",
    instance_type=inference_instance_type,
)

# Retrieve the URI of the pre-trained model
pre_trained_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"
)

pre_trained_model = Model(
        image_uri=deploy_image_uri,
        model_data=pre_trained_model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=pre_trained_name,
        env=large_model_env,
    )

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=pre_trained_name,
)

Repeat the preceding step to create a SageMaker model instance of the fine-tuned model and create an endpoint to deploy the model.

Evaluate the models

First, set the length of summarized text, number of model outputs (should be greater than 1 if multiple summaries need to be generated), and number of beams for beam search.

Construct the inference request as a JSON payload and use it to query the endpoints for the pre-trained and fine-tuned models.

Compute the aggregated ROUGE scores (ROUGE1, ROUGE2, ROUGEL, ROUGELsum) as described earlier.

Compare the results

The following table depicts the evaluation results for the dev1 and dev2 datasets. The evaluation result on dev1 (2,000 findings from the MIMIC CXR Radiology Report) shows approximately 38 percentage points improvement in the aggregated average ROUGE1 and ROUGE2 scores compared to the pre-trained model. For dev2, an improvement of 31 percentage points and 25 percentage points is observed in ROUGE1 and ROUGE2 scores. Overall, fine-tuning led to an improvement of 38.2 percentage points and 31.3 percentage points in ROUGELsum scores for the dev1 and dev2 datasets, respectively.

Evaluation

Dataset

Pre-trained Model Fine-tuned model
ROUGE1 ROUGE2 ROUGEL ROUGELsum ROUGE1 ROUGE2 ROUGEL ROUGELsum
dev1 0.2239 0.1134 0.1891 0.1891 0.6040 0.4800 0.5705 0.5708
dev2 0.1583 0.0599 0.1391 0.1393 0.4660 0.3125 0.4525 0.4525

The following box plots depict the distribution of ROUGE scores for the dev1 and dev2 datasets evaluated using the fine-tuned model.

(a): dev1 (b): dev2

The following table shows that ROUGE scores for the evaluation datasets have approximately the same median and mean and therefore are symmetrically distributed.

Datasets Scores Count Mean Std Deviation Minimum 25% percentile 50% percentile 75% percentile Maximum
dev1 ROUGE1 2000.00 0.6038 0.3065 0.0000 0.3653 0.6000 0.9384 1.0000
ROUGE 2 2000.00 0.4798 0.3578 0.0000 0.1818 0.4000 0.8571 1.0000
ROUGE L 2000.00 0.5706 0.3194 0.0000 0.3000 0.5345 0.9101 1.0000
ROUGELsum 2000.00 0.5706 0.3194 0.0000 0.3000 0.5345 0.9101 1.0000
dev2 ROUGE 1 2000.00 0.4659 0.2525 0.0000 0.2500 0.5000 0.7500 1.0000
ROUGE 2 2000.00 0.3123 0.2645 0.0000 0.0664 0.2857 0.5610 1.0000
ROUGE L 2000.00 0.4529 0.2554 0.0000 0.2349 0.4615 0.7500 1.0000
ROUGE Lsum 2000.00 0.4529 0.2554 0.0000 0.2349 0.4615 0.7500 1.0000

Clean up

To avoid incurring future charges, delete the resources you created with the following code:

# Delete resources
pre_trained_predictor.delete_model()
pre_trained_predictor.delete_endpoint()
fine_tuned_predictor.delete_model()
fine_tuned_predictor.delete_endpoint()

Conclusion

In this post, we demonstrated how to fine-tune a FLAN-T5 XL model for a clinical domain-specific summarization task using SageMaker Studio. To increase the confidence, we compared the predictions with ground truth and evaluated the results using ROUGE metrics. We demonstrated that a model fine-tuned for a specific task returns better results than a model pre-trained on a generic NLP task. We would like to point out that fine-tuning a general-purpose LLM eliminates the cost of pre-training altogether.

Although the work presented here focuses on chest X-ray reports, it has the potential to be expanded to bigger datasets with varied anatomies and modalities, such as MRI and CT, for which radiology reports might be more complex with multiple findings. In such cases, radiologists could generate impressions in order of criticality and include follow-up recommendations. Furthermore, setting up a feedback loop for this application would enable radiologists to improve the performance of the model over time.

As we showed in this post, the fine-tuned model generates impressions for radiology reports with high ROUGE scores. You can try to fine-tune LLMs on other domain-specific medical reports from different departments.


About the authors

Dr. Adewale Akinfaderin is a Senior Data Scientist in Healthcare and Life Sciences at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global healthcare customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in Physics and a Doctorate degree in Engineering.

Priya Padate is a Senior Partner Solutions Architect with extensive expertise in Healthcare and Life Sciences at AWS. Priya drives go-to-market strategies with partners and drives solution development to accelerate AI/ML-based development. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Ekta Walia Bhullar, PhD, is a senior AI/ML consultant with AWS Healthcare and Life Sciences (HCLS) professional services business unit. She has extensive experience in the application of AI/ML within the healthcare domain, especially in radiology. Outside of work, when not discussing AI in radiology, she likes to run and hike.

Read More

MLOps for batch inference with model monitoring and retraining using Amazon SageMaker, HashiCorp Terraform, and GitLab CI/CD

MLOps for batch inference with model monitoring and retraining using Amazon SageMaker, HashiCorp Terraform, and GitLab CI/CD

Maintaining machine learning (ML) workflows in production is a challenging task because it requires creating continuous integration and continuous delivery (CI/CD) pipelines for ML code and models, model versioning, monitoring for data and concept drift, model retraining, and a manual approval process to ensure new versions of the model satisfy both performance and compliance requirements.

In this post, we describe how to create an MLOps workflow for batch inference that automates job scheduling, model monitoring, retraining, and registration, as well as error handling and notification by using Amazon SageMaker, Amazon EventBridge, AWS Lambda, Amazon Simple Notification Service (Amazon SNS), HashiCorp Terraform, and GitLab CI/CD. The presented MLOps workflow provides a reusable template for managing the ML lifecycle through automation, monitoring, auditability, and scalability, thereby reducing the complexities and costs of maintaining batch inference workloads in production.

Solution overview

The following figure illustrates the proposed target MLOps architecture for enterprise batch inference for organizations who use GitLab CI/CD and Terraform infrastructure as code (IaC) in conjunction with AWS tools and services. GitLab CI/CD serves as the macro-orchestrator, orchestrating model build and model deploy pipelines, which include sourcing, building, and provisioning Amazon SageMaker Pipelines and supporting resources using the SageMaker Python SDK and Terraform. SageMaker Python SDK is used to create or update SageMaker pipelines for training, training with hyperparameter optimization (HPO), and batch inference. Terraform is used to create additional resources such as EventBridge rules, Lambda functions, and SNS topics for monitoring SageMaker pipelines and sending notifications (for example, when a pipeline step fails or succeeds). SageMaker Pipelines serves as the orchestrator for ML model training and inference workflows.

This architecture design represents a multi-account strategy where ML models are built, trained, and registered in a central model registry within a data science development account (which has more controls than a typical application development account). Then, inference pipelines are deployed to staging and production accounts using automation from DevOps tools such as GitLab CI/CD. The central model registry could optionally be placed in a shared services account as well. Refer to Operating model for best practices regarding a multi-account strategy for ML.

In the following subsections, we discuss different aspects of the architecture design in detail.

Infrastructure as code

IaC offers a way to manage IT infrastructure through machine-readable files, ensuring efficient version control. In this post and the accompanying code sample, we demonstrate how to use HashiCorp Terraform with GitLab CI/CD to manage AWS resources effectively. This approach underscores the key benefit of IaC, offering a transparent and repeatable process in IT infrastructure management.

Model training and retraining

In this design, the SageMaker training pipeline runs on a schedule (via EventBridge) or based on an Amazon Simple Storage Service (Amazon S3) event trigger (for example, when a trigger file or new training data, in case of a single training data object, is placed in Amazon S3) to regularly recalibrate the model with new data. This pipeline does not introduce structural or material changes to the model because it uses fixed hyperparameters that have been approved during the enterprise model review process.

The training pipeline registers the newly trained model version in the Amazon SageMaker Model Registry if the model exceeds a predefined model performance threshold (for example, RMSE for regression and F1 score for classification). When a new version of the model is registered in the model registry, it triggers a notification to the responsible data scientist via Amazon SNS. The data scientist then needs to review and manually approve the latest version of the model in the Amazon SageMaker Studio UI or via an API call using the AWS Command Line Interface (AWS CLI) or AWS SDK for Python (Boto3) before the new version of model can be utilized for inference.

The SageMaker training pipeline and its supporting resources are created by the GitLab model build pipeline, either via a manual run of the GitLab pipeline or automatically when code is merged into the main branch of the model build Git repository.

Batch inference

The SageMaker batch inference pipeline runs on a schedule (via EventBridge) or based on an S3 event trigger as well. The batch inference pipeline automatically pulls the latest approved version of the model from the model registry and uses it for inference. The batch inference pipeline includes steps for checking data quality against a baseline created by the training pipeline, as well as model quality (model performance) if ground truth labels are available.

If the batch inference pipeline discovers data quality issues, it will notify the responsible data scientist via Amazon SNS. If it discovers model quality issues (for example, RMSE is greater than a pre-specified threshold), the pipeline step for the model quality check will fail, which will in turn trigger an EventBridge event to start the training with HPO pipeline.

The SageMaker batch inference pipeline and its supporting resources are created by the GitLab model deploy pipeline, either via a manual run of the GitLab pipeline or automatically when code is merged into the main branch of the model deploy Git repository.

Model tuning and retuning

The SageMaker training with HPO pipeline is triggered when the model quality check step of the batch inference pipeline fails. The model quality check is performed by comparing model predictions with the actual ground truth labels. If the model quality metric (for example, RMSE for regression and F1 score for classification) doesn’t meet a pre-specified criterion, the model quality check step is marked as failed. The SageMaker training with HPO pipeline can also be triggered manually (in the SageMaker Studio UI or via an API call using the AWS CLI or SageMaker Python SDK) by the responsible data scientist if needed. Because the model hyperparameters are changing, the responsible data scientist needs to obtain approval from the enterprise model review board before the new model version can be approved in the model registry.

The SageMaker training with HPO pipeline and its supporting resources are created by the GitLab model build pipeline, either via a manual run of the GitLab pipeline or automatically when code is merged into the main branch of the model build Git repository.

Model monitoring

Data statistics and constraints baselines are generated as part of the training and training with HPO pipelines. They are saved to Amazon S3 and also registered with the trained model in the model registry if the model passes evaluation. The proposed architecture for the batch inference pipeline uses Amazon SageMaker Model Monitor for data quality checks, while using custom Amazon SageMaker Processing steps for model quality check. This design decouples data and model quality checks, which in turn allows you to only send a warning notification when data drift is detected; and trigger the training with HPO pipeline when a model quality violation is detected.

Model approval

After a newly trained model is registered in the model registry, the responsible data scientist receives a notification. If the model has been trained by the training pipeline (recalibration with new training data while hyperparameters are fixed), there is no need for approval from the enterprise model review board. The data scientist can review and approve the new version of the model independently. On the other hand, if the model has been trained by the training with HPO pipeline (retuning by changing hyperparameters), the new model version needs to go through the enterprise review process before it can be used for inference in production. When the review process is complete, the data scientist can proceed and approve the new version of the model in the model registry. Changing the status of the model package to Approved will trigger a Lambda function via EventBridge, which will in turn trigger the GitLab model deploy pipeline via an API call. This will automatically update the SageMaker batch inference pipeline to utilize the latest approved version of the model for inference.

There are two main ways to approve or reject a new model version in the model registry: using the AWS SDK for Python (Boto3) or from the SageMaker Studio UI. By default, both the training pipeline and training with HPO pipeline set ModelApprovalStatus to PendingManualApproval. The responsible data scientist can update the approval status for the model by calling the update_model_package API from Boto3. Refer to Update the Approval Status of a Model for details about updating the approval status of a model via the SageMaker Studio UI.

Data I/O design

SageMaker interacts directly with Amazon S3 for reading inputs and storing outputs of individual steps in the training and inference pipelines. The following diagram illustrates how different Python scripts, raw and processed training data, raw and processed inference data, inference results and ground truth labels (if available for model quality monitoring), model artifacts, training and inference evaluation metrics (model quality monitoring), as well as data quality baselines and violation reports (for data quality monitoring) can be organized within an S3 bucket. The direction of arrows in the diagram indicates which files are inputs or outputs from their respective steps in the SageMaker pipelines. Arrows have been color-coded based on pipeline step type to make them easier to read. The pipeline will automatically upload Python scripts from the GitLab repository and store output files or model artifacts from each step in the appropriate S3 path.

The data engineer is responsible for the following:

  • Uploading labeled training data to the appropriate path in Amazon S3. This includes adding new training data regularly to ensure the training pipeline and training with HPO pipeline have access to recent training data for model retraining and retuning, respectively.
  • Uploading input data for inference to the appropriate path in S3 bucket before a planned run of the inference pipeline.
  • Uploading ground truth labels to the appropriate S3 path for model quality monitoring.

The data scientist is responsible for the following:

  • Preparing ground truth labels and providing them to the data engineering team for uploading to Amazon S3.
  • Taking the model versions trained by the training with HPO pipeline through the enterprise review process and obtaining necessary approvals.
  • Manually approving or rejecting newly trained model versions in the model registry.
  • Approving the production gate for the inference pipeline and supporting resources to be promoted to production.

Sample code

In this section, we present a sample code for batch inference operations with a single-account setup as shown in the following architecture diagram. The sample code can be found in the GitHub repository, and can serve as a starting point for batch inference with model monitoring and automatic retraining using quality gates often required for enterprises. The sample code differs from the target architecture in the following ways:

  • It uses a single AWS account for building and deploying the ML model and supporting resources. Refer to Organizing Your AWS Environment Using Multiple Accounts for guidance on multi-account setup on AWS.
  • It uses a single GitLab CI/CD pipeline for building and deploying the ML model and supporting resources.
  • When a new version of the model is trained and approved, the GitLab CI/CD pipeline is not triggered automatically and needs to be run manually by the responsible data scientist to update the SageMaker batch inference pipeline with the latest approved version of the model.
  • It only supports S3 event-based triggers for running the SageMaker training and inference pipelines.

Prerequisites

You should have the following prerequisites before deploying this solution:

  • An AWS account
  • SageMaker Studio
  • A SageMaker execution role with Amazon S3 read/write and AWS Key Management Service (AWS KMS) encrypt/decrypt permissions
  • An S3 bucket for storing data, scripts, and model artifacts
  • Terraform version 0.13.5 or greater
  • GitLab with a working Docker runner for running the pipelines
  • The AWS CLI
  • jq
  • unzip
  • Python3 (Python 3.7 or greater) and the following Python packages:
    • boto3
    • sagemaker
    • pandas
    • pyyaml

Repository structure

The GitHub repository contains the following directories and files:

  • /code/lambda_function/ – This directory contains the Python file for a Lambda function that prepares and sends notification messages (via Amazon SNS) about the SageMaker pipelines’ step state changes
  • /data/ – This directory includes the raw data files (training, inference, and ground truth data)
  • /env_files/ – This directory contains the Terraform input variables file
  • /pipeline_scripts/ – This directory contains three Python scripts for creating and updating training, inference, and training with HPO SageMaker pipelines, as well as configuration files for specifying each pipeline’s parameters
  • /scripts/ – This directory contains additional Python scripts (such as preprocessing and evaluation) that are referenced by the training, inference, and training with HPO pipelines
  • .gitlab-ci.yml – This file specifies the GitLab CI/CD pipeline configuration
  • /events.tf – This file defines EventBridge resources
  • /lambda.tf – This file defines the Lambda notification function and the associated AWS Identity and Access Management (IAM) resources
  • /main.tf – This file defines Terraform data sources and local variables
  • /sns.tf – This file defines Amazon SNS resources
  • /tags.json – This JSON file allows you to declare custom tag key-value pairs and append them to your Terraform resources using a local variable
  • /variables.tf – This file declares all the Terraform variables

Variables and configuration

The following table shows the variables that are used to parameterize this solution. Refer to the ./env_files/dev_env.tfvars file for more details.

Name Description
bucket_name S3 bucket that is used to store data, scripts, and model artifacts
bucket_prefix S3 prefix for the ML project
bucket_train_prefix S3 prefix for training data
bucket_inf_prefix S3 prefix for inference data
notification_function_name Name of the Lambda function that prepares and sends notification messages about SageMaker pipelines’ step state changes
custom_notification_config The configuration for customizing notification message for specific SageMaker pipeline steps when a specific pipeline run status is detected
email_recipient The email address list for receiving SageMaker pipelines’ step state change notifications
pipeline_inf Name of the SageMaker inference pipeline
pipeline_train Name of the SageMaker training pipeline
pipeline_trainwhpo Name of SageMaker training with HPO pipeline
recreate_pipelines If set to true, the three existing SageMaker pipelines (training, inference, training with HPO) will be deleted and new ones will be created when GitLab CI/CD is run
model_package_group_name Name of the model package group
accuracy_mse_threshold Maximum value of MSE before requiring an update to the model
role_arn IAM role ARN of the SageMaker pipeline execution role
kms_key KMS key ARN for Amazon S3 and SageMaker encryption
subnet_id Subnet ID for SageMaker networking configuration
sg_id Security group ID for SageMaker networking configuration
upload_training_data If set to true, training data will be uploaded to Amazon S3, and this upload operation will trigger the run of the training pipeline
upload_inference_data If set to true, inference data will be uploaded to Amazon S3, and this upload operation will trigger the run of the inference pipeline
user_id The employee ID of the SageMaker user that is added as a tag to SageMaker resources

Deploy the solution

Complete the following steps to deploy the solution in your AWS account:

  1. Clone the GitHub repository into your working directory.
  2. Review and modify the GitLab CI/CD pipeline configuration to suit your environment. The configuration is specified in the ./gitlab-ci.yml file.
  3. Refer to the README file to update the general solution variables in the ./env_files/dev_env.tfvars file. This file contains variables for both Python scripts and Terraform automation.
    1. Check the additional SageMaker Pipelines parameters that are defined in the YAML files under ./batch_scoring_pipeline/pipeline_scripts/. Review and update the parameters if necessary.
  4. Review the SageMaker pipeline creation scripts in ./pipeline_scripts/ as well as the scripts that are referenced by them in the ./scripts/ folder. The example scripts provided in the GitHub repo are based on the Abalone dataset. If you are going to use a different dataset, ensure you update the scripts to suit your particular problem.
  5. Put your data files into the ./data/ folder using the following naming convention. If you are using the Abalone dataset along with the provided example scripts, ensure the data files are headerless, the training data includes both independent and target variables with the original order of columns preserved, the inference data only includes independent variables, and the ground truth file only includes the target variable.
    1. training-data.csv
    2. inference-data.csv
    3. ground-truth.csv
  6. Commit and push the code to the repository to trigger the GitLab CI/CD pipeline run (first run). Note that the first pipeline run will fail on the pipeline stage because there’s no approved model version yet for the inference pipeline script to use. Review the step log and verify a new SageMaker pipeline named TrainingPipeline has been successfully created.

    1. Open the SageMaker Studio UI, then review and run the training pipeline.
    2. After the successful run of the training pipeline, approve the registered model version in the model registry, then rerun the entire GitLab CI/CD pipeline.
  1. Review the Terraform plan output in the build stage. Approve the manual apply stage in the GitLab CI/CD pipeline to resume the pipeline run and authorize Terraform to create the monitoring and notification resources in your AWS account.
  2. Finally, review the SageMaker pipelines’ run status and output in the SageMaker Studio UI and check your email for notification messages, as shown in the following screenshot. The default message body is in JSON format.

SageMaker pipelines

In this section, we describe the three SageMaker pipelines within the MLOps workflow.

Training pipeline

The training pipeline is composed of the following steps:

  • Preprocessing step, including feature transformation and encoding
  • Data quality check step for generating data statistics and constraints baseline using the training data
  • Training step
  • Training evaluation step
  • Condition step to check whether the trained model meets a pre-specified performance threshold
  • Model registration step to register the newly trained model in the model registry if the trained model meets the required performance threshold

Both the skip_check_data_quality and register_new_baseline_data_quality parameters are set to True in the training pipeline. These parameters instruct the pipeline to skip the data quality check and just create and register new data statistics or constraints baselines using the training data. The following figure depicts a successful run of the training pipeline.

Batch inference pipeline

The batch inference pipeline is composed of the following steps:

  • Creating a model from the latest approved model version in the model registry
  • Preprocessing step, including feature transformation and encoding
  • Batch inference step
  • Data quality check preprocessing step, which creates a new CSV file containing both input data and model predictions to be used for the data quality check
  • Data quality check step, which checks the input data against baseline statistics and constraints associated with the registered model
  • Condition step to check whether ground truth data is available. If ground truth data is available, the model quality check step will be performed
  • Model quality calculation step, which calculates model performance based on ground truth labels

Both the skip_check_data_quality and register_new_baseline_data_quality parameters are set to False in the inference pipeline. These parameters instruct the pipeline to perform a data quality check using the data statistics or constraints baseline associated with the registered model (supplied_baseline_statistics_data_quality and supplied_baseline_constraints_data_quality) and skip creating or registering new data statistics and constraints baselines during inference. The following figure illustrates a run of the batch inference pipeline where the data quality check step has failed due to poor performance of the model on the inference data. In this particular case, the training with HPO pipeline will be triggered automatically to fine-tune the model.

Training with HPO pipeline

The training with HPO pipeline is composed of the following steps:

  • Preprocessing step (feature transformation and encoding)
  • Data quality check step for generating data statistics and constraints baseline using the training data
  • Hyperparameter tuning step
  • Training evaluation step
  • Condition step to check whether the trained model meets a pre-specified accuracy threshold
  • Model registration step if the best trained model meets the required accuracy threshold

Both the skip_check_data_quality and register_new_baseline_data_quality parameters are set to True in the training with HPO pipeline. The following figure depicts a successful run of the training with HPO pipeline.

Clean up

Complete the following steps to clean up your resources:

  1. Employ the destroy stage in the GitLab CI/CD pipeline to eliminate all resources provisioned by Terraform.
  2. Use the AWS CLI to list and remove any remaining pipelines that are created by the Python scripts.
  3. Optionally, delete other AWS resources such as the S3 bucket or IAM role created outside the CI/CD pipeline.

Conclusion

In this post, we demonstrated how enterprises can create MLOps workflows for their batch inference jobs using Amazon SageMaker, Amazon EventBridge, AWS Lambda, Amazon SNS, HashiCorp Terraform, and GitLab CI/CD. The presented workflow automates data and model monitoring, model retraining, as well as batch job runs, code versioning, and infrastructure provisioning. This can lead to significant reductions in complexities and costs of maintaining batch inference jobs in production. For more information about implementation details, review the GitHub repo.


About the Authors

Hasan Shojaei is a Sr. Data Scientist with AWS Professional Services, where he helps customers across different industries such as sports, insurance, and financial services solve their business challenges through the use of big data, machine learning, and cloud technologies. Prior to this role, Hasan led multiple initiatives to develop novel physics-based and data-driven modeling techniques for top energy companies. Outside of work, Hasan is passionate about books, hiking, photography, and history.

Wenxin Liu is a Sr. Cloud Infrastructure Architect. Wenxin advises enterprise companies on how to accelerate cloud adoption and supports their innovations on the cloud. He’s a pet lover and is passionate about snowboarding and traveling.

Vivek Lakshmanan is a Machine Learning Engineer at Amazon. He has a Master’s degree in Software Engineering with specialization in Data Science and several years of experience as an MLE. Vivek is excited on applying cutting-edge technologies and building AI/ML solutions to customers on cloud. He is passionate about Statistics, NLP and Model Explainability in AI/ML. In his spare time, he enjoys playing cricket and taking road trips.

Andy Cracchiolo is a Cloud Infrastructure Architect. With more than 15 years in IT infrastructure, Andy is an accomplished and results-driven IT professional. In addition to optimizing IT infrastructure, operations, and automation, Andy has a proven track record of analyzing IT operations, identifying inconsistencies, and implementing process enhancements that increase efficiency, reduce costs, and increase profits.

Read More