Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services

Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services

In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of understanding, generating and manipulating text with unprecedented proficiency. Their potential applications span from conversational agents to content generation and information retrieval, holding the promise of revolutionizing all industries. However, harnessing this potential while ensuring the responsible and effective use of these models hinges on the critical process of LLM evaluation. An evaluation is a task used to measure the quality and responsibility of output of an LLM or generative AI service. Evaluating LLMs is not only motivated by the desire to understand a model performance but also by the need to implement responsible AI and by the need to mitigate the risk of providing misinformation or biased content and to minimize the generation of harmful, unsafe, malicious and unethical content. Furthermore, evaluating LLMs can also help mitigating security risks, particularly in the context of prompt data tampering. For LLM-based applications, it is crucial to identify vulnerabilities and implement safeguards that protect against potential breaches and unauthorized manipulations of data.

By providing essential tools for evaluating LLMs with a straightforward configuration and one-click approach, Amazon SageMaker Clarify LLM evaluation capabilities grant customers access to most of the aforementioned benefits. With these tools in hand, the next challenge is to integrate LLM evaluation into the Machine Learning and Operation (MLOps) lifecycle to achieve automation and scalability in the process. In this post, we show you how to integrate Amazon SageMaker Clarify LLM evaluation with Amazon SageMaker Pipelines to enable LLM evaluation at scale. Additionally, we provide code example in this GitHub repository to enable the users to conduct parallel multi-model evaluation at scale, using examples such as Llama2-7b-f, Falcon-7b, and fine-tuned Llama2-7b models.

Who needs to perform LLM evaluation?

Anyone who trains, fine-tunes or simply uses a pre-trained LLM needs to accurately evaluate it to assess the behavior of the application powered by that LLM. Based on this tenet, we can classify generative AI users who need LLM evaluation capabilities into 3 groups as shown in the following figure: model providers, fine-tuners, and consumers.

  • Foundational Model (FM) providers train models that are general-purpose. These models can be used for many downstream tasks, such as feature extraction or to generate content. Each trained model needs to be benchmarked against many tasks not only to assess its performances but also to compare it with other existing models, to identify areas that needs improvements and finally, to keep track of advancements in the field. Model providers also need to check the presence of any biases to ensure of the quality of the starting dataset and of the correct behavior of their model. Gathering evaluation data is vital for model providers. Furthermore, these data and metrics must be collected to comply with upcoming regulations. ISO 42001, the Biden Administration Executive Order, and EU AI Act develop standards, tools, and tests to help ensure that AI systems are safe, secure, and trustworthy. For example, the EU AI Act is tasked providing information on which datasets are used for training, what compute power is required to run the model, report model results against public/industry-standard benchmarks and share results of internal and external testing.
  • Model fine-tuners want to solve specific tasks (e.g. sentiment classification, summarization, question answering) as well as pre-trained models for adopting domain specific tasks. They need evaluation metrics generated by model providers to select the right pre-trained model as a starting point.
    They need to evaluate their fine-tuned models against their desired use-case with task-specific or domain-specific datasets. Frequently, they must curate and create their private datasets since publicly available datasets, even those designed for a specific task, may not adequately capture the nuances required for their particular use case.
    Fine-tuning is faster and cheaper than a full training and requires faster operative iteration for deployment and testing because many candidate models are usually generated. Evaluating these models allows continuous model improvement, calibration and debugging. Note that fine-tuners can become consumers of their own models when they develop real world applications.
  • Model consumers or model deployers serve and monitor general purpose or fine-tuned models in production, aiming to enhance their applications or services through the adoption of LLMs. The first challenge they have is to ensure that the chosen LLM aligns with their specific needs, cost, and performance expectations. Interpreting and understanding the model’s outputs is a persistent concern, especially when privacy and data security are involved (e.g. for auditing risk and compliance in regulated industries, such as financial sector). Continuous model evaluation is critical to prevent propagation of bias or harmful content. By implementing a robust monitoring and evaluation framework, model consumers can proactively identify and address regression in LLMs, ensuring that these models maintain their effectiveness and reliability over time.

How to perform LLM evaluation

Effective model evaluation involves three fundamental components: one or more FMs or fine-tuned models to evaluate the input datasets (prompts, conversations or regular inputs) and the evaluation logic.

To select the models for evaluation, different factors must be considered, including data characteristics, problem complexity, available computational resources, and the desired outcome. The input datastore provides the data necessary for training, fine-tuning, and testing the selected model. It’s vital that this datastore is well-structured, representative, and of high quality, as the model’s performance heavily depends on the data it learns from. Lastly, evaluation logics define the criteria and metrics used to assess the model’s performance.

Together, these three components form a cohesive framework that ensures the rigorous and systematic assessment of machine learning models, ultimately leading to informed decisions and improvements in model effectiveness.

Model evaluation techniques are still an active field of research. Many public benchmarks and frameworks were created by the community of researchers in the last few years to cover a wide range of tasks and scenarios such as GLUE, SuperGLUE, HELM, MMLU and BIG-bench. These benchmarks have leaderboards that can be used to compare and contrast evaluated models. Benchmarks, like HELM, also aim to assess on metrics beyond accuracy measures, like precision or F1 score. The HELM benchmark includes metrics for fairness, bias and toxicity which have an equally significant importance in the overall model evaluation score.

All these benchmarks include a set of metrics that measure how the model performs on a certain task. The most famous and most common metrics are ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BLEU (BiLingual Evaluation Understudy), or METEOR (Metric for Evaluation of Translation with Explicit ORdering). Those metrics serve as a useful tool for automated evaluation, providing quantitative measures of lexical similarity between generated and reference text. However, they do not capture the full breadth of human-like language generation, which includes semantic understanding, context, or stylistic nuances. For example, HELM doesn’t provide evaluation details relevant to specific use cases, solutions for testing custom prompts, and easily interpreted results used by non-experts, because the process can be costly, not easy to scale, and only for specific tasks.

Furthermore, achieving human-like language generation often requires the incorporation of human-in-the-loop to bring qualitative assessments and human judgement to complement the automated accuracy metrics. Human evaluation is a valuable method for assessing LLM outputs but it can also be subjective and prone to bias because different human evaluators may have diverse opinions and interpretations of text quality. Furthermore, human evaluation can be resource-intensive and costly and it can demand significant time and effort.

Let’s dive deep into how Amazon SageMaker Clarify seamlessly connects the dots, aiding customers in conducting thorough model evaluation and selection.

LLM evaluation with Amazon SageMaker Clarify

Amazon SageMaker Clarify helps customers to automate the metrics, including but not limited to accuracy, robustness, toxicity, stereotyping and factual knowledge for automated, and style, coherence, relevance for human-based evaluation, and evaluation methods by providing a framework to evaluate LLMs and LLM-based services such as Amazon Bedrock. As a fully-managed service, SageMaker Clarify simplifies the use of open-source evaluation frameworks within Amazon SageMaker. Customers can select relevant evaluation datasets and metrics for their scenarios and extend them with their own prompt datasets and evaluation algorithms. SageMaker Clarify delivers evaluation results in multiple formats to support different roles in the LLM workflow. Data scientists can analyze detailed results with SageMaker Clarify visualizations in Notebooks, SageMaker Model Cards, and PDF reports. Meanwhile, operations teams can use Amazon SageMaker GroundTruth to review and annotate high-risk items that SageMaker Clarify identifies. For example, by stereotyping, toxicity, escaped PII, or low accuracy.

Annotations and reinforcement learning are subsequently employed to mitigate potential risks. Human-friendly explanations of the identified risks expedite the manual review process, thereby reducing costs. Summary reports offer business stakeholders comparative benchmarks between different models and versions, facilitating informed decision-making.

The following figure shows the framework to evaluate LLMs and LLM-based services:

Amazon SageMaker Clarify LLM evaluation is an open-source Foundation Model Evaluation (FMEval) library developed by AWS to help customers easily evaluate LLMs. All the functionalities have been also incorporated into Amazon SageMaker Studio to enable LLM evaluation for its users. In the following sections, we introduce the integration of Amazon SageMaker Clarify LLM evaluation capabilities with SageMaker Pipelines to enable LLM evaluation at scale by using MLOps principles.

Amazon SageMaker MLOps lifecycle

As the post “MLOps foundation roadmap for enterprises with Amazon SageMaker” describes, MLOps is the combination of processes, people, and technology to productionise ML use cases efficiently.

The following figure shows the end-to-end MLOps lifecycle:

A typical journey starts with a data scientist creating a proof-of-concept (PoC) notebook to prove that ML can solve a business problem. Throughout the Proof of Concept (PoC) development, it falls to the data scientist to convert the business Key Performance Indicators (KPIs) into machine learning model metrics, such as precision or false-positive rate, and utilize a limited test dataset to evaluate these metrics. Data scientists collaborate with ML engineers to transition code from notebooks to repositories, creating ML pipelines using Amazon SageMaker Pipelines, which connect various processing steps and tasks, including pre-processing, training, evaluation, and post-processing, all while continually incorporating new production data. Deployment of Amazon SageMaker Pipelines relies on repository interactions and CI/CD pipeline activation. The ML pipeline maintains top-performing models, container images, evaluation results, and status information in a model registry, where model stakeholders assess performance and decide on progression to production based on performance results and benchmarks, followed by activation of another CI/CD pipeline for staging and production deployment. Once in production, ML consumers utilize the model via application-triggered inference through direct invocation or API calls, with feedback loops to model owners for ongoing performance evaluation.

Amazon SageMaker Clarify and MLOps integration

Following MLOps lifecycle, fine-tuners or users of open-source models productionize fine-tuned models or FM using Amazon SageMaker Jumpstart and MLOps services, as described in Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models. This lead to a new domain for foundation model operations (FMOps) and LLM Operations (LLMOps) FMOps/LLMOps: Operationalize generative AI and differences with MLOps.

The following figure shows end-to-end LLMOps lifecycle:

In LLMOps the main differences compared to MLOps are model selection and model evaluation involving different processes and metrics. In the initial experimentation phase, the data scientists (or fine-tuners) select the FM that will be used for a specific Generative AI use case.
This often results in the testing and fine-tuning of multiple FMs, some of which may yield comparable results. After the selection of the model(s), prompt engineers are responsible for preparing the necessary input data and expected output for evaluation (e.g. input prompts comprising input data and query) and define metrics like similarity and toxicity. In addition to these metrics, data scientists or fine-tuners must validate the outcomes and choose the appropriate FM not only on precision metrics, but on other capabilities like latency and cost. Then, they can deploy a model to a SageMaker endpoint and test its performance on a small scale. While the experimentation phase may involve a straightforward process, transitioning to production requires customers to automate the process and enhance the robustness of the solution. Therefore, we need to deep dive on how to automate evaluation, enabling testers to perform efficient evaluation at scale and implementing real-time monitoring of model input and output.

Automate FM evaluation

Amazon SageMaker Pipelines automate all the phases of preprocessing, FM fine-tuning (optionally) and evaluation at scale. Given the selected models during experimentation, prompt engineers need to cover a larger set of cases by preparing many prompts and storing them to a designated storage repository called prompt catalog. For more information, refer to FMOps/LLMOps: Operationalize generative AI and differences with MLOps. Then, Amazon SageMaker Pipelines can be structured as follows:

Scenario 1 – Evaluate multiple FMs: In this scenario, the FMs can cover the business use case without fine-tuning. The Amazon SageMaker Pipeline consists of the following steps: data pre-processing, parallel evaluation of multiple FMs, models comparison, and selection based on accuracy and other properties like cost or latency, registration of selected model artifacts, and metadata.

The following diagram illustrates this architecture.

Scenario 2 – Fine-tune and evaluate multiple FMs: In this scenario, the Amazon SageMaker Pipeline is structured much like Scenario 1, but it runs in parallel both fine-tuning and evaluation steps for each FM. The best fine-tuned model will be registered to the Model Registry.

The following diagram illustrates this architecture.

Scenario 3 – Evaluate multiple FMs and fine-tuned FMs: This scenario is a combination of evaluating general purpose FMs and fine-tuned FMs. In this case, the customers want to check if a fine-tuned model can perform better than a general-purpose FM.

The following figure shows the resulting SageMaker Pipeline steps.

Note that model registration follows two patterns: (a) store an open-source model and artifacts or (b) store a reference to a proprietary FM. For more information, refer to FMOps/LLMOps: Operationalize generative AI and differences with MLOps.

Solution overview

To accelerate your journey into LLM evaluation at scale, we created a solution that implements the scenarios using both Amazon SageMaker Clarify and the new Amazon SageMaker Pipelines SDK. The code example, including datasets, source notebooks and SageMaker Pipelines (steps and ML pipeline), is available on GitHub. To develop this example solution, we have used two FMs: Llama2 and Falcon-7B. In this post, our primary focus is on the key elements of the SageMaker Pipeline solution that pertain to the evaluation process.

Evaluation configuration: For the purpose of standardizing the evaluation procedure, we have created a YAML configuration file, (evaluation_config.yaml), that contains the necessary details for the evaluation process including the dataset, the model(s), and the algorithms to be run during the evaluation step of the SageMaker Pipeline. The following example illustrates the configuration file:

pipeline:
    name: "llm-evaluation-multi-models-hybrid"

dataset:
    dataset_name: "trivia_qa_sampled"
    input_data_location: "evaluation_dataset_trivia.jsonl"
    dataset_mime_type: "jsonlines"
    model_input_key: "question"
    target_output_key: "answer"

models:
  - name: "llama2-7b-f"
    model_id: "meta-textgeneration-llama-2-7b-f"
    model_version: "*"
    endpoint_name: "llm-eval-meta-textgeneration-llama-2-7b-f"
    deployment_config:
      instance_type: "ml.g5.2xlarge"
      num_instances: 1
    evaluation_config:
      output: '[0].generation.content'
      content_template: [[{"role":"user", "content": "PROMPT_PLACEHOLDER"}]]
      inference_parameters: 
        max_new_tokens: 100
        top_p: 0.9
        temperature: 0.6
      custom_attributes:
        accept_eula: True
      prompt_template: "$feature"
    cleanup_endpoint: True

  - name: "falcon-7b"
    ...

  - name: "llama2-7b-finetuned"
    ...
    finetuning:
      train_data_path: "train_dataset"
      validation_data_path: "val_dataset"
      parameters:
        instance_type: "ml.g5.12xlarge"
        num_instances: 1
        epoch: 1
        max_input_length: 100
        instruction_tuned: True
        chat_dataset: False
    ...

algorithms:
  - algorithm: "FactualKnowledge" 
    module: "fmeval.eval_algorithms.factual_knowledge"
    config: "FactualKnowledgeConfig"
    target_output_delimiter: "<OR>"

Evaluation step: The new SageMaker Pipeline SDK provides users the flexibility to define custom steps in the ML workflow using the ‘@step’ Python decorator. Therefore, the users need to create a basic Python script that conducts the evaluation, as follows:

def evaluation(data_s3_path, endpoint_name, data_config, model_config, algorithm_config, output_data_path,):
    from fmeval.data_loaders.data_config import DataConfig
    from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
    from fmeval.reporting.eval_output_cells import EvalOutputCell
    from fmeval.constants import MIME_TYPE_JSONLINES

    s3 = boto3.client("s3")

    bucket, object_key = parse_s3_url(data_s3_path)
    s3.download_file(bucket, object_key, "dataset.jsonl")

    config = DataConfig(
        dataset_name=data_config["dataset_name"],
        dataset_uri="dataset.jsonl",
        dataset_mime_type=MIME_TYPE_JSONLINES,
        model_input_location=data_config["model_input_key"],
        target_output_location=data_config["target_output_key"],
    )

    evaluation_config = model_config["evaluation_config"]

    content_dict = {
        "inputs": evaluation_config["content_template"],
        "parameters": evaluation_config["inference_parameters"],
    }
    serializer = JSONSerializer()
    serialized_data = serializer.serialize(content_dict)

    content_template = serialized_data.replace('"PROMPT_PLACEHOLDER"', "$prompt")
    print(content_template)

    js_model_runner = JumpStartModelRunner(
        endpoint_name=endpoint_name,
        model_id=model_config["model_id"],
        model_version=model_config["model_version"],
        output=evaluation_config["output"],
        content_template=content_template,
        custom_attributes="accept_eula=true",
    )

    eval_output_all = []
    s3 = boto3.resource("s3")
    output_bucket, output_index = parse_s3_url(output_data_path)

    for algorithm in algorithm_config:
        algorithm_name = algorithm["algorithm"]
        module = importlib.import_module(algorithm["module"])
        algorithm_class = getattr(module, algorithm_name)
        algorithm_config_class = getattr(module, algorithm["config"])
        eval_algo = algorithm_class(algorithm_config_class(target_output_delimiter=algorithm["target_output_delimiter"]))
        eval_output = eval_algo.evaluate(model=js_model_runner, dataset_config=config, prompt_template=evaluation_config["prompt_template"], save=True,)
        
        print(f"eval_output: {eval_output}")
        eval_output_all.append(eval_output)
        html = markdown.markdown(str(EvalOutputCell(eval_output[0])))
        file_index = (output_index + "/" + model_config["name"] + "_" + eval_algo.eval_name + ".html")
        s3_object = s3.Object(bucket_name=output_bucket, key=file_index)
        s3_object.put(Body=html)

    eval_result = {"model_config": model_config, "eval_output": eval_output_all}
    print(f"eval_result: {eval_result}")

    return eval_result

SageMaker Pipeline: After creating the necessary steps, such as data preprocessing, model deployment and model evaluation, the user needs to link the steps together by using SageMaker Pipeline SDK. The new SDK automatically generates the workflow by interpreting the dependencies between different steps when a SageMaker Pipeline creation API is invoked as shown in the following example:

import os
import argparse
from datetime import datetime

import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.function_step import step
from sagemaker.workflow.step_outputs import get_step

# Import the necessary steps
from steps.preprocess import preprocess
from steps.evaluation import evaluation
from steps.cleanup import cleanup
from steps.deploy import deploy

from lib.utils import ConfigParser
from lib.utils import find_model_by_name

if __name__ == "__main__":
    os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

    sagemaker_session = sagemaker.session.Session()

    # Define data location either by providing it as an argument or by using the default bucket
    default_bucket = sagemaker.Session().default_bucket()
    parser = argparse.ArgumentParser()
    parser.add_argument("-input-data-path", "--input-data-path", dest="input_data_path", default=f"s3://{default_bucket}/llm-evaluation-at-scale-example", help="The S3 path of the input data",)
    parser.add_argument("-config", "--config", dest="config", default="", help="The path to .yaml config file",)
    args = parser.parse_args()

    # Initialize configuration for data, model, and algorithm
    if args.config:
        config = ConfigParser(args.config).get_config()
    else:
        config = ConfigParser("pipeline_config.yaml").get_config()

    evalaution_exec_id = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
    pipeline_name = config["pipeline"]["name"]
    dataset_config = config["dataset"]  # Get dataset configuration
    input_data_path = args.input_data_path + "/" + dataset_config["input_data_location"]
    output_data_path = (args.input_data_path + "/output_" + pipeline_name + "_" + evalaution_exec_id)

    print("Data input location:", input_data_path)
    print("Data output location:", output_data_path)

    algorithms_config = config["algorithms"]  # Get algorithms configuration

    model_config = find_model_by_name(config["models"], "llama2-7b")
    model_id = model_config["model_id"]
    model_version = model_config["model_version"]
    evaluation_config = model_config["evaluation_config"]
    endpoint_name = model_config["endpoint_name"]

    model_deploy_config = model_config["deployment_config"]
    deploy_instance_type = model_deploy_config["instance_type"]
    deploy_num_instances = model_deploy_config["num_instances"]

    # Construct the steps
    processed_data_path = step(preprocess, name="preprocess")(input_data_path, output_data_path)

    endpoint_name = step(deploy, name=f"deploy_{model_id}")(model_id, model_version, endpoint_name, deploy_instance_type, deploy_num_instances,)

    evaluation_results = step(evaluation, name=f"evaluation_{model_id}", keep_alive_period_in_seconds=1200)(processed_data_path, endpoint_name, dataset_config, model_config, algorithms_config, output_data_path,)

    last_pipeline_step = evaluation_results

    if model_config["cleanup_endpoint"]:
        cleanup = step(cleanup, name=f"cleanup_{model_id}")(model_id, endpoint_name)
        get_step(cleanup).add_depends_on([evaluation_results])
        last_pipeline_step = cleanup

    # Define the SageMaker Pipeline
    pipeline = Pipeline(
        name=pipeline_name,
        steps=[last_pipeline_step],
    )

    # Build and run the Sagemaker Pipeline
    pipeline.upsert(role_arn=sagemaker.get_execution_role())
    # pipeline.upsert(role_arn="arn:aws:iam::<...>:role/service-role/AmazonSageMaker-ExecutionRole-<...>")

    pipeline.start()

The example implements the evaluation of a single FM by pre-processing the initial data set, deploying the model, and running the evaluation. The generated pipeline directed acyclic graph (DAG) is shown in the following figure.

Following a similar approach and by using and tailoring the example in Fine-tune LLaMA 2 models on SageMaker JumpStart, we created the pipeline to evaluate a fine-tuned model, as shown in the following figure.

By using the previous SageMaker Pipeline steps as “Lego” blocks, we developed the solution for Scenario 1 and Scenario 3, as shown in the following figures. Specifically, the GitHub repository enables the user to evaluate multiple FMs in parallel or to perform more complex evaluation combining evaluation of both foundation and fine-tuned models.

Additional functionalities available in the repository include the following:

  • Dynamic evaluation step generation: Our solution generates all the necessary evaluation steps dynamically based on the configuration file to enable users to evaluate any number of models. We have extended the solution to support an easy integration of new types of models, such as Hugging Face or Amazon Bedrock.
  • Prevent endpoint redeployment: If an endpoint is already in place, we skip the deployment process. This allows the user to re-use endpoints with FMs for evaluation, resulting in cost savings and reduced deployment time.
  • End-point clean up: After the completion of the evaluation the SageMaker Pipeline decommission the deployed endpoints. This functionality can be extended to keep the best model endpoint alive.
  • Model selection step: We have added a model selection step placeholder that requires the business logic of the final model selection, including criteria such as cost or latency.
  • Model registration step: The best model can be registered into Amazon SageMaker Model Registry as a new version of a specific model group.
  • Warm pool: SageMaker managed warm pools let you retain and reuse provisioned infrastructure after the completion of a job to reduce latency for repetitive workloads

The following figure illustrates these capabilities and a multi-model evaluation example that the users can create easily and dynamically using our solution in this GitHub repository.

We intentionally kept the data preparation out of scope as it will be described in a different post in depth, including prompt catalog designs, prompt templates, prompt optimization. For more information and related component definitions, refer to FMOps/LLMOps: Operationalize generative AI and differences with MLOps.

Conclusion

In this post, we focused on how to automate and operationalize LLMs evaluation at scale using Amazon SageMaker Clarify LLM evaluation capabilities and Amazon SageMaker Pipelines. In addition to theoretical architecture designs, we have example code in this GitHub repository (featuring Llama2 and Falcon-7B FMs) to enable customers to develop their own scalable evaluation mechanisms.

The following illustration shows model evaluation architecture.

In this post, we focused on operationalizing the LLM evaluation at scale as shown on the left side of the illustration. In the future, we ’ll focus on developing examples fulfilling the end-to-end lifecycle of FMs to production by following the guideline described in FMOps/LLMOps: Operationalize generative AI and differences with MLOps. This includes LLM serving, monitoring, storing of output rating that will eventually trigger automatic re-evaluation and fine-tuning and, lastly, using humans-in-the-loop to work on labeled data or prompts catalog.


About the authors

Dr. Sokratis Kartakis is a Principal Machine Learning and Operations Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) and generative AI solutions by exploiting AWS services and shaping their operating model, i.e. MLOps/FMOps/LLMOps foundations, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and AI solutions in the domains of energy, retail, health, finance, motorsports etc.

Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in Netherlands. He uses his passion for DevOps, GenAI and builder tools to help both system integrators and technology partners. Jagdeep applies his application development and architecture background to drive innovation within his team and promote new technologies.

Dr. Riccardo Gatti is a Senior Startup Solution Architect based in Italy. He is a technical advisor for customers, helping them growing their business by selecting the right tools and technologies to innovate, scale fast and go global in minutes. He has always been passionate about machine learning and generative AI, having studied and applied these technologies across different domains throughout his working career. He is host and editor for the AWS Italian podcast “Casa Startup”, dedicated to stories of startup founders and new technological trends.

Read More

Accelerate deep learning model training up to 35% with Amazon SageMaker smart sifting

Accelerate deep learning model training up to 35% with Amazon SageMaker smart sifting

In today’s rapidly evolving landscape of artificial intelligence, deep learning models have found themselves at the forefront of innovation, with applications spanning computer vision (CV), natural language processing (NLP), and recommendation systems. However, the increasing cost associated with training and fine-tuning these models poses a challenge for enterprises. This cost is primarily driven by the sheer volume of data used in training deep learning models. Today, large models are often trained on terabytes of data and can take weeks to train, even with powerful GPU or AWS Trainium-based hardware. Typically, customers rely on techniques and optimizations that improve the efficiency of a model’s training loop, such as optimized kernels or layers, mixed precision training, or features such as the Amazon SageMaker distributed training libraries. However, there is less focus today on the efficiency of the training data itself. Not all data contributes equally to the learning process during model training: a significant proportion of the computational resources may be spent on processing simple examples that don’t contribute substantially to the model’s overall accuracy.

Customers have traditionally relied on preprocessing techniques such as upsampling or downsampling and deduplication to refine and improve the information quality of their data. These techniques can help, but are often time consuming, require specialized data science experience, and can sometimes be more art than science. Customers often also rely on curated datasets, such as RefinedWeb, to improve the performance of their models; however, these datasets aren’t always fully open source and are often more general purpose and not related to your specific use case.

How else can you overcome this inefficiency related to low-information data samples during model training?

We’re excited to announce a public preview of smart sifting, a new capability of SageMaker that can reduce the cost of training deep learning models by up to 35%. Smart sifting is a new data efficiency technique that actively analyzes your data samples during training and filters out the samples that are less informative to the model. By training on a smaller subset of data with only the samples that contribute the most to model convergence, total training and cost decreases with minimal or no impact to accuracy. Additionally, because the feature operates online during model training, smart sifting doesn’t require changes to your upstream data or downstream training pipeline.

In this post, we discuss the following topics:

  • The new smart sifting capability in SageMaker and how it works
  • How to use smart sifting with PyTorch training workloads

You can also check out our documentation and sample notebooks for additional resources on how to get started with smart sifting.

How SageMaker smart sifting works

We begin this post with an overview of how the smart sifting capability can accelerate your model training on SageMaker.

Smart sifting’s task is to sift through your training data during the training process and only feed the more informative samples to the model. During a typical training with PyTorch, data is iteratively sent in batches to the training loop and to accelerator devices (for example, GPUs or Trainium chips) by the PyTorch DataLoader. Smart sifting is implemented at this data loading stage and therefore is independent of any upstream data preprocessing in your training pipeline.

Smart sifting uses your model and a user-specified loss function to do an evaluative forward pass of each data sample as it’s loaded. Samples that are high-loss will materially impact model training and therefore are used in training; data samples that are relatively low-loss are set aside and excluded from training.

A key input to smart sifting is the proportion of data to exclude: for example, by setting the proportion to 33% (beta_value=0.5), samples in approximately the bottom third of loss of each batch will be excluded from training. When enough high-loss samples have been identified to complete a batch, the data is sent through the full training loop and the model learns and trains normally. You don’t need to make any changes to your training loop when smart sifting is enabled.

The following diagram illustrates this workflow.

By including only a subset of your training data, smart sifting reduces the time and computation needed to train the model. In our tests, we achieved up to a nearly 40% reduction in total training time and cost. With smart sifting of data, there can be minimal or no impact to model accuracy because the excluded samples were relatively low-loss for the model. In the following table, we include a set of experimental results demonstrating the performance improvement possible with SageMaker smart sifting.

In the table, the % Accepted column indicates the proportion of data that is included and used in the training loop. Increasing this tunable parameter decreases the cost (as demonstrated in the IMR Savings % column), but it also can also affect the accuracy. The appropriate setting for % Accepted is a function of your dataset and model; you should experiment with and tune this parameter to achieve the best balance between reduced cost and impact to accuracy.

Solution overview

In the following sections, we walk through a practical example of enabling smart sifting with a PyTorch training job on SageMaker. If you want to get started quickly, you can jump to the PyTorch or PyTorch Lightning examples.

Prerequisites

We assume that you already know how to train a model using PyTorch or PyTorch Lightning using the SageMaker Python SDK and the Estimator class using SageMaker Deep Learning Containers for training. If not, refer to Using the SageMaker Python SDK before continuing.

Get started with SageMaker smart sifting

In a typical PyTorch training job, you initialize the PyTorch training DataLoader with your dataset and other required parameters, which provides input batches as the training progresses. To enable smart sifting of your training data, you’ll use a new DataLoader class: smart_sifting.dataloader.sift_dataloader.SiftingDataloader. This class is used as a wrapper on top of your existing PyTorch DataLoader and the training process will instead use SiftingDataloader to get input batches. The SiftingDataLoader gets the input batch from your original PyTorch DataLoader, evaluates the importance of samples in the batch, and constructs a sifted batch with high-loss samples, which are then passed to the training step. The wrapper looks like the following code:

from smart_sifting.dataloader.sift_dataloader import SiftingDataloader

train_dataloader =  SiftingDataloader(
    sift_config = sift_config,
    orig_dataloader=DataLoader(self.train, self.batch_size, shuffle=True),
    loss_impl=BertLoss(),
    model=self.model
)

The SiftingDataloader requires some additional parameters to analyze your training data, which you can specify via the sift_config parameter. First, create a smart_sifting.sift_config.sift_configs.RelativeProbabilisticSiftConfig object. This object holds the configurable and required beta_value and loss_history_length, which respectively define the proportion of samples to keep and the window of samples to include when evaluating relative loss. Note that, because smart sifting uses your model for defining the importance of the sample, there can be negative implications if we use a model with completely random weights. Instead, you can use loss_based_sift_config and a sift_delay to delay the sift process until the parameter weights in the model are updated beyond random values. (For more details, refer to Apply smart sifting to your training script.) In the following code, we define sift_config and specify beta_value and loss_history_length, as well as delay the start of sifting using loss_based_sift_config:

from smart_sifting.sift_config.sift_configs import RelativeProbabilisticSiftConfig, LossConfig, SiftingBaseConfig

sift_config = RelativeProbabilisticSiftConfig(
    beta_value=3,
    loss_history_length=500,
    loss_based_sift_config=LossConfig(
         sift_config=SiftingBaseConfig(sift_delay=10)
    )
)

Next, you must also include a loss_impl parameter in the SiftingDataloader object. Smart sifting works on an individual sample level, and it’s crucial to have access to a loss calculation method to determine the importance of the sample. You must implement a sifting loss method that returns a nx1 tensor, which holds loss values of n samples. Typically, you specify the same loss method used by your model during training. Finally, include a pointer to your model in the SiftingDataloader object, which is used to evaluate samples before they are included in training. See the following code:

from smart_sifting.sift_config.sift_configs import RelativeProbabilisticSiftConfig, LossConfig, SiftingBaseConfig

## Defining Sift loss
class SiftBertLoss(Loss):
    # You should add the following initializaztion function 
    # to calculate loss per sample, not per batch.
    def __init__(self):
        self.celoss = torch.nn.CrossEntropyLoss(reduction='none')

    def loss(
            self,
            model: torch.nn.Module,
            transformed_batch: SiftingBatch,
            original_batch: Any = None,
    ) -> torch.Tensor:
    
        device = next(model.parameters()).device
        batch = [t.to(device) for t in original_batch]

        # compute loss
        outputs = model(batch)
        return self.celoss(outputs.logits, batch[2])

....
....

train_dataloader =  SiftingDataloader(
    sift_config = sift_config,
    orig_dataloader=DataLoader(self.train, self.batch_size, shuffle=True),
    loss_impl=SiftBertLoss(),
    model=self.model
)

The following code shows a complete example of enabling smart sifting with an existing BERT training job:

from smart_sifting.dataloader.sift_dataloader import SiftingDataloader
from smart_sifting.loss.abstract_sift_loss_module import Loss
from smart_sifting.sift_config.sift_configs import RelativeProbabilisticSiftConfig, LossConfig, SiftingBaseConfig
...
...
...

## Defining Sift loss
class SiftBertLoss(Loss):
    # You should add the following initializaztion function 
    # to calculate loss per sample, not per batch.
    def __init__(self):
        self.celoss = torch.nn.CrossEntropyLoss(reduction='none')

    def loss(
            self,
            model: torch.nn.Module,
            transformed_batch: SiftingBatch,
            original_batch: Any = None,
    ) -> torch.Tensor:
    
        device = next(model.parameters()).device
        batch = [t.to(device) for t in original_batch]

        # compute loss
        outputs = model(batch)
        return self.celoss(outputs.logits, batch[2])
             
 ....
 ....
 ....
 
 sift_config = RelativeProbabilisticSiftConfig(
    beta_value=3,
    loss_history_length=500,
    loss_based_sift_config=LossConfig(
        sift_config=SiftingBaseConfig(sift_delay=10)
    )
)

train_dataloader =  SiftingDataloader(
    sift_config = sift_config,
    orig_dataloader=DataLoader(self.train, self.batch_size, shuffle=True),
    loss_impl=SiftBertLoss(),
    model=self.model
)

......

# use train_dataloader in the rest of the training logic.

Conclusion

In this post, we explored the public preview of smart sifting, a new capability of SageMaker that can reduce deep learning model training costs by up to 35%. This feature improves data efficiency during training that filters out less informative data samples. By including only the most impactful data for model convergence, you can significantly reduce training time and expense, all while maintaining accuracy. What’s more, it seamlessly integrates into your existing processes without requiring alterations to your data or training pipeline.

To dive deeper into SageMaker smart sifting, explore how it works, and implement it with PyTorch training workloads, check out our documentation and sample notebooks and get started with this new capability.


About the authors

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

K Lokesh Kumar Reddy is a Senior engineer in the Amazon Applied AI team. He is focused on efficient ML training techniques and building tools to improve conversational AI systems. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

Abhishek Dan is a senior Dev Manager in the Amazon Applied AI team and works on machine learning and conversational AI systems. He is passionate about AI technologies and works in the intersection of Science and Engineering in advancing the capabilities of AI systems to create more intuitive and seamless human-computer interactions. He is currently building applications on large language models to drive efficiency and CX improvements for Amazon.

Read More

Schedule Amazon SageMaker notebook jobs and manage multi-step notebook workflows using APIs

Schedule Amazon SageMaker notebook jobs and manage multi-step notebook workflows using APIs

Amazon SageMaker Studio provides a fully managed solution for data scientists to interactively build, train, and deploy machine learning (ML) models. Amazon SageMaker notebook jobs allow data scientists to run their notebooks on demand or on a schedule with a few clicks in SageMaker Studio. With this launch, you can programmatically run notebooks as jobs using APIs provided by Amazon SageMaker Pipelines, the ML workflow orchestration feature of Amazon SageMaker. Furthermore, you can create a multi-step ML workflow with multiple dependent notebooks using these APIs.

SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct SageMaker integration. Each SageMaker pipeline is composed of steps, which correspond to individual tasks such as processing, training, or data processing using Amazon EMR. SageMaker notebook jobs are now available as a built-in step type in SageMaker pipelines. You can use this notebook job step to easily run notebooks as jobs with just a few lines of code using the Amazon SageMaker Python SDK. Additionally, you can stitch multiple dependent notebooks together to create a workflow in the form of Directed Acyclic Graphs (DAGs). You can then run these notebooks jobs or DAGs, and manage and visualize them using SageMaker Studio.

Data scientists currently use SageMaker Studio to interactively develop their Jupyter notebooks and then use SageMaker notebook jobs to run these notebooks as scheduled jobs. These jobs can be run immediately or on a recurring time schedule without the need for data workers to refactor code as Python modules. Some common use cases for doing this include:

  • Running long running-notebooks in the background
  • Regularly running model inference to generate reports
  • Scaling up from preparing small sample datasets to working with petabyte-scale big data
  • Retraining and deploying models on some cadence
  • Scheduling jobs for model quality or data drift monitoring
  • Exploring the parameter space for better models

Although this functionality makes it straightforward for data workers to automate standalone notebooks, ML workflows are often comprised of several notebooks, each performing a specific task with complex dependencies. For instance, a notebook that monitors for model data drift should have a pre-step that allows extract, transform, and load (ETL) and processing of new data and a post-step of model refresh and training in case a significant drift is noticed. Furthermore, data scientists might want to trigger this entire workflow on a recurring schedule to update the model based on new data. To enable you to easily automate your notebooks and create such complex workflows, SageMaker notebook jobs are now available as a step in SageMaker Pipelines. In this post, we show how you can solve the following use cases with a few lines of code:

  • Programmatically run a standalone notebook immediately or on a recurring schedule
  • Create multi-step workflows of notebooks as DAGs for continuous integration and continuous delivery (CI/CD) purposes that can be managed via the SageMaker Studio UI

Solution overview

The following diagram illustrates our solution architecture. You can use the SageMaker Python SDK to run a single notebook job or a workflow. This feature creates a SageMaker training job to run the notebook.

In the following sections, we walk through a sample ML use case and showcase the steps to create a workflow of notebook jobs, passing parameters between different notebook steps, scheduling your workflow, and monitoring it via SageMaker Studio.

For our ML problem in this example, we are building a sentiment analysis model, which is a type of text classification task. The most common applications of sentiment analysis include social media monitoring, customer support management, and analyzing customer feedback. The dataset being used in this example is the Stanford Sentiment Treebank (SST2) dataset, which consists of movie reviews along with an integer (0 or 1) that indicates the positive or negative sentiment of the review.

The following is an example of a data.csv file corresponding to the SST2 dataset, and shows values in its first two columns. Note that the file shouldn’t have any header.

Column 1 Column 2
0 hide new secretions from the parental units
0 contains no wit , only labored gags
1 that loves its characters and communicates something rather beautiful about human nature
0 remains utterly satisfied to remain the same throughout
0 on the worst revenge-of-the-nerds clichés the filmmakers could dredge up
0 that ‘s far too tragic to merit such superficial treatment
1 demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop .

In this ML example, we must perform several tasks:

  1. Perform feature engineering to prepare this dataset in a format our model can understand.
  2. Post-feature engineering, run a training step that uses Transformers.
  3. Set up batch inference with the fine-tuned model to help predict the sentiment for new reviews that come in.
  4. Set up a data monitoring step so that we can regularly monitor our new data for any drift in quality that might require us to retrain the model weights.

With this launch of a notebook job as a step in SageMaker pipelines, we can orchestrate this workflow, which consists of three distinct steps. Each step of the workflow is developed in a different notebook, which are then converted into independent notebook jobs steps and connected as a pipeline:

  • Preprocessing – Download the public SST2 dataset from Amazon Simple Storage Service (Amazon S3) and create a CSV file for the notebook in Step 2 to run. The SST2 dataset is a text classification dataset with two labels (0 and 1) and a column of text to categorize.
  • Training – Take the shaped CSV file and run fine-tuning with BERT for text classification utilizing Transformers libraries. We use a test data preparation notebook as part of this step, which is a dependency for the fine-tuning and batch inference step. When fine-tuning is complete, this notebook is run using run magic and prepares a test dataset for sample inference with the fine-tuned model.
  • Transform and monitor – Perform batch inference and set up data quality with model monitoring to have a baseline dataset suggestion.

Run the notebooks

The sample code for this solution is available on GitHub.

Creating a SageMaker notebook job step is similar to creating other SageMaker Pipeline steps. In this notebook example, we use the SageMaker Python SDK to orchestrate the workflow. To create a notebook step in SageMaker Pipelines, you can define the following parameters:

  • Input notebook – The name of the notebook that this notebook step will be orchestrating. Here you can pass in the local path to the input notebook. Optionally, if this notebook has other notebooks it’s running, you can pass these in the AdditionalDependencies parameter for the notebook job step.
  • Image URI – The Docker image behind the notebook job step. This can be the predefined images that SageMaker already provides or a custom image that you have defined and pushed to Amazon Elastic Container Registry (Amazon ECR). Refer to the considerations section at the end of this post for supported images.
  • Kernel name – The name of the kernel that you are using on SageMaker Studio. This kernel spec is registered in the image that you have provided.
  • Instance type (optional) – The Amazon Elastic Compute Cloud (Amazon EC2) instance type behind the notebook job that you have defined and will be running.
  • Parameters (optional) – Parameters you can pass in that will be accessible for your notebook. These can be defined in key-value pairs. Additionally, these parameters can be modified between various notebook job runs or pipeline runs.

Our example has a total of five notebooks:

  • nb-job-pipeline.ipynb – This is our main notebook where we define our pipeline and workflow.
  • preprocess.ipynb – This notebook is the first step in our workflow and contains the code that will pull the public AWS dataset and create a CSV file out of it.
  • training.ipynb – This notebook is the second step in our workflow and contains code to take the CSV from the previous step and conduct local training and fine-tuning. This step also has a dependency from the prepare-test-set.ipynb notebook to pull down a test dataset for sample inference with the fine-tuned model.
  • prepare-test-set.ipynb – This notebook creates a test dataset that our training notebook will use in the second pipeline step and use for sample inference with the fine-tuned model.
  • transform-monitor.ipynb – This notebook is the third step in our workflow and takes the base BERT model and runs a SageMaker batch transform job, while also setting up data quality with model monitoring.

Next, we walk through the main notebook nb-job-pipeline.ipynb, which combines all the sub-notebooks into a pipeline and runs the end-to-end workflow. Note that although the following example only runs the notebook one time, you can also schedule the pipeline to run the notebook repeatedly. Refer to SageMaker documentation for detailed instructions.

For our first notebook job step, we pass in a parameter with a default S3 bucket. We can use this bucket to dump any artifacts we want available for our other pipeline steps. For the first notebook (preprocess.ipynb), we pull down the AWS public SST2 train dataset and create a training CSV file out of it that we push to this S3 bucket. See the following code:

# Parameters
print(default_s3_bucket)

!aws s3 cp s3://sagemaker-sample-files/datasets/text/SST2/sst2.train sst2.train

# will read just the first 500 lines for quicker execution
with open('sst2.train', 'r') as f:
    lines = f.readlines()[:500] 

data = []
for line in lines:
    label, text = line.strip().split(' ', 1)
    data.append((int(label), text))

df = pd.DataFrame(data, columns=['label', 'text'])
df.to_csv("train.csv", index=False) #create csv file with smaller dataset
!aws s3 cp "train.csv" {default_s3_bucket}

We can then convert this notebook in a NotebookJobStep with the following code in our main notebook:

# provide S3 Bucket to dump artifacts in
nb_job_params = {"default_s3_bucket": notebook_artifacts}

preprocess_nb_step = NotebookJobStep(
name=preprocess_step_name,
description=preprocess_description,
notebook_job_name=preprocess_job_name,
image_uri=image_uri,
kernel_name=kernel_name,
display_name=display_name,
role=role,
input_notebook=preprocess_notebook,
instance_type="ml.m5.4xlarge",
parameters=nb_job_params,
)

Now that we have a sample CSV file, we can start training our model in our training notebook. Our training notebook takes in the same parameter with the S3 bucket and pulls down the training dataset from that location. Then we perform fine-tuning by using the Transformers trainer object with the following code snippet:

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

After fine-tuning, we want to run some batch inference to see how the model is performing. This is done using a separate notebook (prepare-test-set.ipynb) in the same local path that creates a test dataset to perform inference on using our trained model. We can run the additional notebook in our training notebook with the following magic cell:

%run 'prepare-test-set.ipynb'

We define this extra notebook dependency in the AdditionalDependencies parameter in our second notebook job step:

train_nb_step = NotebookJobStep(
name=training_step_name,
description=training_description,
notebook_job_name=training_job_name,
input_notebook=training_notebook,
additional_dependencies=[test_data_prep_notebook],
image_uri=image_uri,
kernel_name=kernel_name,
display_name=display_name,
instance_type="ml.m5.12xlarge",
role=role,
parameters=nb_job_params,
)

We must also specify that the training notebook job step (Step 2) depends on the Preprocess notebook job step (Step 1) by using the add_depends_on API call as follows:

train_nb_step.add_depends_on([preprocess_nb_step])

Our last step, will take the BERT model run a SageMaker Batch Transform, while also setting up Data Capture and Quality via SageMaker Model Monitor. Note that this is different from using the built-in Transform or Capture steps via Pipelines. Our notebook for this step will execute those same APIs, but will be tracked as a Notebook Job Step. This step is dependent on the Training Job Step that we previously defined, so we also capture that with the depends_on flag.

batch_monitor_step = NotebookJobStep(
name=batch_monitor_step_name,
description=batch_monitor_description,
notebook_job_name=batch_monitor_job_name,
input_notebook=batch_monitor_notebook,
image_uri=image_uri,
kernel_name=kernel_name,
display_name=display_name,
instance_type="ml.m5.12xlarge",
role=role,
parameters=nb_job_params,
)
batch_monitor_step.add_depends_on([train_nb_step])

After the various steps of our workflow have been defined, we can create and run the end-to-end pipeline:

# create pipeline
pipeline = Pipeline(
name=pipeline_name,
steps=[preprocess_nb_step, train_nb_step, batch_monitor_step],
)

# execute pipeline
pipeline.create(session.get_execution_role())
execution = pipeline.start(parameters={})
execution.wait(delay=30, max_attempts=60)
execution_steps = execution.list_steps()
print(execution_steps)

Monitor the pipeline runs

You can track and monitor the notebook step runs via the SageMaker Pipelines DAG, as seen in the following screenshot.

You can also optionally monitor the individual notebook runs on the notebook job dashboard and toggle the output files that have been created via the SageMaker Studio UI. When using this functionality outside of SageMaker Studio, you can define the users who can track the run status on the notebook job dashboard by using tags. For more details about tags to include, see View your notebook jobs and download outputs in the Studio UI dashboard.

For this example, we output the resulting notebook jobs to a directory called outputs in your local path with your pipeline run code. As shown in the following screenshot, here you can see the output of your input notebook and also any parameters you defined for that step.

Clean up

If you followed along with our example, be sure to delete the created pipeline, notebook jobs and the s3 data downloaded by the sample notebooks.

Considerations

The following are some important considerations for this feature:

Conclusion

With this launch, data workers can now programmatically run their notebooks with a few lines of code using the SageMaker Python SDK. Additionally, you can create complex multi-step workflows using your notebooks, significantly reducing the time needed to move from a notebook to a CI/CD pipeline. After creating the pipeline, you can use SageMaker Studio to view and run DAGs for your pipelines and manage and compare the runs. Whether you’re scheduling end-to-end ML workflows or a part of it, we encourage you to try notebook-based workflows.


About the authors

Anchit Gupta is a Senior Product Manager for Amazon SageMaker Studio. She focuses on enabling interactive data science and data engineering workflows from within the SageMaker Studio IDE. In her spare time, she enjoys cooking, playing board/card games, and reading.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Edward Sun is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solution and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML ecosystem. In his spare time, Edward is big fan of camping, hiking and fishing and enjoys the time spending with his family.

Read More

Announcing new tools and capabilities to enable responsible AI innovation

Announcing new tools and capabilities to enable responsible AI innovation

The rapid growth of generative AI brings promising new innovation, and at the same time raises new challenges. These challenges include some that were common before generative AI, such as bias and explainability, and new ones unique to foundation models (FMs), including hallucination and toxicity. At AWS, we are committed to developing generative AI responsibly, taking a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle.

Over the past year, we have introduced new capabilities in our generative AI applications and models such as built-in security scanning in Amazon CodeWhisperer, training to detect and block harmful content in Amazon Titan, and data privacy protections in Amazon Bedrock. Our investment in safe, transparent, and responsible generative AI includes collaboration with the global community and policymakers as we encouraged and supported both the White House Voluntary AI commitments and AI Safety Summit in the UK. And we continue to work hand-in-hand with customers to operationalize responsible AI with purpose-built tools like Amazon SageMaker Clarify, ML Governance with Amazon SageMaker, and more.

Introducing new responsible AI innovation

As generative AI scales to new industries, organizations, and use cases, this growth must be accompanied by a sustained investment in responsible FM development. Customers want their FMs to be built with safety, fairness, and security in mind, so that they can in turn deploy AI responsibly. At AWS re:Invent this year, we are excited to announce new capabilities to foster responsible generative AI innovation across a broad set of capabilities with new built-in tools, customer protections, resources to enhance transparency, and tools to combat disinformation. We aim to provide customers the information they need to evaluate FMs against key responsible AI considerations, like toxicity and robustness, and introduce guardrails to apply safeguards based on customer use cases and responsible AI policies. At the same time, our customers want to be better informed on the safety, fairness, security, and other properties, of AI services and FMs, as they use them within their own organization. We are excited to announce more resources to help customers better understand our AWS AI services and deliver the transparency they are asking for.

Implementing safeguards: Guardrails for Amazon Bedrock

Safety is a priority when it comes to introducing generative AI at scale. Organizations want to promote safe interactions between their customers and generative AI applications that avoid harmful or offensive language and align with company policies. The easiest way to do that is to put consistent safeguards in place across the whole organization so everyone can innovate safely. Yesterday we announced the preview of Guardrails for Amazon Bedrock—a new capability that makes it easy to implement application-specific safeguards based on customer use cases and responsible AI policies.

Guardrails drive consistency in how FMs on Amazon Bedrock respond to undesirable and harmful content within applications. Customers can apply guardrails to large language models on Amazon Bedrock as well as to fine-tuned models and in combination with Agents for Amazon Bedrock. Guardrails lets you specify topics to be avoided, and the service automatically detects and prevents queries and responses that fall into restricted categories. Customers can also configure content filter thresholds across categories including hate speech, insults, sexualized language, and violence to filter out harmful content to the desired level. For example, an online banking application can be set up to avoid providing investment advice and limit inappropriate content (such as hate speech, insults, and violence). In the near future, customers will also be able to redact personally identifiable information (PII) in user inputs and FMs’ responses, set profanity filters, and provide a list of custom words to block in interactions between users and FMs, improving compliance and further protecting users. With Guardrails, you can innovate faster with generative AI while maintaining protections and safeguards consistent with company policies.

Identifying the best FM for a specific use case: Model Evaluation in Amazon Bedrock

Today, organizations have a wide range of FM options to power their generative AI applications. To strike the right balance of accuracy and performance for their use case, organizations must efficiently compare models and find the best option based on key responsible AI and quality metrics that are important to them. To evaluate models, organizations must first spend days identifying benchmarks, setting up evaluation tools, and running assessments, all of which requires deep expertise in data science. Furthermore, these tests are not useful for evaluating subjective criteria (e.g., brand voice, relevance, and style) that requires judgment through tedious, time-intensive, human-review workflows. The time, expertise, and resources required for these evaluations—for every new use case —make it difficult for organizations to evaluate models against responsible AI dimensions and make an informed choice around what model will provide the most accurate, safe experience for their customers.

Now available in preview, Model Evaluation on Amazon Bedrock helps customers evaluate, compare, and select the best FMs for their specific use case based on custom metrics, such as accuracy and safety, using either automatic or human evaluations. In the Amazon Bedrock console, customers choose the FMs they want to compare for a given task, such as question-answering or content summarization. For automatic evaluations, customers select predefined evaluation criteria (e.g., accuracy, robustness, and toxicity) and upload their own testing dataset or select from built-in, publicly available datasets. For subjective criteria or nuanced content requiring  judgment, customers can easily set up human-based evaluation workflows with just a few clicks. These workflows leverage a customer’s in-house workteam, or use a managed workforce provided by AWS, to evaluate model responses. During human-based evaluations, customers define use case-specific metrics (e.g., relevance, style, and brand voice). Once customers finish the setup process, Amazon Bedrock runs evaluations and generates a report, so customers can easily understand how the model performed across key safety and accuracy criteria and select the best model for their use case.

This ability to evaluate models is not limited to Amazon Bedrock, customers can also use model evaluation in Amazon SageMaker Clarify to easily evaluate, compare, and select the best FM option across key quality and responsibility metrics such as accuracy, robustness, and toxicity – across all FMs.

Combating disinformation: Watermarking in Amazon Titan

Today, we announced Amazon Titan Image Generator in preview, which empowers customers to rapidly produce and enhance high-quality images at scale. We considered responsible AI during each stage of the model development process, including training data selection, building filtering capabilities to detect and remove inappropriate user inputs and model outputs, and improving demographic diversity of our model outputs. All Amazon Titan-generated images contain an invisible watermark by default, which is designed to help reduce the spread of disinformation by providing a discreet mechanism to identify AI-generated images. AWS is among the first model providers to widely release built-in invisible watermarks that are integrated into image outputs and are designed to be resistant to alterations.

Building trust: Standing behind our models and applications with indemnification

Building customer trust is core to AWS. We have been on a journey with our customers since our inception, and with the growth of generative AI, we remain committed to building innovative technology together. To enable customers to harness the power of our generative AI, they need to know they are protected. AWS offers copyright indemnity coverage for outputs of the following Amazon generative AI services: Amazon Titan Text Express, Amazon Titan Text Lite, Amazon Titan Embeddings, Amazon Titan Multimodal Embeddings, Amazon CodeWhisperer Professional, AWS HealthScribe, Amazon Lex, and Amazon Personalize. This means that customers who use the models responsibly are protected from third-party claims alleging copyright infringement by the outputs generated by those services (see Section 50.10 of the Service Terms). In addition, our standard IP indemnity for use of the services protects customers from third-party claims alleging IP infringement by the services and the data used to train them. To put it another way, if you use an Amazon generative AI service listed above and someone sues you for IP infringement, AWS will defend that lawsuit, which includes covering any judgment against you or settlement costs.

We stand behind our generative AI services and work to continually improve them. As AWS launches new services and generative AI continues to evolve, AWS will continue to relentlessly focus on earning and maintaining customer trust.

Enhancing transparency: AWS AI Service Card for Amazon Titan Text

We introduced AWS AI Service Cards at re:Invent 2022 as a transparency resource to help customers better understand our AWS AI services. AI Service Cards are a form of responsible AI documentation that provide customers with a single place to find information on the intended use cases and limitations, responsible AI design choices, and deployment and performance optimization best practices for our AI services. They are part of a comprehensive development process we undertake to build our services in a responsible way that addresses fairness, explainability, veracity and robustness, governance, transparency, privacy and security, safety, and controllability.

At re:Invent this year we are announcing a new AI Service Card for Amazon Titan Text to increase transparency in foundation models. We are also launching four new AI Service Cards including: Amazon Comprehend Detect PII, Amazon Transcribe Toxicity Detection, Amazon Rekognition Face Liveness, and AWS HealthScribe. You can explore each of these cards on the AWS website. As generative AI continues to grow and evolve, transparency on how technology is developed, tested, and used will be a vital component to earn the trust of organizations and their customers alike. At AWS, we are committed to continuing to bring transparency resources like AI Service Cards to the broader community—and to iterate and gather feedback on the best ways forward.

Investing in responsible AI across the entire generative AI lifecycle

We are excited about the new innovations announced at re:Invent this week that gives our customers more tools, resources, and built-in protections to build and use generative AI safely. From model evaluation to guardrails to watermarking, customers can now bring generative AI to their organization faster, while mitigating risk. New protections for customers like IP indemnity coverage and new resources to enhance transparency like additional AI Service Cards are also key examples of our commitment to build trust across technology companies, policymakers, community groups, scientists, and more. We continue to make meaningful investments in responsible AI across the lifecycle of a foundation model—to help our customers scale AI in a safe, secure, and responsible way.


About the Authors

Peter Hallinan leads initiatives in the science and practice of Responsible AI at AWS AI, alongside a team of responsible AI experts. He has deep expertise in AI (PhD, Harvard) and entrepreneurship (Blindsight, sold to Amazon). His volunteer activities have included serving as a consulting professor at the Stanford University School of Medicine, and as the president of the American Chamber of Commerce in Madagascar. When possible, he’s off in the mountains with his children: skiing, climbing, hiking and rafting

Vasi Philomin is currently the VP of Generative AI at AWS. He leads generative AI efforts including Amazon Bedrock, Amazon Titan, and Amazon CodeWhisperer.

Read More

Introducing the AWS Generative AI Innovation Center’s Custom Model Program for Anthropic Claude

Introducing the AWS Generative AI Innovation Center’s Custom Model Program for Anthropic Claude

Since launching in June 2023, the AWS Generative AI Innovation Center team of strategists, data scientists, machine learning (ML) engineers, and solutions architects have worked with hundreds of customers worldwide, and helped them ideate, prioritize, and build bespoke solutions that harness the power of generative AI. Customers worked closely with us to prioritize use cases, select the right foundation models (FMs), incorporate responsible AI principles, develop proofs of concept, optimize solutions, and launch them at scale. Today, we are excited to announce the AWS Generative AI Innovation Center Custom Model Program for Anthropic Claude. Starting in Q1 2024, customers can engage with researchers and ML scientists from the Generative AI Innovation Center to fine-tune Anthropic Claude models securely with their own proprietary data.

For most use cases, customers can use the high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, all available in Amazon Bedrock via a single API. Techniques such as prompt engineering, few-shot learning, and RAG can also help customize model responses for your business context and specific tasks without the need for further training. However, some applications will benefit from deeper customization through model fine-tuning. Fine-tuning refers to taking a general-purpose FM and adapting it to improve performance on specific tasks or domains using a relative smaller, but high-quality labeled datasets. Fine-tuning typically results in better performance on specific tasks compared to the base FM. This additional task-specific training helps the model get better at the applications you care about. The resulting models are also unique to the fine-tuning data used, enabling enterprises to develop differentiated solutions based on their private company data sources.

Fine-tuning, aligning, and optimizing Anthropic Claude models for complex tasks and domains requires deep AI expertise. Starting in Q1 2024, customers can engage with a team of experts from the AWS Generative AI Innovation Center and fine-tune Claude models with their proprietary data sources. Our experts will help you scope requirements for model customization, define evaluation criteria, and work with your proprietary data for fine-tuning. We will collaborate with the Anthropic science team and align the fine-tuned models to meet your needs. You can privately access the fine-tuned models directly through Amazon Bedrock, enabling the same API integrations you use today without the need to manage deployments or infrastructure.

To learn more about the program, contact your AWS account team.


About the authors

Sri Elaprolu currently serves as the Head of AWS Generative AI Innovation Center. He leads a large team of machine learning scientists, engineers, and strategists that work with global enterprises and public sector organizations to address challenging problems and opportunities using generative AI. Previously, he led science teams that supported 100s of AWS customers including the NFL, Cerner, NASA, and the U.S. Dept. of Defense leverage AWS AI/ML to drive business and mission outcomes.

Read More

Learn how to assess the risk of AI systems

Learn how to assess the risk of AI systems

Artificial intelligence (AI) is a rapidly evolving field with the potential to improve and transform many aspects of society. In 2023, the pace of adoption of AI technologies has accelerated further with the development of powerful foundation models (FMs) and a resulting advancement in generative AI capabilities.

At Amazon, we have launched multiple generative AI services, such as Amazon Bedrock and Amazon CodeWhisperer, and have made a range of highly capable generative models available through Amazon SageMaker JumpStart. These services are designed to support our customers in unlocking the emerging capabilities of generative AI, including enhanced creativity, personalized and dynamic content creation, and innovative design. They can also enable AI practitioners to make sense of the world as never before—addressing language barriers, climate change, accelerating scientific discoveries, and more.

To realize the full potential of generative AI, however, it’s important to carefully reflect on any potential risks. First and foremost, this benefits the stakeholders of the AI system by promoting responsible and safe development and deployment, and by encouraging the adoption of proactive measures to address potential impact. Consequently, establishing mechanisms to assess and manage risk is an important process for AI practitioners to consider and has become a core component of many emerging AI industry standards (for example, ISO 42001, ISO 23894, and NIST RMF) and legislation (such as EU AI Act).

In this post, we discuss how to assess the potential risk of your AI system.

What are the different levels of risk?

While it might be easier to start looking at an individual machine learning (ML) model and the associated risks in isolation, it’s important to consider the details of the specific application of such a model and the corresponding use case as part of a complete AI system. In fact, a typical AI system is likely to be based on multiple different ML models working together, and an organization might be looking to build multiple different AI systems. Consequently, risks can be evaluated for each use case and at different levels, namely model risk, AI system risk, and enterprise risk.

Enterprise risk encompasses the broad spectrum of risks that an organization may face, including financial, operational, and strategic risks. AI system risk focuses on the impact associated with the implementation and operation of AI systems, whereas ML model risk pertains specifically to the vulnerabilities and uncertainties inherent in ML models.

In this post, we focus on AI system risk, primarily. However, it’s important to note that all different levels of risk management within an organization should be considered and aligned.

How is AI system risk defined?

Risk management in the context of an AI system can be a path to minimize the effect of uncertainty or potential negative impacts, while also providing opportunities to maximize positive impacts. Risk itself is not a potential harm but the effect of uncertainty on objectives. According to the NIST Risk Management Framework (NIST RMF), risk can be estimated as a multiplicative measure of an event’s probability of occurring timed by the magnitudes of the consequences of the corresponding event.

There are two aspects to risk: inherent risk and residual risk. Inherent risk represents the amount of risk the AI system exhibits in absence of mitigations or controls. Residual risk captures the remaining risks after factoring in mitigation strategies.

Always keep in mind that risk assessment is a human-centric activity that requires organization-wide efforts; these efforts range from ensuring all relevant stakeholders are included in the assessment process (such as product, engineering, science, sales, and security teams) to assessing how social perspectives and norms influence the perceived likelihood and consequences of certain events.

Why should your organization care about risk evaluation?

Establishing risk management frameworks for AI systems can benefit society at large by promoting the safe and responsible design, development and operation of AI systems. Risk management frameworks can also benefit organizations through the following:

  • Improved decision-making – By understanding the risks associated with AI systems, organizations can make better decisions about how to mitigate those risks and use AI systems in a safe and responsible manner
  • Increased compliance planning – A risk assessment framework can help organizations prepare for risk assessment requirements in relevant laws and regulations
  • Building trust – By demonstrating that they are taking steps to mitigate the risks of AI systems, organizations can show their customers and stakeholders that they are committed to using AI in a safe and responsible manner

How to assess risk?

As a first step, an organization should consider describing the AI use case that needs to be assessed and identify all relevant stakeholders. A use case is a specific scenario or situation that describes how users interact with an AI system to achieve a particular goal. When creating a use case description, it can be helpful to specify the business problem being solved, list the stakeholders involved, characterize the workflow, and provide details regarding key inputs and outputs of the system.

When it comes to stakeholders, it’s easy to overlook some. The following figure is a good starting point to map out AI stakeholder roles.

Source: “Information technology – Artificial intelligence – Artificial intelligence concepts and terminology”.

An important next step of the AI system risk assessment is to identify potentially harmful events associated with the use case. In considering these events, it can be helpful to reflect on different dimensions of responsible AI, such as fairness and robustness, for example. Different stakeholders might be affected to different degrees along different dimensions. For example, a low robustness risk for an end-user could be the result of an AI system exhibiting minor disruptions, whereas a low fairness risk could be caused by an AI system producing negligibly different outputs for different demographic groups.

To estimate the risk of an event, you can use a likelihood scale in combination with a severity scale to measure the probability of occurrence as well as the degree of consequences. A helpful starting point when developing these scales might be the NIST RMF, which suggests using qualitative nonnumerical categories ranging from very low to very high risk or semi-quantitative assessments principles, such as scales (such as 1–10), bins, or otherwise representative numbers. After you have defined the likelihood and severity scales for all relevant dimensions, you can use a risk matrix scheme to quantify the overall risk per stakeholders along each relevant dimension. The following figure shows an example risk matrix.

Using this risk matrix, we can consider an event with low severity and rare likelihood of occurring as very low risk. Keep in mind that the initial assessment will be an estimate of inherent risk, and risk mitigation strategies can help lower the risk levels further. The process can then be repeated to generate a rating for any remaining residual risk per event. If there are multiple events identified along the same dimension, it can be helpful to pick the highest risk level among all to create a final assessment summary.

Using the final assessment summary, organizations will have to define what risk levels are acceptable for their AI systems as well as consider relevant regulations and policies.

AWS commitment

Through engagements with the White House and UN, among others, we are committed to sharing our knowledge and expertise to advance the responsible and secure use of AI. Along these lines, Amazon’s Adam Selipsky recently represented AWS at the AI Safety Summit with heads of state and industry leaders in attendance, further demonstrating our dedication to collaborating on the responsible advancement of artificial intelligence.

Conclusion

As AI continues to advance, risk assessment is becoming increasingly important and useful for organizations looking to build and deploy AI responsibly. By establishing a risk assessment framework and risk mitigation plan, organizations can reduce the risk of potential AI-related incidents and earn trust with their customers, as well as reap benefits such as improved reliability, improved fairness for different demographics, and more.

Go ahead and get started on your journey of developing a risk assessment framework in your organization and share your thoughts in the comments.

Also check out an overview of generative AI risks published on Amazon Science: Responsible AI in the generative era, and explore the range of AWS services that can support you on your risk assessment and mitigation journey: Amazon SageMaker Clarify, Amazon SageMaker Model Monitor, AWS CloudTrail, as well as the model governance framework.


About the Authors

Mia C. Mayer is an Applied Scientist and ML educator at AWS Machine Learning University; where she researches and teaches safety, explainability and fairness of Machine Learning and AI systems. Throughout her career, Mia established several university outreach programs, acted as a guest lecturer and keynote speaker, and presented at numerous large learning conferences. She also helps internal teams and AWS customers get started on their responsible AI journey.

Denis V. Batalov is a 17-year Amazon veteran and a PhD in Machine Learning, Denis worked on such exciting projects as Search Inside the Book, Amazon Mobile apps and Kindle Direct Publishing. Since 2013 he has helped AWS customers adopt AI/ML technology as a Solutions Architect. Currently, Denis is a Worldwide Tech Leader for AI/ML responsible for the functioning of AWS ML Specialist Solutions Architects globally. Denis is a frequent public speaker, you can follow him on Twitter @dbatalov.

Dr. Sara Liu is a Senior Technical Program Manager with the AWS Responsible AI team. She works with a team of scientists, dataset leads, ML engineers, researchers, as well as other cross-functional teams to raise the responsible AI bar across AWS AI services. Her current projects involve developing AI service cards, conducting risk assessments for responsible AI, creating high-quality evaluation datasets, and implementing quality programs. She also helps internal teams and customers meet evolving AI industry standards.

Read More

Introducing three new NVIDIA GPU-based Amazon EC2 instances

Introducing three new NVIDIA GPU-based Amazon EC2 instances

Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing portfolio offers the broadest choice of accelerators to power your artificial intelligence (AI), machine learning (ML), graphics, and high performance computing (HPC) workloads. We are excited to announce the expansion of this portfolio with three new instances featuring the latest NVIDIA GPUs: Amazon EC2 P5e instances powered by NVIDIA H200 GPUs, Amazon EC2 G6 instances featuring NVIDIA L4 GPUs, and Amazon EC2 G6e instances powered by NVIDIA L40S GPUs. All three instances will be available in 2024, and we look forward to seeing what you can do with them.

AWS and NVIDIA have collaborated for over 13 years and have pioneered large-scale, highly performant, and cost-effective GPU-based solutions for developers and enterprise across the spectrum. We have combined NVIDIA’s powerful GPUs with differentiated AWS technologies such as AWS Nitro System, 3,200 Gbps of Elastic Fabric Adapter (EFA) v2 networking, hundreds of GB/s of data throughput with Amazon FSx for Lustre, and exascale computing with Amazon EC2 UltraClusters to deliver the most performant infrastructure for AI/ML, graphics, and HPC. Coupled with other managed services such as Amazon Bedrock, Amazon SageMaker, and Amazon Elastic Kubernetes Service (Amazon EKS), these instances provide developers with the industry’s best platform for building and deploying generative AI, HPC, and graphics applications.

High-performance and cost-effective GPU-based instances for AI, HPC, and graphics workloads

To power the development, training, and inference of the largest large language models (LLMs), EC2 P5e instances will feature NVIDIA’s latest H200 GPUs, which offer 141 GBs of HBM3e GPU memory, which is 1.7 times larger and 1.4 times faster than H100 GPUs. This boost in GPU memory along with up to 3200 Gbps of EFA networking enabled by AWS Nitro System will enable you to continue to build, train, and deploy your cutting-edge models on AWS.

EC2 G6e instances, featuring NVIDIA L40S GPUs, are built to provide developers with a broadly available option for training and inference of publicly available LLMs, as well as support the increasing adoption of Small Language Models (SLM). They are also optimal for digital twin applications that use NVIDIA Omniverse for describing and simulating across 3D tools and applications, and for creating virtual worlds and advanced workflows for industrial digitalization.

EC2 G6 instances, featuring NVIDIA L4 GPUs, will deliver a lower-cost, energy-efficient solution for deploying ML models for natural language processing, language translation, video and image analysis, speech recognition, and personalization as well as graphics workloads, such as creating and rendering real-time, cinematic-quality graphics and game streaming.


About the Author

Chetan Kapoor is the Director of Product Management for the Amazon EC2 Accelerated Computing Portfolio.

Read More

Boost inference performance for LLMs with new Amazon SageMaker containers

Boost inference performance for LLMs with new Amazon SageMaker containers

Today, Amazon SageMaker launches a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) and adds support for NVIDIA’s TensorRT-LLM Library. With these upgrades, you can effortlessly access state-of-the-art tooling to optimize large language models (LLMs) on SageMaker and achieve price-performance benefits – Amazon SageMaker LMI TensorRT-LLM DLC reduces latency by 33% on average and improves throughput by 60% on average for Llama2-70B, Falcon-40B and CodeLlama-34B models, compared to previous version.

LLMs have seen an unprecedented growth in popularity across a broad spectrum of applications. However, these models are often too large to fit on a single accelerator or GPU device, making it difficult to achieve low-latency inference and scale. SageMaker offers LMI DLCs to help you maximize the utilization of available resources and improve performance. The latest LMI DLCs offer continuous batching support for inference requests to improve throughput, efficient inference collective operations to improve latency, Paged Attention V2 (which improves the performance of workloads with longer sequence lengths), and the latest TensorRT-LLM library from NVIDIA to maximize performance on GPUs. LMI DLCs offer a low-code interface that simplifies compilation with TensorRT-LLM by just requiring the model ID and optional model parameters; all of the heavy lifting required with building a TensorRT-LLM optimized model and creating a model repo is managed by the LMI DLC. In addition, you can use the latest quantization techniques—GPTQ, AWQ, and SmoothQuant—that are available with LMI DLCs. As a result, with LMI DLCs on SageMaker, you can accelerate time-to-value for your generative AI applications and optimize LLMs for the hardware of your choice to achieve best-in-class price-performance.

In this post, we dive deep into the new features with the latest release of LMI DLCs, discuss performance benchmarks, and outline the steps required to deploy LLMs with LMI DLCs to maximize performance and reduce costs.

New features with SageMaker LMI DLCs

In this section, we discuss three new features with SageMaker LMI DLCs.

SageMaker LMI now supports TensorRT-LLM

SageMaker now offers NVIDIA’s TensorRT-LLM as part of the latest LMI DLC release (0.25.0), enabling state-of-the-art optimizations like SmoothQuant, FP8, and continuous batching for LLMs when using NVIDIA GPUs. TensorRT-LLM opens the door to ultra-low latency experiences that can greatly improve performance. The TensorRT-LLM SDK supports deployments ranging from single-GPU to multi-GPU configurations, with additional performance gains possible through techniques like tensor parallelism. To use the TensorRT-LLM library, choose the TensorRT-LLM DLC from the available LMI DLCs and set engine=MPI among other settings such as option.model_id. The following diagram illustrates the TensorRT-LLM tech stack.

Efficient inference collective operations

In a typical deployment of LLMs, model parameters are spread across multiple accelerators to accommodate the requirements of a large model that can’t fit on a single accelerator. This enhances inference speed by enabling each accelerator to carry out partial calculations in parallel. Afterwards, a collective operation is introduced to consolidate these partial results at the end of these processes, and redistribute them among the accelerators.

For P4D instance types, SageMaker implements a new collective operation that speeds up communication between GPUs. As a result, you get lower latency and higher throughput with the latest LMI DLCs compared to previous versions. Furthermore, this feature is supported out of the box with LMI DLCs, and you don’t need to configure anything to use this feature because it’s embedded in the SageMaker LMI DLCs and is exclusively available for Amazon SageMaker.

Quantization support

SageMaker LMI DLCs now support the latest quantization techniques, including pre-quantized models with GPTQ, Activation-aware Weight Quantization (AWQ), and just-in-time quantization like SmoothQuant.

GPTQ allows LMI to run popular INT3 and INT4 models from Hugging Face. It offers the smallest possible model weights that can fit on a single GPU/multi-GPU. LMI DLCs also support AWQ inference, which allows faster inference speed. Finally, LMI DLCs now support SmoothQuant, which allows INT8 quantization to reduce the memory footprint and computational cost of models with minimal loss in accuracy. Currently, we allow you to do just-in-time conversion for SmoothQuant models without any additional steps. GPTQ and AWQ need to be quantized with a dataset to be used with LMI DLCs. You can also pick up popular pre-quantized GPTQ and AWQ models to use on LMI DLCs. To use SmoothQuant, set option.quantize=smoothquant with engine=DeepSpeed in serving.properties. A sample notebook using SmoothQuant for hosting GPT-Neox on ml.g5.12xlarge is located on GitHub.

Using SageMaker LMI DLCs

You can deploy your LLMs on SageMaker using the new LMI DLCs 0.25.0 without any changes to your code. SageMaker LMI DLCs use DJL serving to serve your model for inference. To get started, you just need to create a configuration file that specifies settings like model parallelization and inference optimization libraries to use. For instructions and tutorials on using SageMaker LMI DLCs, refer to Model parallelism and large model inference and our list of available SageMaker LMI DLCs.

The DeepSpeed container includes a library called LMI Distributed Inference Library (LMI-Dist). LMI-Dist is an inference library used to run large model inference with the best optimization used in different open-source libraries, across vLLM, Text-Generation-Inference (up to version 0.9.4), FasterTransformer, and DeepSpeed frameworks. This library incorporates open-source popular technologies like FlashAttention, PagedAttention, FusedKernel, and efficient GPU communication kernels to accelerate the model and reduce memory consumption.

TensorRT LLM is an open-source library released by NVIDIA in October 2023. We optimized the TensorRT-LLM library for inference speedup and created a toolkit to simplify the user experience by supporting just-in-time model conversion. This toolkit enables users to provide a Hugging Face model ID and deploy the model end-to-end. It also supports continuous batching with streaming. You can expect approximately 1–2 minutes to compile the Llama-2 7B and 13B models, and around 7 minutes for the 70B model. If you want to avoid this compilation overhead during SageMaker endpoint setup and scaling of instances , we recommend using ahead of time (AOT) compilation with our tutorial to prepare the model. We also accept any TensorRT LLM model built for Triton Server that can be used with LMI DLCs.

Performance benchmarking results

We compared the performance of the latest SageMaker LMI DLCs version (0.25.0) to the previous version (0.23.0). We conducted experiments on the Llama-2 70B, Falcon 40B, and CodeLlama 34B models to demonstrate the performance gain with TensorRT-LLM and efficient inference collective operations (available on SageMaker).

SageMaker LMI containers come with a default handler script to load and host models, providing a low-code option. You also have the option to bring your own script if you need to do any customizations to the model loading steps. You need to pass the required parameters in a serving.properties file. This file contains the required configurations for the Deep Java Library (DJL) model server to download and host the model. The following code is the serving.properties used for our deployment and benchmarking:

engine=MPI
option.use_custom_all_reduce=true 
option.model_id={{s3url}}
option.tensor_parallel_degree=8
option.output_formatter=json
option.max_rolling_batch_size=64
option.model_loading_timeout=3600

The engine parameter is used to define the runtime engine for the DJL model server. We can specify the Hugging Face model ID or Amazon Simple Storage Service (Amazon S3) location of the model using the model_id parameter. The task parameter is used to define the natural language processing (NLP) task. The tensor_parallel_degree parameter sets the number of devices over which the tensor parallel modules are distributed. The use_custom_all_reduce parameter is set to true for GPU instances that have NVLink enabled to speed up model inference. You can set this for P4D, P4de, P5 and other GPUs that have NVLink connected. The output_formatter parameter sets the output format. The max_rolling_batch_size parameter sets the limit for the maximum number of concurrent requests. The model_loading_timeout sets the timeout value for downloading and loading the model to serve inference. For more details on the configuration options, refer to Configurations and settings.

Llama-2 70B

The following are the performance comparison results of Llama-2 70B. Latency reduced by 28% and throughput increased by 44% for concurrency of 16, with the new LMI TensorRT LLM DLC.

Falcon 40B

The following figures compare Falcon 40B. Latency reduced by 36% and throughput increased by 59% for concurrency of 16, with the new LMI TensorRT LLM DLC.

CodeLlama 34B

The following figures compare CodeLlama 34B. Latency reduced by 36% and throughput increased by 77% for concurrency of 16, with the new LMI TensorRT LLM DLC.

Recommended configuration and container for hosting LLMs

With the latest release, SageMaker is providing two containers: 0.25.0-deepspeed and 0.25.0-tensorrtllm. The DeepSpeed container contains DeepSpeed, the LMI Distributed Inference Library. The TensorRT-LLM container includes NVIDIA’s TensorRT-LLM Library to accelerate LLM inference.

We recommend the deployment configuration illustrated in the following diagram.

To get started, refer to the sample notebooks:

Conclusion

In this post, we showed how you can use SageMaker LMI DLCs to optimize LLMs for your business use case and achieve price-performance benefits. To learn more about LMI DLC capabilities, refer to Model parallelism and large model inference. We’re excited to see how you use these new capabilities from Amazon SageMaker.


About the authors

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Rishabh Ray Chaudhury is a Senior Product Manager with Amazon SageMaker, focusing on Machine Learning inference. He is passionate about innovating and building new experiences for Machine Learning customers on AWS to help scale their workloads. In his spare time, he enjoys traveling and cooking. You can find him on LinkedIn.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Jian Sheng is a Software Development Engineer at Amazon Web Services who has worked on several key aspects of machine learning systems. He has been a key contributor to the SageMaker Neo service, focusing on deep learning compilation and framework runtime optimization. Recently, he has directed his efforts and contributed to optimizing the machine learning system for large model inference.

Vivek Gangasani is a AI/ML Startup Solutions Architect for Generative AI startups at AWS. He helps emerging GenAI startups build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Harish Tummalacherla is Software Engineer with Deep Learning Performance team at SageMaker. He works on performance engineering for serving large language models efficiently on SageMaker. In his spare time, he enjoys running, cycling and ski mountaineering.

Read More

Simplify data prep for generative AI with Amazon SageMaker Data Wrangler

Simplify data prep for generative AI with Amazon SageMaker Data Wrangler

Generative artificial intelligence (generative AI) models have demonstrated impressive capabilities in generating high-quality text, images, and other content. However, these models require massive amounts of clean, structured training data to reach their full potential. Most real-world data exists in unstructured formats like PDFs, which requires preprocessing before it can be used effectively.

According to IDC, unstructured data accounts for over 80% of all business data today. This includes formats like emails, PDFs, scanned documents, images, audio, video, and more. While this data holds valuable insights, its unstructured nature makes it difficult for AI algorithms to interpret and learn from it. According to a 2019 survey by Deloitte, only 18% of businesses reported being able to take advantage of unstructured data.

As AI adoption continues to accelerate, developing efficient mechanisms for digesting and learning from unstructured data becomes even more critical in the future. This could involve better preprocessing tools, semi-supervised learning techniques, and advances in natural language processing. Companies that use their unstructured data most effectively will gain significant competitive advantages from AI. Clean data is important for good model performance. Extracted texts still have large amounts of gibberish and boilerplate text (e.g., read HTML). Scraped data from the internet often contains a lot of duplications. Data from social media, reviews, or any user generated contents can also contain toxic and biased contents, and you may need to filter them out using some pre-processing steps. There could also be a lot of low-quality contents or bot-generated texts, which can be filtered out using accompanying metadata (e.g., filter out customer service responses that received low customer ratings).

Data preparation is important at multiple stages in Retrieval Augmented Generation (RAG) models. The knowledge source documents need preprocessing, like cleaning text and generating semantic embeddings, so they can be efficiently indexed and retrieved. The user’s natural language query also requires preprocessing, so it can be encoded into a vector and compared to document embeddings. After retrieving relevant contexts, they may need additional preprocessing, like truncation, before being concatenated to the user’s query to create the final prompt for the foundation model.

Solution overview

In this post, we work with a PDF documentation dataset—Amazon Bedrock user guide. Further, we show how to preprocess a dataset for RAG. Specifically, we clean the data and create RAG artifacts to answer the questions about the content of the dataset. Consider the following machine learning (ML) problem: user asks a large language model (LLM) question: “How to filter and search models in Amazon Bedrock?”. LLM has not seen the documentation during the training or fine-tuning stage, thus wouldn’t be able to answer the question and most probably will hallucinate. Our goal with this post, is to find a relevant piece of text from the PDF (i.e., RAG) and attach it to the prompt, thus enabling LLM to answer questions specific to this document.

Below, we show how you can do all these main preprocessing steps from Amazon SageMaker Data Wrangler:

  1. Extracting text from a PDF document (powered by Textract)
  2. Remove sensitive information (powered by Comprehend)
  3. Chunk text into pieces.
  4. Create embeddings for each piece (powered by Bedrock).
  5. Upload embedding to a vector database (powered by OpenSearch)

Prerequisites

For this walkthrough, you should have the following:

Note: Create OpenSearch Service domains following the instructions here. For simplicity, let’s pick the option with a master username and password for fine-grained access control. Once the domain is created, create a vector index with the following mappings, and vector dimension 1536 aligns with Amazon Titan embeddings:

PUT knowledge-base-index
{
  "settings": {
    "index.knn": True
  },
  "mappings": {
    "properties": {
      "text_content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "text_content_v": {
        "type": "knn_vector",
        "dimension": 1536
      },
      
    }
  }
} }

Walkthrough

Build a data flow

In this section, we cover how we can build a data flow to extract text and metadata from PDFs, clean and process the data, generate embeddings using Amazon Bedrock, and index the data in Amazon OpenSearch.

Launch SageMaker Canvas

To launch SageMaker Canvas, complete the following steps:

  1. On the Amazon SageMaker Console, choose Domains in the navigation pane.
  2. Choose your domain.
  3. On the launch menu, choose Canvas.

Create a dataflow

Complete the following steps to create a data flow in SageMaker Canvas:

  1. On the SageMaker Canvas home page, choose Data preparation.
  2. Choose Create on the right side of page, then give a data flow name and select Create.
  3. This will land on a data flow page.
  4. Choose Import data, select tabular data.

Now let’s import the data from Amazon S3 bucket:

  1. Choose Import data and select Tabular from the drop-down list.
  2. Data Source and select Amazon S3 from the drop-down list.
  3. Navigate to the meta data file with PDF file locations, and choose the file.
  4. Now the metadata file is loaded to the data preparation data flow, and we can proceed to add next steps to transform the data and index into Amazon OpenSearch. In this case the file has following metadata, with the location of each file in Amazon S3 directory.

To add a new transform, complete the following steps:

  1. Choose the plus sign and choose Add Transform.
  2. Choose Add Step and choose Custom Transform.
  3. You can create a custom transform using Pandas, PySpark, Python user-defined functions, and SQL PySpark. Choose Python (PySpark) for this use-case.
  4. Enter a name for the step. From the example code snippets, browse and select extract text from pdf. Make necessary changes to code snippet and select Add.

  5. Let’s add a step to redact Personal Identifiable Information (PII) data from the extracted data by leveraging Amazon Comprehend. Choose Add Step and choose Custom Transform. And select Python (PySpark).

From the example code snippets, browse and select mask PII. Make necessary changes to code snippet and select Add.

  1. The next step is to chunk the text content. Choose Add Step and choose Custom Transform. And select Python (PySpark).

From the example code snippets, browse and select Chunk text. Make necessary changes to code snippet and select Add.

  1. Let’s convert the text content to vector embeddings using the Amazon Bedrock Titan Embeddings model. Choose Add Step and choose Custom Transform. And select Python (PySpark).

From the example code snippets, browse and select Generate text embedding with Bedrock. Make necessary changes to code snippet and select Add.

  1. Now we have vector embeddings available for the PDF file contents. Let’s go ahead and index the data into Amazon OpenSearch. Choose Add Step and choose Custom Transform. And select Python (PySpark). You’re free to rewrite the following code to use your preferred vector database. For simplicity, we are using master username and password to access OpenSearch API’s, for production workloads select option according to your organization policies.
    from pyspark.sql.functions import col, udf
    from pyspark.sql.types import StringType
    import json
    import requests
    
    text_column = "text_redacted_chunks_embedding"
    output_column = text_column + "_response"
    
    headers = {"Content-Type": "application/json", "kbn-xsrf": "true", "osd-xsrf": "true", "security_tenant": "global"};
    index_name = 's3_vector_data_v1'
    
    
    def index_data(text_redacted_chunks, text_redacted_chunks_embedding):
        input_json = json.dumps({"text_content": text_redacted_chunks[-1], "text_content_v": text_redacted_chunks_embedding[-1]})
        response = requests.request(method="POST",
                                    url=f'https://search-canvas-vector-db-domain-dt3yq3b4cykwuvc6t7rnkvmnka.us-west-2.es.amazonaws.com/{index_name}/_doc',
                                    headers=headers,
                                    json=input_json,
                                    auth=(master_user, 'master_pass'),
                                    timeout=30)
        return response.content
    
    
    indexing_udf = udf(index_data, StringType())
    df = df.withColumn('index_response',
                       indexing_udf(col("text_redacted_chunks"), col("text_redacted_chunks_embedding")))

Finally, the dataflow created would be as follows:

With this dataflow, the data from the PDF file has been read and indexed with vector embeddings in Amazon OpenSearch. Now it’s time for us to create a file with queries to query the indexed data and save it to the Amazon S3 location. We’ll point our search data flow to the file and output a file with corresponding results in a new file in an Amazon S3 location.

Preparing a prompt

After we create a knowledge base out of our PDF, we can test it by searching the knowledge base for a few sample queries. We’ll process each query as follows:

  1. Generate embedding for the query (powered by Amazon Bedrock)
  2. Query vector database for the nearest neighbor context (powered by Amazon OpenSearch)
  3. Combine the query and the context into the prompt.
  4. Query LLM with a prompt (powered by Amazon Bedrock)
  5. On the SageMaker Canvas home page, choose Data preparation.
  6. Choose Create on the right side of page, then give a data flow name and select Create.

Now let’s load the user questions and then create a prompt by combining the question and the similar documents. This prompt is provided to the LLM for generating an answer to the user question.

  1. Let’s load a csv file with user questions. Choose Import Data and select Tabular from the drop-down list.
  2. Data Source, and select Amazon S3 from the drop-down list. Alternatively, you can choose to upload a file with user queries.
  3. Let’s add a custom transformation to convert the data into vector embeddings, followed by searching related embeddings from Amazon OpenSearch, before sending a prompt to Amazon Bedrock with the query and context from knowledge base. To generate embeddings for the query, you can use the same example code snippet Generate text embedding with Bedrock mentioned in Step #7 above.

Let’s invoke the Amazon OpenSearch API to search relevant documents for the generated vector embeddings. Add a custom transform with Python (PySpark).

from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
import json
import requests

text_column = "Queries_embedding"
output_column = text_column + "_response"

headers = {"Content-Type": "application/json", "kbn-xsrf": "true", "osd-xsrf": "true", "security_tenant": "global"};
index_name = 's3_vector_data_v1'

def search_data(text_column_embedding):
    input_json={'size':20,'query':{'knn':{'text_content_v':{'vector':{text_column_embedding},'k':5,},},},'fields':['text_content']}
    response = requests.request(method="GET",
                                url=f'https://search-canvas-vector-db-domain-dt3yq3b4cykwuvc6t7rnkvmnka.us-west-2.es.amazonaws.com/{index_name}/_search',
                                headers=headers,
                                json=input_json,
                                auth=(master_user, master_pass'),
                                timeout=30)
    return response.content

search_udf = udf(search_data, types.ArrayType())
df = df.withColumn(output_column,search_udf(col(text_column)))

Let’s add a custom transform to call the Amazon Bedrock API for query response, passing the documents from the Amazon OpenSearch knowledge base. From the example code snippets, browse and select Query Bedrock with context. Make necessary changes to code snippet and select Add.

In summary, RAG based question answering dataflow is as follows:

ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project leads to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity because different teams often run identical jobs or even write duplicate feature engineering code because they have no knowledge of prior work. To avoid the reprocessing of features, we’ll export our data flow to an Amazon SageMaker pipeline. Let’s select the + button to the right of the query. Select export data flow and choose Run SageMaker Pipeline (via Jupyter notebook).

Cleaning up

To avoid incurring future charges, delete or shut down the resources you created while following this post. Refer to Logging out of Amazon SageMaker Canvas for more details.

Conclusion

In this post, we showed you how Amazon SageMaker Canvas’s end-to-end capabilities by assuming the role of a data professional preparing data for an LLM. The interactive data preparation enabled quickly cleaning, transforming, and analyzing the data to engineer informative features. By removing coding complexities, SageMaker Canvas allowed rapid iteration to create a high-quality training dataset. This accelerated workflow led directly into building, training, and deploying a performant machine learning model for business impact. With its comprehensive data preparation and unified experience from data to insights, SageMaker Canvas empowers users to improve their ML outcomes.

We encourage you to learn more by exploring Amazon SageMaker Data Wrangler, Amazon SageMaker Canvas, Amazon Titan models, Amazon Bedrock, and Amazon OpenSearch Service to build a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, then please leave a comment.


About the Authors

Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Nikita Ivkin is a Senior Applied Scientist at Amazon SageMaker Data Wrangler with interests in machine learning and data cleaning algorithms.

Read More

Democratize ML on Salesforce Data Cloud with no-code Amazon SageMaker Canvas

Democratize ML on Salesforce Data Cloud with no-code Amazon SageMaker Canvas

This post is co-authored by Daryl Martis, Director of Product, Salesforce Einstein AI.

This is the third post in a series discussing the integration of Salesforce Data Cloud and Amazon SageMaker.

In Part 1 and Part 2, we show how the Salesforce Data Cloud and Einstein Studio integration with SageMaker allows businesses to access their Salesforce data securely using SageMaker and use its tools to build, train, and deploy models to endpoints hosted on SageMaker. SageMaker endpoints can be registered to the Salesforce Data Cloud to activate predictions in Salesforce.

In this post, we demonstrate how business analysts and citizen data scientists can create machine learning (ML) models, without any code, in Amazon SageMaker Canvas and deploy trained models for integration with Salesforce Einstein Studio to create powerful business applications. SageMaker Canvas provides a no-code experience to access data from Salesforce Data Cloud and build, test, and deploy models using just a few clicks. SageMaker Canvas also enables you to understand your predictions using feature importance and SHAP values, making it straightforward for you to explain predictions made by ML models.

SageMaker Canvas

SageMaker Canvas enables business analysts and data science teams to build and use ML and generative AI models without having to write a single line of code. SageMaker Canvas provides a visual point-and-click interface to generate accurate ML predictions for classification, regression, forecasting, natural language processing (NLP), and computer vision (CV). In addition, you can access and evaluate foundation models (FMs) from Amazon Bedrock or public FMs from Amazon SageMaker JumpStart for content generation, text extraction, and text summarization to support generative AI solutions. SageMaker Canvas allows you to bring ML models built anywhere and generate predictions directly in SageMaker Canvas.

Salesforce Data Cloud and Einstein Studio

Salesforce Data Cloud is a data platform that provides businesses with real-time updates of their customer data from any touch point.

Einstein Studio is a gateway to AI tools on Salesforce Data Cloud. With Einstein Studio, admins and data scientists can effortlessly create models with a few clicks or using code. Einstein Studio’s bring your own model (BYOM) experience provides the capability to connect custom or generative AI models from external platforms such as SageMaker to Salesforce Data Cloud.

Solution overview

To demonstrate how you can build ML models using data in Salesforce Data Cloud using SageMaker Canvas, we create a predictive model to recommend a product. This model uses the features stored in Salesforce Data Cloud such as customer demographics, marketing engagements, and purchase history. The product recommendation model is built and deployed using the SageMaker Canvas no-code user interface using data in Salesforce Data Cloud.

We use the following sample dataset stored in Amazon Simple Storage Service (Amazon S3). To use this dataset in Salesforce Data Cloud, refer to Create Amazon S3 Data Stream in Data Cloud. The following attributes are needed to create the model:

  • Club Member – If the customer is a club member
  • Campaign – The campaign the customer is a part of
  • State – The state or province the customer resides in
  • Month – The month of purchase
  • Case Count – The number of cases raised by the customer
  • Case Type Return – Whether the customer returned any product within the last year
  • Case Type Shipment Damaged – Whether the customer had any shipments damaged in the last year
  • Engagement Score – The level of engagement the customer has (response to mailing campaigns, logins to the online store, and so on)
  • Tenure – The tenure of the customer relationship with the company
  • Clicks – The average number of clicks the customer has made within a week prior to purchase
  • Pages Visited – The average number of pages the customer visited within a week prior to purchase
  • Product Purchased – The actual product purchased

The following steps give an overview of how to use the Salesforce Data Cloud connector launched in SageMaker Canvas to access your enterprise data and build a predictive model:

  1. Configure the Salesforce connected app to register the SageMaker Canvas domain.
  2. Set up OAuth for Salesforce Data Cloud in SageMaker Canvas.
  3. Connect to Salesforce Data Cloud data using the built-in SageMaker Canvas Salesforce Data Cloud connector and import the dataset.
  4. Build and train models in SageMaker Canvas.
  5. Deploy the model in SageMaker Canvas and make predictions.
  6. Deploy an Amazon API Gateway endpoint as a front-end connection to the SageMaker inference endpoint.
  7. Register the API Gateway endpoint in Einstein Studio. For instructions, refer to Bring Your Own AI Models to Data Cloud.

The following diagram illustrates the solution architecture.

Prerequisites

Before you get started, complete the following prerequisite steps to create a SageMaker domain and enable SageMaker Canvas:

  1. Create an Amazon SageMaker Studio domain. For instructions, refer to Onboard to Amazon SageMaker Domain.
  2. Note down the domain ID and execution role that is created and will be used by your user profile. You add permissions to this role in subsequent steps.

The following screenshot shows the domain we created for this post.

  1. Next, go to the user profile and choose Edit.
  2. Navigate to the Amazon SageMaker Canvas settings section and select Enable Canvas base permissions.
  3. Select Enable direct deployments of Canvas models and Enable model registry permissions for all users.

This allows SageMaker Canvas to deploy models to endpoints on the SageMaker console. These settings can be configured at the domain or user profile level. User profile settings take precedence over domain settings.

Create or update the Salesforce connected app

Next, we create a Salesforce connected app to enable the OAuth flow from SageMaker Canvas to Salesforce Data Cloud. Complete the following steps:

  1. Log in to Salesforce and navigate to Setup.
  2. Search for App Manager and create a new connected app.
  3. Provide the following inputs:
    1. For Connected App Name, enter a name.
    2. For API Name, leave as default (it’s automatically populated).
    3. For Contact Email, enter your contact email address.
    4. Select Enable OAuth Settings.
    5. For Callback URL, enter https://<domain-id>.studio.<region>.sagemaker.aws/canvas/default/lab, and provide the domain ID and Region from your SageMaker domain.
  4. Configure the following scopes on your connected app:
    1. Manage user data via APIs (api).
    2. Perform requests at any time (refresh_token, offline_access).
    3. Perform ANSI SQL queries on Salesforce Data Cloud data (Data Cloud_query_api).
    4. Manage Data Cloud profile data (Data Cloud_profile_api).
    5. Access the identity URL service (id, profile, email, address, phone).
    6. Access unique user identifiers (openid).
  5. Set your connected app IP Relaxation setting to Relax IP restrictions.

Configure OAuth settings for the Salesforce Data Cloud connector

SageMaker Canvas uses AWS Secrets Manager to securely store connection information from the Salesforce connected app. SageMaker Canvas allows administrators to configure OAuth settings for an individual user profile or at the domain level. Note that you can add a secret to both a domain and user profile, but SageMaker Canvas looks for secrets in the user profile first.

To configure your OAuth settings, complete the following steps:

  1. Navigate to edit Domain or User Profile Settings in SageMaker Console.
  2. Choose Canvas Settings in the navigation pane.
  3. Under OAuth settings, for Data Source, choose Salesforce Data Cloud.
  4. For Secret setup, you can create a new secret or use an existing secret. For this example, we create a new secret and input the client ID and client secret from the Salesforce connected app.

For more details on enabling OAuth in SageMaker Canvas, refer to Set up OAuth for Salesforce Data Cloud.

This completes the setup to enable data access from Salesforce Data Cloud to SageMaker Canvas to build AI and ML models.

Import data from Salesforce Data Cloud

To import your data, complete the following steps:

  1. From the user profile you created with your SageMaker domain, choose Launch and select Canvas.

The first time you access your Canvas app, it will take about 10 minutes to create.

  1. Choose Data Wrangler in the navigation pane.
  2. On the Create menu, choose Tabular to create a tabular dataset.
  3. Name the dataset and choose Create.
  4. For Data Source, choose Salesforce Data Cloud and Add Connection to import the data lake object.

If you’ve previously configured a connection to Salesforce Data Cloud, you will see an option to use that connection instead of creating a new one.

  1. Provide a name for a new Salesforce Data Cloud connection and choose Add connection.

It will take a few minutes to complete.

  1. You will be redirected to the Salesforce login page to authorize the connection.

After the login is successful, the request will be redirected back to SageMaker Canvas with the data Lake object listing.

  1. Select the dataset that contains the features for model training that was uploaded via Amazon S3.
  2. Drag and drop the file, then choose Edit in SQL.

Salesforce adds a “__c“ to all the Data Cloud object fields. As per SageMaker Canvas naming convention, ”__“ is not allowed in the field names.

  1. Edit the SQL to rename the columns and drop metadata that isn’t relevant for model training. Replace the table name with your object name.
    SELECT "state__c" as state, 
    "case_type_shipment_damaged__c" as case_type_shipment_damaged, 
    "campaign__c" as campaign, 
    "engagement_score__c" as engagement_score, 
    "case_count__c" as case_count, 
    "case_type_return__c" as case_type_return, 
    "club_member__c" as club_member, 
    "pages_visited__c" as pages_visited, 
    "product_purchased__c" as product_purchased, 
    "clicks__c" as clicks, 
    "tenure__c" as tenure, 
    "month__c" as month FROM product_recommendation__dlm;

  2. Choose Run SQL and then Create dataset.
  3. Select the dataset and choose Create a model.

  4. To create a model to predict a product recommendation, provide a model name, choose Predictive analysis for Problem type, and choose Create.

Build and train the model

Complete the following steps to build and train your model:

  1. After the model is launched, set the target column to product_purchased.

SageMaker Canvas displays key statistics and correlations of each column to the target column. SageMaker Canvas provides you with tools to preview your model and validate data before you begin building.

  1. Use the preview model feature to see the accuracy of your model and validate your dataset to prevent issues while building the model.
  2. After reviewing your data and making any changes to your dataset, choose your build type. The Quick build option may be faster, but it will only use a subset of your data to build a model. For the purpose of this post, we selected the Standard build option.

A standard build can take 2–4 hours to complete.

SageMaker Canvas automatically handles missing values in your dataset while it builds the model. It will also apply other data prep transformations for you to get the data ready for ML.

  1. After your model begins building, you can leave the page.

When the model shows as Ready on the My models page, it’s ready for analysis and predictions.

  1. After the model is built, navigate to My models, choose View to view the model you created, and choose the most recent version.
  2. Go to the Analyze tab to see the impact of each feature on the prediction.
  3. For additional information on the model’s predictions, navigate to the Scoring tab.
  4. Choose Predict to initiate a product prediction.

Deploy the model and make predictions

Complete the following steps to deploy your model and start making predictions:

  1. You can choose to make either batch or single predictions. For the purpose of this post, we choose Single prediction.

When you choose Single prediction, SageMaker Canvas displays the features that you can provide inputs for.

  1. You can change the values by choosing Update and view the real-time prediction.

The accuracy of the model as well as the impact of each feature for that specific prediction will be displayed.

  1. To deploy the model, provide a deployment name, select an instance type and instance count, and choose Deploy.

Model deployment will take a few minutes.

Model status is updated to In Service after the deployment is successful.

SageMaker Canvas provides an option to test the deployment.

  1. Choose View details.

The Details tab provides the model endpoint details. Instance type, count, input format, response content, and endpoint are some of key details displayed.

  1. Choose Test deployment to test the deployed endpoint.

Similar to single prediction, the view displays the input features and provides an option to update and test the endpoint in real time.

The new prediction along with the endpoint invocation result is returned to the user.

Create API to expose SageMaker Endpoint

To generate predictions that power business applications in Salesforce, you need to expose the SageMaker inference endpoint created by your SageMaker Canvas deployment via API Gateway and register it in Salesforce Einstein.

The request and response formats vary between Salesforce Einstein and SageMaker inference endpoint. You could either use API Gateway to perform the transformation or use AWS Lambda to transform the request and map the response. Refer to Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda to expose a SageMaker endpoint via Lambda and API Gateway.

The following code snippet is a Lambda function to transform the request and the response

import json
import boto3
import os
client = boto3.client("runtime.sagemaker")
endpoint = os.environ['SAGEMAKER_ENDPOINT_NAME']
prediction_label = 'product_purchased__c'

def lambda_handler(event, context):
        features=[]
        # Input Sample : {"instances": [{"features": ["Washington", 1, "New Colors", 1, 1, 1, 1, 1, 1, 1, 1]}, {"features": ["California", 1, "Web", 100, 1, 1, 100, 1, 10, 1, 1]}]}
        for instance in event["instances"]:
            features.append(','.join(map(str, instance["features"])))
        body='n'.join(features)
        response = client.invoke_endpoint(EndpointName=endpoint,ContentType="text/csv",Body=body,Accept="application/json")
        response =  json.loads(response['Body'].read().decode('utf-8'))
        prediction_response={"predictions":[]}
        for prediction in response.get('predictions'):
            prediction_response['predictions'].append({prediction_label:prediction['predicted_label']})
        return prediction_response

Update the endpoint and prediction_label values in the Lambda function based on your configuration.

  1. Add an environment variable SAGEMAKER_ENDPOINT_NAME to capture the SageMaker inference endpoint.
  2. Set the prediction label to match the model output JSON key that is registered in Einstein Studio.

The default timeout for a Lambda function is 3 seconds. Depending on the prediction request input size, the SageMaker real-time inference API may take more than 3 seconds to respond.

  1. Increase the Lambda function timeout but keep it below the API Gateway default integration timeout, which is 29 seconds.

Register the model in Salesforce Einstein Studio

To register the API Gateway endpoint in Einstein Studio, refer to Bring Your Own AI Models to Data Cloud.

Conclusion

In this post, we explained how you can use SageMaker Canvas to connect to Salesforce Data Cloud and generate predictions through automated ML features without writing a single line of code. We demonstrated the SageMaker Canvas model build capability to conduct an early preview of your model performance before running the standard build that trains the model with the full dataset. We also showcased post-model creation activities like using the single predictions interface within SageMaker Canvas and understanding your predictions using feature importance. Next, we used the SageMaker endpoint created in SageMaker Canvas and made it available as an API so you can integrate it with Salesforce Einstein Studio and create powerful Salesforce applications.

In an upcoming post, we will show you how to use data from Salesforce Data Cloud in SageMaker Canvas to make data insights and preparation even more straightforward by using a visual interface and simple natural language prompts.

To get started with SageMaker Canvas, see SageMaker Canvas immersion day and refer to Getting started with Amazon SageMaker Canvas.


About the authors

Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers, including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City. Follow him on Linkedin.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

 Ravi Bhattiprolu is a Sr. Partner Solutions Architect at AWS. Ravi works with strategic partners, Salesforce and Tableau, to deliver innovative and well-architected products and solutions that help joint customers realize their business objectives.

Miriam Lebowitz is a Solutions Architect in the Strategic ISV segment at AWS. She is engaged with teams across Salesforce, including Salesforce Data Cloud, and specializes in data analytics. Outside of work, she enjoys baking, traveling, and spending quality time with friends and family.

Read More