Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra

Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer.

Amazon Kendra securely connects to over 40 data sources. When using your data source, you might want better visibility into the document processing lifecycle during data source sync jobs. They could include knowing the status of each document you attempted to crawl and index, as well as being able to troubleshoot why certain documents were not returned with the expected answers. Additionally, you might need access to metadata, timestamps, and access control lists (ACLs) for the indexed documents.

We are pleased to announce a new feature now available in Amazon Kendra that significantly improves visibility into data source sync operations. The latest release introduces a comprehensive document-level report incorporated into the sync history, providing administrators with granular indexing status, metadata, and ACL details for every document processed during a data source sync job. This enhancement to sync job observability enables administrators to quickly investigate and resolve ingestion or access issues encountered while setting up Amazon Kendra indexes. The detailed document reports are persisted in the new SYNC_RUN_HISTORY_REPORT log stream under the Amazon Kendra index log group, so critical sync job details are available on-demand when troubleshooting.

In this post, we discuss the benefits of this new feature and how it offers enhanced data sync visibility in Amazon Kendra.

Lifecycle of a document in a data source sync run job

In this section, we examine the lifecycle of a document within a data source sync in Amazon Kendra. This provides valuable insight into the sync process. The data source sync comprises three key stages: crawling, syncing, and indexing. Crawling involves the connector connecting to the data source and extracting documents meeting the defined sync scope according to the data source configuration. These documents are then synced to the Amazon Kendra index during the syncing phase. Finally, indexing makes the synced documents searchable within the Amazon Kendra environment.

The following diagram shows a flowchart of a sync run job.

Crawling stage

The first stage is the crawling stage, where the connector crawls all documents and their metadata from the data source. During this stage, the connector also compares the checksum of the document against the Amazon Kendra index to determine if a particular document needs to be added, modified, or deleted from the index. This operation corresponds to the CrawlAction field in the sync run history report.

If the document is unmodified, it’s marked as UNMODIFIED and skipped in the rest of the stages. If any document fails in the crawling stage, for example due to throttling errors, broken content, or if the document size is too big, that document is marked in the sync run history report with the CrawlStatus as FAILED. If the document was skipped due to any validation errors, its CrawlStatus is marked as SKIPPED. These documents are not sent to the next stage. All successful documents are marked as SUCCESS and are sent forward.

We also capture the ACLs and metadata on each document in this stage to be able to add it to the sync run history report.

Syncing stage

During the syncing stage, the document is sent to Amazon Kendra ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a document is submitted to these APIs, Amazon Kendra runs validation checks on the submitted documents. If any document fails these checks, its SyncStatus is marked as FAILED. If there is an irrecoverable error for a particular document, it is marked as SKIPPED and other documents are sent forward.

Indexing stage

In this step, Amazon Kendra parses the document, processes it according to its content type, and persists it in the index. If the document fails to be persisted, its IndexStatus is marked as FAILED; otherwise, it is marked as SUCCESS.

After the statuses of all the stages have been captured, we emit these statuses as an Amazon CloudWatch event to the customer’s AWS account.

Key features and benefits of document-level reports

The following are the key features and benefits of the new document-level report in Amazon Kendra indexes:

  • Enhanced sync run history page – A new Actions column has been added to the sync run history page, providing access to the document-level report for each sync run.

  • Dedicated log stream – A new log stream named SYNC_RUN_HISTORY_REPORT has been created in the Amazon Kendra CloudWatch log group, containing the document-level report.

  • Comprehensive document information – The document-level report includes the following information for each document:
  • Document ID – This is the document ID that is inherited directly from the data source or mapped by the customer in the data source field mappings.
  • Document title – The title of the document is taken from the data source or mapped by the customer in the data source field mappings.
  • Consolidated document status (SUCCESS, FAILED, or SKIPPED) – This is the final consolidated status of the document. It can have a value of SUCCESS, FAILED, or SKIPPED. If the document was successfully processed in all stages, then the value is SUCCESS. If the document failed or was skipped in any of the stages, then the value of this field will be FAILED or SKIPPED, respectively.
  • Error message (if the document failed) – This field contains the error message with which a document failed. If a document was skipped due to throttling errors, or any internal errors, this will be shown in the error message field.
  • Crawl status – This field denotes whether the document was crawled successfully from the data source. This status correlates to the syncing-crawling state in the data source sync.
  • Sync status – This field denotes whether the document was sent for syncing successfully. This correlates to the syncing-indexing state in the data source sync.
  • Index status – This field denotes whether the document was successfully persisted in the index.
  • ACLs – This field contains a list of document-level permissions that were crawled from the data source. The details of each element in the list are:
    • Global name – This is the email or user name of the user. This field is mapped across multiple data sources. For example, if a user has three datasources Confluence, SharePoint, and Gmail, with the local user ID as confluence_user, sharepoint_user and gmail_user respectively, and their email address user@email.com is the globalName in the ACL for all of them, then Amazon Kendra understands that all of these local user IDs map to the same global name.
    • Name – This is the local unique ID of the user, which is assigned by the data source.
    • Type – This field indicates the principal type. This can be either USER or GROUP.
    • Is Federated – This is a boolean flag that indicates whether the group is of INDEX level (true) or DATASOURCE level (false).
    • Access – This field indicates whether the user has access allowed or denied explicitly. Values can be either ALLOWED or DENIED.
    • Data source ID – This is the data source ID. For federated groups (INDEX level), this field will be null.
  • Metadata – This field contains the metadata fields (other than ACL) that were pulled from the data source. This list also includes the metadata fields mapped by the customer in the data source field mappings as well as extra metadata fields added by the connector.
  • Hashed document ID (for troubleshooting assistance) – To safeguard your data privacy, we present a secure, one-way hash of the document identifier. This encrypted value enables the Amazon Kendra team to efficiently locate and analyze the specific document within our logs, should you encounter any issue that requires further investigation and resolution.
  • Timestamp – The timestamp indicates when the document status was logged in CloudWatch.

In the following sections, we explore different use cases for the logging feature.

Determine the optimal boosting duration for recent documents in using document-level reporting

When it comes to generating accurate answers, you may want to fine-tune the way Amazon Kendra prioritizes its content. For instance, you may prefer to boost recent documents over older ones to make sure the most up-to-date passages are used to generate an answer. To achieve this, you can use the relevance tuning feature in Amazon Kendra to boost documents based on the last update date attribute, with a specified boosting duration. However, determining the optimal boosting period can be challenging when dealing with a large number of frequently changing documents.

You can now use the per-document-level report to obtain the _last_updated_at metadata field information for your documents, which can help you determine the appropriate boosting period. For this, you use the following CloudWatch Logs Insights query to retrieve the _last_updated_at metadata attribute for machine learning documents from the SYNC_RUN_HISTORY_REPORT log stream.

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Metadata like 'Machine Learning'
| parse Metadata '{"key":"_last_updated_at","value":{"dateValue":"*"}}' as @last_updated_at
| sort @last_updated_at desc, @timestamp desc
| dedup DocumentTitle

With the preceding query, you can gain insights into the last updated timestamps of your documents, enabling you to make informed decisions about the optimal boosting period. This approach makes sure your chat responses are generated using the most recent and relevant information, enhancing the overall accuracy and effectiveness of your Amazon Kendra implementation.

The following screenshot shows an example result.

Common document indexing observability and troubleshooting methods

In this section, we explore some common admin tasks for observing and troubleshooting document indexing using the new document-level reporting feature.

List all successfully indexed documents from a data source

To retrieve a list of all documents that have been successfully indexed from a specific data source, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'
and ConnectorDocumentStatus.Status = "SUCCESS"
| sort @timestamp desc | dedup DocumentTitle, DocumentId

The following screenshot shows an example result.

List all successfully indexed documents from a data source sync job

To retrieve a list of all documents that have been successfully indexed during a specific sync job, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Status = "SUCCESS"
| sort DocumentTitle

The following screenshot shows an example result.

List all failed indexed documents from a data source sync job

To retrieve a list of all documents that failed to index during a specific sync job, along with the error messages, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, ErrorMsg, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Status = "FAILED"
| sort @timestamp desc

The following screenshot shows an example result.

List all documents that contain a user’s ACL permission from an Amazon Kendra index

To retrieve a list of documents that have a specific users ACL permission, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Acl like 'aneesh@mydemoaws.onmicrosoft.com'
| display DocumentTitle, SourceUri

The following screenshot shows an example result.

List the ACL of an indexed document from a data source sync job

To retrieve the ACL information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| display DocumentTitle, Acl

The following screenshot shows an example result.

List metadata of an indexed document from a data source sync job

To retrieve the metadata information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| display DocumentTitle, Metadata

The following screenshot shows an example result.

Conclusion

The newly introduced document-level report in Amazon Kendra provides enhanced visibility and observability into the document processing lifecycle during data source sync jobs. This feature addresses a critical need expressed by customers for better troubleshooting capabilities and access to detailed information about the indexing status, metadata, and ACLs of individual documents.

The document-level report is stored in a log stream named SYNC_RUN_HISTORY_REPORT within the Amazon Kendra index CloudWatch log group. This report contains comprehensive information for each document, including the document ID, title, overall document sync status, error messages (if any), along with its ACLs and metadata information retrieved from the data sources. The data source sync run history page now includes an Actions column, providing access to the document-level report for each sync run. This feature significantly improves the ability to troubleshoot issues related to document ingestion and access control, and issues related to metadata relevance, and provides better visibility about the documents synced with an Amazon Kendra index.

To get started with Amazon Kendra, explore the Getting started guide. To learn more about data source connectors and best practices, see Creating a data source connector.


About the Authors

Aneesh Mohan is a Senior Solutions Architect at Amazon Web Services (AWS), with over 20 years of experience in architecting and delivering high-impact solutions for mission-critical workloads. His expertise spans across the financial services industry, AI/ML, security, and data technologies. Driven by a deep passion for technology, Aneesh is dedicated to partnering with customers to design and implement well-architected, innovative solutions that address their unique business needs.

Ashwin Shukla is a Software Development Engineer II on the Amazon Q for Business and Amazon Kendra engineering team, with 6 years of experience in developing enterprise software. In this role, he works on designing and developing foundational features for Amazon Q for Business.

Read More

Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker

Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker

This post is co-written with Meta’s PyTorch team.

In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large language models (LLMs) for their specific needs. Although foundation models (FMs) offer impressive out-of-the-box capabilities, true competitive advantage often lies in deep model customization through fine-tuning. However, fine-tuning LLMs for complex tasks typically requires advanced AI expertise to align and optimize them effectively. Recognizing this challenge, Meta developed torchtune, a PyTorch-native library that simplifies authoring, fine-tuning, and experimenting with LLMs, making it more accessible to a broader range of users and applications.

In this post, AWS collaborates with Meta’s PyTorch team to showcase how you can use Meta’s torchtune library to fine-tune Meta Llama-like architectures while using a fully-managed environment provided by Amazon SageMaker Training. We demonstrate this through a step-by-step implementation of model fine-tuning, inference, quantization, and evaluation. We perform the steps on a Meta Llama 3.1 8B model utilizing the LoRA fine-tuning strategy on a single p4d.24xlarge worker node (providing 8 Nvidia A100 GPUs).

Before we dive into the step-by-step guide, we first explored the performance of our technical stack by fine-tuning a Meta Llama 3.1 8B model across various configurations and instance types.

As can be seen in the following chart, we found that a single p4d.24xlarge delivers 70% higher performance than two g5.48xlarge instances (each with 8 NVIDIA A10 GPUs) at almost 47% reduced price. We therefore have optimized the example in this post for a p4d.24xlarge configuration. However, you could use the same code to run single-node or multi-node training on different instance configurations by changing the parameters passed to the SageMaker estimator. You could further optimize the time for training in the following graph by using a SageMaker managed warm pool and accessing pre-downloaded models using Amazon Elastic File System (Amazon EFS).

Challenges with fine-tuning LLMs

Generative AI models offer many promising business use cases. However, to maintain factual accuracy and relevance of these LLMs to specific business domains, fine-tuning is required. Due to the growing number of model parameters and the increasing context length of modern LLMs, this process is memory intensive. To address these challenges, fine-tuning strategies like LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) limit the number of trainable parameters by adding low-rank parallel structures to the transformer layers. This enables you to train LLMs even on systems with low memory availability like commodity GPUs. However, this leads to an increased complexity because new dependencies have to be handled and training recipes and hyperparameters need to be adapted to the new techniques.

What businesses need today is user-friendly training recipes for these popular fine-tuning techniques, which provide abstractions to the end-to-end tuning process, addressing the common pitfalls in the most opinionated way.

How does torchtune helps?

torchtune is a PyTorch-native library that aims to democratize and streamline the fine-tuning process for LLMs. By doing so, it makes it straightforward for researchers, developers, and organizations to adapt these powerful LLMs to their specific needs and constraints. It provides training recipes for a variety of fine-tuning techniques, which can be configured through YAML files. The recipes implement common fine-tuning methods (full-weight, LoRA, QLoRA) as well as other common tasks like inference and evaluation. They automatically apply a set of important features (FSDP, activation checkpointing, gradient accumulation, mixed precision) and are specific to a given model family (such as Meta Llama 3/3.1 or Mistral) as well as compute environment (single-node vs. multi-node).

Additionally, torchtune integrates with major libraries and frameworks like Hugging Face datasets, EleutherAI’s Eval Harness, and Weights & Biases. This helps address the requirements of the generative AI fine-tuning lifecycle, from data ingestion and multi-node fine-tuning to inference and evaluation. The following diagram shows a visualization of the steps we describe in this post.

Refer to the installation instructions and PyTorch documentation to learn more about torchtune and its concepts.

Solution overview

This post demonstrates the use of SageMaker Training for running torchtune recipes through task-specific training jobs on separate compute clusters. SageMaker Training is a comprehensive, fully managed ML service that enables scalable model training. It provides flexible compute resource selection, support for custom libraries, a pay-as-you-go pricing model, and self-healing capabilities. By managing workload orchestration, health checks, and infrastructure, SageMaker helps reduce training time and total cost of ownership.

The solution architecture incorporates the following key components to enhance security and efficiency in fine-tuning workflows:

  • Security enhancement – Training jobs are run within private subnets of your virtual private cloud (VPC), significantly improving the security posture of machine learning (ML) workflows.
  • Efficient storage solution – Amazon EFS is used to accelerate model storage and access across various phases of the ML workflow.
  • Customizable environment – We use custom containers in training jobs. The support in SageMaker for custom containers allows you to package all necessary dependencies, specialized frameworks, and libraries into a single artifact, providing full control over your ML environment.

The following diagram illustrates the solution architecture. Users initiate the process by calling the SageMaker control plane through APIs or command line interface (CLI) or using the SageMaker SDK for each individual step. In response, SageMaker spins up training jobs with the requested number and type of compute instances to run specific tasks. Each step defined in the diagram accesses torchtune recipes from an Amazon Simple Storage Service (Amazon S3) bucket and uses Amazon EFS to save and access model artifacts across different stages of the workflow.

By decoupling every torchtune step, we achieve a balance between flexibility and integration, allowing for both independent execution of steps and the potential for automating this process using seamless pipeline integration.

In this use case, we fine-tune a Meta Llama 3.1 8B model with LoRA. Subsequently, we run model inference, and optionally quantize and evaluate the model using torchtune and SageMaker Training.

Recipes, configs, datasets, and prompt templates are completely configurable and allow you to align torchtune to your requirements. To demonstrate this, we use a custom prompt template in this use case and combine it with the open source dataset Samsung/samsum from the Hugging Face hub.

We fine-tune the model using torchtune’s multi device LoRA recipe (lora_finetune_distributed) and use the SageMaker customized version of Meta Llama 3.1 8B default config (llama3_1/8B_lora).

Prerequisites

You need to complete the following prerequisites before you can run the SageMaker Jupyter notebooks:

  1. Create a Hugging Face access token to get access to the gated repo meta-llama/Meta-Llama-3.1-8B on Hugging Face.
  2. Create a Weights & Biases API key to access the Weights & Biases dashboard for logging and monitoring
  3. Request a SageMaker service quota for 1x ml.p4d.24xlarge and 1xml.g5.2xlarge.
  4. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonEC2FullAccess, AmazonElasticFileSystemFullAccess, and AWSCloudFormationFullAccess to give required access to SageMaker to run the examples. (This is for demonstration purposes. You should adjust this to your specific security requirements for production.)
  5. Create an Amazon SageMaker Studio domain (see Quick setup to Amazon SageMaker) to access Jupyter notebooks with the preceding role. Refer to the instructions to set permissions for Docker build.
  6. Log in to the notebook console and clone the GitHub repo:
$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd sagemaker-distributed-training-workshop/13-torchtune
  1. Run the notebook ipynb to set up VPC and Amazon EFS using an AWS CloudFormation stack.

Review torchtune configs

The following figure illustrates the steps in our workflow.

You can look up the torchtune configs for your use case by directly using the tune CLI.For this post, we provide modified config files aligned with SageMaker directory path’s structure:

sh-4.2$ cd config/
sh-4.2$ ls -ltr
-rw-rw-r-- 1 ec2-user ec2-user 1151 Aug 26 18:34 config_l3.1_8b_gen_orig.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1172 Aug 26 18:34 config_l3.1_8b_gen_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user  644 Aug 26 18:49 config_l3.1_8b_quant.yaml
-rw-rw-r-- 1 ec2-user ec2-user 2223 Aug 28 14:53 config_l3.1_8b_lora.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1223 Sep  4 14:28 config_l3.1_8b_eval_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1213 Sep  4 14:29 config_l3.1_8b_eval_original.yaml

torchtune uses these config files to select and configure the components (think models and tokenizers) during the execution of the recipes.

Build the container

As part of our example, we create a custom container to provide custom libraries like torch nightlies and torchtune. Complete the following steps:

sh-4.2$ cat Dockerfile
# Set the default value for the REGION build argument
ARG REGION=us-west-2
# SageMaker PyTorch image for TRAINING
FROM ${ACCOUNTID}.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
# Uninstall existing PyTorch packages
RUN pip uninstall torch torchvision transformer-engine -y
# Install latest release of PyTorch and torchvision
RUN pip install --force-reinstall torch==2.4.1 torchao==0.4.0 torchvision==0.19.1

Run the 1_build_container.ipynb notebook until the following command to push this file to your ECR repository:

!sm-docker build . --repository accelerate:latest

sm-docker is a CLI tool designed for building Docker images in SageMaker Studio using AWS CodeBuild. We install the library as part of the notebook.

Next, we will run the 2_torchtune-llama3_1.ipynb notebook for all fine-tuning workflow tasks.

For every task, we review three artifacts:

  • torchtune configuration file
  • SageMaker task config with compute and torchtune recipe details
  • SageMaker task output

Run the fine-tuning task

In this section, we walk through the steps to run and monitor the fine-tuning task.

Run the fine-tuning job

The following code shows a shortened torchtune recipe configuration highlighting a few key components of the file for a fine-tuning job:

  • Model component including LoRA rank configuration
  • Meta Llama 3 tokenizer to tokenize the data
  • Checkpointer to read and write checkpoints
  • Dataset component to load the dataset
sh-4.2$ cat config_l3.1_8b_lora.yaml
# Model Arguments
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  lora_rank: 8
  lora_alpha: 16

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /opt/ml/input/data/model/hf-model/original/tokenizer.model

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_files: [
    consolidated.00.pth
  ]
  …

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.samsum_dataset
  train_on_input: True
batch_size: 13

# Training
epochs: 1
gradient_accumulation_steps: 2

... and more ...

We use Weights & Biases for logging and monitoring our training jobs, which helps us track our model’s performance:

metric_logger:
_component_: torchtune.utils.metric_logging.WandBLogger
…

Next, we define a SageMaker task that will be passed to our utility function in the script create_pytorch_estimator. This script creates the PyTorch estimator with all the defined parameters.

In the task, we use the lora_finetune_distributed torchrun recipe with config config-l3.1-8b-lora.yaml on an ml.p4d.24xlarge instance. Make sure you download the base model from Hugging Face before it’s fine-tuned using the use_downloaded_model parameter. The image_uri parameter defines the URI of the custom container.

sagemaker_tasks={
    "fine-tune":{
        "hyperparameters":{
            "tune_config_name":"config-l3.1-8b-lora.yaml",
            "tune_action":"fine-tune",
            "use_downloaded_model":"false",
            "tune_recipe":"lora_finetune_distributed"
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",        
        "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
    ... and more ...
}

To create and run the task, run the following code:

Task="fine-tune"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The following code shows the task output and reported status:

# Refer-Output

2024-08-16 17:45:32 Starting - Starting the training job...
...
...

1|140|Loss: 1.4883038997650146:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]
1|141|Loss: 1.4621509313583374:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]

Training completed with code: 0
2024-08-26 14:19:09,760 sagemaker-training-toolkit INFO     Reporting training SUCCESS

The final model is saved to Amazon EFS, which makes it available without download time penalties.

Monitor the fine-tuning job

You can monitor various metrics such as loss and learning rate for your training run through the Weights & Biases dashboard. The following figures show the results of the training run where we tracked GPU utilization, GPU memory utilization, and loss curve.

For the following graph, to optimize memory usage, torchtune uses only rank 0 to initially load the model into CPU memory. rank 0 therefore will be responsible for loading the model weights from the checkpoint.

The example is optimized to use GPU memory to its maximum capacity. Increasing the batch size further will lead to CUDA out-of-memory (OOM) errors.

The run took about 13 minutes to complete for one epoch, resulting in the loss curve shown in the following graph.

Run the model generation task

In the next step, we use the previously fine-tuned model weights to generate the answer to a sample prompt and compare it to the base model.

The following code shows the configuration of the generate recipe config_l3.1_8b_gen_trained.yaml. The following are key parameters:

  • FullModelMetaCheckpointer – We use this to load the trained model checkpoint meta_model_0.pt from Amazon EFS
  • CustomTemplate.SummarizeTemplate – We use this to format the prompt for inference
# torchtune - trained model generation config - config_l3.1_8b_gen_trained.yaml
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b
  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /opt/ml/input/data/model/
  checkpoint_files: [
    meta_model_0.pt
  ]
  …

# Generation arguments; defaults taken from gpt-fast
instruct_template: CustomTemplate.SummarizeTemplate

... and more ...

Next, we configure the SageMaker task to run on a single ml.g5.2xlarge instance:

prompt=r'{"dialogue":"Amanda: I baked  cookies. Do you want some?rnJerry: Sure rnAmanda: I will bring you tomorrow :-)"}'

sagemaker_tasks={
    "generate_inference_on_trained":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_gen_trained.yaml ",
            "tune_action":"generate-trained",
            "use_downloaded_model":"true",
            "prompt":json.dumps(prompt)
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
 "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
}

In the output of the SageMaker task, we see the model summary output and some stats like tokens per second:

#Refer- Output
...
Amanda: I baked  cookies. Do you want some?rnJerry: Sure rnAmanda: I will bring you tomorrow :-)

Summary:
Amanda baked cookies. She will bring some to Jerry tomorrow.

INFO:torchtune.utils.logging:Time for inference: 1.71 sec total, 7.61 tokens/sec
INFO:torchtune.utils.logging:Memory used: 18.32 GB

... and more ...

We can generate inference from the original model using the original model artifact consolidated.00.pth:

# torchtune - trained original generation config - config_l3.1_8b_gen_orig.yaml
…  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /opt/ml/input/data/model/hf-model/original/
  checkpoint_files: [
    consolidated.00.pth
  ]
  
... and more ...

The following code shows the comparison output from the base model run with the SageMaker task (generate_inference_on_original). We can see that the fine-tuned model is performing subjectively better than the base model by also mentioning that Amanda baked the cookies.

# Refer-Output 
---
Summary:
Jerry tells Amanda he wants some cookies. Amanda says she will bring him some cookies tomorrow.

... and more ...

Run the model quantization task

To speed up the inference and decrease the model artifact size, we can apply post-training quantization. torchtune relies on torchao for post-training quantization.

We configure the recipe to use Int8DynActInt4WeightQuantizer, which refers to int8 dynamic per token activation quantization combined with int4 grouped per axis weight quantization. For more details, refer to the torchao implementation.

# torchtune model quantization config - config_l3.1_8b_quant.yaml
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  …

quantizer:
  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256

We again use a single ml.g5.2xlarge instance and use SageMaker warm pool configuration to speed up the spin-up time for the compute nodes:

sagemaker_tasks={
"quantize_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_quant.yaml",
            "tune_action":"run-quant",
            "use_downloaded_model":"true"
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
        "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
}

In the output, we see the location of the quantized model and how much memory we saved due to the process:

#Refer-Output
...

linear: layers.31.mlp.w1, in=4096, out=14336
linear: layers.31.mlp.w2, in=14336, out=4096
linear: layers.31.mlp.w3, in=4096, out=14336
linear: output, in=4096, out=128256
INFO:torchtune.utils.logging:Time for quantization: 7.40 sec
INFO:torchtune.utils.logging:Memory used: 22.97 GB
INFO:torchtune.utils.logging:Model checkpoint of size 8.79 GB saved to /opt/ml/input/data/model/quantized/meta_model_0-8da4w.pt

... and more ...

You can run model inference on the quantized model meta_model_0-8da4w.pt by updating the inference-specific configurations.

Run the model evaluation task

Finally, let’s evaluate our fine-tuned model in an objective manner by running an evaluation on the validation portion of our dataset.

torchtune integrates with EleutherAI’s evaluation harness and provides the eleuther_eval recipe.

For our evaluation, we use a custom task for the evaluation harness to evaluate the dialogue summarizations using the rouge metrics.

The recipe configuration points the evaluation harness to our custom evaluation task:

# torchtune trained model evaluation config - config_l3.1_8b_eval_trained.yaml

model:
...

include_path: "/opt/ml/input/data/config/tasks"
tasks: ["samsum"]
...

The following code is the SageMaker task that we run on a single ml.p4d.24xlarge instance:

sagemaker_tasks={
"evaluate_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_eval_trained.yaml",
            "tune_action":"run-eval",
            "use_downloaded_model":"true",
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",
    }
}

Run the model evaluation on ml.p4d.24xlarge:

Task="evaluate_trained_model"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The following tables show the task output for the fine-tuned model as well as the base model.

The following output is for the fine-tuned model.

 

Tasks Version Filter n-shot Metric Direction Value ± Stderr
samsum 2 none None rouge1 45.8661 ± N/A
none None rouge2 23.6071 ± N/A
none None rougeL 37.1828 ± N/A

The following output is for the base model.

Tasks Version Filter n-shot Metric Direction Value ± Stderr
samsum 2 none None rouge1 33.6109 ± N/A
none None rouge2 13.0929 ± N/A
none None rougeL 26.2371 ± N/A

Our fine-tuned model achieves an improvement of approximately 46% on the summarization task, which is approximately 12 points better than the baseline.

Clean up

Complete the following steps to clean up your resources:

  1. Delete any unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. Delete the CloudFormation stack to delete the VPC and Amazon EFS resources.

Conclusion

In this post, we discussed how you can fine-tune Meta Llama-like architectures using various fine-tuning strategies on your preferred compute and libraries, using custom dataset prompt templates with torchtune and SageMaker. This architecture gives you a flexible way of running fine-tuning jobs that are optimized for GPU memory and performance. We demonstrated this through fine-tuning a Meta Llama3.1 model using P4 and G5 instances on SageMaker and used observability tools like Weights & Biases to monitor loss curve, as well as CPU and GPU utilization.

We encourage you to use SageMaker training capabilities and Meta’s torchtune library to fine-tune Meta Llama-like architectures for your specific business use cases. To stay informed about upcoming releases and new features, refer to the torchtune GitHub repo and the official Amazon SageMaker training documentation .

Special thanks to Kartikay Khandelwal (Software Engineer at Meta), Eli Uriegas (Engineering Manager at Meta), Raj Devnath (Sr. Product Manager Technical at AWS) and Arun Kumar Lokanatha (Sr. ML Solution Architect at AWS) for their support to the launch of this post.


About the Authors

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS.He helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS.

Matthias Reso is a Partner Engineer at PyTorch working on open source, high-performance model optimization, distributed training (FSDP), and inference. He is a co-maintainer of llama-recipes and TorchServe.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. He serves as a voting member of the PyTorch Foundation Governing Board, where he contributes to the strategic advancement of open-source deep learning frameworks. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Read More

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Amazon Bedrock Knowledge Bases provides foundation models (FMs) and agents in Amazon Bedrock contextual information from your company’s private data sources for Retrieval Augmented Generation (RAG) to deliver more relevant, accurate, and customized responses. Amazon Bedrock Knowledge Bases offers a fully managed RAG experience.

The data sources that can be connected to as knowledge bases are continuously expanding. This post showcases how to use one of the data source connectors; Microsoft SharePoint, an integrated content management and collaboration tool that many organizations use for storing, organizing, and sharing their internal data. See Data source connectors for the full list of supported data source connectors.

Solution overview

The following are some pertinent features of the SharePoint data source within Amazon Bedrock Knowledge Bases:

  • It provides access to the information stored in SharePoint. The RAG architecture queries and retrieves relevant information from the SharePoint source to provide contextual responses based on the user’s input.
  • It provides the ability to extract structured data, metadata, and other information from documents ingested from SharePoint to provide relevant search results based on the user query.
  • It provides the ability to sync incremental SharePoint content updates on an ongoing basis.
  • It provides source attribution to the response generated by the FM.

In the following sections, we walk through the steps to create a knowledge base, configure your data source, and test the solution.

Prerequisites

The following are the prerequisites necessary to implement Amazon Bedrock Knowledge Bases with SharePoint as a connector:

Create a knowledge base and connect to the data source

Complete the following steps to set up a knowledge base on Amazon Bedrock and connect to a SharePoint data source:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose Create knowledge base.

kb-landing-view

  1. In the Knowledge base details section, optionally change the default name and enter a description for your knowledge base.
  2. In the IAM permissions section, select an IAM role that provides Amazon Bedrock permission to access other AWS services. You can let Amazon Bedrock create the service role or choose a custom role that you have created.
  3. In the Choose data source section, select SharePoint.
  4. Optionally, add tags to your knowledge base. For more information, see Tag resources.
  5. Choose Next.

kb-details-1

  1. In the Name and Description section, optionally change the default data source name and enter a description of the data source.
  2. In the Source section, provide the following information:
    1. For Site URLs, enter the site URLs to use for crawling and indexing the content for RAG.
    2. For Domain, enter the domain name associated with the data source. For example, if the site URL is https://deloittedasits.sharepoint.com/xyz.aspx, the domain value would be deloittedasits.
    3. Under Advanced settings, keep the default selections.

kb-details-name-desc

While converting your data into embeddings, Amazon Bedrock encrypts your data with a key that AWS owns and manages by default. To use your own AWS Key Management Service (AWS KMS) key, choose Customize encryption settings (Advanced) and choose a key. For more information, see Encryption of transient data storage during data ingestion.

You can also choose from the following options for the data deletion policy for your data source:

  • Delete – Deletes all underlying data belonging to the data source from the vector store upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted, only the underlying data. This flag is ignored if an AWS account is deleted.
  • Retain – Retains all underlying data in your vector store upon deletion of a knowledge base or data source resource.

For more information on managing your knowledge base, see Manage a data source.

ML-17173-kb-details-advanced-settings

  1. In the Authentication section, the supported authentication method is set to OAuth 2.0.
    1. For Tenant ID, enter your tenant ID. Refer to section Register a new application in the Microsoft Azure Portal of this post to get the Tenant ID.
    2. For AWS Secrets Manager secret, enter an AWS Secrets Manager Refer to the section Create a Secrets Manager secret for the SharePoint data source of this post to get the secret.

The SharePoint data source will need credentials to connect to the SharePoint Online site using the Microsoft Graph API. To facilitate this, create a new Secrets Manager secret. These credentials will not be used in any access logs for the SharePoint Online Site.

kb-details-authentication

  1. In the Metadata Settings section, optionally select any content types that you want to include or exclude.

kb-details-metadata

  1. In the Content chunking and parsing section, select Default.

kb-details-content-chunking

  1. Choose Next.
  2. In the Embeddings model section, select Titan Embeddings G1 – Text or another embeddings model as appropriate.
  3. In the Vector database section, select Quick create a new vector store to create a vector store for the embeddings.
  4. Choose Next.

kb-details-embeddings

  1. On the Review and create page, verify the selections you made and choose Create.

The knowledge base creation should be complete.

kn-created-success

The knowledge base with SharePoint as the data source is now created. However, the data source needs to be synced in order to crawl the site URLs and index the associated content.

  1. To initiate this process, on the knowledge base details page, select your data source and choose Sync.

kb-sync

Register a new application in the Microsoft Azure Portal

In this section, we register a new application in the Microsoft Azure Portal. We capture the Tenant ID from this step to use when configuring the data source for Knowledge Base for Amazon Bedrock. Complete the following steps:

  1. Open the Azure Portal and log in with your Microsoft account. If you don’t have an account, you can create one or contact your organization’s administration team.
  2. Choose New registration.
  3. Provide the following information:
    1. For Name, provide the name for your application. Let’s refer to this application as TargetApp. Amazon Bedrock Knowledge Bases uses TargetApp to connect to the SharePoint site to crawl and index the data.
    2. For Who can use this application or access this API, choose Accounts in this organizational directory only (<Tenant name> only – Single tenant).
    3. Choose Register.
    4. Note down the application (client) ID and the directory (tenant) ID on the Overview You’ll need them later when asked for TargetApp-ClientId and TenantId.
  4. Choose API permissions in the navigation pane.
  5. Configure the permissions as follows:
    1. Choose Add a permission.
    2. Choose Microsoft Graph.
    3. Choose Delegated permissions.
    4. Choose Read.All in the User section.
    5. Choose Read.All in the GroupMember section.
    6. Choose FullControl.All in the Sites section.
    7. Choose Add permissions. This permission allows the app to read data in your organization’s directory about the signed-in user.
    8. On the options menu (three dots), choose Remove permission.
    9. Remove the original Read – Delegated permission.
    10. Choose Grant admin consent for the default directory.

SPO-register-app

  1. Choose Certificates & secrets in the navigation pane.
    1. Choose New client secret.
    2. For Description, enter a description, such as description of my client secret.
    3. Choose a value for Expires. In production, you’ll need to manually rotate your secret before it expires.
    4. Choose Add.
    5. Note down the value for your new secret. You’ll need it later when asked for your client secret (TargetApp-ClientSecret).
  2. Optionally, choose Owners to add any additional owners for the application. Owners will be able to manage permissions of the Azure AD app (TargetApp).

Create a Secrets Manager secret for the SharePoint data source

Complete the following steps to create a Secrets Manager secret to connect to the SharePoint online sites listed as site URLs within the data source:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, select Other type of secret.
  3. For Key/value pairs, enter the following:
    1. username
    2. password
    3. clientId
    4. clientSecret
  4. For Encryption key, choose aws/secretsmanager.
  5. Choose Next.
  6. In the Secret name and description section, enter the name of the secret and an optional description.
  7. Add any associated tags in the Tags
  8. Leave Resource permissions and Replication secret as default.
  9. Choose Next.
  10. In the Configure rotation section, leave as default or modify according to your organizational policies.
  11. Choose Next.
  12. Review the options you selected and choose Store.
  13. On the secrets detail page, note your secret ARN value to be used as the secret when creating the Knowledge Base for Amazon Bedrock.

kb-secret

Test the solution

Complete the following steps to test the knowledge base you created:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Select the knowledge base you created and choose Test.

kb-test-1

  1. Choose an appropriate model for testing and choose Apply.

kb-test-2

  1. Enter your question for the content housed in the SharePoint site.

kb-test-3

Clean up

If you created a new knowledge base to experiment using this post and don’t plan to use it further, delete the knowledge base so that your AWS account doesn’t accumulate costs. For instructions, see Manage a knowledge base.

Conclusion

In this post, we showed you how to configure Amazon Bedrock Knowledge Bases with SharePoint Online as a data source. By connecting SharePoint Online as a data source, employees can interact with the organization’s knowledge and data stored in SharePoint using natural language, making it straightforward to find relevant information, extract key points, and derive valuable insights. This can significantly improve productivity, decision-making, and knowledge sharing within the organization.

Try this feature on the Amazon Bedrock console today! See Amazon Bedrock Knowledge Bases to learn more.


About the Authors

SurendarSurendar Gajavelli is a Sr. Solutions Architect based out of Nashville, Tennessee. He is a passionate technology enthusiast who enjoys working with customers and helping them build innovative solutions.

AbhiAbhi Patlolla is a Sr. Solutions Architect based out of the New York City region, helping customers in their cloud transformation, AI/ML, and data initiatives. He is a strategic and technical leader, advising executives and engineers on cloud strategies to foster innovation and positive impact.

Read More

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

In the field of technology and creative design, logo design and creation has adapted and evolved at a rapid pace. From the hieroglyphs of ancient Egypt to the sleek minimalism of today’s tech giants, the visual identities that define our favorite brands have undergone a remarkable transformation.

Today, the world of creative design is once again being transformed by the emergence of generative AI. Designers and brands now have opportunities to push the boundaries of creativity, crafting logos that are not only visually stunning but also responsive to their environments and tailored to the preferences of their target audiences.

Amazon Bedrock enables access to powerful generative AI models like Stable Diffusion through a user-friendly API. These models can be integrated into the logo design workflow, allowing designers to rapidly ideate, experiment, generate, and edit a wide range of unique visual images. Integrating it with the range of AWS serverless computing, networking, and content delivery services like AWS Lambda, Amazon API Gateway, and AWS Amplify facilitates the creation of an interactive tool to generate dynamic, responsive, and adaptive logos.

In this post, we walk through how AWS can help accelerate a brand’s creative efforts with access to a powerful image-to-image model from Stable Diffusion available on Amazon Bedrock to interactively create and edit art and logo images.

Image-to-image model

The Stability AI’s image-to-image model, SDXL, is a deep learning model that generates images based on text descriptions, images, or other inputs. It first converts the text into numerical values that summarize the prompt, then uses those values to generate an image representation. Finally, it upscales the image representation into a high-resolution image. Stable Diffusion can also generate new images based on an initial image and a text prompt. For example, it can fill in a line drawing with colors, lighting, and a background that makes sense for the subject. Stable Diffusion can also be used for inpainting (adding features to an existing image) and outpainting (removing features from an existing image).

One of its primary applications lies in advertising and marketing, where it can be used to create personalized ad campaigns and an unlimited number of marketing assets. Businesses can generate visually appealing and tailored images based on specific prompts, enabling them to stand out in a crowded marketplace and effectively communicate their brand message. In the media and entertainment sector, filmmakers, artists, and content creators can use this as a tool for developing creative assets and ideating with images.

Solution overview

The following diagram illustrates the solution architecture.

This architecture workflow involves the following steps:

  1. In the frontend UI, a user chooses from one of two options to get started:
    1. Generate an initial image.
    2. Provide an initial image link.
  2. The user provides a text prompt to edit the given image.
  3. The user chooses Call API to invoke API Gateway to begin processing on the backend.
  4. The API invokes a Lambda function, which uses the Amazon Bedrock API to invoke the Stability AI SDXL 1.0 model.
  5. The invoked model generates an image, and the output image is stored in an Amazon Simple Storage Service (Amazon S3) bucket.
  6. The backend services return the output image to the frontend UI.
  7. The user can use this generated image as a reference image and edit it, generate a new image, or provide a different initial image. They can continue this process until the model produces a satisfactory output.

Prerequisites

To set up this solution, complete the following prerequisites:

  1. Pick an AWS Region where you want to deploy the solution. We recommend using the us-east-1
  2. Obtain access to the Stability SDXL 1.0 model in Amazon Bedrock if you don’t have it already. For instructions, see Access Amazon Bedrock foundation models.
  3. If you prefer to use a separate S3 bucket for this solution, create a new S3 bucket.
  4. If you prefer to use localhost for testing the application instead of Amplify, make sure python3 is installed in your local machine.

Deploy the solution

To deploy the backend resources for the solution, we create a stack using an AWS CloudFormation template. You can upload the template directly, or upload it to an S3 bucket and link to it during the stack creation process. During the creation process, provide the appropriate variable names for apiGatewayName, apiGatewayStageName, s3BucketName, and lambdaFunctionName. If you created a new S3 bucket earlier, input that name in s3BucketName – this bucket is where output images are stored. When the stack creation is complete, all the backend resources are ready to be connected to the frontend UI.

The frontend resources play an integral part in creating an interactive environment for your end-users. Complete the following steps to integrate the frontend and backend:

  1. When the CloudFormation stack deployment is complete, open the created API from the API Gateway console.

Step 1

  1. Choose Stages in the navigation pane, and on the Stage actions menu, choose Generate SDK.

Step 2

  1. For Platform, choose JavaScript.

  1. Download and unzip the JavaScript SDK .zip file, which contains a folder called apiGateway-js-sdk.
  2. Download the frontend UI index.html file and place it in the unzipped folder.

This file is configured to integrate with the JavaScript SDK by simply placing it in the folder.

  1. After the index.html is placed in the folder, select the content of the folder and compress it into a .zip file (don’t compress the apiGateway-js-sdk folder itself.)

  1. On the Amplify console, choose Create new app.
  2. Select Deploy without Git, then choose Next.

  1. Upload the compressed .zip file, and change the application name and branch name if preferred.
  2. Choose Save and deploy.

The deployment will take a few seconds. When deployment is complete, there will be a domain URL that you can use to access the application. The application is ready to be tested at the domain URL.

CloudFormation template overview

Before we move on to testing the solution, let’s explore the CloudFormation template. This template sets up an API Gateway API with appropriate rules and paths, a Lambda function, and necessary permissions in AWS Identity and Access Management (IAM). Let’s dive deep into the content of the CloudFormation template to understand the resources created:

  • PromptProcessingAPI – This is the main API Gateway REST API. This API will be used to invoke the Lambda function. Other API Gateway resources, methods, and schemas created in the CloudFormation template are attached to this API.
  • ActionResource, ActionInputResource, PromptResource, PromptInputResource, and ProxyResource – These are API Gateway resources that define the URL path structure for the API. The path structure is /action/{actionInput}/prompt/{promptInput}/{proxy+}. The {promptInput} value is a placeholder variable for the prompt that users input in the frontend. Similarly, {actionInput} is the choice the user selected for how they want to generate the image. These are used in the backend Lambda function to process and generate images.
  • ActionInputMethod, PromptInputMethod, and ProxyMethod – These are API Gateway methods that define the integration with the Lambda function for the POST HTTP method.
  • ActionMethodCORS, ActionInputMethodCORS, PromptMethodCORS, PromptInputMethodCORS, and ProxyMethodCORS – These are API Gateway methods that handle the cross-origin resource sharing (CORs) support. These resources are crucial in integrating the frontend UI with backend resources. For more information on CORS, see What is CORS?
  • ResponseSchema and RequestSchema – These are API Gateway models that define the expected JSON schema for the response and request payloads, respectively.
  • Default4xxResponse and Default5xxResponse – These are the gateway responses that define the default response behavior for 4xx and 5xx HTTP status codes, respectively.
  • ApiDeployment – This resource deploys the API Gateway API after all of the preceding configurations have been set. After the deployment, the API is ready to use.
  • LambdaFunction – This creates a Lambda function and specifies the type of runtime, the service role for Lambda, and the limit for the reserved concurrent runs.
  • LambdaPermission1, LambdaPermission2, and LambdaPermission3 – These are permissions that allow the API Gateway API to invoke the Lambda function.
  • LambdaExecutionRole and lambdaLogGroup – The first resource is the IAM role attached to the Lambda function allowing it to run on other AWS services such as Amazon S3 and Amazon Bedrock. The second resource configures the Lambda function log group in Amazon CloudWatch.

Lambda function explanation

Let’s dive into the details of the Python code that generates and manipulate images using the Stability AI model. There are three ways of using the Lambda function: provide a text prompt to generate an initial image, upload an image and include a text prompt to adjust the image, or reupload a generated image and include a prompt to adjust the image.

The code contains the following constants:

  • negative_prompts – A list of negative prompts used to guide the image generation.
  • style_preset – The style preset to use for image generation (for example, photographic, digital-art, or cinematic). We used digital-art for this post.
  • clip_guidance_preset – The Contrastive Language-Image Pretraining (CLIP) guidance preset to use (for example, FAST_BLUE, FAST_GREEN, NONE, SIMPLE, SLOW, SLOWER, SLOWEST).
  • sampler – The sampling algorithm to use for image generation (for example, DDIM, DDPM, K_DPMPP_SDE, K_DPMPP_2M, K_DPMPP_2S_ANCESTRAL, K_DPM_2, K_DPM_2_ANCESTRAL, K_EULER, K_EULER_ANCESTRAL, K_HEUN, K_LMS).
  • width – The width of the generated image.

handler(event, context) is the main entry point for the Lambda function. It processes the input event, which contains the promptInput and actionInput parameters. Based on the actionInput, it performs one of the following actions:

  • For GenerateInit, it generates a new image using the generate_image_with_bedrock function, uploads it to Amazon S3, and returns the file name and a pre-signed URL.
  • When you upload an existing image, it performs one of the following actions:
    • s3URL – It retrieves an image from a pre-signed S3 URL, generates a new image using the generate_image_with_bedrock function, uploads the new image to Amazon S3, and returns the file name and a pre-signed URL.
    • UseGenerated – It retrieves an image from a pre-signed S3 URL, generates a new image using the generate_image_with_bedrock function, uploads the new image to Amazon S3, and returns the file name and a pre-signed URL.

The function generate_image_with_bedrock(prompt, init_image_b64=None) generates an image using the Amazon Bedrock runtime service, which includes the following actions:

  • If an initial image is provided (base64-encoded), it uses that as the starting point for the image generation.
  • If no initial image is provided, it generates a new image based on the provided prompt.
  • The function sets various parameters for the image generation, such as the text prompts, configuration, and sampling method.
  • It then invokes the Amazon Bedrock model, retrieves the generated image as a base64-encoded string, and returns it.

To obtain a more personalized outputs, the hyperparameter values in the function can be adjusted:

  • text_prompts – This is a list of dictionaries, where each dictionary contains a text prompt and an associated weight. For a positive text prompt, one that you would like to associate to the output image, weight is set as 1.0. For all of the negative text prompts, weight is set as -1.0.
  • cfg_scale – This parameter controls the potential for randomness in the image. The default is 7, and 10 seems to work well from our observations. A higher value means the image will be more influenced by the text, but a value that’s too high or too low will result in visually poor-quality outputs.
  • init_image – This parameter is a base64-encoded string representing an initial image. The model uses this image as a starting point and modifies it based on the text prompts. For generating the first image, this parameter is not used.
  • start_schedule – This parameter controls the strength of the noise added to the initial image at the start of the generation process. A value of 0.6 means that the initial noise will be relatively low.
  • steps – This parameter specifies the number of steps (iterations) the model should take during the image generation process. In this case, it’s set to 50 steps.
  • style_preset – This parameter specifies a predefined style or aesthetic to apply to the generated image. Because we’re generating logo images, we use digital-art.
  • clip_guidance_preset – This parameter specifies a predefined guidance setting for the CLIP model, which is used to guide the image generation process based on the text prompts.
  • sampler – This parameter specifies the sampling algorithm used during the image generation process to repeatedly denoise the image to produce a high-quality output.

Test and evaluate the application

The following screenshot shows a simple UI. You can choose to either generate a new image or edit an image using text prompts.

The following screenshots show iterations of sample logos we created using the UI. The text prompts are included under each image.

Clean up

To clean up, delete the CloudFormation stack and the S3 bucket you created.

Conclusion

In this post, we explored how you can use Stability AI and Amazon Bedrock to generate and edit images. By following the instructions and using the provided CloudFormation template and the frontend code, you can generate unique and personalized images and logos for your business. Try generating and editing your own logos, and let us know what you think in the comments. To explore more AI use cases, refer to AI Use Case Explorer.


About the authors

Pyone Thant Win is a Partner Solutions Architect focused on AI/ML and computer vision. Pyone is passionate about enabling AWS Partners through technical best practices and using the latest technologies to showcase the art of possible.

Nneoma Okoroafor is a Partner Solutions Architect focused on helping partners follow best practices by conducting technical validations. She specializes in assisting AI/ML and generative AI partners, providing guidance to make sure they’re using the latest technologies and techniques to deliver innovative solutions to customers.

Read More

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Personalization has become a cornerstone of delivering tangible benefits to businesses and their customers. Generative AI and large language models (LLMs) offer new possibilities, although some businesses might hesitate due to concerns about consistency and adherence to company guidelines. This post presents an automated personalization solution that balances the innovative capabilities of LLMs with adherence to human directives and human-curated assets for a consistent and responsible personalization experience for your customers.

Our solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. For this post, we use Anthropic’s Claude models on Amazon Bedrock.

We present our solution through a fictional consulting company, OneCompany Consulting, using automatically generated personalized website content for accelerating business client onboarding for their consultancy service. The personalized content is built using generative AI by following human guidance and provided sources of truth. We employ task decomposition, using domain / task adopted LLMs for content personalization (UX designer/personalizer), image generation (artist), and building (builder/front end developer) for the final delivery of HTML, CSS, and JavaScript files. The approach broadly mimics a human organization pursuing the same objective. This allows us to create cost-effective, more controlled, more accurate and responsible personalized customer experiences, utilizing existing guidelines and assets originally designed for human-driven processes.

We provide our code base on GitHub for you to follow along, suggest possible enhancements and modifications, and help you innovate with generative AI in personalization. Generative AI on AWS can transform user experiences for customers while maintaining brand consistency and your desired customization.

Use case overview

Our fictional company, OneCompany Consulting, plans to use generative AI to automatically create personalized landing pages as their business clients sign in. Their clients have provided some basic public information during sign-up, such as state of location, industry vertical, company size, and their mission statement. In parallel, OneCompany maintains a market research repository gathered by their researchers, offers industry-specific services outlined in documents, and has compiled approved customer testimonials. UX/UI designers have established best practices and design systems applicable to all of their websites. These resources should be used as single sources of truth. Because we don’t have such expertise, we synthetically generate these assets to demonstrate the process that would otherwise be created by expert humans or other methods in real life.

The following diagram illustrates the process of generating a personalized landing page for business visitors after they sign up.

Processing showing customers signing up and solution creating personalized websites

Fig 1. The process of customers signing up and the solution creating personalized websites using human-curated assets and guidelines.

We employed other LLMs available on Amazon Bedrock to synthetically generate fictitious reference materials to avoid potential biases that could arise from Amazon Claude’s pre-training data. In practical scenarios, these resources would be created by humans and organizations, containing more comprehensive and exhaustive details. Nonetheless, our solution can still be utilized.

  • Client profiles – We have three business clients in the construction, manufacturing, and mining industries, which are mid-to-enterprise companies. The process assumes the information in the company profiles is public and that the companies who signed up opted in to OneCompany Consulting to use for personalization. The following example is for the construction industry:
Profiles = {
'Construction-Example': {
'Name': 'Example Corp Construction',
'Industry': 'Construction',
'CompanySize': 1500,
'CompanyType': 'Enterprise',
'Location': 'New York City, NY',
'Mission': 'Building a sustainable future for New York' }
}
  • Offerings – Offerings are documents that consolidate all offerings provided by OneCompany Consulting. The following is an example of a synthetically generated offering for the construction industry:
OneCompany Consulting Construction Consulting Services Offerings
Introduction
OneCompany Consulting is a premier construction consulting firm dedicated to... <Redacted for printing>
Our core values are:
1. Client-Centric Approach: We put our clients at the heart of everything we do... <Redacted for printing>
Our offerings include:
1. Pre-Construction Services - Feasibility Studies - Site Selection and Evaluation... <Redacted for printing>
5. Construction Technology Solutions - Construction Data Analytics and Reporting... <Redacted for printing>
  • Industry insights – Your LLMs can use industry pain points, news, and other resources to enrich personalized content. Because the industry news and resources are wide, we use a Retrieval Augmented Generation (RAG) framework to retrieve related information. The following is an example for the manufacturing industry:
The Electric Vehicle Manufacturing Industry: Overcoming Challenges... <Redacted for printing>
Introduction
The electric vehicle (EV) industry has experienced unprecedented growth in recent years, driven by... <Redacted for printing>
This article will explore the key challenges facing the EV... <Redacted for printing>
Supply Chain Disruptions
The COVID-19 pandemic has highlighted the vulnerability of global supply chains, and the EV industry has not been... <Redacted for printing>
V. Regulatory Frameworks and Incentives
Regulatory frameworks and government incentives play a critical role in promoting EV... <Redacted for printing>

  • Testimonials – Synthetically generated customer testimonials are displayed for the visitors. In this solution, the LLM is asked to use the sentence without changes because it’s a testimonial. The following is an example:
"AnyCompany Consulting's expert consulting services have been invaluable in streamlining our operations and optimizing our construction processes, resulting in increased efficiency and cost savings"
- John Smith, CTO from Example Corp Solutions.
  • Design guidelines and systems – This part is for the instructions and rules to be followed for building the website. Our examples were manually created only for high-level guidance for simplicity.
    • Guidelines – The following are some examples from the design guidelines:
- Use a color palette that aligns with the customer's industry, ensuring sufficient color contrast for readability.
- Use visible font sizes and responsive units for better visibility and readability across screen sizes... <Redacted for printing> - ... <Redacted for printing>
  • Instructions – The following are some examples from the design instructions:
Header Design:
- Choose an attention-grabbing background color and font that aligns with the client's industry.
- Consider incorporating subtle animations or transitions
Hero Section:
- Choose an attention-grabbing background color and font that aligns with the client's industry.
- ... <Redacted for printing>

Solution overview

To create personalized websites efficiently, we employ task decomposition—breaking down the complex process into simpler, decoupled sub-tasks. This approach allows using smaller, cost-effective language models, creating targeted prompts and contexts for increased accuracy and faithfulness, isolating responses for straightforward troubleshooting, and achieving cost savings.

In our example, we decomposed the overall personalized website creation process into three steps, each handled by specialized agents: the personalizer for tailoring content, the artist for generating images, and the frontend engineer/builder for coding. For the personalizer, we used Claude Sonnet due to the relative complexity of the task compared to code generation handled by Haiku. However, Claude Haiku can also be used for the personalization task, potentially leading to further cost savings. Yet, Haiku may require more prescriptive prompts and examples to achieve similar results. We recommend that customers test both Sonnet and Haiku to determine the optimal balance between performance and cost for their specific use case. In our demonstration, we chose to use Sonnet with a relatively simple prompt to showcase its efficiency, but the flexibility of this approach allows for various LLMs to be integrated into the agentic workflow.

The following diagram illustrates our agentic workflow.

Illustration of agentic workflow in solution.

Fig 2. Workflow diagram of agentic workflow made of specialized (task / domain adopted) LLMs.

The following diagram illustrates our solution architecture.

Solutions architecture

Fig 3. Solutions architecture

The workflow includes the following steps:

  1. The client profile is stored as key-value pairs in JSON format. However, the JSON needs to be converted into natural language to simplify the task for the downstream LLMs, so they don’t have to figure out the JSON schema and associated meaning.
  2. After the profile is converted into text that explains the profile, a RAG framework is launched using Amazon Bedrock Knowledge Bases to retrieve related industry insights (articles, pain points, and so on). Amazon Bedrock Knowledge Bases is a fully managed capability that helps you implement the RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. All the information retrieved from Amazon Bedrock Knowledge Bases is provided with citations to improve transparency and increase accuracy. In our example, if the client is a manufacturing customer building electronic vehicles (EVs), then the related context will be retrieved from the human-curated or human-created research documents.
  3. Now we’re ready to send our prompt to the personalizer LLM with all the relevant information. In addition to the customer profile and industry insights, we include offerings, design guidance, and testimonials, and ask the personalizer LLM to create a detailed website description and description of visuals.
  4. The response from the personalizer LLM is divided into two paths by a regex method. The first part moves to the frontend developer LLM.
  5. The second part is sent to the image generator or artist LLM.
  6. For the frontend developer LLM, we also use system design-related materials (in our case, design guidelines) so the frontend developer builds the website described by the personalizer LLM while applying the rules in the design guidelines. Here, we also prompted the LLM to use the company logo (which is the unicorn of AWS GameDay) to demonstrate incorporating existing design elements into the design. In addition, our prompt asks the frontend developer LLM to write the JavaScript to make the testimonials display and call to action dynamic.
  7. At the end of the process, we create a consolidated HTML file, which includes CSS and JavaScript, and store it in an Amazon Simple Storage Service (Amazon S3) bucket so that the assets are ready to be deployed.

Prerequisites

For this post, you need the following prerequisites:

After you complete the prerequisites, you can use the following Jupyter notebook , which have all the necessary steps to follow this post.

Test the solution

Let’s start with our example manufacturing client, who is building the next-generation EVs ('Manufacturing-Example'). First, the profile of this client in JSON format is converted into natural language as follows:

customer = build_profile(UserProfile)
print(customer)

Output:
"Your customer is Example Corp Manufacturing. Their industry is manufacturing. They have 2,500 employees." 
"They are an enterprise company, located in San Jose, CA."  
"Their mission statement is “Building the next generation Electric Vehicles."

Based on the customer profile giving the location and industry, the related background information is retrieved using RAG based on kbId = <Knowledge_base_id>:

context_painpoints = Bedrock.doc_retrieve(query,kbId,numberOfResults = 10)
contexts_painpoints = get_contexts(context_painpoints)

At this stage, we have the context and industry pain points. We now need to gather human-curated sources of truth such as testimonials, design guidelines, requirements, and offerings. These references aren’t numerous and require complete information in the system prompt. However, you can apply the RAG method used for industry insights to retrieve one or all of these references.

offerings = open("./references/offerings.txt", "r").read()
testimonials = open("./references/Testimonials.txt", "r").read()
design_steps= open("./references/RecommendedStepsToDesignNew.txt", "r").read()
design_guideline= open("./references/DesignGuidelineNew.txt", "r").read()

We’re ready to input our prompt to the personalizer LLM, which is Anthropic’s Claude 3 Sonnet (or Haiku, if you used it) on Amazon Bedrock. We used the following which incorporates the parameters customer, testimonials, design_steps, and offerings. We prompted the LLM to generate a response in two parts:

  • Detailed Website Description – Describing the website in text format
  • Visual Elements – Describing the visual elements used

These two responses will be used in two different workflows in the following steps. The following is an example response (response_personalized_website) for the manufacturing client:

Section 1: Detailed Website Description

Header:
The header section features a clean and modern design with a navy-blue background color (#003366) and white text. The AnyCompany Consulting logo is prominently displayed on the left side, while the main navigation menu is positioned on the right. The navigation menu includes the following links: "Home," "Services," "About," and "Contact". The header is fixed to the top of the page, ensuring easy access to the navigation throughout the user's experience.

Hero Section:
The hero section showcases a large, high-quality image (hero-example.jpg) that depicts a modern manufacturing facility with various machinery and workers in action. The image is overlaid with a semi-transparent black gradient to improve the readability of the text. The hero section's main heading reads "Transforming Manufacturing Excellence" in a large, bold font (font-size: 48px), and the subheading states "Unlock your full potential with AnyCompany Consulting's tailored solutions" in a slightly smaller font (font-size: 24px). A prominent "Get Started" call-to-action button is positioned below the subheading, encouraging the user to take the next step.

Offerings Section:
The offerings section is divided into three main categories, each with a corresponding icon and brief description:

1. Operational Excellence
- Icon: operations-icon.png
- Description: "Optimize your manufacturing processes and drive continuous improvement with our Lean and Six Sigma expertise."
2. Digital Transformation
- Icon: digital-icon.png
- Description: "Leverage the power of Industry 4.0 technologies to enhance productivity, efficiency, and data-driven decision-making."
- ... <Redacted for printing>

Dynamic Content for JavaScript:
The testimonials section features a dynamic horizontal slider that automatically cycles through the testimonials every 5 seconds. This functionality can be implemented using JavaScript, with the following elements:

1. Testimonial Data:
- An array of testimonial objects, each containing the following properties:
- name: "John Smith"
- title: "CTO, Example Corp Solutions"
- quote: "AnyCompany Consulting's expert consulting services have been invaluable in streamlining our operations and optimizing our construction processes, resulting in increased efficiency and cost savings."

2. Testimonial Slider:
- A container element to hold the testimonials
- A function to display the current testimonial
- A timer to automatically cycle through the testimonials every 5 seconds
- Smooth transition animations between testimonials

3. User Interaction:
- Ability to manually navigate through the testimonials (for example, previous and next buttons)
- Pause the automatic cycling when the user interacts with the slider

Section 2: Visual Elements

1. <VISUAL_LABEL>anycompany_logo.jpg</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This is the AnyCompany Consulting logo, featuring the company name in a bold, modern font. The logo is designed in a clean, minimalist style with a navy blue color scheme.</VISUAL_DESCRIPTION>

2. <VISUAL_LABEL>hero-example.jpg</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This image depicts a modern manufacturing facility with various machinery and workers in action. The scene showcases the dynamic and technologically advanced nature of the manufacturing industry, aligning with the customer's profile and the AnyCompany Consulting's expertise.</VISUAL_DESCRIPTION>

3. <VISUAL_LABEL>operations-icon.png</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This icon represents the "Operational Excellence" offering. It features a gear or cog symbol, symbolizing the optimization and streamlining of manufacturing processes.</VISUAL_DESCRIPTION>

4. ... <Redacted for printing>

We use Stable Diffusion to generate visual assets based on the descriptions provided by the personalizer LLM. We extract the image descriptions enclosed within <VISUAL_LABEL> and <VISUAL_DESCRIPTION> tags using a regex pattern. These descriptions are then sent in batches to our artist LLM to create images.

pattern = r'&lt;VISUAL_LABEL&gt;(.+?)&lt;/VISUAL_LABEL&gt;s*&lt;VISUAL_DESCRIPTION&gt;(.+?)&lt;/VISUAL_DESCRIPTION&gt;'

The images are created and put into the S3 bucket to store your website assets.

Now we’re ready to create the website HTML, CSS, and JavaScript assets. We used the following prompt template, which uses response_personalized_website from the personalizer, actual testimonials, and the UI design guidelines:

prompt =
f"""
You are an experienced frontend web developer specializing in creating accessible, responsive, and visually appealing websites. Your task is to generate the complete HTML, CSS, and JavaScript code that accurately implements the provided 'Website Description' while adhering to the specified guidelines.
 <website description>
Know that this your Design Guideline (Requirements):
{design_guideline}
You use the testimonials as follows:
{testimonials}
Website Description:
{response_personalized_website} </website description> 
Carefully read the 'Website Description' line by line, and then generate the HTML, CSS, and JavaScript code required to build the described website while following the specified design guidelines and requirements.
Provide the HTML, CSS, and JavaScript code directly, starting with the <!DOCTYPE html> declaration, without any preamble or introduction.
"""

After this step, you have all the necessary assets to preview your website. You can put the HTML and created files into a folder and use a web browser to see your created website.

The following screenshots show examples of generated personalized pages for EV manufacturing, mining, and construction clients, respectively. Each image is displayed for 3 seconds in GIF format. To experience the full quality and dynamic features of these pages, we recommend you visit the examples folder, download the folders, and open the corresponding main.html files with your internet browser.

Three examples looping use-cases

Fig 4. Generated example personalized pages for three industry clients.

Highlights from the test

The solution automatically generated personalized web pages, including personalized images within the provided guardrails, prompts, and reference materials. The workflow considered appropriate color contrasts, such as dark backgrounds with white fonts for accessibility. The solution generated representative icons with consistent coloring and themes across the pages. The workflow also created industry-specific, engaging labels, descriptions, offerings, and pain points based on the source of truth references. The offering selection and pain point sections are especially noteworthy because they were tailored to the visitor. For example, the hero page showcased an EV on a production line, whereas a mining company with a “sustainability” motto received green icons and a focus on that topic. The construction company from New York had themes mentioning their specific points. The workflow is capable of creating dynamic assets as prompted, such as testimonials or call-to-action buttons. Additionally, the solution created consistent assets that can scale well and are compatible with multiple devices, as requested.

In this example, we did not fully exhaust the capabilities of personalization. However, we hope these examples can provide a simple starting point for your personalization use cases.

Clean up

To clean up, start by deleting the S3 bucket you created for your knowledge base. Then delete your knowledge base. Because we used Amazon Bedrock on demand, unless you invoke the endpoint, it will not incur any cost. However, we recommend deleting the artifacts in SageMaker Studio or the SageMaker Studio domain if you used SageMaker Studio to follow along with this demo.

Suggested enhancements

You can extend this solution with some further enhancements, such as the following:

  • Use batch processing for cost-effective asset creation based on visitor profiles. You can use batch inference with Amazon Bedrock or batch transform with SageMaker.
  • Cluster similar client profiles to reduce design element variations for frugality and consistency.
  • Provide website templates and chain-of-thought descriptions to follow design patterns more prescriptively.
  • Use Haiku instead of Sonnet for further cost reduction. You may need more prescriptive and multi-shot prompts as you switch to Haiku for the personalization stage.
  • Retrieve existing company images and icons using semantic search instead of generating visuals. For example, you can build semantic image search using Amazon Titan.

Conclusions

In this post, we presented an automated solution to provide a consistent and responsible personalization experience for your customers. This approach uses smaller LLMs for website personalization tailored to businesses and industries. It decomposes the complex task into subtasks handled by task / domain adopted LLMs, adhering to company guidelines and human expertise. Using a fictional business consulting company scenario, we demonstrated the solution by generating personalized marketing content like text, images, HTML, CSS, and JavaScript code. The process employs techniques like RAG, prompt engineering with personas, and human-curated references to maintain output control.

By combining generative AI, curated data, and task decomposition, businesses can cost-effectively create accurate, personalized website experiences aligned with their branding and design systems.

Amazon Bedrock, which you can use to build generative AI applications, is at the center of this solution. To get started with Amazon Bedrock, we recommend following the quick start and familiarizing yourself with building generative AI applications.


About the Authors

BurakBurak Gozluklu is a Principal AI/ML Specialist Solutions Architect and lead GenAI Scientist Architect for Amazon on AWS, based in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. He maintains his connection to academia as a research affiliate at MIT. Outside of work, Burak is an enthusiast of yoga.

Chidi Prince John is a Data Scientist at Amazon. He designs, builds and deploys models for large-scale personalization in Amazon Payments. Chidi has a Master’s degree in Quantitative Management from Duke University and a Bachelor’s degree in Economics from the University of Nigeria. Outside of work, he is passionate about soccer and TV shows.

Dieter D’Haenens is a Senior Product Manager for Amazon, responsible for customer growth, delivering personalized experiences and driving the Amazon flywheel. Leveraging his expertise in retail and strategy, he is passionate about solving customer problems through scalable, innovative AI and ML solutions. Dieter holds a Bachelor of Science in Economics from Ghent University, a Master in General Management from Vlerick Business School, and a Master of Science in Business Analytics from Southern Methodist University. In his spare time, he enjoys traveling and sports.

Read More

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

In recent years, FM sizes have been increasing. It is important to consider the massive amount of compute often required to train these models. The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia, custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

When using compute clusters of massive size, a single failure can often throw a training job off course and may require multiple hours of discovery and remediation from customers. According to a report from OPT-175B training, about 178,000 GPU hours were wasted due to various training failures, amounting to 16 percent of the total training time. Similarly, a study by Meta AI and Carnegie Melon university found that, in the worst cases, 43 percent of compute time was wasted because of overheads due to hardware failures. This can adversely impact a customer’s ability to keep up with the pace of innovation in generative AI and can also increase the time-to-market for their models.

Amazon SageMaker HyperPod is a service that is purpose-built to accelerate FM training, removing the undifferentiated heavy-lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks to months without disruption. To make FM training more resilient to hardware failures, SageMaker HyperPod continually monitors cluster health, repairs and replaces faulty nodes without disrupting training, and uses customer-defined checkpoints to automatically resume training from the last point of failure.

Why SageMaker HyperPod?

SageMaker HyperPod offers several benefits that make it a good choice for FM training:

  • Standby pool of nodes at no additional cost – SageMaker HyperPod provisions and manages a pool of spare nodes on the customer’s behalf. These nodes are on standby and can be automatically used to replace faulty nodes during training. This makes it so that failures don’t interrupt or delay large-scale training jobs, and these spare nodes come at no additional cost to the user. With the SageMaker HyperPod auto-resume functionality, the service can dynamically swap out unhealthy nodes for spare ones to ensure the seamless continuation of the workload.
  • Cluster placement groups for optimized training – Each instance group is launched in a cluster placement group within the same network spine, in order to get the best inter-node latency and maximize bandwidth between nodes. This is ideal for tightly coupled workloads like distributed training where low-latency communication is essential for synchronizing gradient updates and ensuring that model training scales effectively across multiple GPUs.
  • Preconfigured deep learning AMI with essential libraries – The SageMaker HyperPod agent runs a SageMaker HyperPod DLAMI, which is built on top of AWS Deep Learning Base GPU AMI (Ubuntu 20.04).The SageMaker HyperPod DLAMI is bundled with additional packages to support open source tools such as Slurm and dependencies. Also included are SageMaker HyperPod cluster software packages, which support features such as cluster health check and auto-resume.
  • Reusable scaling scripts for rapid experimentation – HyperPod offers a set of scalable and reusable scripts that simplify the process of launching multiple training runs. These scripts streamline infrastructure setup and deployment and can be easily adapted for different training scenarios or to run many jobs in parallel, making large-scale training more manageable. By reducing repetitive tasks and providing reusable automation, these scripts empower users to quickly scale up or down, test different model variations, and iterate faster, improving productivity and reducing operational overhead.
  • Auto-resume functionality – This is one of the most valuable features of SageMaker HyperPod. When a node fails, SageMaker HyperPod automatically replaces it with a healthy node from the spare pool and resumes the job from the last saved checkpoint with minimal disruption to training. This is particularly crucial for long-running training jobs, where even minor interruptions can lead to significant delays.
  • Real-time performance dashboards with few-click setup – SageMaker HyperPod integrates seamlessly with real-time dashboards to monitor node health, GPU utilization, network traffic, and other key metrics. This can be done with just a few clicks, providing full visibility into training jobs and allowing teams to optimize performance in real-time.

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod. We review components of the Slurm orchestrated SageMaker HyperPod cluster setup, primarily focusing on the resiliency and feature set of SageMaker HyperPod, including automatic fault detection and integration with open source tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Overview of SageMaker HyperPod resiliency

Some of the health check metrics used by SageMaker HyperPod include:

  • Accelerator issues Checks for GPU issues including DCGM policies like XID errors, GPU health through nvidia-smi, and Trainium issues by reading from Neuron sysfs
  • Networking issues Elastic Fabric Adapter (EFA)
  • Health checks – Performed to run processes on accelerators and multiple threads on CPUs to achieve 100 percent utilization. This process determines the health of the CPU or accelerator. Specifically, DCGM Diagnostics Level 2 tests are run for GPUs, and CPU health is determined using the Linux stress tool.

SageMaker HyperPod continuously performs health checks on crucial components, including GPUs, AWS Trainium cores, and EFA networking devices. This proactive approach allows for the HyperPod health check agent to identify various hardware failures or potential performance degradation. When hardware failures are detected, SageMaker HyperPod identifies faulty instances and is also able to use its auto-resume functionality to initiate a replacement process without manual intervention. This feature automatically detects hardware failures, seamlessly replaces faulty instances, and resumes jobs from the last saved checkpoint. In addition, SageMaker HyperPod offers you the ability to manually replace a node in the case that you have a node stuck with an issue but is not being fixed by the SageMaker HyperPod auto-resume functionality. You can manually change the state of the node to fail, and SageMaker HyperPod will replace it with a healthy instance. For a more in-depth dive into resiliency with SageMaker HyperPod, refer to the Resiliency section of this post.

Overview of SageMaker HyperPod observability

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate your cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana. The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your SageMaker HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster’s behavior. By using these services, you gain a centralized and unified view of your SageMaker HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads. The Observability section of this post goes into more detail on which metrics are exported and what the dashboards look like in Amazon Managaed Grafana.

This post is primarily focused on Amazon Managed Service for Prometheus and Amazon Managed Grafana for observability. To explore more observability integrations with SageMaker HyperPod like Nvidia Nsight, refer to the validation and observability folder of the awsome-distributed-training GitHub repo.

These resiliency and observability features collectively contribute to a more reliable and efficient training environment, minimize downtime, and optimize resource usage. By directly integrating with Amazon Managed Service for Prometheus and Amazon Managed Grafana and abstracting the management of hardware failures and job resumption, SageMaker HyperPod allows data scientists and ML engineers to focus on model development rather than infrastructure management.

Mathstral model from Mistral AI

Mathstral is a model designed for math reasoning and scientific discovery, is based on the original Mistral 7B model, and features a 32k context window. The release of Mathstral aligns with Mistral AI’s broader effort to support academic and scientific research, particularly through their collaboration with Project Numina. As a 7B model, Mathstral sets a new standard on the performance and latency space for math and reasoning generation compared to similar models used for math and reasoning. Mathstral can achieve significantly better results with more inference-time computation.

Overview of PyTorch FSDP

In distributed data parallel (DDP) training, each process or worker owns a replica of the model and processes a batch of data. Then, it uses all-reduce to sum up gradients over different workers. In DDP, the model weights and optimizer states are replicated across all workers. DDP maintains a full copy of the model on each GPU and requires enough memory on each GPU to store the entire model. For training larger FMs, using an approach like FSDP is recommended, since these FMs require more than a single GPU. It is a type of data parallelism that shards model parameters, optimizer states, and gradients across DDP ranks. This approach reduces the memory requirements on individual GPUs and distributes the memory load across GPUs. With FSDP enhanced efficiency, researchers and developers use fewer GPUs, thereby minimizing operational costs and achieving faster model convergence.

When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes the training of some very large models feasible by allowing them to be loaded into memory with a lower memory footprint. However, this comes at the cost of increased communication volume. For more information on FSDP, refer to PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Solution overview

The following image shows the architecture diagram for the resources deployed as part of Sagemaker HyperPod for our use case of training the Mathstral model. In your account, you will have a VPC provisioned with a public and private subnet, and an S3 bucket synced to your FSxL file system via a data repository link. In the service team account, your cluster of P4de instances is provisioned, along with the head node, and the login node, for you to submit the training job to your cluster.

Prerequisites

In the context of this post, we use four p4de.24xlarge instances. You can find more information on the p4de.24xlarge instance type at Amazon EC2 P4 Instances. To get the best inter-node latency, we launch these instances together in a cluster and only run jobs on a single instance group. You can also use a variety of other instance types to follow along with this post.

For more information on getting access to instances in a partition group, refer to the Getting Started section in this post. Note that Mathstral 7B at full precision (FP32) is approximately 26 GB in size so you need to make sure that your cluster configuration has sufficient GPU memory to load the model into memory along with the gradients, activations, and moments. This should account for a total of 107 GB in addition to the training assets required to kick off a job successfully. For demonstration purposes, we use FSDP for this continued pre-training job.

The following sections describe setting up your infrastructure and environment with SageMaker HyperPod. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop. The prerequisites and cluster setup parts of this workshop go over all the required components needed in order to set up your cluster. The workshop also provides resources to troubleshoot commonly faced issues during setup.

Set up your infrastructure

Deploy HyperPod VPC stack

To set up your cluster, you first need to create some resources. The following resources can be created by deploying this SageMaker HyperPod VPC CloudFormation stack. By default usw2-az4 is specified as the Availability Zone. Change this to reflect the Availability Zone where you have your cluster. This VPC stack creates the following resources:

  • Subnet – This is a private subnet in the Availability Zone id that you choose to use
  • Security group – This allows SageMaker HyperPod to mount your Amazon FSx for Lustre file system
  • FSx for Lustre file system – This serves as the shared file system that all the nodes can access. It’s a 1.2 TB PERSISTENT_2 file system in the private subnet you create. It gets mounted at /fsx.
  • Linux environment – This provides a standardized development environment to work in
  • Amazon Simple Storage Service (Amazon S3) bucket – To push and store your lifecycle scripts
  • AWS Identity and Access Management (IAM) role – Role required for creating the SageMaker HyperPod cluster

Deploy the observability stack

In order to use the observability integration with SageMaker HyperPod, you need to deploy the SageMaker HyperPod Observability CloudFormation stack, which can then be used to monitor your cluster metrics in real time.

Set up your environment

Let’s move on to environment setup. In order to deploy this solution, you need to use a Linux-based development environment. This section briefly describes the steps required to set up your cluster. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop.

Set up your cluster

This section guides you through the process of deploying a cluster to train with. You need to set up the following:

  • Head node and compute nodes – The head node is composed of an m5.12xlarge instance, and the worker group consists of p4de.24xlarge instances. Refer to the following table for details on these instance types.
  • Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes
  • Placement groups enabled – A placement group will launch instances close together inside one physical data center in a single Availability Zone to maximize the bandwidth and reduce the latency between instances
  • Local storage – Each node will have an 8 TB local NVME volume attached for local storage
  • Scheduler – SLURM will be used as a job scheduler
  • Accounting – As part of cluster configuration, a local MariaDB is deployed to keep track of job runtime information
A B C D E F
1 Instance size GPU devices Total GPU memory VCPUs CPU memory EFA bandwidth
2 p4de.24xlarge 8 640 gb 96 1152 gb 400 Gbps

Set up the AWS CLI

Before creating the cluster and its associated resources, you need to set up the AWS Command Line Interface (AWS CLI) using the latest version (or version 2.17.1 at a minimum).

To check the AWS CLI version, use the following command.

aws --version

To update the AWS CLI to the latest version, use the following command.

sudo ./aws/install –update

The AWS CLI plugin for Session Manager, a capability of AWS Systems Manager, must be installed to access your cluster. To use Amazon Linux 2 to install Session Manager, use the following command:

sudo yum install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm

For detailed steps on installing and setting up the AWS CLI, follow the steps provided in the Install AWS CLI section of the Amazon SageMaker HyperPod workshop.

Source environment variables

An important part of the setup is to source in all the environment variables, using the output from the VPC CloudFormation stack deployed in a previous step. Use the following command.

curl 'https://static.us-east-1.prod.workshops.aws/public/e3e1b2f1-8140-43eb-a316-e76f569119dd/static/scripts/create_config.sh' --output create_config.sh
bash create_config.sh
source env_vars

Once you have sourced them in, confirm that they were correctly set using the following command.

cat env_vars

Set up lifecycle scripts

SageMaker HyperPod uses a collection of lifecycle scripts  to bootstrap the cluster. These scripts are responsible for several actions, including setting up Slurm and mounting the FSx for Lustre file system. You need to customize these scripts in order to mount your FSx for Lustre file system. For detailed steps on setting up these lifecycle scripts, refer to the Set Up Lifecycle Scripts section of the workshop.

Make sure to complete the Enable Optional Lifecycle scripts section after step 4 of the Set Up Lifecycle scripts section because this is needed in order to enable installation of exporter services on the cluster. This is required because you need the exporter services on the cluster to emit metrics to Amazon Managed Service for Prometheus.

Additionally, the observability stack requires the following two AWS managed IAM policies to be added to your AmazonSagemakerClusterExecutionRole prior to creating your cluster.

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Once you have uploaded the lifecycle scripts to Amazon S3, you can then create your cluster.

Create your cluster

To create your cluster, you need your cluster configuration. Because you use p4de.24xlarge for this example, copy the following cluster configuration.

source env_vars
cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "InstanceGroups": [
      {
        "InstanceGroupName": "controller-machine",
        "InstanceType": "ml.m5.12xlarge",
        "InstanceStorageConfigs": [
          {
            "EbsVolumeConfig": {
              "VolumeSizeInGB": 500
            }
          }
        ],
        "InstanceCount": 1,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      },
      {
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4de.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      }
    ],
    "VpcConfig": {
      "SecurityGroupIds": ["$SECURITY_GROUP"],
      "Subnets":["$SUBNET_ID"]
    }
}
EOL

If you use a different instance type for your cluster, refer to the Create Cluster section of the workshop to create your cluster-config.json file.

SageMaker HyperPod also gives you the ability to update your clusters to increase the size of an existing worker group or create a new worker group to add additional instance-types to your cluster. For steps on updating the cluster to create additional worker groups that use other instance types, refer to the section in the workshop to create Heterogenous Clusters.

Once you’ve created the cluster-config.json file, follow the Create Cluster steps in the workshop to create the FSX for Lustre configuration (provisioning_parameters.json) file and upload it to Amazon S3. Then, you can validate the configuration using the validate-config.py file in the awsome-distributed-training GitHub repo.

Once this validation is completed, you can create your cluster. Use the following command.

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json 
    --region $AWS_REGION

To check the state of your cluster, run the following command.

aws sagemaker list-clusters --output table

You should then be able to observe the cluster creating.

-----------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                          ListClusters                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                       ClusterSummaries                                                          ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
||                        ClusterArn                              |    ClusterName       | ClusterStatus |               CreationTime              ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
|| arn:aws:sagemaker:us-west-2:{cluster arn}                      |  ml-cluster          | Creating      | time taken to create                    ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|

Now that you’ve created a cluster, you can monitor the status in the SageMaker console. This will show you cluster status, running instances, and node groups and allow you to modify the cluster. In the SageMaker HyperPod console, find your cluster and select it, as shown in the following screenshot.

Once the Cluster status changes to InService, you can connect using Secure Shell (SSH). Make sure that you completed the step in Set up the AWS CLI to install the SSM plugin. You can then take the easy-ssh.sh file from the repo to simplify the SSM command connect to the controller-machine using SSH.

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

Use the following command to switch to the ubuntu user.

sudo su - ubuntu

Refer to the Get to know your Cluster  section in the SageMaker HyperPod workshop to familiarize yourself with the commands you need to use in the later sections.

Finally, set up SSH access to the compute nodes. To do this, add a key-value pair to the /fsx/ubuntu directory. Because all the compute nodes mount this directory, you only have to do this once for ubuntu to access all the compute nodes. For instructions, refer to the SSH Access to compute section of the workshop.

Congrats on setting up your environment! Now that you’ve completed the necessary steps, you can move on to your training job.

Run your pre-training job

Follow these steps on your cluster head node:

  1. Navigate to your shared FSx for Lustre file system. If you followed the tutorial linked previously, it will be location at /fsx.
  2. Use the following command to clone the awsome-distributed-training repo.
cd /fsx
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/10.FSDP
  1. Run the create_conda_env.sh script.

This script will first download and install Miniconda, then create a Conda environment called pt_fsdp. The Conda environment installs PyTorch on AWS, which is a package that is built to run PyTorch workloads on AWS. Specifically, it lets you use EFA out of the box, since OFI-NCCL is pre-built in the Conda package. PyTorch on AWS also provides the latest versions of CUDA, cuDNN, and NCCL for the best performance on GPU-based instances. Dependencies required to run your FSDP training job will be installed in this Conda environment, and since this Conda environment is created on the /fsx file system, it’ll be shared across all your training nodes.

bash 0.create_conda_env.sh

For this training job, you use the C4 dataset, which is several hundred gigabytes. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.

If want to use your own dataset instead, you can format it as a HuggingFace dataset and pass its location to the --dataset_path argument.

Launch training

The script to launch the Mathstral training job can be found in 3.distributed-training-mistral-mathstral.sbatch. Depending on the number of nodes in your cluster, you are can adjust them by modifying #SBATCH --nodes=4. Because you are using four p4de.24xlarge instances, it has been set to 4.

For the purpose of this post, you need to make sure that the FI_EFA variables for EFA are exported in the 3.distributed-training-mistral-mathstral.sbatch file. If you use instances not enabled for remote direct memory access (RDMA), such as the g5.12xlarge, comment out lines 21–22 of this file. These instances have EFA between nodes, but do not have the GPU direct RDMA access of p4d/e and p5 instances. In this walkthrough, we are using p4de instances, so we leave these lines uncommented.

## Plenty of EFA level variables
## Comment out for non-efa instances (G5, G4d, P3)
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4de
export FI_LOG_LEVEL=1
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO

Under User Variables, make sure to adjust GPUS_PER_NODE to match the number of GPUs on your instance type (8 for p4de).

You can also adjust the training parameters in TRAINING_ARGS. Additional parameters can be found in model/arguments.py.

We use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way, if the training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

Note: You may change these hyperparameters in the 3.distributed-training-mistral-mathstral.sbatch file. We are using arbitrary hyperparameters here for the sake of demonstration.

declare -a TRAINING_ARGS=(
    --train_batch_size=1 
    --val_batch_size=1 
    --max_steps=5000 
    --seed=42 
    --grad_clip=1.0 
    --weight_decay=0.2 
    --beta1=0.9 
    --beta2=0.95 
    --activation_checkpointing=1 
    --intermediate_size=14336 
    --num_key_value_heads=8 
    --logging_freq=1 
    --max_context_width=32768 
    --vocab_size=32768 
    --hidden_width=4096 
    --num_layers=32 
    --num_heads=32 
    --resid_pdrop=0.1 
    --embd_pdrop=0.1 
    --attn_pdrop=0.1 
    --summary_first_pdrop=0.1 
    --initializer_range=0.02 
    --model_type="mistral" 
    --rotary_pct=0.25 
    --rotary_emb_base=10000 
    --lr=0.0001 
    --lr_decay_style="cosine" 
    --min_lr=1e-5 
    --warmup=0.0032 
    --plateau=0.0 
    --dataset="c4" 
    --tokenizer="mistralai/mathstral-7B-v0.1" 
    --epochs=3 
    --checkpoint_dir="./checkpoints/mathstral-7B" 
    --resume_from_checkpoint="./checkpoints/mathstral-7B" 
    --checkpoint_freq=50 
    --validation_freq=500 
    --dataset_config_name="en" 
    --limit_all_gathers=1 
    --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html
    --offload_activations=1
)

To launch your training, run the following command.

sbatch 3.distributed-training-mistral-mathstral.sbatch

You’ll find a new file in the FSDP directory of the form slurm-[job-number].out. This will be continuously updated with your training logs. Don’t be worried if you notice a long stream of NCCL logs (we prefer to use NCCL_DEBUG=INFO for verbose logging). After about a minute, you should observe your Mathstral model training, with an output similar to the following.

...
+ TORCHRUN=./pt_fsdp/bin/torchrun
+ export TRAIN_SCRIPT=./train.py
+ TRAIN_SCRIPT=./train.py
+ TRAINING_ARGS=(--train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type="mistral" --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style="cosine" --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset="c4" --tokenizer="mistralai/mathstral-7B-v0.1" --epochs=3 --checkpoint_dir="./checkpoints/mathstral-7B" --resume_from_checkpoint="./checkpoints/mathstral-7B" --checkpoint_freq=50 --validation_freq=500 --dataset_config_name="en" --limit_all_gathers=1 --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html --offload_activations=1)
+ declare -a TRAINING_ARGS
+ AUTO_RESUME=
+ '[' -d /opt/sagemaker_cluster ']'
+ echo 'Detected Hyperpod cluster.. enabling --auto-resume=1'
Detected Hyperpod cluster.. enabling --auto-resume=1
+ AUTO_RESUME=--auto-resume=1
+ srun --auto-resume=1 -l ./pt_fsdp/bin/torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id=35 --rdzv_backend=c10d --rdzv_endpoint=ip-10-2-39-253 ./train.py --train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type=mistral --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style=cosine --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset=c4 --tokenizer=mistralai/mathstral-7B-v0.1 --epochs=3 --checkpoint_dir=./checkpoints/mathstral-7B --resume_from_checkpoint=./checkpoints/mathstral-7B --checkpoint_freq=50 --validation_freq=500 --dataset_config_name=en --limit_all_gathers=1 --sharding_strategy=full ' #' https://pytorch.org/docs/stable/fsdp.html --offload_activations=1
...
3: 2024-07-19 03:31:38 I [train.py:155] Creating Model
3: 2024-07-19 03:33:08 I [train.py:171] Created model with total parameters: 7248023552 (7.25 B)
3:...
3: 2024-07-19 03:33:23 I [train.py:209] Wrapped model with FSDP
3: 2024-07-19 03:33:23 I [train.py:226] Created optimizer
3: 2024-07-19 03:33:23 I [checkpoint.py:70] No Checkpoints Found
...
3: 2024-07-19 03:33:35 I [train.py:102] Batch 0 Loss: 11.19900, Speed: 5.10 samples/sec, lr: 0.000006
3: 2024-07-19 03:33:38 I [train.py:102] Batch 1 Loss: 11.18291, Speed: 10.96 samples/sec, lr: 0.000013
3: 2024-07-19 03:33:40 I [train.py:102] Batch 2 Loss: 11.09163, Speed: 11.22 samples/sec, lr: 0.000019
3: 2024-07-19 03:33:43 I [train.py:102] Batch 3 Loss: 10.86621, Speed: 11.19 samples/sec, lr: 0.000025
3: 2024-07-19 03:33:46 I [train.py:102] Batch 4 Loss: 10.58236, Speed: 11.12 samples/sec, lr: 0.000031
3: 2024-07-19 03:33:49 I [train.py:102] Batch 5 Loss: 10.08024, Speed: 11.18 samples/sec, lr: 0.000038
3: 2024-07-19 03:33:52 I [train.py:102] Batch 6 Loss: 10.15507, Speed: 11.23 samples/sec, lr: 0.000044
3: 2024-07-19 03:33:55 I [train.py:102] Batch 7 Loss: 9.97296, Speed: 10.42 samples/sec, lr: 0.000050
3: 2024-07-19 03:33:58 I [train.py:102] Batch 8 Loss: 10.13596, Speed: 11.21 samples/sec, lr: 0.000056
3: 2024-07-19 03:34:01 I [train.py:102] Batch 9 Loss: 9.93156, Speed: 11.10 samples/sec, lr: 0.000063

Observability

SageMaker HyperPod can optionally be integrated with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about your cluster and cluster-nodes to an Amazon Managed Grafana dashboard.

For more details about configuring Amazon Managed Service for Prometheus and Amazon Managed Grafana, refer to the Prometheus Configuration and Amazon Managed Grafana sections in the SageMaker HyperPod workshop.

Slurm Exporter dashboard

The Amazon Managed Grafana Slurm dashboard (ID: 4323) provides visualization options for monitoring Slurm clusters. Prometheus Slurm exporter is installed on the controller node of the cluster. Some of the metrics exported include:

  • Cluster overview – Displays the total number of nodes, jobs, and their states
  • Job metrics – Visualizes job counts and states over time
  • Node metrics – Shows node states, allocation, and available resources
  • Partition metrics – Monitors partition-specific metrics such as CPU, memory, and GPU utilization
  • Job efficiency – Calculates job efficiency based on resources used

The following screenshot of the exporter dashboard shows the continued pre-training job for Mathstral being completed successfully.

Node Exporter dashboard

The Amazon Managed Grafana Node Exporter Full dashboard (ID: 1860) offers visualization options for monitoring system metrics collected by the Prometheus Node Exporter installed on the cluster nodes. Some of the key metrics you can visualize include:

  • System overview – Displays CPU load averages and memory usage
  • Memory metrics – Visualizes memory utilization including total memory, free memory, and swap space
  • Disk usage – Monitors disk space utilization and availability
  • Network traffic – Shows network bytes received and transmitted over time
  • File system metrics – Analyzes file system usage and availability
  • Disk I/O metrics – Visualizes disk read and write activity

DCGM Exporter dashboard

The Amazon Managed Grafana NVIDIA DCGM Exporter dashboard (ID: 12239) offers visualization options for monitoring NVIDIA GPU metrics collected by the DCGM Exporter. Some of the key metrics you can visualize include:

  • GPU overview – Displays GPU utilization, temperatures, power usage, and memory usage
  • Temperature metrics – Visualizes GPU temperatures over time
  • Power usage – Monitors GPU power draw and power usage trends
  • Memory utilization – Analyzes GPU memory usage, including used, free, and total memory
  • Fan speed – Shows GPU fan speeds and variations
  • ECC errors – Tracks GPU memory ECC errors and pending errors

EFA Metrics dashboard

The Amazon Managed Grafana EFA Metrics dashboard (ID: 20579) offers visualization options for monitoring EFA related metrics collected by the EFA Node Exporter. Some of the key visualizations include:

  • EFA error metrics – Visualizes errors such as allocation errors, command errors, and memory map errors
  • EFA network traffic – Monitors received and transmitted bytes, packets, and work requests
  • EFA RDMA performance – Analyzes RDMA read and write operations, including bytes transferred and error rates
  • EFA port lifespan – Displays the lifespan of EFA ports over time
  • EFA keep-alive packets – Tracks the number of keep-alive packets received

FSx Metrics dashboard

The Amazon Managed Grafana FSx for Lustre dashboard (ID: 20906) offers visualization options for monitoring Amazon FSx for Lustre file system related metrics collected by Amazon CloudWatch. Some of the key visualizations include:

  • DataReadBytes – The number of bytes for file system read operations
  • DataWriteBytes – The number of bytes for file system write operations
  • DataReadOperations – The number of read operations
  • DataWriteOperations – The number of write operations
  • MetadataOperations – The number of metadata operations
  • FreeDataStorageCapacity – The amount of available storage capacity

These metrics provide insights into various aspects of your FSx for Lustre file systems.

Resiliency

As mentioned previously, one of the value propositions of SageMaker HyperPod is that it provides a variety of cluster resiliency features such as cluster health checks, auto-resume, and the option to manually replace faulty nodes.

Based on the status of these health checks, SageMaker HyperPod detects whether nodes in the cluster are healthy or not. If a node is deemed unhealthy by any of the health checks, SageMaker HyperPod uses its auto-resume feature to automatically replace the faulty node, without any manual intervention.

Additionally, users have the option to implement checkpointing in their training procedure. Checkpointing, combined with auto-resume, means that once a faulty node is replaced, the training job can resume from the last saved checkpoint. This way, despite a hardware failure, a user’s training job can run with minimal loss in progress.

In this section, we demonstrate the resiliency and auto-resume feature of SageMaker HyperPod by simulating a hardware failure scenario and pointing you towards some logs that indicate the success of a replacement job. We use the same submitted FSDP training job, which has the following two important components enabled:

  1. Checkpointing is enabled and implemented
  2. The --auto-resume=1 flag is set. You can verify this in the SLURM .out

This section in the provided sbatch file sets the --auto-resume=1 flag.

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
    AUTO_RESUME="--auto-resume=1"
fi
srun ${AUTO_RESUME} -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

The sbatch file has the checkpointing flags checkpoint_freq, checkpoint_dir, resume_from_checkpoint, which tell the job how often to write checkpoints, where to write the checkpoints to, and what directory to read checkpoints from in case of failure, respectively.

Assuming that you already have your training job submitted, wait until a few checkpoints are written to the ./checkpoints directory (or the directory name you specified for checkpoint_freq. You can check whether any checkpoints were written by running ls -lt checkpoints/. This should return an output that resembles the following.

total 74
-rw-rw-r--  1 ubuntu ubuntu     1 Dec  9 00:21 latest_checkpointed_iteration.txt
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:20 iter_0000002
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:11 iter_0000001

You may also check the progress of your training job by running tail -f slurm-<job-id>.log, where <job-id> can be derived by running squeue. You should observe an output that resembles the following.

1:  iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 440352.6 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |0: saving checkpoint at iteration       1 to /fsx/checkpoints
0:   successfully saved checkpoint at iteration       1 to /fsx/checkpoints
1: (min, max) time across ranks (ms):
1:     save-checkpoint ................................: (81611.24, 81611.82)

Once you’ve confirmed that your training job is running and that you have checkpoints written, you are ready to simulate a hardware failure.

As part of the output of running squeue, you have received an output that resembles the following.

JOBID PARTITION     NAME   USER ST  TIME  NODES NODELIST(REASON)
32          dev interact ubuntu  R  0:02      4  ip-10-2-9-98,...

This tells you what jobs are running and on what nodes. Locate your training job and choose any of the nodes except the first node on the list of nodes allocated to your job (this is the node that you will be injecting an error into). This is very important because PyTorch uses node 0 (that is, the first node) as the coordination node for your training job.

Once you’ve identified the node to inject the error onto, connect to it using SSH with following command.

ssh <NODE ip>

You can inject an ECC error by running the following command.

dcgmi test --inject --gpuid 0 -f 319 -v 4

This simulates a double-bit error (DBE) in the GPU of your chosen node. Additionally, to kill the training job to simulate a job failure, you can take the process id (PID) of any of the python processes running. The Python processes are the training job processes running your FSDP training job. The -9 flag here is the signal number for the SIGKILL signal, which forces a process to stop, without giving it a chance to clean up, or perform any other actions.

ps -aux | grep python
kill -9 <PID>

Once the ECC error is injected and the Python process has stopped, you can exit out of your compute node. In the meantime, you can get the output of the slurmctld.log file using the following command.

tail -f /var/log/slurm/slurmctld.log

In there, you can observe the following lines, which show a failed job or node.

 [2024-07-19T04:13:03.313] sched: Allocate JobId=35 NodeList-ip-10-2-39-253, ip-10-2-40-102, ip-10-2-76-26, ip-10-2-108-162 #CPUs=192 Partition=dev
 [2024-07-19T04:50:31.682] _slurm_rpc_submit_batch_job: JobId=35 InitPrio=1 usec=727
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 reason set to: Action: Replace 
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 state set to FAILING

Pay attention to the line that says update_node: node ip-10-2-39-253 reason set to: Action:Replace, which is the log that says that the node has failed and requires replacement.

If you look at your <slurm-job>.out file, you should observe logs like the following.

[Auto Resume] Info: JobID: 35 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Response from cluster agent: JobID=35, ResumeAction=RETRYSTEP
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - replacing nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - Dropping unhealthy nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Succesfully shrink job to retain healthy nodes ...
srun: job 35 queued and waiting for resources

This shows that job 35 (your training job) is paused and a new job (job 35) has initiated the replacement process. You can verify this by running squeue, where you will observe an auto-res. This is the auto-resume job that is initiated by SageMaker HyperPod to replace your faulty node.

JOBID PARTITION  NAME      USER ST    TIME NODES NODELIST(REASON)
35    dev    auto-res    ubuntu PD    0:00     4 (Resources)
...

You can also monitor your SageMaker HyperPod cluster using the AWS console. Under Instances, you should observe one of the nodes in worker-group-1 in Pending state, as shown in the following screenshot. This shows that the node is about to get replaced.

Once your node is replaced, you can observe the slurmctld.log file. Be on the alert for the following line:

update_node: node <YOUR-NODE-IP-ADDRESS> reason set to: AWS:Replaced

You can also verify that your node was successfully replaced using the HyperPod cluster tab in the Amazon SageMaker console.

Once your node is replaced, squeue should no longer display the auto-res job and should only display your original training job. The node is successfully replaced, without any manual intervention.

Because you enabled checkpointing, you can verify that the training job resumes from the latest checkpoint. In your <slurm-job>.out file, find the following lines, which show that a checkpoint was detected in the checkpoint directory (./checkpoints) and that the latest checkpoint was loaded, respectively.

...
Loading checkpoint from checkpoints/mathstral-10steps ...
...
Checkpoint loaded from checkpoints/mathstral-10steps ...
...

If you continue to monitor your <slurm-job>.out file, you should observe that your training job has resumed from the latest checkpoint.

Clean up

  1. To delete your cluster, enter the following command.
aws sagemaker delete-cluster --cluster-name ml-cluster

Once you are done deleting the cluster, make sure it is deleted in the SageMaker HyperPod clusters section under SageMaker.

  1. To use the console to delete your SageMaker HyperPod VPC and Observability CloudFormation stacks, follow the directions at Delete a stack from the CloudFormation console. Alternatively, use the AWS CLI by entering the following command. Replace my-stack with the name of your stacks.
aws cloudformation delete-stack 
    --stack-name my-stack

Conclusion

In this post, we provided a comprehensive guide on using Amazon SageMaker HyperPod for training large-scale models such as Mistral AI’s Mathstral using PyTorch Fully Sharded Data Parallel (FSDP). The process highlighted the efficiency of distributed training on SageMaker HyperPod, showcasing the critical role of resiliency and observability features in maintaining uninterrupted, scalable training environments.

Because of the integration with tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana for real-time monitoring, along with the robust cluster management capabilities of SageMaker HyperPod, ML practitioners can focus on model development rather than infrastructure management. The detailed steps for setting up the infrastructure, deploying the observability stack, and running a training job demonstrate how SageMaker HyperPod helps tackle the complexities of distributed training.

Moreover, the automatic health checks and the auto-resume feature significantly reduce downtime and minimize the impact of hardware failures so that large-scale training jobs can proceed with minimal interruptions. This level of resilience is crucial for maintaining the pace of innovation in AI research, especially when dealing with massive FMs.

By following the outlined procedures and using the powerful tools provided by AWS, data scientists and engineers can optimize their training workflows, reduce operational overhead, and accelerate the development of state-of-the-art models.

Getting Started

Interested in getting started with SageMaker HyperPod? Reach out to your AWS Account Team or email aws-frameworks-gtm@amazon.com. To begin experimenting with other examples on SageMaker HyperPod, refer to the awsome-distributed-training GitHub repo and the Amazon SageMaker HyperPod workshop.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with key GenAI foundation model providers, AWS service teams, strategic customers, founders, universities, venture ecosystems, and Amazon to develop technology strategy that enables the next generation of artificial intelligence, machine learning, and accelerated computing on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Read More

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

This post has been co-written with Artem Sysuev, Danny Portman, Matúš Chládek, and Saurabh Gupta from Zeta Global.

Zeta Global is a leading data-driven, cloud-based marketing technology company that empowers enterprises to acquire, grow and retain customers. The company’s Zeta Marketing Platform (ZMP) is the largest omnichannel marketing platform with identity data at its core. The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. For more information, see Zeta Global’s home page.

What Zeta has accomplished in AI/ML

In the fast-evolving landscape of digital marketing, Zeta Global stands out with its groundbreaking advancements in artificial intelligence. Zeta’s AI innovations over the past few years span 30 pending and issued patents, primarily related to the application of deep learning and generative AI to marketing technology. Using AI, Zeta Global has revolutionized how brands connect with their audiences, offering solutions that aren’t just innovative, but also incredibly effective. As an early adopter of large language model (LLM) technology, Zeta released Email Subject Line Generation in 2021. This tool enables marketers to craft compelling email subject lines that significantly boost open rates and engagement, tailored perfectly to the audience’s preferences and behaviors.

Further expanding the capabilities of AI in marketing, Zeta Global has developed AI Lookalikes. This technology allows companies to identify and target new customers who closely resemble their best existing customers, thereby optimizing marketing efforts and improving their return on investment (ROI). The backbone of these advancements is ZOE, Zeta’s Optimization Engine. ZOE is a multi-agent LLM application that integrates with multiple data sources to provide a unified view of the customer, simplify analytics queries, and facilitate marketing campaign creation. Together, these AI-driven tools and technologies aren’t just reshaping how brands perform marketing tasks; they’re setting new benchmarks for what’s possible in customer engagement.

In addition to its groundbreaking AI innovations, Zeta Global has harnessed Amazon Elastic Container Service (Amazon ECS) with AWS Fargate to deploy a multitude of smaller models efficiently.

Zeta’s AI innovation is powered by a proprietary machine learning operations (MLOps) system, developed in-house.

Context

In early 2023, Zeta’s machine learning (ML) teams shifted from traditional vertical teams to a more dynamic horizontal structure, introducing the concept of pods comprising diverse skill sets. This paradigm shift aimed to accelerate project delivery by fostering collaboration and synergy among teams with varied expertise. The need for a centralized MLOps platform became apparent as ML and AI applications proliferated across various teams, leading to a maze of maintenance complexities and hindering knowledge transfer and innovation.

To address these challenges, the organization developed an MLOps platform based on four key open-source tools: Airflow, Feast, dbt, and MLflow. Hosted on Amazon ECS with tasks run on Fargate, this platform streamlines the end-to-end ML workflow, from data ingestion to model deployment. This blog post delves into the details of this MLOps platform, exploring how the integration of these tools facilitates a more efficient and scalable approach to managing ML projects.

Architecture overview

Our MLOps architecture is designed to automate and monitor all stages of the ML lifecycle. At its core, it integrates:

  • Airflow for workflow orchestration
  • Feast for feature management
  • dbt for accelerated data transformation
  • MLflow for experiment tracking and model management

These components interact within the Amazon ECS environment, providing a scalable and serverless platform where ML workflows are run in containers using Fargate. This setup not only simplifies infrastructure management, but also ensures that resources are used efficiently, scaling up or down as needed.

The following figure shows the MLOps architecture.

Architectural deep dive

The following details dive deep into each of the components used in this architecture.

Airflow for workflow orchestration

Airflow schedules and manages complex workflows, defining tasks and dependencies in Python code. An example direct acyclic graph (DAG) might automate data ingestion, processing, model training, and deployment tasks, ensuring that each step is run in the correct order and at the right time.

Though it’s worth mentioning that Airflow isn’t used at runtime as is usual for extract, transform, and load (ETL) tasks. Every Airflow task calls Amazon ECS tasks with some overrides. Additionally, we’re using a custom Airflow operator called ECSTaskLogOperator that allows us to process Amazon CloudWatch logs using downstream systems.

model_training = ECSTaskLogOperator(
task_id= <...>,
task_definition= <...>,
cluster= <...>,
launch_type="FARGATE",
aws_conn_id= <...>,
overrides={
"containerOverrides": [
{
"name": " <...> ",
"environment": [
{
"name": "MLFLOW_TRACKING_URI",
"value": "<...>"
},
],
"command": ["mlflow", "run", <...>]
}
],
}

Feast for feature management

Feast acts as a central repository for storing and serving features, ensuring that models in both training and production environments use consistent and up-to-date data. It simplifies feature access for model training and inference, significantly reducing the time and complexity involved in managing data pipelines.

Additionally, Feast promotes feature reuse, so the time spent on data preparation is reduced greatly.

from datetime import timedelta
from feast import Entity, FeatureView, FeatureService, Field, SnowflakeSource
from feast.types import Float64

entities = [
Entity(name="site_id", join_keys=["SITE_ID"]),
Entity(name="user_id", join_keys=["USER_ID"]),
]

def create_feature_view(name, table, field_name, schema_name):
return FeatureView(
name=name,
entities=entities,
ttl=timedelta(days=30),
schema=[Field(name=field_name, dtype=Float64)],
source=SnowflakeSource(
database="<...>", schema="<...>", table=table, 
timestamp_field="<...>"
),
tags="<...>",
)

feature_view_1 = create_feature_view("<...>")
feature_view_2 = create_feature_view("<...>")

my_feature_service = FeatureService(
name="my_feature_servic ",
features=[feature_view_1, feature_view_1],
description=""" 

This is my Feature Service 

""",
owner="<...>",
)

dbt for data transformation

dbt is used for transforming data within the data warehouse, allowing data teams to define complex data models in SQL. It promotes a disciplined approach to data modeling, making it easier to ensure data quality and consistency across the ML pipelines. Moreover, it provides a straightforward way to track data lineage, so we can foresee which datasets will be affected by newly introduced changes. The following figure shows schema definition and model which reference it.

MLflow for experiment tracking and model management

MLflow tracks experiments and manages models. It provides a unified interface for logging parameters, code versions, metrics, and artifacts, making it easier to compare experiments and manage the model lifecycle.

Similarly to Airflow, MLflow is also used just partially. The main parts we use are tracking the server and model registry. From our experience, artifact server has some limitations, such as limits on artifact size (because of sending it using REST API). As a result, we opted to use it only partially.

We don’t extensively use the deployment capabilities of MLflow, because in our current setup, we build custom inference containers.

Hosting on Amazon ECS with Fargate

Amazon ECS offers a highly scalable and secure environment for running containerized applications. Fargate eliminates the need for managing underlying infrastructure, allowing us to focus solely on deploying and running the containers. This abstraction layer simplifies the deployment process, enabling seamless scaling based on workload demands while optimizing resource utilization and cost efficiency.

We found it optimal to run on Fargate components of our ML workflows that don’t require GPUs or distributed processing. These include dbt pipelines, data gathering jobs, training, evaluation, and batch inference jobs for smaller models.

Furthermore, Amazon ECS and Fargate seamlessly integrate with other AWS services, such as Amazon Elastic Container Registry (Amazon ECR) for container image management and AWS Systems Manager Parameter Store for securely storing and managing secrets and configurations. Using Parameter Store, we can centralize configuration settings, such as database connection strings, API keys, and environment variables, eliminating the need for hardcoding sensitive information within container images. This enhances security and simplifies maintenance, because secrets and configuration values can be dynamically retrieved by containers at runtime, ensuring consistency across deployments.

Moreover, integrating Amazon ECS and Fargate with CloudWatch enables comprehensive monitoring and logging capabilities for containerized tasks. This can be achieved by enabling the awslogs log driver within the logConfiguration parameters of the task definitions.

Why ECS with Fargate is the solution of choice

  1. Serverless model:
    • No infrastructure management: With Fargate, we don’t need to provision, configure, or manage servers. This simplifies operations and reduces the operational overhead, allowing teams to focus on developing and deploying applications.
    • Automatic scaling: Fargate automatically scales our applications based on demand, ensuring optimal performance without manual intervention.
  1. Cost efficiency:
    • Pay-as-we-go: Fargate charges are based on the resources (vCPU and memory) that the containers use. This model can be more cost-effective compared to maintaining idle resources.
    • No over provisioning: Because we only pay for what we use, there’s no need to over-provision resources, which can lead to cost savings.
  1. Enhanced security:
    • Isolation: Each Fargate task runs in its own isolated environment, improving security. There’s no sharing of underlying compute resources with other tenants.
  1. Integration with the AWS ecosystem:

Configuring Amazon ECS with Fargate for ML workloads

Configuring Amazon ECS with Fargate for ML workloads involves the following steps.

  1. Docker images: ML models and applications are containerized using Docker. This includes all dependencies, libraries, and configurations needed to run the ML workload.
  2. Creating task definitions:
    • Define resources: Create an Amazon ECS task definition specifying the Docker image, required vCPU, memory, and other configurations.
    • Environment variables: Set environment variables, such as model paths, API keys, and other necessary parameters.
  1. IAM roles: Assign appropriate AWS Identity and Access Management (IAM) roles to the tasks for accessing other AWS resources securely.
  2. Logging using CloudWatch: Use CloudWatch for logging and monitoring the performance and health of ML workloads.

Future development and addressing emerging challenges

As the field of MLOps continues to evolve, it’s essential to anticipate and address upcoming challenges to ensure that the platform remains efficient, scalable, and user-friendly. Two primary areas of future development for our platform include:

  1. Enhancing bring your own model (BYOM) capabilities for external clients
  2. Reducing the learning curve for data scientists

This section outlines those challenges and proposes directions for future enhancements.

Enhancing BYOM capabilities

As machine learning becomes more democratized, there is a growing need for platforms to easily integrate models developed externally by Zeta’s clients.

Future directions:

  • Developing standardized APIs: Implement APIs that allow for easy integration of external models, regardless of the framework or language they were developed in. This would involve creating a set of standardized interfaces for model ingestion, validation, and deployment.
  • Creating a model adapter framework: Design a framework that can adapt external models to be compatible with the platform’s infrastructure, ensuring that they can be managed, tracked, and deployed just like internally developed models.
  • Enhancing documentation and support: Provide comprehensive documentation and support resources to guide external clients through the BYOM process, including best practices for model preparation, integration, and optimization.

Reducing the learning curve for data scientists

The incorporation of multiple specialized tools (Airflow, Feast, dbt, and MLflow) into the MLOps pipeline can present a steep learning curve for data scientists, potentially hindering their productivity and the overall efficiency of the ML development process.

Future directions:

We’ll do the following to help reduce the learning curve:

  • Creating unified interfaces: Develop a unified interface, including UI, API, and SDK, that abstracts away the complexities of interacting with each tool individually. This interface could provide simplified workflows, automating routine tasks and presenting a cohesive view of the entire ML lifecycle.
  • Offering comprehensive training and resources: Invest in training programs and resources tailored to data scientists at different skill levels. This could include interactive tutorials, workshops, and detailed case studies showcasing real-world applications of the platform.

Conclusion

Integrating Airflow, Feast, dbt, and MLflow into an MLOps platform hosted on Amazon ECS with AWS Fargate presents a robust solution for managing the ML lifecycle. This setup not only streamlines operations but also enhances scalability and efficiency, allowing data science teams to focus on innovation rather than infrastructure management.

Additional Resources

For those looking to dive deeper, we recommend exploring the official documentation and tutorials for each tool: Airflow, Feast, dbt, MLflow) and Amazon ECS. These resources are invaluable for understanding the capabilities and configurations of each component in our MLOps platform.


About the authors

Varad Ram holds the position of Senior Solutions Architect at Amazon Web Services. He possesses extensive experience encompassing application development, cloud migration strategies, and information technology team management. Recently, his primary focus has shifted towards assisting clients in navigating the process of productizing generative artificial intelligence use cases.

Artem Sysuev is a Lead Machine Learning Engineer at Zeta, passionate about creating efficient, scalable solutions. He believes that effective processes are key to success, which led him to focus on both machine learning and MLOps. Starting with machine learning, Artem developed skills in building predictive models. Over time, he saw the need for strong operational frameworks to deploy and maintain these models at scale, which drew him to MLOps. At Zeta, he drives innovation by automating workflows and improving collaboration, ensuring smooth integration of machine learning models into production systems.

Saurabh Gupta is a Principal Engineer at Zeta Global. He is passionate about machine learning engineering, distributed systems, and big-data technologies. He has built scalable platforms that empower data scientists and data engineers, focusing on low-latency, resilient systems that streamline workflows and drive innovation. He holds a B.Tech degree in Electronics and Communication Engineering from the Indian Institute of Technology (IIT), Guwahati, and has deep expertise in designing data-driven solutions that support advanced analytics and machine learning initiatives.

Matúš Chládek is a Senior Engineering Manager for ML Ops at Zeta Global. With a career that began in Data Science, Matúš has developed a strong foundation in analytics and machine learning. Over the years, Matúš transitioned into more engineering-focused roles, eventually becoming a Machine Learning Engineer before moving into Engineering Management. Matúš’s leadership focuses on building robust, scalable infrastructure that streamlines workflows and supports rapid iteration and production-ready delivery of machine learning projects. Matúš is passionate about driving innovation at the intersection of Data Science and Engineering, making advanced analytics accessible and scalable for internal users and clients alike.

Dr. Danny Portman is a recognized thought leader in AI and machine learning, with over 30 patents focused on Deep Learning and Generative AI applications in advertising and marketing technology. He holds a Ph.D. in Computational Physics, specializing in high-performance computing models for simulating complex astrophysical systems. With a strong background in quantitative research, Danny brings a wealth of experience in applying data-driven approaches to solve problems across various sectors. As VP of Data Science and Head of AI/ML at Zeta Global, Dr. Portman leads the development of AI-driven products and strategies, and spearheads the company’s cutting-edge Generative AI R&D efforts to deliver innovative solutions for marketers.

Read More

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

The post is co-written with Michael Shaul and Sasha Korman from NetApp.

Generative artificial intelligence (AI) applications are commonly built using a technique called Retrieval Augmented Generation (RAG) that provides foundation models (FMs) access to additional data they didn’t have during training. This data is used to enrich the generative AI prompt to deliver more context-specific and accurate responses without continuously retraining the FM, while also improving transparency and minimizing hallucinations.

In this post, we demonstrate a solution using Amazon FSx for NetApp ONTAP with Amazon Bedrock to provide a RAG experience for your generative AI applications on AWS by bringing company-specific, unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.

Our solution uses an FSx for ONTAP file system as the source of unstructured data and continuously populates an Amazon OpenSearch Serverless vector database with the user’s existing files and folders and associated metadata. This enables a RAG scenario with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data retrieved from the OpenSearch Serverless vector database.

When developing generative AI applications such as a Q&A chatbot using RAG, customers are also concerned about keeping their data secure and preventing end-users from querying information from unauthorized data sources. Our solution also uses FSx for ONTAP to allow users to extend their current data security and access mechanisms to augment model responses from Amazon Bedrock. We use FSx for ONTAP as the source of associated metadata, specifically the user’s security access control list (ACL) configurations attached to their files and folders and populate that metadata into OpenSearch Serverless. By combining access control operations with file events that notify the RAG application of new and changed data on the file system, our solution demonstrates how FSx for ONTAP enables Amazon Bedrock to only use embeddings from authorized files for the specific users that connect to our generative AI application.

AWS serverless services make it straightforward to focus on building generative AI applications by providing automatic scaling, built-in high availability, and a pay-for-use billing model. Event-driven compute with AWS Lambda is a good fit for compute-intensive, on-demand tasks such as document embedding and flexible large language model (LLM) orchestration, and Amazon API Gateway provides an API interface that allows for pluggable frontends and event-driven invocation of the LLMs. Our solution also demonstrates how to build a scalable, automated, API-driven serverless application layer on top of Amazon Bedrock and FSx for ONTAP using API Gateway and Lambda.

Solution overview

The solution provisions an FSx for ONTAP Multi-AZ file system with a storage virtual machine (SVM) joined to an AWS Managed Microsoft AD domain. An OpenSearch Serverless vector search collection provides a scalable and high-performance similarity search capability. We use an Amazon Elastic Compute Cloud (Amazon EC2) Windows server as an SMB/CIFS client to the FSx for ONTAP volume and configure data sharing and ACLs for the SMB shares in the volume. We use this data and ACLs to test permissions-based access to the embeddings in a RAG scenario with Amazon Bedrock.

The embeddings container component of our solution is deployed on an EC2 Linux server and mounted as an NFS client on the FSx for ONTAP volume. It periodically migrates existing files and folders along with their security ACL configurations to OpenSearch Serverless. It populates an index in the OpenSearch Serverless vector search collection with company-specific data (and associated metadata and ACLs) from the NFS share on the FSx for ONTAP file system.

The solution implements a RAG Retrieval Lambda function that allows RAG with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data and associated metadata (including ACLs) retrieved from the OpenSearch Serverless index that was populated by the embeddings container component. The RAG Retrieval Lambda function stores conversation history for the user interaction in an Amazon DynamoDB table.

End-users interact with the solution by submitting a natural language prompt either through a chatbot application or directly through the API Gateway interface. The chatbot application container is built using Streamlit and fronted by an AWS Application Load Balancer (ALB). When a user submits a natural language prompt to the chatbot UI using the ALB, the chatbot container interacts with the API Gateway interface that then invokes the RAG Retrieval Lambda function to fetch the response for the user. The user can also directly submit prompt requests to API Gateway and obtain a response. We demonstrate permissions-based access to the RAG documents by explicitly retrieving the SID of a user and then using that SID in the chatbot or API Gateway request, where the RAG Retrieval Lambda function then matches the SID to the Windows ACLs configured for the document. As an additional authentication step in a production environment, you may want to also authenticate the user against an identity provider and then match the user against the permissions configured for the documents.

The following diagram illustrates the end-to-end flow for our solution. We start by configuring data sharing and ACLs with FSx for ONTAP, and then these are periodically scanned by the embeddings container. The embeddings container splits the documents into chunks and uses the Amazon Titan Embeddings model to create vector embeddings from these chunks. It then stores these vector embeddings with associated metadata in our vector database by populating an index in a vector collection in OpenSearch Serverless. The following diagram illustrates the end-to-end flow.

end to end embedding flow for the fsxontap and bedrock integration

The following architecture diagram illustrates the various components of our solution.overall architecture diagram describing all the components of the solution

Prerequisites

Complete the following prerequisite steps:

  1. Make sure you have model access in Amazon Bedrock. In this solution, we use Anthropic Claude v3 Sonnet on Amazon Bedrock.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Install Docker.
  4. Install Terraform.

Deploy the solution

The solution is available for download on this GitHub repo. Cloning the repository and using the Terraform template will provision all the components with their required configurations.

  1. Clone the repository for this solution:
    sudo yum install -y unzip
    git clone https://github.com/aws-samples/genai-bedrock-fsxontap.git
    cd genai-bedrock-fsxontap/terraform

  2. From the terraform folder, deploy the entire solution using Terraform:
    terraform init
    terraform apply -auto-approve

This process can take 15–20 minutes to complete. When finished, the output of the terraform commands should look like the following:

api-invoke-url = "https://9ng1jjn8qi.execute-api.<region>.amazonaws.com/prod"
fsx-management-ip = toset([
"198.19.255.230",])
fsx-secret-id = "arn:aws:secretsmanager:<region>:<account-id>:secret:AmazonBedrock-FSx-NetAPP-ONTAP-a2fZEdIt-0fBcS9"
fsx-svm-smb-dns-name = "BRSVM.BEDROCK-01.COM"
lb-dns-name = "chat-load-balancer-2040177936.<region>.elb.amazonaws.com"

Load data and set permissions

To test the solution, we will use the EC2 Windows server (ad_host) mounted as an SMB/CIFS client to the FSx for ONTAP volume to share sample data and set user permissions that will then be used to populate the OpenSearch Serverless index by the solution’s embedding container component. Perform the following steps to mount your FSx for ONTAP SVM data volume as a network drive, upload data to this shared network drive, and set permissions based on Windows ACLs:

  1. Obtain the ad_host instance DNS from the output of your Terraform template.
  2. Navigate to AWS Systems Manager Fleet Manager on your AWS console, locate the ad_host instance and follow instructions here to login with Remote Desktop. Use the domain admin user bedrock-01Admin and obtain the password from AWS Secrets Manager. You can find the password using the Secrets Manager fsx-secret-id secret id from the output of your Terraform template.
  3. To mount an FSx for ONTAP data volume as a network drive, under This PC, choose (right-click) Network and then choose Map Network drive.
  4. Choose the drive letter and use the FSx for ONTAP share path for the mount
    (\<svm>.<domain >c$<volume-name>):
    map network drive
  5. Upload the Amazon Bedrock User Guide to the shared network drive and set permissions to the admin user only (make sure that you disable inheritance under Advanced):upload the amazon bedrock user guide
  6. Upload the Amazon FSx for ONTAP User Guide to the shared drive and make sure permissions are set to Everyone:upload the amazon fsx ontap media guide
  7. On the ad_host server, open the command prompt and enter the following command to obtain the SID for the admin user:
    wmic useraccount where name='Admin' get sid

Test permissions using the chatbot

To test permissions using the chatbot, obtain the lb-dns-name URL from the output of your Terraform template and access it through your web browser:

test with chatbot and enter prompt

For the prompt query, ask any general question on the FSx for ONTAP user guide that is available for access to everyone. In our scenario, we asked “How can I create an FSx for ONTAP file system,” and the model replied back with detailed steps and source attribution in the chat window to create an FSx for ONTAP file system using the AWS Management Console, AWS CLI, or FSx API:

test with chatbot and enter prompt related to the bedrock guide

Now, let’s ask a question about the Amazon Bedrock user guide that is available for admin access only. In our scenario, we asked “How do I use foundation models with Amazon Bedrock,” and the model replied with the response that it doesn’t have enough information to provide a detailed answer to the question.:

Use the admin SID on the user (SID) filter search in the chat UI and ask the same question in the prompt. This time, the model should reply with steps detailing how to use FMs with Amazon Bedrock and provide the source attribution used by the model for the response:

Test permissions using API Gateway

You can also query the model directly using API Gateway. Obtain the api-invoke-url parameter from the output of your Terraform template.

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "What is an FSxN ONTAP filesystem?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "NA", "memory_window": 10}'

Then invoke the API gateway with Everyone access for a query related to the FSx for ONTAP user guide by setting the value of the metadata parameter to NA to indicate Everyone access:

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "what is bedrock?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "S-1-5-21-4037439088-1296877785-2872080499-1112", "memory_window": 10}'

Cleanup

To avoid recurring charges, clean up your account after trying the solution. From the terraform folder, delete the Terraform template for the solution:

terraform apply --destroy

Conclusion

In this post, we demonstrated a solution that uses FSx for ONTAP with Amazon Bedrock and uses FSx for ONTAP support for file ownership and ACLs to provide permissions-based access in a RAG scenario for generative AI applications. Our solution enables you to build generative AI applications with Amazon Bedrock where you can enrich the generative AI prompt in Amazon Bedrock with your company-specific, unstructured user file data from an FSx for ONTAP file system. This solution enables you to deliver more relevant, context-specific, and accurate responses while also making sure only authorized users have access to that data. Finally, the solution demonstrates the use of AWS serverless services with FSx for ONTAP and Amazon Bedrock that enable automatic scaling, event-driven compute, and API interfaces for your generative AI applications on AWS.

For more information about how to get started building with Amazon Bedrock and FSx for ONTAP, refer to the following resources:


About the authors

Kanishk Mahajan is Principal, Solutions Architecture at AWS. He leads cloud transformation and solution architecture for ISV customers and partner at AWS. Kanishk specializes in containers, cloud operations, migrations and modernizations, AI/ML, resilience and security and compliance. He is a Technical Field Community (TFC) member in each of those domains at AWS.

Michael Shaul is a Principal Architect at NetApp’s office of the CTO. He has over 20 years of experience building data management systems, applications, and infrastructure solutions. He has a unique in-depth perspective on cloud technologies, builder, and AI solutions.

Sasha Korman is a tech visionary leader of dynamic development and QA teams across Israel and India. With 14-years at NetApp that began as a programmer, his hands-on experience and leadership have been pivotal in steering complex projects to success, with a focus on innovation, scalability, and reliability.

Read More

Support for AWS DeepComposer ending soon

Support for AWS DeepComposer ending soon

AWS DeepComposer was first introduced during AWS re:Invent 2019 as a fun way for developers to compose music by using generative AI. AWS DeepComposer was the world’s first machine learning (ML)-enabled keyboard for developers to get hands-on—literally—with a musical keyboard and the latest ML techniques to compose their own music.

After careful consideration, we have made the decision to end support for AWS DeepComposer, effective September 17, 2025. With your help and feedback, our portfolio of products and services has grown to include new tools for developers to get hands-on with AI and ML. Amazon PartyRock, for example, is a generative AI playground for intuitive, code-free help in building web applications.

If you have data stored on the AWS DeepComposer console, you will be able to use AWS DeepComposer as normal until September 17, 2025, when support for the service will end. After this date, you will no longer be able to use AWS DeepComposer through the AWS Management Console, manage AWS DeepComposer devices, or access any compositions or models you have created. Until then, you can continue to work on your compositions or models and export those you would like to keep by using the step-by-step guide in the AWS DeepComposer FAQs.

If you have additional questions, please read our FAQs or contact us.


About the author

Kanchan Jagannathan is a Sr. Program Manager in the AWS AI Devices team where he helps launches AWS devices into sales channel and also oversees the Service Availability Change process for the team. He was a Program Manager for FC automation deployment and launches before joining AWS. Outside of work, he has begun to bravely endeavour camping with his 5-yr old and 1-yr old kids and enjoying the moments he gets to be with them.

Read More