Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Amazon Bedrock Knowledge Bases provides foundation models (FMs) and agents in Amazon Bedrock contextual information from your company’s private data sources for Retrieval Augmented Generation (RAG) to deliver more relevant, accurate, and customized responses. Amazon Bedrock Knowledge Bases offers a fully managed RAG experience.

The data sources that can be connected to as knowledge bases are continuously expanding. This post showcases how to use one of the data source connectors; Microsoft SharePoint, an integrated content management and collaboration tool that many organizations use for storing, organizing, and sharing their internal data. See Data source connectors for the full list of supported data source connectors.

Solution overview

The following are some pertinent features of the SharePoint data source within Amazon Bedrock Knowledge Bases:

  • It provides access to the information stored in SharePoint. The RAG architecture queries and retrieves relevant information from the SharePoint source to provide contextual responses based on the user’s input.
  • It provides the ability to extract structured data, metadata, and other information from documents ingested from SharePoint to provide relevant search results based on the user query.
  • It provides the ability to sync incremental SharePoint content updates on an ongoing basis.
  • It provides source attribution to the response generated by the FM.

In the following sections, we walk through the steps to create a knowledge base, configure your data source, and test the solution.

Prerequisites

The following are the prerequisites necessary to implement Amazon Bedrock Knowledge Bases with SharePoint as a connector:

Create a knowledge base and connect to the data source

Complete the following steps to set up a knowledge base on Amazon Bedrock and connect to a SharePoint data source:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose Create knowledge base.

kb-landing-view

  1. In the Knowledge base details section, optionally change the default name and enter a description for your knowledge base.
  2. In the IAM permissions section, select an IAM role that provides Amazon Bedrock permission to access other AWS services. You can let Amazon Bedrock create the service role or choose a custom role that you have created.
  3. In the Choose data source section, select SharePoint.
  4. Optionally, add tags to your knowledge base. For more information, see Tag resources.
  5. Choose Next.

kb-details-1

  1. In the Name and Description section, optionally change the default data source name and enter a description of the data source.
  2. In the Source section, provide the following information:
    1. For Site URLs, enter the site URLs to use for crawling and indexing the content for RAG.
    2. For Domain, enter the domain name associated with the data source. For example, if the site URL is https://deloittedasits.sharepoint.com/xyz.aspx, the domain value would be deloittedasits.
    3. Under Advanced settings, keep the default selections.

kb-details-name-desc

While converting your data into embeddings, Amazon Bedrock encrypts your data with a key that AWS owns and manages by default. To use your own AWS Key Management Service (AWS KMS) key, choose Customize encryption settings (Advanced) and choose a key. For more information, see Encryption of transient data storage during data ingestion.

You can also choose from the following options for the data deletion policy for your data source:

  • Delete – Deletes all underlying data belonging to the data source from the vector store upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted, only the underlying data. This flag is ignored if an AWS account is deleted.
  • Retain – Retains all underlying data in your vector store upon deletion of a knowledge base or data source resource.

For more information on managing your knowledge base, see Manage a data source.

ML-17173-kb-details-advanced-settings

  1. In the Authentication section, the supported authentication method is set to OAuth 2.0.
    1. For Tenant ID, enter your tenant ID. Refer to section Register a new application in the Microsoft Azure Portal of this post to get the Tenant ID.
    2. For AWS Secrets Manager secret, enter an AWS Secrets Manager Refer to the section Create a Secrets Manager secret for the SharePoint data source of this post to get the secret.

The SharePoint data source will need credentials to connect to the SharePoint Online site using the Microsoft Graph API. To facilitate this, create a new Secrets Manager secret. These credentials will not be used in any access logs for the SharePoint Online Site.

kb-details-authentication

  1. In the Metadata Settings section, optionally select any content types that you want to include or exclude.

kb-details-metadata

  1. In the Content chunking and parsing section, select Default.

kb-details-content-chunking

  1. Choose Next.
  2. In the Embeddings model section, select Titan Embeddings G1 – Text or another embeddings model as appropriate.
  3. In the Vector database section, select Quick create a new vector store to create a vector store for the embeddings.
  4. Choose Next.

kb-details-embeddings

  1. On the Review and create page, verify the selections you made and choose Create.

The knowledge base creation should be complete.

kn-created-success

The knowledge base with SharePoint as the data source is now created. However, the data source needs to be synced in order to crawl the site URLs and index the associated content.

  1. To initiate this process, on the knowledge base details page, select your data source and choose Sync.

kb-sync

Register a new application in the Microsoft Azure Portal

In this section, we register a new application in the Microsoft Azure Portal. We capture the Tenant ID from this step to use when configuring the data source for Knowledge Base for Amazon Bedrock. Complete the following steps:

  1. Open the Azure Portal and log in with your Microsoft account. If you don’t have an account, you can create one or contact your organization’s administration team.
  2. Choose New registration.
  3. Provide the following information:
    1. For Name, provide the name for your application. Let’s refer to this application as TargetApp. Amazon Bedrock Knowledge Bases uses TargetApp to connect to the SharePoint site to crawl and index the data.
    2. For Who can use this application or access this API, choose Accounts in this organizational directory only (<Tenant name> only – Single tenant).
    3. Choose Register.
    4. Note down the application (client) ID and the directory (tenant) ID on the Overview You’ll need them later when asked for TargetApp-ClientId and TenantId.
  4. Choose API permissions in the navigation pane.
  5. Configure the permissions as follows:
    1. Choose Add a permission.
    2. Choose Microsoft Graph.
    3. Choose Delegated permissions.
    4. Choose Read.All in the User section.
    5. Choose Read.All in the GroupMember section.
    6. Choose FullControl.All in the Sites section.
    7. Choose Add permissions. This permission allows the app to read data in your organization’s directory about the signed-in user.
    8. On the options menu (three dots), choose Remove permission.
    9. Remove the original Read – Delegated permission.
    10. Choose Grant admin consent for the default directory.

SPO-register-app

  1. Choose Certificates & secrets in the navigation pane.
    1. Choose New client secret.
    2. For Description, enter a description, such as description of my client secret.
    3. Choose a value for Expires. In production, you’ll need to manually rotate your secret before it expires.
    4. Choose Add.
    5. Note down the value for your new secret. You’ll need it later when asked for your client secret (TargetApp-ClientSecret).
  2. Optionally, choose Owners to add any additional owners for the application. Owners will be able to manage permissions of the Azure AD app (TargetApp).

Create a Secrets Manager secret for the SharePoint data source

Complete the following steps to create a Secrets Manager secret to connect to the SharePoint online sites listed as site URLs within the data source:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, select Other type of secret.
  3. For Key/value pairs, enter the following:
    1. username
    2. password
    3. clientId
    4. clientSecret
  4. For Encryption key, choose aws/secretsmanager.
  5. Choose Next.
  6. In the Secret name and description section, enter the name of the secret and an optional description.
  7. Add any associated tags in the Tags
  8. Leave Resource permissions and Replication secret as default.
  9. Choose Next.
  10. In the Configure rotation section, leave as default or modify according to your organizational policies.
  11. Choose Next.
  12. Review the options you selected and choose Store.
  13. On the secrets detail page, note your secret ARN value to be used as the secret when creating the Knowledge Base for Amazon Bedrock.

kb-secret

Test the solution

Complete the following steps to test the knowledge base you created:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Select the knowledge base you created and choose Test.

kb-test-1

  1. Choose an appropriate model for testing and choose Apply.

kb-test-2

  1. Enter your question for the content housed in the SharePoint site.

kb-test-3

Clean up

If you created a new knowledge base to experiment using this post and don’t plan to use it further, delete the knowledge base so that your AWS account doesn’t accumulate costs. For instructions, see Manage a knowledge base.

Conclusion

In this post, we showed you how to configure Amazon Bedrock Knowledge Bases with SharePoint Online as a data source. By connecting SharePoint Online as a data source, employees can interact with the organization’s knowledge and data stored in SharePoint using natural language, making it straightforward to find relevant information, extract key points, and derive valuable insights. This can significantly improve productivity, decision-making, and knowledge sharing within the organization.

Try this feature on the Amazon Bedrock console today! See Amazon Bedrock Knowledge Bases to learn more.


About the Authors

SurendarSurendar Gajavelli is a Sr. Solutions Architect based out of Nashville, Tennessee. He is a passionate technology enthusiast who enjoys working with customers and helping them build innovative solutions.

AbhiAbhi Patlolla is a Sr. Solutions Architect based out of the New York City region, helping customers in their cloud transformation, AI/ML, and data initiatives. He is a strategic and technical leader, advising executives and engineers on cloud strategies to foster innovation and positive impact.

Read More

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

In the field of technology and creative design, logo design and creation has adapted and evolved at a rapid pace. From the hieroglyphs of ancient Egypt to the sleek minimalism of today’s tech giants, the visual identities that define our favorite brands have undergone a remarkable transformation.

Today, the world of creative design is once again being transformed by the emergence of generative AI. Designers and brands now have opportunities to push the boundaries of creativity, crafting logos that are not only visually stunning but also responsive to their environments and tailored to the preferences of their target audiences.

Amazon Bedrock enables access to powerful generative AI models like Stable Diffusion through a user-friendly API. These models can be integrated into the logo design workflow, allowing designers to rapidly ideate, experiment, generate, and edit a wide range of unique visual images. Integrating it with the range of AWS serverless computing, networking, and content delivery services like AWS Lambda, Amazon API Gateway, and AWS Amplify facilitates the creation of an interactive tool to generate dynamic, responsive, and adaptive logos.

In this post, we walk through how AWS can help accelerate a brand’s creative efforts with access to a powerful image-to-image model from Stable Diffusion available on Amazon Bedrock to interactively create and edit art and logo images.

Image-to-image model

The Stability AI’s image-to-image model, SDXL, is a deep learning model that generates images based on text descriptions, images, or other inputs. It first converts the text into numerical values that summarize the prompt, then uses those values to generate an image representation. Finally, it upscales the image representation into a high-resolution image. Stable Diffusion can also generate new images based on an initial image and a text prompt. For example, it can fill in a line drawing with colors, lighting, and a background that makes sense for the subject. Stable Diffusion can also be used for inpainting (adding features to an existing image) and outpainting (removing features from an existing image).

One of its primary applications lies in advertising and marketing, where it can be used to create personalized ad campaigns and an unlimited number of marketing assets. Businesses can generate visually appealing and tailored images based on specific prompts, enabling them to stand out in a crowded marketplace and effectively communicate their brand message. In the media and entertainment sector, filmmakers, artists, and content creators can use this as a tool for developing creative assets and ideating with images.

Solution overview

The following diagram illustrates the solution architecture.

This architecture workflow involves the following steps:

  1. In the frontend UI, a user chooses from one of two options to get started:
    1. Generate an initial image.
    2. Provide an initial image link.
  2. The user provides a text prompt to edit the given image.
  3. The user chooses Call API to invoke API Gateway to begin processing on the backend.
  4. The API invokes a Lambda function, which uses the Amazon Bedrock API to invoke the Stability AI SDXL 1.0 model.
  5. The invoked model generates an image, and the output image is stored in an Amazon Simple Storage Service (Amazon S3) bucket.
  6. The backend services return the output image to the frontend UI.
  7. The user can use this generated image as a reference image and edit it, generate a new image, or provide a different initial image. They can continue this process until the model produces a satisfactory output.

Prerequisites

To set up this solution, complete the following prerequisites:

  1. Pick an AWS Region where you want to deploy the solution. We recommend using the us-east-1
  2. Obtain access to the Stability SDXL 1.0 model in Amazon Bedrock if you don’t have it already. For instructions, see Access Amazon Bedrock foundation models.
  3. If you prefer to use a separate S3 bucket for this solution, create a new S3 bucket.
  4. If you prefer to use localhost for testing the application instead of Amplify, make sure python3 is installed in your local machine.

Deploy the solution

To deploy the backend resources for the solution, we create a stack using an AWS CloudFormation template. You can upload the template directly, or upload it to an S3 bucket and link to it during the stack creation process. During the creation process, provide the appropriate variable names for apiGatewayName, apiGatewayStageName, s3BucketName, and lambdaFunctionName. If you created a new S3 bucket earlier, input that name in s3BucketName – this bucket is where output images are stored. When the stack creation is complete, all the backend resources are ready to be connected to the frontend UI.

The frontend resources play an integral part in creating an interactive environment for your end-users. Complete the following steps to integrate the frontend and backend:

  1. When the CloudFormation stack deployment is complete, open the created API from the API Gateway console.

Step 1

  1. Choose Stages in the navigation pane, and on the Stage actions menu, choose Generate SDK.

Step 2

  1. For Platform, choose JavaScript.

  1. Download and unzip the JavaScript SDK .zip file, which contains a folder called apiGateway-js-sdk.
  2. Download the frontend UI index.html file and place it in the unzipped folder.

This file is configured to integrate with the JavaScript SDK by simply placing it in the folder.

  1. After the index.html is placed in the folder, select the content of the folder and compress it into a .zip file (don’t compress the apiGateway-js-sdk folder itself.)

  1. On the Amplify console, choose Create new app.
  2. Select Deploy without Git, then choose Next.

  1. Upload the compressed .zip file, and change the application name and branch name if preferred.
  2. Choose Save and deploy.

The deployment will take a few seconds. When deployment is complete, there will be a domain URL that you can use to access the application. The application is ready to be tested at the domain URL.

CloudFormation template overview

Before we move on to testing the solution, let’s explore the CloudFormation template. This template sets up an API Gateway API with appropriate rules and paths, a Lambda function, and necessary permissions in AWS Identity and Access Management (IAM). Let’s dive deep into the content of the CloudFormation template to understand the resources created:

  • PromptProcessingAPI – This is the main API Gateway REST API. This API will be used to invoke the Lambda function. Other API Gateway resources, methods, and schemas created in the CloudFormation template are attached to this API.
  • ActionResource, ActionInputResource, PromptResource, PromptInputResource, and ProxyResource – These are API Gateway resources that define the URL path structure for the API. The path structure is /action/{actionInput}/prompt/{promptInput}/{proxy+}. The {promptInput} value is a placeholder variable for the prompt that users input in the frontend. Similarly, {actionInput} is the choice the user selected for how they want to generate the image. These are used in the backend Lambda function to process and generate images.
  • ActionInputMethod, PromptInputMethod, and ProxyMethod – These are API Gateway methods that define the integration with the Lambda function for the POST HTTP method.
  • ActionMethodCORS, ActionInputMethodCORS, PromptMethodCORS, PromptInputMethodCORS, and ProxyMethodCORS – These are API Gateway methods that handle the cross-origin resource sharing (CORs) support. These resources are crucial in integrating the frontend UI with backend resources. For more information on CORS, see What is CORS?
  • ResponseSchema and RequestSchema – These are API Gateway models that define the expected JSON schema for the response and request payloads, respectively.
  • Default4xxResponse and Default5xxResponse – These are the gateway responses that define the default response behavior for 4xx and 5xx HTTP status codes, respectively.
  • ApiDeployment – This resource deploys the API Gateway API after all of the preceding configurations have been set. After the deployment, the API is ready to use.
  • LambdaFunction – This creates a Lambda function and specifies the type of runtime, the service role for Lambda, and the limit for the reserved concurrent runs.
  • LambdaPermission1, LambdaPermission2, and LambdaPermission3 – These are permissions that allow the API Gateway API to invoke the Lambda function.
  • LambdaExecutionRole and lambdaLogGroup – The first resource is the IAM role attached to the Lambda function allowing it to run on other AWS services such as Amazon S3 and Amazon Bedrock. The second resource configures the Lambda function log group in Amazon CloudWatch.

Lambda function explanation

Let’s dive into the details of the Python code that generates and manipulate images using the Stability AI model. There are three ways of using the Lambda function: provide a text prompt to generate an initial image, upload an image and include a text prompt to adjust the image, or reupload a generated image and include a prompt to adjust the image.

The code contains the following constants:

  • negative_prompts – A list of negative prompts used to guide the image generation.
  • style_preset – The style preset to use for image generation (for example, photographic, digital-art, or cinematic). We used digital-art for this post.
  • clip_guidance_preset – The Contrastive Language-Image Pretraining (CLIP) guidance preset to use (for example, FAST_BLUE, FAST_GREEN, NONE, SIMPLE, SLOW, SLOWER, SLOWEST).
  • sampler – The sampling algorithm to use for image generation (for example, DDIM, DDPM, K_DPMPP_SDE, K_DPMPP_2M, K_DPMPP_2S_ANCESTRAL, K_DPM_2, K_DPM_2_ANCESTRAL, K_EULER, K_EULER_ANCESTRAL, K_HEUN, K_LMS).
  • width – The width of the generated image.

handler(event, context) is the main entry point for the Lambda function. It processes the input event, which contains the promptInput and actionInput parameters. Based on the actionInput, it performs one of the following actions:

  • For GenerateInit, it generates a new image using the generate_image_with_bedrock function, uploads it to Amazon S3, and returns the file name and a pre-signed URL.
  • When you upload an existing image, it performs one of the following actions:
    • s3URL – It retrieves an image from a pre-signed S3 URL, generates a new image using the generate_image_with_bedrock function, uploads the new image to Amazon S3, and returns the file name and a pre-signed URL.
    • UseGenerated – It retrieves an image from a pre-signed S3 URL, generates a new image using the generate_image_with_bedrock function, uploads the new image to Amazon S3, and returns the file name and a pre-signed URL.

The function generate_image_with_bedrock(prompt, init_image_b64=None) generates an image using the Amazon Bedrock runtime service, which includes the following actions:

  • If an initial image is provided (base64-encoded), it uses that as the starting point for the image generation.
  • If no initial image is provided, it generates a new image based on the provided prompt.
  • The function sets various parameters for the image generation, such as the text prompts, configuration, and sampling method.
  • It then invokes the Amazon Bedrock model, retrieves the generated image as a base64-encoded string, and returns it.

To obtain a more personalized outputs, the hyperparameter values in the function can be adjusted:

  • text_prompts – This is a list of dictionaries, where each dictionary contains a text prompt and an associated weight. For a positive text prompt, one that you would like to associate to the output image, weight is set as 1.0. For all of the negative text prompts, weight is set as -1.0.
  • cfg_scale – This parameter controls the potential for randomness in the image. The default is 7, and 10 seems to work well from our observations. A higher value means the image will be more influenced by the text, but a value that’s too high or too low will result in visually poor-quality outputs.
  • init_image – This parameter is a base64-encoded string representing an initial image. The model uses this image as a starting point and modifies it based on the text prompts. For generating the first image, this parameter is not used.
  • start_schedule – This parameter controls the strength of the noise added to the initial image at the start of the generation process. A value of 0.6 means that the initial noise will be relatively low.
  • steps – This parameter specifies the number of steps (iterations) the model should take during the image generation process. In this case, it’s set to 50 steps.
  • style_preset – This parameter specifies a predefined style or aesthetic to apply to the generated image. Because we’re generating logo images, we use digital-art.
  • clip_guidance_preset – This parameter specifies a predefined guidance setting for the CLIP model, which is used to guide the image generation process based on the text prompts.
  • sampler – This parameter specifies the sampling algorithm used during the image generation process to repeatedly denoise the image to produce a high-quality output.

Test and evaluate the application

The following screenshot shows a simple UI. You can choose to either generate a new image or edit an image using text prompts.

The following screenshots show iterations of sample logos we created using the UI. The text prompts are included under each image.

Clean up

To clean up, delete the CloudFormation stack and the S3 bucket you created.

Conclusion

In this post, we explored how you can use Stability AI and Amazon Bedrock to generate and edit images. By following the instructions and using the provided CloudFormation template and the frontend code, you can generate unique and personalized images and logos for your business. Try generating and editing your own logos, and let us know what you think in the comments. To explore more AI use cases, refer to AI Use Case Explorer.


About the authors

Pyone Thant Win is a Partner Solutions Architect focused on AI/ML and computer vision. Pyone is passionate about enabling AWS Partners through technical best practices and using the latest technologies to showcase the art of possible.

Nneoma Okoroafor is a Partner Solutions Architect focused on helping partners follow best practices by conducting technical validations. She specializes in assisting AI/ML and generative AI partners, providing guidance to make sure they’re using the latest technologies and techniques to deliver innovative solutions to customers.

Read More

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Personalization has become a cornerstone of delivering tangible benefits to businesses and their customers. Generative AI and large language models (LLMs) offer new possibilities, although some businesses might hesitate due to concerns about consistency and adherence to company guidelines. This post presents an automated personalization solution that balances the innovative capabilities of LLMs with adherence to human directives and human-curated assets for a consistent and responsible personalization experience for your customers.

Our solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. For this post, we use Anthropic’s Claude models on Amazon Bedrock.

We present our solution through a fictional consulting company, OneCompany Consulting, using automatically generated personalized website content for accelerating business client onboarding for their consultancy service. The personalized content is built using generative AI by following human guidance and provided sources of truth. We employ task decomposition, using domain / task adopted LLMs for content personalization (UX designer/personalizer), image generation (artist), and building (builder/front end developer) for the final delivery of HTML, CSS, and JavaScript files. The approach broadly mimics a human organization pursuing the same objective. This allows us to create cost-effective, more controlled, more accurate and responsible personalized customer experiences, utilizing existing guidelines and assets originally designed for human-driven processes.

We provide our code base on GitHub for you to follow along, suggest possible enhancements and modifications, and help you innovate with generative AI in personalization. Generative AI on AWS can transform user experiences for customers while maintaining brand consistency and your desired customization.

Use case overview

Our fictional company, OneCompany Consulting, plans to use generative AI to automatically create personalized landing pages as their business clients sign in. Their clients have provided some basic public information during sign-up, such as state of location, industry vertical, company size, and their mission statement. In parallel, OneCompany maintains a market research repository gathered by their researchers, offers industry-specific services outlined in documents, and has compiled approved customer testimonials. UX/UI designers have established best practices and design systems applicable to all of their websites. These resources should be used as single sources of truth. Because we don’t have such expertise, we synthetically generate these assets to demonstrate the process that would otherwise be created by expert humans or other methods in real life.

The following diagram illustrates the process of generating a personalized landing page for business visitors after they sign up.

Processing showing customers signing up and solution creating personalized websites

Fig 1. The process of customers signing up and the solution creating personalized websites using human-curated assets and guidelines.

We employed other LLMs available on Amazon Bedrock to synthetically generate fictitious reference materials to avoid potential biases that could arise from Amazon Claude’s pre-training data. In practical scenarios, these resources would be created by humans and organizations, containing more comprehensive and exhaustive details. Nonetheless, our solution can still be utilized.

  • Client profiles – We have three business clients in the construction, manufacturing, and mining industries, which are mid-to-enterprise companies. The process assumes the information in the company profiles is public and that the companies who signed up opted in to OneCompany Consulting to use for personalization. The following example is for the construction industry:
Profiles = {
'Construction-Example': {
'Name': 'Example Corp Construction',
'Industry': 'Construction',
'CompanySize': 1500,
'CompanyType': 'Enterprise',
'Location': 'New York City, NY',
'Mission': 'Building a sustainable future for New York' }
}
  • Offerings – Offerings are documents that consolidate all offerings provided by OneCompany Consulting. The following is an example of a synthetically generated offering for the construction industry:
OneCompany Consulting Construction Consulting Services Offerings
Introduction
OneCompany Consulting is a premier construction consulting firm dedicated to... <Redacted for printing>
Our core values are:
1. Client-Centric Approach: We put our clients at the heart of everything we do... <Redacted for printing>
Our offerings include:
1. Pre-Construction Services - Feasibility Studies - Site Selection and Evaluation... <Redacted for printing>
5. Construction Technology Solutions - Construction Data Analytics and Reporting... <Redacted for printing>
  • Industry insights – Your LLMs can use industry pain points, news, and other resources to enrich personalized content. Because the industry news and resources are wide, we use a Retrieval Augmented Generation (RAG) framework to retrieve related information. The following is an example for the manufacturing industry:
The Electric Vehicle Manufacturing Industry: Overcoming Challenges... <Redacted for printing>
Introduction
The electric vehicle (EV) industry has experienced unprecedented growth in recent years, driven by... <Redacted for printing>
This article will explore the key challenges facing the EV... <Redacted for printing>
Supply Chain Disruptions
The COVID-19 pandemic has highlighted the vulnerability of global supply chains, and the EV industry has not been... <Redacted for printing>
V. Regulatory Frameworks and Incentives
Regulatory frameworks and government incentives play a critical role in promoting EV... <Redacted for printing>

  • Testimonials – Synthetically generated customer testimonials are displayed for the visitors. In this solution, the LLM is asked to use the sentence without changes because it’s a testimonial. The following is an example:
"AnyCompany Consulting's expert consulting services have been invaluable in streamlining our operations and optimizing our construction processes, resulting in increased efficiency and cost savings"
- John Smith, CTO from Example Corp Solutions.
  • Design guidelines and systems – This part is for the instructions and rules to be followed for building the website. Our examples were manually created only for high-level guidance for simplicity.
    • Guidelines – The following are some examples from the design guidelines:
- Use a color palette that aligns with the customer's industry, ensuring sufficient color contrast for readability.
- Use visible font sizes and responsive units for better visibility and readability across screen sizes... <Redacted for printing> - ... <Redacted for printing>
  • Instructions – The following are some examples from the design instructions:
Header Design:
- Choose an attention-grabbing background color and font that aligns with the client's industry.
- Consider incorporating subtle animations or transitions
Hero Section:
- Choose an attention-grabbing background color and font that aligns with the client's industry.
- ... <Redacted for printing>

Solution overview

To create personalized websites efficiently, we employ task decomposition—breaking down the complex process into simpler, decoupled sub-tasks. This approach allows using smaller, cost-effective language models, creating targeted prompts and contexts for increased accuracy and faithfulness, isolating responses for straightforward troubleshooting, and achieving cost savings.

In our example, we decomposed the overall personalized website creation process into three steps, each handled by specialized agents: the personalizer for tailoring content, the artist for generating images, and the frontend engineer/builder for coding. For the personalizer, we used Claude Sonnet due to the relative complexity of the task compared to code generation handled by Haiku. However, Claude Haiku can also be used for the personalization task, potentially leading to further cost savings. Yet, Haiku may require more prescriptive prompts and examples to achieve similar results. We recommend that customers test both Sonnet and Haiku to determine the optimal balance between performance and cost for their specific use case. In our demonstration, we chose to use Sonnet with a relatively simple prompt to showcase its efficiency, but the flexibility of this approach allows for various LLMs to be integrated into the agentic workflow.

The following diagram illustrates our agentic workflow.

Illustration of agentic workflow in solution.

Fig 2. Workflow diagram of agentic workflow made of specialized (task / domain adopted) LLMs.

The following diagram illustrates our solution architecture.

Solutions architecture

Fig 3. Solutions architecture

The workflow includes the following steps:

  1. The client profile is stored as key-value pairs in JSON format. However, the JSON needs to be converted into natural language to simplify the task for the downstream LLMs, so they don’t have to figure out the JSON schema and associated meaning.
  2. After the profile is converted into text that explains the profile, a RAG framework is launched using Amazon Bedrock Knowledge Bases to retrieve related industry insights (articles, pain points, and so on). Amazon Bedrock Knowledge Bases is a fully managed capability that helps you implement the RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. All the information retrieved from Amazon Bedrock Knowledge Bases is provided with citations to improve transparency and increase accuracy. In our example, if the client is a manufacturing customer building electronic vehicles (EVs), then the related context will be retrieved from the human-curated or human-created research documents.
  3. Now we’re ready to send our prompt to the personalizer LLM with all the relevant information. In addition to the customer profile and industry insights, we include offerings, design guidance, and testimonials, and ask the personalizer LLM to create a detailed website description and description of visuals.
  4. The response from the personalizer LLM is divided into two paths by a regex method. The first part moves to the frontend developer LLM.
  5. The second part is sent to the image generator or artist LLM.
  6. For the frontend developer LLM, we also use system design-related materials (in our case, design guidelines) so the frontend developer builds the website described by the personalizer LLM while applying the rules in the design guidelines. Here, we also prompted the LLM to use the company logo (which is the unicorn of AWS GameDay) to demonstrate incorporating existing design elements into the design. In addition, our prompt asks the frontend developer LLM to write the JavaScript to make the testimonials display and call to action dynamic.
  7. At the end of the process, we create a consolidated HTML file, which includes CSS and JavaScript, and store it in an Amazon Simple Storage Service (Amazon S3) bucket so that the assets are ready to be deployed.

Prerequisites

For this post, you need the following prerequisites:

After you complete the prerequisites, you can use the following Jupyter notebook , which have all the necessary steps to follow this post.

Test the solution

Let’s start with our example manufacturing client, who is building the next-generation EVs ('Manufacturing-Example'). First, the profile of this client in JSON format is converted into natural language as follows:

customer = build_profile(UserProfile)
print(customer)

Output:
"Your customer is Example Corp Manufacturing. Their industry is manufacturing. They have 2,500 employees." 
"They are an enterprise company, located in San Jose, CA."  
"Their mission statement is “Building the next generation Electric Vehicles."

Based on the customer profile giving the location and industry, the related background information is retrieved using RAG based on kbId = <Knowledge_base_id>:

context_painpoints = Bedrock.doc_retrieve(query,kbId,numberOfResults = 10)
contexts_painpoints = get_contexts(context_painpoints)

At this stage, we have the context and industry pain points. We now need to gather human-curated sources of truth such as testimonials, design guidelines, requirements, and offerings. These references aren’t numerous and require complete information in the system prompt. However, you can apply the RAG method used for industry insights to retrieve one or all of these references.

offerings = open("./references/offerings.txt", "r").read()
testimonials = open("./references/Testimonials.txt", "r").read()
design_steps= open("./references/RecommendedStepsToDesignNew.txt", "r").read()
design_guideline= open("./references/DesignGuidelineNew.txt", "r").read()

We’re ready to input our prompt to the personalizer LLM, which is Anthropic’s Claude 3 Sonnet (or Haiku, if you used it) on Amazon Bedrock. We used the following which incorporates the parameters customer, testimonials, design_steps, and offerings. We prompted the LLM to generate a response in two parts:

  • Detailed Website Description – Describing the website in text format
  • Visual Elements – Describing the visual elements used

These two responses will be used in two different workflows in the following steps. The following is an example response (response_personalized_website) for the manufacturing client:

Section 1: Detailed Website Description

Header:
The header section features a clean and modern design with a navy-blue background color (#003366) and white text. The AnyCompany Consulting logo is prominently displayed on the left side, while the main navigation menu is positioned on the right. The navigation menu includes the following links: "Home," "Services," "About," and "Contact". The header is fixed to the top of the page, ensuring easy access to the navigation throughout the user's experience.

Hero Section:
The hero section showcases a large, high-quality image (hero-example.jpg) that depicts a modern manufacturing facility with various machinery and workers in action. The image is overlaid with a semi-transparent black gradient to improve the readability of the text. The hero section's main heading reads "Transforming Manufacturing Excellence" in a large, bold font (font-size: 48px), and the subheading states "Unlock your full potential with AnyCompany Consulting's tailored solutions" in a slightly smaller font (font-size: 24px). A prominent "Get Started" call-to-action button is positioned below the subheading, encouraging the user to take the next step.

Offerings Section:
The offerings section is divided into three main categories, each with a corresponding icon and brief description:

1. Operational Excellence
- Icon: operations-icon.png
- Description: "Optimize your manufacturing processes and drive continuous improvement with our Lean and Six Sigma expertise."
2. Digital Transformation
- Icon: digital-icon.png
- Description: "Leverage the power of Industry 4.0 technologies to enhance productivity, efficiency, and data-driven decision-making."
- ... <Redacted for printing>

Dynamic Content for JavaScript:
The testimonials section features a dynamic horizontal slider that automatically cycles through the testimonials every 5 seconds. This functionality can be implemented using JavaScript, with the following elements:

1. Testimonial Data:
- An array of testimonial objects, each containing the following properties:
- name: "John Smith"
- title: "CTO, Example Corp Solutions"
- quote: "AnyCompany Consulting's expert consulting services have been invaluable in streamlining our operations and optimizing our construction processes, resulting in increased efficiency and cost savings."

2. Testimonial Slider:
- A container element to hold the testimonials
- A function to display the current testimonial
- A timer to automatically cycle through the testimonials every 5 seconds
- Smooth transition animations between testimonials

3. User Interaction:
- Ability to manually navigate through the testimonials (for example, previous and next buttons)
- Pause the automatic cycling when the user interacts with the slider

Section 2: Visual Elements

1. <VISUAL_LABEL>anycompany_logo.jpg</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This is the AnyCompany Consulting logo, featuring the company name in a bold, modern font. The logo is designed in a clean, minimalist style with a navy blue color scheme.</VISUAL_DESCRIPTION>

2. <VISUAL_LABEL>hero-example.jpg</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This image depicts a modern manufacturing facility with various machinery and workers in action. The scene showcases the dynamic and technologically advanced nature of the manufacturing industry, aligning with the customer's profile and the AnyCompany Consulting's expertise.</VISUAL_DESCRIPTION>

3. <VISUAL_LABEL>operations-icon.png</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This icon represents the "Operational Excellence" offering. It features a gear or cog symbol, symbolizing the optimization and streamlining of manufacturing processes.</VISUAL_DESCRIPTION>

4. ... <Redacted for printing>

We use Stable Diffusion to generate visual assets based on the descriptions provided by the personalizer LLM. We extract the image descriptions enclosed within <VISUAL_LABEL> and <VISUAL_DESCRIPTION> tags using a regex pattern. These descriptions are then sent in batches to our artist LLM to create images.

pattern = r'&lt;VISUAL_LABEL&gt;(.+?)&lt;/VISUAL_LABEL&gt;s*&lt;VISUAL_DESCRIPTION&gt;(.+?)&lt;/VISUAL_DESCRIPTION&gt;'

The images are created and put into the S3 bucket to store your website assets.

Now we’re ready to create the website HTML, CSS, and JavaScript assets. We used the following prompt template, which uses response_personalized_website from the personalizer, actual testimonials, and the UI design guidelines:

prompt =
f"""
You are an experienced frontend web developer specializing in creating accessible, responsive, and visually appealing websites. Your task is to generate the complete HTML, CSS, and JavaScript code that accurately implements the provided 'Website Description' while adhering to the specified guidelines.
 <website description>
Know that this your Design Guideline (Requirements):
{design_guideline}
You use the testimonials as follows:
{testimonials}
Website Description:
{response_personalized_website} </website description> 
Carefully read the 'Website Description' line by line, and then generate the HTML, CSS, and JavaScript code required to build the described website while following the specified design guidelines and requirements.
Provide the HTML, CSS, and JavaScript code directly, starting with the <!DOCTYPE html> declaration, without any preamble or introduction.
"""

After this step, you have all the necessary assets to preview your website. You can put the HTML and created files into a folder and use a web browser to see your created website.

The following screenshots show examples of generated personalized pages for EV manufacturing, mining, and construction clients, respectively. Each image is displayed for 3 seconds in GIF format. To experience the full quality and dynamic features of these pages, we recommend you visit the examples folder, download the folders, and open the corresponding main.html files with your internet browser.

Three examples looping use-cases

Fig 4. Generated example personalized pages for three industry clients.

Highlights from the test

The solution automatically generated personalized web pages, including personalized images within the provided guardrails, prompts, and reference materials. The workflow considered appropriate color contrasts, such as dark backgrounds with white fonts for accessibility. The solution generated representative icons with consistent coloring and themes across the pages. The workflow also created industry-specific, engaging labels, descriptions, offerings, and pain points based on the source of truth references. The offering selection and pain point sections are especially noteworthy because they were tailored to the visitor. For example, the hero page showcased an EV on a production line, whereas a mining company with a “sustainability” motto received green icons and a focus on that topic. The construction company from New York had themes mentioning their specific points. The workflow is capable of creating dynamic assets as prompted, such as testimonials or call-to-action buttons. Additionally, the solution created consistent assets that can scale well and are compatible with multiple devices, as requested.

In this example, we did not fully exhaust the capabilities of personalization. However, we hope these examples can provide a simple starting point for your personalization use cases.

Clean up

To clean up, start by deleting the S3 bucket you created for your knowledge base. Then delete your knowledge base. Because we used Amazon Bedrock on demand, unless you invoke the endpoint, it will not incur any cost. However, we recommend deleting the artifacts in SageMaker Studio or the SageMaker Studio domain if you used SageMaker Studio to follow along with this demo.

Suggested enhancements

You can extend this solution with some further enhancements, such as the following:

  • Use batch processing for cost-effective asset creation based on visitor profiles. You can use batch inference with Amazon Bedrock or batch transform with SageMaker.
  • Cluster similar client profiles to reduce design element variations for frugality and consistency.
  • Provide website templates and chain-of-thought descriptions to follow design patterns more prescriptively.
  • Use Haiku instead of Sonnet for further cost reduction. You may need more prescriptive and multi-shot prompts as you switch to Haiku for the personalization stage.
  • Retrieve existing company images and icons using semantic search instead of generating visuals. For example, you can build semantic image search using Amazon Titan.

Conclusions

In this post, we presented an automated solution to provide a consistent and responsible personalization experience for your customers. This approach uses smaller LLMs for website personalization tailored to businesses and industries. It decomposes the complex task into subtasks handled by task / domain adopted LLMs, adhering to company guidelines and human expertise. Using a fictional business consulting company scenario, we demonstrated the solution by generating personalized marketing content like text, images, HTML, CSS, and JavaScript code. The process employs techniques like RAG, prompt engineering with personas, and human-curated references to maintain output control.

By combining generative AI, curated data, and task decomposition, businesses can cost-effectively create accurate, personalized website experiences aligned with their branding and design systems.

Amazon Bedrock, which you can use to build generative AI applications, is at the center of this solution. To get started with Amazon Bedrock, we recommend following the quick start and familiarizing yourself with building generative AI applications.


About the Authors

BurakBurak Gozluklu is a Principal AI/ML Specialist Solutions Architect and lead GenAI Scientist Architect for Amazon on AWS, based in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. He maintains his connection to academia as a research affiliate at MIT. Outside of work, Burak is an enthusiast of yoga.

Chidi Prince John is a Data Scientist at Amazon. He designs, builds and deploys models for large-scale personalization in Amazon Payments. Chidi has a Master’s degree in Quantitative Management from Duke University and a Bachelor’s degree in Economics from the University of Nigeria. Outside of work, he is passionate about soccer and TV shows.

Dieter D’Haenens is a Senior Product Manager for Amazon, responsible for customer growth, delivering personalized experiences and driving the Amazon flywheel. Leveraging his expertise in retail and strategy, he is passionate about solving customer problems through scalable, innovative AI and ML solutions. Dieter holds a Bachelor of Science in Economics from Ghent University, a Master in General Management from Vlerick Business School, and a Master of Science in Business Analytics from Southern Methodist University. In his spare time, he enjoys traveling and sports.

Read More

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

In recent years, FM sizes have been increasing. It is important to consider the massive amount of compute often required to train these models. The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia, custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

When using compute clusters of massive size, a single failure can often throw a training job off course and may require multiple hours of discovery and remediation from customers. According to a report from OPT-175B training, about 178,000 GPU hours were wasted due to various training failures, amounting to 16 percent of the total training time. Similarly, a study by Meta AI and Carnegie Melon university found that, in the worst cases, 43 percent of compute time was wasted because of overheads due to hardware failures. This can adversely impact a customer’s ability to keep up with the pace of innovation in generative AI and can also increase the time-to-market for their models.

Amazon SageMaker HyperPod is a service that is purpose-built to accelerate FM training, removing the undifferentiated heavy-lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks to months without disruption. To make FM training more resilient to hardware failures, SageMaker HyperPod continually monitors cluster health, repairs and replaces faulty nodes without disrupting training, and uses customer-defined checkpoints to automatically resume training from the last point of failure.

Why SageMaker HyperPod?

SageMaker HyperPod offers several benefits that make it a good choice for FM training:

  • Standby pool of nodes at no additional cost – SageMaker HyperPod provisions and manages a pool of spare nodes on the customer’s behalf. These nodes are on standby and can be automatically used to replace faulty nodes during training. This makes it so that failures don’t interrupt or delay large-scale training jobs, and these spare nodes come at no additional cost to the user. With the SageMaker HyperPod auto-resume functionality, the service can dynamically swap out unhealthy nodes for spare ones to ensure the seamless continuation of the workload.
  • Cluster placement groups for optimized training – Each instance group is launched in a cluster placement group within the same network spine, in order to get the best inter-node latency and maximize bandwidth between nodes. This is ideal for tightly coupled workloads like distributed training where low-latency communication is essential for synchronizing gradient updates and ensuring that model training scales effectively across multiple GPUs.
  • Preconfigured deep learning AMI with essential libraries – The SageMaker HyperPod agent runs a SageMaker HyperPod DLAMI, which is built on top of AWS Deep Learning Base GPU AMI (Ubuntu 20.04).The SageMaker HyperPod DLAMI is bundled with additional packages to support open source tools such as Slurm and dependencies. Also included are SageMaker HyperPod cluster software packages, which support features such as cluster health check and auto-resume.
  • Reusable scaling scripts for rapid experimentation – HyperPod offers a set of scalable and reusable scripts that simplify the process of launching multiple training runs. These scripts streamline infrastructure setup and deployment and can be easily adapted for different training scenarios or to run many jobs in parallel, making large-scale training more manageable. By reducing repetitive tasks and providing reusable automation, these scripts empower users to quickly scale up or down, test different model variations, and iterate faster, improving productivity and reducing operational overhead.
  • Auto-resume functionality – This is one of the most valuable features of SageMaker HyperPod. When a node fails, SageMaker HyperPod automatically replaces it with a healthy node from the spare pool and resumes the job from the last saved checkpoint with minimal disruption to training. This is particularly crucial for long-running training jobs, where even minor interruptions can lead to significant delays.
  • Real-time performance dashboards with few-click setup – SageMaker HyperPod integrates seamlessly with real-time dashboards to monitor node health, GPU utilization, network traffic, and other key metrics. This can be done with just a few clicks, providing full visibility into training jobs and allowing teams to optimize performance in real-time.

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod. We review components of the Slurm orchestrated SageMaker HyperPod cluster setup, primarily focusing on the resiliency and feature set of SageMaker HyperPod, including automatic fault detection and integration with open source tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Overview of SageMaker HyperPod resiliency

Some of the health check metrics used by SageMaker HyperPod include:

  • Accelerator issues Checks for GPU issues including DCGM policies like XID errors, GPU health through nvidia-smi, and Trainium issues by reading from Neuron sysfs
  • Networking issues Elastic Fabric Adapter (EFA)
  • Health checks – Performed to run processes on accelerators and multiple threads on CPUs to achieve 100 percent utilization. This process determines the health of the CPU or accelerator. Specifically, DCGM Diagnostics Level 2 tests are run for GPUs, and CPU health is determined using the Linux stress tool.

SageMaker HyperPod continuously performs health checks on crucial components, including GPUs, AWS Trainium cores, and EFA networking devices. This proactive approach allows for the HyperPod health check agent to identify various hardware failures or potential performance degradation. When hardware failures are detected, SageMaker HyperPod identifies faulty instances and is also able to use its auto-resume functionality to initiate a replacement process without manual intervention. This feature automatically detects hardware failures, seamlessly replaces faulty instances, and resumes jobs from the last saved checkpoint. In addition, SageMaker HyperPod offers you the ability to manually replace a node in the case that you have a node stuck with an issue but is not being fixed by the SageMaker HyperPod auto-resume functionality. You can manually change the state of the node to fail, and SageMaker HyperPod will replace it with a healthy instance. For a more in-depth dive into resiliency with SageMaker HyperPod, refer to the Resiliency section of this post.

Overview of SageMaker HyperPod observability

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate your cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana. The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your SageMaker HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster’s behavior. By using these services, you gain a centralized and unified view of your SageMaker HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads. The Observability section of this post goes into more detail on which metrics are exported and what the dashboards look like in Amazon Managaed Grafana.

This post is primarily focused on Amazon Managed Service for Prometheus and Amazon Managed Grafana for observability. To explore more observability integrations with SageMaker HyperPod like Nvidia Nsight, refer to the validation and observability folder of the awsome-distributed-training GitHub repo.

These resiliency and observability features collectively contribute to a more reliable and efficient training environment, minimize downtime, and optimize resource usage. By directly integrating with Amazon Managed Service for Prometheus and Amazon Managed Grafana and abstracting the management of hardware failures and job resumption, SageMaker HyperPod allows data scientists and ML engineers to focus on model development rather than infrastructure management.

Mathstral model from Mistral AI

Mathstral is a model designed for math reasoning and scientific discovery, is based on the original Mistral 7B model, and features a 32k context window. The release of Mathstral aligns with Mistral AI’s broader effort to support academic and scientific research, particularly through their collaboration with Project Numina. As a 7B model, Mathstral sets a new standard on the performance and latency space for math and reasoning generation compared to similar models used for math and reasoning. Mathstral can achieve significantly better results with more inference-time computation.

Overview of PyTorch FSDP

In distributed data parallel (DDP) training, each process or worker owns a replica of the model and processes a batch of data. Then, it uses all-reduce to sum up gradients over different workers. In DDP, the model weights and optimizer states are replicated across all workers. DDP maintains a full copy of the model on each GPU and requires enough memory on each GPU to store the entire model. For training larger FMs, using an approach like FSDP is recommended, since these FMs require more than a single GPU. It is a type of data parallelism that shards model parameters, optimizer states, and gradients across DDP ranks. This approach reduces the memory requirements on individual GPUs and distributes the memory load across GPUs. With FSDP enhanced efficiency, researchers and developers use fewer GPUs, thereby minimizing operational costs and achieving faster model convergence.

When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes the training of some very large models feasible by allowing them to be loaded into memory with a lower memory footprint. However, this comes at the cost of increased communication volume. For more information on FSDP, refer to PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Solution overview

The following image shows the architecture diagram for the resources deployed as part of Sagemaker HyperPod for our use case of training the Mathstral model. In your account, you will have a VPC provisioned with a public and private subnet, and an S3 bucket synced to your FSxL file system via a data repository link. In the service team account, your cluster of P4de instances is provisioned, along with the head node, and the login node, for you to submit the training job to your cluster.

Prerequisites

In the context of this post, we use four p4de.24xlarge instances. You can find more information on the p4de.24xlarge instance type at Amazon EC2 P4 Instances. To get the best inter-node latency, we launch these instances together in a cluster and only run jobs on a single instance group. You can also use a variety of other instance types to follow along with this post.

For more information on getting access to instances in a partition group, refer to the Getting Started section in this post. Note that Mathstral 7B at full precision (FP32) is approximately 26 GB in size so you need to make sure that your cluster configuration has sufficient GPU memory to load the model into memory along with the gradients, activations, and moments. This should account for a total of 107 GB in addition to the training assets required to kick off a job successfully. For demonstration purposes, we use FSDP for this continued pre-training job.

The following sections describe setting up your infrastructure and environment with SageMaker HyperPod. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop. The prerequisites and cluster setup parts of this workshop go over all the required components needed in order to set up your cluster. The workshop also provides resources to troubleshoot commonly faced issues during setup.

Set up your infrastructure

Deploy HyperPod VPC stack

To set up your cluster, you first need to create some resources. The following resources can be created by deploying this SageMaker HyperPod VPC CloudFormation stack. By default usw2-az4 is specified as the Availability Zone. Change this to reflect the Availability Zone where you have your cluster. This VPC stack creates the following resources:

  • Subnet – This is a private subnet in the Availability Zone id that you choose to use
  • Security group – This allows SageMaker HyperPod to mount your Amazon FSx for Lustre file system
  • FSx for Lustre file system – This serves as the shared file system that all the nodes can access. It’s a 1.2 TB PERSISTENT_2 file system in the private subnet you create. It gets mounted at /fsx.
  • Linux environment – This provides a standardized development environment to work in
  • Amazon Simple Storage Service (Amazon S3) bucket – To push and store your lifecycle scripts
  • AWS Identity and Access Management (IAM) role – Role required for creating the SageMaker HyperPod cluster

Deploy the observability stack

In order to use the observability integration with SageMaker HyperPod, you need to deploy the SageMaker HyperPod Observability CloudFormation stack, which can then be used to monitor your cluster metrics in real time.

Set up your environment

Let’s move on to environment setup. In order to deploy this solution, you need to use a Linux-based development environment. This section briefly describes the steps required to set up your cluster. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop.

Set up your cluster

This section guides you through the process of deploying a cluster to train with. You need to set up the following:

  • Head node and compute nodes – The head node is composed of an m5.12xlarge instance, and the worker group consists of p4de.24xlarge instances. Refer to the following table for details on these instance types.
  • Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes
  • Placement groups enabled – A placement group will launch instances close together inside one physical data center in a single Availability Zone to maximize the bandwidth and reduce the latency between instances
  • Local storage – Each node will have an 8 TB local NVME volume attached for local storage
  • Scheduler – SLURM will be used as a job scheduler
  • Accounting – As part of cluster configuration, a local MariaDB is deployed to keep track of job runtime information
A B C D E F
1 Instance size GPU devices Total GPU memory VCPUs CPU memory EFA bandwidth
2 p4de.24xlarge 8 640 gb 96 1152 gb 400 Gbps

Set up the AWS CLI

Before creating the cluster and its associated resources, you need to set up the AWS Command Line Interface (AWS CLI) using the latest version (or version 2.17.1 at a minimum).

To check the AWS CLI version, use the following command.

aws --version

To update the AWS CLI to the latest version, use the following command.

sudo ./aws/install –update

The AWS CLI plugin for Session Manager, a capability of AWS Systems Manager, must be installed to access your cluster. To use Amazon Linux 2 to install Session Manager, use the following command:

sudo yum install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm

For detailed steps on installing and setting up the AWS CLI, follow the steps provided in the Install AWS CLI section of the Amazon SageMaker HyperPod workshop.

Source environment variables

An important part of the setup is to source in all the environment variables, using the output from the VPC CloudFormation stack deployed in a previous step. Use the following command.

curl 'https://static.us-east-1.prod.workshops.aws/public/e3e1b2f1-8140-43eb-a316-e76f569119dd/static/scripts/create_config.sh' --output create_config.sh
bash create_config.sh
source env_vars

Once you have sourced them in, confirm that they were correctly set using the following command.

cat env_vars

Set up lifecycle scripts

SageMaker HyperPod uses a collection of lifecycle scripts  to bootstrap the cluster. These scripts are responsible for several actions, including setting up Slurm and mounting the FSx for Lustre file system. You need to customize these scripts in order to mount your FSx for Lustre file system. For detailed steps on setting up these lifecycle scripts, refer to the Set Up Lifecycle Scripts section of the workshop.

Make sure to complete the Enable Optional Lifecycle scripts section after step 4 of the Set Up Lifecycle scripts section because this is needed in order to enable installation of exporter services on the cluster. This is required because you need the exporter services on the cluster to emit metrics to Amazon Managed Service for Prometheus.

Additionally, the observability stack requires the following two AWS managed IAM policies to be added to your AmazonSagemakerClusterExecutionRole prior to creating your cluster.

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Once you have uploaded the lifecycle scripts to Amazon S3, you can then create your cluster.

Create your cluster

To create your cluster, you need your cluster configuration. Because you use p4de.24xlarge for this example, copy the following cluster configuration.

source env_vars
cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "InstanceGroups": [
      {
        "InstanceGroupName": "controller-machine",
        "InstanceType": "ml.m5.12xlarge",
        "InstanceStorageConfigs": [
          {
            "EbsVolumeConfig": {
              "VolumeSizeInGB": 500
            }
          }
        ],
        "InstanceCount": 1,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      },
      {
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4de.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      }
    ],
    "VpcConfig": {
      "SecurityGroupIds": ["$SECURITY_GROUP"],
      "Subnets":["$SUBNET_ID"]
    }
}
EOL

If you use a different instance type for your cluster, refer to the Create Cluster section of the workshop to create your cluster-config.json file.

SageMaker HyperPod also gives you the ability to update your clusters to increase the size of an existing worker group or create a new worker group to add additional instance-types to your cluster. For steps on updating the cluster to create additional worker groups that use other instance types, refer to the section in the workshop to create Heterogenous Clusters.

Once you’ve created the cluster-config.json file, follow the Create Cluster steps in the workshop to create the FSX for Lustre configuration (provisioning_parameters.json) file and upload it to Amazon S3. Then, you can validate the configuration using the validate-config.py file in the awsome-distributed-training GitHub repo.

Once this validation is completed, you can create your cluster. Use the following command.

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json 
    --region $AWS_REGION

To check the state of your cluster, run the following command.

aws sagemaker list-clusters --output table

You should then be able to observe the cluster creating.

-----------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                          ListClusters                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                       ClusterSummaries                                                          ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
||                        ClusterArn                              |    ClusterName       | ClusterStatus |               CreationTime              ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
|| arn:aws:sagemaker:us-west-2:{cluster arn}                      |  ml-cluster          | Creating      | time taken to create                    ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|

Now that you’ve created a cluster, you can monitor the status in the SageMaker console. This will show you cluster status, running instances, and node groups and allow you to modify the cluster. In the SageMaker HyperPod console, find your cluster and select it, as shown in the following screenshot.

Once the Cluster status changes to InService, you can connect using Secure Shell (SSH). Make sure that you completed the step in Set up the AWS CLI to install the SSM plugin. You can then take the easy-ssh.sh file from the repo to simplify the SSM command connect to the controller-machine using SSH.

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

Use the following command to switch to the ubuntu user.

sudo su - ubuntu

Refer to the Get to know your Cluster  section in the SageMaker HyperPod workshop to familiarize yourself with the commands you need to use in the later sections.

Finally, set up SSH access to the compute nodes. To do this, add a key-value pair to the /fsx/ubuntu directory. Because all the compute nodes mount this directory, you only have to do this once for ubuntu to access all the compute nodes. For instructions, refer to the SSH Access to compute section of the workshop.

Congrats on setting up your environment! Now that you’ve completed the necessary steps, you can move on to your training job.

Run your pre-training job

Follow these steps on your cluster head node:

  1. Navigate to your shared FSx for Lustre file system. If you followed the tutorial linked previously, it will be location at /fsx.
  2. Use the following command to clone the awsome-distributed-training repo.
cd /fsx
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/10.FSDP
  1. Run the create_conda_env.sh script.

This script will first download and install Miniconda, then create a Conda environment called pt_fsdp. The Conda environment installs PyTorch on AWS, which is a package that is built to run PyTorch workloads on AWS. Specifically, it lets you use EFA out of the box, since OFI-NCCL is pre-built in the Conda package. PyTorch on AWS also provides the latest versions of CUDA, cuDNN, and NCCL for the best performance on GPU-based instances. Dependencies required to run your FSDP training job will be installed in this Conda environment, and since this Conda environment is created on the /fsx file system, it’ll be shared across all your training nodes.

bash 0.create_conda_env.sh

For this training job, you use the C4 dataset, which is several hundred gigabytes. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.

If want to use your own dataset instead, you can format it as a HuggingFace dataset and pass its location to the --dataset_path argument.

Launch training

The script to launch the Mathstral training job can be found in 3.distributed-training-mistral-mathstral.sbatch. Depending on the number of nodes in your cluster, you are can adjust them by modifying #SBATCH --nodes=4. Because you are using four p4de.24xlarge instances, it has been set to 4.

For the purpose of this post, you need to make sure that the FI_EFA variables for EFA are exported in the 3.distributed-training-mistral-mathstral.sbatch file. If you use instances not enabled for remote direct memory access (RDMA), such as the g5.12xlarge, comment out lines 21–22 of this file. These instances have EFA between nodes, but do not have the GPU direct RDMA access of p4d/e and p5 instances. In this walkthrough, we are using p4de instances, so we leave these lines uncommented.

## Plenty of EFA level variables
## Comment out for non-efa instances (G5, G4d, P3)
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4de
export FI_LOG_LEVEL=1
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO

Under User Variables, make sure to adjust GPUS_PER_NODE to match the number of GPUs on your instance type (8 for p4de).

You can also adjust the training parameters in TRAINING_ARGS. Additional parameters can be found in model/arguments.py.

We use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way, if the training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

Note: You may change these hyperparameters in the 3.distributed-training-mistral-mathstral.sbatch file. We are using arbitrary hyperparameters here for the sake of demonstration.

declare -a TRAINING_ARGS=(
    --train_batch_size=1 
    --val_batch_size=1 
    --max_steps=5000 
    --seed=42 
    --grad_clip=1.0 
    --weight_decay=0.2 
    --beta1=0.9 
    --beta2=0.95 
    --activation_checkpointing=1 
    --intermediate_size=14336 
    --num_key_value_heads=8 
    --logging_freq=1 
    --max_context_width=32768 
    --vocab_size=32768 
    --hidden_width=4096 
    --num_layers=32 
    --num_heads=32 
    --resid_pdrop=0.1 
    --embd_pdrop=0.1 
    --attn_pdrop=0.1 
    --summary_first_pdrop=0.1 
    --initializer_range=0.02 
    --model_type="mistral" 
    --rotary_pct=0.25 
    --rotary_emb_base=10000 
    --lr=0.0001 
    --lr_decay_style="cosine" 
    --min_lr=1e-5 
    --warmup=0.0032 
    --plateau=0.0 
    --dataset="c4" 
    --tokenizer="mistralai/mathstral-7B-v0.1" 
    --epochs=3 
    --checkpoint_dir="./checkpoints/mathstral-7B" 
    --resume_from_checkpoint="./checkpoints/mathstral-7B" 
    --checkpoint_freq=50 
    --validation_freq=500 
    --dataset_config_name="en" 
    --limit_all_gathers=1 
    --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html
    --offload_activations=1
)

To launch your training, run the following command.

sbatch 3.distributed-training-mistral-mathstral.sbatch

You’ll find a new file in the FSDP directory of the form slurm-[job-number].out. This will be continuously updated with your training logs. Don’t be worried if you notice a long stream of NCCL logs (we prefer to use NCCL_DEBUG=INFO for verbose logging). After about a minute, you should observe your Mathstral model training, with an output similar to the following.

...
+ TORCHRUN=./pt_fsdp/bin/torchrun
+ export TRAIN_SCRIPT=./train.py
+ TRAIN_SCRIPT=./train.py
+ TRAINING_ARGS=(--train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type="mistral" --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style="cosine" --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset="c4" --tokenizer="mistralai/mathstral-7B-v0.1" --epochs=3 --checkpoint_dir="./checkpoints/mathstral-7B" --resume_from_checkpoint="./checkpoints/mathstral-7B" --checkpoint_freq=50 --validation_freq=500 --dataset_config_name="en" --limit_all_gathers=1 --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html --offload_activations=1)
+ declare -a TRAINING_ARGS
+ AUTO_RESUME=
+ '[' -d /opt/sagemaker_cluster ']'
+ echo 'Detected Hyperpod cluster.. enabling --auto-resume=1'
Detected Hyperpod cluster.. enabling --auto-resume=1
+ AUTO_RESUME=--auto-resume=1
+ srun --auto-resume=1 -l ./pt_fsdp/bin/torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id=35 --rdzv_backend=c10d --rdzv_endpoint=ip-10-2-39-253 ./train.py --train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type=mistral --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style=cosine --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset=c4 --tokenizer=mistralai/mathstral-7B-v0.1 --epochs=3 --checkpoint_dir=./checkpoints/mathstral-7B --resume_from_checkpoint=./checkpoints/mathstral-7B --checkpoint_freq=50 --validation_freq=500 --dataset_config_name=en --limit_all_gathers=1 --sharding_strategy=full ' #' https://pytorch.org/docs/stable/fsdp.html --offload_activations=1
...
3: 2024-07-19 03:31:38 I [train.py:155] Creating Model
3: 2024-07-19 03:33:08 I [train.py:171] Created model with total parameters: 7248023552 (7.25 B)
3:...
3: 2024-07-19 03:33:23 I [train.py:209] Wrapped model with FSDP
3: 2024-07-19 03:33:23 I [train.py:226] Created optimizer
3: 2024-07-19 03:33:23 I [checkpoint.py:70] No Checkpoints Found
...
3: 2024-07-19 03:33:35 I [train.py:102] Batch 0 Loss: 11.19900, Speed: 5.10 samples/sec, lr: 0.000006
3: 2024-07-19 03:33:38 I [train.py:102] Batch 1 Loss: 11.18291, Speed: 10.96 samples/sec, lr: 0.000013
3: 2024-07-19 03:33:40 I [train.py:102] Batch 2 Loss: 11.09163, Speed: 11.22 samples/sec, lr: 0.000019
3: 2024-07-19 03:33:43 I [train.py:102] Batch 3 Loss: 10.86621, Speed: 11.19 samples/sec, lr: 0.000025
3: 2024-07-19 03:33:46 I [train.py:102] Batch 4 Loss: 10.58236, Speed: 11.12 samples/sec, lr: 0.000031
3: 2024-07-19 03:33:49 I [train.py:102] Batch 5 Loss: 10.08024, Speed: 11.18 samples/sec, lr: 0.000038
3: 2024-07-19 03:33:52 I [train.py:102] Batch 6 Loss: 10.15507, Speed: 11.23 samples/sec, lr: 0.000044
3: 2024-07-19 03:33:55 I [train.py:102] Batch 7 Loss: 9.97296, Speed: 10.42 samples/sec, lr: 0.000050
3: 2024-07-19 03:33:58 I [train.py:102] Batch 8 Loss: 10.13596, Speed: 11.21 samples/sec, lr: 0.000056
3: 2024-07-19 03:34:01 I [train.py:102] Batch 9 Loss: 9.93156, Speed: 11.10 samples/sec, lr: 0.000063

Observability

SageMaker HyperPod can optionally be integrated with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about your cluster and cluster-nodes to an Amazon Managed Grafana dashboard.

For more details about configuring Amazon Managed Service for Prometheus and Amazon Managed Grafana, refer to the Prometheus Configuration and Amazon Managed Grafana sections in the SageMaker HyperPod workshop.

Slurm Exporter dashboard

The Amazon Managed Grafana Slurm dashboard (ID: 4323) provides visualization options for monitoring Slurm clusters. Prometheus Slurm exporter is installed on the controller node of the cluster. Some of the metrics exported include:

  • Cluster overview – Displays the total number of nodes, jobs, and their states
  • Job metrics – Visualizes job counts and states over time
  • Node metrics – Shows node states, allocation, and available resources
  • Partition metrics – Monitors partition-specific metrics such as CPU, memory, and GPU utilization
  • Job efficiency – Calculates job efficiency based on resources used

The following screenshot of the exporter dashboard shows the continued pre-training job for Mathstral being completed successfully.

Node Exporter dashboard

The Amazon Managed Grafana Node Exporter Full dashboard (ID: 1860) offers visualization options for monitoring system metrics collected by the Prometheus Node Exporter installed on the cluster nodes. Some of the key metrics you can visualize include:

  • System overview – Displays CPU load averages and memory usage
  • Memory metrics – Visualizes memory utilization including total memory, free memory, and swap space
  • Disk usage – Monitors disk space utilization and availability
  • Network traffic – Shows network bytes received and transmitted over time
  • File system metrics – Analyzes file system usage and availability
  • Disk I/O metrics – Visualizes disk read and write activity

DCGM Exporter dashboard

The Amazon Managed Grafana NVIDIA DCGM Exporter dashboard (ID: 12239) offers visualization options for monitoring NVIDIA GPU metrics collected by the DCGM Exporter. Some of the key metrics you can visualize include:

  • GPU overview – Displays GPU utilization, temperatures, power usage, and memory usage
  • Temperature metrics – Visualizes GPU temperatures over time
  • Power usage – Monitors GPU power draw and power usage trends
  • Memory utilization – Analyzes GPU memory usage, including used, free, and total memory
  • Fan speed – Shows GPU fan speeds and variations
  • ECC errors – Tracks GPU memory ECC errors and pending errors

EFA Metrics dashboard

The Amazon Managed Grafana EFA Metrics dashboard (ID: 20579) offers visualization options for monitoring EFA related metrics collected by the EFA Node Exporter. Some of the key visualizations include:

  • EFA error metrics – Visualizes errors such as allocation errors, command errors, and memory map errors
  • EFA network traffic – Monitors received and transmitted bytes, packets, and work requests
  • EFA RDMA performance – Analyzes RDMA read and write operations, including bytes transferred and error rates
  • EFA port lifespan – Displays the lifespan of EFA ports over time
  • EFA keep-alive packets – Tracks the number of keep-alive packets received

FSx Metrics dashboard

The Amazon Managed Grafana FSx for Lustre dashboard (ID: 20906) offers visualization options for monitoring Amazon FSx for Lustre file system related metrics collected by Amazon CloudWatch. Some of the key visualizations include:

  • DataReadBytes – The number of bytes for file system read operations
  • DataWriteBytes – The number of bytes for file system write operations
  • DataReadOperations – The number of read operations
  • DataWriteOperations – The number of write operations
  • MetadataOperations – The number of metadata operations
  • FreeDataStorageCapacity – The amount of available storage capacity

These metrics provide insights into various aspects of your FSx for Lustre file systems.

Resiliency

As mentioned previously, one of the value propositions of SageMaker HyperPod is that it provides a variety of cluster resiliency features such as cluster health checks, auto-resume, and the option to manually replace faulty nodes.

Based on the status of these health checks, SageMaker HyperPod detects whether nodes in the cluster are healthy or not. If a node is deemed unhealthy by any of the health checks, SageMaker HyperPod uses its auto-resume feature to automatically replace the faulty node, without any manual intervention.

Additionally, users have the option to implement checkpointing in their training procedure. Checkpointing, combined with auto-resume, means that once a faulty node is replaced, the training job can resume from the last saved checkpoint. This way, despite a hardware failure, a user’s training job can run with minimal loss in progress.

In this section, we demonstrate the resiliency and auto-resume feature of SageMaker HyperPod by simulating a hardware failure scenario and pointing you towards some logs that indicate the success of a replacement job. We use the same submitted FSDP training job, which has the following two important components enabled:

  1. Checkpointing is enabled and implemented
  2. The --auto-resume=1 flag is set. You can verify this in the SLURM .out

This section in the provided sbatch file sets the --auto-resume=1 flag.

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
    AUTO_RESUME="--auto-resume=1"
fi
srun ${AUTO_RESUME} -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

The sbatch file has the checkpointing flags checkpoint_freq, checkpoint_dir, resume_from_checkpoint, which tell the job how often to write checkpoints, where to write the checkpoints to, and what directory to read checkpoints from in case of failure, respectively.

Assuming that you already have your training job submitted, wait until a few checkpoints are written to the ./checkpoints directory (or the directory name you specified for checkpoint_freq. You can check whether any checkpoints were written by running ls -lt checkpoints/. This should return an output that resembles the following.

total 74
-rw-rw-r--  1 ubuntu ubuntu     1 Dec  9 00:21 latest_checkpointed_iteration.txt
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:20 iter_0000002
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:11 iter_0000001

You may also check the progress of your training job by running tail -f slurm-<job-id>.log, where <job-id> can be derived by running squeue. You should observe an output that resembles the following.

1:  iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 440352.6 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |0: saving checkpoint at iteration       1 to /fsx/checkpoints
0:   successfully saved checkpoint at iteration       1 to /fsx/checkpoints
1: (min, max) time across ranks (ms):
1:     save-checkpoint ................................: (81611.24, 81611.82)

Once you’ve confirmed that your training job is running and that you have checkpoints written, you are ready to simulate a hardware failure.

As part of the output of running squeue, you have received an output that resembles the following.

JOBID PARTITION     NAME   USER ST  TIME  NODES NODELIST(REASON)
32          dev interact ubuntu  R  0:02      4  ip-10-2-9-98,...

This tells you what jobs are running and on what nodes. Locate your training job and choose any of the nodes except the first node on the list of nodes allocated to your job (this is the node that you will be injecting an error into). This is very important because PyTorch uses node 0 (that is, the first node) as the coordination node for your training job.

Once you’ve identified the node to inject the error onto, connect to it using SSH with following command.

ssh <NODE ip>

You can inject an ECC error by running the following command.

dcgmi test --inject --gpuid 0 -f 319 -v 4

This simulates a double-bit error (DBE) in the GPU of your chosen node. Additionally, to kill the training job to simulate a job failure, you can take the process id (PID) of any of the python processes running. The Python processes are the training job processes running your FSDP training job. The -9 flag here is the signal number for the SIGKILL signal, which forces a process to stop, without giving it a chance to clean up, or perform any other actions.

ps -aux | grep python
kill -9 <PID>

Once the ECC error is injected and the Python process has stopped, you can exit out of your compute node. In the meantime, you can get the output of the slurmctld.log file using the following command.

tail -f /var/log/slurm/slurmctld.log

In there, you can observe the following lines, which show a failed job or node.

 [2024-07-19T04:13:03.313] sched: Allocate JobId=35 NodeList-ip-10-2-39-253, ip-10-2-40-102, ip-10-2-76-26, ip-10-2-108-162 #CPUs=192 Partition=dev
 [2024-07-19T04:50:31.682] _slurm_rpc_submit_batch_job: JobId=35 InitPrio=1 usec=727
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 reason set to: Action: Replace 
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 state set to FAILING

Pay attention to the line that says update_node: node ip-10-2-39-253 reason set to: Action:Replace, which is the log that says that the node has failed and requires replacement.

If you look at your <slurm-job>.out file, you should observe logs like the following.

[Auto Resume] Info: JobID: 35 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Response from cluster agent: JobID=35, ResumeAction=RETRYSTEP
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - replacing nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - Dropping unhealthy nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Succesfully shrink job to retain healthy nodes ...
srun: job 35 queued and waiting for resources

This shows that job 35 (your training job) is paused and a new job (job 35) has initiated the replacement process. You can verify this by running squeue, where you will observe an auto-res. This is the auto-resume job that is initiated by SageMaker HyperPod to replace your faulty node.

JOBID PARTITION  NAME      USER ST    TIME NODES NODELIST(REASON)
35    dev    auto-res    ubuntu PD    0:00     4 (Resources)
...

You can also monitor your SageMaker HyperPod cluster using the AWS console. Under Instances, you should observe one of the nodes in worker-group-1 in Pending state, as shown in the following screenshot. This shows that the node is about to get replaced.

Once your node is replaced, you can observe the slurmctld.log file. Be on the alert for the following line:

update_node: node <YOUR-NODE-IP-ADDRESS> reason set to: AWS:Replaced

You can also verify that your node was successfully replaced using the HyperPod cluster tab in the Amazon SageMaker console.

Once your node is replaced, squeue should no longer display the auto-res job and should only display your original training job. The node is successfully replaced, without any manual intervention.

Because you enabled checkpointing, you can verify that the training job resumes from the latest checkpoint. In your <slurm-job>.out file, find the following lines, which show that a checkpoint was detected in the checkpoint directory (./checkpoints) and that the latest checkpoint was loaded, respectively.

...
Loading checkpoint from checkpoints/mathstral-10steps ...
...
Checkpoint loaded from checkpoints/mathstral-10steps ...
...

If you continue to monitor your <slurm-job>.out file, you should observe that your training job has resumed from the latest checkpoint.

Clean up

  1. To delete your cluster, enter the following command.
aws sagemaker delete-cluster --cluster-name ml-cluster

Once you are done deleting the cluster, make sure it is deleted in the SageMaker HyperPod clusters section under SageMaker.

  1. To use the console to delete your SageMaker HyperPod VPC and Observability CloudFormation stacks, follow the directions at Delete a stack from the CloudFormation console. Alternatively, use the AWS CLI by entering the following command. Replace my-stack with the name of your stacks.
aws cloudformation delete-stack 
    --stack-name my-stack

Conclusion

In this post, we provided a comprehensive guide on using Amazon SageMaker HyperPod for training large-scale models such as Mistral AI’s Mathstral using PyTorch Fully Sharded Data Parallel (FSDP). The process highlighted the efficiency of distributed training on SageMaker HyperPod, showcasing the critical role of resiliency and observability features in maintaining uninterrupted, scalable training environments.

Because of the integration with tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana for real-time monitoring, along with the robust cluster management capabilities of SageMaker HyperPod, ML practitioners can focus on model development rather than infrastructure management. The detailed steps for setting up the infrastructure, deploying the observability stack, and running a training job demonstrate how SageMaker HyperPod helps tackle the complexities of distributed training.

Moreover, the automatic health checks and the auto-resume feature significantly reduce downtime and minimize the impact of hardware failures so that large-scale training jobs can proceed with minimal interruptions. This level of resilience is crucial for maintaining the pace of innovation in AI research, especially when dealing with massive FMs.

By following the outlined procedures and using the powerful tools provided by AWS, data scientists and engineers can optimize their training workflows, reduce operational overhead, and accelerate the development of state-of-the-art models.

Getting Started

Interested in getting started with SageMaker HyperPod? Reach out to your AWS Account Team or email aws-frameworks-gtm@amazon.com. To begin experimenting with other examples on SageMaker HyperPod, refer to the awsome-distributed-training GitHub repo and the Amazon SageMaker HyperPod workshop.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with key GenAI foundation model providers, AWS service teams, strategic customers, founders, universities, venture ecosystems, and Amazon to develop technology strategy that enables the next generation of artificial intelligence, machine learning, and accelerated computing on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Read More

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

This post has been co-written with Artem Sysuev, Danny Portman, Matúš Chládek, and Saurabh Gupta from Zeta Global.

Zeta Global is a leading data-driven, cloud-based marketing technology company that empowers enterprises to acquire, grow and retain customers. The company’s Zeta Marketing Platform (ZMP) is the largest omnichannel marketing platform with identity data at its core. The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. For more information, see Zeta Global’s home page.

What Zeta has accomplished in AI/ML

In the fast-evolving landscape of digital marketing, Zeta Global stands out with its groundbreaking advancements in artificial intelligence. Zeta’s AI innovations over the past few years span 30 pending and issued patents, primarily related to the application of deep learning and generative AI to marketing technology. Using AI, Zeta Global has revolutionized how brands connect with their audiences, offering solutions that aren’t just innovative, but also incredibly effective. As an early adopter of large language model (LLM) technology, Zeta released Email Subject Line Generation in 2021. This tool enables marketers to craft compelling email subject lines that significantly boost open rates and engagement, tailored perfectly to the audience’s preferences and behaviors.

Further expanding the capabilities of AI in marketing, Zeta Global has developed AI Lookalikes. This technology allows companies to identify and target new customers who closely resemble their best existing customers, thereby optimizing marketing efforts and improving their return on investment (ROI). The backbone of these advancements is ZOE, Zeta’s Optimization Engine. ZOE is a multi-agent LLM application that integrates with multiple data sources to provide a unified view of the customer, simplify analytics queries, and facilitate marketing campaign creation. Together, these AI-driven tools and technologies aren’t just reshaping how brands perform marketing tasks; they’re setting new benchmarks for what’s possible in customer engagement.

In addition to its groundbreaking AI innovations, Zeta Global has harnessed Amazon Elastic Container Service (Amazon ECS) with AWS Fargate to deploy a multitude of smaller models efficiently.

Zeta’s AI innovation is powered by a proprietary machine learning operations (MLOps) system, developed in-house.

Context

In early 2023, Zeta’s machine learning (ML) teams shifted from traditional vertical teams to a more dynamic horizontal structure, introducing the concept of pods comprising diverse skill sets. This paradigm shift aimed to accelerate project delivery by fostering collaboration and synergy among teams with varied expertise. The need for a centralized MLOps platform became apparent as ML and AI applications proliferated across various teams, leading to a maze of maintenance complexities and hindering knowledge transfer and innovation.

To address these challenges, the organization developed an MLOps platform based on four key open-source tools: Airflow, Feast, dbt, and MLflow. Hosted on Amazon ECS with tasks run on Fargate, this platform streamlines the end-to-end ML workflow, from data ingestion to model deployment. This blog post delves into the details of this MLOps platform, exploring how the integration of these tools facilitates a more efficient and scalable approach to managing ML projects.

Architecture overview

Our MLOps architecture is designed to automate and monitor all stages of the ML lifecycle. At its core, it integrates:

  • Airflow for workflow orchestration
  • Feast for feature management
  • dbt for accelerated data transformation
  • MLflow for experiment tracking and model management

These components interact within the Amazon ECS environment, providing a scalable and serverless platform where ML workflows are run in containers using Fargate. This setup not only simplifies infrastructure management, but also ensures that resources are used efficiently, scaling up or down as needed.

The following figure shows the MLOps architecture.

Architectural deep dive

The following details dive deep into each of the components used in this architecture.

Airflow for workflow orchestration

Airflow schedules and manages complex workflows, defining tasks and dependencies in Python code. An example direct acyclic graph (DAG) might automate data ingestion, processing, model training, and deployment tasks, ensuring that each step is run in the correct order and at the right time.

Though it’s worth mentioning that Airflow isn’t used at runtime as is usual for extract, transform, and load (ETL) tasks. Every Airflow task calls Amazon ECS tasks with some overrides. Additionally, we’re using a custom Airflow operator called ECSTaskLogOperator that allows us to process Amazon CloudWatch logs using downstream systems.

model_training = ECSTaskLogOperator(
task_id= <...>,
task_definition= <...>,
cluster= <...>,
launch_type="FARGATE",
aws_conn_id= <...>,
overrides={
"containerOverrides": [
{
"name": " <...> ",
"environment": [
{
"name": "MLFLOW_TRACKING_URI",
"value": "<...>"
},
],
"command": ["mlflow", "run", <...>]
}
],
}

Feast for feature management

Feast acts as a central repository for storing and serving features, ensuring that models in both training and production environments use consistent and up-to-date data. It simplifies feature access for model training and inference, significantly reducing the time and complexity involved in managing data pipelines.

Additionally, Feast promotes feature reuse, so the time spent on data preparation is reduced greatly.

from datetime import timedelta
from feast import Entity, FeatureView, FeatureService, Field, SnowflakeSource
from feast.types import Float64

entities = [
Entity(name="site_id", join_keys=["SITE_ID"]),
Entity(name="user_id", join_keys=["USER_ID"]),
]

def create_feature_view(name, table, field_name, schema_name):
return FeatureView(
name=name,
entities=entities,
ttl=timedelta(days=30),
schema=[Field(name=field_name, dtype=Float64)],
source=SnowflakeSource(
database="<...>", schema="<...>", table=table, 
timestamp_field="<...>"
),
tags="<...>",
)

feature_view_1 = create_feature_view("<...>")
feature_view_2 = create_feature_view("<...>")

my_feature_service = FeatureService(
name="my_feature_servic ",
features=[feature_view_1, feature_view_1],
description=""" 

This is my Feature Service 

""",
owner="<...>",
)

dbt for data transformation

dbt is used for transforming data within the data warehouse, allowing data teams to define complex data models in SQL. It promotes a disciplined approach to data modeling, making it easier to ensure data quality and consistency across the ML pipelines. Moreover, it provides a straightforward way to track data lineage, so we can foresee which datasets will be affected by newly introduced changes. The following figure shows schema definition and model which reference it.

MLflow for experiment tracking and model management

MLflow tracks experiments and manages models. It provides a unified interface for logging parameters, code versions, metrics, and artifacts, making it easier to compare experiments and manage the model lifecycle.

Similarly to Airflow, MLflow is also used just partially. The main parts we use are tracking the server and model registry. From our experience, artifact server has some limitations, such as limits on artifact size (because of sending it using REST API). As a result, we opted to use it only partially.

We don’t extensively use the deployment capabilities of MLflow, because in our current setup, we build custom inference containers.

Hosting on Amazon ECS with Fargate

Amazon ECS offers a highly scalable and secure environment for running containerized applications. Fargate eliminates the need for managing underlying infrastructure, allowing us to focus solely on deploying and running the containers. This abstraction layer simplifies the deployment process, enabling seamless scaling based on workload demands while optimizing resource utilization and cost efficiency.

We found it optimal to run on Fargate components of our ML workflows that don’t require GPUs or distributed processing. These include dbt pipelines, data gathering jobs, training, evaluation, and batch inference jobs for smaller models.

Furthermore, Amazon ECS and Fargate seamlessly integrate with other AWS services, such as Amazon Elastic Container Registry (Amazon ECR) for container image management and AWS Systems Manager Parameter Store for securely storing and managing secrets and configurations. Using Parameter Store, we can centralize configuration settings, such as database connection strings, API keys, and environment variables, eliminating the need for hardcoding sensitive information within container images. This enhances security and simplifies maintenance, because secrets and configuration values can be dynamically retrieved by containers at runtime, ensuring consistency across deployments.

Moreover, integrating Amazon ECS and Fargate with CloudWatch enables comprehensive monitoring and logging capabilities for containerized tasks. This can be achieved by enabling the awslogs log driver within the logConfiguration parameters of the task definitions.

Why ECS with Fargate is the solution of choice

  1. Serverless model:
    • No infrastructure management: With Fargate, we don’t need to provision, configure, or manage servers. This simplifies operations and reduces the operational overhead, allowing teams to focus on developing and deploying applications.
    • Automatic scaling: Fargate automatically scales our applications based on demand, ensuring optimal performance without manual intervention.
  1. Cost efficiency:
    • Pay-as-we-go: Fargate charges are based on the resources (vCPU and memory) that the containers use. This model can be more cost-effective compared to maintaining idle resources.
    • No over provisioning: Because we only pay for what we use, there’s no need to over-provision resources, which can lead to cost savings.
  1. Enhanced security:
    • Isolation: Each Fargate task runs in its own isolated environment, improving security. There’s no sharing of underlying compute resources with other tenants.
  1. Integration with the AWS ecosystem:

Configuring Amazon ECS with Fargate for ML workloads

Configuring Amazon ECS with Fargate for ML workloads involves the following steps.

  1. Docker images: ML models and applications are containerized using Docker. This includes all dependencies, libraries, and configurations needed to run the ML workload.
  2. Creating task definitions:
    • Define resources: Create an Amazon ECS task definition specifying the Docker image, required vCPU, memory, and other configurations.
    • Environment variables: Set environment variables, such as model paths, API keys, and other necessary parameters.
  1. IAM roles: Assign appropriate AWS Identity and Access Management (IAM) roles to the tasks for accessing other AWS resources securely.
  2. Logging using CloudWatch: Use CloudWatch for logging and monitoring the performance and health of ML workloads.

Future development and addressing emerging challenges

As the field of MLOps continues to evolve, it’s essential to anticipate and address upcoming challenges to ensure that the platform remains efficient, scalable, and user-friendly. Two primary areas of future development for our platform include:

  1. Enhancing bring your own model (BYOM) capabilities for external clients
  2. Reducing the learning curve for data scientists

This section outlines those challenges and proposes directions for future enhancements.

Enhancing BYOM capabilities

As machine learning becomes more democratized, there is a growing need for platforms to easily integrate models developed externally by Zeta’s clients.

Future directions:

  • Developing standardized APIs: Implement APIs that allow for easy integration of external models, regardless of the framework or language they were developed in. This would involve creating a set of standardized interfaces for model ingestion, validation, and deployment.
  • Creating a model adapter framework: Design a framework that can adapt external models to be compatible with the platform’s infrastructure, ensuring that they can be managed, tracked, and deployed just like internally developed models.
  • Enhancing documentation and support: Provide comprehensive documentation and support resources to guide external clients through the BYOM process, including best practices for model preparation, integration, and optimization.

Reducing the learning curve for data scientists

The incorporation of multiple specialized tools (Airflow, Feast, dbt, and MLflow) into the MLOps pipeline can present a steep learning curve for data scientists, potentially hindering their productivity and the overall efficiency of the ML development process.

Future directions:

We’ll do the following to help reduce the learning curve:

  • Creating unified interfaces: Develop a unified interface, including UI, API, and SDK, that abstracts away the complexities of interacting with each tool individually. This interface could provide simplified workflows, automating routine tasks and presenting a cohesive view of the entire ML lifecycle.
  • Offering comprehensive training and resources: Invest in training programs and resources tailored to data scientists at different skill levels. This could include interactive tutorials, workshops, and detailed case studies showcasing real-world applications of the platform.

Conclusion

Integrating Airflow, Feast, dbt, and MLflow into an MLOps platform hosted on Amazon ECS with AWS Fargate presents a robust solution for managing the ML lifecycle. This setup not only streamlines operations but also enhances scalability and efficiency, allowing data science teams to focus on innovation rather than infrastructure management.

Additional Resources

For those looking to dive deeper, we recommend exploring the official documentation and tutorials for each tool: Airflow, Feast, dbt, MLflow) and Amazon ECS. These resources are invaluable for understanding the capabilities and configurations of each component in our MLOps platform.


About the authors

Varad Ram holds the position of Senior Solutions Architect at Amazon Web Services. He possesses extensive experience encompassing application development, cloud migration strategies, and information technology team management. Recently, his primary focus has shifted towards assisting clients in navigating the process of productizing generative artificial intelligence use cases.

Artem Sysuev is a Lead Machine Learning Engineer at Zeta, passionate about creating efficient, scalable solutions. He believes that effective processes are key to success, which led him to focus on both machine learning and MLOps. Starting with machine learning, Artem developed skills in building predictive models. Over time, he saw the need for strong operational frameworks to deploy and maintain these models at scale, which drew him to MLOps. At Zeta, he drives innovation by automating workflows and improving collaboration, ensuring smooth integration of machine learning models into production systems.

Saurabh Gupta is a Principal Engineer at Zeta Global. He is passionate about machine learning engineering, distributed systems, and big-data technologies. He has built scalable platforms that empower data scientists and data engineers, focusing on low-latency, resilient systems that streamline workflows and drive innovation. He holds a B.Tech degree in Electronics and Communication Engineering from the Indian Institute of Technology (IIT), Guwahati, and has deep expertise in designing data-driven solutions that support advanced analytics and machine learning initiatives.

Matúš Chládek is a Senior Engineering Manager for ML Ops at Zeta Global. With a career that began in Data Science, Matúš has developed a strong foundation in analytics and machine learning. Over the years, Matúš transitioned into more engineering-focused roles, eventually becoming a Machine Learning Engineer before moving into Engineering Management. Matúš’s leadership focuses on building robust, scalable infrastructure that streamlines workflows and supports rapid iteration and production-ready delivery of machine learning projects. Matúš is passionate about driving innovation at the intersection of Data Science and Engineering, making advanced analytics accessible and scalable for internal users and clients alike.

Dr. Danny Portman is a recognized thought leader in AI and machine learning, with over 30 patents focused on Deep Learning and Generative AI applications in advertising and marketing technology. He holds a Ph.D. in Computational Physics, specializing in high-performance computing models for simulating complex astrophysical systems. With a strong background in quantitative research, Danny brings a wealth of experience in applying data-driven approaches to solve problems across various sectors. As VP of Data Science and Head of AI/ML at Zeta Global, Dr. Portman leads the development of AI-driven products and strategies, and spearheads the company’s cutting-edge Generative AI R&D efforts to deliver innovative solutions for marketers.

Read More

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

The post is co-written with Michael Shaul and Sasha Korman from NetApp.

Generative artificial intelligence (AI) applications are commonly built using a technique called Retrieval Augmented Generation (RAG) that provides foundation models (FMs) access to additional data they didn’t have during training. This data is used to enrich the generative AI prompt to deliver more context-specific and accurate responses without continuously retraining the FM, while also improving transparency and minimizing hallucinations.

In this post, we demonstrate a solution using Amazon FSx for NetApp ONTAP with Amazon Bedrock to provide a RAG experience for your generative AI applications on AWS by bringing company-specific, unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.

Our solution uses an FSx for ONTAP file system as the source of unstructured data and continuously populates an Amazon OpenSearch Serverless vector database with the user’s existing files and folders and associated metadata. This enables a RAG scenario with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data retrieved from the OpenSearch Serverless vector database.

When developing generative AI applications such as a Q&A chatbot using RAG, customers are also concerned about keeping their data secure and preventing end-users from querying information from unauthorized data sources. Our solution also uses FSx for ONTAP to allow users to extend their current data security and access mechanisms to augment model responses from Amazon Bedrock. We use FSx for ONTAP as the source of associated metadata, specifically the user’s security access control list (ACL) configurations attached to their files and folders and populate that metadata into OpenSearch Serverless. By combining access control operations with file events that notify the RAG application of new and changed data on the file system, our solution demonstrates how FSx for ONTAP enables Amazon Bedrock to only use embeddings from authorized files for the specific users that connect to our generative AI application.

AWS serverless services make it straightforward to focus on building generative AI applications by providing automatic scaling, built-in high availability, and a pay-for-use billing model. Event-driven compute with AWS Lambda is a good fit for compute-intensive, on-demand tasks such as document embedding and flexible large language model (LLM) orchestration, and Amazon API Gateway provides an API interface that allows for pluggable frontends and event-driven invocation of the LLMs. Our solution also demonstrates how to build a scalable, automated, API-driven serverless application layer on top of Amazon Bedrock and FSx for ONTAP using API Gateway and Lambda.

Solution overview

The solution provisions an FSx for ONTAP Multi-AZ file system with a storage virtual machine (SVM) joined to an AWS Managed Microsoft AD domain. An OpenSearch Serverless vector search collection provides a scalable and high-performance similarity search capability. We use an Amazon Elastic Compute Cloud (Amazon EC2) Windows server as an SMB/CIFS client to the FSx for ONTAP volume and configure data sharing and ACLs for the SMB shares in the volume. We use this data and ACLs to test permissions-based access to the embeddings in a RAG scenario with Amazon Bedrock.

The embeddings container component of our solution is deployed on an EC2 Linux server and mounted as an NFS client on the FSx for ONTAP volume. It periodically migrates existing files and folders along with their security ACL configurations to OpenSearch Serverless. It populates an index in the OpenSearch Serverless vector search collection with company-specific data (and associated metadata and ACLs) from the NFS share on the FSx for ONTAP file system.

The solution implements a RAG Retrieval Lambda function that allows RAG with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data and associated metadata (including ACLs) retrieved from the OpenSearch Serverless index that was populated by the embeddings container component. The RAG Retrieval Lambda function stores conversation history for the user interaction in an Amazon DynamoDB table.

End-users interact with the solution by submitting a natural language prompt either through a chatbot application or directly through the API Gateway interface. The chatbot application container is built using Streamlit and fronted by an AWS Application Load Balancer (ALB). When a user submits a natural language prompt to the chatbot UI using the ALB, the chatbot container interacts with the API Gateway interface that then invokes the RAG Retrieval Lambda function to fetch the response for the user. The user can also directly submit prompt requests to API Gateway and obtain a response. We demonstrate permissions-based access to the RAG documents by explicitly retrieving the SID of a user and then using that SID in the chatbot or API Gateway request, where the RAG Retrieval Lambda function then matches the SID to the Windows ACLs configured for the document. As an additional authentication step in a production environment, you may want to also authenticate the user against an identity provider and then match the user against the permissions configured for the documents.

The following diagram illustrates the end-to-end flow for our solution. We start by configuring data sharing and ACLs with FSx for ONTAP, and then these are periodically scanned by the embeddings container. The embeddings container splits the documents into chunks and uses the Amazon Titan Embeddings model to create vector embeddings from these chunks. It then stores these vector embeddings with associated metadata in our vector database by populating an index in a vector collection in OpenSearch Serverless. The following diagram illustrates the end-to-end flow.

end to end embedding flow for the fsxontap and bedrock integration

The following architecture diagram illustrates the various components of our solution.overall architecture diagram describing all the components of the solution

Prerequisites

Complete the following prerequisite steps:

  1. Make sure you have model access in Amazon Bedrock. In this solution, we use Anthropic Claude v3 Sonnet on Amazon Bedrock.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Install Docker.
  4. Install Terraform.

Deploy the solution

The solution is available for download on this GitHub repo. Cloning the repository and using the Terraform template will provision all the components with their required configurations.

  1. Clone the repository for this solution:
    sudo yum install -y unzip
    git clone https://github.com/aws-samples/genai-bedrock-fsxontap.git
    cd genai-bedrock-fsxontap/terraform

  2. From the terraform folder, deploy the entire solution using Terraform:
    terraform init
    terraform apply -auto-approve

This process can take 15–20 minutes to complete. When finished, the output of the terraform commands should look like the following:

api-invoke-url = "https://9ng1jjn8qi.execute-api.<region>.amazonaws.com/prod"
fsx-management-ip = toset([
"198.19.255.230",])
fsx-secret-id = "arn:aws:secretsmanager:<region>:<account-id>:secret:AmazonBedrock-FSx-NetAPP-ONTAP-a2fZEdIt-0fBcS9"
fsx-svm-smb-dns-name = "BRSVM.BEDROCK-01.COM"
lb-dns-name = "chat-load-balancer-2040177936.<region>.elb.amazonaws.com"

Load data and set permissions

To test the solution, we will use the EC2 Windows server (ad_host) mounted as an SMB/CIFS client to the FSx for ONTAP volume to share sample data and set user permissions that will then be used to populate the OpenSearch Serverless index by the solution’s embedding container component. Perform the following steps to mount your FSx for ONTAP SVM data volume as a network drive, upload data to this shared network drive, and set permissions based on Windows ACLs:

  1. Obtain the ad_host instance DNS from the output of your Terraform template.
  2. Navigate to AWS Systems Manager Fleet Manager on your AWS console, locate the ad_host instance and follow instructions here to login with Remote Desktop. Use the domain admin user bedrock-01Admin and obtain the password from AWS Secrets Manager. You can find the password using the Secrets Manager fsx-secret-id secret id from the output of your Terraform template.
  3. To mount an FSx for ONTAP data volume as a network drive, under This PC, choose (right-click) Network and then choose Map Network drive.
  4. Choose the drive letter and use the FSx for ONTAP share path for the mount
    (\<svm>.<domain >c$<volume-name>):
    map network drive
  5. Upload the Amazon Bedrock User Guide to the shared network drive and set permissions to the admin user only (make sure that you disable inheritance under Advanced):upload the amazon bedrock user guide
  6. Upload the Amazon FSx for ONTAP User Guide to the shared drive and make sure permissions are set to Everyone:upload the amazon fsx ontap media guide
  7. On the ad_host server, open the command prompt and enter the following command to obtain the SID for the admin user:
    wmic useraccount where name='Admin' get sid

Test permissions using the chatbot

To test permissions using the chatbot, obtain the lb-dns-name URL from the output of your Terraform template and access it through your web browser:

test with chatbot and enter prompt

For the prompt query, ask any general question on the FSx for ONTAP user guide that is available for access to everyone. In our scenario, we asked “How can I create an FSx for ONTAP file system,” and the model replied back with detailed steps and source attribution in the chat window to create an FSx for ONTAP file system using the AWS Management Console, AWS CLI, or FSx API:

test with chatbot and enter prompt related to the bedrock guide

Now, let’s ask a question about the Amazon Bedrock user guide that is available for admin access only. In our scenario, we asked “How do I use foundation models with Amazon Bedrock,” and the model replied with the response that it doesn’t have enough information to provide a detailed answer to the question.:

Use the admin SID on the user (SID) filter search in the chat UI and ask the same question in the prompt. This time, the model should reply with steps detailing how to use FMs with Amazon Bedrock and provide the source attribution used by the model for the response:

Test permissions using API Gateway

You can also query the model directly using API Gateway. Obtain the api-invoke-url parameter from the output of your Terraform template.

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "What is an FSxN ONTAP filesystem?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "NA", "memory_window": 10}'

Then invoke the API gateway with Everyone access for a query related to the FSx for ONTAP user guide by setting the value of the metadata parameter to NA to indicate Everyone access:

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "what is bedrock?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "S-1-5-21-4037439088-1296877785-2872080499-1112", "memory_window": 10}'

Cleanup

To avoid recurring charges, clean up your account after trying the solution. From the terraform folder, delete the Terraform template for the solution:

terraform apply --destroy

Conclusion

In this post, we demonstrated a solution that uses FSx for ONTAP with Amazon Bedrock and uses FSx for ONTAP support for file ownership and ACLs to provide permissions-based access in a RAG scenario for generative AI applications. Our solution enables you to build generative AI applications with Amazon Bedrock where you can enrich the generative AI prompt in Amazon Bedrock with your company-specific, unstructured user file data from an FSx for ONTAP file system. This solution enables you to deliver more relevant, context-specific, and accurate responses while also making sure only authorized users have access to that data. Finally, the solution demonstrates the use of AWS serverless services with FSx for ONTAP and Amazon Bedrock that enable automatic scaling, event-driven compute, and API interfaces for your generative AI applications on AWS.

For more information about how to get started building with Amazon Bedrock and FSx for ONTAP, refer to the following resources:


About the authors

Kanishk Mahajan is Principal, Solutions Architecture at AWS. He leads cloud transformation and solution architecture for ISV customers and partner at AWS. Kanishk specializes in containers, cloud operations, migrations and modernizations, AI/ML, resilience and security and compliance. He is a Technical Field Community (TFC) member in each of those domains at AWS.

Michael Shaul is a Principal Architect at NetApp’s office of the CTO. He has over 20 years of experience building data management systems, applications, and infrastructure solutions. He has a unique in-depth perspective on cloud technologies, builder, and AI solutions.

Sasha Korman is a tech visionary leader of dynamic development and QA teams across Israel and India. With 14-years at NetApp that began as a programmer, his hands-on experience and leadership have been pivotal in steering complex projects to success, with a focus on innovation, scalability, and reliability.

Read More

Support for AWS DeepComposer ending soon

Support for AWS DeepComposer ending soon

AWS DeepComposer was first introduced during AWS re:Invent 2019 as a fun way for developers to compose music by using generative AI. AWS DeepComposer was the world’s first machine learning (ML)-enabled keyboard for developers to get hands-on—literally—with a musical keyboard and the latest ML techniques to compose their own music.

After careful consideration, we have made the decision to end support for AWS DeepComposer, effective September 17, 2025. With your help and feedback, our portfolio of products and services has grown to include new tools for developers to get hands-on with AI and ML. Amazon PartyRock, for example, is a generative AI playground for intuitive, code-free help in building web applications.

If you have data stored on the AWS DeepComposer console, you will be able to use AWS DeepComposer as normal until September 17, 2025, when support for the service will end. After this date, you will no longer be able to use AWS DeepComposer through the AWS Management Console, manage AWS DeepComposer devices, or access any compositions or models you have created. Until then, you can continue to work on your compositions or models and export those you would like to keep by using the step-by-step guide in the AWS DeepComposer FAQs.

If you have additional questions, please read our FAQs or contact us.


About the author

Kanchan Jagannathan is a Sr. Program Manager in the AWS AI Devices team where he helps launches AWS devices into sales channel and also oversees the Service Availability Change process for the team. He was a Program Manager for FC automation deployment and launches before joining AWS. Outside of work, he has begun to bravely endeavour camping with his 5-yr old and 1-yr old kids and enjoying the moments he gets to be with them.

Read More

Preserve access and explore alternatives for Amazon Lookout for Equipment

Preserve access and explore alternatives for Amazon Lookout for Equipment

Amazon Lookout for Equipment, the AWS machine learning (ML) service designed for industrial equipment predictive maintenance, will no longer be open to new customers effective October 17, 2024. Existing customers will be able to use the service (both using the AWS Management Console and API) as normal and AWS will continue to invest in security, availability, and performance improvements for Lookout for Equipment, but we do not plan to introduce new features for this service.

This post discusses how you can maintain access to Lookout for Equipment after it is closed to new customers and some alternatives to Lookout for Equipment.

Maintaining access to Lookout for Equipment

You’re considered an existing customer if you use the service, either through cloud training or cloud inferencing, any time in the 30 days prior to October 17, 2024 (September 17, 2024, through October 16, 2024). To maintain access to the service after October 17, 2024, you should complete one of the following tasks from the account for which you intend to maintain access:

  • On the Lookout for Equipment console, start a new project and successfully complete a model training
  • On the Lookout for Equipment console, open an existing project, schedule an inference for a given model, and run at least one inference
  • Use Lookout for Equipment API calls CreateInferenceScheduler and StartInferenceScheduler (and StopInferenceScheduler when done)

For any questions or support needed, contact your assigned AWS Account Manager or Solutions Architect, or create a case from the AWS console.

Alternatives to Lookout for Equipment

If you’re interested in an alternative to Lookout for Equipment, AWS has options for both buyers and builders.

For an out-of-the-box solution, the AWS Partner Network offers solutions from multiple partners. You can browse solutions on the Asset Maintenance and Reliability page in the AWS Solutions Library. This approach provides a solution that addresses your use case without requiring you to have expertise in predictive maintenance, and typically provides the fastest time to value by using the specialized expertise of the AWS Partners.

If you prefer to build your own solution, AWS offers AI/ML tools and services to help you develop an AI-based predictive maintenance solution. Amazon SageMaker provides a set of tools to enable you to build, train, infer, and deploy ML models for your use case with fully managed infrastructure, tools, and workflows.

Summary

Although new customers will no longer have access to Lookout for Equipment after October 17, 2024, AWS offers a powerful set of AI/ML services and solutions in the form of SageMaker tools to build customer models, and also offers a range of solutions from partners through the AWS Partner Network. You should explore these options to determine what works best for your specific needs.

For more details, refer to the following resources:


About the author

Stuart Gillen is a Sr. Product Manager, Lookout for Equipment, at AWS. Stuart has held a variety of roles in engineering management, business development, product management, and consulting. Most of his career has been focused on industrial applications specifically in reliability practices, maintenance systems, and manufacturing.
Stuart is the Product Manager for Lookout for Equipment at AWS where he utilizes his industrial and AI background in applications focusing on Predictive Maintenance and Condition Monitoring.

Read More

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned models in Amazon SageMaker

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned models in Amazon SageMaker

The clustered regularly interspaced short palindromic repeat (CRISPR) technology holds the promise to revolutionize gene editing technologies, which is transformative to the way we understand and treat diseases. This technique is based in a natural mechanism found in bacteria that allows a protein coupled to a single guide RNA (gRNA) strand to locate and make cuts in specific sites in the targeted genome. Being able to computationally predict the efficiency and specificity of gRNA is central to the success of gene editing.

Transcribed from DNA sequences, RNA is an important type of biological sequence of ribonucleotides (A, U, G, C), which folds into 3D structure. Benefiting from recent advance in large language models (LLMs), a variety of computational biology tasks can be solved by fine-tuning biological LLMs pre-trained on billions of known biological sequences. The downstream tasks on RNAs are relatively understudied.

In this post, we adopt a pre-trained genomic LLMs for gRNA efficiency prediction. The idea is to treat a computer designed gRNA as a sentence, and fine-tune the LLM to perform sentence-level regression tasks analogous to sentiment analysis. We used Parameter-Efficient Fine-Tuning methods to reduce the number of parameters and GPU usage for this task.

Solution overview

Large language models (LLMs) have gained a lot of interest for their ability to encode syntax and semantics of natural languages. The neural architecture behind LLMs are transformers, which are comprised of attention-based encoder-decoder blocks that generate an internal representation of the data they are trained from (encoder) and are able to generate sequences in the same latent space that resemble the original data (decoder). Due to their success in natural language, recent works have explored the use of LLMs for molecular biology information, which is sequential in nature.

DNABERT is a pre-trained transformer model with non-overlapping human DNA sequence data. The backbone is a BERT architecture made up of 12 encoding layers. The authors of this model report that DNABERT is able to capture a good feature representation of the human genome that enables state-of-the-art performance on downstream tasks like promoter prediction and splice/binding site identification. We decided to use this model as the foundation for our experiments.

Despite the success and popular adoption of LLMs, fine-tuning these models can be difficult because of the number of parameters and computation necessary for it. For this reason, Parameter-Efficient Fine-Tuning (PEFT) methods have been developed. In this post, we use one of these methods, called LoRA (Low-Rank Adaptation). We introduce the method in the following sections.

The following diagram is a representation of the Cas9 DNA target mechanism. The gRNA is the component that helps target the cleavage site.

The goal of this solution is to fine-tune a base DNABERT model to predict activity efficiency from different gRNA candidates. As such, our solution first takes gRNA data and processes it, as described later in this post. Then we use an Amazon SageMaker notebook and the Hugging Face PEFT library to fine-tune the DNABERT model with the processed RNA data. The label we want to predict is the efficiency score as it was calculated in experimental conditions testing with the actual RNA sequences in cell cultures. Those scores describe a balance between being able to edit the genome and not damage DNA that wasn’t targeted.

The following diagram illustrates the workflow of the proposed solution.

Prerequisites

For this solution, you need access to the following:

  • A SageMaker notebook instance (we trained the model on an ml.g4dn.8xlarge instance with a single NVIDIA T4 GPU)
  • transformers-4.34.1
  • peft-0.5.0
  • DNABERT 6

Dataset

For this post, we use the gRNA data released by researchers in a paper about gRNA prediction using deep learning. This dataset contains efficiency scores calculated for different gRNAs. In this section, we describe the process we followed to create the training and evaluation datasets for this task.

To train the model, you need a 30-mer gRNA sequence and efficiency score. A k-mer is a contiguous sequence of k nucleotide bases extracted from a longer DNA or RNA sequence. For example, if you have the DNA sequence “ATCGATCG” and you choose k = 3, then the k-mers within this sequence would be “ATC,” “TCG,” “CGA,” “GAT,” and “ATC.”

Efficiency score

Start with excel file 41467_2021_23576_MOESM4_ESM.xlsx from the CRISPRon paper in the Supplementary Data 1 section. In this file, the authors released the gRNA (20-mer) sequences and corresponding total_indel_eff scores. We specifically used the data from the sheet named spCas9_eff_D10+dox. We use the total_indel_eff column as the efficiency score.

Training and validation data

Given the 20-mers and the crispron scores (same as the total_indel_eff scores) from earlier, complete the following steps to put together the training and validation data:

  1. Convert the sequences in the sheet “TRAP12K microarray oligos” into an .fa (fasta) file.
  2. Run the script get_30mers_from_fa.py (from the CRISPRon GitHub repository) to obtain all possible 23-mers and 30-mers from the sequences obtained from Step 1.
  3. Use the CRISPRspec_CRISPRoff_pipeline.py script (from the CRISPRon GitHub repository) to obtain the binding energy for the 23-mers obtained from Step 2. For more details on how to run this script, check out the code released by the authors of the CRISPRon paper(check the script CRISPRon.sh).
  4. At this point, we have 23-mers along with the corresponding binding energy scores, and 20-mers along with the corresponding CRISPRon scores. Additionally, we have the 30-mers from Step 2.
  5. Use the script prepare_train_dev_data.py (from our released code) to create training and validation splits. Running this script will create two files: train.csv and dev.csv.

The data looks something like the following:

id,rna,crisproff_score,crispron_score
seq2875_p_129,GTCCAGCCACCGAGACCCTGTGTATGGCAC,24.74484099890205,85.96491228
seq2972_p_129,AAAGGCGAAGCAGTATGTTCTAAAAGGAGG,17.216228493196073,94.81132075
. . .
. . .

Model architecture for gRNA encoding

To encode the gRNA sequence, we used the DNABERT encoder. DNABERT was pre-trained on human genomic data, so it’s a good model to encode gRNA sequences. DNABERT tokenizes the nucleotide sequence into overlapping k-mers, and each k-mer serves as a word in the DNABERT model’s vocabulary. The gRNA sequence is broken into a sequence of k-mers, and then each k-mer is replaced by an embedding for the k-mer at the input layer. Otherwise, the architecture of DNABERT is similar to that of BERT. After we encode the gRNA, we use the representation of the [CLS] token as the final encoding of the gRNA sequence. To predict the efficiency score, we use an additional regression layer. The MSE loss will be the training objective. The following is a code snippet of the DNABertForSequenceClassification model:

class DNABertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config
        
        self.bert = BertModel(config)
        classifier_dropout = (
            config.classifier_dropout
            if config.classifier_dropout is not None
            else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        print('bert outputs', outputs)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (
                    labels.dtype == torch.long or labels.dtype == torch.int
                ):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)
        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Fine-tuning and prompting genomic LLMs

Fine-tuning all the parameters of a model is expensive because the pre-trained model becomes much larger. LoRA is an innovative technique developed to address the challenge of fine-tuning extremely large language models. LoRA offers a solution by suggesting that the pre-trained model’s weights remain fixed while introducing trainable layers (referred to as rank-decomposition matrices) within each transformer block. This approach significantly reduces the number of parameters that need to be trained and lowers the GPU memory requirements, because most model weights don’t require gradient computations.

Therefore, we adopted LoRA as a PEFT method on the DNABERT model. LoRA is implemented in the Hugging Face PEFT library. When using PEFT to train a model with LoRA, the hyperparameters of the low rank adaptation process and the way to wrap base transformers models can be defined as follows:

from peft import LoraConfig

tokenizer = AutoTokenizer.from_pretrained(
        data_training_args.model_path,
        do_lower_case=False
    )
# DNABertForSequenceClassification is a model class for sequence classification task, which is built on top of the DNABert architecture.    
model = DNABertForSequenceClassification.from_pretrained(
        data_training_args.model_path,
        config=config
    )
    
# Define LoRA Config
LORA_R = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
peft_config = LoraConfig(
                     r=LORA_R, # the dimension of the low-rank matrices
                     lora_alpha=LORA_ALPHA, #scaling factor for the weight matrices
                     lora_dropout=LORA_DROPOUT, #dropout probability of the LoRA layers
                     bias="none",
                     task_type = 'SEQ_CLS'
    )
model = get_peft_model(model, peft_config)

Hold-out evaluation performances

We use RMSE, MSE, and MAE as evaluation metrics, and we tested with rank 8 and 16. Furthermore, we implemented a simple fine-tuning method, which is simply adding several dense layers after the DNABERT embeddings. The following table summarizes the results.

Method RMSE MSE MAE
LoRA (rank = 8) 11.933 142.397 7.014
LoRA (rank = 16) 13.039 170.01 7.157
One dense layer 15.435 238.265 9.351
Three dense layer 15.435 238.241 9.505
CRISPRon 11.788 138.971 7.134

When rank=8, we have 296,450 trainable parameters, which is about 33% trainable of the whole. The performance metrics are “rmse”: 11.933, “mse”: 142.397, “mae”: 7.014.

When rank=16, we have 591,362 trainable parameters, which is about 66% trainable of the whole. The performance metrics are “rmse”: 13.039, “mse”: 170.010, “mae”: 7.157. There might have some overfitting issue here under this setting.

We also compare what happens when adding a few dense layers:

  • After adding one dense layer, we have “rmse”: 15.435, “mse”: 238.265, “mae”: 9.351
  • After adding three dense layers, we have “rmse”: 15.435, “mse”: 238.241, “mae”: 9.505

Lastly, we compare with the existing CRISPRon method. CRISPRon is a CNN based deep learning model. The performance metrics are “rmse”: 11.788, “mse”: 138.971, “mae”: 7.134.

As expected, LoRA is doing much better than simply adding a few dense layers. Although the performance of LoRA is a bit worse than CRISPRon, with thorough hyperparameter search, it is likely to outperform CRISPRon.

When using SageMaker notebooks, you have the flexibility to save the work and data produced during the training, turn off the instance, and turn it back on when you’re ready to continue the work, without losing any artifacts. Turning off the instance will keep you from incurring costs on compute you’re not using. We highly recommend only turning it on when you’re actively using it.

Conclusion

In this post, we showed how to use PEFT methods for fine-tuning DNA language models using SageMaker. We focused on predicting efficiency of CRISPR-Cas9 RNA sequences for their impact in current gene-editing technologies. We also provided code that can help you jumpstart your biology applications in AWS.

To learn more about the healthcare and life science space, refer to Run AlphaFold v2.0 on Amazon EC2 or fine-tuning Fine-tune and deploy the ProtBERT model for protein classification using Amazon SageMaker.


About the Authors

Siddharth Varia is an applied scientist in AWS Bedrock. He is broadly interested in natural language processing and has contributed to AWS products such as Amazon Comprehend. Outside of work, he enjoys exploring new places and reading. He got interested in this project after reading the book The Code Breaker.

Yudi Zhang is an Applied Scientist at AWS marketing. Her research interests are in the area of graph neural networks, natural language processing, and statistics.

Erika Pelaez Coyotl is a Sr Applied Scientist in Amazon Bedrock, where she’s currently helping develop the Amazon Titan large language model. Her background is in biomedical science, and she has helped several customers develop ML models in this vertical.

Zichen Wang is a Sr Applied Scientist in AWS AI Research & Education. He is interested in researching graph neural networks and applying AI to accelerate scientific discovery, specifically on molecules and simulations.

Rishita Anubhai is a Principal Applied Scientist in Amazon Bedrock. She has deep expertise in natural language processing and has contributed to AWS projects like Amazon Comprehend, Machine Learning Solutions Lab, and development of Amazon Titan models. She’s keenly interested in using machine learning research, specifically deep learning, to create tangible impact.

Read More

Improve RAG performance using Cohere Rerank

Improve RAG performance using Cohere Rerank

This post is co-written with Pradeep Prabhakaran from Cohere.

Retrieval Augmented Generation (RAG) is a powerful technique that can help enterprises develop generative artificial intelligence (AI) apps that integrate real-time data and enable rich, interactive conversations using proprietary data.

RAG allows these AI applications to tap into external, reliable sources of domain-specific knowledge, enriching the context for the language model as it answers user queries. However, the reliability and accuracy of the responses hinges on finding the right source materials. Therefore, honing the search process in RAG is crucial to boosting the trustworthiness of the generated responses.

RAG systems are important tools for building search and retrieval systems, but they often fall short of expectations due to suboptimal retrieval steps. This can be enhanced using a rerank step to improve search quality.

RAG is an approach that combines information retrieval techniques with natural language processing (NLP) to enhance the performance of text generation or language modeling tasks. This method involves retrieving relevant information from a large corpus of text data and using it to augment the generation process. The key idea is to incorporate external knowledge or context into the model to improve the accuracy, diversity, and relevance of the generated responses.

Workflow of RAG Orchestration

The RAG orchestration generally consists of two steps:

  1. Retrieval – RAG fetches relevant documents from an external data source using the generated search queries. When presented with the search queries, the RAG-based application searches the data source for relevant documents or passages.
  2. Grounded generation – Using the retrieved documents or passages, the generation model creates educated answers with inline citations using the fetched documents.

The following diagram shows the RAG workflow.

Document retrieval in RAG orchestration

One technique for retrieving documents in a RAG orchestration is dense retrieval, which is an approach to information retrieval that aims to understand the semantic meaning and intent behind user queries. Dense retrieval finds the closest documents to a user query in the embedding, as shown in the following screenshot.

The goal of dense retrieval is to map both the user queries and documents (or passages) into a dense vector space. In this space, the similarity between the query and document vectors can be computed using standard distance metrics like cosine similarity or euclidean distance. The documents that match closest to the semantic meaning of the user query based on the calculated distance metrics are then presented back to the user.

The quality of the final responses to search queries is significantly influenced by the relevance of the retrieved documents. While dense retrieval models are very efficient and can scale to large datasets, they struggle with more complex data and questions due to the simplicity of the method. Document vectors contain the meaning of text in a compressed representation—typically 786-1536 dimension vectors. This often results in loss of information because information is compressed into a single vector. When documents are retrieved during a vector search the most relevant information is not always presented at the top of the retrieval.

Boost search accuracy with Cohere Rerank

To address the challenges with accuracy, search engineers have used two-stage retrieval as a means of increasing search quality. In these two-stage systems, a first-stage model (an embedding model or retriever) retrieves a set of candidate documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

A reranking model, such as Cohere Rerank, is a type of model that will output a similarity score when given a query and document pair. This score can be used to reorder the documents that are most relevant to the search query. Among the reranking methodologies, the Cohere Rerank model stands out for its ability to significantly enhance search accuracy. The model diverges from traditional embedding models by employing deep learning to evaluate the alignment between each document and the query directly. Cohere Rerank outputs a relevance score by processing the query and document in tandem, which results in a more nuanced document selection process.

In the following example, the application was presented with a query: “When was the transformer paper coauthored by Aidan Gomez published?” The top-k with k = 6 returned the results shown in the image, in which the retrieved result set did contain the most accurate result, although it was at the bottom of the list. With k = 3, the most relevant document would not be included in the retrieved results.

Cohere Rerank aims to reassess and reorder the relevance of the retrieved documents based on additional criteria, such as semantic content, user intent, and contextual relevance, to output a similarity score. This score is then used to reorder the documents by relevance of the query. The following image shows reorder results using Rerank.

By applying Cohere Rerank after the first-stage retrieval, the RAG orchestration can gain the benefits of both approaches. While first-stage retrieval helps to capture relevant items based on proximity matches within the vector space, reranking helps optimize search according to results by guaranteeing contextually relevant results are surfaced to the top. The following diagram demonstrates this improved efficiency.

The latest version of Cohere Rerank, Rerank 3, is purpose-built to enhance enterprise search and RAG systems. Rerank 3 offers state-of-the-art capabilities for enterprise search, including:

  • 4k context length to significantly improve search quality for longer documents
  • Ability to search over multi-aspect and semi-structured data (such as emails, invoices, JSON documents, code, and tables)
  • Multilingual coverage of more than 100 languages
  • Improved latency and lower total cost of ownership (TCO)

The endpoint takes in a query and a list of documents, and it produces an ordered array with each document assigned a relevance score. This provides a powerful semantic boost to the search quality of any keyword or vector search system without requiring any overhaul or replacement.

Developers and businesses can access Rerank on Cohere’s hosted API and on Amazon SageMaker. This post offers a step-by-step walkthrough of consuming Cohere Rerank on Amazon SageMaker.

Solution overview

This solution follows these high-level steps:

  1. Subscribe to the model package
  2. Create an endpoint and perform real-time inference

Prerequisites

For this walkthrough, you must have the following prerequisites:

  1. The cohere-aws notebook.

This is a reference notebook, and it cannot run unless you make changes suggested in the notebook. It contains elements that render correctly in the Jupyter interface, so you need to open it from an Amazon SageMaker notebook instance or in Amazon SageMaker Studio.

  1. An AWS Identity and Access Management (IAM) role with the AmazonSageMakerFullAccess policy attached. To deploy this machine learning (ML) model successfully, choose one of the following options:
    1. If your AWS account does not have a subscription to Cohere Rerank 3 Model – Multilingual, your IAM role needs to have the following three permissions, and you need to have the authority to make AWS Marketplace subscriptions in the AWS account used:
      • aws-marketplace:ViewSubscriptions
      • aws-marketplace:Unsubscribe
      • aws-marketplace:Subscribe
    2. If your AWS account has a subscription to Cohere Rerank 3 Model – Multilingual, you can skip the instructions for subscribing to the model package.

Refrain from using full access in production environments. Security best practice is to opt for the principle of least privilege.

Implement Rerank 3 on Amazon SageMaker

To improve RAG performance using Cohere Rerank, use the instructions in the following sections.

Subscribe to the model package

To subscribe to the model package, follow these steps:

  1. In AWS Marketplace, open the model package listing page Cohere Rerank 3 Model – Multilingual
  2. Choose Continue to Subscribe.
  3. On the Subscribe to this software page, review the End User License Agreement (EULA), pricing, and support terms and choose Accept Offer.
  4. Choose Continue to configuration and then choose a Region. You will see a Product ARN displayed, as shown in the following screenshot. This is the model package Amazon Resource Name (ARN) that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your Region and enter it in the following cell.

The code snippets included in this post are sourced from the aws-cohere notebook. If you encounter any issues with this code, refer to the notebook for the most up-to-date version.

!pip install --upgrade cohere-aws
# if you upgrade the package, you need to restart the kernel

from cohere_aws import Client
import boto3

On the Configure for AWS CloudFormation page shown in the following screenshot, under Product Arn, make a note of the last part of the product ARN to use as the value in the variable cohere_package in the following code.

cohere_package = " cohere-rerank-multilingual-v3--13dba038aab73b11b3f0b17fbdb48ea0"

model_package_map = {

"us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",

"us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",

"us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",

"us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",

"ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",

"eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",

"eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",

"eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",

"eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",

"eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",

"ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",

"ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",

"ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",

"ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",

"ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",

"sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",

}

region = boto3.Session().region_name

if region not in model_package_map.keys():

raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, refer to the Amazon SageMaker Developer Guide.

Create an endpoint

To create an endpoint, use the following code.

co = Client(region_name=region)

co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual-v3-0", instance_type="ml.g5.2xlarge", n_instances=1)

# If the endpoint is already created, you just need to connect to it

# co.connect_to_endpoint(endpoint_name="cohere-rerank-multilingual-v3-0”)

After the endpoint is created, you can perform real-time inference.

Create the input payload

To create the input payload, use the following code.

documents = [
    {"Title":"Contraseña incorrecta","Content":"Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"Title":"Confirmation Email Missed","Content":"Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"Title":"أسئلة حول سياسة الإرجاع","Content":"مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"Title":"Customer Support is Busy","Content":"Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"Title":"Falschen Artikel erhalten","Content":"Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"Title":"Customer Service is Unavailable","Content":"Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"Title":"Return Policy for Defective Product","Content":"Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"Title":"收到错误物品","Content":"早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
    {"Title":"Return Defective Product","Content":"Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

 

Perform real-time inference

To perform real-time inference, use the following code.

 

response = co.rerank(documents=documents, query='What emails have been about returning items?', rank_fields=["Title","Content"], top_n=5)

Visualize output

To visualize output, use the following code.

print(f'Documents: {response}')

The following screenshot shows the output response.

Cleanup

To avoid any recurring charges, use the following steps to clean up the resources created in this walkthrough.

Delete the model

Now that you have successfully performed a real-time inference, you do not need the endpoint anymore. You can terminate the endpoint to avoid being charged.

co.delete_endpoint()
co.close()

Unsubscribe to the listing (optional)

If you want to unsubscribe to the model package, follow these steps. Before you cancel the subscription, make sure that you don’t have a deployable model created from the model package or using the algorithm. You can find this information by looking at the container name associated with the model.

Steps to unsubscribe from the product from AWS Marketplace:

  1. On the Your Software subscriptions page, choose the Machine Learning tab
  2. Locate the listing that you want to cancel the subscription for, and then choose Cancel Subscription

Summary

RAG is a capable technique for developing AI applications that integrate real-time data and enable interactive conversations using proprietary information. RAG enhances AI responses by tapping into external, domain-specific knowledge sources, but its effectiveness depends on finding the right source materials. This post focuses on improving search efficiency and accuracy in RAG systems using Cohere Rerank. RAG orchestration typically involves two steps: retrieval of relevant documents and generation of answers. While dense retrieval is efficient for large datasets, it can struggle with complex data and questions due to information compression. Cohere Rerank uses deep learning to evaluate the alignment between documents and queries, outputting a relevance score that enables more nuanced document selection.

Customers can find Cohere Rerank 3 and Cohere Rerank 3 Nimble on Amazon Sagemaker Jumpstart.


About the Authors

Shashi Raina is a Senior Partner Solutions Architect at Amazon Web Services (AWS), where he specializes in supporting generative AI (GenAI) startups. With close to 6 years of experience at AWS, Shashi has developed deep expertise across a range of domains, including DevOps, analytics, and generative AI.

Pradeep Prabhakaran is a Senior Manager – Solutions Architecture at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform.

Read More