Advancing AI trust with new responsible AI tools, capabilities, and resources

Advancing AI trust with new responsible AI tools, capabilities, and resources

As generative AI continues to drive innovation across industries and our daily lives, the need for responsible AI has become increasingly important. At AWS, we believe the long-term success of AI depends on the ability to inspire trust among users, customers, and society. This belief is at the heart of our long-standing commitment to building and using AI responsibly. Responsible AI goes beyond mitigating risks and aligning to relevant standards and regulations. It’s about proactively building trust and unlocking AI’s potential to drive business value. A comprehensive approach to responsible AI empowers organizations to innovate boldly and achieve transformative business outcomes. New joint research conducted by Accenture and AWS underscores this, highlighting responsible AI as a key driver of business value — boosting product quality, operational efficiency, customer loyalty, brand perception, and more. Nearly half of the surveyed companies acknowledge responsible AI as pivotal in driving AI-related revenue growth. Why? Responsible AI builds trust, and trust accelerates adoption and innovation.

With trust as a cornerstone of AI adoption, we are excited to announce at AWS re:Invent 2024 new responsible AI tools, capabilities, and resources that enhance the safety, security, and transparency of our AI services and models and help support customers’ own responsible AI journeys.

Taking proactive steps to manage AI risks and foster trust and interoperability

AWS is the first major cloud service provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. ISO/IEC 42001 is an international management system standard that outlines the requirements for organizations to manage AI systems responsibly throughout their lifecycle. Technical standards, such as ISO/IEC 42001, are significant because they provide a common framework for responsible AI development and deployment, fostering trust and interoperability in an increasingly global and AI-driven technological landscape. Achieving ISO/IEC 42001 certification means that an independent third party has validated that AWS is taking proactive steps to manage risks and opportunities associated with AI development, deployment, and operation. With this certification, we reinforce our commitments to providing AI services that help you innovate responsibly with AI.

Expanding safeguards in Amazon Bedrock Guardrails to improve transparency and safety

In April 2024, we announced the general availability of Amazon Bedrock Guardrails, which makes it easier to apply safety and responsible AI checks for your gen AI applications. Amazon Bedrock Guardrails delivers industry-leading safety protections by blocking up to 85% more harmful content on top of native protections provided by foundation models (FMs) and filtering over 75% of hallucinated responses from models using contextual grounding checks for Retrieval Augmented Generation (RAG) and summarization use cases. The ability to implement these safeguards was a big step forward in building trust in AI systems. Despite the advancements in FMs, models can still produce hallucinations—a challenge many of our customers face. For use cases where accuracy is critical, customers need the use of mathematically sound techniques and explainable reasoning to help generate accurate FM responses.

To address this need, we are adding new safeguards to Amazon Bedrock Guardrails to help prevent factual errors due to FM hallucinations and offer verifiable proofs. With the launch of the Automated Reasoning checks in Amazon Bedrock Guardrails (preview), AWS becomes the first and only major cloud provider to integrate automated reasoning in our generative AI offerings. Automated Reasoning checks help prevent factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with provided facts and aren’t based on hallucinated or inconsistent data. Used alongside other techniques such as prompt engineering, RAG, and contextual grounding checks, Automated Reasoning checks add a more rigorous and verifiable approach to enhancing the accuracy of LLM-generated outputs. Encoding your domain knowledge into structured policies helps your conversational AI applications provide reliable and trustworthy information to your users.

Click on the image below to see a demo of Automated Reasoning checks in Amazon Bedrock Guardrails.

As organizations increasingly use applications with multimodal data to drive business value, improve decision-making, and enhance customer experiences, the need for content filters extends beyond text. Amazon Bedrock Guardrails now supports multimodal toxicity detection (in preview) with support for image content, helping organizations to detect and filter undesirable and potentially harmful image content while retaining safe and relevant visuals. Multimodal toxicity detection helps remove the heavy lifting required to build your own safeguards for image data or invest time in manual evaluation that can be error-prone and tedious. Amazon Bedrock Guardrails helps you to responsibly create AI applications, helping build trust with your users.

Improving generative AI application responses and quality with new Amazon Bedrock evaluation capabilities

With more general-purpose FMs to choose from, organizations now have a wide range of options to power their generative AI applications. However, selecting the optimal model for a specific use case requires efficiently comparing models based on an organization’s preferred quality and responsible AI metrics. While evaluation is an important part of building trust and transparency, it demands substantial time, expertise, and resources for every new use case, making it challenging to choose the model that delivers the most accurate and safe customer experience. Amazon Bedrock Evaluations addresses this by helping you evaluate, compare, and select the best FMs for your use case. You can now use an LLM-as-a-judge (in preview) for model evaluations to perform tests and evaluate other models with human-like quality on your dataset. You can choose from LLMs hosted on Amazon Bedrock to be the judge, with a variety of quality and responsible AI metrics such as correctness, completeness, and harmfulness. You can also bring your own prompt dataset to customize the evaluation with your data, and compare results across evaluation jobs to make decisions faster. Previously, you had a choice between human-based model evaluation and automatic evaluation with exact string matching and other traditional natural language processing (NLP) metrics. These methods, though fast, didn’t provide a strong correlation with human evaluators. Now, with LLM-as-a-judge, you can get human-like evaluation quality at a much lower cost than full human-based evaluations while saving up to weeks of time. Many organizations still want the final assessment to be from expert human annotators. For this, Amazon Bedrock still offers full human-based evaluations with an option to bring your own workforce or have AWS manage your custom evaluation.

To equip FMs with up-to-date and proprietary information, organizations use RAG, a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. However, evaluating and optimizing RAG applications can be challenging due to the complexity of optimizing retrieval and generation components. To address this, we’ve introduced RAG evaluation support in Amazon Bedrock Knowledge Bases (in preview). This new evaluation capability now allows you to assess and optimize RAG applications conveniently and quickly, right where your data and LLMs already reside. Powered by LLM-as-a-judge technology, RAG evaluations offer a choice of several judge models and metrics, such as context relevance, context coverage, correctness, and faithfulness (hallucination detection). This seamless integration promotes regular assessments, fostering a culture of continuous improvement and transparency in AI application development. By saving both cost and time compared to human-based evaluations, these tools empower organizations to enhance their AI applications, building trust through consistent improvement.

The model and RAG evaluation capabilities both provide natural language explanations for each score in the output file and on the AWS Management Console. The scores are normalized from 0 to 1 for ease of interpretability. Rubrics are published in full with the judge prompts in the documentation so non-scientists can understand how scores are derived. To learn more about model and RAG evaluation capabilities, see News blog.

Introducing Amazon Nova, built with responsible AI at the core

Amazon Nova is a new generation of state-of-the-art FMs that deliver frontier intelligence and industry leading price-performance. Amazon Nova FMs incorporate built-in safeguards to detect and remove harmful content from data, rejecting inappropriate user inputs, and filtering model outputs. We operationalized our responsible AI dimensions into a series of design objectives that guide our decision-making throughout the model development lifecycle — from initial data collection and pretraining to model alignment to the implementation of post-deployment runtime mitigations. Amazon Nova Canvas and Amazon Nova Reel come with controls to support safety, security, and IP needs with responsible AI. This includes watermarking, content moderation, and C2PA support (available in Amazon Nova Canvas) to add metadata by default to generated images. Amazon’s safety measures to combat the spread of misinformation, child sexual abuse material (CSAM), and chemical, biological, radiological, or nuclear (CBRN) risks also extend to Amazon Nova models. For more information on how Amazon Nova was built responsibly, read the Amazon Science blog.

Enhancing transparency with new resources to advance responsible generative AI

At re:Invent 2024, we announced the availability of new AWS AI Service Cards for Amazon Nova Reel, Amazon Canvas, Amazon Nova Micro, Lite, and Pro, Amazon Titan Image Generator, and Amazon Titan Text Embeddings to increase transparency of Amazon FMs. These cards provide comprehensive information on the intended use cases, limitations, responsible AI design choices, and best practices for deployment and performance optimization. A key component of Amazon’s responsible AI documentation, AI Service Cards offer customers and the broader AI community a centralized resource to understand the development process we undertake to build our services in a responsible way that addresses fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. As generative AI continues to grow and evolve, transparency on how technology is developed, tested, and used will be a vital component to earn the trust of organizations and their customers alike. You can explore all 16 AI Service Cards on Responsible AI Tools and Resources.

We also updated the AWS Responsible Use of AI Guide. This document offers considerations for designing, developing, deploying, and operating AI systems responsibly, based on our extensive learnings and experience in AI. It was written with a set of diverse AI stakeholders and perspectives in mind—including, but not limited to, builders, decision-makers, and end-users. At AWS, we are committed to continuing to bring transparency resources like these to the broader community—and to iterate and gather feedback on the best ways forward.

Delivering breakthrough innovation with trust at the forefront

At AWS, we’re dedicated to fostering trust in AI, empowering organizations of all sizes to build and use AI effectively and responsibly. We are excited about the responsible AI innovations announced at re:Invent this week. From new safeguards and evaluation techniques in Amazon Bedrock to state-of-the-art Amazon Nova FMs to fostering trust and transparency with ISO/IEC 42001 certification and new AWS AI Service Cards, you have more tools, resources and built-in protections to help you innovate responsibly and unlock value with generative AI.

We encourage you to explore these new tools and resources:


About the author

Dr. Baskar Sridharan is the Vice President for AI/ML and Data Services & Infrastructure, where he oversees the strategic direction and development of key services, including Bedrock, SageMaker, and essential data platforms like EMR, Athena, and Glue.

Read More

Deploy RAG applications on Amazon SageMaker JumpStart using FAISS

Deploy RAG applications on Amazon SageMaker JumpStart using FAISS

Generative AI has empowered customers with their own information in unprecedented ways, reshaping interactions across various industries by enabling intuitive and personalized experiences. This transformation is significantly enhanced by Retrieval Augmented Generation (RAG), which is a generative AI pattern where the large language model (LLM) being used references a knowledge corpus outside of its training data to generate a response. RAG has become a popular choice to improve performance of generative AI applications by taking advantage of additional information in the knowledge corpus to augment an LLM. Customers often prefer RAG for optimizing generative AI output over other techniques like fine-tuning due to cost benefits and quicker iteration.

In this post, we show how to build a RAG application on Amazon SageMaker JumpStart using Facebook AI Similarity Search (FAISS).

RAG applications on AWS

RAG models have proven useful for grounding language generation in external knowledge sources. By retrieving relevant information from a knowledge base or document collection, RAG models can produce responses that are more factual, coherent, and relevant to the user’s query. This can be particularly valuable in applications like question answering, dialogue systems, and content generation, where incorporating external knowledge is crucial for providing accurate and informative outputs.

Additionally, RAG has shown promise for improving understanding of internal company documents and reports. By retrieving relevant context from a corporate knowledge base, RAG models can assist with tasks like summarization, information extraction, and question answering on complex, domain-specific documents. This can help employees quickly find important information and insights buried within large volumes of internal materials.

A RAG workflow typically has four components: the input prompt, document retrieval, contextual generation, and output. A workflow begins with a user providing an input prompt, which is searched in a large knowledge corpus, and the most relevant documents are returned. These returned documents along with the original query are then fed into the LLM, which uses the additional conditional context to produce a more accurate output to users. RAG has become a popular technique to optimize generative AI applications because it uses external data that can be frequently modified to dynamically retrieve user output without the need retrain a model, which is both costly and compute intensive.

The next component in this pattern that we have chosen is SageMaker JumpStart. It provides significant advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with prepackaged artifacts, ease of use through a user-friendly interface, and scalability with seamless integration to the broader AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart allows you to quickly deploy both LLMs and embeddings models without spending too much time on configurations for scalability.

Solution overview

To implement our RAG workflow on SageMaker JumpStart, we use a popular open source Python library known as LangChain. Using LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. Let’s review these different components and how we bring them together:

  • LLM (inference) – We need an LLM that will do the actual inference and answer our end-user’s initial prompt. For our use case, we use Meta Llama 3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints that allows you to simply pass in the endpoint name to define an LLM object in the library.
  • Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we are doing a similarity search on the input text to see what documents share similarities and possess the knowledge to help augment our response. For this example, we use the BGE Hugging Face embeddings model available through SageMaker JumpStart.
  • Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use FAISS, which allows for similarity search as well. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve. Other vector store options include Amazon OpenSearch Service as you scale your experiments.

The following architecture diagram illustrates how you can use a vector index such as FAISS as a knowledge base and embeddings store.

Architecture diagram

Standalone vector indexes like FAISS can significantly improve the search and retrieval of vector embeddings, but they lack capabilities that exist in any database. The following is an overview of the primary benefits to using a vector index for RAG workflows:

  • Efficiency and speed – Vector indexes are highly optimized for fast, memory-efficient similarity search. Because vector databases are built on top of vector indexes, there are additional features that typically contribute additional latency. To build a highly efficient and low-latency RAG workflow, you can use a vector index (such as FAISS) deployed on a single machine with GPU acceleration.
  • Simplified deployment and maintenance – Because vector indexes don’t require the effort of spinning up and maintaining a database instance, they’re a great option to quickly deploy a RAG workflow if continuous updates, high concurrency, or distributed storage aren’t a requirement.
  • Control and customization – Vector indexes offer granular control over parameters, the index type, and performance trade-offs, letting you optimize for exact or approximate searches based on the RAG use case.
  • Memory efficiency – You can tune a vector index to minimize memory usage, especially when using data compression techniques such as quantization. This is advantageous in scenarios where memory is limited and high scalability is required so that more data can be stored in memory on a single machine.

In short, a vector index like FAISS is advantageous when trying to maximize speed, control, and efficiency with minimal infrastructure components and stable data.

In the following sections, we walk through the following notebook, which implements FAISS as the vector store in the RAG solution. In this notebook, we use several years of Amazon’s Letter to Shareholders as a text corpus and perform Q&A on the letters. We use this notebook to demonstrate advanced RAG techniques with Meta Llama 3 8B on SageMaker JumpStart using the FAISS embedding store.

We explore the code using the simple LangChain vector store wrapper, RetrievalQA and ParentDocumentRetriever. RetreivalQA is more advanced than a LangChain vector store wrapper and offers more customizations. ParentDocumentRetriever helps with advanced RAG options like invocation of parent documents for response generation, which enriches the LLM’s outputs with a layered and thorough context. We will see how the responses progressively get better as we move from simple to advanced RAG techniques.

Prerequisites

To run this notebook, you need access to an ml.t3.medium instance.

To deploy the endpoints for Meta Llama 3 8B model inference, you need the following:

  • At least one ml.g5.12xlarge instance for Meta Llama 3 endpoint usage
  • At least one ml.g5.2xlarge instance for embedding endpoint usage

Additionally, you may need to request a Service Quota increase.

Set up the notebook

Complete the following steps to create a SageMaker notebook instance (you can also use Amazon SageMaker Studio with JupyterLab):

  1. On the SageMaker console, choose Notebooks in the navigation pane.
  2. Choose Create notebook instance.

Create Notebook Instance view

  1. For Notebook instance type, choose t3.medium.
  2. Under Additional configuration, for Volume size in GB, enter 50 GB. 

This configuration might need to change depending on the RAG solution you are working with and the amount of data you will have on the file system itself.

SageMaker Notebook Settings

  1. For IAM role, choose Create a new role.

IAM Role Creation

  1. Create an AWS Identity and Access Management (IAM) role with SageMaker full access and any other service-related policies that are necessary for your operations.

Create IAM Role bucket access

  1. Expand the Git repositories section and for Git repository URL, enter https://github.com/aws-samples/sagemaker-genai-hosting-examples.git.

Git Repository URL

  1. Accept defaults for the rest of the configurations and choose Create notebook instance.
  2. Wait for the notebook to be InService and then choose the Open JupyterLab link to launch JupyterLab.

Jupyter Notebook Instances

  1. Open genai-recipes/RAG-recipes/llama3-rag-langchain-smjs.ipynb to work through the notebook.

Open Notebook

Deploy the model

Before you start building the end-to-end RAG workflow, it’s necessary to deploy the LLM and embeddings model of your choice. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all pre-packaged for optimal inference. These are then exposed using SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:

from sagemaker.jumpstart.model import JumpStartModel

# Deploying Llama
# Specify the model ID for the HuggingFace Llama 3 8b Instruct LLM model
model_id = "meta-textgeneration-llama-3-8b-instruct"
accept_eula = True
model = JumpStartModel(model_id=model_id)
predictor = model.deploy(accept_eula=accept_eula)

# Deploying Embeddings Model
# Specify the model ID for the HuggingFace BGE Large EN Embedding model
model_id = "huggingface-sentencesimilarity-bge-large-en-v1-5"
text_embedding_model = JumpStartModel(model_id=model_id)
embedding_predictor = text_embedding_model.deploy()
embedding_predictor.endpoint_name

LangChain comes with built-in support for SageMaker JumpStart and endpoint-based models, so you can encapsulate the endpoints with these constructs so they can later be fit into the encompassing RAG chain:

from langchain_community.llms import SagemakerEndpoint
from langchain_community.embeddings import SagemakerEndpointEmbeddings

# Setup for using the Llama3-8B model with SageMaker Endpoint
llm = SagemakerEndpoint(
     endpoint_name=llm_endpoint_name,
     region_name=region,
     model_kwargs={"max_new_tokens": 1024, "top_p": 0.9, "temperature": 0.7},
     content_handler=llama_content_handler
 )
 
 # setup Embeddings models
 sagemaker_embeddings = SagemakerEndpointEmbeddings(
    endpoint_name=embedding_endpoint_name,
    region_name=region,
    model_kwargs={"mode": "embedding"},
    content_handler=bge_content_handler,
)

After you have set up the models, you can focus on the data preparation and setup of the FAISS vector store.

Data preparation and vector store setup

For this RAG use case, we take public documents of Amazon’s Letter to Shareholders as the text corpus and document source that we will be working with:

# public data to retrieve from
from urllib.request import urlretrieve
urls = [
'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf',
'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/d2fde7ee-05f7-419d-9ce8-186de4c96e25.pdf',
'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/f965e5c3-fded-45d3-bbdb-f750f156dcc9.pdf',
'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/336d8745-ea82-40a5-9acc-1a89df23d0f3.pdf'
]
filenames = [
'AMZN-2024-10-K-Annual-Report.pdf',
'AMZN-2023-10-K-Annual-Report.pdf',
'AMZN-2022-10-K-Annual-Report.pdf',
'AMZN-2021-10-K-Annual-Report.pdf'
]

LangChain comes with built-in processing for PDF documents, and you can use this to load the data from the text corpus. You can also tune or iterate over parameters such as chunk size depending on the documents that you’re working with for your use case.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

documents = []

# process PDF data
for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]
        documents += document
        
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)
docs = text_splitter.split_documents(documents)
print(docs[100])

You can then combine the documents and embeddings models and point towards FAISS as your vector store. LangChain has widespread support for different LLMs such as SageMaker JumpStart, and also has built-in API calls for integrating with FAISS, which we use in this case:

from langchain_community.vectorstores import FAISS
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
vectorstore_faiss = FAISS.from_documents(
    docs, # doc corpus
    sagemaker_embeddings, # embeddings endpoint
)
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

You can then make sure the vector store is performing as expected by sending a few sample queries and reviewing the output that is returned:

query = "How did AWS perform in 2021?"
# returns relevant documents
answer = wrapper_store_faiss.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

LangChain inference

Now that you have set up the vector store and models, you can encapsulate this into a singular chain object. In this case, we use a RetrievalQA Chain tailored for RAG applications provided by LangChain. With this chain, you can customize the document fetching process and control parameters such as number of documents to retrieve. We define a prompt template and pass in our retriever as well as these tertiary parameters:

from langchain.chains import RetrievalQA
prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
This is a conversation between an AI assistant and a Human.
<|eot_id|><|start_header_id|>user<|end_header_id|>
Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
#### Context ####
{context}
#### End of Context ####
Question: {question}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore_faiss.as_retriever(
search_type="similarity", search_kwargs={"k": 3}
),
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)

You can then test some sample inference and trace the relevant source documents that helped answer the query:

query = "How did AWS perform in 2023?"
result = qa({"query": query})
print(result['result'])
print(f"n{result['source_documents']}")

Optionally, if you want to further augment or enhance your RAG applications for more advanced use cases with larger documents, you can also explore using options such as a parent document retriever chain. Depending on your use case, it’s crucial to identify the different RAG processes and architectures that can optimize your generative AI application.

Clean up

After you have built the RAG application with FAISS as a vector index, make sure to clean up the resources that were used. You can delete the LLM endpoint using the delete_endpoint Boto3 API call. In addition, make sure to stop your SageMaker notebook instance to not incur any further charges.

Conclusion

RAG can revolutionize customer interactions across industries by providing personalized and intuitive experiences. RAG’s four-component workflow—input prompt, document retrieval, contextual generation, and output—allows for dynamic, up-to-date responses without the need for costly model retraining. This approach has gained popularity due to its cost-effectiveness and ability to quickly iterate.

In this post, we saw how SageMaker JumpStart has simplified the process of building and deploying generative AI applications, offering pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem. We also saw how using FAISS as a vector index can enable quick retrieval from a large corpus of information, while keeping costs and operational overhead low.

To learn more about RAG on SageMaker, see Retrieval Augmented Generation, or contact your AWS account team to discuss your use cases.


About the Authors

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Ankith Ede is a Solutions Architect at Amazon Web Services based in New York City. He specializes in helping customers build cutting-edge generative AI, machine learning, and data analytics-based solutions for AWS startups. He is passionate about helping customers build scalable and secure cloud-based solutions.

Sid Rampally is a Customer Solutions Manager at AWS, driving generative AI acceleration for life sciences customers. He writes about topics relevant to his customers, focusing on data engineering and machine learning. In his spare time, Sid enjoys walking his dog in Central Park and playing hockey.

Read More

Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans

Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans

Today, organizations are constantly seeking ways to use advanced large language models (LLMs) for their specific needs. These organizations are engaging in both pre-training and fine-tuning massive LLMs, with parameter counts in the billions. This process aims to enhance model efficacy for a wide array of applications across diverse sectors, including healthcare, financial services, and marketing. However, customizing these larger models requires access to the latest and accelerated compute resources.

In this post, we demonstrate how you can address this requirement by using Amazon SageMaker HyperPod training plans, which can bring down your training cluster procurement wait time. A training plan provides simple and predictable access to accelerated compute resources (supporting P4d, P5, P5e, P5en, and trn2 as of the time of writing), allowing you to use this compute capacity to run model training on either Amazon SageMaker training  jobs or SageMaker HyperPod.

We guide you through a step-by-step implementation on how you can use the (AWS CLI) or the AWS Management Console to find, review, and create optimal training plans for your specific compute and timeline needs. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.

You can check out the launch of this new feature in Meet your training timelines and budget with new Amazon SageMaker HyperPod flexible training plans.

Business challenges

As organizations strive to harness the power of LLMs for competitive advantage, they face a significant hurdle: securing sufficient and reliable compute capacity for model training. The scale of these models demands cutting-edge accelerated compute hardware. However, the high cost and limited availability of such resources create a bottleneck for many businesses. This scarcity not only impacts timelines, but also stretches budgets, potentially delaying critical AI initiatives. As a result, organizations are seeking solutions that can provide consistent, scalable, and cost-effective access to high-performance computing resources, enabling them to train and fine-tune LLMs without compromising on speed or quality.

Solution overview

SageMaker HyperPod training plans, a new SageMaker capability, address this challenge by offering you a simple-to-use console UI or AWS CLI experience to search, review, create, and manage training plans.

Capacity provisioned through SageMaker training plans can be used with either SageMaker training jobs or SageMaker HyperPod. If you want to focus on model development rather than infrastructure management and prefer ease of use with a managed experience, SageMaker training jobs are an excellent choice. For organizations requiring granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal solution. To better understand these services and choose the one most appropriate for your use case, refer to Generative AI foundation model training on Amazon SageMaker, which provides detailed information about both options.

The following diagram provides an overview of the main steps involved in requesting capacity using SageMaker training plans for SageMaker training jobs.

Workflow for securing training plans

Figure 1: The main steps involved in procuring capacity via SageMaker HyperPod training plans. Note: This workflow arbitrarily uses SageMaker training jobs as the target; you may choose to use SageMaker HyperPod too.

At a high level, the steps to create a training plan are as follows:

  1. Search the training plans that best match your capacity requirements, such as instance type, instance count, start time, and duration. SageMaker finds the optimal plans across one or more segments.
  2. After reviewing the available training plan offerings, you can reserve the plan that meets your requirements.
  3. Schedule your SageMaker training jobs by using a training plan with a training-job target resource. Note, we are only using training-job for illustration purposes. You may also use hyperpod-cluster as your target resource.
  4. Describe and list your existing training plans. When the capacity is available, it will be allocated to the scheduled training job.

In the following sections, we shift our focus to the solution walkthrough associated with training plans.

Prerequisites

Complete the following prerequisite steps:

  1. If you’re using an AWS Identity and Access Management (IAM) user for this solution, make sure that your user has the AmazonSageMakerFullAccess policy attached to it. To learn more about how to attach a policy to an IAM user, see Adding IAM identity permissions (console).
  2. If you’re setting up the AWS CLI for the first time, follow the instructions at Getting started with the AWS CLI.
  3. If you choose to use the AWS CLI, make sure you are on the most up-to-date AWS CLI version.

Create a training plan

In this post, we discuss two ways to create a training plan: using the SageMaker console or the AWS CLI.

Create a SageMaker training plan using the SageMaker console

The SageMaker console user experience for creating a training plan is similar for both training jobs and SageMaker HyperPod. In this post, for demonstration purposes, we show how to create a training plan for a SageMaker HyperPod cluster.

  1. On the SageMaker console, choose Training plans in the navigation pane.
  2. Create a new training plan.
  3. For Target, select HyperPod cluster.
  4. Under Instance attributes, specify your instance type (ml.p5.48xlarge) and instance count (16).
  5. Under Date settings to search for an available plan, choose your preferred training date and duration (for example, 10 days).
  6. Choose Find training plan.

Figure 2: You can search for available training plan offerings via the SageMaker console! Choose your target, select your instance type and count, and specify duration.

SageMaker suggests a training plan that is split into two 5-day segments. This includes the total upfront price for the plan as well as the estimated data transfer cost based on the data location you provided.

Figure 3: SageMaker suggests a training plan based on your inputs. In this example, SageMaker suggests a training plan split across two 5-day segments. You will also see the total upfront price.

  1. Review and purchase your plan.

Figure 4: Once you’re happy with your selection, you can review and purchase your training plan!

After you create the training plan, you can see the list of training plans created. The plan initially enters a Pending state, awaiting payment. Once the payment is processed (unless the payment cycle has changed), the plan will transition to the Scheduled state. At this point, you can begin queuing jobs or creating clusters using the plan. On the plan’s start date, it becomes Active, and resources are allocated. Your training tasks can then start running (pending resource availability).

Make sure you pay for the training plan using the AWS Billing and Cost Management console for your plan to show up on your SageMaker console. You will receive an invoice to resolve before being able to proceed.

Figure 5: You can list out your training plans on the SageMaker console. You can start using your plan once it transitions to the Active state.

Create a SageMaker training plan using the AWS CLI

Complete the following steps to create a training plan using the AWS CLI:

  1. Start by calling the API, passing your capacity requirements as input parameters, to search for all matching training plan offerings.

The following example searches for training plan offerings suitable for two ml.p5.48xlarge instances for 96 hours in the us-west-2 region. In this example, we also have filters for what time frame we want to use the training plan, and we also filter for training plans that can be used for SageMaker HyperPod cluster workloads using the target-resources parameter:

# Required: instance type and instance count, target resources, region
# Optional: duration hours, start time after, and end time before.

aws sagemaker search-training-plan-offerings 
  --region "us-west-2" 
  --instance-type 'ml.p5.48xlarge' 
  --instance-count 2 
  --target-resources 'hyperpod-cluster' 
  --duration-hours 96 
  --start-time-after "2025-01-01T00:00:00" 
  --end-time-before "2025-12-31T23:59:59"

Each TrainingPlanOffering returned in the response is identified by a unique TrainingPlanOfferingId. The first offering in the list represents the best match for your requirements. In this case, the SageMaker SearchTrainingPlanOfferings API returns a single available TrainingPlanOffering that matches the specified capacity requirements:

{
    'TrainingPlanOfferings': [
      { 
          'TrainingPlanOfferingId': 'tpo-abc123',
          'TargetResources': ['hyperpod-cluster'],
          'RequestedStartTimeAfter': 
          datetime.datetime(2024, 11, 18, 11, 40, 47, 928000, tzinfo=tzlocal()),
          'DurationHours': 96,
          'DurationMinutes': 0,
          'Upfront': 'xx.yy',
          'CurrencyCode': 'USD',
          'ReservedCapacityOfferings': [
            {
                'InstanceType': 'ml.p5.48xlarge',
                'InstanceCount': 2,
                'AvailabilityZone': 'us-east-1a',
                'DurationHours': 96,
                'DurationMinutes': 0,
                'StartTime': datetime.datetime(2024, 11, 21, 3, 30, tzinfo=tzlocal()),
                'EndTime': datetime.datetime(2024, 11, 22, 3, 30, tzinfo=tzlocal())
            }
          ]
      }
    ]
}

Make sure that your SageMaker HyperPod training job subnets are in the same Availability Zone as your training plan.

  1. After you choose the training plan that best suits your schedule and requirements, you can reserve it by calling the CreateTrainingPlan API as follows:
# Required: training-plan-offering-id, training-plan-name
# Optional: target-services (leverages trainig-job by default)
aws sagemaker create-training-plan 
  --training-plan-offering-id "tpo-abc123" 
  --training-plan-name "p5-training-plan" 
  --region "us-west-2"

You will see an output that looks like the following:

{
    "TrainingPlanArn":"arn:aws:sagemaker:us-west-2:123456789123:training-plan/p5-training-plan"
}

After you create the training plan, you will have to pay. Be on the lookout for an invoice. You can also find this on the AWS Billing and Cost Management console.

  1. You can list all the training plans that are created in your AWS account (and Region) by calling the ListTrainingPlans API:
aws sagemaker list-training-plans

This will give you a summary of the training plans in your account. After you have your training plan (the newly created p5-training-plan), you can check its details using either the console or the DescribeTrainingPlan API as follows:

export TRAINING_PLAN="p5-training-plan"
TRAINING_PLAN_DESCRIPTION=$(aws sagemaker describe-training-plan --training-plan-name "$TRAINING_PLAN")
echo $TRAINING_PLAN_DESCRIPTION

# Picking out individual parameters from the DescribeTrainingPlan API
TRAINING_PLAN_ARN=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.TrainingPlanArn)
AVAILABLE_INSTANCE_COUNT=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.AvailableInstanceCount')
TOTAL_INSTANCE_COUNT=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.TotalInstanceCount')

# Note: You may have multiple AZs for your TrainingPlans, so adjust the jq command below accordingly!
TRAINING_PLAN_AZ=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.ReservedCapacitySummaries[0].AvailabilityZone')

Use a training plan with SageMaker HyperPod

When your training plan status transitions to Scheduled, you can use it for new instance groups in either a new or existing SageMaker HyperPod cluster. You can use both the CreateCluster and UpdateCluster APIs to create a new SageMaker HyperPod cluster with your training plan, or update an existing cluster respectively. You can also choose to directly use the SageMaker console.

For a given SageMaker HyperPod cluster, training plans are attached at the instance group level, separately per each instance group. If desired, one SageMaker HyperPod cluster can have one or more training plans attached to multiple instance groups. You always have the option to omit a training plan and instead continue using On-Demand capacity as previously for other combinations of instance groups. However, you can’t mix training plan capacity with On-Demand capacity within the same instance group. You can also choose to have a partial cluster launch for every instance group. This means that even if all the requested capacity isn’t available, you can still spin up a cluster with capacity already available to you.

When a training plan is active, this is the time window when the TrainingPlanOfferings within it are scheduled to start and stop. Each time a TrainingPlanOffering starts, instance groups will automatically scale up to the specified count, and the instance group TrainingPlanStatus will reflect as Active. When a TrainingPlanOffering is scheduled to stop, your cluster’s instance groups will automatically scale down to zero, and the instance group TrainingPlanStatus will reflect as Expired.

Use a training plan with SageMaker HyperPod on the console

You can choose to either create a new cluster and create an instance group, or edit an existing cluster and edit an existing instance group. In the configuration, choose the same instance type that was chosen for a training plan and specify the desired instance count. The Instance capacity option will appear only when you choose an instance type that is supported for training plans. Choose the dropdown menu to scroll through valid training plans. The available training plan selections are listed by name and are filtered for only those that match the chosen instance type, that have at least the specified instance count, that were created with hyperpod-cluster as the target resource, and currently have a status of Scheduled or Active. Double-check these conditions if you don’t see an expected training plan name, and make sure that the expected training plan was created in the same account and in the same Region. The default selection is to use no training plan. Repeat the process for each instance group that should have a training plan.

HyperPod console training plans

Figure 6: You can create an instance group for a SageMaker HyperPod cluster with the instances in your training plan. Make sure to choose the right training plan listed under “Instance capacity”

Use a training plan with SageMaker HyperPod with the AWS CLI

Complete the following steps to use your training plan with the AWS CLI:

  1. Create a SageMaker HyperPod cluster from scratch. For instructions, refer to the Amazon SageMaker HyperPod workshop or the Amazon EKS Support in Amazon SageMaker HyperPod workshop.

The following cluster configuration file defines a SageMaker HyperPod SLURM cluster named ml-cluster. The steps for using training plans will be the same, regardless of if you choose SLURM or Amazon Elastic Kubernetes Service (Amazon EKS) as the orchestrator. This cluster contains an instance group named controller-machine with 1 ml.m5.12xlarge instance as the head node of a SLURM cluster, and it will not use a training plan for the controller-machine instance group. We also define a worker instance group named worker-group-1 that specifies 2 ml.p5.48xlarge instances, which will be sourced from your training plan. Note the line "TrainingPlanArn"—this is where you specify your training plan by the full Amazon Resource Name (ARN). If you followed the steps in the prior sections, this should be the value of the environment variable TRAINING_PLAN_ARN. The following cluster configuration also skips some configuration parameters, such as VPCConfig and InstanceStorageConfig. Refer to the workshop or the following script for a complete SageMaker HyperPod cluster configuration file.

source env_vars
cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "InstanceGroups": [
      {
          "InstanceGroupName": "controller-machine",
          "InstanceType": "ml.m5.12xlarge",
          "InstanceCount": 1,
          ...
      },
      {
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p5.48xlarge",
        "InstanceCount": 2,
        "TrainingPlanArn": "<ENTER TRAINING PLAN ARN HERE>",         ...
      }
    ],
    ...
}
EOF

You can then create the cluster using the following code:

aws sagemaker create-cluster 
  --cli-input-json file://create-cluster-config.json 
  --region $AWS_REGION

These next steps assume that you already have a SageMaker HyperPod cluster created. This section is relevant if you’d like to add an instance group that uses your training plan reserved instances to your existing cluster.

  1. To update an existing cluster, you can define another file called update-cluster-config.json as follows. If you followed the instructions in the workshop to provision the cluster, you can use the provided create_config.sh to get the values for your env_vars before sourcing them.
# Source environment varibales
source env_vars

# Create additional worker group configuration
additional_worker_group=$(cat <<EOF
{
    "InstanceGroupName": "worker-group-2",
    "InstanceType": "ml.p5.48xlarge",
    "InstanceCount": 2,
   "trainingPlan": "<ENTER TRAINING PLAN ARN HERE>"      ...
}
EOF
)

# Copy cluster-config.json to a temporary file
cp cluster-config.json temp-cluster-config.json

# Add additional worker group and remove VpcConfig section
jq --argjson additional_worker_group "$additional_worker_group" '.InstanceGroups += [$additional_worker_group] | del(.VpcConfig)' temp-cluster-config.json > update-cluster-config.json

# Remove the temporary file
rm temp-cluster-config.json

In this file, we define an additional worker group named worker-group-2 consisting of 2 ml.p5.48xlarge instances. Again, notice the line “TrainingPlanArn”—this is where you specify your training plan by the full ARN.

Make sure that you also update provisioning_parameters.json, and upload the updated file to your S3 bucket for SageMaker to use while provisioning the new worker group:

  1. Because this file is uploaded to Amazon Simple Storage Service (Amazon S3) for SageMaker to use while provisioning your cluster, you need to first copy that file over from Amazon S3:

aws s3 cp s3://${BUCKET}/src/provisioning_parameters.json provisioning_parameters.json

  1. Assuming your existing cluster has a controller machine group and a worker group with an ml.g5.48xlarge, you can add the lines in bold to your existing yaml file:
{
    ... 
    "controller_group": "controller-machine",
    "worker_groups": [
      {
          "instance_group_name": "worker-group-1",
          "partition_name": "ml.g5.48xlarge"
      },
 {        "instance_group_name": "worker-group-2",        "partition_name": "ml.p5.48xlarge"      }
    ],
    ...
}

This step adds in the new worker group that you just created, which consists of your 2 ml.p5.48xlarge nodes from your training plan.

  1. Now you can re-upload the updated provisioning-parameters.json file to Amazon S3:
# copy to the S3 Bucket
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  1. Now, with both cluster-config.json (now update-cluster-config.json) and provisioning-parameters.json updated, you can add the training plan nodes to the cluster:
aws sagemaker update-cluster 
  --cli-input-json file://update-cluster-config.json 
  --region $AWS_REGION

Use a training plan with a SageMaker training job

SageMaker training jobs offer two primary methods for execution: an AWS CLI command and the Python SDK. The AWS CLI approach provides direct control and is ideal for scripting, allowing you to create training jobs with a single command. The Python SDK offers a more programmatic interface, enabling seamless integration with existing Python workflows and using the high-level features in SageMaker. In this section, we look at how you can use a training plan with both options.

Run a training job on a training plan using the AWS CLI

The following example demonstrates how to create a SageMaker training job and associate it with a provided training plan using the CapacityScheduleConfig attribute in the create-training-job AWS CLI command:

# Create a training job
aws sagemaker create-training-job 
  --training-job-name training-job-name 
  ...
  --resource-config '{
      "InstanceType": "ml.p5.48xlarge",
      "InstanceCount": 8,
      "VolumeSizeInGB": 10,
 "TrainingPlanArn": "Enter training plan arn"   }' 
  ...

After creating the training job, you can verify that it was properly assigned to the training plan by calling the DescribeTrainingJob API:

aws sagemaker describe-training-job —training-job-name training-job-name

Run a training job on a training plan using the SageMaker Python SDK

The following example demonstrates how to create a SageMaker training job using the SageMaker Python SDK’s Training estimator. It also shows how to associate the job with a provided training plan by using the capacity_schedules attribute in the estimator object when using the SageMaker Python SDK.

For more information on the SageMaker estimator, see Use a SageMaker estimator to run a training job.

Make sure the SageMaker Python SDK version is updated to the latest version.

# Create Estimator
estimator = Estimator(
    entry_point='train.py',
    image_uri="123456789123.dkr.ecr.{}.amazonaws.com/image:tag",
    role=role,
    instance_count=4,
    instance_type='ml.p5.48xlarge',
 training_plan="Enter training plan arn", ...
)

# Run the training job
estimator.fit(inputs=trainingInput, job_name=job_name)

After creating the training job, you can verify that it was properly assigned to the training plan by calling the DescribeTrainingJob API:

# Check job details
sagemaker_session.describe_training_job(TrainingJobName=job_name)

Clean up

To clean up your resources to avoid incurring more charges, complete the following steps:

  1. Delete the SageMaker HyperPod cluster and associated resources such as storage, VPC, and IAM roles.
    1. If using SLURM, refer to Cleanup.
    2. If using Amazon EKS, refer to Cleanup.
  2. Delete any S3 buckets created.
  3. Make sure that the training plan created is used and completes the fulfillment lifecycle.

Conclusion

SageMaker training plans represent a significant leap forward in addressing the compute capacity challenges faced by organizations working with LLMs. By providing quick access to high-performance GPU resources, it streamlines the process of model training and fine-tuning. This solution not only reduces wait times for cluster provisioning, but also offers flexibility in choosing between SageMaker training jobs and SageMaker HyperPod, catering to diverse organizational needs. Ultimately, SageMaker training plans empower businesses to overcome resource constraints and accelerate their AI initiatives, leading to more efficient and effective usage of advanced language models across various industries.

To get started with a SageMaker training plan and explore its capabilities for your specific LLM training needs, refer to Reserve capacity with training plans and try out the step-by-step implementation guide provided in this post.

Special thanks to Fei Ge, Oscar Hsu, Takuma Yoshitani, and Yiting Li for their support in the launch of this post.


About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Kanwaljit Khurmi is an AI/ML Principal Solutions Architect at Amazon Web Services. He works with AWS product teams, engineering, and customers to provide guidance and technical assistance for improving the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Sean Smith is a Sr Specialist Solution Architect at AWS for HPC and generative AI. Prior to that, Sean worked as a Software Engineer on AWS Batch and CfnCluster, becoming the first engineer on the team that created AWS ParallelCluster.

Ty Bergstrom is a Software Engineer at Amazon Web Services. He works on the Hyperpod Clusters platform for Amazon SageMaker.

Read More

Amazon Bedrock Marketplace now includes NVIDIA models: Introducing NVIDIA Nemotron-4 NIM microservices

Amazon Bedrock Marketplace now includes NVIDIA models: Introducing NVIDIA Nemotron-4 NIM microservices

This post is co-written with Abhishek Sawarkar, Eliuth Triana, Jiahong Liu and Kshitiz Gupta from NVIDIA. 

At AWS re:Invent 2024, we are excited to introduce Amazon Bedrock Marketplace. This a revolutionary new capability within Amazon Bedrock that serves as a centralized hub for discovering, testing, and implementing foundation models (FMs). It provides developers and organizations access to an extensive catalog of over 100 popular, emerging, and specialized FMs, complementing the existing selection of industry-leading models in Amazon Bedrock. Bedrock Marketplace enables model subscription and deployment through managed endpoints, all while maintaining the simplicity of the Amazon Bedrock unified APIs.

The NVIDIA Nemotron family, available as NVIDIA NIM microservices, offers a cutting-edge suite of language models now available through Amazon Bedrock Marketplace, marking a significant milestone in AI model accessibility and deployment.

In this post, we discuss the advantages and capabilities of the Bedrock Marketplace and Nemotron models, and how to get started.

About Amazon Bedrock Marketplace

Bedrock Marketplace plays a pivotal role in democratizing access to advanced AI capabilities through several key advantages:

  • Comprehensive model selection – Bedrock Marketplace offers an exceptional range of models, from proprietary to publicly available options, allowing organizations to find the perfect fit for their specific use cases.
  • Unified and secure experience – By providing a single access point for all models through the Amazon Bedrock APIs, Bedrock Marketplace significantly simplifies the integration process. Organizations can use these models securely, and for models that are compatible with the Amazon Bedrock Converse API, you can use the robust toolkit of Amazon Bedrock, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Flows.
  • Scalable infrastructure – Bedrock Marketplace offers configurable scalability through managed endpoints, allowing organizations to select their desired number of instances, choose appropriate instance types, define custom auto scaling policies that dynamically adjust to workload demands, and optimize costs while maintaining performance.

About the NVIDIA Nemotron model family

At the forefront of the NVIDIA Nemotron model family is Nemotron-4, as stated by NVIDIA, it is a powerful multilingual large language model (LLM) trained on an impressive 8 trillion text tokens, specifically optimized for English, multilingual, and coding tasks. Key capabilities include:

  • Synthetic data generation – Able to create high-quality, domain-specific training data at scale
  • Multilingual support – Trained on extensive text corpora, supporting multiple languages and tasks
  • High-performance inference – Optimized for efficient deployment on GPU-accelerated infrastructure
  • Versatile model sizes – Includes variants like the Nemotron-4 15B with 15 billion parameters
  • Open license – Offers a uniquely permissive open model license that gives enterprises a scalable way to generate and own synthetic data that can help build powerful LLMs

The Nemotron models offer transformative potential for AI developers by addressing critical challenges in AI development:

  • Data augmentation – Solve data scarcity problems by generating synthetic, high-quality training datasets
  • Cost-efficiency – Reduce manual data annotation costs and time-consuming data collection processes
  • Model training enhancement – Improve AI model performance through high-quality synthetic data generation
  • Flexible integration – Support seamless integration with existing AWS services and workflows, enabling developers to build sophisticated AI solutions more rapidly

These capabilities make Nemotron models particularly well-suited for organizations looking to accelerate their AI initiatives while maintaining high standards of performance and security.

Getting started with Bedrock Marketplace and Nemotron

To get started with Amazon Bedrock Marketplace, open the Amazon Bedrock console. From there, you can explore Bedrock Marketplace interface, which offers a comprehensive catalog of FMs from various providers. You can browse through the available options to discover different AI capabilities and specializations. This exploration will lead you to find NVIDIA’s model offerings, including Nemotron-4.

We walk you through these steps in the following sections.

Open Amazon Bedrock Marketplace

Navigating to Amazon Bedrock Marketplace is straightforward:

  1. On the Amazon Bedrock console, choose Model catalog in the navigation pane.
  2. Under Filters, select Bedrock Marketplace.

Upon entering Bedrock Marketplace, you’ll find a well-organized interface with various categories and filters to help you find the right model for your needs. You can browse by providers and modality.

  1. Use the search function to quickly locate specific providers, and explore models cataloged in Bedrock Marketplace.

Deploy NVIDIA Nemotron models

After you’ve located NVIDIA’s model offerings in Bedrock Marketplace, you can narrow down to the Nemotron model. To subscribe to and deploy Nemotron-4, complete the following steps:

  1. Filter by Nemotron under Providers or search by model name.
  2. Choose from the available models, such as Nemotron-4 15B.

On the model details page, you can examine its specifications, capabilities, and pricing details. The Nemotron-4 model offers impressive multilingual and coding capabilities.

  1. Choose View subscription options to subscribe to the model.
  2. Review the available options and choose Subscribe.
  3. Choose Deploy and follow the prompts to configure your deployment options, including instance types and scaling policies.

The process is user-friendly, allowing you to quickly integrate these powerful AI capabilities into your projects using the Amazon Bedrock APIs.

Conclusion

The launch of NVIDIA Nemotron models on Amazon Bedrock Marketplace marks a significant milestone in making advanced AI capabilities more accessible to developers and organizations. Nemotron-4 15B, with its impressive 15-billion-parameter architecture trained on 8 trillion text tokens, brings powerful multilingual and coding capabilities to the Amazon Bedrock.

Through Bedrock Marketplace, organizations can use Nemotron’s advanced capabilities while benefiting from the scalable infrastructure of AWS and NVIDIA’s robust technologies. We encourage you to start exploring the capabilities of NVIDIA Nemotron models today through Amazon Bedrock Marketplace, and experience firsthand how this powerful language model can transform your AI applications.


About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within Cloud platforms & enhancing user experience on accelerated computing.

Eliuth Triana is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Jiahong Liu is a Solutions Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA-accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.

Read More

Real value, real time: Production AI with Amazon SageMaker and Tecton

Real value, real time: Production AI with Amazon SageMaker and Tecton

This post is cowritten with Isaac Cameron and Alex Gnibus from Tecton.

Businesses are under pressure to show return on investment (ROI) from AI use cases, whether predictive machine learning (ML) or generative AI. Only 54% of ML prototypes make it to production, and only 5% of generative AI use cases make it to production.

ROI isn’t just about getting to production—it’s about model accuracy and performance. You need a scalable, reliable system with high accuracy and low latency for the real-time use cases that directly impact the bottom line every millisecond.

Fraud detection, for example, requires extremely low latency because decisions need to be made in the time it takes to swipe a credit card. With fraud on the rise, more organizations are pushing to implement successful fraud detection systems. The US nationwide fraud losses topped $10 billion in 2023, a 14% increase from 2022. Global ecommerce fraud is predicted to exceed $343 billion by 2027.

But building and managing an accurate, reliable AI application that can make a dent in that $343 billion problem is overwhelmingly complex.

ML teams often start by manually stitching together different infrastructure components. It seems straightforward at first for batch data, but the engineering gets even more complicated when you need to go from batch data to incorporating real-time and streaming data sources, and from batch inference to real-time serving.

Engineers need to build and orchestrate the data pipelines, juggle the different processing needs for each data source, manage the compute infrastructure, build reliable serving infrastructure for inference, and more. Without the capabilities of Tecton, the architecture might look like the following diagram.

Figure 1: Diagram depicting the different aspects of a typical AI Systems Architecture

Accelerate your AI development and deployment with Amazon SageMaker and Tecton

All that manual complexity gets simplified with Tecton and Amazon SageMaker. Together, Tecton and SageMaker abstract away the engineering needed for production, real-time AI applications. This enables faster time to value, and engineering teams can focus on building new features and use cases instead of struggling to manage the existing infrastructure.

Using SageMaker, you can build, train and deploy ML models. Meanwhile, Tecton makes it straightforward to compute, manage, and retrieve features to power models in SageMaker, both for offline training and online serving. This streamlines the end-to-end feature lifecycle for production-scale use cases, resulting in a simpler architecture, as shown in the following diagram.

Figure 2: Depicting high level architecture of Tecton & SageMaker showing end-to-end feature lifecycle

How does it work? With Tecton’s simple-to-use declarative framework, you define the transformations for your features in a few lines of code, and Tecton builds the pipelines needed to compute, manage, and serve the features. Tecton takes care of the full deployment into production and online serving.

It doesn’t matter if it’s batch, streaming, or real-time data or whether it’s offline or online serving. It’s one common framework for every data processing need in end-to-end feature production.

This framework creates a central hub for feature management and governance with enterprise feature store capabilities, making it straightforward to observe the data lineage for each feature pipeline, monitor data quality, and reuse features across multiple models and teams.

The following diagram shows the Tecton declarative framework.

Figure 3: Depicting Tecton’s Declarative framework to do feature engineering and serving

The next section examines a fraud detection example to show how Tecton and SageMaker accelerate both training and real-time serving for a production AI system.

Streamline feature development and model training

First, you need to develop the features and train the model. Tecton’s declarative framework makes it simple to define features and generate accurate training data for SageMaker models:

  • Experiment and iterate on features in SageMaker notebooks – You can use Tecton’s software development kit (SDK) to interact with Tecton directly through SageMaker notebook instances, enabling flexible experimentation and iteration without leaving the SageMaker environment.
  • Orchestrate with Tecton-managed EMR clusters – After features are deployed, Tecton automatically creates the scheduling, provisioning, and orchestration needed for pipelines that can run on Amazon EMR compute engines. You can view and create EMR clusters directly through the SageMaker notebook.
  • Generate accurate training data for SageMaker models – For model training, data scientists can use Tecton’s SDK within their SageMaker notebooks to retrieve historical features. The same code is used to backfill the offline store and continually update the online store, reducing training/serving skew.

Next, the features need to be served online for the final model to consume in production.

Serve features with robust, real-time online inference

Tecton’s declarative framework extends to online serving. Tecton’s real-time infrastructure is designed to help meet the demands of extensive applications and can reliably run 100,000 requests per second.

For critical ML apps, it’s hard to meet demanding service level agreements (SLAs) in a scalable and cost-efficient manner. Real-time use cases such as fraud detection typically have a p99 latency budget between 100 to 200 milliseconds. That means 99% of requests need to be faster than 200ms for the end-to-end process from feature retrieval to model scoring and post-processing.

Feature serving only gets a fraction of that end-to-end latency budget, which means you need your solution to be especially quick. Tecton accommodates these latency requirements by integrating with both disk-based and in-memory data stores, supporting in-memory caching, and serving features for inference through a low-latency REST API, which integrates with SageMaker endpoints.

Now we can complete our fraud detection use case. In a fraud detection system, when someone makes a transaction (such as buying something online), your app might follow these steps:

  1. It checks with other services to get more information (for example, “Is this merchant known to be risky?”) from third-party APIs
  2. It pulls important historical data about the user and their behavior (for example, “How often does this person usually spend this much?” or “Have they made purchases from this location before?”), requesting the ML features from Tecton
  3. It will likely use streaming features to compare the current transaction with recent spending activity over the last few hours or minutes
  4. It sends all this information to the model hosted on Amazon SageMaker that predicts whether the transaction looks fraudulent.

This process is shown in the following diagram.

Figure 4: Steps performed in a Fraud Detection System when someone makes a transaction

Expand to generative AI use cases with your existing AWS and Tecton architecture

After you’ve developed ML features using the Tecton and AWS architecture, you can extend your ML work to generative AI use cases.

For instance, in the fraud detection example, you might want to add an LLM-powered customer support chat that helps a user answer questions about their account. To generate a useful response, the chat would need to reference different data sources, including the unstructured documents in your knowledge base (such as policy documentation about what causes an account suspension) and structured data such as transaction history and real-time account activity.

If you’re using a Retrieval Augmented Generation (RAG) system to provide context to your LLM, you can use your existing ML feature pipelines as context. With Tecton, you can either enrich your prompts with contextual data or provide features as tools to your LLM—all using the same declarative framework.

To choose and customize the model that will best suit your use case, Amazon Bedrock provides a range of pre-trained foundation models (FMs) for inference, or you can use SageMaker for more extensive model building and training.

The following graphic shows how Amazon Bedrock is incorporated to support generative AI capabilities in the fraud detection system architecture.

Figure 5: Incorporating Bedrock to support Gen-AI capabilities to the fraud detection system architecture

Build valuable AI apps faster with AWS and Tecton

In this post, we walked through how SageMaker and Tecton enable AI teams to train and deploy a high-performing, real-time AI application—without the complex data engineering work. Tecton combines production ML capabilities with the convenience of doing everything from within SageMaker, whether that’s at the development stage for training models or doing real-time inference in production.

To get started, refer to Getting Started with Amazon SageMaker & Tecton’s Feature Platform, a more detailed guide on how to use Tecton with Amazon SageMaker. And if you can’t wait to try it yourself, check out the Tecton interactive demo and observe a fraud detection use case in action.

You can also find Tecton at AWS re:Invent. Reach out to set up a meeting with experts onsite about your AI engineering needs.


About the Authors

Isaac Camerson - Tecton Solutions ArchitectIsaac Cameron is Lead Solutions Architect at Tecton, guiding customers in designing and deploying real-time machine learning applications. Having previously built a custom ML platform from scratch at a major U.S. airline, he brings firsthand experience of the challenges and complexities involved—making him a strong advocate for leveraging modern, managed ML/AI infrastructure.

Alex Gnibus - Product MarketingAlex Gnibus is a technical evangelist at Tecton, making technical concepts accessible and actionable for engineering teams. Through her work educating practitioners, Alex has developed deep expertise in identifying and addressing the practical challenges teams face when productionizing AI systems.

Arnab Sinha SAArnab Sinha is a Senior Solutions Architect at AWS, specializing in designing scalable solutions that drive business outcomes in AI, machine learning, big data, digital transformation, and application modernization. With expertise across industries like energy, healthcare, retail and manufacturing, Arnab holds all AWS Certifications, including the ML Specialty, and has led technology and engineering teams before joining AWS.

Read More

Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models

Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models

Today, we’re excited to announce a new capability that allows you to deploy over 100 open-weight and proprietary models from Amazon SageMaker JumpStart and register them with Amazon Bedrock, allowing you to seamlessly access them through the powerful Amazon Bedrock APIs. You can now use Amazon Bedrock features such as Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails with models deployed through SageMaker JumpStart.

SageMaker JumpStart helps you get started with machine learning (ML) by providing fully customizable solutions and one-click deployment and fine-tuning of more than 400 popular open-weight and proprietary generative AI models. Amazon Bedrock is a fully managed service that provides a single API to access and use various high-performing foundation models (FMs). It also offers a broad set of capabilities to build generative AI applications. The Amazon Bedrock Converse API is a runtime API that provides a consistent interface that works with different models. It allows you to use advanced features in Amazon Bedrock such as the playground, guardrails, and tool use (function calling).

SageMaker JumpStart has long been the go-to service for developers and data scientists seeking to deploy state-of-the-art generative AI models. Through this integration, you can now combine the flexibility of hosting models on SageMaker JumpStart with the fully managed experience of Amazon Bedrock, including advanced security controls, scalable infrastructure, and comprehensive monitoring capabilities.

In this post, we show you how to deploy FMs through SageMaker JumpStart, register them with Amazon Bedrock, and invoke them using Amazon Bedrock APIs.

Solution overview

The Converse API standardizes interaction with Amazon Bedrock FMs, enabling developers to write code one time and use it across various models without needing to adjust for model-specific differences. It supports multi-turn conversations through conversational history as part of the API request, and developers can perform tasks that require access to external APIs through the usage of tools (function calling). Additionally, the Converse API allows you to block inappropriate inputs or generated content by including a guardrail in your API calls. To review the complete list of supported models and model features, refer to Supported models and model features.

This new feature extends the capabilities of the Converse API into a single interface that developers can use to call FMs deployed in SageMaker JumpStart. This allows developers to use the same API to invoke models from Amazon Bedrock and SageMaker JumpStart, streamlining the process to integrate models into their generative AI applications. Now you can build on top of an even larger library of world-class open source and proprietary models through a single API. To view the full list of Bedrock Ready models available from SageMaker JumpStart, refer to the Bedrock Marketplace documentation. You can also use Amazon Bedrock Marketplace to discover and deploy these models to SageMaker endpoints.

In this post, we walk through the following steps:

  1. Deploy the Gemma 2 9B Instruct model using SageMaker JumpStart.
  2. Register the model with Amazon Bedrock.
  3. Test the model with sample prompts using the Amazon Bedrock playground.
  4. Use the Amazon Bedrock RetrieveAndGenerate API to query the Amazon Bedrock knowledge base.
  5. Set up Amazon Bedrock Guardrails to help block harmful content and personally identifiable information (PII) data.
  6. Invoke models with Converse APIs to show an end-to-end Retrieval Augmented Generation (RAG) pipeline.

Prerequisites

You can access and deploy pretrained models from SageMaker JumpStart in the Amazon SageMaker Studio UI. To access SageMaker Studio on the AWS Management Console, you need to set up an Amazon SageMaker domain. SageMaker uses domains to organize user profiles, applications, and their associated resources. To create a domain and set up a user profile, refer to Guide to getting set up with Amazon SageMaker.

You also need an AWS Identity and Access Management (IAM) role with appropriate permissions. To get started with this example, you can use the AmazonSageMakerFullAccess, AmazonBedrockFullAccess, AmazonOpenSearchAccess managed policies to provide the required permissions to SageMaker JumpStart and Amazon Bedrock. For a more scoped down set of permissions, refer to the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockEndpointManagementMutatingOperations",
      "Action": [
        "sagemaker:AddTags",
        "sagemaker:CreateEndpoint",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateModel",
        "sagemaker:DeleteEndpoint",
        "sagemaker:UpdateEndpoint",
        "sagemaker:DeleteTags"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sagemaker:*",
      "Condition": {
        "StringEquals": {
            "aws:ViaAWSService": "bedrock.amazonaws.com"
        }
       }
    },
    {
      "Sid": "BedrockEndpointManagementNonMutatingOperations",
      "Action": [
        "sagemaker:DescribeEndpoint",
        "sagemaker:DescribeEndpointConfig",
        "sagemaker:DescribeModel",
        "sagemaker:ListEndpoints",
        "sagemaker:ListTags"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sagemaker:*",
      "Condition": {
        "StringEquals": {
            "aws:ViaAWSService": "bedrock.amazonaws.com"
        }
       }
    },
    {
      "Sid": "BedrockEndpointInvokingOperations",
      "Action": [
        "sagemaker:InvokeEndpoint",
        "sagemaker:InvokeEndpointWithResponseStream"      
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sagemaker:*",
      "Condition": {
        "StringEquals": {
            "aws:ViaAWSService": "bedrock.amazonaws.com"
         }
       }
    },
    {
      "Sid": "AllowDiscoveringPublicModelDetails",
      "Action": [
        "sagemaker:DescribeHubContent"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sagemaker:*:aws:hub-content/SageMakerPublicHub/Model/*"
    },
    {
      "Sid": "AllowListingPublicModels",
      "Action": [
        "sagemaker:ListHubContents"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sagemaker:*:aws:hub/SageMakerPublicHub"
    },
    {
      "Sid": "RetrieveSubscribedMarketplaceLicenses",
      "Action": [
        "license-manager:ListReceivedLicenses"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:license-manager:*"
    },
    {
      "Sid" : "PassRoleToSagemaker",
      "Effect" : "Allow",
      "Action" : [
        "iam:PassRole"
      ],
      "Resource" : "arn:aws:iam::*:role/*AmazonSageMaker*",
      "Condition" : {
        "StringEquals" : {
        "iam:PassedToService" : [
            "sagemaker.amazonaws.com"
          ]
        }
      }
    },
    {
      "Sid" : "BedrockAll",
      "Effect" : "Allow",
      "Action" : [ "bedrock:*" ],
      "Resource" : "*" 
    },
    {
      "Sid" : "AmazonOpenSearchAccess",
      "Effect" : "Allow",
      "Action" : [ "aoss:*" ],
      "Resource" : "*",
      "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
      }
    },
  ]
}

After applying the relevant permissions, setting up a SageMaker domain, and creating user profiles, you are ready to deploy your SageMaker JumpStart model and register it with Amazon Bedrock.

Deploy a model with SageMaker JumpStart and register it with Amazon Bedrock

This section provides a walkthrough of deploying a model using SageMaker JumpStart and registering it with Amazon Bedrock. In this walkthrough, you will deploy and register the Gemma 2 9B Instruct model offered through Hugging Face in SageMaker JumpStart. Complete the following steps:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. Choose the relevant user profile on the dropdown menu and choose Open Studio.

  1. In SageMaker Studio, choose JumpStart in the navigation pane.

Here, you will see a list of the available SageMaker JumpStart models. Models that can be registered to Amazon Bedrock after they’ve been deployed through SageMaker JumpStart have a Bedrock ready tag.

  1. The Gemma 2 9B Instruct model for this example is provided by Hugging Face, so choose the Hugging Face model card.

  1. To filter the list of models to view which models are supported by Amazon Bedrock, select Bedrock Ready under Action.
  2. Search for Gemma 2 9B Instruct and choose the model card for Gemma 2 9B Instruct.

You can review the model card for Gemma 2 9B Instruct to learn more about the model.

  1. To deploy the model, choose Deploy.
  2. Review the End User License Agreement for Gemma 2 9B Instruct and select I accept the End User License Agreement (EULA) and read the terms and conditions.
  3. Leave the endpoint settings with their default values and choose Deploy.

The endpoint deployment process will take a few minutes.

  1. Under Deployments in the navigation pane, choose Endpoints to view your available endpoints.

After a few minutes, the model will be deployed to the endpoint and its status will change to In service, indicating that the endpoint is ready to serve traffic. You can use the Refresh icon at the bottom of the endpoint screen to get the latest information.

  1. When your endpoint is in service, choose it to go to the endpoint details page.

  1. Choose Use with Bedrock to start the registration process.

You will be redirected to the Amazon Bedrock console.

  1. On the Register endpoint page, the SageMaker endpoint Amazon Resource Name (ARN) and model ARN have already been prepopulated. Review these values and choose Register.

Your SageMaker endpoint will be registered with Amazon Bedrock in a few minutes.

After your SageMaker endpoint is registered with Amazon Bedrock, you can invoke it using the Converse API. Then you can test your endpoint in the Amazon Bedrock playground.

  1. In the navigation pane on the Amazon Bedrock console, choose Marketplace deployments under Foundation models.
  2. From the list of managed deployments, select your registered model, then choose Open in playground.

You will now be in the Amazon Bedrock playground for Chat/text. The Chat/text playground allows to you test your model with a single prompt, or provides chat capability for conversational use cases. Because this example will be an interactive chat session, leave the Mode as the default Chat. The chat capability in the playground should be set to test your Gemma 2 9B Instruct model.

Now you can test your SageMaker endpoint through Amazon Bedrock! Use the following prompt to test summarizing a meeting transcript, and review the results:

Meeting transcript:
Miguel: Hi Brant, I want to discuss the workstream for our new product launch
Brant: Sure Miguel, is there anything in particular you want to discuss?
Miguel: Yes, I want to talk about how users enter into the product.
Brant: Ok, in that case let me add in Namita.
Namita: Hey everyone
Brant: Hi Namita, Miguel wants to discuss how users enter into the product.
Miguel: its too complicated and we should remove friction.  for example, why do I need to fill out additional forms?  I also find it difficult to find where to access the product when I first land on the landing page.
Brant: I would also add that I think there are too many steps.
Namita: Ok, I can work on the landing page to make the product more discoverable but brant can you work on the additonal forms?
Brant: Yes but I would need to work with James from another team as he needs to unblock the sign up workflow.  Miguel can you document any other concerns so that I can discuss with James only once?
Miguel: Sure.

From the meeting transcript above, Create a list of action items for each person.
  1. Enter the prompt into the playground, then choose Run.

You can view the response in the chat generated by your deployed SageMaker JumpStart model through Amazon Bedrock:

Here's a breakdown of action items from the meeting transcript:

**Miguel:**

* **Document:** List out any additional concerns regarding user entry into the product. Share these with Brant for his discussion with James.

**Brant:**

* **Collaborate with James:**  Work with James from another team to simplify the additional forms involved in the user sign-up workflow.
* **Review Documentation:** Review Miguel's documented concerns about user entry to prepare for the discussion with James.

**Namita:**

* **Landing Page Redesign:**  Improve the landing page to make the product more discoverable for new users.

Let me know if you'd like me to elaborate on any of these action items!

You can also test the model with your own prompts and use cases.

Use Amazon Bedrock APIs with the deployed model

This section demonstrates using the AWS SDK for Python (Boto3) and Converse APIs to invoke the Gemma 2 9B Instruct model you deployed earlier through SageMaker and registered with Amazon Bedrock. The full source code associated with this post is available in the accompanying GitHub repo. For additional Converse API examples, refer to Converse API examples.

In this code sample, we also implement a RAG architecture in conjunction with the deployed model. RAG is the process of optimizing the output of a large language model (LLM) so it references an authoritative knowledge base outside of its training data sources before generating a response.

Use the deployed SageMaker model with the RetrieveAndGenerate API offered by Amazon Bedrock to query a knowledge base and generate responses based on the retrieved results. The response only cites sources that are relevant to the query. For information on creating a Knowledge Base, refer to Creating a Knowledge Base. For additional code samples, refer to RetrieveAndGenerate.

The following diagram illustrates the RAG workflow.

Complete the following steps:

  1. To invoke the deployed model, you need to pass the endpoint ARN of the deployed model in the modelId parameter of the Converse API.

To obtain the ARN of the deployed model, navigate to the Amazon Bedrock console. In the navigation pane, choose Marketplace deployments under Foundation models. From the list of managed deployments, choose your registered model to view more details.

You will be directed to the model summary on the Model catalog page under Foundation models. Here, you will find the details associated with your deployed model. Copy the model ARN to use in the following code sample.

import boto3

bedrock_runtime = boto3.client("bedrock-runtime")

# Add your bedrock endpoint arn here.
endpoint_arn = "arn:aws:sagemaker:<AWS::REGION>:<AWS::AccountId>:endpoint/<Endpoint_Name>"

# Base inference parameters to use.
inference_config = {
        "maxTokens": 256,
        "temperature": 0.1,
        "topP": 0.999,
}

# Additional inference parameters to use.
additional_model_fields = {"parameters": {"repetition_penalty": 0.9, "top_k": 250, "do_sample": True}}


response = bedrock_runtime.converse(
    modelId=endpoint_arn,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "text": "What is Amazon doing in the field of generative AI?",
                },
            ]
        },
    ],
    inferenceConfig=inference_config,
    additionalModelRequestFields=additional_model_fields,
)
  1. Invoke the SageMaker JumpStart model with the RetrieveAndGenerate API. The generation_template and orchestration_template parameters in the retrieve_and_generate API are model specific. These templates define the prompts and instructions for the language model, guiding the generation process and the integration with the knowledge retrieval component.
import boto3

bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime")

# Provide your Knowledge Base Id 
kb_id = "" 

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": "What is Amazon doing in the field of generative AI?"
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "generationConfiguration": {
                "inferenceConfig": {
                    "textInferenceConfig": {
                        "maxTokens": 512,
                        "temperature": 0.1,
                        "topP": 0.9
                    }
                },
                "promptTemplate": {
                    "textPromptTemplate": generation_template
                }
            },
            "knowledgeBaseId": kb_id,
            "orchestrationConfiguration": {
                "inferenceConfig": {
                    "textInferenceConfig": {
                        "maxTokens": 512,
                        "temperature": 0.1,
                        "topP": 0.9
                    }
                },
                "promptTemplate": {
                    "textPromptTemplate": orchestration_template
                },
            },
            "modelArn": endpoint_arn,
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

Now you can implement guardrails with the Converse API for your SageMaker JumpStart model. Amazon Bedrock Guardrails enables you to implement safeguards for your generative AI applications based on your use cases and responsible AI policies. For information on creating guardrails, refer to Create a Guardrail. For additional code samples to implement guardrails, refer to Include a guardrail with Converse API.

  1. In the following code sample, you include a guardrail in a Converse API request invoking a SageMaker JumpStart model:
import boto3

bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime")

# Provide your Knowledge Base Id
kb_id = "" 

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        "text": "What is Amazon doing in the field of generative AI?"
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        "vectorSearchConfiguration": {
            "numberOfResults": 1
        }
    }
)

def invoke_model(prompt, source, inference_config=None, additional_model_field=None):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "guardContent": {
                        "text": {
                            "text": source,
                            "qualifiers": ["grounding_source"],
                        }
                    }
                },
                {
                    "guardContent": {
                        "text": {
                            "text": prompt,
                            "qualifiers": ["query"],
                        }
                    }
                },
            ],
        }
    ]
    if not inference_config:
        # Base inference parameters to use.
        inference_config = {
                "maxTokens": 256,
                "temperature": 0.1,
                "topP": 0.999,
        }
    
    if not additional_model_field:
        # Additional inference parameters to use.
        additional_model_fields = {"parameters": {"repetition_penalty": 0.9, "top_k": 250, "do_sample": True}}


    response = bedrock_runtime.converse(
        modelId=endpoint_arn,
        messages=messages,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields,
        guardrailConfig={
            'guardrailIdentifier': guardrail_identifier,
            'guardrailVersion': guardrail_version
        },
    )
    
    return response["output"]["message"]["content"][0]["text"]

invoke_model(prompt="What is Amazon doing in the field of generative AI?", source=relevant_documents["retrievalResults"][0]["content"]["text"]) 
# Content is Blocked 
invoke_model(prompt="Should I buy bitcoin?", source=relevant_documents["retrievalResults"][0]["content"]["text"])

Clean up

To clean up your resources, use the following code:

import boto3

from knowledge_base import KnowledgeBasesForAmazonBedrock

kb = KnowledgeBasesForAmazonBedrock()
kb.delete_kb(knowledge_base_name, delete_s3_bucket=True, delete_iam_roles_and_policies=True)

bedrock.delete_guardrail(guardrailIdentifier=guardrail_identifier)

The SageMaker JumpStart model you deployed will incur cost if you leave it running. Delete the endpoint if you want to stop incurring charges. Deleting the endpoint will also de-register the model from Amazon Bedrock. For more details, see Delete Endpoints and Resources.

Conclusion

In this post, you learned how to deploy FMs through SageMaker JumpStart, register them with Amazon Bedrock, and invoke them using Amazon Bedrock APIs. With this new capability, organizations can access leading proprietary and open-weight models using a single API, reducing the complexity of building generative AI applications with a variety of models. This integration between SageMaker JumpStart and Amazon Bedrock is generally available in all AWS Regions where Amazon Bedrock is available. Try this code to use ConverseAPIs, Knowledge bases and Guardrails with SageMaker.


About the Author

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Abhishek Doppalapudi is a Solutions Architect at Amazon Web Services (AWS), where he assists startups in building and scaling their products using AWS services. Currently, he is focused on helping AWS customers adopt Generative AI solutions. In his free time, Abhishek enjoys playing soccer, watching Premier League matches, and reading.

June Won  is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Eashan Kaushik is an Associate Solutions Architect at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.

Read More

A guide to Amazon Bedrock Model Distillation (preview)

A guide to Amazon Bedrock Model Distillation (preview)

When using generative AI, achieving high performance with low latency models that are cost-efficient is often a challenge, because these goals can clash with each other. With the newly launched Amazon Bedrock Model Distillation feature, you can use smaller, faster, and cost-efficient models that deliver use-case specific accuracy that is comparable to the largest and most capable models in Amazon Bedrock for those specific use cases.

Model distillation is the process of transferring knowledge from a more capable advanced model (teacher) to a smaller model (student), which is faster and more cost efficient to make the student model as performant as the teacher for a specific use-case. To transfer knowledge, your use-case specific prompts are used to first generate responses from the teacher model, and then the teacher responses are used to fine-tune the student model.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) along with a broad set of capabilities to build generative AI applications, simplifying development with security, privacy, and responsible AI. With Amazon Bedrock Model Distillation, you can now customize models for your use case using synthetic data generated by highly capable models. At preview, Amazon Bedrock Model Distillation offers support for three model providers: Amazon, Anthropic, and Meta. The teacher and student models should be from the same model provider.

This post introduces the workflow of Amazon Bedrock Model Distillation. We first introduce the general concept of model distillation in Amazon Bedrock, and then focus on the important steps in model distillation, including setting up permissions, selecting the models, providing input dataset, commencing the model distillation jobs, and conducting evaluation and deployment of the student models after model distillation.

Key benefits of Amazon Bedrock Model Distillation

  • Efficiency: Distilled models provide high use-case specific accuracy comparable to the most capable models while being as fast as some of the smallest models.
  • Cost optimization: Inference from distilled models is less expensive compared to larger advanced models.
  • Advanced customization: Amazon Bedrock Model Distillation removes the need to create a labeled dataset for fine-tuning. Amazon Bedrock automates the complex process of generating high-quality teacher responses to create a diverse and high-volume training dataset to use for fine-tuning the student model, by adding data synthesis (up to 15 thousand prompt-response pairs) and augmentation techniques behind the scenes that automatically adapt to your use case, optimizing the distilled model’s performance.
  • Ease of use: Amazon Bedrock Model Distillation offers a single workflow that automates the generation of teacher responses, adds data synthesis to improve teacher responses, and fine-tunes the student model with optimized hyperparameter tuning.

Use cases for Amazon Bedrock Model Distillation

By distilling knowledge from larger models into smaller, more agile ones, organizations are empowered to develop optimized AI solutions to achieve a higher return on their investments. Here are some applications where a distilled model can make a significant impact:

  • Retrieval Augmented Generation (RAG): Enable enterprise wide search and knowledge retrieval systems that can handle thousands of concurrent queries at a fraction of the cost of larger models, making widespread deployment more feasible.
  • Document summarization: Process vast amounts of business content in real time, such as summarizing thousands of customer call transcripts daily, enabling insights at a scale previously limited by latency constraints.
  • Chatbot deployments: Power customer service chatbots that can handle thousands of concurrent real-time conversations with consistently low latency, delivering the quality of a larger model but at significantly lower operational costs.
  • Text classification: Build faster models for categorizing high volumes of concurrent support tickets, emails, or customer feedback at scale; or for efficiently routing requests to larger models when necessary. This approach can significantly reduce processing costs while maintaining classification accuracy, enabling real-time responsiveness to customer needs.

Amazon Bedrock Model Distillation workflow

Amazon Bedrock offers two options for using Amazon Bedrock Model Distillation. In the first option, you can create a distilled model by providing your production data using historical invocation logs from your previous interactions within Amazon Bedrock. In a production environment, you continue to use the existing Amazon Bedrock Inference APIs, such as the InvokeModel or Converse API, and turn on invocation logs that store model input data (prompts) and model output data (responses). You can optionally add request metadata to these inference requests to filter your invocation logs for specific use cases. By default, Amazon Bedrock reads only the prompts from the invocation logs and will generate responses from the teacher model selected in your distillation job. In this scenario, Amazon Bedrock might apply proprietary data synthesis techniques to generate diverse and high-quality responses from the teacher model to augment the fine-tuning dataset, potentially improving the performance of the distilled student model. The student model is then fine-tuned using the prompt and teacher response pairs. Optionally, you can configure Amazon Bedrock to extract both the prompt and response from the invocation logs. In this scenario, the teacher model selected in the distillation job must match the teacher model in the invocation log. No data synthesis techniques are applied. The prompt-response pairs are taken as is from the invocation logs and the student model is fine-tuned.

In the second option, you can upload your use-case specific prompts by directly uploading a JSONL file to Amazon Simple Storage Service (Amazon S3) containing your use-case specific prompts or labelled prompt-completion pairs. Amazon Bedrock generates responses from the teacher model for the provided prompts. If you provide a human-generated labeled dataset representing the ground truth, Amazon Bedrock can use these prompt-response pairs as golden examples to generate better teacher responses. The student model is then fine-tuned using the prompt-response pairs generated by the teacher model.

Prerequisites

To use the model distillation feature, make sure that you have satisfied the following requirements:

  • An active AWS account.
  • Selected teacher and student models enabled in Amazon Bedrock. You can confirm that the models are enabled on the Model access page of the Amazon Bedrock console.
  • Confirm the AWS Regions where the model is available and quotas.
  • To create a model distillation job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions:
    • A trust relationship that allows Amazon Bedrock to assume the role
    • Permissions to access input data and historical invocation logs in Amazon S3
    • Permissions to write output data to Amazon S3
    • Optionally, permissions to decrypt an AWS Key Management Service (AWS KMS) key if you have encrypted resources with a KMS key
  • An S3 bucket where your distillation job output metrics are stored.
  • If you provide an input dataset for distillation, use Amazon S3 to store your input data
    • Alternatively, if you use a historical invocation log for model distillation, make sure to enable the invocation log in the AWS Management Console and that the historical invocation logging is stored in an S3 location. To do so, go to the Amazon Bedrock console and choose Settings at the bottom of left corner, as shown in the screenshot:
    • On the next page, make sure that Model invocation logging is enabled and select S3 only as the logging destination. (Optionally, you can select Both S3 and CloudWatch Logs as the destination.)
  • Ensure that you have sufficient quota for running a Provisioned Throughput during inference. Go to the AWS Service Quotas console, and check the following quotas:
    • Model units no-commitment Provisioned Throughputs across custom models
    • Model units per provisioned model for [student model name]

Both of these fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Model selection

Currently, Amazon Bedrock Model Distillation supports student-teacher combinations within the same model providers (for example, Amazon, Anthropic, or Meta).

Selecting the right models for distillation is crucial. The process involves choosing a teacher model for synthetic data generation and a student model to learn from the teacher’s output. The teacher model is typically larger and more capable, while the student model is smaller, faster, and more cost-efficient.

When selecting models, consider three key dimensions: performance, latency and cost. These factors are interconnected and adjusting one can affect the others.

  • Performance: Establish clear performance targets for your use case, such as accuracy, consistency, or harmlessness. Select a teacher model that meets or exceeds your desired performance level. The expectation from distillation is to increase the student model’s performance to approach that of the teacher model.
  • Latency: Choose a student model that meets your latency requirements. The final distilled model will have the same latency profile as the student model that you select.
  • Cost: Consider the total cost of ownership (TCO) across the model’s lifecycle, including teacher model inference for synthetic data generation, student model fine-tuning, inference cost for the distilled model, and custom model storage.

Distillation input dataset

There are two main ways to prepare use-case specific input data for distillation in Amazon Bedrock:

  • Uploading a JSONL file to Amazon S3
  • Using historical invocation logs

Uploading a JSONL file to S3

If you have a dataset in the JSON Lines (JSONL) format, you can upload it to an S3 bucket. Each record in this JSONL file use the following structure:

{
    "schemaVersion": "bedrock-conversation-2024",
    "system": [
        {
            "text": string
        }
    ],
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "text": string
                }
            ]
        },
        {
            "role": "assistant"
            "content": [
               {
                   "text": string
               }
            ]
        }
    ]
}

Specifically, each record has a mandatory field, schemaVersion, that must have the value bedrock-conversation-2024 at this launch. The record can optionally include a system prompt that indicates the role assigned to the model. In the messages field, the user role is required, containing the input prompt provided to the model, while the assistant role, containing the desired response, is optional.

At preview, Anthropic and Meta models only accept single-turn conversation prompts, meaning you can only have one user prompt. The Amazon (Nova) models support multi-turn conversations, allowing you to provide multiple user and assistant exchanges within one record.

Using historical invocation logs

Alternatively, you can use your historical invocation logs stored in Amazon S3 for model distillation. These logs capture the prompts, responses, and metadata from your previous model interactions, making them a valuable source of data. To use this method:

  1. Enable invocation logging: Make sure that you’ve enabled invocation logging for your account. If you haven’t done this yet, see to the prerequisites section for instructions.
  2. Add metadata to model invocations: When invoking models using the InvokeModel or Converse API, include a requestMetadata field with key value paris. This allows you to categorize and filter your interactions later. An example for using the Converse API would be:
{
    "additionalModelRequestFields": JSON value,
    "additionalModelResponseFieldPaths": ["string"],
    "guardrailConfig": {
        "guardrailIdentifier": "string",
        "guardrailVersion": "string",
        "trace": "string"
    },
    "inferenceConfig": {
        "maxTokens": number,
        "stopSequences": ["string"],
        "temperature": number,
        "topP": number
    },
    "messages": [{
        "content": [{
            ...
        }],
        "role": "string"
    }],
    "system": [{
        ...
    }],
    "toolConfig": {
        "toolChoice": {
            ...
        },
        "tools": [{
            ...
        }]
    },
    "requestMetadata": {
        "string": "string", // {"key": "value"}
        "string": "string", // {"key": "value"}
        "string": "string"  // {"key": "value"}
    }
}

A specific example for the requestMetadata field for a sample use case could be:

 "requestMetadata":{ 
    "project": "CustomerService", 
    "intent": "BestPractices", 
    "priority": "Medium" }
  1. Select logs for distillation: When creating a model customization job, you can specify filters to select which invocation logs to use. The API supports various filtering options:
    • Include specific logs:
       "requestMetadataFilters": { 
          "equals": {"project": "CustomerService"} }

    • Exclude specific logs:
       "requestMetadataFilters": { 
          "notEquals": {"priority": "Low"} }

    • Combine multiple conditions:
      "requestMetadataFilters": {
          "andAll": [
          {"equals": {"project": "CustomerService"}},
          {"notEquals": {"priority": "Low"}}
          ]
      }

    • Use OR logic:
      "requestMetadataFilters": {
          "orAll": [
          {"equals": {"intent": "ComplaintResolution"}},
          {"equals": {"intent": "ProductInquiry"}}
          ]
      }

By following these steps, you can precisely control which data from your invocation logs should be used for distillation, enabling you to target specific use cases, projects, or workflows.

Selecting the right data

When selecting data for distillation, whether through a new training JSONL file or historical invocation logs, it’s crucial to choose prompts and responses that are relevant to your use case. The quality and diversity of the data will directly impact the performance of the distilled model.

In general, you should aim to include prompts that cover a wide range of topics and scenarios relevant to your use case, more importantly, a good approach also includes optimizing prompts for the teacher model to get better responses so distillation can perform high quality knowledge transfer from teacher to student. Specifically, for use cases like RAG, make sure to include prompts that contain relevant context to be used by the model. For tasks that require a specific response style or format, it’s important to include examples that adhere to the desired style or format.

Be mindful when curating the data used for distillation to help ensure that the distilled model learns the most relevant and valuable knowledge from the teacher model, optimizing its performance for your specific use case.

Run the model distillation

You can start a distillation job either through the Amazon Bedrock console or programmatically using the Amazon Bedrock API. The distillation process requires training data, either by uploading training data in JSONL format to Amazon S3, or by using historical model invocation logs, as we prepared in the prior section.

Before starting a model distillation job, make sure that you’re operating within the boundaries of Amazon Bedrock distillation service quotas.

Let’s explore how to start distillation jobs using different approaches. In the following example, we use Llama 3.1 70B as the teacher model and Llama 3.1 8B as student model.

Start a distillation job using the console

Amazon Bedrock Model Distillation provides you with an option to run a distillation job through a guided user interface in the console. To start a distillation job through the console, follow these steps:

  1. Go to the Amazon Bedrock console. Choose Foundation models in the navigation pane, then choose Custom models. In the Customization methods section, choose Create Distillation job.
  1. For Distilled model name, enter a name for the model. Select Model encryption to add a KMS key. Optionally, expand the Tags section to add tags for tracking.
  2. For Job name, enter a name for the training job. Optionally, expand the Tags section to add tags for tracking.
  3. Choose Select model to pick the teacher model of your choice.
  4. For Categories, choose Meta model family. For Models available for distillation, select Llama 3.1 70B Instruct. Choose Apply.
  5. Open the drop down under Select a student model. For this example, select Llama 3.1 8B Instruct.
  6. Specify the Max response length through the slider or directly in the input field. This configuration will be used as an inference parameter for the synthetic data generation by the teacher model.
  7. As discussed in the prior section, there are two approaches to provide a distillation input dataset.

    • If you plan to directly upload JSONL file to S3, upload your training dataset to the S3 bucket you prepared in prerequisite section. Under Distillation input dataset, specify the Amazon S3 location for your training dataset.
    • If you plan to use historical invocation logs, select Provide access to invocation logs first, then specify the S3 location for your stored invocation logs. You can add different types of metadata filters to select only the invocation logs relevant to the use case.

You can also configure Amazon Bedrock to only read your prompts or use the prompt-response pairs. If you chose to only read the prompts, Amazon Bedrock will regenerate responses from the teacher model; or if you choose to use prompt-response pairs, Amazon Bedrock will use the available response in logs without regenerating it.

Make sure that the teacher model selected for distillation and the model used in the invocation logs is the same if you want Amazon Bedrock to re-use the responses from invocation logs.

  1. Optionally, expand the VPC settings section to specify a VPC that defines the virtual networking environment for this distillation job.
  2. Under Distillation output metrics data, for S3 location, enter the S3 path for the bucket where you want the training output metrics of the distilled model to be stored.
  3. Under Service access, select a method to provide Amazon Bedrock with the required IAM permissions to perform the distillation. This happens through assignment of a service role. You can select Use an existing service role if you have already defined a role with fine-grained IAM policies. If you want a new role to be created, select Create and use a new service role and specify a Service role name. View permission details provides you with a comprehensive overview of IAM permissions required.
  4. After you have added all the required configurations for the Amazon Bedrock Model Distillation job, choose Create Distillation job.
  5. When the distillation job starts, you can see the status of the job (Training, Complete, or e) under Jobs.
  6. Now select your distillation job. As the distillation job progresses, you can find more information about the job, including job creation time, status, job duration, teacher-student configuration and the distillation input dataset.

Start a distillation job with S3 JSONL data using an API

To use an API to start a distillation job using training data stored in an S3 bucket, follow these steps:

  1. First, create and configure an Amazon Bedrock client:
    import boto3
    from datetime import datetime
     
    bedrock_client = boto3.client(service_name="bedrock")
     
    # Generate unique names for the job and model
    job_name = f"distillation-job-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    model_name = f"distilled-model-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
     
    # Configure your models and IAM role
    teacher_model = "arn:aws:bedrock:us-west-2::foundation-model/meta.llama3-1-70b-instruct-v1:0"
    student_model = "arn:aws:bedrock:us-west-2::foundation-model/meta.llama3-1-8b-instruct-v1:0:128k"
    role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    # Specify S3 locations for training data and output
    training_data = "s3://<YOUR_BUCKET>/training-data.jsonl" # Replace by your training file
    output_path = "s3://<YOUR_BUCKET>/output/" 
    # Specify MaxResponseLengthForInference parameter
    max_response_length = 1000

  2. Create the distillation job using create_model_customization_job:
    distillation_job_response = bedrock_client.create_model_customization_job(
        jobName=job_name,
        customModelName=model_name,
        roleArn=role_arn,
        baseModelIdentifier=student_model,
        customizationType="DISTILLATION",
        trainingDataConfig={
            "s3Uri": training_data
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        customizationConfig={
            "distillationConfig": {
                "teacherModelConfig": {
                    "teacherModelIdentifier": teacher_model,
                    "maxResponseLengthForInference": max_response_length 
                }
            }
        }
    )

  3. You can monitor the progress of distillation job by providing the job_arn of your model distillation job:
    response = bedrock_client.get_model_customization_job(
        jobIdentifier=job_arn # Replace by your distillation job_arn
    )

Start a distillation job with an invocation log using an API

To use model invocation logs as training data, make sure that you have collected enough invocation logs in your S3 bucket. First, define the log filter based on the supporting logic referred to in the data preparation section:

# Configure the training data using invocation logs
training_data_config = {
    'invocationLogsConfig': {
        'usePromptResponse': False,
        'invocationLogSource': {
            's3Uri': 's3://<YOUR_BUCKET>/<BUCKET_PREFIX>/AWSLogs' # Replace by your S3 location
        },
        'requestMetadataFilters': {
            'equals': {
                'project': 'CustomerService'  # Filter logs based on metadata
            }
        }
    }
}

The invocationLogsConfig allows you to specify the Amazon S3 location where your invocation logs are stored, whether to use prompt-response pairs from the logs or generate new responses from the teacher model, and filters to select specific logs based on request metadata.

Then, create the distillation job using the same create_model_customization_job API (configuration parameters are defined as was done in the prior section):

distillation_job_response = bedrock_client.create_model_customization_job(
    jobName=job_name,
    customModelName=model_name,
    roleArn=role_arn,
    baseModelIdentifier=student_model,
    customizationType="DISTILLATION",
    trainingDataConfig=training_data_config,
    outputDataConfig={
        "s3Uri": output_path
    },
    customizationConfig={
        "distillationConfig": {
            "teacherModelConfig": {
                "teacherModelIdentifier": teacher_model,
                "maxResponseLengthForInference": max_response_length
            }
        }
    }
)

Deploy and evaluate the model distillation

After distilling the model, you can evaluate the distillation metrics recorded during the process. These metrics are stored in the specified S3 bucket for evaluation purposes, which includes step-wise training metrics with columns step_number, epoch_number and training_loss.

When you’re satisfied with the distillation metrics, you can purchase a Provisioned Throughput to deploy your fine-tuned model, allowing you to take advantage of the improved performance and specialized capabilities of the distilled model in your applications. Provisioned throughput refers to the number and rate of inputs and outputs that a model processes and returns. To use a distilled model, you must purchase a Provisioned Throughput, which is billed hourly. The pricing for a Provisioned Throughput depends on the following factors:

  • The selected student model.
  • The number of model units (MUs) specified for the Provisioned Throughput. An MU is a unit that specifies the throughput capacity for a given model; each MU defines the number of input tokens it can process and output tokens it can generate across all requests within 1 minute.
  • The commitment duration, which can be no commitment, 1 month, or 6 months. Longer commitments offer more discounted hourly rates.

After the Provisioned Throughput is set up, you can use the InvokeModel or Converse API to invoke the distilled model, similar to how the base model is invoked. This provides a seamless transition and maintains compatibility with existing applications or workflows.

It’s crucial to evaluate the performance of the distilled model to make sure that it meets the desired criteria and outperforms in specific tasks. You can conduct various evaluations, including comparing the distilled model with the teacher model to validate its performance.

Deploy the distilled model using the Amazon Bedrock console

To deploy the distilled model using the Amazon Bedrock console, complete the following steps:

  1. On the Amazon Bedrock console, choose Custom models in the navigation pane.
  2. Select the distilled model and choose Purchase provisioned throughput.
  3. For Provisioned throughput name, enter a name.
  4. Choose the model that you want to deploy.
  5. For Commitment term, select your level of commitment (for this post, we choose No commitment).
  6. Choose Purchase provisioned throughput.

After the distilled model has been deployed using a Provisioned Throughput, you can see the model status as In Service when you go to the Provisioned throughput page on the Amazon Bedrock console.

You can interact with this distilled model in Amazon Bedrock playground, select Chat/text, then select the distilled model in Custom & Managed endpoints.

Deploy the distilled model using the Amazon Bedrock API

To deploy the distilled model using the Amazon Bedrock API, complete the following steps:

  1. Retrieve the distilled model ID from the job’s output, and create a Provisioned Throughput model instance with the desired model units:
    import boto3
    bedrock_client = boto3.client(service_name="bedrock")
    job_arn = distillation_job_response['jobArn']
    custom_model_id = bedrock_client.get_model_customization_job(jobIdentifier=job_arn)['outputModelArn']
    
    provisioned_model_id = bedrock_client.create_provisioned_model_throughput(
    modelUnits=1, # Update model unites with desired number
    provisionedModelName='distilled-model',
    modelId=custom_model_id 
    )['provisionedModelArn']

  2. Check the status of your Provisioned Throughput model by running:
    bedrock_client.get_provisioned_model_throughput(provisionedModelId=provisioned_model_id)['status']

  3. When the Provisioned Throughput model is ready, you can call the model by using the InvokeModel or Converse API to generate text using the distilled model:
    bedrock_runtime = boto3.client(service_name='bedrock-runtime')
    conversation = [ 
            {
            "role": "user", 
            "content": [{"text": <YOUR_INPUT_TEXT_PROMPT>}], 
            } 
    ]
    inferenceConfig = {"maxTokens": 2048, "temperature": 0.1, "topP": 0.9}
    response = bedrock_runtime.converse(modelId=provisioned_model_id,
                                                  messages=conversation,
                                                  inferenceConfig=inferenceConfig,
                                                )
    response_text = response["output"]["message"]["content"][0]["text"]

By following these steps, you can deploy and use your distilled model through Amazon Bedrock API, allowing you to generate an efficient and high-performing student model tailored to your use cases. After deploying the distilled model, you can use it for inference in various Amazon Bedrock services, including Knowledge Base inference, Playground, and any other service where custom models can be used for inference.

Conclusion

Amazon Bedrock Model Distillation enables you to create efficient, cost-optimized student models that closely match the performance of larger teacher models for specific use cases. By automating the complex process of knowledge transfer from advanced models to smaller models, Amazon Bedrock simplifies the deployment of faster and less expensive AI solutions without sacrificing accuracy. Customers can benefit from efficiency gains, ease of use, science innovation, and exclusive access to distill models across providers such as Anthropic and Amazon. With Amazon Bedrock Model Distillation, enterprises can use the power of foundation models while optimizing for latency, cost, and resource constraints to drive AI innovation across industries such as financial services, content moderation, healthcare, and customer service.

We encourage you to start your journey towards cost-effective AI innovation by visiting the Amazon Bedrock console and discovering how model distillation can transform your business.

For additional resources, see the following:


About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Aris Tsakpinis is a Specialist Solutions Architect for AI & Machine Learning with a special focus on natural language processing (NLP), large language models (LLMs), and generative AI. In his free time he is pursuing a PhD in ML Engineering at University of Regensburg, focussing on applied NLP in the science domain.

Shreeya Sharma  is a Senior Technical Product Manager at AWS, where she has been working on leveraging the power of Generative AI to deliver innovative and customer-centric products. Shreeya holds a master’s degree from Duke University. Outside of work, she loves traveling, dancing, and singing.

Sovik Kumar Nath is an AI/ML and Generative AI Senior Solutions Architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double master’s degrees from the University of South Florida and University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, and adventures.

Read More

Build generative AI applications quickly with Amazon Bedrock IDE in Amazon SageMaker Unified Studio

Build generative AI applications quickly with Amazon Bedrock IDE in Amazon SageMaker Unified Studio

Building generative AI applications presents significant challenges for organizations: they require specialized ML expertise, complex infrastructure management, and careful orchestration of multiple services. To address these challenges, we introduce Amazon Bedrock IDE, an integrated environment for developing and customizing generative AI applications. Formerly known as Amazon Bedrock Studio, Amazon Bedrock IDE is now incorporated into the Amazon SageMaker Unified Studio (currently in preview). SageMaker Unified Studio combines various AWS services, including Amazon Bedrock, Amazon SageMaker, Amazon Redshift, Amazon Glue, Amazon Athena, and Amazon Managed Workflows for Apache Airflow (MWAA), into a comprehensive data and AI development platform. In this blog post, we’ll focus on Amazon Bedrock IDE and its generative AI capabilities within the Amazon SageMaker Unified Studio environment.

Consider a global retail site operating across multiple regions and countries. Its sales analysts face a daily challenge: they need to make data-driven decisions but are overwhelmed by the volume of available information. They have structured data such as sales transactions and revenue metrics stored in databases, alongside unstructured data such as customer reviews and marketing reports collected from various channels. Without specialized structured query language (SQL) knowledge or Retrieval Augmented Generation (RAG) expertise, these analysts struggle to combine insights effectively from both sources.

In this post, we’ll show how anyone in your company can use Amazon Bedrock IDE to quickly create a generative AI chat agent application that analyzes sales performance data. Through simple conversations, business teams can use the chat agent to extract valuable insights from both structured and unstructured data sources without writing code or managing complex data pipelines. The following diagram illustrates the conceptual architecture of an AI assistant with Amazon Bedrock IDE.

SageMaker Unified Studio simple architecture diagram

Solution overview

The AI chat agent application combines structured and unstructured data analysis through Amazon Bedrock IDE:

  • For structured data: connects to sales records in Amazon Athena, translating natural language into SQL queries
  • For unstructured data: uses Amazon Titan Text Embeddings and Amazon OpenSearch to enable semantic search across customer reviews and marketing reports

The Amazon Bedrock IDE interface seamlessly combines results from both sources, delivering comprehensive insights without requiring users to understand the underlying data structures or query languages. The following figure illustrates the workflow from initial user interaction to final response. For more details on the user interaction flow, check out our associated GitHub repository.

Solution architecture

Bedrock IDE architecture diagram

The architecture in the preceding figure shows how Amazon Bedrock IDE orchestrates the data flow. When users pose questions through the natural language interface, the chat agent determines whether to query the structured data in Amazon Athena through the Amazon Bedrock IDE function, search the Amazon Bedrock knowledge base, or combine both sources for comprehensive insights. This approach enables sales, marketing, product, and supply chain teams to make data-driven decisions efficiently, regardless of their technical expertise. For example, by the end of this tutorial, you will be able to query the data with prompts such as “Can you return our five top selling products this quarter and the principal customer complaints for each?” or “Were there any supply chain issues that could have affected our North American market for clothing sales?”

In the following sections, we’ll guide you through setting up your SageMaker Unified Studio project, creating your knowledge base, building the natural language query interface, and testing the solution.

SageMaker Unified Studio setup

SageMaker Unified Studio is a browser-based web application where you can use all your data and tools for analytics and AI. SageMaker Unified Studio can authenticate you with your AWS Identity and Access Management (IAM) credentials, credentials from your identity provider through the AWS IAM Identity Center, or with your SAML credentials.

You can obtain the SageMaker Unified Studio URL for your domains by accessing the AWS Management Console for Amazon DataZone. Follow the steps in the Administrator Guide to set up your SageMaker Unified Studio.

Building a generative AI application

SageMaker Unified Studio offers tools to discover and build with generative AI. To get started, you need to build a project.

  1. Open SageMaker Unified Studio and choose Generative AI playground at the top of the page.

SageMaker Unified Studio simple landing page

  1. Here, you can explore, experiment and compare various foundation models (FMs) through a chat interface.

Bedrock IDE - Generative AI playground

Similarly, you can explore image and video models with the Image & video playground.

  1. To begin creating your chat agent, choose Build chat agent in the chat playground window. You will now create a new project before building your app. Choose Create project.

Build chat agent

  1. Enter a project name. Next, select Generative AI application development from the available profiles. This profile includes all the necessary elements for working with Amazon Bedrock components in your generative AI application development. Choose Continue.

Bedrock IDE - Create project view

  1. On the next screen, leave all settings at their default values. Choose Continue to move to the next screen and choose the Create Project button to initiate the project creation process. The system will take a few minutes to set up your project.

Bedrock IDE - Create project view confirmation

After you’ve created your project, you can begin building your generative AI application.

Prerequisites

Before creating your application in Amazon Bedrock IDE, you’ll need to set up a few resources in your AWS account. This will provision the backend infrastructure and services that the sales analytics application will rely on. This includes setting up Amazon API Gateway, AWS Lambda functions, and Amazon Athena to enable querying the structured sales data.

  1. Deploy the required AWS resources:
    1. Launch the AWS CloudFormation stack in your preferred AWS Region:
    2. After the stack is deployed, note down the API Gateway URL value from the CloudFormation outputs tab: TextToSqlEngineAPIGatewayURL.
    3. Navigate to the AWS Secrets Manager console and find the secret <StackName>-api-keys. Choose Retrieve secret and copy the apiKey value from the plaintext string {"clientId":"default","allowedOperations":["query"],"apiKey":"xxxxxxxx"}.

You’ll need these values when setting up your Amazon Bedrock IDE function later.

  1. Download all three sample data files. These files contain synthetic data generated by a generative AI model, including customer reviews, customer survey responses, and world news that you’ll use to build your knowledge base:
  2. Download the API configuration: openapi_schema.json. You’ll use this file when setting up your function to query sales data.

That’s it! With these resources ready, you can create your sales analytics application. Each subsequent section will guide you through exactly when and how to use these files.

Instructions configuration for the chat agent

Go to Amazon Bedrock IDE chat agent application. Select model from dropdown (this can be changed later – ensure it supports data and functions). In chat agent instructions field, enter:

You are a Sales Analytics agent with access to sales data in the "sales" database, table "sales_records". Your tasks include analyzing metrics, providing sales insights, and answering data questions.
Table Schema:
- region, country: Location data
- item_type: Product category
- sales_channel: Online/Offline
- order_priority: H/M/L/C
- order_date, ship_date: Timing
- order_id: Unique identifier
- units_sold: Quantity
- unit_price, unit_cost: Price metrics
- total_revenue, total_cost, total_profit: Financial metrics.
Use Amazon Athena SQL queries to provide insights. Format responses with:
	1	SQL query used
	2	Business interpretation
	3	Key insights/recommendations
You can also access sales-repo which contains details on products categories, customer reviews, etc.
Error Handling:
- If the user's query cannot be translated into a valid SQL query, or the SQL is invalid or fails to execute, provide a clear and informative error message.

This instruction will guide the AI application to act as a sales analytics agent, providing structured responses based on the given sales data schema in addition to accessing the product reviews and other sales-related data.

Chat agent application building view

For this application, you will create two main components: a knowledge base to handle unstructured data, and a function that uses Amazon Athena to query the structured data. These components will work together to process and retrieve information for your generative AI application.

Creating a knowledge base

Knowledge bases enable your application to analyze unstructured data like customer reviews and news stories.

  1. Select the Data section on the current chat agent screen.
  2. Choose Create new Knowledge Base and enter a name for your new knowledge base. You also need to enter a brief description for the chat agent to understand the purpose of this Knowledge Base:

This contains product-specific reviews from users, user feedback gathered via survey, and recent industry and economic news

  1. You have two options for configuring your knowledge base data sources, you can either use local files or you can configure a web crawler. Web scraping automatically extracts content from public web pages that you have permission to access. By adding website URLs to the tool, it will crawl these sites and create a knowledge base from all the defined URLs. This process allows you to efficiently gather and structure information from multiple web sources, building a comprehensive dataset for your needs.

For this post, you’ll upload the files containing unstructured data that we mentioned previously (product-reviews.txt, survey-response.txt, and world-news.txt).

  1. Choose Click to upload, and upload the three files. Keep the default parsing settings.
  2. For the embeddings model, select Amazon Titan Text Embeddings V2 or any other embeddings model.
  3. Select OpenSearch Serverless as your vector store.
  4. After you’ve made these selections, choose Create to create your knowledge base.
  5. After the knowledge base creation is finished, select it for your application under the Select Knowledge Base dropdown.

This unstructured data will be used to enhance the knowledge base of your generative AI application, allowing it to process and provide insights based on customer reviews, survey responses, and current news relevant to your business sector.

Create Knowledge Base

To verify that the unstructured data connection is working properly, submit this example prompt into the application. Note that generative AI systems are nondeterministic, so responses will not be the same every time.

Prompt:

What is the customer sentiment surrounding our Office Supplies products?

Expected response:

Based on the customer reviews and feedback, the sentiment surrounding our Office Supplies products is mixed. While there are many positive reviews praising the product selection, store environment, and helpful staff, there are also a significant number of negative reviews citing issues like:

Poor product quality (e.g. flimsy materials, inconsistent paper quality, adhesive issues)

Frustrating in-store experiences (e.g. long wait times, rude staff, messy/disorganized stores)

Problems with online ordering and apps (e.g. filters not working properly, payment issues, logging out repeatedly)

To improve customer sentiment, we should focus on addressing these common pain points through better quality control, more efficient inventory management and staffing, and improving the online/app user experience.

Creating a function

In this section, you will create a function that will interact with Amazon API Gateway to query the database, which then forwards requests to the Lambda function that retrieves data from Amazon Simple Storage Service (Amazon S3) and processes SQL queries using Amazon Athena. The AWS infrastructure has already been deployed as part of the CloudFormation template. The structured dataset includes order information for products spanning from 2010 to 2017. This historical data will allow the function to analyze sales trends, product performance, and other relevant metrics over this seven-year period. The application will use this function to integrate structured data analysis capabilities, enabling it to provide insights based on concrete sales data alongside the unstructured data from reviews and news that are already incorporated.

  1. In your Amazon Bedrock IDE Chat agent application, expand the Functions section on the screen. Choose Create New Function.
  2. Enter a name for the function and provide a description.
  3. For the function schema, select Import JSON/YAML. Import the API schema from the openapi_schema.json file that you downloaded earlier.
  4. Important: After importing, you need to modify the API endpoint URL in the schema. Replace it with the actual value from the CloudFormation stack output TextToSqlEngineAPIGatewayURL. This step makes sure that your function is correctly linked to the appropriate API endpoint for your application.

Create a function

  1. For the Authentication method, select API Keys (Max. 2 Keys) and enter the following details:
    • Key sent in: Header
    • Key name: x-api-key
    • Key value: enter the Amazon Secrets Manager apiKey value.
  2. Choose Create and wait until the function creation is complete.
  3. After the function creation is finished, select it for your application under Functions dropdown.

Create a function confirmation

To verify that the structured data connection is working properly, submit the following example query into the application. Note that generative AI systems are nondeterministic, so responses will not be the same every time.

Prompt:

List all the regions that we do business in.

Expected response:

Based on the query to get distinct region values from the sales_records table, the regions where we do business are:
Europe
Australia and Oceania
North America
Central America and the Caribbean
Sub-Saharan Africa
Middle East and North Africa
Asia

Sharing your application

After you’ve built your application, you can share it with other users in your organization through SageMaker Unified Studio.

  1. In the application interface, locate and choose Share in the top right corner.
  2. In the sharing dialog, search for users by their alias and choose Invite to include them in the sharing list.
  1. After adding all desired users, copy the application URL from the sharing dialog and send the URL to the added users through your preferred communication channel.

Note: if you turn link sharing on, anyone with the link will be able to subscribe and use the app. If you add their names specifically, only those users can see the app, and it will appear under the “Shared generative AI assets” section for them.

Users must have valid SageMaker Unified Studio access credentials to use the shared application. Contact your AWS administrator if users encounter access issues.

Application sharing view

Examples

The following examples demonstrate how a global retail site uses this solution to transform their sales analytics process and extract valuable insights. Let’s explore three types of queries that demonstrate the power of this approach:

  • Analyzing structured data to understand sales performance
  • Analyzing unstructured customer feedback to extract insights
  • Combining both data sources for comprehensive business intelligence

In the following examples, we’ll show how sales analysts can extract valuable insights through basic conversational queries.

Keep in mind that generative AI systems are nondeterministic, so responses will not be the same every time. Generative AI systems might also hallucinate information, and care should be taken to evaluate responses for correctness. Additionally, the structured data source is randomly generated on download, so our results will not necessarily agree. You can access the CSV data source in Amazon S3 to evaluate the results manually. Our examples used the Anthropic Claude 3 Sonnet model, but feel free to experiment with others.

Structured data examples

The feature columns that can be queried are the following: Region, Country, Item Type, Sales Channel, Order Priority, Order Date, Order ID, Ship Date, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, and Total Profit. The data spans seven regions and over a hundred countries, with the following item types represented: Baby Food, Beverages, Cereal, Clothes, Cosmetics, Fruits, Household, Meat, Office Supplies, Personal Care, Snacks, and Vegetables.

Prompt 1:

What region accounts for our highest revenue, and how much revenue is that?

Expected Response 1:

The region that accounts for our highest revenue is "Sub-Saharan Africa" with a total revenue of $356,724,250.12.

Prompt 2:

Which 3 item types account for our most units sold?

Expected Response 2:

The 3 item types that account for our most units sold are:
Beverages with 504,905 units
Personal Care with 475,724 units
Vegetables with 471,276 units

Unstructured data examples

The following examples demonstrate a few natural language prompts that you can use to query unstructured data. The goal of these queries is to query textual data to answer questions and identify common themes.

The data that can be queried includes product-specific reviews from users, online compared to offline feedback gathered through surveys, and recent industry and economic news.

Prompt 1:

Search our customer surveys to see if customers seem to prefer our online or offline shopping experience

Expected Response 1:

Based on the customer survey results from the knowledge base, there seems to be a preference for the in-store/offline shopping experience over the online experience (website and mobile app). 
The in-store reviews frequently mention positive aspects like helpful staff, clean stores, good product selection and organization. In contrast, the online reviews cite many issues with the website and mobile app like slow loading times, crashes, poor search functionality, and difficulties during checkout. 
While there are some positive online reviews as well, the negative experiences seem more prevalent for the online channels compared to the offline in-store experience.

Prompt 2:

Were there any major world events in 2016 affecting the sale of Vegetables?

Expected Response 2:

Based on the search results from the knowledge base, there were a few major events in 2016 that may have impacted vegetable sales:
In July 2016, an extended drought in California's Central Valley resulted in reduced vegetable production, affecting supply chains for major retailers across North America.
In April 2016, Morocco's innovative desert greenhouse project began operations, introducing new competition in the Mediterranean vegetable market and affecting prices in Southern Europe.
In June 2016, India's successful implementation of cold chain infrastructure reduced vegetable spoilage by 35%, stabilizing prices across South Asian markets.
So the drought in California likely reduced vegetable supply and increased prices in North America, while the new greenhouse operations in Morocco introduced more competition in Europe. However, improved cold chain logistics in India helped stabilize vegetable prices in that region.

Combined data examples

The following are examples of a few natural language prompts that you can use to query both structured and unstructured data sources and seamlessly combine the results into valuable business insights.

Prompt 1:

Do customers seem to prefer our online or offline shopping experience?
Gather data from both customer surveys and number of units sold on different sales channels (online and offline),
and try to identify a trend to see if the data corroborates these reviews.

Expected Response 1:

Based on the customer reviews from the knowledge base, there seems to be a preference for the in-store/offline shopping experience over online shopping. Common complaints for online included slow websites, technical glitches, poor search functionality, and issues during checkout. 
In contrast, the in-store experience received more positive reviews praising helpful staff, clean stores, and good product selection.
The overall sales data shows that the Offline sales channel had higher total revenue ($698.9M) and units sold (2.69M) compared to Online ($628.4M revenue, 2.36M units). 
This data corroborates the customer review sentiments, indicating customers may prefer the offline/in-store experience for our products and services.
However, it is important to note that both channels received a mix of positive and negative reviews. To improve the overall customer experience, we should analyze the specific issues raised in the reviews and look for ways to enhance both the online and offline channels based on that feedback.

Prompt 2:

Can you identify any major events that may have affected the sales revenue of Baby Food?
What about any product reviews that may reflect the trends in revenue?
Use order dates and news article publishing dates as you look for trends.

Expected Response 2:

Based on analyzing the sales revenue data for the Baby Food item type and searching relevant unstructured data sources, here are some major events that appear to have impacted sales revenue:
In November 2015, Japan introduced stricter quality control standards for imported baby foods (source 5). This appears to have disrupted supply chains in Asia and may have driven the spike in revenue that month as companies rushed to get products into Japan before the new standards took effect.
In August 2016, Mexico reported a breakthrough in avocado cultivation that increased yields by 25% (source 3). This improved supply of a key baby food ingredient and may have contributed to the high revenue in late 2016 by lowering costs for manufacturers.
In April 2014, Australia had a wheat shortage due to drought conditions, impacting costs for grain-based baby food products (source 2). This aligns with the low revenue on 4/26/2014 as manufacturers likely passed along higher costs to consumers.
The unstructured data sources provided helpful context around supply chain disruptions, ingredient shortages and surpluses, major agricultural events, and changes in trade policies - all of which appear to have impacted baby food sales based on the timing of these events correlating with fluctuations in revenue in the structured data.

Clean-up

To clean up the resources deployed in these instructions, first delete the CloudFormation stack. You can then remove resources from your Amazon Bedrock IDE project and delete domains by following the Amazon SageMaker Unified Studio documentation.

Conclusion

In this post, we demonstrated how Amazon Bedrock IDE transforms generative AI application development from a complex technical endeavor into a straightforward point-and-click experience. While traditional approaches require specialized ML expertise and significant development time, Amazon Bedrock IDE enables users from various skill levels to create production-ready AI applications in hours instead of weeks.

The key benefits are clear: anyone can now build sophisticated generative AI applications without coding expertise, achieve faster time-to-value through pre-built components, and maintain enterprise governance through centralized management. All while having secure access to their organization’s data through a unified, simple-to-use interface. This same approach can be applied beyond sales analytics to other scenarios where teams need to quickly build AI applications that combine enterprise data with large language models – making generative AI truly accessible across your organization.

Ready to transform your organization’s AI capabilities? Start building your first generative AI application today by following our step-by-step guide or visit Amazon Bedrock IDE to explore more solutions for your business needs.


About the Authors

Ameer Hakme is an AWS Solutions Architect based in Pennsylvania. He collaborates with Independent Software Vendors (ISVs) in the Northeast region, assisting them in designing and building scalable and modern platforms on the AWS Cloud. An expert in AI/ML and generative AI, Ameer helps customers unlock the potential of these cutting-edge technologies. In his leisure time, he enjoys riding his motorcycle and spending quality time with his family.

Adam Gamba is a Solutions Architect and Aspiring Analytics & AI/ML Specialist at AWS. With his background in computer science, he is very interested in using technology to build solutions to real-world problems. Originally from New Jersey, but now based in Arlington, Virginia, Adam enjoys rock climbing, playing piano, cooking, and attending local museums and concerts.

Bhaskar Ravat is a Senior Solutions Architect at AWS based in New York, with a deep interest in the transformative potential of AI. My passion lies in exploring how AI can impact both everyday life and the broader human experience. You can find him reading 4 books at a time when not helping or building solutions for customers.

Kosti Vasilakakis is a Principal Product Manager at AWS. He is an ex-data-scientist, turned PM, now leading Amazon Bedrock IDE to help enterprises build high-quality Gen AI applications faster. Kosti remains in awe of the rapid advancements in AI, and is excited to be working on its democratization. Outside of work, you’ll find him coding personal productivity automations, playing tennis, and spending time in the wilderness with his family.

Read More

Scale ML workflows with Amazon SageMaker Studio and Amazon SageMaker HyperPod

Scale ML workflows with Amazon SageMaker Studio and Amazon SageMaker HyperPod

Scaling machine learning (ML) workflows from initial prototypes to large-scale production deployment can be daunting task, but the integration of Amazon SageMaker Studio and Amazon SageMaker HyperPod offers a streamlined solution to this challenge. As teams progress from proof of concept to production-ready models, they often struggle with efficiently managing growing infrastructure and storage needs. This integration addresses these hurdles by providing data scientists and ML engineers with a comprehensive environment that supports the entire ML lifecycle, from development to deployment at scale.

In this post, we walk you through the process of scaling your ML workloads using SageMaker Studio and SageMaker HyperPod.

Solution overview

Implementing the solution consists of the following high-level steps:

  1. Set up your environment and the permissions to access Amazon HyperPod clusters in SageMaker Studio.
  2. Create a JupyterLab space and mount an Amazon FSx for Lustre file system to your space. This eliminates the need for data migration or code changes as you scale. This also mitigates potential reproducibility issues that often arise from data discrepancies across different stages of model development.
  3. You can now use SageMaker Studio to discover the SageMaker HyperPod clusters, and view cluster details and metrics. When you have access to multiple clusters, this information can help you compare the specifications of each cluster, current utilization, and queue status of the clusters to identify the one that meets the requirements of your specific ML task.
  4. We use a sample notebook to show how to connect to the cluster and run a Meta Llama 2 training job with PyTorch FSDP on your Slurm cluster.
  5. After you submit the long-running job to the cluster, you can monitor the tasks directly through the SageMaker Studio UI. This can help you get real-time insights into your distributed workflows and allow you to quickly identify bottlenecks, optimize resource utilization, and improve overall workflow efficiency.

This integrated approach not only streamlines the transition from prototype to large-scale training but also enhances overall productivity by maintaining a familiar development experience even as you scale up to production-level workloads.

Prerequisites

Complete the following prerequisite steps:

  1. Create a SageMaker HyperPod Slurm cluster. For instructions, refer to the Amazon SageMaker HyperPod workshop or Tutorial for getting started with SageMaker HyperPod.
  2. Make sure you have the latest version of the AWS Command Line Interface (AWS CLI).
  3. Create a user in the Slurm head node or login node with an UID greater than 10000. Refer to Multi-User for instructions to create a user.
  4. Tag the SageMaker HyperPod cluster with the key hyperpod-cluster-filesystem. This is the ID for the FSx for Lustre file system associated with the SageMaker HyperPod cluster. This is needed for Amazon SageMaker Studio to mount FSx for Lustre onto Jupyter Lab and Code Editor spaces. Use the following code snippet to add a tag to an existing SageMaker HyperPod cluster:
    aws sagemaker add-tags --resource-arn <cluster_ARN> 
    --tags Key=hyperpod-cluster-filesystem,Value=<fsx_id>

Set up your permissions

In the following sections, we outline the steps to create an Amazon SageMaker domain, create a user, set up a SageMaker Studio space, and connect to the SageMaker HyperPod cluster. By the end of these steps, you should be able to connect to a SageMaker HyperPod Slurm cluster and run a sample training workload. To follow the setup instructions, you need to have admin privileges. Complete the following steps:

  1. Create a new AWS Identity and Access Management (IAM) execution role with AmazonSageMakerFullAccess attached to the role. Also attach the following JSON policy to the role, which enables SageMaker Studio to access the SageMaker HyperPod cluster. Make sure the trust relationship on the role allows the sagemaker.amazonaws.com service to assume this role.
{
    "Version": "2012-10-17",            
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
{
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:ListClusters"
            ],
            "Resource": "*"    
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware"
            ],
            "Resource": "arn:aws:sagemaker:region:account-id:cluster/*"    
        }
    ]
}
  1. In order to use the role you created to access the SageMaker HyperPod cluster head or login node using AWS Systems Manager, you need to add a tag to this IAM role, where Tag Key = “SSMSessionRunAs” and Tag Value = “<posix user>”. The POSIX user is the user that is set up on the Slurm head node. Systems Manager uses this user to exec into the head node.
  2. When you activate Run As support, it prevents Session Manager from starting sessions using the ssm-user account on a managed node. To enable Run As in Session Manager, complete the following steps:
    1. On the Session Manager console, choose Preferences, then choose Edit.
    2. Don’t specify any user name. The user name will be picked from the role tag SSMSessionRunAs that you created earlier.
    3. In the Linux shell profile section, enter /bin/bash.
    4. Choose Save.
  1. Create a new SageMaker Studio domain with the execution role created earlier along with other necessary parameters required to access the SageMaker HyperPod cluster. Use the following script to create the domain and replace the export variables accordingly. Here, VPC_ID and Subnet_ID are the same as the SageMaker HyperPod cluster’s VPC and subnet. The EXECUTION_ROLE_ARN is the role you created earlier.
export DOMAIN_NAME=<domain name>
export VPC_ID=vpc_id-for_hp_cluster
export SUBNET_ID=private_subnet_id
export EXECUTION_ROLE_ARN=execution_role_arn
export FILE_SYSTEM_ID=fsx id
export USER_UID=10000
export USER_GID=1001
export REGION=us-east-2

cat > user_settings.json << EOL
{
    "ExecutionRole": "$EXECUTION_ROLE_ARN",
    "CustomPosixUserConfig":
    {
        "Uid": $USER_UID,
        "Gid": $USER_GID
    },
    "CustomFileSystemConfigs":
    [
        {
            "FSxLustreFileSystemConfig":
            {
                "FileSystemId": "$FILE_SYSTEM_ID",
                "FileSystemPath": "$FILE_SYSTEM_PATH"
            }
        }
    ]
}
EOL

aws sagemaker create-domain 
--domain-name $DOMAIN_NAME 
--vpc-id $VPC_ID 
--subnet-ids $SUBNET_ID 
--auth-mode IAM 
--default-user-settings file://user_settings.json 
--region $REGION 

The UID and GID in the preceding configuration are set to 10000 and 1001 as default; this can be overridden according to the user created in Slurm, and this UID/GID is used to give permissions to the FSx for Lustre file system. Also, setting this at the domain level gives each user the same UID. In order to have a separate UID for each user, consider setting CustomPosixUserConfig while creating the user profile.

  1. After you create the domain, you need to attach SecurityGroupIdForInboundNfs created as part of domain creation to all ENIs of the FSx Lustre volume:
    1. Locate the Amazon Elastic File System (Amazon EFS) file system associated with the domain and corresponding security group attached to It. You can find the EFS file system on the Amazon EFS console; it’s tagged with the domain ID, as shown in the following screenshot.
    2. Collect the corresponding security group, which will be named inbound-nfs-<domain-id> and can be found on the Network tab.
    3. On the FSx for Lustre console, choose To see all the ENIs, see the Amazon EC2 Console. This will show all the ENIs attached to FSx for Lustre. Alternatively, you can find ENIs using the AWS CLI or by calling the fsx:describeFileSystems
    4. For each ENI, attach the SecurityGroupIdForInboundNfs of the domain to it.

Alternately, you can use the following script to automatically find and attach security groups to the ENIs associated with the FSx for Lustre volume. Replace the REGION, DOMAIN_ID, and FSX_ID attributes accordingly.

#!/bin/bash

export REGION=us-east-2
export DOMAIN_ID=d-xxxxx
export FSX_ID=fs-xxx

export EFS_ID=$(aws sagemaker describe-domain --domain-id $DOMAIN_ID --region $REGION --query 'HomeEfsFileSystemId' --output text)
export MOUNT_TARGET_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID --region $REGION --query 'MountTargets[0].MountTargetId' --output text)
export EFS_SG=$(aws efs describe-mount-target-security-groups --mount-target-id $MOUNT_TARGET_ID --query 'SecurityGroups[0]' --output text)
echo "security group associated with the Domain $EFS_SG"

echo "Adding security group to FSxL file system ENI's"
# Get the network interface IDs associated with the FSx file system
NETWORK_INTERFACE_IDS=$(aws fsx describe-file-systems --file-system-ids $FILE_SYSTEM_ID --query "FileSystems[0].NetworkInterfaceIds" --output text)
# Iterate through each network interface and attach the security group
for ENI_ID in $NETWORK_INTERFACE_IDS; do
aws ec2 modify-network-interface-attribute --network-interface-id $ENI_ID --groups $EFS_SG
echo "Attached security group $EFS_SG to network interface $ENI_ID"
done

Without this step, application creation will fail with an error.

  1. After you create the domain, you can use the domain to create a user profile. Replace the DOMAIN_ID value from the one created in the previous step.
export DOMAIN_ID=d-xxx
export USER_PROFILE_NAME=test
export REGION=us-east-2

aws sagemaker create-user-profile 
--domain-id $DOMAIN_ID 
--user-profile-name$USER_PROFILE_NAME 
--region $REGION

Create a JupyterLab space and mount the FSx for Lustre file system

Create a space using the FSx for Lustre file system with the following code:

export SPACE_NAME=hyperpod-space
export DOMAIN_ID=d-xxx
export USER_PROFILE_NAME=test
export FILE_SYSTEM_ID=fs-xxx
export REGION=us-east-2

aws sagemaker create-space --domain-id $DOMAIN_ID 
--space-name $SPACE_NAME 
--space-settings "AppType=JupyterLab,CustomFileSystems=[{FSxLustreFileSystem={FileSystemId=$FILE_SYSTEM_ID}}]" 
--ownership-settings OwnerUserProfileName=$USER_PROFILE_NAME  --space-sharing-settings SharingType=Private  
--region $REGION

Create an application using the space with the following code:

export SPACE_NAME=hyperpod-space
export DOMAIN_ID=d-xxx
export APP_NAME=test-app
export INSTANCE_TYPE=ml.t3.medium
export REGION=us-east-2
export IMAGE_ARN=arn:aws:sagemaker:us-east-2:081975978581:image/sagemaker-distribution-cpu

aws sagemaker create-app --space-name $SPACE_NAME 
--resource-spec '{"InstanceType":"$INSTANCE_TYPE","SageMakerImageArn":"$IMAGE_ARN"}' 
--domain-id  $DOMAIN_ID --app-type JupyterLab --app-name $APP_NAME --region $REGION

Discover clusters in SageMaker Studio

You should now have everything ready to access the SageMaker HyperPod cluster using SageMaker Studio. Complete the following steps:

  1. On the SageMaker console, choose Admin configurations, Domains.
  2. Locate the user profile you created and launch SageMaker Studio.
  3. Under Compute in the navigation pane, choose HyperPod clusters.

Here you can view the SageMaker HyperPod clusters available in the account.

  1. Identify the right cluster for your training workload by looking at the cluster details and the cluster hardware metrics.

You can also preview the cluster by choosing the arrow icon.

You can also go to the Settings and Details tabs to find more information about the cluster.

Work in SageMaker Studio and connect to the cluster

You can also launch either JupyterLab or Code Editor, which mounts the cluster FSx for Lustre volume for development and debugging.

  1. In SageMaker Studio, choose Get started in and choose JupyterLab.
  2. Choose a space that has the FSx for Lustre file system mounted to get a consistent, reproducible environment.

The Cluster Filesystem column identifies which space has the cluster file system mounted.

This should launch JupyterLab with the FSx for Lustre volume mounted. By default, you should see the getting started notebook in your home folder, which has step-by-step instructions to run a Meta Llama 2 training job with PyTorch FSDP on the Slurm cluster. This example notebook demonstrates how you can use SageMaker Studio notebooks to transition from prototyping your training script to scaling up your workloads across multiple instances in the cluster environment. Additionally, you should see the FSx for Lustre file system you mounted to your JupyterLab space under /home/sagemaker-user/custom-file-systems/fsx_lustre.

Monitor the tasks on SageMaker Studio

You can go to SageMaker Studio and choose the cluster to view a list of tasks currently in the Slurm queue.

You can choose a task to get additional task details such as the scheduling and job state, resource usage details, and job submission and limits.

You can also perform actions such as release, requeue, suspend, and hold on these Slurm tasks using the UI.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the space:
aws —region <REGION> sagemaker delete-space 
--domain-id <DomainId> 
--space-name <SpaceName>
  1. Delete the user profile:
aws —region <REGION> sagemaker delete-user-profile 
--domain-id <DomainId> 
--user-profile-name <UserProfileName>
  1. Delete the domain. To retain the EFS volume, specify HomeEfsFileSystem=Retain.
aws —region <REGION> sagemaker delete-domain 
--domain-id <DomainId> 
--retention-policy HomeEfsFileSystem=Delete
  1. Delete the SageMaker HyperPod cluster.
  2. Delete the IAM role you created.

Conclusion

In this post, we explored an approach to streamline your ML workflows using SageMaker Studio. We demonstrated how you can seamlessly transition from prototyping your training script within SageMaker Studio to scaling up your workload across multiple instances in a cluster environment. We also explained how to mount the cluster FSx for Lustre volume to your SageMaker Studio spaces to get a consistent reproducible environment.

This approach not only streamlines your development process but also allows you to initiate long-running jobs on the clusters and conveniently monitor their progress directly from SageMaker Studio.

We encourage you to try this out and share your feedback in the comments section.

Special thanks to Durga Sury (Sr. ML SA), Monidipa Chakraborty (Sr. SDE), and Sumedha Swamy (Sr. Manager PMT) for their support to the launch of this post.


About the Authors

AKLArun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

PKPooja Karadgi is a Senior Technical Product Manager at Amazon Web Services. At AWS, she is a part of the Amazon SageMaker Studio team and helps build products that cater to the needs of administrators and data scientists. She began her career as a software engineer before making the transition to product management. Outside of work, she enjoys crafting travel planners in spreadsheets, in true MBA fashion. Given the time she invests in creating these planners, it’s clear that she has a deep love for traveling, alongside a strong passion for hiking.

Read More

Introducing Amazon Kendra GenAI Index – Enhanced semantic search and retrieval capabilities

Introducing Amazon Kendra GenAI Index – Enhanced semantic search and retrieval capabilities

Amazon Kendra is an intelligent enterprise search service that helps you search across different content repositories with built-in connectors. AWS customers use Amazon Kendra with large language models (LLMs) to quickly create secure, generative AI–powered conversational experiences on top of your enterprise content.

As enterprises adopt generative AI, many are developing intelligent assistants powered by Retrieval Augmented Generation (RAG) to take advantage of information and knowledge from their enterprise data repositories. This approach combines a retriever with an LLM to generate responses. A retriever is responsible for finding relevant documents based on the user query. Customers seek to build comprehensive generative AI systems that use this approach with their choice of index, LLMs, and other components. The combination of retrievers and LLMs offers powerful capabilities, but organizations face significant challenges in building effective retrieval systems.

The core challenge lies in developing data pipelines that can handle diverse data sources, the multitude of data entities in each data source, their metadata and access control information, while maintaining accuracy. This requires implementing information extraction models, optimizing text processing, and balancing sparse and dense retrieval methods. These diverse data sources come with their own ways of encapsulating entities of information. These entities can be documents in Amazon Simple Storage Service (Amazon S3), HTML pages in a web server, accounts in Salesforce, or incidents in ServiceNow. Each data source can have multiple ways to authenticate such as OAuth 2.0 (for example, client credentials flow or refresh token flow), Network Level Trust Manager(), basic authentication, and others.

Entities also come with access control information for each entity, such as the user email and groups that are authorized to access the entity. The data source administrators and users also add a multitude of metadata fields to each entity that contain critical information about the entity, such as created date or author. Organizations must also fine-tune technical parameters, including embedding models, dimensionality, and nearest neighbor algorithms for optimal performance. This complexity often requires significant expertise and resources, making it difficult for many organizations to implement effective retrieval systems for their generative AI solutions.

Amazon Bedrock Knowledge Bases provides managed workflows for RAG pipelines with customizable features for chunking, parsing, and embedding. However, customers seek a more streamlined experience with pre-optimized parameters and simplified data source integration. They also want the ability to reuse indexed content across their generative AI solutions.

Amazon Q Business is a fully managed, generative AI–powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your enterprise data. It allows end users to receive immediate, permissions-aware responses from enterprise data sources with citations, for use cases such as IT, human resources (HR), and benefits help desks.

Amazon Q Business also helps streamline tasks and accelerate problem solving. You can use Amazon Q Business to create and share task automation applications or perform routine actions like submitting time-off requests and sending meeting invites. However, Amazon Q Business customers who have already made investments in Amazon Kendra for their enterprise search needs are seeking ways to get RAG-based enhanced semantic search against Amazon Kendra index and save on cost and time.

Amazon Kendra GenAI Index is a new index in Amazon Kendra designed for RAG and intelligent search to help enterprises build digital assistants and intelligent search experiences more efficiently and effectively. This index offers high retrieval accuracy, using advanced semantic models and the latest information retrieval technologies. It can be integrated with Amazon Bedrock Knowledge Bases and other Amazon Bedrock tools to create RAG-powered digital assistants, or it can be used with Amazon Q Business for a fully managed digital assistant solution.

Amazon Kendra GenAI Index addresses common challenges in building retrievers for generative AI assistants, including data ingestion, model selection, and integration with various generative AI tools. Its features include a managed retriever with high semantic accuracy, a hybrid index combining vector and keyword search, pre-optimized parameters, connectors to a variety of enterprise data sources, and metadata-based user permissions filtering.

A single Amazon Kendra GenAI Index can be used across multiple Amazon Q Business applications and Amazon Bedrock Knowledge Bases, benefiting from features such as relevance tuning, document enrichment, and metadata filtering. This new offering joins our existing Amazon Kendra Developer and Enterprise editions, providing customers with more options to meet their specific search needs. This index will support most of the popular features (with some exceptions listed later in this post) such as connectors, user context filtering, metadata support, relevance tuning, and others that customers love to use in Amazon Kendra.

Benefits

Amazon Kendra GenAI Index offers a managed retriever solution that delivers high semantic accuracy for RAG while enabling organizations to use their Amazon Web Services (AWS) generative AI investments across multiple services through built-in integration with Amazon Bedrock Knowledge Bases and Amazon Q Business without needing to rebuild indexes for different applications. Amazon Kendra Gen AI Index also supports connectors to 43 enterprise sources such as SharePoint, OneDrive, Google Drive, Salesforce, and others with integrated metadata-based user permissions filtering, reducing the burden of building custom connectors.

Because Amazon Kendra GenAI Index is a managed RAG option within Amazon Bedrock Knowledge Bases, customers can build generative AI assistants using Amazon Bedrock tooling such as agents and prompt flows. Organizations can select their preferred language models, customize prompts, and manage costs through pay-per-token pricing.

For those seeking a fully managed experience, Amazon Kendra Gen AI Index integrates seamlessly with Amazon Q Business, removing the complexity of LLM selection and prompt engineering. Customers can also use a single Amazon Kendra GenAI Index that serves multiple Amazon Q Business applications and Amazon Bedrock Knowledge Bases. As a result, they can index one time and reuse that indexed content across use cases. Additionally, features such as relevance tuning, document enrichment, and metadata filtering enable businesses to optimize content relevance for their specific needs.

Enhanced semantic understanding

Amazon Kendra GenAI Index incorporates significant upgrades to the underlying search and retrieval technologies, along with improved semantic models. These enhancements provide higher accuracy in the retrieval API, making it especially valuable for RAG applications. It offers high accuracy out-of-the-box for search and retrieval use cases, powered by the latest information retrieval technologies, semantic embedding, and reranker models tested across a variety of datasets. The high retrieval accuracy is provided through its hybrid indexing system, which combines vector and keyword search using advanced semantic relevance models with pre-optimized parameters.

Optimized resource management

The Amazon Kendra GenAI Index introduces smaller index units, leading to improved capacity utilization. This optimization enables organizations to manage their search infrastructure more efficiently while maintaining high performance levels. The streamlined architecture reduces operational overhead and allows for more flexible scaling based on actual usage patterns.

Single index seamless integration with AWS services

Amazon Kendra GenAI Index enables organizations to use a single index across the AWS generative AI stack without having to rebuild indexes. Through deep integration with both Amazon Q Business and Amazon Bedrock Knowledge Bases, organizations can choose between a fully managed experience or a customizable approach. The Amazon Q Business integration provides a streamlined path for building generative AI assistants, and Amazon Bedrock Knowledge Bases offers greater control over prompt customization, model selection, and orchestration with pay-per-token pricing. This flexibility allows organizations to adapt their implementation as needs evolve, protecting their investment in content indexing.

How to create and use the Amazon Kendra Gen AI Index

As mentioned, you have the option to use Amazon Kendra GenAI Index as a standalone index for search use cases using Amazon Kendra. You also have the option to use the new Amazon Kendra GenAI Index as a retriever for Amazon Q Business and as part of Amazon Bedrock Knowledge Bases.

Option 1: Use Amazon Kendra Gen AI Index within Amazon Kendra standalone

The steps to create an Amanzon Kendra GenAI index are similar to Creating an index as described in the Amazon Kendra Developer Guide.

To get started with Amazon Kendra GenAI Index:

  1. On the Amazon Kendra console, choose Create index.
  2. Select GenAI edition as your index type and choose Next, as shown in the following screenshot.
  1. Choose defaults under Configure user access control and choose Next, as shown in the following screenshot.
  1. Choose the defaults under Review and create and choose Create, as shown in the following screenshot.
  1. You can validate the Amazon Kendra Edition type by selecting the created index from the list of indexes created. By clicking on the Settings tab, you to validate the Edition type.

  1. Your index is now ready to add data sources. In the left navigation pane, choose Data sources, then choose Add data source, as shown in the following screenshot.
  1. Choose Select sample dataset (Amazon S3 data source).
  1. Add a Data source name and choose defaults. Choose Add data source, as shown in the following screenshot.
  1. It will take a few seconds to propagate the AWS Identity and Access Management (IAM) role. When it’s done, sync the data source by choosing Sync now explicitly, or it should also automatically start syncing.
  1. After it’s done crawling and indexing, in Sync history under Status, you should notice it has Completed. Confirm Total items scanned.
  1. Check the search results against the newly created Amazon Kendra GenAI index. Select the newly created index and choose Search indexed content, which presents a user interface to search.

The following image shows a comparison of the results for the same query to a Non GenAI index. You can observe that the semantic relevancy increased, making the result as part of Amazon Kendra suggested answers. Also, the number of output tokens increased, providing more context and relevance.

You can also visit the Amazon Kendra Developer Guide to learn how to add data sources to your index by using one of the available data sources or adding a document directly to batch upload.

Option 2: Use Amazon Kendra GenAI Index as a retriever with Amazon Q Business

One of the main benefits of the Amazon Kendra GenAI Index is the usability of the index across multiple AWS services. In Amazon Q Business, administrators can now use the same Amazon Kendra GenAI index created in the previous steps to attach to an application.

To create an Amazon Q Business application, refer to Creating an Amazon Q Business application environment in the Amazon Q User Guide.

  1. When the Amazon Q Business application is ready, in the left navigation pane, select Data sources, then choose Add an index, as shown in the following screenshot.
  1. Select Use an existing Amazon Kendra index. Under Select an index, notice the newly created GenAI

NOTE: After adding the Amazon Kendra index as a retriever in your Amazon Q Business application, you can manage the index and add documents and data sources through the Amazon Kendra GenAI Index console.

  1. After the indexed is attached, click on the web experience link. In the left navigation pane, select Amazon Q Business. Under Web experience settings, choose Deployed URL, as shown in the following screenshot to interact with Q Business AI assistant.
  1. When you’re in the Amazon Q Business web chat, pose the same question as in the previous steps. This query will use the same Amazon Kendra GenAI Index created in Amazon Kendra.

Option 3: Use Amazon Kendra GenAI Index with Amazon Bedrock Knowledge Bases

Similar to Option 2, you can seamlessly use Amazon Kendra GenAI Index as a data source with Amazon Bedrock Knowledge Bases.

To create Amazon Bedrock Knowledge bases, refer to Build a knowledge base by connecting to a data source in the Amazon Bedrock User Guide.

  1. On the Amazon Bedrock console, choose Knowledge , as shown in the following screenshot.
  1. You will be presented with Knowledge Base creation with Amazon Kendra GenAI Index screen, enter the details shown below and select Amazon Kendra GenAI index created from the options.
  1. After your knowledge base is created, you can validate that the Retrieval-Augmented Generation (RAG) type is listed as Kendra GenAI Index. To manage data sources, you can choose Add. The Amazon Kendra console will open, where you can manage all data sources for the index.
  1. After the knowledge base is created, select it to test the query.

Conclusion

Amazon Kendra GenAI Index represents a significant advancement in enterprise search and retrieval capabilities, offering organizations a streamlined path to implementing effective RAG solutions. Whether organizations choose to use it as a standalone search solution, integrate it with Amazon Q Business, or use it through Amazon Bedrock Knowledge Bases, Amazon Kendra GenAI Index provides the flexibility and efficiency needed to make enterprise content more accessible and actionable.

To know more about Amazon Kendra, visit Amazon Kendra Documentation.

Pricing and availability

For information about the AWS Regions in which Amazon Kendra GenAI Index is available, refer to the Amazon Kendra endpoints and quotas page. For detailed pricing information, visit the Amazon Kendra Pricing page.


About the Authors

Krishna Mudda is Senior Manager of Gen AI World Wide Specialist Solution Architects with in Amazon Q Business team.

Marcel Pividal is a Senior AI Services SA in the World- Wide Specialist Organization, bringing over 22 years of expertise in transforming complex business challenges into innovative technological solutions. As a thought leader in generative AI implementation, he specializes in developing secure, compliant AI architectures for enterprise- scale deployments across multiple industries.

Nikhil Shetty is Senior Product Manager of Amazon Kendra.

Aakash Upadhyay is a Senior Software Engineer at AWS, specializing in building scalable NLP and Generative AI cloud services. Over the past six years, he has contributed to the development and enhancement of products like Amazon Translate, Kendra, and Q-Business.

Vijai Gandikota is a Principal Product Manager on the Amazon Q and Amazon Kendra team of Amazon Web Services. He is responsible for Region expansion, language support, guardrails, ingestion, security, and other aspects of Amazon Q and Amazon Kendra.

Kristy Lin is a Software Development Engineer with Amazon Bedrock Knowledge Bases, helping customers build scalable RAG applications.

Read More