Significant new capabilities make it easier to use Amazon Bedrock to build and scale generative AI applications – and achieve impressive results

Significant new capabilities make it easier to use Amazon Bedrock to build and scale generative AI applications – and achieve impressive results

We introduced Amazon Bedrock to the world a little over a year ago, delivering an entirely new way to build generative artificial intelligence (AI) applications. With the broadest selection of first- and third-party foundation models (FMs) as well as user-friendly capabilities, Amazon Bedrock is the fastest and easiest way to build and scale secure generative AI applications. Now tens of thousands of customers are using Amazon Bedrock to build and scale impressive applications. They are innovating quickly, easily, and securely to advance their AI strategies. And we’re supporting their efforts by enhancing Amazon Bedrock with exciting new capabilities including even more model choice and features that make it easier to select the right model, customize the model for a specific use case, and safeguard and scale generative AI applications.

Customers across diverse industries from finance to travel and hospitality to healthcare to consumer technology are making remarkable progress. They are realizing real business value by quickly moving generative AI applications into production to improve customer experiences and increase operational efficiency. Consider the New York Stock Exchange (NYSE), the world’s largest capital market processing billions of transactions each day. NYSE is leveraging Amazon Bedrock’s choice of FMs and cutting-edge AI generative capabilities across several use cases, including the processing of thousands of pages of regulations to provide answers in easy-to-understand language

Global airline United Airlines modernized their Passenger Service System to translate legacy passenger reservation codes into plain English so that agents can provide swift and efficient customer support. LexisNexis Legal & Professional, a leading global provider of information and analytics, developed a personalized legal generative AI assistant on Lexis+ AI. LexisNexis customers receive trusted results two times faster than the nearest competing product and can save up to five hours per week for legal research and summarization. And HappyFox, an online help desk software, selected Amazon Bedrock for its security and performance, boosting the efficiency of its AI-powered automated ticket system in its customer support solution by 40% and agent productivity by 30%.

And across Amazon, we are continuing to innovate with generative AI to deliver more immersive, engaging experiences for our customers. Just last week Amazon Music announced Maestro. Maestro is an AI playlist generator powered by Amazon Bedrock that gives Amazon Music subscribers an easier, more fun way to create playlists based on prompts. Maestro is now rolling out in beta to a small number of U.S. customers on all tiers of Amazon Music.

With Amazon Bedrock, we’re focused on the key areas that customers need to build production-ready, enterprise-grade generative AI applications at the right cost and speed. Today I’m excited to share new features that we’re announcing across the areas of model choice, tools for building generative AI applications, and privacy and security.

1. Amazon Bedrock expands model choice with Llama 3 models and helps you find the best model for your needs

In these early days, customers are still learning and experimenting with different models to determine which ones to use for various purposes. They want to be able to easily try the latest models, and test which capabilities and features will give them the best results and cost characteristics for their use cases. The majority of Amazon Bedrock customers use more than one model, and Amazon Bedrock provides the broadest selection of first- and third-party large language models (LLMs) and other FMs.  This includes models from AI21 labs, Anthropic, Cohere, Meta, Mistral AI, and Stability AI, as well as our own Amazon Titan models. In fact, Joel Hron, head of AI and Thomson Reuters Labs at Thomson Reuters recently said this about their adoption of Amazon Bedrock, “Having the ability to use a diverse range of models as they come out was a key driver for us, especially given how quickly this space is evolving.” The cutting-edge models of the Mistral AI model family including Mistral 7B, Mixtral 8x7B, and Mistral Large have customers excited about their high performance in text generation, summarization, Q&A, and code generation. Since we introduced the Anthropic Claude 3 model family, thousands of customers have experienced how Claude 3 Haiku, Sonnet, and Opus have established new benchmarks across cognitive tasks with unrivaled intelligence, speed, and cost-efficiency. After the initial evaluation using Claude 3 Haiku and Opus in Amazon Bedrock, BlueOcean.ai, a brand intelligence platform, saw a cost reduction of over 50% when they were able to consolidate four separate API calls into a single, more efficient call.

Masahiro Oba, General Manager, Group Federated Governance of DX Platform at Sony Group corporation shared,

“While there are many challenges with applying generative AI to the business, Amazon Bedrock’s diverse capabilities help us to tailor generative AI applications to Sony’s business. We are able to take advantage of not only the powerful LLM capabilities of Claude 3, but also capabilities that help us safeguard applications at the enterprise-level. I’m really proud to be working with the Bedrock team to further democratize generative AI within the Sony Group.”

I recently sat down with Aaron Linsky, CTO of Artificial Investment Associate Labs at Bridgewater Associates, a premier asset management firm, where they are using generative AI to enhance their “Artificial Investment Associate,” a major leap forward for their customers. It builds on their experience of giving rules-based expert advice for investment decision-making. With Amazon Bedrock, they can use the best available FMs, such as Claude 3, for different tasks-combining fundamental market understanding with the flexible reasoning capabilities of AI. Amazon Bedrock allows for seamless model experimentation, enabling Bridgewater to build a powerful, self-improving investment system that marries systematic advice with cutting-edge capabilities–creating an evolving, AI-first process.

To bring even more model choice to customers, today, we are making Meta Llama 3 models available in Amazon Bedrock. Llama 3’s Llama 3 8B and Llama 3 70B models are designed for building, experimenting, and responsibly scaling generative AI applications. These models were significantly improved from the previous model architecture, including scaling up pretraining, as well as instruction fine-tuning approaches. Llama 3 8B excels in text summarization, classification, sentiment analysis, and translation, ideal for limited resources and edge devices. Llama 3 70B shines in content creation, conversational AI, language understanding, R&D, enterprises, accurate summarization, nuanced classification/sentiment analysis, language modeling, dialogue systems, code generation, and instruction following. Read more about Meta Llama 3 now available in Amazon Bedrock.

We are also announcing support coming soon for Cohere’s Command R and Command R+ enterprise FMs. These models are highly scalable and optimized for long-context tasks like retrieval-augmented generation (RAG) with citations to mitigate hallucinations, multi-step tool use for automating complex business tasks, and support for 10 languages for global operations. Command R+ is Cohere’s most powerful model optimized for long-context tasks, while Command R is optimized for large-scale production workloads. With the Cohere models coming soon in Amazon Bedrock, businesses can build enterprise-grade generative AI applications that balance strong accuracy and efficiency for day-to-day AI operations beyond proof-of-concept.

Amazon Titan Image Generator now generally available and Amazon Titan Text Embeddings V2 coming soon

In addition to adding the most capable 3P models, Amazon Titan Image Generator is generally available today. With Amazon Titan Image Generator, customers in industries like advertising, e-commerce, media, and entertainment can efficiently generate realistic, studio-quality images in large volumes and at low cost, utilizing natural language prompts. They can edit generated or existing images using text prompts, configure image dimensions, or specify the number of image variations to guide the model. By default, every image produced by Amazon Titan Image Generator contains an invisible watermark, which aligns with AWS’s commitment to promoting responsible and ethical AI by reducing the spread of misinformation. The Watermark Detection feature identifies images created by Image Generator, and is designed to be tamper-resistant, helping increase transparency around AI-generated content. Watermark Detection helps mitigate intellectual property risks and enables content creators, news organizations, risk analysts, fraud-detection teams, and others, to better identify and mitigate dissemination of misleading AI-generated content. Read more about Watermark Detection for Titan Image Generator.

Coming soon, Amazon Titan Text Embeddings V2 efficiently delivers more relevant responses for critical enterprise use cases like search. Efficient embeddings models are crucial to performance when leveraging RAG to enrich responses with additional information. Embeddings V2 is optimized for RAG workflows and provides seamless integration with Knowledge Bases for Amazon Bedrock to deliver more informative and relevant responses efficiently. Embeddings V2 enables a deeper understanding of data relationships for complex tasks like retrieval, classification, semantic similarity search, and enhancing search relevance. Offering flexible embedding sizes of 256, 512, and 1024 dimensions, Embeddings V2 prioritizes cost reduction while retaining 97% of the accuracy for RAG use cases, out-performing other leading models. Additionally, the flexible embedding sizes cater to diverse application needs, from low-latency mobile deployments to high-accuracy asynchronous workflows.

New Model Evaluation simplifies the process of accessing, comparing, and selecting LLMs and FMs

Choosing the appropriate model is a critical first step toward building any generative AI application. LLMs can vary drastically in performance based on the task, domain, data modalities, and other factors. For example, a biomedical model is likely to outperform general healthcare models in specific medical contexts, whereas a coding model may face challenges with natural language processing tasks. Using an excessively powerful model could lead to inefficient resource usage, while an underpowered model might fail to meet minimum performance standards – potentially providing incorrect results. And selecting an unsuitable FM at a project’s onset could undermine stakeholder confidence and trust.

With so many models to choose from, we want to make it easier for customers to pick the right one for their use case.

Amazon Bedrock’s Model Evaluation tool, now generally available, simplifies the selection process by enabling benchmarking and comparison against specific datasets and evaluation metrics, ensuring developers select the model that best aligns with their project goals. This guided experience allows developers to evaluate models across criteria tailored to each use case. Through Model Evaluation, developers select candidate models to assess – public options, imported custom models, or fine-tuned versions. They define relevant test tasks, datasets, and evaluation metrics, such as accuracy, latency, cost projections, and qualitative factors. Read more about Model Evaluation in Amazon Bedrock.

The ability to select from the top-performing FMs in Amazon Bedrock has been extremely beneficial for Elastic Security. James Spiteri, Director of Product Management at Elastic shared,

“With just a few clicks, we can assess a single prompt across multiple models simultaneously. This model evaluation functionality enables us to compare the outputs, metrics, and associated costs across different models, allowing us to make an informed decision on which model would be most suitable for what we are trying to accomplish. This has significantly streamlined our process, saving us a considerable amount of time in deploying our applications to production.”

2. Amazon Bedrock offers capabilities to tailor generative AI to your business needs

While models are incredibly important, it takes more than a model to build an application that is useful for an organization. That’s why Amazon Bedrock has capabilities to help you easily tailor generative AI solutions to specific use cases. Customers can use their own data to privately customize applications through fine-tuning or by using Knowledge Bases for a fully managed RAG experience to deliver more relevant, accurate, and customized responses. Agents for Amazon Bedrock allows developers to define specific tasks, workflows, or decision-making processes, enhancing control and automation while ensuring consistent alignment with an intended use case. Starting today, you can now use Agents with Anthropic Claude 3 Haiku and Sonnet models. We are also introducing an updated AWS console experience, supporting a simplified schema and return of control to make it easy for developers to get started. Read more about Agents for Amazon Bedrock, now faster and easier to use.

With new Custom Model Import, customers can leverage the full capabilities of Amazon Bedrock with their own models

All these features are essential to building generative AI applications, which is why we wanted to make them available to even more customers including those who have already invested significant resources in fine-tuning LLMs with their own data on different services or in training custom models from scratch. Many customers have customized models available on Amazon SageMaker, which provides the broadest array of over 250 pre-trained FMs. These FMs include cutting-edge models such as Mistral, Llama2, CodeLlama, Jurassic-2, Jamba, pplx-7B, 70B, and the impressive Falcon 180B. Amazon SageMaker helps with getting data organized and fine-tuned, building scalable and efficient training infrastructure, and then deploying models at scale in a low latency, cost-efficient manner. It has been a game changer for developers in preparing their data for AI, managing experiments, training models faster (e.g. Perplexity AI trains models 40% faster in Amazon SageMaker), lowering inference latency (e.g. Workday has reduced inference latency by 80% with Amazon SageMaker), and improving developer productivity (e.g. NatWest reduced its time-to-value for AI from 12-18 months to under seven months using Amazon SageMaker). However, operationalizing these customized models securely and integrating them into applications for specific business use cases still has challenges.

That is why today we’re introducing Amazon Bedrock Custom Model Import, which enables organizations to leverage their existing AI investments along with Amazon Bedrock’s capabilities. With Custom Model Import, customers can now import and access their own custom models built on popular open model architectures including Flan-T5, Llama, and Mistral, as a fully managed application programming interface (API) in Amazon Bedrock. Customers can take models that they customized on Amazon SageMaker, or other tools, and easily add them to Amazon Bedrock. After an automated validation, they can seamlessly access their custom model, as with any other model in Amazon Bedrock. They get all the same benefits, including seamless scalability and powerful capabilities to safeguard their applications, adherence to responsible AI principles – as well as the ability to expand a model’s knowledge base with RAG, easily create agents to complete multi-step tasks, and carry out fine tuning to keep teaching and refining models. All without needing to manage the underlying infrastructure.

With this new capability, we’re making it easy for organizations to choose a combination of Amazon Bedrock models and their own custom models while maintaining the same streamlined development experience. Today, Amazon Bedrock Custom Model Import is available in preview and supports three of the most popular open model architectures and with plans for more in the future. Read more about Custom Model Import for Amazon Bedrock.

ASAPP is a generative AI company with a 10-year history of building ML models.

“Our conversational generative AI voice and chat agent leverages these models to redefine the customer service experience. To give our customers end to end automation, we need LLM agents, knowledge base, and model selection flexibility. With Custom Model Import, we will be able to use our existing custom models in Amazon Bedrock. Bedrock will allow us to onboard our customers faster, increase our pace of innovation, and accelerate time to market for new product capabilities.”

– Priya Vijayarajendran, President, Technology.

3. Amazon Bedrock provides a secure and responsible foundation to implement safeguards easily

As generative AI capabilities progress and expand, building trust and addressing ethical concerns becomes even more important. Amazon Bedrock addresses these concerns by leveraging AWS’s secure and trustworthy infrastructure with industry-leading security measures, robust data encryption, and strict access controls.

Guardrails for Amazon Bedrock, now generally available, helps customers prevent harmful content and manage sensitive information within an application.

We also offer Guardrails for Amazon Bedrock, which is now generally available. Guardrails offers industry-leading safety protection, giving customers the ability to define content policies, set application behavior boundaries, and implement safeguards against potential risks. Guardrails for Amazon Bedrock is the only solution offered by a major cloud provider that enables customers to build and customize safety and privacy protections for their generative AI applications in a single solution. It helps customers block as much as 85% more harmful content than protection natively provided by FMs on Amazon Bedrock. Guardrails provides comprehensive support for harmful content filtering and robust personal identifiable information (PII) detection capabilities. Guardrails works with all LLMs in Amazon Bedrock as well as fine-tuned models, driving consistency in how models respond to undesirable and harmful content. You can configure thresholds to filter content across six categories – hate, insults, sexual, violence, misconduct (including criminal activity), and prompt attack (jailbreak and prompt injection). You can also define a set of topics or words that need to be blocked in your generative AI application, including harmful words, profanity, competitor names, and products. For example, a banking application can configure a guardrail to detect and block topics related to investment advice. A contact center application summarizing call center transcripts can use PII redaction to remove PIIs in call summaries, or a conversational chatbot can use content filters to block harmful content. Read more about Guardrails for Amazon Bedrock.

Companies like Aha!, a software company that helps more than 1 million people bring their product strategy to life, uses Amazon Bedrock to power many of their generative AI capabilities.

“We have full control over our information through Amazon Bedrock’s data protection and privacy policies, and can block harmful content through Guardrails for Amazon Bedrock. We just built on it to help product managers discover insights by analyzing feedback submitted by their customers. This is just the beginning. We will continue to build on advanced AWS technology to help product development teams everywhere prioritize what to build next with confidence.”

With even more choice of leading FMs and features that help you evaluate models and safeguard applications as well as leverage your prior investments in AI along with the capabilities of Amazon Bedrock, today’s launches make it even easier and faster for customers to build and scale generative AI applications. This blog post highlights only a subset of the new features. You can learn more about everything we’ve launched in the resources of this post, including asking questions and summarizing data from a single document without setting up a vector database in Knowledge Bases and the general availability of support for multiple data sources with Knowledge Bases.

Early adopters leveraging Amazon Bedrock’s capabilities are gaining a crucial head start – driving productivity gains, fueling ground-breaking discoveries across domains, and delivering enhanced customer experiences that foster loyalty and engagement. I’m excited to see what our customers will do next with these new capabilities.

As my mentor Werner Vogels always says “Now Go Build” and I’ll add “…with Amazon Bedrock!”

Resources

Check out the following resources to learn more about this announcement:


About the author

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More

Building scalable, secure, and reliable RAG applications using Knowledge Bases for Amazon Bedrock

Building scalable, secure, and reliable RAG applications using Knowledge Bases for Amazon Bedrock

Generative artificial intelligence (AI) has gained significant momentum with organizations actively exploring its potential applications. As successful proof-of-concepts transition into production, organizations are increasingly in need of enterprise scalable solutions. However, to unlock the long-term success and viability of these AI-powered solutions, it is crucial to align them with well-established architectural principles.

The AWS Well-Architected Framework provides best practices and guidelines for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. Aligning generative AI applications with this framework is essential for several reasons, including providing scalability, maintaining security and privacy, achieving reliability, optimizing costs, and streamlining operations. Embracing these principles is critical for organizations seeking to use the power of generative AI and drive innovation.

This post explores the new enterprise-grade features for Knowledge Bases on Amazon Bedrock and how they align with the AWS Well-Architected Framework. With Knowledge Bases for Amazon Bedrock, you can quickly build applications using Retrieval Augmented Generation (RAG) for use cases like question answering, contextual chatbots, and personalized search.

Here are some features which we will cover:

  1. AWS CloudFormation support
  2. Private network policies for Amazon OpenSearch Serverless
  3. Multiple S3 buckets as data sources
  4. Service Quotas support
  5. Hybrid search, metadata filters, custom prompts for the RetreiveAndGenerate API, and maximum number of retrievals.

AWS Well-Architected design principles

RAG-based applications built using Knowledge Bases for Amazon Bedrock can greatly benefit from following the AWS Well-Architected Framework. This framework has six pillars that help organizations make sure their applications are secure, high-performing, resilient, efficient, cost-effective, and sustainable:

  • Operational Excellence – Well-Architected principles streamline operations, automate processes, and enable continuous monitoring and improvement of generative AI app performance.
  • Security – Implementing strong access controls, encryption, and monitoring helps secure sensitive data used in your organization’s knowledge base and prevent misuse of generative AI.
  • Reliability – Well-Architected principles guide the design of resilient and fault-tolerant systems, providing consistent value delivery to users.
  • Performance Optimization – Choosing the appropriate resources, implementing caching strategies, and proactively monitoring performance metrics ensure that applications deliver fast and accurate responses, leading to optimal performance and an enhanced user experience.
  • Cost Optimization – Well-Architected guidelines assist in optimizing resource usage, using cost-saving services, and monitoring expenses, resulting in long-term viability of generative AI projects.
  • Sustainability – Well-Architected principles promote efficient resource utilization and minimizing carbon footprints, addressing the environmental impact of growing generative AI usage.

By aligning with the Well-Architected Framework, organizations can effectively build and manage enterprise-grade RAG applications using Knowledge Bases for Amazon Bedrock. Now, let’s dive deep into the new features launched within Knowledge Bases for Amazon Bedrock.

AWS CloudFormation support

For organizations building RAG applications, it’s important to provide efficient and effective operations and consistent infrastructure across different environments. This can be achieved by implementing practices such as automating deployment processes. To accomplish this, Knowledge Bases for Amazon Bedrock now offers support for AWS CloudFormation.

With AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK), you can now create, update, and delete knowledge bases and associated data sources. Adopting AWS CloudFormation and the AWS CDK for managing knowledge bases and associated data sources not only streamlines the deployment process, but also promotes adherence to the Well-Architected principles. By performing operations (applications, infrastructure) as code, you can provide consistent and reliable deployments in multiple AWS accounts and AWS Regions, and maintain versioned and auditable infrastructure configurations.

The following is a sample CloudFormation script in JSON format for creating and updating a knowledge base in Amazon Bedrock:

{
    "Type" : "AWS::Bedrock::KnowledgeBase", 
    "Properties" : {
        "Name": String,
        "RoleArn": String,
        "Description": String,
        "KnowledgeBaseConfiguration": {
  		"Type" : String,
  		"VectorKnowledgeBaseConfiguration" : VectorKnowledgeBaseConfiguration
},
        "StorageConfiguration": StorageConfiguration,            
    } 
}

Type specifies a knowledge base as a resource in a top-level template. Minimally, you must specify the following properties:

  • Name – Specify a name for the knowledge base.
  • RoleArn – Specify the Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role with permissions to invoke API operations on the knowledge base. For more information, see Create a service role for Knowledge bases for Amazon Bedrock.
  • KnowledgeBaseConfiguration – Specify the embeddings configuration of the knowledge base. The following sub-properties are required:
    • Type – Specify the value VECTOR.
    • VectorKnowledgeBaseConfiguration – Contains details about the model used to create vector embeddings for the knowledge base.
  • StorageConfiguration – Specify information about the vector store in which the data source is stored. The following sub-properties are required:
    • Type – Specify the vector store service that you are using.
    • You would also need to select one of the vector stores supported by Knowledge Bases such OpenSearchServerless, Pinecone or Amazon PostgreSQL and provide configuration for the selected vector store.

For details on all the fields and providing configuration of various vector stores supported by Knowledge Bases for Amazon Bedrock, refer to AWS::Bedrock::KnowledgeBase.

Redis Enterprise Cloud vector stores are not supported as of this writing in AWS CloudFormation. For latest information, please refer to the documentation above.

After you create a knowledge base, you need to create a data source from the Amazon Simple Storage Service (Amazon S3) bucket containing the files for your knowledge base. It calls the CreateDataSource and DeleteDataSource APIs.

The following is the sample CloudFormation script in JSON format:

{
    "Type" : "AWS::Bedrock::DataSource", 
    "Properties" : {
        "KnowledgeBaseId": String,
        "Name": String,
        "RoleArn": String,
        "Description": String,
        "DataSourceConfiguration": {
  		"S3Configuration" : S3DataSourceConfiguration,
  		"Type" : String
},
ServerSideEncryptionConfiguration":ServerSideEncryptionConfiguration,           
"VectorIngestionConfiguration": VectorIngestionConfiguration
    } 
}

Type specifies a data source as a resource in a top-level template. Minimally, you must specify the following properties:

  • Name – Specify a name for the data source.
  • KnowledgeBaseId – Specify the ID of the knowledge base for the data source to belong to.
  • DataSourceConfiguration – Specify information about the S3 bucket containing the data source. The following sub-properties are required:
    • Type – Specify the value S3.
    • S3Configuration – Contains details about the configuration of the S3 object containing the data source.
  • VectorIngestionConfiguration – Contains details about how to ingest the documents in a data source. You need to provide “ChunkingConfiguration” where you can define your chunking strategy.
  • ServerSideEncryptionConfiguration – Contains the configuration for server-side encryption, where you can provide the Amazon Resource Name (ARN) of the AWS KMS key used to encrypt the resource.

For more information about setting up data sources in Amazon Bedrock, see Set up a data source for your knowledge base.

Note: You cannot change the chunking configuration after you create the data source.

The CloudFormation template allows you to define and manage your knowledge base resources using infrastructure as code (IaC). By automating the setup and management of the knowledge base, you can provide a consistent infrastructure across different environments. This approach aligns with the Operational Excellence pillar, which emphasizes performing operations as code. By treating your entire workload as code, you can automate processes, create consistent responses to events, and ultimately reduce human errors.

Private network policies for Amazon OpenSearch Serverless

For companies building RAG applications, it’s critical that the data remains secure and the network traffic does not go to public internet. To support this, Knowledge Bases for Amazon Bedrock now supports private network policies for Amazon OpenSearch Serverless.

Knowledge Bases for Amazon Bedrock provides an option for using OpenSearch Serverless as a vector store. You can now access OpenSearch Serverless collections that have a private network policy, which further enhances the security posture for your RAG application. To achieve this, you need to create an OpenSearch Serverless collection and configure it for private network access. First, create a vector index within the collection to store the embeddings. Then, while creating the collection, set Network access settings to Private and specify the VPC endpoint for access. Importantly, you can now provide private network access to OpenSearch Serverless collections specifically for Amazon Bedrock. To do this, select AWS service private access and specify bedrock.amazonaws.com as the service.

This private network configuration makes sure that your embeddings are stored securely and are only accessible by Amazon Bedrock, enhancing the overall security and privacy of your knowledge bases. It aligns closely with the Security Pillar of controlling traffic at all layers, because all network traffic is kept within the AWS backbone with these settings.

So far, we have explored the automation of creating, deleting, and updating knowledge base resources and the enhanced security through private network policies for OpenSearch Serverless to store vector embeddings securely. Now, let’s understand how to build more reliable, comprehensive, and cost-optimized RAG applications.

Multiple S3 buckets as data sources

Knowledge Bases for Amazon Bedrock now supports adding multiple S3 buckets as data sources within single knowledge base, including cross-account access. This enhancement increases the knowledge base’s comprehensiveness and accuracy by allowing users to aggregate and use information from various sources seamlessly.

The following are key features:

  • Multiple S3 buckets – Knowledge Bases for Amazon Bedrock can now incorporate data from multiple S3 buckets, enabling users to combine and use information from different sources effortlessly. This feature promotes data diversity and makes sure that relevant information is readily available for RAG-based applications.
  • Cross-account data access – Knowledge Bases for Amazon Bedrock supports the configuration of S3 buckets as data sources across different accounts. You can provide the necessary credentials to access these data sources, expanding the range of information that can be incorporated into their knowledge bases.
  • Efficient data management – When a data source or knowledge base is deleted, the related or existing items in the vector stores are automatically removed. This feature makes sure that the knowledge base remains up to date and free from obsolete or irrelevant data, maintaining the integrity and accuracy of the RAG process.

By supporting multiple S3 buckets as data sources, the need for creating multiple knowledge bases or redundant data copies is eliminated, thereby optimizing cost and promoting cloud financial management. Furthermore, the cross-account access capabilities enable the development of resilient architectures, aligning with the Reliability pillar of the AWS Well-Architected Framework, providing high availability and fault tolerance.

Other recently announced features for Knowledge Bases

To further enhance the reliability of your RAG application, Knowledge Bases for Amazon Bedrock now extends support for Service Quotas. This feature provides a single pane of glass to view applied AWS quota values and usage. For example, you now have quick access to information such as the allowed number of `RetrieveAndGenerate API requests per second.

This feature allows you to effectively manage resource quotas, prevent overprovisioning, and limit API request rates to safeguard services from potential abuse.

You can also enhance your application’s performance by using recently announced features like hybrid search, filtering based on metadata, custom prompts for the RetreiveAndGenerate API, and maximum number of retrievals. These features collectively improve the accuracy, relevance, and consistency of generated responses, and align with the Performance Efficiency pillar of the AWS Well-Architected Framework.

Knowledge Bases for Amazon Bedrock aligns with the Sustainability pillar of the AWS Well-Architected Framework by using managed services and optimizing resource utilization. As a fully managed service, Knowledge Bases for Amazon Bedrock removes the burden of provisioning, managing, and scaling the underlying infrastructure, thereby reducing the environmental impact associated with operating and maintaining these resources.

Additionally, by aligning with the AWS Well-Architected principles, organizations can design and operate their RAG applications in a sustainable manner. Practices such as automating deployments through AWS CloudFormation, implementing private network policies for secure data access, and using efficient services like OpenSearch Serverless contribute to minimizing the environmental impact of these workloads.

Overall, Knowledge Bases for Amazon Bedrock, combined with the AWS Well-Architected Framework, empowers organizations to build scalable, secure, and reliable RAG applications while prioritizing environmental sustainability through efficient resource utilization and the adoption of managed services.

Conclusion

The new enterprise-grade features, such as AWS CloudFormation support, private network policies, the ability to use multiple S3 buckets as data sources, and support for Service Quotas, make it straightforward to build scalable, secure, and reliable RAG applications with Knowledge Bases for Amazon Bedrock. Using AWS managed services and following Well-Architected best practices allows organizations to focus on delivering innovative generative AI solutions while providing operational excellence, robust security, and efficient resource utilization. As you build applications on AWS, aligning RAG applications with the AWS Well-Architected Framework provides a solid foundation for building enterprise-grade solutions that drive business value while adhering to industry standards.

For additional resources, refer to the following:


About the authors

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.

Read More

Integrate HyperPod clusters with Active Directory for seamless multi-user login

Integrate HyperPod clusters with Active Directory for seamless multi-user login

Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption.

Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files, run their own jobs, and want to avoid impacting each other’s work. To achieve this multi-user environment, you can take advantage of Linux’s user and group mechanism and statically create multiple users on each instance through lifecycle scripts. The drawback to this approach, however, is that user and group settings are duplicated across multiple instances in the cluster, making it difficult to configure them consistently on all instances, such as when a new team member joins.

To solve this pain point, we can use Lightweight Directory Access Protocol (LDAP) and LDAP over TLS/SSL (LDAPS) to integrate with a directory service such as AWS Directory Service for Microsoft Active Directory. With the directory service, you can centrally maintain users and groups, and their permissions.

In this post, we introduce a solution to integrate HyperPod clusters with AWS Managed Microsoft AD, and explain how to achieve a seamless multi-user login environment with a centrally maintained directory.

Solution overview

The solution uses the following AWS services and resources:

We also use AWS CloudFormation to deploy a stack to create the prerequisites for the HyperPod cluster: VPC, subnets, security group, and Amazon FSx for Lustre volume.

The following diagram illustrates the high-level solution architecture.

Architecture diagram for HyperPod and Active Directory integration

In this solution, HyperPod cluster instances use the LDAPS protocol to connect to the AWS Managed Microsoft AD via an NLB. We use TLS termination by installing a certificate to the NLB. To configure LDAPS in HyperPod cluster instances, the lifecycle script installs and configures System Security Services Daemon (SSSD)—an open source client software for LDAP/LDAPS.

Prerequisites

This post assumes you already know how to create a basic HyperPod cluster without SSSD. For more details on how to create HyperPod clusters, refer to Getting started with SageMaker HyperPod and the HyperPod workshop.

Also, in the setup steps, you will use a Linux machine to generate a self-signed certificate and obtain an obfuscated password for the AD reader user. If you don’t have a Linux machine, you can create an EC2 Linux instance or use AWS CloudShell.

Create a VPC, subnets, and a security group

Follow the instructions in the Own Account section of the HyperPod workshop. You will deploy a CloudFormation stack and create prerequisite resources such as VPC, subnets, security group, and FSx for Lustre volume. You need to create both a primary subnet and backup subnet when deploying the CloudFormation stack, because AWS Managed Microsoft AD requires at least two subnets with different Availability Zones.

In this post, for simplicity, we use the same VPC, subnets, and security group for both the HyperPod cluster and directory service. If you need to use different networks between the cluster and directory service, make sure security groups and route tables are configured so that they can communicate each other.

Create AWS Managed Microsoft AD on Directory Service

Complete the following steps to set up your directory:

  1. On the Directory Service console, choose Directories in the navigation pane.
  2. Choose Set up directory.
  3. For Directory type, select AWS Managed Microsoft AD.
  4. Choose Next.
    Directory type selection screen
  5. For Edition, select Standard Edition.
  6. For Directory DNS name, enter your preferred directory DNS name (for example, hyperpod.abc123.com).
  7. For Admin password¸ set a password and save it for later use.
  8. Choose Next.
    Directory creation configuration screen
  9. In the Networking section, specify the VPC and two private subnets you created.
  10. Choose Next.
    Directory network configuration screen
  11. Review the configuration and pricing, then choose Create directory.
    Directory creation confirmation screen
    The directory creation starts. Wait until the status changes from Creating to Active, which can take 20–30 minutes.
  12. When the status changes to Active, open the detail page of the directory and take note of the DNS addresses for later use.Directory details screen

Create an NLB in front of Directory Service

To create the NLB, complete the following steps:

  1. On the Amazon EC2 console, choose Target groups in the navigation pane.
  2. Choose Create target groups.
  3. Create a target group with the following parameters:
    1. For Choose a target type, select IP addresses.
    2. For Target group name, enter LDAP.
    3. For Protocol: Port, choose TCP and enter 389.
    4. For IP address type, select IPv4.
    5. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    6. For Health check protocol, choose TCP.
  4. Choose Next.
    Load balancing target creation configuration screen
  5. In the Register targets section, register the directory service’s DNS addresses as the targets.
  6. For Ports, choose Include as pending below.Load balancing target registration screenThe addresses are added in the Review targets section with Pending status.
  7. Choose Create target group.Load balancing target review screen
  8. On the Load Balancers console, choose Create load balancer.
  9. Under Network Load Balancer, choose Create.Load balancer type choosing screen
  10. Configure an NLB with the following parameters:
    1. For Load balancer name, enter a name (for example, nlb-ds).
    2. For Scheme, select Internal.
    3. For IP address type, select IPv4.NLB creation basic configuration section
    4. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    5. Under Mappings, select the two private subnets and their CIDR ranges (which you created with the CloudFormation template).
    6. For Security groups, choose CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).NLB creation network mapping and security groups configurations
  11. In the Listeners and routing section, specify the following parameters:
    1. For Protocol, choose TCP.
    2. For Port, enter 389.
    3. For Default action, choose the target group named LDAP.

    Here, we are adding a listener for LDAP. We will add LDAPS later.

  12. Choose Create load balancer.NLB listeners routing configuration screenWait until the status changes from Provisioning to Active, which can take 3–5 minutes.
  13. When the status changes to Active, open the detail page of the provisioned NLB and take note of the DNS name (xyzxyz.elb.region-name.amazonaws.com) for later use.NLB details screen

Create a self-signed certificate and import it to Certificate Manager

To create a self-signed certificate, complete the following steps:

  1. On your Linux-based environment (local laptop, EC2 Linux instance, or CloudShell), run the following OpenSSL commands to create a self-signed certificate and private key:
    $ openssl genrsa 2048 > ldaps.key
    
    $ openssl req -new -key ldaps.key -out ldaps_server.csr
    
    You are about to be asked to enter information that will be incorporated
    into your certificate request.
    What you are about to enter is what is called a Distinguished Name or a DN.
    There are quite a few fields but you can leave some blank
    For some fields there will be a default value,
    If you enter '.', the field will be left blank.
    -----
    Country Name (2 letter code) [AU]:US
    State or Province Name (full name) [Some-State]:Washington
    Locality Name (eg, city) []:Bellevue
    Organization Name (eg, company) [Internet Widgits Pty Ltd]:CorpName
    Organizational Unit Name (eg, section) []:OrgName
    Common Name (e.g., server FQDN or YOUR name) []:nlb-ds-abcd1234.elb.region.amazonaws.com
    Email Address []:your@email.address.com
    
    Please enter the following 'extra' attributes
    to be sent with your certificate request
    A challenge password []:
    An optional company name []:
    
    $ openssl x509 -req -sha256 -days 365 -in ldaps_server.csr -signkey ldaps.key -out ldaps.crt
    
    Certificate request self-signature ok
    subject=C = US, ST = Washington, L = Bellevue, O = CorpName, OU = OrgName, CN = nlb-ds-abcd1234.elb.region.amazonaws.com, emailAddress = your@email.address.com
    
    $ chmod 600 ldaps.key

  2. On the Certificate Manager console, choose Import.
  3. Enter the certificate body and private key, from the contents of ldaps.crt and ldaps.key respectively.
  4. Choose Next.Certificate importing screen
  5. Add any optional tags, then choose Next.Certificate tag editing screen
  6. Review the configuration and choose Import.Certificate import review screen

Add an LDAPS listener

We added a listener for LDAP already in the NLB. Now we add a listener for LDAPS with the imported certificate. Complete the following steps:

  1. On the Load Balancers console, navigate to the NLB details page.
  2. On the Listeners tab, choose Add listener.NLB listers screen with add listener button
  3. Configure the listener with the following parameters:
    1. For Protocol, choose TLS.
    2. For Port, enter 636.
    3. For Default action, choose LDAP.
    4. For Certificate source, select From ACM.
    5. For Certificate, enter what you imported in ACM.
  4. Choose Add.NLB listener configuration screenNow the NLB listens to both LDAP and LDAPS. It is recommended to delete the LDAP listener because it transmits data without encryption, unlike LDAPS.NLB listerners list with LDAP and LDAPS

Create an EC2 Windows instance to administer users and groups in the AD

To create and maintain users and groups in the AD, complete the following steps:

  1. On the Amazon EC2 console, choose Instances in the navigation pane.
  2. Choose Launch instances.
  3. For Name, enter a name for your instance.
  4. For Amazon Machine Image, choose Microsoft Windows Server 2022 Base.
  5. For Instance type, choose t2.micro.
  6. In the Network settings section, provide the following parameters:
    1. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. For Subnet, choose either of two subnets you created with the CloudFormation template.
    3. For Common security groups, choose CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).
  7. For Configure storage, set storage to 30 GB gp2.
  8. In the Advanced details section, for Domain join directory¸ choose the AD you created.
  9. For IAM instance profile, choose an AWS Identity and Access Management (IAM) role with at least the AmazonSSMManagedEC2InstanceDefaultPolicy policy.
  10. Review the summary and choose Launch instance.

Create users and groups in AD using the EC2 Windows instance

With Remote Desktop, connect to the EC2 Windows instance you created in the previous step. Using an RDP client is recommended over using a browser-based Remote Desktop so that you can exchange the contents of the clipboard with your local machine using copy-paste operations. For more details about connecting to EC2 Windows instances, refer to Connect to your Windows instance.

If you are prompted for a login credential, use hyperpodAdmin (where hyperpod is the first part of your directory DNS name) as the user name, and use the admin password you set to the directory service.

  1. When the Windows desktop screen opens, choose Server Manager from the Start menu.Dashboard screen on Server Manager
  2. Choose Local Server in the navigation pane, and confirm that the domain is what you specified to the directory service.Local Server screen on Server Manager
  3. On the Manage menu, choose Add Roles and Features.Drop down menu opened from Manage button
  4. Choose Next until you are at the Features page.Add Roles and Features Wizard
  5. Expand the feature Remote Server Administration Tools, expand Role Administration Tools, and select AD DS and AD LDS Tools and Active Directory Rights Management Service.
  6. Choose Next and Install.Features selection screenFeature installation starts.
  7. When the installation is complete, choose Close.Feature installation progress screen
  8. Open Active Directory Users and Computers from the Start menu.Active Directory Users and Computers window
  9. Under hyperpod.abc123.com, expand hyperpod.
  10. Choose (right-click) hyperpod, choose New, and choose Organizational Unit.Context menu opened to create an Organizational Unit
  11. Create an organizational unit called Groups.Organizational Unit ceation dialog
  12. Choose (right-click) Groups, choose New, and choose Group.Context menu opened to create groups
  13. Create a group called ClusterAdmin.Group creation dialog for ClusterAdmin
  14. Create a second group called ClusterDev.Group creation dialog for ClusterDev
  15. Choose (right-click) Users, choose New, and choose User.
  16. Create a new user.User creation dialog
  17. Choose (right-click) the user and choose Add to a group.Context menu opened to add a user to a group
  18. Add your users to the groups ClusterAdmin or ClusterDev.Group selection screen to add a user to a groupUsers added to the ClusterAdmin group will have sudo privilege on the cluster.

Create a ReadOnly user in AD

Create a user called ReadOnly under Users. The ReadOnly user is used by the cluster to programmatically access users and groups in AD.

User creation dialog to create ReadOnly user

Take note of the password for later use.

Password entering screen for ReadOnly user

(For SSH public key authentication) Add SSH public keys to users

By storing an SSH public key to a user in AD, you can log in without entering a password. You can use an existing key pair, or you can create a new key pair with OpenSSH’s ssh-keygen command. For more information about generating a key pair, refer to Create a key pair for your Amazon EC2 instance.

  1. In Active Directory Users and Computers, on the View menu, enable Advanced Features.View menu opened to enable Advanced Features
  2. Open the Properties dialog of the user.
  3. On the Attribute Editor tab, choose altSecurityIdentities choose Edit.Attribute Editor tab on User Properties dialog
  4. For Value to add, choose Add.
  5. For Values, add an SSH public key.
  6. Choose OK.Attribute editing dialog for altSecurityIdentitiesConfirm that the SSH public key appears as an attribute.Attribute Editor tab with altSecurityIdentities configured

Get an obfuscated password for the ReadOnly user

To avoid including a plain text password in the SSSD configuration file, you obfuscate the password. For this step, you need a Linux environment (local laptop, EC2 Linux instance, or CloudShell).

Install the sssd-tools package on the Linux machine to install the Python module pysss for obfuscation:

# Ubuntu
$ sudo apt install sssd-tools

# Amazon Linux
$ sudo yum install sssd-tools

Run the following one-line Python script. Input the password of the ReadOnly user. You will get the obfuscated password.

$ python3 -c "import getpass,pysss; print(pysss.password().encrypt(getpass.getpass('AD reader user password: ').strip(), pysss.password().AES_256))"
AD reader user password: (Enter ReadOnly user password) 
AAAQACK2....

Create a HyperPod cluster with an SSSD-enabled lifecycle script

Next, you create a HyperPod cluster with LDAPS/Active Directory integration.

  1. Find the configuration file config.py in your lifecycle script directory, open it with your text editor, and edit the properties in the Config class and SssdConfig class:
    1. Set True for enable_sssd to enable setting up SSSD.
    2. The SssdConfig class contains configuration parameters for SSSD.
    3. Make sure you use the obfuscated password for the ldap_default_authtok property, not a plain text password.
    # Basic configuration parameters
    class Config:
             :
        # Set true if you want to install SSSD for ActiveDirectory/LDAP integration.
        # You need to configure parameters in SssdConfig as well.
        enable_sssd = True
    # Configuration parameters for ActiveDirectory/LDAP/SSSD
    class SssdConfig:
    
        # Name of domain. Can be default if you are not sure.
        domain = "default"
    
        # Comma separated list of LDAP server URIs
        ldap_uri = "ldaps://nlb-ds-xyzxyz.elb.us-west-2.amazonaws.com"
    
        # The default base DN to use for performing LDAP user operations
        ldap_search_base = "dc=hyperpod,dc=abc123,dc=com"
    
        # The default bind DN to use for performing LDAP operations
        ldap_default_bind_dn = "CN=ReadOnly,OU=Users,OU=hyperpod,DC=hyperpod,DC=abc123,DC=com"
    
        # "password" or "obfuscated_password". Obfuscated password is recommended.
        ldap_default_authtok_type = "obfuscated_password"
    
        # You need to modify this parameter with the obfuscated password, not plain text password
        ldap_default_authtok = "placeholder"
    
        # SSH authentication method - "password" or "publickey"
        ssh_auth_method = "publickey"
    
        # Home directory. You can change it to "/home/%u" if your cluster doesn't use FSx volume.
        override_homedir = "/fsx/%u"
    
        # Group names to accept SSH login
        ssh_allow_groups = {
            "controller" : ["ClusterAdmin", "ubuntu"],
            "compute" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
            "login" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
        }
    
        # Group names for sudoers
        sudoers_groups = {
            "controller" : ["ClusterAdmin", "ClusterDev"],
            "compute" : ["ClusterAdmin", "ClusterDev"],
            "login" : ["ClusterAdmin", "ClusterDev"],
        }
    

  2. Copy the certificate file ldaps.crt to the same directory (where config.py exists).
  3. Upload the modified lifecycle script files to your Amazon Simple Storage Service (Amazon S3) bucket, and create a HyperPod cluster with it.
  4. Wait until the status changes to InService.

Verification

Let’s verify the solution by logging in to the cluster with SSH. Because the cluster was created in a private subnet, you can’t directly SSH into the cluster from your local environment. You can choose from two options to connect to the cluster.

Option 1: SSH login through AWS Systems Manager

You can use AWS Systems Manager as a proxy for the SSH connection. Add a host entry to the SSH configuration file ~/.ssh/config using the following example. For the HostName field, specify the Systems Manger target name in the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]. For the IdentityFile field, specify the file path to the user’s SSH private key. This field is not required if you chose password authentication.

Host MyCluster-LoginNode
    HostName sagemaker-cluster:abcd1234_LoginGroup-i-01234567890abcdef
    User user1
    IdentityFile ~/keys/my-cluster-ssh-key.pem
    ProxyCommand aws --profile default --region us-west-2 ssm start-session --target %h --document-name AWS-StartSSHSession --parameters portNumber=%p

Run the ssh command using the host name you specified. Confirm you can log in to the instance with the specified user.

$ ssh MyCluster-LoginNode
   :
   :
   ____              __  ___     __             __ __                  ___          __
  / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
 _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
/___/_,_/_, /__/_/  /_/_,_/_/_\__/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
         /___/                                    /___/_/
You're on the controller
Instance Type: ml.m5.xlarge
user1@ip-10-1-111-222:~$

At this point, users can still use the Systems Manager default shell session to log in to the cluster as ssm-user with administrative privileges. To block the default Systems Manager shell access and enforce SSH access, you can configure your IAM policy by referring to the following example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-west-2:123456789012:cluster/abcd1234efgh",
                "arn:aws:ssm:us-west-2:123456789012:document/AWS-StartSSHSession"
            ],
            "Condition": {
                "BoolIfExists": {
                    "ssm:SessionDocumentAccessCheck": "true"
                }
            }
        }
    ]
}

For more details on how to enforce SSH access, refer to Start a session with a document by specifying the session documents in IAM policies.

Option 2: SSH login through bastion host

Another option to access the cluster is to use a bastion host as a proxy. You can use this option when the user doesn’t have permission to use Systems Manager sessions, or to troubleshoot when Systems Manager is not working.

  1. Create a bastion security group that allows inbound SSH access (TCP port 22) from your local environment.
  2. Update the security group for the cluster to allow inbound SSH access from the bastion security group.
  3. Create an EC2 Linux instance.
  4. For Amazon Machine Image, choose Ubuntu Server 20.04 LTS.
  5. For Instance type, choose t3.small.
  6. In the Network settings section, provide the following parameters:
    1. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. For Subnet, choose the public subnet you created with the CloudFormation template.
    3. For Common security groups, choose the bastion security group you created.
  7. For Configure storage, set storage to 8 GB.
  8. Identify the public IP address of the bastion host and the private IP address of the target instance (for example, the login node of the cluster), and add two host entries in the SSH config, by referring to the following example:
    Host Bastion
        HostName 11.22.33.44
        User ubuntu
        IdentityFile ~/keys/my-bastion-ssh-key.pem
    
    Host MyCluster-LoginNode-with-Proxy
        HostName 10.1.111.222
        User user1
        IdentityFile ~/keys/my-cluster-ssh-key.pem
        ProxyCommand ssh -q -W %h:%p Bastion

  9. Run the ssh command using the target host name you specified earlier, and confirm you can log in to the instance with the specified user:
    $ ssh MyCluster-LoginNode-with-Proxy
       :
       :
       ____              __  ___     __             __ __                  ___          __
      / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
     _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
    /___/_,_/_, /__/_/  /_/_,_/_/_\__/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
             /___/                                    /___/_/
    You're on the controller
    Instance Type: ml.m5.xlarge
    user1@ip-10-1-111-222:~$

Clean up

Clean up the resources in the following order:

  1. Delete the HyperPod cluster.
  2. Delete the Network Load Balancer.
  3. Delete the load balancing target group.
  4. Delete the certificate imported to Certificate Manager.
  5. Delete the EC2 Windows instance.
  6. Delete the EC2 Linux instance for the bastion host.
  7. Delete the AWS Managed Microsoft AD.
  8. Delete the CloudFormation stack for the VPC, subnets, security group, and FSx for Lustre volume.

Conclusion

This post provided steps to create a HyperPod cluster integrated with Active Directory. This solution removes the hassle of user maintenance on large-scale clusters and allows you to manage users and groups centrally in one place.

For more information about HyperPod, check out the HyperPod workshop and the SageMaker HyperPod Developer Guide. Leave your feedback on this solution in the comments section.


About the Authors

Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Monidipa Chakraborty currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. She is committed to assisting customers by designing and implementing robust and scalable systems that demonstrate operational excellence. Bringing nearly a decade of software development experience, Monidipa has contributed to various sectors within Amazon, including Video, Retail, Amazon Go, and AWS SageMaker.

Satish Pasumarthi is a Software Developer at Amazon Web Services. With several years of software engineering and an ML background, he loves to bridge the gap between the ML and systems and is passionate to build systems that make large scale model training possible. He has worked on projects in a variety of domains, including Machine Learning frameworks, model benchmarking, building hyperpod beta involving a broad set of AWS services. In his free time, Satish enjoys playing badminton.

Read More

The executive’s guide to generative AI for sustainability

The executive’s guide to generative AI for sustainability

Organizations are facing ever-increasing requirements for sustainability goals alongside environmental, social, and governance (ESG) practices. A Gartner, Inc. survey revealed that 87 percent of business leaders expect to increase their organization’s investment in sustainability over the next years. This post serves as a starting point for any executive seeking to navigate the intersection of generative artificial intelligence (generative AI) and sustainability. It provides examples of use cases and best practices for using generative AI’s potential to accelerate sustainability and ESG initiatives, as well as insights into the main operational challenges of generative AI for sustainability. This guide can be used as a roadmap for integrating generative AI effectively within sustainability strategies while ensuring alignment with organizational objectives.

A roadmap to generative AI for sustainability

In the sections that follow, we provide a roadmap for integrating generative AI into sustainability initiatives

1. Understand the potential of generative AI for sustainability

Generative AI has the power to transform every part of a business with its wide range of capabilities. These include the ability to analyze massive amounts of data, identify patterns, summarize documents, perform translations, correct errors, or answer questions. These capabilities can be used to add value throughout the entire value chain of your organization. Figure 1 illustrates selected examples of use cases of generative AI for sustainability across the value chain.

Figure 1: Examples of generative AI for sustainability use cases across the value chain

According to KPMG’s 2024 ESG Organization Survey, investment in ESG capabilities is another top priority for executives as organizations face increasing regulatory pressure to disclose information about ESG impacts, risks, and opportunities. Within this context, you can use generative AI to advance your organization’s ESG goals.

The typical ESG workflow consists of multiple phases, each presenting unique pain points. Generative AI offers solutions that can address these pain points throughout the process and contribute to sustainability efforts. Figure 2 provides examples illustrating how generative AI can support each phase of the ESG workflow within your organization. These examples include speeding up market trend analysis, ensuring accurate risk management and compliance, and facilitating data collection or report generation. Note that ESG workflows may vary across different verticals, organizational maturities, and legislative frameworks. Factors such as industry-specific regulations, company size, and regional policies can influence the ESG workflow steps. Therefore, prioritizing use cases according to your specific needs and context and defining a clear plan to measure success is essential for optimal effectiveness.

Figure 2: Mapping generative AI benefits across the ESG workflow

2. Recognize the operational challenges of generative AI for sustainability

Understanding and appropriately addressing the challenges of implementing generative AI is crucial for organizations aiming to use its potential to address the organization’s sustainability goals and ESG initiatives. These challenges include collecting and managing high-quality data, integrating generative AI into existing IT systems, navigating ethical concerns, filling skills gaps and setting the organization up for success by bringing in key stakeholders such as the chief information security officer (CISO) or chief financial officer (CFO) early so you build responsibly. Legal challenges are a huge blocker for transitioning from proof of concept (POC) to production. Therefore, it’s essential to involve legal teams early in the process to build with compliance in mind. Figure 3 provides an overview of the main operational challenges of generative AI for sustainability.

Figure 3: Operational challenges of generative AI for sustainability

3. Set the right data foundations

As a CEO aiming to use generative AI to achieve sustainability goals, remember that data is your differentiator. Companies that lack ready access to high-quality data will not be able to customize generative AI models with their own data, thus missing out on realizing the full scaling potential of generative AI and creating a competitive advantage. Invest in acquiring diverse and high-quality datasets to enrich and accelerate your ESG initiatives. You can use resources such as the Amazon Sustainability Data Initiative or the AWS Data Exchange to simplify and expedite the acquisition and analysis of comprehensive datasets. Alongside external data acquisition, prioritize internal data management to maximize the potential of generative AI and use its capabilities in analyzing your organizational data and uncovering new insights.

From an operational standpoint, you can embrace foundation model ops (FMOps) and large language model ops (LLMOps) to make sure your sustainability efforts are data-driven and scalable. This involves documenting data lineage, data versioning, automating data processing, and monitoring data management costs.

4. Identify high-impact opportunities

You can use Amazon’s working backwards principle to pinpoint opportunities within your sustainability strategy where generative AI can make a significant impact. Prioritize projects that promise immediate enhancements in key areas within your organization. While ESG remains a key aspect of sustainability, tapping into industry-specific expertise across sectors such as energy, supply chain, and manufacturing, transportation, or agriculture can uncover diverse generative AI for sustainability use cases tailored to your business’s applications. Moreover, exploring alternative avenues, such as using generative AI for improving research and development, enabling customer self-service, optimizing energy usage in buildings or slowing down deforestation, can also provide impactful opportunities for sustainable innovation.

5. Use the right tools

Failing to use the appropriate tools can add complexity, compromise security, and reduce effectiveness in using generative AI for sustainability. The right tool should offer you choice and flexibility and enable you to customize your solutions to specific needs and requirements.

Figure 4 illustrates the AWS generative AI stack as of 2023, which offers a set of capabilities that encompass choice, breadth, and depth across all layers. Moreover, it is built on a data-first approach, ensuring that every aspect of its offerings is designed with security and privacy in mind.

Examples of tools you can use to advance sustainability initiatives are:

Amazon Bedrock – a fully managed service that provides access to high-performing FMs from leading AI companies through a single API, enabling you to choose the right model for your sustainability use cases.

AWS Trainium2 – Purpose-built for high-performance training of FMs and LLMs, Trainium2 provides up to 2x better energy efficiency (performance/watt) compared to first-generation Trainium chips.

Inferentia2-based Amazon EC2 Inf2 instances – These instances offer up to 50 percent better performance/watt over comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. Purpose-built to handle deep learning models at scale, Inf2 instances are indispensable for deploying ultra-large models while meeting sustainability goals through improved energy efficiency.

Figure 4: AWS generative AI stack

6. Use the right approach

Generative AI isn’t a one-size-fits-all solution. Tailoring your approach by choosing the right modality and optimization strategy is crucial for maximizing its impact on sustainability initiatives. Figure 5 offers an overview on generative AI modalities and optimization strategies, including prompt engineering, Retrieval Augmented Generation, and fine-tuning or continued pre-training.

Figure 5: Generative AI modalities

In addition, figure 6 outlines the main generative AI optimization strategies, including prompt engineering, Retrieval Augmented Generation, and fine-tuning or continued pre-training.

Figure 6: Generative AI optimization strategies

7. Simplify the development of your applications by using generative AI agents

Generative AI agents offer a unique opportunity to drive sustainability initiatives forward with their advanced capabilities of automating a wide range of routine and repetitive tasks, such as data entry, customer support inquiries, and content generation. Moreover, they can orchestrate complex, multistep workflows by breaking down tasks into smaller, manageable steps, coordinating various actions, and ensuring the efficient execution of processes within your organization. For example, you can use Agents for Amazon Bedrock to configure an agent that monitors and analyzes energy usage patterns across your operations and identifies opportunities for energy savings. Alternatively, you can create a specialized agent that monitors compliance with sustainability regulations in real time.

8. Build robust feedback mechanisms for evaluation

Take advantage of feedback insights for strategic improvements, whether adjusting generative AI models or redefining objectives to ensure agility and alignment with sustainability challenges. Consider the following guidelines:

Implement real-time monitoring – Set up monitoring systems to track generative AI performance against sustainability benchmarks, focusing on efficiency and environmental impact. Establish a metrics pipeline to provide insights into the sustainability contributions of your generative AI initiatives.

Engage stakeholders for human-in-the-loop evaluation – Rely on human-in-the-loop auditing and regularly collect feedback from internal teams, customers, and partners to gauge the impact of generative AI–driven processes on the organization’s sustainability benchmarks. This enhances transparency and promotes trust in your commitment to sustainability.

Use automated testing for continuous improvement – With tools such as RAGAS and LangSmith, you can use LLM-based evaluation to identify and correct inaccuracies or hallucinations, facilitating rapid optimization of generative AI models in line with sustainability goals.

9. Measure impact and maximize ROI from generative AI for sustainability

Establish clear key performance indicators (KPIs) that capture the environmental impact, such as carbon footprint reduction, alongside economic benefits, such as cost savings or enhanced business agility. This dual focus ensures that your investments not only contribute to programs focused on environmental sustainability but also reinforces the business case for sustainability while empowering you to drive innovation and competitive advantage in sustainable practices. Share success stories internally and externally to inspire others and demonstrate your organization’s commitment to sustainability leadership.

10. Minimize resource usage throughout the generative AI lifecycle

In some cases, generative AI itself can have a high energy cost. To achieve maximum impact, consider the trade-off between the benefits of using generative AI for sustainability initiatives and the energy efficiency of the technology itself. Make sure to gain a deep understanding of the iterative generative AI lifecycle and optimize each phase for environmental sustainability. Typically, the journey into generative AI begins with identifying specific application requirements. From there, you have the option to either train your model from scratch or use an existing one. In most cases, opting for an existing model and customizing it is preferred. Following this step and evaluating your system thoroughly is essential before deployment. Lastly, continuous monitoring enables ongoing refinement and adjustments. Throughout this lifecycle, implementing AWS Well-Architected Framework best practices is recommended. Refer to Figure 7 for an overview of the generative AI lifecycle.

Figure 7: The generative AI lifecycle

11. Manage risks and implement responsibly

While generative AI holds significant promise for working towards your organization’s sustainability goals, it also poses challenges such as toxicity and hallucinations. Striking the right balance between innovation and the responsible use of generative AI is fundamental for mitigating risks and enabling responsible AI innovation. This balance must account for the assessment of risk in terms of several factors such as quality, disclosures, or reporting. To achieve this, adopting specific tools and capabilities and working with your security team experts to adopt security best practices is necessary. Scaling generative AI in a safe and secure manner requires putting in place guardrails that are customized to your use cases and aligned with responsible AI policies.

12. Invest in educating and training your teams

Continuously upskill your team and empower them with the right skills to innovate and actively contribute to achieving your organization’s sustainability goals. Identify relevant resources for sustainability and generative AI to ensure your teams stay updated with the essential skills required in both areas.

Conclusion

In this post, we provided a guide for executives to integrate generative AI into their sustainability strategies, focusing on both sustainability and ESG goals. The adoption of generative AI in sustainability efforts is not just about technological innovation. It is about fostering a culture of responsibility, innovation, and continuous improvement. By prioritizing high-quality data, identifying impactful opportunities, and fostering stakeholders’ engagement, companies can harness the transformative power of generative AI to not only achieve but surpass their sustainability goals.

How can AWS help?

Explore the AWS Solutions Library to discover ways to build sustainability solutions on AWS.

The AWS Generative AI Innovation Center can assist you in the process with expert guidance on ideation, strategic use case identification, execution, and scaling to production.

To learn more about how Amazon is using AI to reach our climate pledge commitment of net-zero carbon by 2040, explore the 7 ways AI is helping Amazon build a more sustainable future and business.


About the Authors

Wafae BakkaliDr. Wafae Bakkali is a Data Scientist at AWS. As a generative AI expert, Wafae is driven by the mission to empower customers in solving their business challenges through the utilization of generative AI techniques, ensuring they do so with maximum efficiency and sustainability.

Dr. Mehdi Noori is a Senior Scientist at AWS Generative AI Innovation Center. With a passion for bridging technology and innovation in the sustainability field, he assists AWS customers in unlocking the potential of Generative AI, turning potential challenges into opportunities for rapid experimentation and innovation. By focusing on scalable, measurable, and impactful uses of advanced AI technologies and streamlining the path to production, he helps customers achieve their sustainability goals.

Rahul Sareen is the GM for Sustainability Solutions and GTM at AWS. Rahul has a team of high performing individuals consisting of sustainability strategists, GTM specialists and technology architects to create great business outcomes for customer’s sustainability goals (everything from carbon emission tracking, sustainable packaging and operations, circular economy to renewable energy). Rahul’s team provides technical expertise (ML, GenAI, IoT) to solve sustainability use cases

Read More

Introducing automatic training for solutions in Amazon Personalize

Introducing automatic training for solutions in Amazon Personalize

Amazon Personalize is excited to announce automatic training for solutions. Solution training is fundamental to maintain the effectiveness of a model and make sure recommendations align with users’ evolving behaviors and preferences. As data patterns and trends change over time, retraining the solution with the latest relevant data enables the model to learn and adapt, enhancing its predictive accuracy. Automatic training generates a new solution version, mitigating model drift and keeping recommendations relevant and tailored to end-users’ current behaviors while including the newest items. Ultimately, automatic training provides a more personalized and engaging experience that adapts to changing preferences.

Amazon Personalize accelerates your digital transformation with machine learning (ML), making it effortless to integrate personalized recommendations into existing websites, applications, email marketing systems, and more. Amazon Personalize enables developers to quickly implement a customized personalization engine, without requiring ML expertise. Amazon Personalize provisions the necessary infrastructure and manages the entire ML pipeline, including processing the data, identifying features, using the appropriate algorithms, and training, optimizing, and hosting the customized models based on your data. All your data is encrypted to be private and secure.

In this post, we guide you through the process of configuring automatic training, so your solutions and recommendations maintain their accuracy and relevance.

Solution overview

A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions (trained models). When you create a custom solution, you specify a recipe matching your use case and configure training parameters. For this post, you configure automatic training in the training parameters.

Prerequisites

To enable automatic training for your solutions, you first need to set up Amazon Personalize resources. Start by creating a dataset group, schemas, and datasets representing your items, interactions, and user data. For instructions, refer to Getting Started (console) or Getting Started (AWS CLI).

After you finish importing your data, you are ready to create a solution.

Create a solution

To set up automatic training, complete the following steps:

  1. On the Amazon Personalize console, create a new solution.
  2. Specify a name for your solution, choose the type of solution you want to create, and choose your recipe.
  3. Optionally, add any tags. For more information about tagging Amazon Personalize resources, see Tagging Amazon Personalize resources.
  4. To use automatic training, in the Automatic training section, select Turn on and specify your training frequency.

Automatic training is enabled by default to train one time every 7 days. You can configure the training cadence to suit your business needs, ranging from one time every 1–30 days.

  1. If your recipe generates item recommendations or user segments, optionally use the Columns for training section to choose the columns Amazon Personalize considers when training solution versions.
  2. In the Hyperparameter configuration section, optionally configure any hyperparameter options based on your recipe and business needs.
  3. Provide any additional configurations, then choose Next.
  4. Review the solution details and confirm that your automatic training is configured as expected.
  5. Choose Create solution.

Amazon Personalize will automatically create your first solution version. A solution version refers to a trained ML model. When a solution version is created for the solution, Amazon Personalize trains the model backing the solution version based on the recipe and training configuration. It can take up to 1 hour for the solution version creation to start.

The following is sample code for creating a solution with automatic training using the AWS SDK:

import boto3 
personalize = boto3.client('personalize')

solution_config = {
    "autoTrainingConfig": {
        "schedulingExpression": "rate(3 days)"
    }
}

recipe = "arn:aws:personalize:::recipe/aws-similar-items"
name = "test_automatic_training"
response = personalize.create_solution(name=name, recipeArn=recipe_arn, datasetGroupArn=dataset_group_arn, 
                            performAutoTraining=True, solutionConfig=solution_config)

print(response['solutionArn'])
solution_arn = response['solutionArn'])

After a solution is created, you can confirm whether automatic training is enabled on the solution details page.

You can also use the following sample code to confirm via the AWS SDK that automatic training is enabled:

response = personalize.describe_solution(solutionArn=solution_arn)
print(response)

Your response will contain the fields performAutoTraining and autoTrainingConfig, displaying the values you set in the CreateSolution call.

On the solution details page, you will also see the solution versions that are created automatically. The Training type column specifies whether the solution version was created manually or automatically.

You can also use the following sample code to return a list of solution versions for the given solution:

response = personalize.list_solution_versions(solutionArn=solution_arn)['solutionVersions']
print("List Solution Version responsen")
for val in response:
    print(f"SolutionVersion: {val}")
    print("n")

Your response will contain the field trainingType, which specifies whether the solution version was created manually or automatically.

When your solution version is ready, you can create a campaign for your solution version.

Create a campaign

A campaign deploys a solution version (trained model) to generate real-time recommendations. With Amazon Personalize, you can streamline your workflow and automate the deployment of the latest solution version to campaigns via automatic syncing. To set up auto sync, complete the following steps:

  1. On the Amazon Personalize console, create a new campaign.
  2. Specify a name for your campaign.
  3. Choose the solution you just created.
  4. Select Automatically use the latest solution version.
  5. Set the minimum provisioned transactions per second.
  6. Create your campaign.

The campaign is ready when its status is ACTIVE.

The following is sample code for creating a campaign with syncWithLatestSolutionVersion set to true using the AWS SDK. You must also append the suffix $LATEST to the solutionArn in solutionVersionArn when you set syncWithLatestSolutionVersion to true.

campaign_config = {
    "syncWithLatestSolutionVersion": True
}
resource_name = "test_campaign_sync"
solution_version_arn = "arn:aws:personalize:<region>:<accountId>:solution/<solution_name>/$LATEST"
response = personalize.create_campaign(name=resource_name, solutionVersionArn=solution_version_arn, campaignConfig=campaign_config)
campaign_arn = response['campaignArn']
print(campaign_arn)

On the campaign details page, you can see whether the campaign selected has auto sync enabled. When enabled, your campaign will automatically update to use the most recent solution version, whether it was automatically or manually created.

Use the following sample code to confirm via the AWS SDK that syncWithLatestSolutionVersion is enabled:

response = personalize.describe_campaign(campaignArn=campaign_arn)
Print(response)

Your response will contain the field syncWithLatestSolutionVersion under campaignConfig, displaying the value you set in the CreateCampaign call.

You can enable or disable the option to automatically use the latest solution version on the Amazon Personalize console after a campaign is created by updating your campaign. Similarly, you can enable or disable syncWithLatestSolutionVersion with UpdateCampaign using the AWS SDK.

Conclusion

With automatic training, you can mitigate model drift and maintain recommendation relevance by streamlining your workflow and automating the deployment of the latest solution version in Amazon Personalize.

For more information about optimizing your user experience with Amazon Personalize, see the Amazon Personalize Developer Guide.


About the authors

Ba’Carri Johnson is a Sr. Technical Product Manager working with AWS AI/ML on the Amazon Personalize team. With a background in computer science and strategy, she is passionate about product innovation. In her spare time, she enjoys traveling and exploring the great outdoors.

Ajay Venkatakrishnan is a Software Development Engineer on the Amazon Personalize team. In his spare time, he enjoys writing and playing soccer.

Pranesh Anubhav is a Senior Software Engineer for Amazon Personalize. He is passionate about designing machine learning systems to serve customers at scale. Outside of his work, he loves playing soccer and is an avid follower of Real Madrid.

Read More

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

We are excited to announce a new version of the Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like buckets, databases, or message queues simply by using the Kubernetes API.

Release v1.2.9 of the SageMaker ACK Operators adds support for inference components, which until now were only available through the SageMaker API and the AWS Software Development Kits (SDKs). Inference components can help you optimize deployment costs and reduce latency. With the new inference component capabilities, you can deploy one or more foundation models (FMs) on the same Amazon SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. This helps improve resource utilization, reduces model deployment costs on average by 50%, and lets you scale endpoints together with your use cases. For more details, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

The availability of inference components through the SageMaker controller enables customers who use Kubernetes as their control plane to take advantage of inference components while deploying their models on SageMaker.

In this post, we show how to use SageMaker ACK Operators to deploy SageMaker inference components.

How ACK works

To demonstrate how ACK works, let’s look at an example using Amazon Simple Storage Service (Amazon S3). In the following diagram, Alice is our Kubernetes user. Her application depends on the existence of an S3 bucket named my-bucket.

How ACK Works

The workflow consists of the following steps:

  1. Alice issues a call to kubectl apply, passing in a file that describes a Kubernetes custom resource describing her S3 bucket. kubectl apply passes this file, called a manifest, to the Kubernetes API server running in the Kubernetes controller node.
  2. The Kubernetes API server receives the manifest describing the S3 bucket and determines if Alice has permissions to create a custom resource of kind s3.services.k8s.aws/Bucket, and that the custom resource is properly formatted.
  3. If Alice is authorized and the custom resource is valid, the Kubernetes API server writes the custom resource to its etcd data store.
  4. It then responds to Alice that the custom resource has been created.
  5. At this point, the ACK service controller for Amazon S3, which is running on a Kubernetes worker node within the context of a normal Kubernetes Pod, is notified that a new custom resource of kind s3.services.k8s.aws/Bucket has been created.
  6. The ACK service controller for Amazon S3 then communicates with the Amazon S3 API, calling the S3 CreateBucket API to create the bucket in AWS.
  7. After communicating with the Amazon S3 API, the ACK service controller calls the Kubernetes API server to update the custom resource’s status with information it received from Amazon S3.

Key components

The new inference capabilities build upon SageMaker’s real-time inference endpoints. As before, you create the SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. The model is configured in a new construct, an inference component. Here, you specify the number of accelerators and amount of memory you want to allocate to each copy of a model, together with the model artifacts, container image, and number of model copies to deploy.

You can use the new inference capabilities from Amazon SageMaker Studio, the SageMaker Python SDK, AWS SDKs, and AWS Command Line Interface (AWS CLI). They are also supported by AWS CloudFormation. Now you also can use them with SageMaker Operators for Kubernetes.

Solution overview

For this demo, we use the SageMaker controller to deploy a copy of the Dolly v2 7B model and a copy of the FLAN-T5 XXL model from the Hugging Face Model Hub on a SageMaker real-time endpoint using the new inference capabilities.

Prerequisites

To follow along, you should have a Kubernetes cluster with the SageMaker ACK controller v1.2.9 or above installed. For instructions on how to provision an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with Amazon Elastic Compute Cloud (Amazon EC2) Linux managed nodes using eksctl, see Getting started with Amazon EKS – eksctl. For instructions on installing the SageMaker controller, refer to Machine Learning with the ACK SageMaker Controller.

You need access to accelerated instances (GPUs) for hosting the LLMs. This solution uses one instance of ml.g5.12xlarge; you can check the availability of these instances in your AWS account and request these instances as needed via a Service Quotas increase request, as shown in the following screenshot.

Service Quotas Increase Request

Create an inference component

To create your inference component, define the EndpointConfig, Endpoint, Model, and InferenceComponent YAML files, similar to the ones shown in this section. Use kubectl apply -f <yaml file> to create the Kubernetes resources.

You can list the status of the resource via kubectl describe <resource-type>; for example, kubectl describe inferencecomponent.

You can also create the inference component without a model resource. Refer to the guidance provided in the API documentation for more details.

EndpointConfig YAML

The following is the code for the EndpointConfig file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: EndpointConfig
metadata:
  name: inference-component-endpoint-config
spec:
  endpointConfigName: inference-component-endpoint-config
  executionRoleARN: <EXECUTION_ROLE_ARN>
  productionVariants:
  - variantName: AllTraffic
    instanceType: ml.g5.12xlarge
    initialInstanceCount: 1
    routingConfig:
      routingStrategy: LEAST_OUTSTANDING_REQUESTS

Endpoint YAML

The following is the code for the Endpoint file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Endpoint
metadata:
  name: inference-component-endpoint
spec:
  endpointName: inference-component-endpoint
  endpointConfigName: inference-component-endpoint-config

Model YAML

The following is the code for the Model file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: dolly-v2-7b
spec:
  modelName: dolly-v2-7b
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: databricks/dolly-v2-7b
      HF_TASK: text-generation
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: flan-t5-xxl
spec:
  modelName: flan-t5-xxl
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: google/flan-t5-xxl
      HF_TASK: text-generation

InferenceComponent YAMLs

In the following YAML files, given that the ml.g5.12xlarge instance comes with 4 GPUs, we are allocating 2 GPUs, 2 CPUs and 1,024 MB of memory to each model:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-flan
spec:
  inferenceComponentName: inference-component-flan
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: flan-t5-xxl
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Invoke models

You can now invoke the models using the following code:

import boto3
import json

sm_runtime_client = boto3.client(service_name="sagemaker-runtime")
payload = {"inputs": "Why is California a great place to live?"}

response_dolly = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-dolly",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_dolly = json.loads(response_dolly['Body'].read().decode())
print(result_dolly)

response_flan = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-flan",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_flan = json.loads(response_flan['Body'].read().decode())
print(result_flan)

Update an inference component

To update an existing inference component, you can update the YAML files and then use kubectl apply -f <yaml file>. The following is an example of an updated file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 4 # Update the numberOfCPUCoresRequired.
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Delete an inference component

To delete an existing inference component, use the command kubectl delete -f <yaml file>.

Availability and pricing

The new SageMaker inference capabilities are available today in AWS Regions US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE), and South America (São Paulo). For pricing details, visit Amazon SageMaker Pricing.

Conclusion

In this post, we showed how to use SageMaker ACK Operators to deploy SageMaker inference components. Fire up your Kubernetes cluster and deploy your FMs using the new SageMaker inference capabilities today!


About the Authors

Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Suryansh Singh is a Software Development Engineer at AWS SageMaker and works on developing ML-distributed infrastructure solutions for AWS customers at scale.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Johna LiuJohna Liu is a Software Development Engineer in the Amazon SageMaker team. Her current work focuses on helping developers efficiently host machine learning models and improve inference performance. She is passionate about spatial data analysis and using AI to solve societal problems.

Read More

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 2

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 2

In Part 1 of this series, we presented a solution that used the Amazon Titan Multimodal Embeddings model to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on the most similar slide retrieved from the vector database. We used AWS services including Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Serverless in this solution.

In this post, we demonstrate a different approach. We use the Anthropic Claude 3 Sonnet model to generate text descriptions for each slide in the slide deck. These descriptions are then converted into text embeddings using the Amazon Titan Text Embeddings model and stored in a vector database. Then we use the Claude 3 Sonnet model to generate answers to user questions based on the most relevant text description retrieved from the vector database.

You can test both approaches for your dataset and evaluate the results to see which approach gives you the best results. In Part 3 of this series, we evaluate the results of both methods.

Solution overview

The solution provides an implementation for answering questions using information contained in text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by large language models (LLMs). In this series, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

This solution includes the following components:

  • Amazon Titan Text Embeddings is a text embeddings model that converts natural language text, including single words, phrases, or even large documents, into numerical representations that can be used to power use cases such as search, personalization, and clustering based on semantic similarity.
  • Claude 3 Sonnet is the next generation of state-of-the-art models from Anthropic. Sonnet is a versatile tool that can handle a wide range of tasks, from complex reasoning and analysis to rapid outputs, as well as efficient search and retrieval across vast amounts of information.
  • OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Amazon Titan Text Embeddings model. An index created in the OpenSearch Serverless collection serves as the vector store for our RAG solution.
  • Amazon OpenSearch Ingestion (OSI) is a fully managed, serverless data collector that delivers data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we use an OSI pipeline API to deliver data to the OpenSearch Serverless vector store.

The solution design consists of two parts: ingestion and user interaction. During ingestion, we process the input slide deck by converting each slide into an image, generating descriptions and text embeddings for each image. We then populate the vector data store with the embeddings and text description for each slide. These steps are completed prior to the user interaction steps.

In the user interaction phase, a question from the user is converted into text embeddings. A similarity search is run on the vector database to find a text description corresponding to a slide that could potentially contain answers to the user question. We then provide the slide description and the user question to the Claude 3 Sonnet model to generate an answer to the query. All the code for this post is available in the GitHub repo.

The following diagram illustrates the ingestion architecture.

The workflow consists of the following steps:

  1. Slides are converted to image files (one per slide) in JPG format and passed to the Claude 3 Sonnet model to generate text description.
  2. The data is sent to the Amazon Titan Text Embeddings model to generate embeddings. In this series, we use the slide deck Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023 to demonstrate the solution. The sample deck has 31 slides, therefore we generate 31 sets of vector embeddings, each with 1536 dimensions. We add additional metadata fields to perform rich search queries using OpenSearch’s powerful search capabilities.
  3. The embeddings are ingested into an OSI pipeline using an API call.
  4. The OSI pipeline ingests the data as documents into an OpenSearch Serverless index. The index is configured as the sink for this pipeline and is created as part of the OpenSearch Serverless collection.

The following diagram illustrates the user interaction architecture.

The workflow consists of the following steps:

  1. A user submits a question related to the slide deck that has been ingested.
  2. The user input is converted into embeddings using the Amazon Titan Text Embeddings model accessed using Amazon Bedrock. An OpenSearch Service vector search is performed using these embeddings. We perform a k-nearest neighbor (k-NN) search to retrieve the most relevant embeddings matching the user query.
  3. The metadata of the response from OpenSearch Serverless contains a path to the image and description corresponding to the most relevant slide.
  4. A prompt is created by combining the user question and the image description. The prompt is provided to Claude 3 Sonnet hosted on Amazon Bedrock.
  5. The result of this inference is returned to the user.

We discuss the steps for both stages in the following sections, and include details about the output.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This solution uses the Claude 3 Sonnet and Amazon Titan Text Embeddings models hosted on Amazon Bedrock. Make sure that these models are enabled for use by navigating to the Model access page on the Amazon Bedrock console.

If models are enabled, the Access status will state Access granted.

If the models are not available, enable access by choosing Manage model access, selecting the models, and choosing Request model access. The models are enabled for use immediately.

Use AWS CloudFormation to create the solution stack

You can use AWS CloudFormation to create the solution stack. If you have created the solution for Part 1 in the same AWS account, be sure to delete that before creating this stack.

AWS Region Link
us-east-1
us-west-2

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the values for MultimodalCollectionEndpoint and OpenSearchPipelineEndpoint. You use these in the subsequent steps.

The CloudFormation template creates the following resources:

  • IAM roles – The following AWS Identity and Access Management (IAM) roles are created. Update these roles to apply least-privilege permissions, as discussed in Security best practices.
    • SMExecutionRole with Amazon Simple Storage Service (Amazon S3), SageMaker, OpenSearch Service, and Amazon Bedrock full access.
    • OSPipelineExecutionRole with access to the S3 bucket and OSI actions.
  • SageMaker notebook – All code for this post is run using this notebook.
  • OpenSearch Serverless collection – This is the vector database for storing and retrieving embeddings.
  • OSI pipeline – This is the pipeline for ingesting data into OpenSearch Serverless.
  • S3 bucket – All data for this post is stored in this bucket.

The CloudFormation template sets up the pipeline configuration required to configure the OSI pipeline with HTTP as source and the OpenSearch Serverless index as sink. The SageMaker notebook 2_data_ingestion.ipynb displays how to ingest data into the pipeline using the Requests HTTP library.

The CloudFormation template also creates network, encryption and data access policies required for your OpenSearch Serverless collection. Update these policies to apply least-privilege permissions.

The CloudFormation template name and OpenSearch Service index name are referenced in the SageMaker notebook 3_rag_inference.ipynb. If you change the default names, make sure you update them in the notebook.

Test the solution

After you have created the CloudFormation stack, you can test the solution. Complete the following steps:

  1. On the SageMaker console, choose Notebooks in the navigation pane.
  2. Select MultimodalNotebookInstance and choose Open JupyterLab.
  3. In File Browser, traverse to the notebooks folder to see notebooks and supporting files.

The notebooks are numbered in the sequence in which they run. Instructions and comments in each notebook describe the actions performed by that notebook. We run these notebooks one by one.

  1. Choose 1_data_prep.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook will download a publicly available slide deck, convert each slide into the JPG file format, and upload these to the S3 bucket.

  1. Choose 2_data_ingestion.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

In this notebook, you create an index in the OpenSearch Serverless collection. This index stores the embeddings data for the slide deck. See the following code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "type": "text"
      },
      "slide_text": {
        "type": "text"
      },
      "slide_number": {
        "type": "text"
      },
      "metadata": { 
        "properties" :
          {
            "filename" : {
              "type" : "text"
            },
            "desc":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""
index_body = json.loads(index_body)
try:
  response = os_client.indices.create(index_name, body=index_body)
  logger.info(f"response received for the create index -> {response}")
except Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")

You use the Claude 3 Sonnet and Amazon Titan Text Embeddings models to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows how Claude 3 Sonnet generates image descriptions:

def get_img_desc(image_file_path: str, prompt: str):
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')
  
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )
    
    response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body
    )

    resp_body = json.loads(response['body'].read().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].replace('"', "'")

    return resp_text

The image descriptions are passed to the Amazon Titan Text Embeddings model to generate vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows the call to the Amazon Titan Text Embeddings model:

def get_text_embedding(bedrock: botocore.client, prompt_data: str) -> np.ndarray:
    body = json.dumps({
        "inputText": prompt_data,
    })    
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.TITAN_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response['body'].read())
        embedding = response_body.get('embedding')
    except Exception as e:
        logger.error(f"exception={e}")
        embedding = None

    return embedding

The data is ingested into the OpenSearch Serverless index by making an API call to the OSI pipeline. The following code snippet shows the call made using the Requests HTTP library:

data = json.dumps([{
    "image_path": input_image_s3, 
    "slide_text": resp_text, 
    "slide_number": slide_number, 
    "metadata": {
        "filename": obj_name, 
        "desc": "" 
    }, 
    "vector_embedding": embedding
}])

r = requests.request(
    method='POST', 
    url=osi_endpoint, 
    data=data,
    auth=AWSSigV4('osis'))
  1. Choose 3_rag_inference.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook implements the RAG solution: you convert the user question into embeddings, find a similar image description from the vector database, and provide the retrieved description to Claude 3 Sonnet to generate an answer to the user question. You use the following prompt template:

  llm_prompt: str = """

  Human: Use the summary to provide a concise answer to the question to the best of your abilities. If you cannot answer the question from the context then say I do not know, do not make up an answer.
  <question>
  {question}
  </question>

  <summary>
  {summary}
  </summary>

  Assistant:"""

The following code snippet provides the RAG workflow:

def get_llm_response(bedrock: botocore.client, question: str, summary: str) -> str:
    prompt = llm_prompt.format(question=question, summary=summary)
    
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })
        
    try:
        response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body)

        response_body = json.loads(response['body'].read().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].replace('"', "'")
        
    except Exception as e:
        logger.error(f"exception while slide_text={summary[:10]}, exception={e}")
        llm_response = None

    return llm_response


# create prompt and convert to embeddings question: str = "How does Inf2 compare in performance to comparable EC2 instances? I need numbers." text_embedding = get_text_embedding(bedrock, question) # vector db search vector_db_response: Dict = find_similar_data(text_embeddings) # download image for local notebook display s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path') logger.info(f"going to answer the question="{question}" using the image "{s3_img_path}"") !aws s3 cp {s3_img_path} . local_img_path = os.path.basename(s3_img_path) display(filename=local_img_path) # Ask Claude 3 Sonnet slide_text = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('slide_text') llm_response = get_llm_response(bedrock, question, slide_text) print(llm_response)

Results

The following table contains some user questions and responses generated by our implementation. The Question column captures the user question, and the Answer column is the textual response generated by Claude 3 Sonnet. The Image column shows the k-NN slide match returned by the OpenSearch Serverless vector search.

Multimodal RAG results

Question Answer Image
How does Inf2 compare in performance to comparable EC2 instances? I need numbers. Based on the information provided in the summary, the key performance comparison numbers given for the Amazon EC2 Inf2 instances compared to other instances are: 1. Up to 4x higher throughput. 2. Up to 10x lower latency. The summary does not provide specific numbers comparing the Inf2 instances to other EC2 instance types. However, it highlights the throughput and latency improvements as the main performance advantages of the Inf2 instances over comparable instances for running generative AI models.
As per the AI/ML flywheel, what do the AWS AI/ML services provide? Based on the summary, the AWS AI/ML services provide a range of capabilities that fuel an AI/ML flywheel. This flywheel consists of innovation in developing new and better algorithms, enabling more use-cases and applications, driving wider adoption and demand, which in turn leads to further investment in research and optimization to drive more innovation. The AWS AI/ML services seem to offer the tools, resources, and infrastructure to support this continuous cycle of innovation, application development, adoption, and reinvestment in the field of artificial intelligence and machine learning.
Compared to GPT-2, how many more parameters does GPT-3 have? What is the numerical difference between the parameter size of GPT-2 and GPT-3? According to the information provided in the summary, GPT-3 from 2020 had 175B (175 billion) parameters, while GPT-2 from 2019 had 1.5B (1.5 billion) parameters. Therefore, the numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion parameters. GPT-3 has approximately 115 times more parameters than GPT-2.
What are quarks in particle physics? Unfortunately, the given summary does not contain any information about quarks in particle physics. The summary describes an image related to the progression of natural language processing and generative AI technologies, but it does not mention anything about particle physics or the concept of quarks.

Query your index

You can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data.

Cleanup

To avoid incurring future charges, delete the resources. You can do this by deleting the stack using the AWS CloudFormation console.

Conclusion

Enterprises generate new content all the time, and slide decks are a common way to share and disseminate information internally within the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks.

You can use this solution and the power of multimodal FMs such as the Amazon Titan Text Embeddings and Claude 3 Sonnet to discover new information or uncover new perspectives on content in slide decks. You can try different Claude models available on Amazon Bedrock by updating the CLAUDE_MODEL_ID in the globals.py file.

This is Part 2 of a three-part series. We used the Amazon Titan Multimodal Embeddings and the LLaVA model in Part 1. In Part 3, we will compare the approaches from Part 1 and Part 2.

Portions of this code are released under the Apache 2.0 License.


About the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Manju Prasad is a Senior Solutions Architect at Amazon Web Services. She focuses on providing technical guidance in a variety of technical domains, including AI/ML. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup. She is passionate about sharing knowledge and fostering interest in emerging talent.

Archana Inapudi is a Senior Solutions Architect at AWS, supporting a strategic customer. She has over a decade of cross-industry expertise leading strategic technical initiatives. Archana is an aspiring member of the AI/ML technical field community at AWS. Prior to joining AWS, Archana led a migration from traditional siloed data sources to Hadoop at a healthcare company. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services, supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital-centered customers.

Read More

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

This is a guest post co-written with the leadership team of Iambic Therapeutics.

Iambic Therapeutics is a drug discovery startup with a mission to create innovative AI-driven technologies to bring better medicines to cancer patients, faster.

Our advanced generative and predictive artificial intelligence (AI) tools enable us to search the vast space of possible drug molecules faster and more effectively. Our technologies are versatile and applicable across therapeutic areas, protein classes, and mechanisms of action. Beyond creating differentiated AI tools, we have established an integrated platform that merges AI software, cloud-based data, scalable computation infrastructure, and high-throughput chemistry and biology capabilities. The platform both enables our AI—by supplying data to refine our models—and is enabled by it, capitalizing on opportunities for automated decision-making and data processing.

We measure success by our ability to produce superior clinical candidates to address urgent patient need, at unprecedented speed: we advanced from program launch to clinical candidates in just 24 months, significantly faster than our competitors.

In this post, we focus on how we used Karpenter on Amazon Elastic Kubernetes Service (Amazon EKS) to scale AI training and inference, which are core elements of the Iambic discovery platform.

The need for scalable AI training and inference

Every week, Iambic performs AI inference across dozens of models and millions of molecules, serving two primary use cases:

  • Medicinal chemists and other scientists use our web application, Insight, to explore chemical space, access and interpret experimental data, and predict properties of newly designed molecules. All of this work is done interactively in real time, creating a need for inference with low latency and medium throughput.
  • At the same time, our generative AI models automatically design molecules targeting improvement across numerous properties, searching millions of candidates, and requiring enormous throughput and medium latency.

Guided by AI technologies and expert drug hunters, our experimental platform generates thousands of unique molecules each week, and each is subjected to multiple biological assays. The generated data points are automatically processed and used to fine-tune our AI models every week. Initially, our model fine-tuning took hours of CPU time, so a framework for scaling model fine-tuning on GPUs was imperative.

Our deep learning models have non-trivial requirements: they are gigabytes in size, are numerous and heterogeneous, and require GPUs for fast inference and fine-tuning. Looking to cloud infrastructure, we needed a system that allows us to access GPUs, scale up and down quickly to handle spiky, heterogeneous workloads, and run large Docker images.

We wanted to build a scalable system to support AI training and inference. We use Amazon EKS and were looking for the best solution to auto scale our worker nodes. We chose Karpenter for Kubernetes node auto scaling for a number of reasons:

  • Ease of integration with Kubernetes, using Kubernetes semantics to define node requirements and pod specs for scaling
  • Low-latency scale-out of nodes
  • Ease of integration with our infrastructure as code tooling (Terraform)

The node provisioners support effortless integration with Amazon EKS and other AWS resources such as Amazon Elastic Compute Cloud (Amazon EC2) instances and Amazon Elastic Block Store volumes. The Kubernetes semantics used by the provisioners support directed scheduling using Kubernetes constructs such as taints or tolerations and affinity or anti-affinity specifications; they also facilitate control over the number and types of GPU instances that may be scheduled by Karpenter.

Solution overview

In this section, we present a generic architecture that is similar to the one we use for our own workloads, which allows elastic deployment of models using efficient auto scaling based on custom metrics.

The following diagram illustrates the solution architecture.

The architecture deploys a simple service in a Kubernetes pod within an EKS cluster. This could be a model inference, data simulation, or any other containerized service, accessible by HTTP request. The service is exposed behind a reverse-proxy using Traefik. The reverse proxy collects metrics about calls to the service and exposes them via a standard metrics API to Prometheus. The Kubernetes Event Driven Autoscaler (KEDA) is configured to automatically scale the number of service pods, based on the custom metrics available in Prometheus. Here we use the number of requests per second as a custom metric. The same architectural approach applies if you choose a different metric for your workload.

Karpenter monitors for any pending pods that can’t run due to lack of sufficient resources in the cluster. If such pods are detected, Karpenter adds more nodes to the cluster to provide the necessary resources. Conversely, if there are more nodes in the cluster than what is needed by the scheduled pods, Karpenter removes some of the worker nodes and the pods get rescheduled, consolidating them on fewer instances. The number of HTTP requests per second and number of nodes can be visualized using a Grafana dashboard. To demonstrate auto scaling, we run one or more simple load-generating pods, which send HTTP requests to the service using curl.

Solution deployment

In the step-by-step walkthrough, we use AWS Cloud9 as an environment to deploy the architecture. This enables all steps to be completed from a web browser. You can also deploy the solution from a local computer or EC2 instance.

To simplify deployment and improve reproducibility, we follow the principles of the do-framework and the structure of the depend-on-docker template. We clone the aws-do-eks project and, using Docker, we build a container image that is equipped with the necessary tooling and scripts. Within the container, we run through all the steps of the end-to-end walkthrough, from creating an EKS cluster with Karpenter to scaling EC2 instances.

For the example in this post, we use the following EKS cluster manifest:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: do-eks-yaml-karpenter
version: '1.28'
region: us-west-2
tags:
  karpenter.sh/discovery: do-eks-yaml-karpenter
iam:
  withOIDC: true
addons:
  - name: aws-ebs-csi-driver
    version: v1.26.0-eksbuild.1
wellKnownPolicies:
  ebsCSIController: true
managedNodeGroups:
  - name: c5-xl-do-eks-karpenter-ng
    instanceType: c5.xlarge
    instancePrefix: c5-xl
    privateNetworking: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 10
    volumeSize: 300
    iam:
      withAddonPolicies:
        cloudWatch: true
        ebs: true

This manifest defines a cluster named do-eks-yaml-karpenter with the EBS CSI driver installed as an add-on. A managed node group with two c5.xlarge nodes is included to run system pods that are needed by the cluster. The worker nodes are hosted in private subnets, and the cluster API endpoint is public by default.

You could also use an existing EKS cluster instead of creating one. We deploy Karpenter by following the instructions in the Karpenter documentation, or by running the following script, which automates the deployment instructions.

The following code shows the Karpenter configuration we use in this example:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata: null
    labels:
      cluster-name: do-eks-yaml-karpenter
    annotations:
      purpose: karpenter-example
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
        requirements:
          - key: karpenter.sh/capacity-type
            operator: In
            values:
              - spot
              - on-demand
          - key: karpenter.k8s.aws/instance-category
            operator: In
            values:
              - c
              - m
              - r
              - g
              - p
          - key: karpenter.k8s.aws/instance-generation
            operator: Gt
            values:
              - '2'
  disruption:
    consolidationPolicy: WhenUnderutilized
    #consolidationPolicy: WhenEmpty
    #consolidateAfter: 30s
    expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
      karpenter.sh/discovery: "do-eks-yaml-karpenter"
  securityGroupSelectorTerms:
    - tags:
      karpenter.sh/discovery: "do-eks-yaml-karpenter"
  role: "KarpenterNodeRole-do-eks-yaml-karpenter"
  tags:
    app: autoscaling-test
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 80Gi
        volumeType: gp3
        iops: 10000
        deleteOnTermination: true
        throughput: 125
  detailedMonitoring: true

We define a default Karpenter NodePool with the following requirements:

  • Karpenter can launch instances from both spot and on-demand capacity pools
  • Instances must be from the “c” (compute optimized), “m” (general purpose), “r” (memory optimized), or “g” and “p” (GPU accelerated) computing families
  • Instance generation must be greater than 2; for example, g3 is acceptable, but g2 is not

The default NodePool also defines disruption policies. Underutilized nodes will be removed so pods can be consolidated to run on fewer or smaller nodes. Alternatively, we can configure empty nodes to be removed after the specified time period. The expireAfter setting specifies the maximum lifetime of any node, before it is stopped and replaced if necessary. This helps reduce security vulnerabilities as well as avoid issues that are typical for nodes with long uptimes, such as file fragmentation or memory leaks.

By default, Karpenter provisions nodes with a small root volume, which can be insufficient for running AI or machine learning (ML) workloads. Some of the deep learning container images can be tens of GB in size, and we need to make sure there is enough storage space on the nodes to run pods using these images. To do that, we define EC2NodeClass with blockDeviceMappings, as shown in the preceding code.

Karpenter is responsible for auto scaling at the cluster level. To configure auto scaling at the pod level, we use KEDA to define a custom resource called ScaledObject, as shown in the following code:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: keda-prometheus-hpa
  namespace: hpa-example
spec:
  scaleTargetRef:
    name: php-apache
  minReplicaCount: 1
  cooldownPeriod: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus- server.prometheus.svc.cluster.local:80
        metricName: http_requests_total
        threshold: '1'
        query: rate(traefik_service_requests_total{service="hpa-example-php-apache-80@kubernetes",code="200"}[2m])

The preceding manifest defines a ScaledObject named keda-prometheus-hpa, which is responsible for scaling the php-apache deployment and always keeps at least one replica running. It scales the pods of this deployment based on the metric http_requests_total available in Prometheus obtained by the specified query, and targets to scale up the pods so that each pod serves no more than one request per second. It scales down the replicas after the request load has been below the threshold for longer than 30 seconds.

The deployment spec for our example service contains the following resource requests and limits:

resources:
  limits:
    cpu: 500m
    nvidia.com/gpu: 1
  requests:
    cpu: 200m
    nvidia.com/gpu: 1

With this configuration, each of the service pods will use exactly one NVIDIA GPU. When new pods are created, they will be in Pending state until a GPU is available. Karpenter adds GPU nodes to the cluster as needed to accommodate the pending pods.

A load-generating pod sends HTTP requests to the service with a pre-set frequency. We increase the number of requests by increasing the number of replicas in the load-generator deployment.

A full scaling cycle with utilization-based node consolidation is visualized in a Grafana dashboard. The following dashboard shows the number of nodes in the cluster by instance type (top), the number of requests per second (bottom left), and the number of pods (bottom right).

Scaling Dashboard 1

We start with just the two c5.xlarge CPU instances that the cluster was created with. Then we deploy one service instance, which requires a single GPU. Karpenter adds a g4dn.xlarge instance to accommodate this need. We then deploy the load generator, which causes KEDA to add more service pods and Karpenter adds more GPU instances. After optimization, the state settles on one p3.8xlarge instance with 8 GPUs and one g5.12xlarge instance with 4 GPUs.

When we scale the load-generating deployment to 40 replicas, KEDA creates additional service pods to maintain the required request load per pod. Karpenter adds g4dn.metal and g4dn.12xlarge nodes to the cluster to provide the needed GPUs for the additional pods. In the scaled state, the cluster contains 16 GPU nodes and serves about 300 requests per second. When we scale down the load generator to 1 replica, the reverse process takes place. After the cooldown period, KEDA reduces the number of service pods. Then as fewer pods run, Karpenter removes the underutilized nodes from the cluster and the service pods get consolidated to run on fewer nodes. When the load generator pod is removed, a single service pod on a single g4dn.xlarge instance with 1 GPU remains running. When we remove the service pod as well, the cluster is left in the initial state with only two CPU nodes.

We can observe this behavior when the NodePool has the setting consolidationPolicy: WhenUnderutilized.

With this setting, Karpenter dynamically configures the cluster with as few nodes as possible, while providing sufficient resources for all pods to run and also minimizing cost.

The scaling behavior shown in the following dashboard is observed when the NodePool consolidation policy is set to WhenEmpty, along with consolidateAfter: 30s.

Scaling Dashboard 2

In this scenario, nodes are stopped only when there are no pods running on them after the cool-off period. The scaling curve appears smooth, compared to the utilization-based consolidation policy; however, it can be seen that more nodes are used in the scaled state (22 vs. 16).

Overall, combining pod and cluster auto scaling makes sure that the cluster scales dynamically with the workload, allocating resources when needed and removing them when not in use, thereby maximizing utilization and minimizing cost.

Outcomes

Iambic used this architecture to enable efficient use of GPUs on AWS and migrate workloads from CPU to GPU. By using EC2 GPU powered instances, Amazon EKS, and Karpenter, we were able to enable faster inference for our physics-based models and fast experiment iteration times for applied scientists who rely on training as a service.

The following table summarizes some of the time metrics of this migration.

Task CPUs GPUs
Inference using diffusion models for physics-based ML models 3,600 seconds

100 seconds

(due to inherent batching of GPUs)

ML model training as a service 180 minutes 4 minutes

The following table summarizes some of our time and cost metrics.

Task Performance/Cost
CPUs GPUs
ML model training

240 minutes

average $0.70 per training task

20 minutes

average $0.38 per training task

Summary

In this post, we showcased how Iambic used Karpenter and KEDA to scale our Amazon EKS infrastructure to meet the latency requirements of our AI inference and training workloads. Karpenter and KEDA are powerful open source tools that help auto scale EKS clusters and workloads running on them. This helps optimize compute costs while meeting performance requirements. You can check out the code and deploy the same architecture in your own environment by following the complete walkthrough in this GitHub repo.


About the Authors

Matthew Welborn is the director of Machine Learning at Iambic Therapeutics. He and his team leverage AI to accelerate the identification and development of novel therapeutics, bringing life-saving medicines to patients faster.

Paul Whittemore is a Principal Engineer at Iambic Therapeutics. He supports delivery of the infrastructure for the Iambic AI-driven drug discovery platform.

Alex Iankoulski is a Principal Solutions Architect, ML/AI Frameworks, who focuses on helping customers orchestrate their AI workloads using containers and accelerated computing infrastructure on AWS.

Read More

Generate customized, compliant application IaC scripts for AWS Landing Zone using Amazon Bedrock

Generate customized, compliant application IaC scripts for AWS Landing Zone using Amazon Bedrock

Migrating to the cloud is an essential step for modern organizations aiming to capitalize on the flexibility and scale of cloud resources. Tools like Terraform and AWS CloudFormation are pivotal for such transitions, offering infrastructure as code (IaC) capabilities that define and manage complex cloud environments with precision. However, despite its benefits, IaC’s learning curve, and the complexity of adhering to your organization’s and industry-specific compliance and security standards, could slow down your cloud adoption journey. Organizations typically counter these hurdles by investing in extensive training programs or hiring specialized personnel, which often leads to increased costs and delayed migration timelines.

Generative artificial intelligence (AI) with Amazon Bedrock directly addresses these challenges. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock empowers teams to generate Terraform and CloudFormation scripts that are custom fitted to organizational needs while seamlessly integrating compliance and security best practices. Traditionally, cloud engineers learning IaC would manually sift through documentation and best practices to write compliant IaC scripts. With Amazon Bedrock, teams can input high-level architectural descriptions and use generative AI to generate a baseline configuration of Terraform scripts. These generated scripts are tailored to meet your organization’s unique requirements while conforming to industry standards for security and compliance. These scripts serve as a foundational starting point, requiring further refinement and validation to make sure they meet production-level standards.

This solution not only accelerates the migration process but also provides a standardized and secure cloud infrastructure. Additionally, it offers beginner cloud engineers initial script drafts as standard templates to build upon, facilitating their IaC learning journey.

As you navigate the complexities of cloud migration, the need for a structured, secure, and compliant environment is paramount. AWS Landing Zone addresses this need by offering a standardized approach to deploying AWS resources. This makes sure your cloud foundation is built according to AWS best practices from the start. With AWS Landing Zone, you eliminate the guesswork in security configurations, resource provisioning, and account management. It’s particularly beneficial for organizations looking to scale without compromising on governance or control, providing a clear path to a robust and efficient cloud setup.

In this post, we show you how to generate customized, compliant IaC scripts for AWS Landing Zone using Amazon Bedrock.

AWS Landing Zone architecture in the context of cloud migration

AWS Landing Zone can help you set up a secure, multi-account AWS environment based on AWS best practices. It provides a baseline environment to get started with a multi-account architecture, automate the setup of new accounts, and centralize compliance, security, and identity management. The following is an example of a customized Terraform-based AWS Landing Zone solution, in which each application resides in its own AWS account.

The high-level workflow includes the following components:

  • Module provisioning – Different platform teams across various domains, such as databases, containers, data management, networking, and security, develop and publish certified or custom modules. These are delivered through pipelines to a Terraform private module registry, which is maintained by the organization for consistency and standardization.
  • Account vending machine layer – The account vending machine (AVM) layer uses either AWS Control Tower, AWS Account Factory for Terraform (AFT), or a custom landing zone solution to vend accounts. In this post, we refer to these solutions collectively as the AVM layer. When application owners submit a request to the AVM layer, it processes the input parameters from the request to provision a target AWS account. This account is then provisioned with tailored infrastructure components through AVM customizations, which include AWS Control Tower customizations or AFT customizations.
  • Application infrastructure layer – In this layer, application teams deploy their infrastructure components into the provisioned AWS accounts. This is achieved by writing Terraform code within an application-specific repository. The Terraform code calls upon the modules previously published to the Terraform private registry by the platform teams.

Overcoming on-premises IaC migration challenges with generative AI

Teams maintaining on-premises applications often encounter a learning curve with Terraform, a key tool for IaC in AWS environments. This skill gap can be a significant hurdle in cloud migration efforts. Amazon Bedrock, with its generative AI capabilities, plays an essential role in mitigating this challenge. It facilitates the automation of Terraform code creation for the application infrastructure layer, empowering teams with limited Terraform experience to make an efficient transition to AWS.

Amazon Bedrock generates Terraform code from architectural descriptions. The generated code is custom and standardized based on organizational best practices, security, and regulatory guidelines. This standardization is made possible by using advanced prompts in conjunction with Knowledge Bases for Amazon Bedrock, which stores information on organization-specific Terraform modules. This solution uses Retrieval Augmented Generation (RAG) to enrich the input prompt to Amazon Bedrock with details from the knowledge base, making sure the output Terraform configuration and README contents are compliant with your organization’s Terraform best practices and guidelines.

The following diagram illustrates this architecture.

The workflow consists of the following steps:

  1. The process begins with account vending, where application owners submit a request for a new AWS account. This invokes the AVM, which processes the request parameters to provision the target AWS account.
  2. An architecture description for an application slated for migration is passed as one of the inputs to the AVM layer.
  3. After the account is provisioned, AVM customizations are applied. This can include AWS Control Tower customizations or AFT customizations that set up the account with the necessary infrastructure components and configurations in line with organizational policies.
  4. In parallel, the AVM layer invokes a Lambda function to generate Terraform code. This function enriches the architecture description with a customized prompt, and utilizes RAG to further enhance the prompt with organization-specific coding guidelines from the Knowledge Base for Bedrock. This Knowledge Base includes tailored best practices, security guardrails, and guidelines specific to the organization. See an illustrative example of organization specific Terraform module specifications and  guidelines uploaded to the Knowledge Base.
  5. Before deployment, the initial draft of the Terraform code is thoroughly reviewed by cloud engineers or an automated code review system to confirm that it meets all technical and compliance standards.
  6. The reviewed and updated Terraform scripts are then used to deploy infrastructure components into the newly provisioned AWS account, setting up compute, storage, and networking resources required for the application.

Solution overview

The AWS Landing Zone deployment uses a Lambda function for generating Terraform scripts from architectural inputs. This function, which is central to the operation, translates these inputs into compliant code, using Amazon Bedrock and Knowledge Bases for Amazon Bedrock. The output is then stored in a GitHub repository, corresponding to the specific application in migration. The following sections detail the prerequisites and specific steps needed to implement this solution.

Prerequisites

You should have the following:

Configure the Lambda function to generate custom code

This Lambda function is a key component in automating the creation of customized, compliant Terraform configurations for AWS services. It commits the generated configurations directly to a designated GitHub repository, aligning with organizational best practices. For the function code, refer to the following GitHub repo. For creating lambda function, please follow instructions.

The following diagram illustrates the workflow of the function.

The workflow includes the following steps:

  1. The function is invoked by an event from the AVM layer, containing the architecture description.
  2. The function retrieves and uses Terraform module definitions from the knowledge base.
  3. The function invokes the Amazon Bedrock model twice, following recommended prompt engineering guidelines. The function applies RAG to enrich the input prompt with the Terraform module information, making sure the output code meets organizational best practices.
    • First, generate Terraform configurations following organizational coding guidelines and include Terraform module details from the knowledge base. For example, the prompt could be: “Generate Terraform configurations for AWS services. Follow security best practices by using IAM roles and least privilege permissions. Include all necessary parameters, with default values. Add comments explaining the overall architecture and the purpose of each resource.”
    • Second, create a detailed README file. For example: “Generate a detailed README for the Terraform configuration based on AWS services. Include sections on security improvements, cost optimization tips following the AWS Well-Architected Framework. Also, include detailed Cost Breakdown for each AWS service used with hourly rates and total daily and monthly costs.”
  4. It commits the generated Terraform configuration and the README to the GitHub repository, providing traceability and transparency.
  5. Lastly, it responds with success, including URLs to the committed GitHub files, or returns detailed error information for troubleshooting.

Configure Knowledge Bases for Amazon Bedrock

Follow these steps to set up your knowledge base in Amazon Bedrock:

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. Enter a clear and descriptive name that reflects the purpose of your knowledge base, such as AWS Account Setup Knowledge Base For Amazon Bedrock.
  4. Assign a pre-configured IAM role with the necessary permissions. It’s typically best to let Amazon Bedrock create this role for you to make sure it has the correct permissions.
  5. Upload a JSON file to an S3 bucket with encryption enabled for security. This file should contain a structured list of AWS services and Terraform modules. For the JSON structure, use the following example from the GitHub repository.
  6. Choose the default embeddings model.
  7. Allow Amazon Bedrock to create and manage the vector store for you in Amazon OpenSearch Service.
  8. Review the information for accuracy. Pay special attention to the S3 bucket URI and IAM role details.
  9. Create your knowledge base.

After you deploy and configure these components, when your AWS Landing Zone solution invokes the Lambda function, the following files are generated:

  • A Terraform configuration file – This file specifies the infrastructure setup.
  • A comprehensive README file – This file documents the security standards embedded within the code, confirming that they align with the security practices outlined in the initial sections. Additionally, this README includes an architectural summary, cost optimization tips, and a detailed cost breakdown for the resources described in the Terraform configuration.

The following screenshot shows an example of the Terraform configuration file.

The following screenshot shows an example of the README file.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Lambda function if it’s no longer required.
  2. Empty and delete the S3 bucket used for Terraform state storage.
  3. Remove the generated Terraform scripts and README file from the GitHub repo.
  4. Delete the knowledge base if it’s no longer needed.

Conclusion

The generative AI capabilities of Amazon Bedrock not only streamline the creation of compliant Terraform scripts for AWS deployments, but also act as a pivotal learning aid for beginner cloud engineers transitioning on-premises applications to AWS. This approach accelerates the cloud migration process and helps you adhere to best practices. You can also use the solution to provide value after the migration, enhancing daily operations such as ongoing infrastructure and cost optimization. Although we primarily focused on Terraform in this post, these principles can also enhance your AWS CloudFormation deployments, providing a versatile solution for your infrastructure needs.

Ready to simplify your cloud migration process with generative AI in Amazon Bedrock? Begin by exploring the Amazon Bedrock User Guide to understand how it can streamline your organization’s cloud journey. For further assistance and expertise, consider using AWS Professional Services to help you streamline your cloud migration journey and maximize the benefits of Amazon Bedrock.

Unlock the potential for rapid, secure, and efficient cloud adoption with Amazon Bedrock. Take the first step today and discover how it can enhance your organization’s cloud transformation endeavors.


About the Author

Ebbey Thomas specializes in strategizing and developing custom AWS Landing Zone resources with a focus on using generative AI to enhance cloud infrastructure automation. In his role at AWS Professional Services, Ebbey’s expertise is central to architecting solutions that streamline cloud adoption, providing a secure and efficient operational framework for AWS users. He is known for his innovative approach to cloud challenges and his commitment to driving forward the capabilities of cloud services.

Read More

Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

See CHANGELOG for latest features and fixes.

You’ve likely experienced the challenge of taking notes during a meeting while trying to pay attention to the conversation. You’ve probably also experienced the need to quickly fact-check something that’s been said, or look up information to answer a question that’s just been asked in the call. Or maybe you have a team member that always joins meetings late, and expects you to send them a quick summary over chat to catch them up.

Then there are the times that others are talking in a language that’s not your first language, and you’d love to have a live translation of what people are saying to make sure you understand correctly.

And after the call is over, you usually want to capture a summary for your records, or to send to the participants, with a list of all the action items, owners, and due dates.

All of this, and more, is now possible with our newest sample solution, Live Meeting Assistant (LMA).

Check out the following demo to see how it works.

In this post, we show you how to use LMA with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock.

Solution overview

The LMA sample solution captures speaker audio and metadata from your browser-based meeting app (as of this writing, Zoom and Chime are supported), or audio only from any other browser-based meeting app, softphone, or audio source. It uses Amazon Transcribe for speech to text, Knowledge Bases for Amazon Bedrock for contextual queries against your company’s documents and knowledge sources, and Amazon Bedrock models for customizable transcription insights and summaries.

Everything you need is provided as open source in our GitHub repo. It’s straightforward to deploy in your AWS account. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things LMA can do:

  • Live transcription with speaker attribution – LMA is powered by Amazon Transcribe ASR models for low-latency, high-accuracy speech to text. You can teach it brand names and domain-specific terminology if needed, using custom vocabulary and custom language model features in Amazon Transcribe.
  • Live translation – It uses Amazon Translate to optionally show each segment of the conversation translated into your language of choice, from a selection of 75 languages.
  • Context-aware meeting assistant – It uses Knowledge Bases for Amazon Bedrock to provide answers from your trusted sources, using the live transcript as context for fact-checking and follow-up questions. To activate the assistant, just say “Okay, Assistant,” choose the ASK ASSISTANT! button, or enter your own question in the UI.
  • On-demand summaries of the meeting – With the click of a button on the UI, you can generate a summary, which is useful when someone joins late and needs to get caught up. The summaries are generated from the transcript by Amazon Bedrock. LMA also provides options for identifying the current meeting topic, and for generating a list of action items with owners and due dates. You can also create your own custom prompts and corresponding options.
  • Automated summary and insights – When the meeting has ended, LMA automatically runs a set of large language model (LLM) prompts on Amazon Bedrock to summarize the meeting transcript and extract insights. You can customize these prompts as well.
  • Meeting recording – The audio is (optionally) stored for you, so you can replay important sections on the meeting later.
  • Inventory list of meetings – LMA keeps track of all your meetings in a searchable list.
  • Browser extension captures audio and meeting metadata from popular meeting apps – The browser extension captures meeting metadata—the meeting title and names of active speakers—and audio from you (your microphone) and others (from the meeting browser tab). As of this writing, LMA supports Chrome for the browser extension, and Zoom and Chime for meeting apps (with Teams and WebEx coming soon). Standalone meeting apps don’t work with LMA —instead, launch your meetings in the browser.

You are responsible for complying with legal, corporate, and ethical restrictions that apply to recording meetings and calls. Do not use this solution to stream, record, or transcribe calls if otherwise prohibited.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need an existing knowledge base in Amazon Bedrock. If you haven’t set one up yet, see Create a knowledge base. Populate your knowledge base with content to power LMA’s context-aware meeting assistant.

Finally, LMA uses Amazon Bedrock LLMs for its meeting summarization features. Before proceeding, if you have not previously done so, you must request access to the following Amazon Bedrock models:

  • Titan Embeddings G1 – Text
  • Anthropic: All Claude models

Deploy the solution using AWS CloudFormation

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

  1. Log in to the AWS Management Console.
  2. Choose Launch Stack for your desired AWS Region to open the AWS CloudFormation console and create a new stack.
Region Launch Stack
US East (N. Virginia)
US West (Oregon)
  1. For Stack name, use the default value, LMA.
  2. For Admin Email Address, use a valid email address—your temporary password is emailed to this address during the deployment.
  3. For Authorized Account Email Domain, use the domain name part of your corporate email address to allow users with email addresses in the same domain to create their own new UI accounts, or leave blank to prevent users from directly creating their own accounts. You can enter multiple domains as a comma-separated list.
  4. For MeetingAssistService, choose BEDROCK_KNOWLEDGE_BASE (the only available option as of this writing).
  5. For Meeting Assist Bedrock Knowledge Base Id (existing), enter your existing knowledge base ID (for example, JSXXXXX3D8). You can copy it from the Amazon Bedrock console.
  6. For all other parameters, use the default values.

If you want to customize the settings later, for example to add your own AWS Lambda functions, use custom vocabularies and language models to improve accuracy, enable personally identifiable information (PII) redaction, and more, you can update the stack for these parameters.

  1. Select the acknowledgement check boxes, then choose Create stack.

The main CloudFormation stack uses nested stacks to create the following resources in your AWS account:

The stacks take about 35–40 minutes to deploy. The main stack status shows CREATE_COMPLETE when everything is deployed.

Set your password

After you deploy the stack, open the LMA web user interface and set your password by completing the following steps:

  1. Open the email you received, at the email address you provided, with the subject “Welcome to Live Meeting Assistant!”
  2. Open your web browser to the URL shown in the email. You’re directed to the login page.
  3. The email contains a generated temporary password that you use to log in and create your own password. Your user name is your email address.
  4. Set a new password.

Your new password must have a length of at least eight characters, and contain uppercase and lowercase characters, plus numbers and special characters.

  1. Follow the directions to verify your email address, or choose Skip to do it later.

You’re now logged in to LMA.

You also received a similar email with the subject “QnABot Signup Verification Code.” This email contains a generated temporary password that you use to log in and create your own password in the QnABot designer. You use QnABot designer only if you want to customize LMA options and prompts. Your username for QnABot is Admin. You can set your permanent QnABot Admin password now, or keep this email safe in case you want to customize things later.

Download and install the Chrome browser extension

For the best meeting streaming experience, install the LMA browser plugin (currently available for Chrome):

  1. Choose Download Chrome Extension to download the browser extension .zip file (lma-chrome-extension.zip).
  2. Choose (right-click) and expand the .zip file (lma-chrome-extension.zip) to create a local folder named lma-chrome-extension.
  3. Open Chrome and enter the link chrome://extensions into the address bar.
  4. Enable Developer mode.
  5. Choose Load unpacked, navigate to the lma-chrome-extension folder (which you unzipped from the download), and choose Select. This loads your extension.
  6. Pin the new LMA extension to the browser tool bar for easy access—you will use it often to stream your meetings!

Start using LMA

LMA provides two streaming options:

  • Chrome browser extension – Use this to stream audio and speaker metadata from your meeting browser app. It currently works with Zoom and Chime, but we hope to add more meeting apps.
  • LMA Stream Audio tab – Use this to stream audio from your microphone and any Chrome browser-based meeting app, softphone, or audio application.

We show you how to use both options in the following sections.

Use the Chrome browser extension to stream a Zoom call

Complete the following steps to use the browser extension:

  1. Open the LMA extension and log in with your LMA credentials.
  2. Join or start a Zoom meeting in your web browser (do not use the separate Zoom client).

If you already have the Zoom meeting page loaded, reload it.

The LMA extension automatically detects that Zoom is running in the browser tab, and populates your name and the meeting name.

  1. Tell others on the call that you are about to start recording the call using LMA and obtain their permission. Do not proceed if participants object.
  2. Choose Start Listening.
  3. Read and accept the disclaimer, and choose Allow to share the browser tab.

The LMA extension automatically detects and displays the active speaker on the call. If you are alone in the meeting, invite some friends to join, and observe that the names they used to join the call are displayed in the extension when they speak, and are attributed to their words in the LMA transcript.

  1. Choose Open in LMA to see your live transcript in a new tab.
  2. Choose your preferred transcript language, and interact with the meeting assistant using the wake phrase “OK Assistant!” or the Meeting Assist Bot pane.

The ASK ASSISTANT button asks the meeting assistant service (Amazon Bedrock knowledge base) to suggest a good response based on the transcript of the recent interactions in the meeting. Your mileage may vary, so experiment!

  1. When you are done, choose Stop Streaming to end the meeting in LMA.

Within a few seconds, the automated end-of-meeting summaries appear, and the audio recording becomes available. You can continue to use the bot after the call has ended.

Use the LMA UI Stream Audio tab to stream from your microphone and any browser-based audio application

The browser extension is the most convenient way to stream metadata and audio from supported meeting web apps. However, you can also use LMA to stream just the audio from any browser-based softphone, meeting app, or other audio source playing in your Chrome browser, using the convenient Stream Audio tab that is built into the LMA UI.

  1. Open any audio source in a browser tab.

For example, this could be a softphone (such as Google Voice), another meeting app, or for demo purposes, you can simply play a local audio recording or a YouTube video in your browser to emulate another meeting participant. If you just want to try it, open the following YouTube video in a new tab.

  1. In the LMA App UI, choose Stream Audio (no extension) to open the Stream Audio tab.
  2. For Meeting ID, enter a meeting ID.
  3. For Name, enter a name for yourself (applied to audio from your microphone).
  4. For Participant Name(s), enter the names of the participants (applied to the incoming audio source).
  5. Choose Start Streaming.
  6. Choose the browser tab you opened earlier, and choose Allow to share.
  7. Choose the LMA UI tab again to view your new meeting ID listed, showing the meeting as In Progress.
  8. Choose the meeting ID to open the details page, and watch the transcript of the incoming audio, attributed to the participant names that you entered. If you speak, you’ll see the transcription of your own voice.

Use the Stream Audio feature to stream from any softphone app, meeting app, or any other streaming audio playing in the browser, along with your own audio captured from your selected microphone. Always obtain permission from others before recording them using LMA, or any other recording application.

Processing flow overview

How did LMA transcribe and analyze your meeting? Let’s look at how it works. The following diagram shows the main architectural components and how they fit together at a high level.

The LMA user joins a meeting in their browser, enables the LMA browser extension, and authenticates using their LMA credentials. If the meeting app (for example, Zoom.us) is supported by the LMA extension, the user’s name, meeting name, and active speaker names are automatically detected by the extension. If the meeting app is not supported by the extension, then the LMA user can manually enter their name and the meeting topic—active speakers’ names will not be detected.

After getting permission from other participants, the LMA user chooses Start Listening on the LMA extension pane. A secure WebSocket connection is established to the preconfigured LMA stack WebSocket URL, and the user’s authentication token is validated. The LMA browser extension sends a START message to the WebSocket containing the meeting metadata (name, topic, and so on), and starts streaming two-channel audio from the user’s microphone and the incoming audio channel containing the voices of the other meeting participants. The extension monitors the meeting app to detect active speaker changes during the call, and sends that metadata to the WebSocket, enabling LMA to label speech segments with the speaker’s name.

The WebSocket server running in Fargate consumes the real-time two-channel audio fragments from the incoming WebSocket stream. The audio is streamed to Amazon Transcribe, and the transcription results are written in real time to Kinesis Data Streams.

Each meeting processing session runs until the user chooses Stop Listening in the LMA extension pane, or ends the meeting and closes the tab. At the end of the call, the function creates a stereo recording file in Amazon S3 (if recording was enabled when the stack was deployed).

A Lambda function called the Call Event Processor, fed by Kinesis Data Streams, processes and optionally enriches meeting metadata and transcription segments. The Call Event Processor integrates with the meeting assist services. LMA is powered by Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs using the open source QnABot on AWS solution for answers based on FAQs and as an orchestrator for request routing to the appropriate AI service. The Call Event Processor also invokes the Transcript Summarization Lambda function when the call ends, to generate a summary of the call from the full transcript.

The Call Event Processor function interfaces with AWS AppSync to persist changes (mutations) in Amazon DynamoDB and send real-time updates to the LMA user’s logged-in web clients (conveniently opened by choosing the Open in LMA option in the browser extension).

The LMA web UI assets are hosted on Amazon S3 and served via CloudFront. Authentication is provided by Amazon Cognito.

When the user is authenticated, the web application establishes a secure GraphQL connection to the AWS AppSync API, and subscribes to receive real-time events such as new calls and call status changes for the meetings list page, and new or updated transcription segments and computed analytics for the meeting details page. When translation is enabled, the web application also interacts securely with Amazon Translate to translate the meeting transcription into the selected language.

The entire processing flow, from ingested speech to live webpage updates, is event driven, and the end-to-end latency is short—typically just a few seconds.

Monitoring and troubleshooting

AWS CloudFormation reports deployment failures and causes on the relevant stack’s Events tab. See Troubleshooting CloudFormation for help with common deployment problems. Look out for deployment failures caused by limit exceeded errors; the LMA stacks create resources that are subject to default account and Region service quotas, such as elastic IP addresses and NAT gateways. When troubleshooting CloudFormation stack failures, always navigate into any failed nested stacks to find the first nested resource failure reported—this is almost always the root cause.

Amazon Transcribe has a default limit of 25 concurrent transcription streams, which limits LMA to 25 concurrent meetings in a given AWS account or Region. Request an increase for the number of concurrent HTTP/2 streams for streaming transcription if you have many users and need to handle a larger number of concurrent meetings in your account.

LMA provides runtime monitoring and logs for each component using CloudWatch:

  • WebSocket processing and transcribing Fargate task – On the Amazon Elastic Container Service (Amazon ECS) console, navigate to the Clusters page and open the LMA-WEBSOCKETSTACK-xxxx-TranscribingCluster function. Choose the Tasks tab and open the task page. Choose Logs and View in CloudWatch to inspect the WebSocket transcriber task logs.
  • Call Event Processor Lambda function – On the Lambda console, open the LMA-AISTACK-CallEventProcessor function. Choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.
  • AWS AppSync API – On the AWS AppSync console, open the CallAnalytics-LMA API. Choose Monitoring in the navigation pane to see API metrics. Choose View logs in CloudWatch to inspect AWS AppSync API logs.

For QnABot on AWS for Meeting Assist, refer to the Meeting Assist README, and the QnABot solution implementation guide for additional information.

Cost assessment

LMA provides a WebSocket server using Fargate (2vCPU) and VPC networking resources costing about $0.10/hour (approximately $72/month). For more details, see AWS Fargate Pricing.

LMA is enabled using QnABot and Knowledge Bases for Amazon Bedrock. You create your own knowledge base, which you use for LMA and potentially other use cases. For more details, see Amazon Bedrock Pricing. Additional AWS services used by the QnABot solution cost about $0.77/hour. For more details, refer to the list of QnABot on AWS solution costs.

The remaining solution costs are based on usage.

The usage costs add up to about $0.17 for a 5-minute call, although this can vary based on options selected (such as translation), number of LLM summarizations, and total usage because usage affects Free Tier eligibility and volume tiered pricing for many services. For more information about the services that incur usage costs, see the following:

To explore LMA costs for yourself, use AWS Cost Explorer or choose Bill Details on the AWS Billing Dashboard to see your month-to-date spend by service.

Customize your deployment

Use the following CloudFormation template parameters when creating or updating your stack to customize your LCA deployment:

  • To use your own S3 bucket for meeting recordings, use Call Audio Recordings Bucket Name and Audio File Prefix.
  • To redact PII from the transcriptions, set Enable Content Redaction for Transcripts to true, and adjust Transcription PII Redaction Entity Types as needed. For more information, see Redacting or identifying PII in a real-time stream.
  • To improve transcription accuracy for technical and domain-specific acronyms and jargon, set Transcription Custom Vocabulary Name to the name of a custom vocabulary that you already created in Amazon Transcribe or set Transcription Custom Language Model Name to the name of a previously created custom language model. For more information, see Improving Transcription Accuracy.
  • To transcribe meetings in a supported language other than US English, choose the desired value for Language for Transcription.
  • To customize transcript processing, optionally set Lambda Hook Function ARN for Custom Transcript Segment Processing to the ARN of your own Lambda function. For more information, see Using a Lambda function to optionally provide custom logic for transcript processing.
  • To customize the meeting assist capabilities based on the QnABot on AWS solution, Amazon Lex, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock integration, see the Meeting Assist README.
  • To customize transcript summarization by configuring LMA to call your own Lambda function, see Transcript Summarization LAMBDA option.
  • To customize transcript summarization by modifying the default prompts or adding new ones, see Transcript Summarization.
  • To change the retention period, set Record Expiration In Days to the desired value. All call data is permanently deleted from the LMA DynamoDB storage after this period. Changes to this setting apply only to new calls received after the update.

LMA is an open source project. You can fork the LMA GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

Update an existing LMA stack

You can update your existing LMA stack to the latest release. For more details, see Update an existing stack.

Clean up

Congratulations! You have completed all the steps for setting up your live call analytics sample solution using AWS services.

When you’re finished experimenting with this sample solution, clean up your resources by using the AWS CloudFormation console to delete the LMA stacks that you deployed. This deletes resources that were created by deploying the solution. The recording S3 buckets, DynamoDB table, and CloudWatch log groups are retained after the stack is deleted to avoid deleting your data.

Live Call Analytics: Companion solution

Our companion solution, Live Call Analytics and Agent Assist (LCA), offers real-time transcription and analytics for contact centers (phone calls) rather than meetings. There are many similarities—in fact, LMA was built using an architecture and many components derived from LCA.

Conclusion

The Live Meeting Assistant sample solution offers a flexible, feature-rich, and customizable approach to provide live meeting assistance to improve your productivity during and after meetings. It uses Amazon AI/ML services like Amazon Transcribe, Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs to transcribe and extract real-time insights from your meeting audio.

The sample LMA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Browse to the LMA GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.

For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the LMA GitHub repository.


About the authors

Bob Strahan Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Chris Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California and enjoys gardening, aerospace, and traveling the world.

Babu Srinivasan is a Sr. Specialist SA – Language AI services in the World Wide Specialist organization at AWS, with over 24 years of experience in IT and the last 6 years focused on the AWS Cloud. He is passionate about AI/ML. Outside of work, he enjoys woodworking and entertains friends and family (sometimes strangers) with sleight of hand card magic.

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Picture of Gillian ArmstrongGillian Armstrong is a Builder Solutions Architect. She is excited about how the Cloud is opening up opportunities for more people to use technology to solve problems, and especially excited about how cognitive technologies, like conversational AI, are allowing us to interact with computers in more human ways.

Read More