November 2024 – Page 2

How RTX AI PCs Unlock AI Agents That Solve Complex Problems Autonomously With Generative AI

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Generative AI has transformed the way people bring ideas to life. Agentic AI takes this one step further — using sophisticated, autonomous reasoning and iterative planning to help solve complex, multi-step problems.

AnythingLLM is a customizable open-source desktop application that lets users seamlessly integrate large language model (LLM) capabilities into various applications locally on their PCs. It enables users to harness AI for tasks such as content generation, summarization and more, tailoring tools to meet specific needs.

Accelerated on NVIDIA RTX AI PCs, AnythingLLM has launched a new Community Hub where users can share prompts, slash commands and AI agent skills while experimenting with building and running AI agents locally.

Autonomously Solve Complex, Multi-Step Problems With Agentic AI

AI agents can take chatbot capabilities further. They typically understand the context of the tasks and can analyze challenges and develop strategies — and some can even fully execute assigned tasks.

For example, while a chatbot could answer a prompt asking for a restaurant recommendation, an AI agent could even surface the restaurant’s phone number for a reservation and add reminders to the user’s calendar.

Agents help achieve big-picture goals and don’t get bogged down at the task level. There are many agentic apps in development to tackle to-do lists, manage schedules, help organize tasks, automate email replies, recommend personalized workout plans or plan trips.

Once prompted, an AI agent can gather and process data from various sources, including databases. It can use an LLM for reasoning — for example, to understand the task — then generate solutions and specific functions. If integrated with external tools and software, an AI agent can next execute the task.

Some sophisticated agents can even be improved through a feedback loop. When the data it generates is fed back into the system, the AI agent becomes smarter and faster.

A step-by-step look at the process behind agentic AI systems. AI agents process user input, retrieve information from databases and other sources, and refine tasks in real time to deliver actionable results.

Accelerated by NVIDIA RTX AI PCs, these agents can perform inferencing and execute tasks faster than any other PC. Users can operate the agent locally to help ensure data privacy, even without an internet connection.

AnythingLLM: A Community Effort, Accelerated by RTX

The AI community is already diving into the possibilities of agentic AI, experimenting with ways to create smarter, more capable systems.

Applications like AnythingLLM let developers easily build, customize and unlock agentic AI with their favorite models — like Llama and Mistral — as well as with other tools, such as Ollama and LMStudio. AnythingLLM is accelerated on RTX-powered AI PCs and workstations with high-performance Tensor Cores, dedicated hardware that provides the compute performance needed to run the latest and most demanding AI models.

AnythingLLM is designed to make working with AI seamless, productive and accessible to everyone. It allows users to chat with their documents using intuitive interfaces, use AI agents to handle complex and custom tasks, and run cutting-edge LLMs locally on RTX-powered PCs and workstations. This means unlocked access to local resources, tools and applications that typically can’t be integrated with cloud- or browser-based applications, or those that require extensive setup and knowledge to build. By tapping into the power of NVIDIA RTX GPUs, AnythingLLM delivers faster, smarter and more responsive AI for a variety of workflows — all within a single desktop application.

AnythingLLM’s Community Hub lets AI enthusiasts easily access system prompts that can help steer LLM behavior, discover productivity-boosting slash commands, build specialized AI agent skills for unique workflows and custom tools, and access on-device resources.

Example of a user invoking the agent to complete a Web Search query.

Some example agent skills that are available in the Community Hub include Microsoft Outlook email assistants, calendar agents, web searches and home assistant controllers, as well as agents for populating and even integrating custom application programming interface endpoints and services for a specific use case.

By enabling AI enthusiasts to download, customize and use agentic AI workflows on their own systems with full privacy, AnythingLLM is fueling innovation and making it easier to experiment with the latest technologies — whether building a spreadsheet assistant or tackling more advanced workflows.

Experience AnythingLLM now.

Powered by People, Driven by Innovation

AnythingLLM showcases how AI can go beyond answering questions to actively enhancing productivity and creativity. Such applications illustrate AI’s move toward becoming an essential collaborator across workflows.

Agentic AI’s potential applications are vast and require creativity, expertise and computing capabilities. NVIDIA RTX AI PCs deliver peak performance for running agents locally, whether accomplishing simple tasks like generating and distributing content, or managing more complex use cases such as orchestrating enterprise software.

Learn more and get started with agentic AI.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Advances in run-time strategies for next-generation foundation models

A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.

Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed Medprompt last year—a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods.

A line chart that plots the MedQA test accuracy (y-axis) over time (x-axis).

Open AI o1-preview model achieves the highest result at 96.0% accuracy followed by Med-Gemini at 91.1%; GPT-4 (Medprompt) at 90.2%; Med PaLM 2 at 86.5; GPT-4 base at 86.1; Med PaLM at 67.2; GPT-3.5 base at 60.2, BioMedLM at 50.3; DRAGON at 47.5; BioLinkBERT at 45.1; PubMedBERT at 38.1. — Figure 1. Comparative analyses of performance of multiple models on MedQA.

Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)—without using sophisticated prompt guidance and control. This advancement is driven by the model’s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2.

A spider web chart plotting the performance of OpenAI o1-preview (0 shot ensemble) compared to GPT-4 (Medprompt) and GPT-4 (5 shot) model performance on medical challenge problems. o1-preview achieves state-of-the-art results on MedQA US (4-option), JMLE-2024, MedMCQA Dev, MMLU Anatomy, MMLU Medical Genetics, MMLU Professional Medicine, MMLU College Biology, and MMLU College Medicine, and NCLEX. GPT-4 (Medprompt) performed better than OpenAI o1-preview (0 shot ensemble) on MMLU Clinical Knowledge — Figure 2. Comparisons on a wide range of medical challenge benchmarks.

These results are notable, prompting us to publish our recent study, findings, and analyses, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab). But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.

Is o1-preview “just” fancy prompting?

The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to “think” before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.

A line chart plotting accuracy on the MedQA Test (y-axis) versus total cost on a logarithmic scale (x-axis). OpenAI o1-preview using 5x, 10x, and 15x Ensemble hover around 1000 total cost. OpenAI o1-preview using Tailored Prompt, Minimal Prompt, Few-shot, kNN Few-shot are around 100 total cost. GPT-4o with Medprompt is below 100; kNN Few-shot CoT, Few-shot CoT, and Few-Shot are at 10; Zero-shot is at 1. GPT-4-Turbo with Medprompt is at 200; kNN Few-shot CoT, Few-shot CoT, and Few-Shot hover near 50, Zero-shot is near 5. — Figure 3. Pareto frontier showing accuracy versus total API cost (log scale) on the MedQA benchmark (1273 questions total). o1-preview (Sep 2024) is compared with GPT-4o (Aug 2024) and GPT-4 Turbo (Nov 2023).

Can we prompt engineer o1-preview?

The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here’s a summary of our findings:

Tailored Prompt: While minimal prompting—like a brief, one-sentence description followed by a question—offered a strong baseline performance, detailed task descriptions were best for eliciting accurate responses.
Ensembling: Generating multiple answers per question and using majority voting across different reasoning paths boosted reliability, while shuffling answers in runs produced richer reasoning chains and improved outcomes. Ensembling continues to yield consistent performance improvements.
Few-Shot Prompting: Guiding the model with a few examples produced inconsistent results and, on average, decreased performance compared with GPT models.

Three charts show the accuracy of o1-preview when combined with Tailored Prompt, Ensemble, and 5-shot KNN based on an average baseline of medical benchmarks. Tailored Prompts improves accuracy from 94.2 to 94.7; Ensemble (15x) improves accuracy from 94.2 to 95.5; 5-shot KNN decreases accuracy from 94.2 to 93.7. — Figure 4. Tests of different prompting strategies across benchmark datasets.

Do results stand in another language?

A chart with two bar charts measuring the accuracy (y-axis) by short and long questions (x-axis) on the Japanese Medical Licensing Examination. The short question bar is slightly higher than the long question ratio for o1-preview (0-shot ensemble). The short question bar is about two points less accurate than the long question bar for o1-preview (0-shot). The short answer bar is a point more accurate than the long question bar for GPT-4o (Medprompt). The short question bar is one point less accurate than the long question bar for GPT-4o (0 shot). — Figure 5. JMLE-2024: National medical licensing exam held in Japan (Feb 2024).

We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam. The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model’s knowledge cutoff. Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam’s minimum passing score of approximately 80%.

Do reasoning tokens improve performance?

For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.

A chart plotting the impact of reasoning tokens on accuracy. JMLE achieved 95.3% accuracy for Quick Response Prompt and 96.7% accuracy for Extended Reasoning Prompt. MMLU achieved 94.9% accuracy for Quick Response Prompt and 94.7% accuracy for Extended Reasoning Prompt. MedQA achieved 94.3% accuracy for Quick Response Prompt and 95.1% accuracy for Extended Reasoning Prompt. USMLE Sample Exam achieved 92.6% accuracy for Quick Response Prompt and 93.1% accuracy for Extended Reasoning Prompt. USMLE Self Assessment achieved 91.8% accuracy for Quick Response Prompt and 92.2% accuracy for Extended Reasoning Prompt. — Figure 6. The effect of two prompting strategies that elicit variable length reasoning chains across benchmark datasets.

What’s the takeaway?

Bottom line: There’s a little something for everyone when it comes to run-time strategies. We’re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template (opens in new tab).

A matrix that shows the relative performance of prompting strategies over baseline medical benchmarks. The top row from left to right are the results for baseline numbers: JMLE = 95.6%; MMLU = 94.6%; MedMCQA = 81.4%; MedQA = 94.9%; USMLE Sample Exam = 94.0%; USMLE Self Assessment = 91.8%. The second row from left to right, 5-shot Random baseline difference: JMLE = +1.2%; MMLU = -1.1%; MedMCQA = 0.0%; MedQA = -1.4%; USMLE Sample Exam = -0.4%; USMLE Self Assessment = -1.0%. The third row from left to right, 5-shot KNN baseline difference: JMLE = +0.6%; MMLU = -0.1%; MedMCQA = +1.2%; MedQA = -2.2%; USMLE Sample Exam = -0.3%; USMLE Self Assessment = -0.6%. The fourth row from left to right, Bootstrap Ensemble (5x) baseline difference: JMLE = +1.5%; MMLU = +0.1%; MedMCQA = +1.3%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.0%. The fifth row from left to right, Bootstrap Ensemble (10x) baseline difference: JMLE = +1.4%; MMLU = +0.6%; MedMCQA = +1.5%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.1%. The sixth row from left to right, Ensemble (15x) baseline difference: JMLE = +1.5%; MMLU = +0.6%; MedMCQA = +2.0%; MedQA = +1.1%; USMLE Sample Exam = +2.0%; USMLE Self Assessment = +1.3%. The seventh row from left to right, Tailored Prompt baseline difference: JMLE = +1.6%; MMLU = +0.4%; MedMCQA = +0.9%; MedQA = +0.2%; USMLE Sample Exam = +0.0%; USMLE Self Assessment = +0.4%. The eighth row from left to right, Tailored Bootstrap Ensemble (5x) baseline difference: JMLE = +2.2%; MMLU = +0.7%; MedMCQA = +1.8%; MedQA = +0.8%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.1%. The ninth row from left to right, Tailored Bootstrap Ensemble (10x) baseline difference: JMLE = +2.3%; MMLU = +0.7%; MedMCQA = +2.1%; MedQA = +0.9%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.2%. The tenth row from left to right, Tailored Ensemble (15x) baseline difference: JMLE = +2.5%; MMLU = +0.4%; MedMCQA = +2.6%; MedQA = +1.1%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.4%. — Figure 7. Heatmap showing absolute accuracy and relative performance over baseline zero-shot prompt (in parenthesis) across all benchmark datasets.

Anything more to consider?

We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:

Research on run-time strategies. The research community has largely relied on boosting model capabilities with data, compute, and model size, predictably achieving gains by way of scaling laws. A promising new direction is inference-time scaling—the value of investing in additional computation and machinery for guiding inference at run time. We highlight in the paper opportunities to guide run-time allocations to boost efficiency, accuracy, and intellectual capabilities, including meta reasoning and reflection in real time and learning during the “idle” time (opens in new tab) between problem solving. We see a great deal of opportunity for new research and development on real-time and “offline” reasoning, learning, and reflection.
Benchmark saturation. With the rapid advancement of state-of-the-art models, many existing medical benchmarks are reaching “saturation,” where models perform extremely well on standing medical competency challenges, considered extremely difficult just a few years ago. Current benchmarks, such as USMLE and JMLE, were designed to assess the performance of medical students and clinicians and are increasingly inadequate for evaluating cutting-edge AI models. To drive understandings of models and guide research, we need to design more challenging medical benchmarks.
From benchmarks to clinical applications. We note that, while benchmarks offer valuable insights into performance and accuracy, they often fail to capture the complexities and nuances of real-world clinical decision making and healthcare delivery, more broadly. Conducting clinical trials to rigorously evaluate the impact of AI applications on patient care poses far greater difficulties than benchmarking models against challenge problems drawn from medical competency exams. Yet, studies of AI deployments in realistic clinical settings are essential for understanding model capabilities and for guiding the effective integration of AI into healthcare.

Resources

The post Advances in run-time strategies for next-generation foundation models appeared first on Microsoft Research.

Unleash your Salesforce data using the Amazon Q Salesforce Online connector

Thousands of companies worldwide use Salesforce to manage their sales, marketing, customer service, and other business operations. The Salesforce cloud-based platform centralizes customer information and interactions across the organization, providing sales reps, marketers, and support agents with a unified 360-degree view of each customer. With Salesforce at the heart of their business, companies accumulate vast amounts of customer data within the platform over time. This data is incredibly valuable for gaining insights into customers, improving operations, and guiding strategic decisions. However, accessing and analyzing the blend of structured data and unstructured data can be challenging. With the Amazon Q Salesforce Online connector, companies can unleash the value of their Salesforce data.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely take actions based on data and information in your enterprise systems. It empowers employees to be more data-driven, efficient, prepared, and productive.

Amazon Q Business offers pre-built connectors for over 40 data sources, including Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, Google Drive, Atlassian Confluence, Atlassian Jira, and many more. For a full list of data source connectors, see Amazon Q Business connectors.

In this post, we walk you through configuring and setting up the Amazon Q Salesforce Online connector.

Overview of the Amazon Q Salesforce Online connector

Amazon Q Business supports its own index where you can add and sync documents. Amazon Q connectors make it straightforward to synchronize data from multiple content repositories with your Amazon Q index. You can set up connectors to automatically sync your index with your data source based on a schedule, so you’re always securely searching through up-to-date content.

The Amazon Q Salesforce Online connector provides a simple, seamless integration between Salesforce and Amazon Q. With a few clicks, you can securely connect your Salesforce instance to Amazon Q and unlock a robust self-service conversational AI assistant for your Salesforce data.

The following diagram illustrates this architecture.

Types of documents

When you connect Amazon Q Business to a data source like Salesforce, what Amazon Q considers and crawls as a document varies by connector type.

The Amazon Q Salesforce Online connector crawls and indexes the following content types:

Account
Campaign
Case
Chatter
Contact
Contract
Custom object
Document
Group
Idea
Knowledge articles
Lead
Opportunity
Partner
Pricebook
Product
Profile
Solution
Task
User

The Amazon Q Salesforce Online connector also supports field mappings to enrich index data with additional fields data. Field mappings allow you to map Salesforce field names to Amazon Q index field names. This includes both default field mappings created automatically by Amazon Q, and custom field mappings that you can create and edit.

Authentication

The Amazon Q Salesforce Online connector supports OAuth 2.0 with the Resource Owner Password Flow.

ACL crawling

To securely index documents, the Amazon Q Salesforce Online connector supports crawling access control lists (ACLs) with role hierarchy by default. With ACL crawling, the information can be used to filter chat responses to your end-user’s document access level. You can apply ACL-based chat filtering using Salesforce standard objects and chatter feeds. ACL-based chat filtering isn’t available for Salesforce knowledge articles.

If you index documents without ACLs, all documents are considered public. If you want to index documents without ACLs, make sure the documents are marked as public in your data source.

Solution overview

In this post, we guide you through connecting an existing Amazon Q application to Salesforce Online. You configure authentication, map fields, sync data between Salesforce and Amazon Q, and then deploy your AI assistant using the Amazon Q web experience.

We also demonstrate how to use Amazon Q to have a conversation about Salesforce accounts, opportunities, tasks, and other supported data types.

Prerequisites

You need the following prerequisites:

Authentication – If you are creating a Salesforce connected application for the first time, refer to Connected Apps. The following information is required for Salesforce OAuth 2.0 authentication:
- Salesforce authentication URL
- User name
- Password
- Security token
- Consumer key
- Consumer secret
AWS IAM Identity Center – AWS IAM Identity Center should be set up by your administrator for groups or users who will get access to the Amazon Q AI assistant. To add users to IAM Identity Center, refer to Add users to your Identity Center directory.
Amazon Q application – This post assumes that an Amazon Q Business application has already been created. If you haven’t created an application, refer to Build private and secure enterprise generative AI apps with Amazon Q Business and AWS IAM Identity Center to create and configure an Amazon Q Business application.

Set up Salesforce authentication

To set up authentication and allow external programs to Salesforce, complete the following steps to configure your connected application settings:

In Salesforce, in the Quick Find box, search and choose App Manager.
Choose New Connected App.
For Connected App Name, enter a name.
For API name, enter an API name used when referring to the connected application.
Enter your contact email address and phone.
If you are using OAuth, select the right scope for OAuth.

Choose Save and wait for connected application to be created.
On the Connected Apps page, select the application, and on the drop-down menu, choose View.
On the details page, next to Consumer Key and Secret, choose Manage Consumer Details.

Copy the client ID and client secret for future use in Salesforce.

Set up the Amazon Q Salesforce Online connector

Complete the following steps to set up the Amazon Q Salesforce Online connector:

On the Amazon Q Business console, choose Applications in the navigation pane.
Select your application and on the Actions menu, choose Edit.

On the Update application page, leave settings as default and choose Update.
On the Update retriever page, leave settings as default and choose Update.
On the Connect data sources page, on the All tab, search for Salesforce.
Choose the plus sign for the Salesforce Online connector.

In the Name and description section, enter a name and description.
In the Source section, for Salesforce URL, enter your Salesforce server URL in https://yourcompany.my.salesforce.com/

In the Authentication section, choose Create and add new secret.
Enter the Salesforce connected application authentication information and choose Save.

In the IAM role section, choose Create a new service role (recommended).

In the Sync scope section, select All standard objects.

If you choose to sync only specific objects, then select each object type accordingly.

In the Sync mode section, select New, modified, or deleted content sync.

Under Sync run schedule, choose the desired frequency. For testing purposes, we choose Run on demand.

Choose Add data source and wait for the connector to be created.
After the Salesforce connector is created, you’re redirected back to the Connect data sources page, where you can add additional data sources if needed.
Choose Next.
On the Update groups and users page, assign users or groups from IAM Identity Center set up by your administrator. Optionally, if you have permissions to add new users, you can select Add new users.
Choose Next.

Choose a user or group from the list to give them access to the Amazon Q web experience.
Choose Done.

Choose Update application to complete setting up the Salesforce data connector for Amazon Q Business.

Additional Salesforce field mappings

When you connect Amazon Q to a data source, Amazon Q automatically maps specific data source document attributes to fields within an Amazon Q index. If a document attribute in your data source doesn’t have an attribute mapping already available, or if you want to map additional document attributes to index fields, use the custom field mappings to specify how a data source attribute maps to an Amazon Q index field. You create field mappings by editing your data source after your application and retriever are created.

To update the field mapping, complete the following steps:

On the Amazon Q console, navigate to your Amazon Q application.
Under Data sources, select your data source and on the Actions menu, choose Edit.

In the Field mappings section, find the item that you want to add fields to and choose Add field. (For this post, we add the postalCode field to Lead.)
Add any other fields that you want to be included in the Amazon Q index and then choose Update.

The setup process is complete.

In the application details, choose Sync now to start the Amazon Q crawling and indexing process.

The initial sync may take a few minutes to get started.

When the sync process is complete, you can see a summary of ingested data on the connector’s Sync history tab. Check Total items scanned and Added to confirm that the right number of documents are included in the index.

Mapping custom fields

Salesforce allows you to store your unique business data by creating and using custom fields. When you need to fetch a custom field to generate answers, additional steps are needed for mapping and crawling the field. For example, knowledge articles in Salesforce use custom fields to store content of articles.

Make sure the initial sync process for the connector is complete. On the initial sync, the connector gets a list of all fields and objects in Salesforce, which is needed for custom fields mapping.

Complete the following steps to index contents of knowledge articles:

Navigate to Salesforce Setup and search and open Object Manager.
In Object Manager, choose the Knowledge

In the Fields & Relationships section, find the field name (for this example, we’re looking for Article Body and the field name is Article_Body__c) and record this field name.

On the Amazon Q Business console, navigate back to your application and choose Data sources in the navigation pane.
Select the Salesforce data source and on the Actions menu, choose Edit.

In the Field mappings section, under Knowledge Articles, choose Add field.
For Salesforce field name, enter Article_Body__c and map it to _document_body for Index field name.
Select your object type.
Choose Update to save the changes.

Return to the Data sources page of the application and choose Sync now.

When the sync process is complete, you can chat with Salesforce data source about default fields and also the Salesforce custom field that you added.

Talk with your Salesforce data using the Amazon Q web experience

When the synchronization process is complete, you can start using the Amazon Q web experience. To access the Amazon Q application UI, select your application and choose Customize web experience, which opens a preview of the UI and options to customize it.

You can customize the values for Title, Subtitle, and Welcome message in the UI. After you make changes, choose Save and then choose View web experience.

After signing in, you can start chatting with your generative AI assistant. To verify answers, check the citation links included in the answers. If you need to improve answers, add more details and context to the questions.

The results aren’t limited to cases and activities. You can also include other objects like knowledge bases. If a field isn’t included in the default mapped fields, you still can add them in the retriever settings and update the content index.

Let’s look at opportunities in Salesforce for a specific company and ask Amazon Q about these opportunities.

After opportunities, check a sample knowledge article from Salesforce.

When you chat with Amazon Q, you can see the exact article is referenced as the primary source.

As you can see, each answer has a thumbs up/thumbs down button to provide feedback. Amazon Q uses this feedback to improve responses for all your organization users.

Metadata fields

In Salesforce, document metadata refers to the information that describes the properties and characteristics of documents stored in Salesforce. The Amazon Q data source connector crawls relevant metadata or attributes associated with a document. To use metadata search, go to the Amazon Q application page and choose Metadata controls in the navigation pane. Select the metadata fields that are needed, for instance sf_subject and sf_status. This allows you to ask metadata lookup queries such as “Summarize case titled as supply chain vendors cost optimization” or “Give me status of case with subject as cloud modernization project.” Here, the sf_status and sf_subject metadata fields will be used to query and generate the relevant answer.

Frequently asked questions

In this section, we discuss some frequently asked questions.

Amazon Q Business is unable to answer your questions

If you get the response “Sorry, I could not find relevant information to complete your request,” this may be due to a few reasons:

No permissions – ACLs applied to your account don’t allow you to query certain data sources. If this is the case, reach out to your application administrator to make sure your ACLs are configured to access the data sources.
Data connector sync failed – Your data connector may have failed to sync information from the source to the Amazon Q Business application. Verify the data connector’s sync run schedule and sync history to confirm the sync is successful.
No subscriptions – Make sure that logged-in users have a subscription for Amazon Q.

If none of these reasons apply to your use case, open a support case and work with your technical account manager to get this resolved.

Custom fields aren’t showing up in fields mappings

A custom fields list is retrieved after the initial full synchronization. After a successful synchronization, you can add field mappings for custom fields.

Clean up

To prevent incurring additional costs, it’s essential to clean up and remove any resources created during the implementation of this solution. Specifically, you should delete the Amazon Q application, which will consequently remove the associated index and data connectors. However, any AWS Identity and Access Management (IAM) roles and secrets created during the Amazon Q application setup process will need to be removed separately. Failing to clean up these resources may result in ongoing charges, so it’s crucial to take the necessary steps to remove all components related to this solution.

Complete the following steps to delete the Amazon Q application, secret, and IAM role:

On the Amazon Q Business console, select the application that you created.
On the Actions menu, choose Delete and confirm the deletion.
On the Secrets Manager console, select the secret that was created for the connector.
On the Actions menu, choose Delete.
Set the waiting period as 7 days and choose Schedule deletion.

On the IAM console, select the role that was created during the Amazon Q application creation.
Choose Delete and confirm the deletion.

Conclusion

In this post, we provided an overview of the Amazon Q Salesforce Online connector and how you can use it for a safe and seamless integration of generative AI assistance with Salesforce. By using a single interface for the variety of data sources in the organization, you can enable employees to be more data-driven, efficient, prepared, and productive.

To learn more about the Amazon Q Salesforce Online connector, refer to Connecting Salesforce Online to Amazon Q Business.

About the Author

Mehdy Haghy is a Senior Solutions Architect at the AWS WWCS team, specializing in AI and ML on AWS. He works with enterprise customers, helping them migrate, modernize, and optimize their workloads for the AWS Cloud. In his spare time, he enjoys cooking Persian food and tinkering with circuit boards.

Reducing hallucinations in large language models with custom intervention using Amazon Bedrock Agents

Hallucinations in large language models (LLMs) refer to the phenomenon where the LLM generates an output that is plausible but factually incorrect or made-up. This can occur when the model’s training data lacks the necessary information or when the model attempts to generate coherent responses by making logical inferences beyond its actual knowledge. Hallucinations arise because of the inherent limitations of the language modeling approach, which aims to produce fluent and contextually appropriate text without necessarily ensuring factual accuracy.

Remediating hallucinations is crucial for production applications that use LLMs, particularly in domains where incorrect information can have serious consequences, such as healthcare, finance, or legal applications. Unchecked hallucinations can undermine the reliability and trustworthiness of the system, leading to potential harm or legal liabilities. Strategies to mitigate hallucinations can include rigorous fact-checking mechanisms, integrating external knowledge sources using Retrieval Augmented Generation (RAG), applying confidence thresholds, and implementing human oversight or verification processes for critical outputs.

RAG is an approach that aims to reduce hallucinations in language models by incorporating the capability to retrieve external knowledge and making it part of the prompt that’s used as input to the model. The retriever module is responsible for retrieving relevant passages or documents from a large corpus of textual data based on the input query or context. The retrieved information is then provided to the LLM, which uses this external knowledge in conjunction with prompts to generate the final output. By grounding the generation process in factual information from reliable sources, RAG can reduce the likelihood of hallucinating incorrect or made-up content, thereby enhancing the factual accuracy and reliability of the generated responses.

Amazon Bedrock Guardrails offer hallucination detection with contextual grounding checks, which can be seamlessly applied using Amazon Bedrock APIs (such as Converse or InvokeModel) or embedded into workflows. After an LLM generates a response, these workflows perform a check to see if hallucinations occurred. This setup can be achieved through Amazon Bedrock Prompt Flows or with custom logic using AWS Lambda functions. Customers can also do batch evaluation with human reviewers using Amazon Bedrock model evaluation’s human-based evaluation feature. However, these are static workflows, updating the hallucination detection logic requires modifying the entire workflow, limiting adaptability.

To address this need for flexibility, Amazon Bedrock Agents enables dynamic workflow orchestration. With Amazon Bedrock Agents, organizations can implement scalable, customizable hallucination detection that adjusts based on specific needs, reducing the effort needed to incorporate new detection techniques and additional API calls in the workflow without restructuring the entire workflow and letting the LLM decide the plan of action to orchestrate the workflow.

In this post, we will set up our own custom agentic AI workflow using Amazon Bedrock Agents to intervene when LLM hallucinations are detected and route the user query to customer service agents through a human-in-the-loop process. Imagine this to be a simpler implementation of calling a customer service agent when the chatbot is unable to answer the customer query. The chatbot is based on a RAG approach, which reduces hallucinations to a large extent, and the agentic workflow provides a customizable mechanism in how to measure, detect, and mitigate hallucinations that might occur.

Agentic workflows are a fresh new perspective in building dynamic and complex business use case-based workflows with the help of LLMs as the reasoning engine or brain. These agentic workflows decompose the natural language query-based tasks into multiple actionable steps with iterative feedback loops and self-reflection to produce the final result using tools and APIs.

Amazon Bedrock Agents helps accelerate generative AI application development by orchestrating multistep tasks. Amazon Bedrock Agents uses the reasoning capability of LLMs to break down user-requested tasks into multiple steps. They use the given instruction to create an orchestration plan and then carry out the plan by invoking company APIs or accessing knowledge bases using RAG to provide a final response to the user. This offers tremendous use case flexibility, enables dynamic workflows, and reduces development cost. Amazon Bedrock Agents is instrumental in customizing applications to help meet specific project requirements while protecting private data and helping to secure applications. These agents work with AWS managed infrastructure capabilities such as Lambda and Amazon Bedrock, reducing infrastructure management overhead. Additionally, agents streamline workflows and automate repetitive tasks. With the power of AI automation, you can boost productivity and reduce costs.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Use case overview

In this post, we add our own custom intervention to a RAG-powered chatbot in an event of hallucinations being detected. We will be using Retrieval Augmented Generation Automatic Score (metrics such as answer correctness and answer relevancy to develop a custom hallucination score for measuring hallucinations. If the hallucination score for a particular LLM response is less than a custom threshold, it indicates that the generated model response is not well-aligned with the ground truth. In this situation, we notify a pool of human agents through Amazon Simple Notification Service (Amazon SNS) notification to assist with the query instead of providing the customer with the hallucinated LLM response.

The RAG-based chatbot we use ingests the Amazon Bedrock User Guide to assist customers on queries related to Amazon Bedrock.

Dataset

The dataset used in the notebook is the latest Amazon Bedrock User guide PDF file, which is publicly available to download. Alternatively, you can use other PDFs of your choice to create the knowledge base from scratch and use it in this notebook.

If you use a custom PDF, you will need to curate a supervised dataset of ground truth answers to multiple questions to test this approach. The custom hallucination detector uses RAGAS metrics, which are generated using a CSV file containing question-answer pairs. For custom PDFs, it is necessary to replace this CSV file and re-run the notebook for a different dataset.

In addition to the dataset in the notebook, we ask the agent multiple questions, a few of them from the PDF and a few not part of the PDF. The ground truth answers are manually curated based on the PDF contents if relevant.

Prerequisites

To run this solution in your AWS account, complete the following prerequisites:

Clone the GitHub repository and follow the steps explained in the README.
Set up an Amazon SageMaker notebook on an ml.t3.medium Amazon Elastic Compute Cloud (Amazon EC2)
Acquire access to models hosted on Amazon Bedrock. Choose Manage model access in the navigation pane of the Amazon Bedrock console and choose from the list of available options. We use Anthropic’s Claude v3 (Sonnet) on Amazon Bedrock and Amazon Titan Embeddings Text v2 on Amazon Bedrock for this post.

Implement the solution

The following illustrates the solution architecture:

Architecture diagram of custom hallucination detection and mitigation : The user's question is fed to a search engine (with optional LLM-based step to pre-process it to a good search query). The documents or snippets returned by the search engine, together with the user's question, are inserted into a prompt template - and an LLM generates a final answer based on the retrieved documents. The final answer can be evaluated against the reference answer from the dataset to get a custom hallucination score. Based on a pre-defined empirical threshold, a customer service agent is requested to join the conversation using SNS notification

Architecture Diagram for Custom Hallucination Detection and Mitigation

The overall workflow involves the following steps:

Data ingestion involving raw PDFs stored in an Amazon Simple Storage Service (Amazon S3) bucket synced as a data source with .
User asks questions relevant to the Amazon Bedrock User Guide, which are handled by an Amazon Bedrock agent that is set up to handle user queries.

User query: What models are supported by bedrock agents?

The agent creates a plan and identifies the need to use a knowledge base. It then sends a request to the knowledge base, which retrieves relevant data from the underlying vector database. The agent retrieves an answer through RAG using the following steps:
- The search query is directed to the vector database (Amazon OpenSearch Serverless).
- Relevant answer chunks are retrieved.
- The knowledge base response is generated from the retrieved answer chunks and sent back to the agent.

Generated Answer: Amazon Bedrock supports foundation models from various providers including Anthropic (Claude models), AI21 Labs (Jamba models), Cohere (Command models), Meta (Llama models), Mistral AI

The user query and knowledge base response are used together to invoke the correct action group.
The user question and knowledge base response are passed as inputs to a Lambda function that calculates a hallucination score.

The generated answer has some correct and some incorrect information as it picks up general Amazon Bedrock model support and not Amazon Bedrock Agents-specific model support. Therefore we have hallucination detected with a score of 0.4.

An SNS notification is sent if the answer score is lower than the custom threshold.

Because answer score is 0.4 < 0.9 (hallucination threshold), the SNS notification is triggered.

If the answer score is higher than the custom threshold, the hallucination detector set up in Lambda responds with a final knowledge base response. Otherwise, it returns a pre-defined response asking the user to wait until a customer service agent joins the conversation shortly.

Customer service human agent queue is notified and the next available agent joins or emails back if it is an offline response mechanism.

The final agent response is shown in the chatbot UI(User Interface).

In the GitHub repository notebook, we cover the following learning objectives:

Measure and detect hallucinations with an Agentic AI workflow which has the ability to notify humans-in-the-loop to remediate hallucinations, if detected.
Custom hallucination detector with pre-defined thresholds based on select evaluation metrics in RAGAS.
To remediate, we will send an SNS notification to the customer service queue and wait for a human to help us with the question.

Step 1: Setting up Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents

In this section, we will integrate Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents to create a RAG workflow. RAG systems use external knowledge sources to augment the LLM’s output, improving factual accuracy and reducing hallucinations. We create the agent with the following high-level instruction encouraging it to take a question-answering role.

agent_instruction = """

You are a question answering agent that helps customers answer questions from the Amazon Bedrock User Guide inside the associated knowledge base.

Next you will always use the knowledge base search result to detect and measure any hallucination using the functions provided"

"""

Step 2: Invoke Amazon Bedrock Agents with user questions about Amazon Bedrock documentation

We are using a supervised dataset with predefined questions and ground truth answers to invoke Amazon Bedrock Agents which triggers the custom hallucination detector based on the agent response from the knowledge base. In the notebook, we demonstrate how the answer score based on RAGAS metrics can notify a human customer service representative if it does not meet a pre-defined custom threshold score.

We use RAGAS metrics such as answer correctness and answer relevancy to determine the custom threshold score. Depending on the use case and dataset, the list of applicable RAGAS metrics can be customized accordingly.

To change the threshold score, you can modify the measure_hallucination() method inside the Lambda function lambda_hallucination_detection().

The agent is prompted with the following template. The user_question in the template is iterated from the supervised dataset CSV file that contains the question and ground truth answers.

USER_PROMPT_TEMPLATE = """Question: {user_question}

Given an input question, you will search the Knowledge Base on Amazon Bedrock User Guide to answer the user question. 
If the knowledge base search results do not return any answer, you can try answering it to the best of your ability, but do not answer anything you do not know. Do not hallucinate.
Using this knowledge base search result you will ALWAYS execute the appropriate action group API to measure and detect the hallucination on that knowledge base search result.

Remove any XML tags from the knowledge base search results and final user response.


Some samples for `user_question` parameter:

What models are supported by bedrock agents?
Which models can I use with Amazon Bedrock Agents?
Which are the dates for reinvent 2024?
What is Amazon Bedrock?

"""

Step 3: Trigger human-in-the-loop in case of hallucination

If the custom hallucination score threshold is not met by the agent response, a human in the loop is notified using SNS notifications. These notifications can be sent to the customer service representative queue or Amazon Simple Queue Service (Amazon SQS) queues for email and text notifications. These representatives can respond to the email (offline) or ongoing chat (online) based on their training and knowledge of the system and additional resources. This would be based out of the specific product workflow design.

To view the actual SNS messages sent out, we can view the latest Lambda AWS CloudWatch logs following the instructions as given in viewing CloudWatch logs for Lambda functions. You can search for the string Received SNS message :: inside the CloudWatch logs for the Lambda function LambdaAgentsHallucinationDetection().

Cost considerations

The following are important cost considerations:

This current implementation has no separate charges for building resources using Amazon Bedrock Knowledge Bases or Amazon Bedrock Agents.
You will incur charges for the embedding model and text model invocation on Amazon Bedrock. For more details, see Amazon Bedrock pricing.
You will incur charges for Amazon S3 and vector database usage. For more details, see Amazon S3 pricing and Amazon OpenSearch Service pricing, respectively.

Clean up

To avoid incurring unnecessary costs, the implementation has the option to clean up resources after an entire run of the notebook. You can check the instructions in the cleanup_infrastructure() method for how to avoid the automatic cleanup and experiment with different prompts and datasets.

The order of resource cleanup is as follows:

Disable the action group.
Delete the action group.
Delete the alias.
Delete the agent.
Delete the Lambda function.
Empty the S3 bucket.
Delete the S3 bucket.
Delete AWS Identity and Access Management (IAM) roles and policies.
Delete the vector DB collection policies.
Delete the knowledge bases.

Key considerations

Amazon Bedrock Agents can increase overall latency compared to using just Amazon Bedrock Guardrails and Amazon Bedrock Prompt Flows. It is a trade-off decision between having LLM generated workflows compared to static or deterministic workflows. With agents, the LLM generates the workflow orchestration in real time using the available knowledge bases, tools, and APIs. Whereas with prompt flows and guardrails, the workflow has to be orchestrated and designed offline.

For evaluation, while we have chosen an LLM-based evaluation framework RAGAS, it is possible to swap out the elements in the hallucination detection Lambda function for another framework.

Conclusion

This post demonstrated how to use Amazon Bedrock Agents, Amazon Knowledge Bases, and the RAGAS evaluation metrics to build a custom hallucination detector and remediate it by using human-in-the-loop. The agentic workflow can be extended to custom use cases through different hallucination remediation techniques and offers the flexibility to detect and mitigate hallucinations using custom actions.

For more information on creating agents to orchestrate workflows, see Amazon Bedrock Agents. To learn about multiple RAGAS metrics for LLM evaluations see RAGAS: Getting Started.

About the Authors

Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Bharathi Srinivasan is a Generative AI Data Scientist at AWS WWSO where she works building solutions for Responsible AI challenges. She is passionate about driving business value from machine learning applications by addressing broad concerns of Responsible AI. Outside of building new AI experiences for customers, Bharathi loves to write science fiction and challenge herself with endurance sports.

Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.

In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.

Solution overview

The steps to implement the solution are as follows:

Create the EKS cluster.
Set up the Inferentia 2 node group.
Install the Neuron device plugin and scheduling extension.
Prepare the Docker image.
Deploy the Meta Llama 3.18B model.

We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.

Prerequisites

Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.

The AWS Command Line Interface (AWS CLI) installed
eksctl
kubectl
docker

In this post, the examples use an inf2.48xlarge instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.

Create the EKS cluster

If you don’t have an existing EKS cluster, you can create one using eksctl. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:

export AWS_REGION=us-east-1
export CLUSTER_NAME=my-cluster
export EKS_VERSION=1.30
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Then complete the following steps:

Create a new file named eks_cluster.yaml with the following command:

cat > eks_cluster.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $AWS_REGION
  version: "$EKS_VERSION"

addons:
- name: vpc-cni
  version: latest

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
    
iam:
  withOIDC: true
EOF

This configuration file contains the following parameters:

metadata.name – Specifies the name of your EKS cluster, which is set to my-cluster in this example. You can change it to a name of your choice.
metadata.region – Specifies the Region where you want to create the cluster. In this example, it’s set to us-east-2. Change this to your desired Region. Because we’re using Inf2 instances, you should choose a Region where those instances are presented.
metadata.version – Specifies the Kubernetes version to use for the cluster. In this example, it’s set to 1.30. You can change this to a different version if needed, but make sure to use a version that is supported by Amazon EKS. For a list of supported versions, see Review release notes for Kubernetes versions on standard support.
addons.vpc-cni – Specifies the version of the Amazon VPC CNI (Container Network Interface) add-on to use. Setting it to latest will install the latest available version.
cloudWatch.clusterLogging – Enables cluster logging, which sends logs from the control plane to Amazon CloudWatch Logs.
iam.withOIDC – Enables the OpenID Connect (OIDC) provider for the cluster, which is required for certain AWS services to interact with the cluster.

After you create the eks_cluster.yaml file, you can create the EKS cluster by running the following command:

eksctl create cluster --config-file eks_cluster.yaml

This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml file. The process will take approximately 15–20 minutes to complete.

During the cluster creation process, eksctl will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.

To complete the setup of kubectl, run the following code:

aws eks update-kubeconfig —region $AWS_REGION —name $CLUSTER_NAME

Set up the Inferentia 2 node group

To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:

First, retrieve the latest Amazon EKS optimized accelerated AMI ID:

export ACCELERATED_AMI=$(aws ssm get-parameter 
--name /aws/service/eks/optimized-ami/$EKS_VERSION/amazon-linux-2-gpu/recommended/image_id 
--region $AWS_REGION 
--query "Parameter.Value" 
--output text)

Create the Inferentia 2 node group using eksctl:

cat > eks_nodegroup.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $AWS_REGION
  version: "$EKS_VERSION"
    
managedNodeGroups:
  - name: neuron-group
    instanceType: inf2.48xlarge
    desiredCapacity: 1
    volumeSize: 512
    ami: "$ACCELERATED_AMI"
    amiFamily: AmazonLinux2
    iam:
      attachPolicyARNs:
      - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
      - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

    overrideBootstrapCommand: |
      #!/bin/bash

      /etc/eks/bootstrap.sh $CLUSTER_NAME
EOF

Run eksctl create nodegroup --config-file eks_nodegroup.yaml to create the node group.

This will take approximately 5 minutes.

Install the Neuron device plugin and scheduling extension

To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.

The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.

For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.

Prepare the Docker image

To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).

Set up environment variables:

export ECR_REPO_NAME=vllm-neuron

Create an ECR repository:

aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION

Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.

Create the Dockerfile:

cat > Dockerfile <<EOF
FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04
# Clone the vllm repository
RUN git clone https://github.com/vllm-project/vllm.git
# Set the working directory
WORKDIR /vllm
RUN git checkout v0.6.0
# Set the environment variable
ENV VLLM_TARGET_DEVICE=neuron
# Install the dependencies
RUN python3 -m pip install -U -r requirements-neuron.txt
RUN python3 -m pip install .
# Modify the arg_utils.py file to support larger block_size option
RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/[8, 16, 32]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py
# Install ray
RUN python3 -m pip install ray
RUN pip install -U  triton>=3.0.0
# Set the entry point
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
EOF

Use the following commands to create an ECR repository, build your Docker image, and push it to the newly created repository. The account ID and Region are dynamically set using AWS CLI commands, making the process more flexible and avoiding hard-coded values.

# Authenticate Docker to your ECR registry
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Build the Docker image
docker build -t ${ECR_REPO_NAME}:latest .

# Tag the image
docker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
# Push the image to ECR
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest

Deploy the Meta Llama 3.1-8B model

With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:

cat > neuronx-vllm-deployment.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: neuronx-vllm-deployment
  labels:
    app: neuronx-vllm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: neuronx-vllm
  template:
    metadata:
      labels:
        app: neuronx-vllm
    spec:
      schedulerName: my-scheduler
      containers:
      - name: neuronx-vllm
        image: <replace with the url to the docker image you pushed to the ECR>
        resources:
          limits:
            cpu: 32
            memory: "64G"
            aws.amazon.com/neuroncore: "8"
          requests:
            cpu: 32
            memory: "64G"
            aws.amazon.com/neuroncore: "8"
        ports:
        - containerPort: 8000
        env:
        - name: HF_TOKEN
          value: <your huggingface token>
        - name: FI_EFA_FORK_SAFE
          value: "1"
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3.1-8B"
        - "--tensor-parallel-size"
        - "8"
        - "--max-num-seqs"
        - "64"
        - "--max-model-len"
        - "8192"
        - "--block-size"
        - "8192"
EOF

Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml.

This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.

The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.

To modify the setup, update the neuronx-vllm-deployment.yaml file, adjusting the replicas field in the deployment specification and the NUM_NEURON_CORES environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.

The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.

Test the deployment

After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:

Check the deployment status:

kubectl get deployments

This will show you the desired, current, and up-to-date number of replicas.

Monitor the pods:

kubectl get pods -l app=neuronx-vllm -w

The -w flag will watch for changes. You’ll see the pods transitioning from "Pending" to "ContainerCreating" to "Running".

Check the logs of a specific pod:

kubectl logs <pod-name>

The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.

To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.

Add the following probe configurations to your container spec in the deployment YAML:

spec:
  containers:
  - name: neuronx-vllm
    # ... other container configurations ...
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      periodSeconds: 15
    startupProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      failureThreshold: 30
      periodSeconds: 10

The configuration is comprised of three probes:

Readiness probe – Checks if the pod is ready to serve traffic. It starts checking after 60 seconds and repeats every 10 seconds.
Liveness probe – Verifies if the pod is still running correctly. It begins after 120 seconds and checks every 15 seconds.
Startup probe – Gives the application time to start up. It allows up to 25 minutes for the application to start before considering it failed.

These probes assume that your vLLM application exposes a /health endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.

With these probes in place, Kubernetes will do the following:

Only send traffic to pods that are ready
Restart pods that are no longer alive
Allow sufficient time for initial startup and compilation

This configuration helps facilitate high availability and proper functioning of your vLLM deployment.

Now you’re ready to access the pods.

Identify the pod that is running your inference server. You can use the following command to list the pods with the neuronx-vllm label:

kubectl get pods -l app=neuronx-vllm

This command will output a list of pods, and you’ll need the name of the pod you want to forward.

Use kubectl port-forward to forward the port from the Kubernetes pod to your local machine. Use the name of your pod from the previous step:

kubectl port-forward <pod-name> 8000:8000

This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000.

Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.

In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.

You can test the inference server by making a request from your local machine. The following code is an example of how to make an inference call using curl:

curl -X POST http://localhost:8000/v1/completions  
-H "Content-Type: application/json"  
-d '{ 
  "model": " meta-llama/Meta-Llama-3.1-8B", 
  "prompt": "Explain the theory of relativity.", 
  "max_tokens": 100 
}'

This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.

For more information about routing, see Route application and HTTP traffic with Application Load Balancers.

Monitor performance

AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.

These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.

Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.

To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.

Scaling and multi-tenancy

As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:

Increasing the number of nodes in your EKS node group to provide additional Neuron cores
Increasing the number of replicas in your deployment to utilize these new resources

You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:

kubectl scale deployment neuronx-vllm-deployment --replicas=<new-number>

Alternatively, you can set up auto scaling:

Configure auto scaling for your EKS node group to automatically add nodes based on resource demands
Use Horizontal Pod Autoscaling (HPA) to automatically adjust the number of replicas in your deployment

You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.

Example scaling solutions include:

Cluster Autoscaler with Karpenter – Though not currently installed in this setup, Karpenter offers more flexible and efficient auto scaling for future consideration. It can dynamically provision the right number of nodes needed for your Neuron workloads based on pending pods and custom scheduling constraints. For more details, see Scale cluster compute with Karpenter and Cluster Autoscaler.
Multi-cluster federation – For even larger scale, you could set up multiple EKS clusters, each with its own Neuron-equipped nodes, and use a multi-cluster federation tool to distribute traffic among them.

You should consider the following when scaling:

Alignment of resources – Make sure that your scaling strategy for both nodes and pods aligns with the Neuron core requirements (multiples of 8 for optimal performance). This is model dependent and unique for the Meta Llama 3.1 model.
Compilation time – Remember the 15-minute compilation time for new pods when planning your scaling strategy. Consider pre-warming pods during off-peak hours.
Cost management – Monitor costs closely as you scale, because Neuron-equipped instances can be expensive.
Performance testing – Conduct thorough performance testing as you scale to verify that increased capacity translates to improved throughput and reduced latency.

By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.

Clean up

Use the following code to delete the cluster created in this solution:

eksctl delete cluster --name $CLUSTER_NAME --region $AWS_REGION

Conclusion

Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.

This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.

The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.

For the complete code and detailed implementation steps, visit the GitHub repository.

About the Authors

Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.

Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.

Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.

Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.

In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta’s newest Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model and how to run online and offline inference on the model. We will also talk about performance tuning the inference graph.

Prerequisite: Hugging Face account and model access

To use the meta-llama/Llama-3.2-1B model, you’ll need a Hugging Face account and access to the model. Please go to the model card, sign up, and agree to the model license. You will then need a Hugging Face token, which you can get by following these steps. When you get to the Save your Access Token screen, as shown in the following figure, make sure you copy the token because it will not be shown again.

Create an EC2 instance

You can create an EC2 Instance by following the guide. A few things to note:

If this is your first time using inf/trn instances, you will need to request a quota increase.
You will use inf2.xlarge as your instance type. inf2.xlarge instances are only available in these AWS Regions.
Increase the gp3 volume to 100 G.
You will use Deep Learning AMI Neuron (Ubuntu 22.04) as your AMI, as shown in the following figure.

After the instance is launched, you can connect to it to access the command line. In the next step, you’ll use Docker (preinstalled on this AMI) to run a vLLM container image for neuron.

Start vLLM server

You will use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:

cat > Dockerfile <<EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && 
    apt-get install -y 
        git 
        python3 
        python3-pip 
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && 
    cd vllm && 
    git checkout v0.6.2 && 
    python3 -m pip install -U 
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 
        -r requirements-neuron.txt && 
    pip install --no-build-isolation -v -e . && 
    pip install --upgrade triton==3.0.0
CMD ["/bin/bash"]
EOF

Then run:

docker build . -t vllm-neuron

Building the image will take about 10 minutes. After it’s done, use the new Docker image (replace YOUR_TOKEN_HERE with the token from Hugging Face):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run 
        -it 
        -p 8000:8000 
        --device /dev/neuron0 
        -e HF_TOKEN=$HF_TOKEN 
        -e NEURON_CC_FLAGS=-O1 
        vllm-neuron

You can now start the vLLM server with the following command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

This command runs vLLM with the following parameters:

serve meta-llama/Llama-3.2-1B: The Hugging Face modelID of the model that is being deployed for inference.
--device neuron: Configures vLLM to run on the neuron device.
--tensor-parallel-size 2: Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neuron device and each neuron device has 2 neuron cores.
--max-model-len 4096: This is set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.
--block-size 8: For neuron devices, this is internally set to the max-model-len.
--max-num-seqs 32: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.

The first time you load a model, if there isn’t a previously compiled model, it will need to be compiled. This compiled model can optionally be saved so the compilation step is not necessary if the container is recreated. After everything is done and the model server is running, you should see the following logs:

Avg prompt throughput: 0.0 tokens/s ...

This means that the model server is running, but it isn’t yet processing requests because none have been received. You can now detach from the container by pressing ctrl + p and ctrl + q.

Inference

When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1B is running.

curl localhost:8000/v1/models

This should return something like:

{"object":"list","data":[{"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}ubuntu@ip-172-31-12-216:~$

Now, send it a prompt:

curl localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

You should get back a response similar to the following from vLLM:

ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent  Left  Speed
100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
" How does it work?nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.nGen AI is 
a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
situations and environments."

Offline inference with vLLM

Another way to use vLLM on Inferentia is by sending a few requests all at the same time in a script. This is useful for automation or when you have a batch of prompts that you want to send all at the same time.

You can reattach to your Docker container and stop the online inference server with the following:

docker attach $(docker ps --format "{{.ID}}")

At this point, you should see a blank cursor, press ctrl + c to stop the server and you should be back at the bash prompt in the container. Create a file for using the offline inference engine:

cat > offline_inference.py <<EOF
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.2-1B",
        max_num_seqs=32,
        max_model_len=4096,
        block_size=8,
        device="neuron",
        tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

EOF

Now, run the script python offline_inference.py and you should get back responses for the four prompts. This may take a minute as the model needs to be started again.

Processed prompts: 100%|
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
Prompt: 'The future of AI is', Generated text: ' nownThe 10 most influential AI professionals to watch in 2019n'

You can now type exit and press return and then press ctrl + c to shut down the Docker container and go back to your inf2 instance.

Clean up

Now that you’re done testing the Llama 3.2 1B LLM, you should terminate your EC2 instance to avoid additional charges.

Performance tuning for variable sequence lengths

You will probably have to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a computation graph that works with the shape and size of the buckets. To fine tune the performance based on the length of input and output tokens in the inference requests, you can set two kinds of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of integers:

NEURON_CONTEXT_LENGTH_BUCKETS corresponds to the context encoding phase. Set this to the estimated length of prompts during inference.
NEURON_TOKEN_GEN_BUCKETS corresponds to the token generation phase. Set this to a range of powers of two within your generation length.

You can use Docker run command to set the environment variables while starting the vLLM server (remember to replace YOUR_TOKEN_HERE with your Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run 
        -it 
        -p 8000:8000 
        --device /dev/neuron0 
        -e HF_TOKEN=$HF_TOKEN 
        -e NEURON_CC_FLAGS=-O1 
        -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" 
        -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" 
        vllm-neuron

You can then start the server using the same command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

As the model graph has changed, the model will need to be recompiled. If the container was terminated, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p and ctrl + q and using the same command:

curl localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

For more information about how to configure the buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate in the documentation and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions in the documentation.

Conclusion

You’ve just seen how to deploy meta-llama/Llama-3.2-1B using vLLM on an Amazon EC2 Inf2 instance. If you’re interested in deploying other popular LLMs from Hugging Face, you can replace the modelID in the vLLM serve command. More details on the integration between the Neuron SDK and vLLM can be found in the Neuron user guide for continuous batching and the vLLM guide for Neuron.

After you’ve identified a model that you want to use in production, you will want to deploy it with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post of this series, we’ll go into using Amazon EKS with Ray Serve to deploy vLLM into production with autoscaling and observability.

About the authors

Omri Shiv is an Open Source Machine Learning Engineer focusing on helping customers through their AI/ML journey. In his free time, he likes cooking, tinkering with open source and open hardware, and listening to and playing music.

Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

Using LLMs to fortify cyber defenses: Sophos’s insight on strategies for using LLMs with Amazon Bedrock and Amazon SageMaker

This post is co-written with Adarsh Kyadige and Salma Taoufiq from Sophos.

As a leader in cutting-edge cybersecurity, Sophos is dedicated to safeguarding over 500,000 organizations and millions of customers across more than 150 countries. By harnessing the power of threat intelligence, machine learning (ML), and artificial intelligence (AI), Sophos delivers a comprehensive range of advanced products and services. These solutions are designed to protect and defend users, networks, and endpoints against a wide array of cyber threats including phishing, ransomware, and malware. The Sophos Artificial Intelligence (AI) group (SophosAI) oversees the development and maintenance of Sophos’s major ML security technology.

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation across diverse domains as showcased in numerous leaderboards (e.g., HELM, Hugging Face Open LLM leaderboard) that evaluate them on a myriad of generic tasks. However, their effectiveness in specialized fields like cybersecurity relies heavily on domain-specific knowledge. In this context, fine-tuning emerges as a crucial technique to adapt these general-purpose models to the intricacies of cybersecurity. For example, we could use Instruction fine-tuning to increase the model performance on an incident classification or summarization. However, before fine-tuning, it’s important to determine an out-of-the-box model’s potential by testing its abilities on a set of tasks based on the domain. We have defined three specialized tasks that are covered later in the blog. These same tasks can also be used to measure the gains in performance obtained through fine-tuning, Retrieval-Augmented Generation (RAG), or knowledge distillation.

In this post, SophosAI shares insights in using and evaluating an out-of-the-box LLM for the enhancement of a security operations center’s (SOC) productivity using Amazon Bedrock and Amazon SageMaker. We use Anthropic’s Claude 3 Sonnet on Amazon Bedrock to illustrate the use cases.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Tasks

We will showcase three example tasks to delve into using LLMs in the context of an SOC. An SOC is an organizational unit responsible for monitoring, detecting, analyzing, and responding to cybersecurity threats and incidents. It employs a combination of technology, processes, and skilled personnel to maintain the confidentiality, integrity, and availability of information systems and data. SOC analysts continuously monitor security events, investigate potential threats, and take appropriate action to mitigate risks. Known challenges faced by SOCs are the high volume of alerts generated by detection tools and the subsequent alert fatigue among analysts. These challenges are often coupled with staffing shortages. To address these challenges and enhance operational efficiency and scalability, many SOCs are increasingly turning to automation technologies to streamline repetitive tasks, prioritize alerts, and accelerate incident response. Considering the nature of tasks analysts need to perform, LLMs are good tools to enhance the level of automation in SOCs and empower security teams.

For this work, we focus on three essential SOC use cases where LLMs have the potential of greatly assisting analysts, namely:

SQL Query generation from natural language to simplify data extraction
Incident severity prediction to prioritize which incidents analysts should focus on
Incident summarization based on its constituent alert data to increase analyst productivity

Based on the token consumption of these tasks, particularly the summarization component, we need a model with a context window of at least 4000 tokens. While the tasks have been tested in English, Anthropic’s Claude 3 Sonnet model can perform in other languages. However, we recommend evaluating the performance in your specific language of interest.

Let’s dive into the details of each task.

Task 1: Query generation from natural language

This task’s objective is to assess a model’s capacity to translate natural language questions into SQL queries, using contextual knowledge of the underlying data schema. This skill simplifies the data extraction process, allowing security analysts to conduct investigations more efficiently without requiring deep technical knowledge. We used prompt engineering guidelines to tailor our prompts to generate better responses from the LLM.

A three-shot prompting strategy is used for this task. Given a database schema, the model is provided with three examples pairing a natural-language question with its corresponding SQL query. Following these examples, the model is then prompted to generate the SQL query for a question of interest.

The prompt below is a three-shot prompt example for query generation from natural language. Empirically, we have obtained better results with few-shot prompting as opposed to one-shot (where the model is provided with only one example question and corresponding query before the actual question of interest) or zero-shot (where the model is directly prompted to generate a desired query without any examples).

Translate the following request into SQL
Schema for alert_table table
   <Table schema>
Schema for process_table table
   <Table schema>
Schema for network_table table
   <Table schema>

Here are some examples
<examples>
Request:tell me a list of processes that were executed between 2021/10/19 and 2021/11/30
   SQL:select * from process_table where timestamp between '2021-10-19' and '2021-11-30';

Request:show me any low severity security alerts for the 23 days ago
   SQL:select * from alert_table where severity='low' and timestamp>=DATEADD('day', -23, CURRENT_TIMESTAMP());

Request:show me the count of msword.exe processes that ran between Dec/01 and Dec/11
   SQL:select count(*) from process_table where process='msword.exe' and timestamp>='2022-12-01' and timestamp<='2022-12-11';
</examples>

Request:"Any Ubuntu processes that was run by the user ""admin"" from host ""db-server"""
SQL:

To evaluate a model’s performance on this task, we rely on a proprietary data set of about 100 target queries based on a test database schema. To determine the accuracy of the queries generated by the model, a multi-step evaluation is followed. First, we verify whether the model’s output is an exact match to the expected SQL statement. Exact matches are then recorded as successful outcomes. If there is a mismatch, we then run both the model’s query and the expected query against our mock database to compare their results. However, this method can be prone to false positives and false negatives. To mitigate this, we further perform a query equivalence assessment using a different stronger LLM on this task. This method is known as LLM-as-a-judge.

Anthropic’s Claude 3 Sonnet model achieved a good accuracy rate of 88 percent on the chosen dataset, suggesting that this natural-language-to-SQL task is quite simple for LLMs. With basic few-shot prompting, an LLM can therefore be used out-of-the-box without fine-tuning by security analysts to assist them in retrieving key information while investigating threats. The above model performance is based on our dataset and our experiment. This means that you can perform your own test using the strategy explained above.

Task 2: Incident severity prediction

For the second task, we assess a model’s ability to recognize the severity of observed events as indicators of an incident. Specifically, we try to determine whether an LLM can review a security incident and accurately gauge its importance. Armed with such a capability, a model can assist analysts in determining which incidents are most pressing, so they can work more efficiently by organizing their work queue based on severity levels, cut through the noise, and save time and energy.

The input data in this use case is semi-structured alert data, typical of what is produced by various detection systems during an incident. We clearly define severity categories—critical, high, medium, low, and informational—across which the model is to classify the severity of the incident. This is therefore a classification problem that tests an LLM’s intrinsic cybersecurity knowledge.

Each security incident within the Sophos Managed Detection and Response (MDR) platform is made up of multiple detections that highlight suspicious activities occurring in a user’s environment. A detection might involve identifying potentially harmful patterns, such as unusual command executions, abnormal file access, anomalous network traffic, or suspicious script use. We have attached below an example input data.

The “detection” section provides detailed information about each specific suspicious activity that was identified. It includes the type of security incident, such as “Execution,” along with a description that explains the nature of the threat, like the use of suspicious PowerShell commands. The detection is tied to a unique identifier for tracking and reference purposes. Additionally, it contains details from the MITRE ATT&CK framework which categorizes the tactics and techniques involved in the threat. This section might also reference related Sigma rules, which are community-driven signatures for detecting threats across different systems. By including these elements, the detection section serves as a comprehensive outline of the potential threat, helping analysts understand not just what was detected but also why it matters.

The “machine_data” section holds crucial information about the machine on which the detection occurred. It can provide further metadata on the machine, helping to pinpoint where exactly in the environment the suspicious activity was observed.

{
    ...
  "detection": {
    "attack": "Execution",
    "description": "Identifies the use of suspicious PowerShell IEX patterns. IEX is the shortened version of the Invoke-Expression PowerShell cmdlet. The cmdlet runs the specified string as a command.",
    "id": <Detection ID>,
    "mitre_attack": [
      {
        "tactic": {
          "id": "TA0002",
          "name": "Execution",
          "techniques": [
            {
              "id": "T1059.001",
              "name": "PowerShell"
            }
          ]
        }
      },
      {
        "tactic": {
          "id": "TA0005",
          "name": "Defense Evasion",
          "techniques": [
            {
              "id": "T1027",
              "name": "Obfuscated Files or Information"
            }
          ]
        }
      }
    ],
    "sigma": {
      "id": <Detection ID>,
      "references": [
        "https://github.com/SigmaHQ/sigma/blob/master/rules/windows/process_creation/proc_creation_win_susp_powershell_download_iex.yml",
        "https://github.com/VirtualAlllocEx/Payload-Download-Cradles/blob/main/Download-Cradles.cmd"
      ]
    },
    "type": "process",
  },
  "machine_data": {
    ...
    "username": <Username>
    },
    "customer_id": <Customer ID>,
    "decorations": {
        <Customer data>
    },
    "original_file_name": "powershell.exe",
    "os_platform": "windows",
    "parent_process_name": "cmd.exe",
    "parent_process_path": "C:\Windows\System32\cmd.exe",
    "powershell_code": "iex ([system.text.encoding]::ASCII.GetString([Convert]::FromBase64String('aWYoR2V0LUNvbW1hbmQgR2V0LVdpbmRvd3NGZWF0dXJlIC1lYSBTaWxlbnRseUNvbnRpbnVlKQp7CihHZXQtV2luZG93c0ZlYXR1cmUgfCBXaGVyZS1PYmplY3QgeyRfLm5hbWUgLWVxICdSRFMtUkQtU2VydmVyJ30gfCBTZWxlY3QgSW5zdGFsbFN0YXRlKS5JbnN0YWxsU3RhdGUKfQo=')))",
    "process_name": "powershell.exe",
    "process_path": "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe",
  },
  ...
}

To facilitate evaluation, the prompt used for this task requires that the model communicates its severity assessments in a uniform way, providing the response in a standardized format, for example, as a dictionary with severity_pred as the key and their chosen severity level as the value. The prompt below is an example for incident severity classification. Model performance is then evaluated against a test set of over 3,800 security incidents with target severity levels.

You are a helpful cybersecurity incident investigation expert that classifies incidents according to their severity level given a set of detections per incident.
Respond strictly with this JSON format: {"severity_pred": "xxx"} where xxx should only be either:
    - Critical,
    <Criteria for a critical incident>
    - High,
    <Criteria for a high severity incident>
    - Medium,
    <Criteria for a medium severity incident>
    - Low,
    <Criteria for a low severity incident>
    - Informational
    <Criteria for an informational incident>
    No other value is allowed.

Detections:

Various experimental setups are used for this task, including zero-shot prompting, three-shot prompting using random or nearest-neighbor incidents examples, and simple classifiers.

This task turned out to be quite challenging, because of the noise in the target labels and the inherent difficulty of assessing the criticality of an incident without further investigation by models that weren’t trained specifically for this use case.

Even under various setups, such as few-shot prompting with nearest neighbor incidents, the model’s performance couldn’t reliably outperform random chance. For reference, the baseline accuracy on the test set is approximately 71 percent and the baseline balanced accuracy is 20 percent.

Figure 1 presents the confusion matrix of the model’s responses. The confusion matrix allows to see in one graph the performance of the model’s classification. We can see that only 12% (0.12) of the Actual critical incidents have been correctly predicted/classified. Then 50% of the Critical incidents have been predicted as High incidents, 25% as Medium incidents and 12% as Informational incidents. We can similarly see low accuracy on the rest of the labels and the lowest being bee the Low incidents label with only 2% of the incidents correctly predicted. There is also a notable tendency to overpredict High and Medium categories across the board.

Figure 1: Confusion matrix for the five-severity-level classification using Anthropic Claude 3 Sonnet

The performance observed in this benchmark task indicates this is a particularly hard problem for an unmodified, all-purpose LLM, and the problem requires a more specialized model, specifically trained or fine-tuned on cybersecurity data.

Task 3: Incident summarization

The third task is concerned with the summarization of incoming incidents. It evaluates the potential of a model to assist threat analysts in the triage and investigation of security incidents as they come in by providing a succinct and concise summary of the activity that triggered the incident.

Security incidents typically consist of a series of events occurring on a user endpoint or network, associated with detected suspicious activity. The analysts investigating the incident are presented with a series of events that occurred on the endpoint at the time the suspicious activity was detected. However, analyzing this event sequence can be challenging and time-consuming, resulting in difficulty in identifying noteworthy events. This is where LLMs can be beneficial by helping organize and categorize event data following a specific template, thereby aiding comprehension, and helping analysts quickly determine the appropriate next actions.

We use real incident data from Sophos’s MDR for incident summarization. The input for this task encompasses a set of JSON events, each having distinct schemas and attributes based on the capturing sensor. Along with instructions and a predefined template, this data is provided to the model to generate a summary. The prompt below is an example template prompt for generating incident summaries from SOC data.

As a cybersecurity assistant, your task is to:
    1. Analyze the provided cybersecurity detections data.
    2. Create a report of the events using the information from the '### Detections' section, which may include security artifacts such as command lines and file paths.
    3. [Any other additional general requirements for formatting, etc.]
The report outline should look like this:
Summary:
    <Few sentence description of the activity. [Any additional requirements for the summary: what to  include, etc.]>
Observed MITRE Techniques:
    <List only the registered MITRE Technique or Tactic ID and name pairs if available. The ID should start with 'T'.>
Impacted Hosts:
    <List of all hostname observed in the detections, provide corresponding IPs if available>
Active Users:
    <List of all usernames observed in the detections. There could be multiple, list all of them>
Events:
    <One sentence description for top three detection events. Start the list with n1. >
IPs/URLs:
    <List available IPs and URLs.>
    <Enumerate only up to ten artifacts under each report category, and summarize any remaining events beyond that.>
Files: 
    <List the files found in the incident as follows:>
    <TEMPLATE FOR FILES WITH DETAILS>
Command Lines: 
    <List the command lines found in the detections as follows:>
    <TEMPLATE FOR COMMAND LINES WITH DETAILS>

### Detections:

Evaluating these generated incident summaries is tricky because several factors must be considered. For example, it’s crucial that the extracted information is not only correct, but also relevant. To gain a general understanding of the quality of a model’s incident summarization, we use a set of five distinct metrics and rely on a dataset comprising of N incidents. We compare the generated descriptions with corresponding gold-standard descriptions crafted based on Sophos analysts’ feedback.

We compute two classes of metrics. The first class of metrics assesses factual accuracy; they are used to evaluate how many artifacts such as command lines, file paths, usernames, and so on were correctly identified and summarized by the model. The computation here is straightforward; we compute the average distance across extracted artifacts between the generated description and the target. We use two distance metrics, Levenshtein distance and longest common subsequence (LCS).

The second class of metrics is used to provide a more semantic evaluation of the generated description, using three different metrics:

BERTScore metric: This metric is used to evaluate the generated summaries using a pre-trained BERT model’s contextual embeddings. It determines the similarity between the generated summary and the reference summary using cosine similarity.
ADA2 embeddings cosine similarity: This metric assesses the cosine similarity of ADA2 embeddings of tokens in the generated summary with those of the reference summary.
METEOR score: METEOR is an evaluation metric based on the harmonic mean of unigram precision and recall.

More advanced evaluation methods can be used such as training a reward model on human preferences and using it as an evaluator, but for the sake of simplicity and cost-effectiveness, we limited the scope to these metrics.

Below is a summary of our results on this task:

Model	Levenshtein-based factual accuracy	LCS-based factual accuracy	BERTScore	Cosine similarity of ADA2 embeddings	METEOR score
Anthropic’s Claude 3 Sonnet	0.810	0.721	0.886	0.951	0.4165

Based on these findings, we gain a broad understanding of the performance of the model when it comes to generating incident summaries, focusing especially on factual accuracy and retrieval rate. Anthropic’s Claude 3 Sonnet model can capture the activity that’s occurring in the incident and summarize it well. However, it ignores certain instructions such as defanging all IPs and URLs. The returned reports are also not fully aligned with the target responses on a token level as signaled by the METEOR score. Anthropic’s Claude 3 Sonnet model skims over some details and explanations in the reports.

Experimental setup using Amazon Bedrock and Amazon SageMaker

This section outlines the experimental setup for evaluating various large language models (LLMs) using Amazon Bedrock and Amazon SageMaker. These services allowed us to efficiently interact with and deploy multiple LLMs for quick and cost-effective experimentation.

Amazon Bedrock

Amazon Bedrock is a managed service that allows experimenting with various LLMs quickly in an on-demand manner. This brings the advantage of being able to interact and experiment with LLMs without having to self-host them and only pay by tokens consumed. We used the InvokeModel API to interact with the model with minimal latency. We wrote the following function that let us call different models by passing the necessary inference parameters to the API. For more details on what the inference parameters are per provider, we recommend you read the Inference request parameters and response fields for foundation models section in the Amazon Bedrock documentation. The example below uses the function based on Anthropic’s Claude 3 Sonnet model. Notice that we gave the model a role via the system prompt and that we prefilled its response.

system_prompt = “You are a helpful cybersecurity incident investigation expert that classifies incidents according to their severity level given a set of detections per incident”
messages = [
             {"role": "user", 
             "content": 
" Respond strictly with this JSON format:{"severity_pred": "xxx"} where xxx should only be either:
- Critical,
<Criteria for a critical incident>
- High,
<Criteria for a high severity incident>
- Medium,
<Criteria for a medium severity incident>
- Low,
<Criteria for a low severity incident>
- Informational
<Criteria for an informational incident>
No other value is allowed."},
              {"role": "assistant", "content": " Detections:"}]

def generate_message(bedrock_runtime, model_id, system_prompt, messages, max_tokens):
    body=json.dumps(
        {
            "anthropic_version": " bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages
        }  
    )   
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    return response_body

The above example is based on our use case. The model_id parameter specifies the identifier of the specific model you wish to invoke using the Bedrock runtime. We used the model id anthropic.claude-3-sonnet-20240229-v1:0. For other model ids, please refer to the bedrock documentation. For further details about this API, we recommend you read the API documentation. We advise you to adapt it to your use case based on your requirements.

Our analysis in this blog post has focused on Anthropic’s Claude 3 Sonnet model and three specific use cases. These insights can be adapted to other SOCs’ specific requirements and desired models. For example, it’s possible to access other models such as Meta’s Llama models, Mistral models, Amazon Titan models and others. For additional models, we used Amazon SageMaker Jumpstart.

Amazon SageMaker

Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. Amazon SageMaker JumpStart is a robust feature within the SageMaker machine learning (ML) environment, offering practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). It offers a wide range of publicly available and proprietary LLMs that you can, in a low-code manner, quickly tune and deploy. To quickly deploy and experiment with the out of the box models in SageMaker in a cost-effective manner, we deployed the LLMs from SageMaker JumpStart using asynchronous inference endpoints.

Inference endpoints were an effortless way for us to directly download these models from the respective Hugging Face repositories and deploy them using a few lines of code and pre-made Text Generation Inference (TGI) containers (see the example notebook on GitHub). In addition, we used asynchronous inference endpoints with autoscaling, which helped us to manage costs by automatically scaling the inference endpoints down to zero when they weren’t being used. Considering the number of endpoints we were creating, asynchronous inference made it simple for us to manage endpoints by having the endpoint ready to use whenever they were needed and scaling them down when they weren’t being used, without additional management on our end after the scaling policy was defined.

Next steps

In this blog post we applied the tasks on a single model to show case it as an example; in reality, you would select a couple of LLMs that you would put through the experiments in this post based on your requirements. From there, if the out-of-the-box models aren’t sufficient for the task, you would select the best suited LLM and then fine-tune it on the specific task.

For example, based on the outcomes of our three experimental tasks, we found that the results of the incident information summarization task didn’t meet our expectations. Therefore, we will fine-tune the out-of-the-box model that best suits our needs. This fine-tuning process can be accomplished using Amazon Bedrock Custom Models or SageMaker fine tuning, and the fine-tuned model could then be deployed using the customized model by importing it into Amazon Bedrock or by deploying the model to a SageMaker endpoint.

In this blog we covered the experimentation phase. Once you identify an LLM that meets your performance requirements, it’s important to start considering how to productionize it. When productionizing an LLM, it is important to consider things like guardrails and scalability of the LLM. Implementing guardrails helps you to minimize the risk of the model being misused or security breaches. Amazon Bedrock Guardrails enables you to implement safeguards for your generative AI applications based on your use cases and responsible AI policies. This blog covers how to build guardrails in your generative AI applications. When moving an LLM into ] production, you also want to validate the scalability of the LLM based on request traffic. In Amazon Bedrock, consider increasing the quotas of your model, batch inference, queuing the requests, or even distributing the requests between different Regions that have the same model. Select the technique that suits you based on your use case and traffic.

Conclusion

In this post, SophosAI shared insights on how to use and evaluate out-of-the-box LLMs following a set of specialized tasks for the enhancement of a security operations center’s (SOC) productivity by using Amazon Bedrock and Amazon SageMaker. We used Anthropic’s Claude 3 Sonnet model on Amazon Bedrock to illustrate three use cases.

Amazon Bedrock and SageMaker have been key to enabling us to run these experiments. With the convenient access to high-performing foundation models (FMs) from leading AI companies provided by Amazon Bedrock through a single API call, we were able to test various LLMs without needing to deploy them ourselves. Additionally, the on-demand pricing model allowed us to only pay for what we used based on token consumption.

To access additional models with flexible control, SageMaker is a great alternative that offers a wide range of LLMs ready for deployment. While you would deploy these models yourself, you can still achieve great cost optimization by using asynchronous endpoints with a scaling policy that scales the instance down to zero when not in use.

General takeaways as to the applicability of an LLM such as Anthropic’s Claude 3 Sonnet model in cybersecurity can be summarized as follows:

An out-of-the-box LLM can be an effective assistant in threat hunting and incident investigation. However, it still requires some guardrails and guidance. We believe that this potential application can be implemented using an existing powerful model, such as Anthropic’s Claude 3 Sonnet model, with careful prompt engineering.
When it comes to summarizing incident information from raw data, Anthropic’s Claude 3 Sonnet model performs adequately, but there’s room for improvement through fine-tuning.
Evaluating individual artifacts or groups of artifacts remains a challenging task for a pre-trained LLM. To tackle this problem, a specialized LLM trained specifically on cybersecurity data might be required.

It is also worth noticing that while we used the InvokeModel API from Amazon Bedrock, another simpler way to access Amazon Bedrock models is by using the Converse API. The Converse API provides consistent API calls that work with Amazon Bedrock models that support messages. This means you can write code once and use it with different models. Should a model have unique inference parameters, the Converse API also allows you to pass those unique parameters in a model specific structure.

About the Authors

Benoit de Patoul is a GenAI/AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to GenAI/AI/ML using Amazon Web Services. In his free time, he likes to play piano and spend time with friends.

Naresh Nagpal is a Solutions Architect at AWS with extensive experience in application development, integration, and technology architecture. At AWS, he works with ISV customers in the UK to help them build and modernize their SaaS applications on AWS. He is also helping customers to integrate GenAI capabilities in their SaaS applications.

Adarsh Kyadige oversees the Research wing of the Sophos AI team, where he has been working since 2018 at the intersection of Machine Learning and Security. He earned a Masters degree in Computer Science, with a specialization in Artificial Intelligence and Machine Learning, from UC San Diego. His interests and responsibilities involve applying Deep Learning to Cybersecurity, as well as orchestrating pipelines for large scale data processing. In his leisure time, Adarsh can be found at the archery range, tennis courts, or in nature. His latest research can be found on Google Scholar.

Salma Taoufiq was a Senior Data Scientist at Sophos focusing at the intersection of machine learning and cybersecurity. With an undergraduate background in computer science, she graduated from the Central European University with a MSc. in Mathematics and Its Applications. When not developing a malware detector, Salma is an avid hiker, traveler, and consumer of thrillers.

Enhanced observability for AWS Trainium and AWS Inferentia with Datadog

This post is co-written with Curtis Maher and Anjali Thatte from Datadog.

This post walks you through Datadog’s new integration with AWS Neuron, which helps you monitor your AWS Trainium and AWS Inferentia instances by providing deep observability into resource utilization, model execution performance, latency, and real-time infrastructure health, enabling you to optimize machine learning (ML) workloads and achieve high-performance at scale.

Neuron is the SDK used to run deep learning workloads on Trainium and Inferentia based instances. AWS AI chips, Trainium and Inferentia, enable you to build and deploy generative AI models at higher performance and lower cost. With the increasing use of large models, requiring a large number of accelerated compute instances, observability plays a critical role in ML operations, empowering you to improve performance, diagnose and fix failures, and optimize resource utilization.

Datadog, an observability and security platform, provides real-time monitoring for cloud infrastructure and ML operations. Datadog is excited to launch its Neuron integration, which pulls metrics collected by the Neuron SDK’s Neuron Monitor tool into Datadog, enabling you to track the performance of your Trainium and Inferentia based instances. By providing real-time visibility into model performance and hardware usage, Datadog helps you achieve efficient training and inference, optimized resource utilization, and the prevention of service slowdowns.

Comprehensive monitoring for Trainium and Inferentia

Datadog’s integration with the Neuron SDK automatically collects metrics and logs from Trainium and Inferentia instances and sends them to the Datadog platform. Upon enabling the integration, users will find an out-of-the-box dashboard in Datadog, making it straightforward to start monitoring quickly. You can also modify preexisting dashboards and monitors, and add news ones tailored to your specific monitoring requirements.

The Datadog dashboard offers a detailed view of your AWS AI chip (Trainium or Inferentia) performance, such as the number of instances, availability, and AWS Region. Real-time metrics give an immediate snapshot of infrastructure health, with preconfigured monitors alerting teams to critical issues like latency, resource utilization, and execution errors. The following screenshot shows an example dashboard.

For instance, when latency spikes on a specific instance, a monitor in the monitor summary section of the dashboard will turn red and trigger alerts through Datadog or other paging mechanisms (like Slack or email). High latency may indicate high user demand or inefficient data pipelines, which can slow down response times. By identifying these signals early, teams can quickly respond in real time to maintain high-quality user experiences.

Datadog’s Neuron integration enables tracking of key performance aspects, providing crucial insights for troubleshooting and optimization:

NeuronCore counters – Monitoring NeuronCore utilization helps make sure that cores are being used efficiently, helping you identify if you need to make adjustments to balance workloads or optimize performance.
Execution status – You can monitor the progress of training jobs, including completed tasks and failed runs. This data makes sure models are being trained smoothly and reliably. If failures increase, it may signal issues with data quality, model configurations, or resource limitations that need to be addressed.
Memory used – You can gain a granular view of memory usage across both the host and Neuron device, including memory allocated for tensors and model execution. This helps you understand how effectively resources are being used, and when it might be time to rebalance workloads or scale resources to prevent bottlenecks from causing disruptions during training.
Neuron runtime vCPU usage – You can keep an eye on vCPU utilization to make sure your models aren’t overburdening the infrastructure. When vCPU usage crosses a certain threshold, you will be alerted to decide whether to redistribute workloads or upgrade instance types to avoid training slowdowns.

By consolidating these metrics into one view, Datadog provides a powerful tool for maintaining efficient, high-performance Neuron workloads, helping teams identify issues in real time and optimize infrastructure as needed. Using the Neuron integration combined with Datadog’s LLM Observability capabilities, you can gain comprehensive visibility into your large language model (LLM) applications.

Get started with Datadog and Inferentia and Trainium

Datadog’s integration with Neuron provides real-time visibility into Trainium and Inferentia, helping you optimize resource utilization, troubleshoot issues, and achieve seamless performance at scale. To get started, see AWS Inferentia and AWS Trainium Monitoring.

To learn more about how Datadog integrates with Amazon ML services and Datadog LLM Observability, see Monitor Amazon Bedrock with Datadog and Monitoring Amazon SageMaker with Datadog.

If you don’t already have a Datadog account, you can sign up for a free 14-day trial today.

About the Authors

Curtis Maher is a Product Marketing Manager at Datadog, focused on the platform’s cloud and AI/ML integrations. Curtis works closely with Datadog’s product, marketing, and sales teams to coordinate product launches and help customers observe and secure their cloud infrastructure.

Anjali Thatte is a Product Manager at Datadog. She currently focuses on building technology to monitor AI infrastructure and ML tooling and helping customers gain visibility across their AI application tech stacks.

Jason Mimick is a Senior Partner Solutions Architect at AWS working closely with product, engineering, marketing, and sales teams daily.

Anuj Sharma is a Principal Solution Architect at Amazon Web Services. He specializes in application modernization with hands-on technologies such as serverless, containers, generative AI, and observability. With over 18 years of experience in application development, he currently leads co-building with containers and observability focused AWS Software Partners.

Create a virtual stock technical analyst using Amazon Bedrock Agents

Stock technical analysis questions can be as unique as the individual stock analyst themselves. Queries often have multiple technical indicators like Simple Moving Average (SMA), Exponential Moving Average (EMA), Relative Strength Index (RSI), and others. Answering these varied questions would mean writing complex business logic to unpack the query into parts and fetching the necessary data. With the number of indicators available, the possibility of having one or many of them in any combination, and each of those indicators over different time periods, it can get quite complex to build such a business logic into code.

As AI technology continues to evolve, the capabilities of generative AI agents continue to expand, offering even more opportunities for you to gain a competitive edge. At the forefront of this evolution sits Amazon Bedrock, a fully managed service that makes high-performing foundation models (FMs) from Amazon and other leading AI companies available through a single API. With Amazon Bedrock, you can build and scale generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents plans and runs multistep tasks using company systems and data sources—from answering customer questions about your product availability to taking their orders. With Amazon Bedrock, you can create an agent in just a few quick steps by first selecting an FM and providing it access to your enterprise systems, knowledge bases, and actions to securely execute your APIs. These actions can be implemented in the cloud using AWS Lambda, or you can use local business logic with return of control. An agent analyzes the user request and automatically calls the necessary APIs and data sources to fulfill the request. Amazon Bedrock Agents offers enhanced security and privacy—no need for you to engineer prompts, manage session context, or manually orchestrate tasks.

In this post, we create a virtual analyst that can answer natural language queries of stocks matching certain technical indicator criteria using Amazon Bedrock Agents. As part of the agent, we configure action groups consisting of Lambda functions, which gives the agent the ability to perform various actions. The Amazon Bedrock agent will transform the user natural language query into relevant Lambda calls, passing the technical indicators and their needed duration. Lambda will access the open source stock data pre-fetched into an Amazon Simple Storage Service (Amazon S3) bucket, calculate the technical indicator in real time, and pass it back to the agent. The agent will take further actions like other Lambda calls or filtering and ordering based on the task.

Solution overview

The technical analysis assistant solution will use Amazon Bedrock Agents to answer natural language technical analysis queries, from simple ones like “Can you give me a list of stocks in the NASDAQ 100 index” to complex ones like “Which stocks in the NASDAQ 100 index has both grown over 10% in last 6 months and also closed above their 20-day SMA?” Agents orchestrate and analyze the task and break it down into the correct logical sequence using the FM’s reasoning abilities. Agents automatically call the necessary Lambda functions to fetch the relevant stock technical analysis data, determining along the way if they can proceed or if they need to gather more information.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The solution spins up a Python-based Lambda function that fetches the daily stock data for the last one year using the yfinance package. The Lambda function is triggered to run every day using an Amazon EventBridge The functions puts the last 1-year stock data into an S3 bucket.
A user asks a natural language query, like “Can you give me a list of stocks in NASDAQ 100 index” or “Which stocks have closed over both 20 SMA and 50 EMA in the FTSE 100 index?”
This query is passed to the Amazon Bedrock agent powered by Anthropic’s Claude 3 Sonnet. The agent deconstructs the user query, creates an action plan, and executes it step-by-step to fetch various data needed to answer the question. To fetch the needed data, the agent has three action groups, each powered by a Lambda function that can use the raw data stored in the S3 bucket to calculate technical indicators and other stock-related information. Based on the response from the action group and the agent’s plan of action, the agent will continue to make calls or take other actions like filtering or summarizing until it arrives at the answer to the question. The action groups are as follows:
1. get-index – Get the stock symbol of constituents for a given index. The example currently has constituents configured for Nasdaq 100, FTSE 100, and Nifty 50 indexes.
2. get-stock-change – For a given stock or list of stocks, calculate the change percentage over a given period based on the pre-fetched raw data in Amazon S3. This solution currently is configured to have data of the past 1 year.
3. get-technical-analysis – For a given stock or list of stocks, calculate the given technical indicator for a given time period. It also fetches the last closing price of the stock based on the pre-fetched raw data in Amazon S3. This solution currently is configured to handle SMA, EMA, and RSI technical indicators for up to 1 year.

Prerequisites

To set up this solution, you need a basic knowledge of AWS and the relevant AWS services. Additionally, request model access on Amazon Bedrock for Anthropic’s Claude 3 Sonnet.

Deploy the solution

Complete the following steps to deploy the solution using AWS CloudFormation:

Launch the CloudFormation stack in the us-east-1 AWS Region:

For Stack name, enter a stack name of your choice.
Leave the rest as defaults.
Choose Next.
Choose Next
Select the acknowledgement check box and choose Submit.

Wait for the stack creation to complete.
Verify all resources are created on the stack details page.

The CloudFormation stack creates the solution described in the solution overview and the following key resources:

StockDataS3Bucket – The S3 bucket to store the 1-year stock data.
YfinDailyLambda – A Python Lambda function to fetch the last 1-year stocks data in the Nasdaq 100, FTSE 100 and Nifty 50 indexes from Yahoo Finance using the yfinance package:

# Example call to Yahoo finance to get stock history
import yfinance as yf
stock_ticker = yf.Ticker(<Stock Symbol>)
stock_history = stock_ticker.history(period= <Time duration to fetch history e.g. 1y>))

YfinDailyLambdaScheduleRule – An EventBridge rule to trigger the YfinDailyLambda function daily to get the latest stock data.
InvokeYfinDailyLambda and InvokeLambdaFunction – A custom CloudFormation resource and its Lambda function to invoke the YfinDailyLambda function as part of the stack creation to fetch the initial data.
GetIndexLambda – This function takes in an index name as input and returns the list of stocks in the given index.
GetStockChangeLambda – This function takes a list of stocks and number of days as input, fetches the stock data from the S3 bucket, calculates the percentage change over the period for the stocks, and returns the data.
GetStockTechAnalysisLambda – This function takes a list of stocks, number of days, and a technical indicator as input and returns the last close and the technical indicator over the number of days for the given list of stocks. For example:

# Sample code to calculate Simple Moving Average,  a technical indicator
from ta.trend import SMAIndicator
indicator_ta = SMAIndicator(<PAndas series with Close price>, window=<SMA window number of days>)
stock_SMA = indicator_ta.sma_indicator()

StockBotAgent – The Amazon Bedrock agent created with Anthropic’s Claude 3 Sonnet model with three action groups, each mapped to a Lambda function. We give the agent instructions in natural language. In our solution, part of our instruction is that “You can fetch the list of stocks in a given index” so the agent knows it can fetch stocks in an index. We configure the action groups as part of the agent and use the OpenAPI 3 schema standard to describe the Lambda functionality, so the agent understands when and how to invoke the Lambda functions. The following is a snippet of the get-index action group OpenAPI schema where we describe its functionality, its input and output parameters, and format:

{
"openapi": "3.0.0",
"info": {
"title": "Get Index list of stocks api",
"version": "1.0.0",
"description": "API to fetch the list of stocks in a given index"
},

"paths": {
"/get-index": {
"get": {
"summary": "Get list of stock symbols in index",
"description": "Based on provided index, return list of stock symbols in the index",
"operationId": "getIndex",
"parameters": [
{
"name": "indexName",
"in": "path",
"description": "Index Name",
"required": true,
"schema": {
"type": "string"
}
}
],

"responses": {
"200": {
"description": "Get Index stock list",
"content": {
"application/json": {
"schema": {
"type": "array",

Test the solution

To test the solution, complete the following steps:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Choose the agent created by the CloudFormation stack.

Under Test, choose the alias with the name Version 1 and expand the Test

Now you can enter questions to interact with the agent.

Let’s start with the query “Can you give me a list of stocks in Nasdaq.”

You can see the answer in the following screenshot. Expand the Trace Step section in the right pane to see the agent’s rationale and the call to the Lambda function.

Now let’s ask a question that is likely to use all three action groups and their Lambda functions: “Can you give list of stocks that has both grown over 10% in last 6 months and also closed above their 20-day SMA. Use stocks from the Nasdaq index.”

You will get the response shown in the following screenshot, and in the trace steps, you will see the various Lambda functions being invoked at different steps as the agent reasons through the steps to get to the answer to the question.

You can further test with additional prompts, such as:

Can you give the top three gainers in terms of percentage in the last 6 months in the Nifty 50 index?
Which stocks have closed over both 20 SMA and 50 EMA in the FTSE 100 index?
Can you give list of stocks that has grown over 10% in last 6 months and closed above 20-day SMA and 50-day EMA. Use stocks from the FTSE 100 index. Follow up question: Of these stocks, are there any that have grown over 25% in the last months? If so, can you give me the stocks and their growth percent over 6 months?

Programmatically invoke the agent

When you’re satisfied with the performance of your agent, you can build an application to programmatically invoke the agent using the InvokeAgent API. The agent ID and the agent alias ID needed for invoking the agent alias programmatically can be found on the Outputs tab of the CloudFormation stack, titled AgentId and AgentAliasId, respectively. To learn more about the programmatic invocation, refer the following Python example and JavaScript example.

Clean up

To avoid charges in your AWS account, clean up the solution’s provisioned resources:

On the Amazon S3 console, empty the S3 bucket created as part of the CloudFormation stack. The bucket name should start with your stack name that you entered while creating the CloudFormation stack. You can also check the name of the bucket on the CloudFormation stack’s Resources
On the AWS CloudFormation console, select the stack you created for this solution and choose Delete.

Conclusion

In this post, we showed how you can use Amazon Bedrock Agents to carry out complex tasks that need multiple step orchestration through just natural language instructions. With agents, you can automate tasks for your customers and answer questions for them. We encourage you to explore the Amazon Bedrock Agents User Guide to understand its capabilities further and use it for your use cases.

About the authors

Bharath Sridharan is a Senior Technical Account Manager at AWS and works with strategic customers of AWS in pro-actively optimizing their workloads on AWS. Bharath additionally specializes on Machine learning services of AWS with a focus on Generative AI.

Apply Amazon SageMaker Studio lifecycle configurations using AWS CDK

This post serves as a step-by-step guide on how to set up lifecycle configurations for your Amazon SageMaker Studio domains. With lifecycle configurations, system administrators can apply automated controls to their SageMaker Studio domains and their users. We cover core concepts of SageMaker Studio and provide code examples of how to apply lifecycle configuration to your SageMaker Studio domain to automate behaviors such as preinstallation of libraries and automated shutdown of idle kernels.

Amazon SageMaker Studio is the first integrated development environment (IDE) purposefully designed to accelerate end-to-end machine learning (ML) development. Amazon SageMaker Studio provides a single web-based visual interface where data scientists create dedicated workspaces to perform all ML development steps required to prepare data and build, train, and deploy models. You can create multiple Amazon SageMaker domains, which define environments with dedicated data storage, security policies, and networking configurations. With your domains in place, you can then create domain user profiles, which serve as an access point for data scientists to enter the workspace with user-defined least-privilege permissions. Data scientists use their domain user profiles to launch private or shared Amazon SageMaker Studio spaces to manage the storage and resource needs of the IDEs they use to tackle different ML projects.

To effectively manage and govern both user profiles and domains with SageMaker Studio, you can use Amazon SageMaker Studio lifecycle configurations. This feature allows you for instance to install custom packages, configure notebook extensions, preload datasets, set up code repositories, or shut down idle notebook kernels automatically. Amazon SageMaker Studio now also supports configuration of idle kernel shutdown directly on the user interface for JupyterLab and Code Editor applications that use Amazon SageMaker Distribution image version 2.0 or newer.

These automations can greatly decrease overhead related to ML project setup, facilitate technical consistency, and save costs related to running idle instances. SageMaker Studio lifecycle configurations can be deployed on two different levels: on the domain level (all users in a domain are affected) or on the user level (only specific users are affected).

In this post, we demonstrate how you can configure custom lifecycle configurations for SageMaker Studio to manage your own ML environments efficiently at scale.

Solution overview

The solution constitutes a best-practice Amazon SageMaker domain setup with a configurable list of domain user profiles and a shared SageMaker Studio space using the AWS Cloud Development Kit (AWS CDK). The AWS CDK is a framework for defining cloud infrastructure as code.

In addition, we demonstrate how to implement two different use cases of SageMaker Studio lifecycle configurations: 1) automatic installation of python packages and 2) automatic shutdown of idle kernels. Both are deployed and managed with AWS CDK custom resources. These are powerful, low-abstracted and highly customizable AWS CDK constructs that can be used to manage the behavior of resources at creation, update, and deletion events. We use Python as the main language for our AWS CDK application, but the code can be easily translated to other AWS CDK supported languages. For more information, refer to Work with the AWS CDK library.

The following architecture diagram captures the main infrastructure that is deployed by the AWS CDK, typically carried out by a DevOps engineer. The domain administrator defines the configuration of the Studio domain environment, which also includes the selection of Studio lifecycle configurations to include in the deployment of the infrastructure. After the infrastructure is provisioned, data scientists can access the SageMaker Studio IDE through their domain user profiles in the SageMaker console.

After the data scientists access the IDE, they can select from a variety of available applications, including JupyterLab and Code Editor, and run the provisioned space. In this solution, a JupyterLab space has been included in the infrastructure stack. Upon opening JupyterLab, data scientists can immediately start tackling their development work, which includes retrieving or dumping data on Amazon Simple Storage Service (Amazon S3), developing ML models, and pushing changes to their code repository. If multiple data scientists are working on the same project, they can access the shared Studio spaces using their domain user profiles to foster collaboration. The main Python libraries are already installed by the Studio lifecycle configuration, saving time for value-generating tasks. As the data scientists complete their daily work, their spaces will be automatically shut down by the Studio lifecycle configuration.

High-level architecture diagram of this solution, which includes a SageMaker Studio domain, user profiles, and two Studio lifecycle configurations.

Prerequisites

To get started, make sure you fulfill the following prerequisites:

The AWS Command Line Interface (AWS CLI) installed.
The AWS CDK installed. For more information, refer to Getting started with the AWS CDK and Working with the AWS CDK in Python.
Python 3.9 or higher installed.

Clone the GitHub repository

First, clone this GitHub repository.

Upon cloning the repository, you can observe a classic AWS CDK project setup, where app.py is the CDK entry point script that deploys two stacks in sequence. The first stack (NetworkingStack) deploys the networking infrastructure and the second stack (SageMakerStudioStack) deploys the domain, user profiles and spaces. The application logic is covered by AWS Lambda functions which are found under the source directory.

The AWS CDK stacks

In the following subsections we elaborate on the provisioned resources for each of the two CDK stacks.

Virtual private cloud (VPC) setup for the NetworkingStack

The NetworkingStack deploys all the necessary networking resources and builds the foundation of the application. This includes a VPC with a public and a private subnet and a NAT gateway to enable connection from instances in the private subnet to AWS services (for example, to Amazon SageMaker). The SageMaker Studio domain is deployed into the VPC’s private subnet, shielded from direct internet access for enhanced security. The stack also includes security groups to control traffic within the VPC and a custom resource to delete security groups when destroying the infrastructure via CDK. We elaborate on custom resources in the subsection CustomResource class.

SageMaker Studio domain and user profiles for SageMakerStudioStack

The SageMakerStudioStack is deployed on top of the NetworkingStack and captures project-specific resources. This includes the domain user profiles and the name of the workspace. By default, it creates one workspace called “project1” with three users called “user1”, “user2”, and “user3”. The SageMakerStudioStack is instantiated, as shown in the following code example.

SagemakerStudioStack(
  env=env,
  scope=app,
  construct_id="SageMakerStudioStack",
  domain_name="sagemaker-domain",
  vpc_id=networking_stack.vpc_id,
  subnet_ids=networking_stack.subnet_ids,
  security_group_id=networking_stack.security_group_id,
  workspace_id="project1",
  user_ids=[
  "user1",
  "user2",
  "user3",
   ],
)

You can adjust the names according to your own requirements and even deploy multiple SageMaker Studio domains by instantiating multiple objects of the SageMakerStudioStack class in your CDK app.py script.

Apply the lifecycle configurations

The application of lifecycle configurations in this solution relies on CDK custom resources which are powerful constructs that allow you to deploy and manage highly bespoke infrastructure components to fit your specific needs. To facilitate the usage of these components, this solution comprises a general CustomResource class that is inherited by the following five CustomResource subclasses:

InstallPackagesCustomresource subclass: Installs the required packages automatically when launching JupyterLab within SageMaker Studio using lifecycle configurations.
ShutdownIdleKernelsCustomResource subclass: Shuts down idle kernels after the user-specified time window (default 1 hour) using lifecycle configurations.
EFSCustomResource subclass: Deletes the Elastic File System (EFS) of SageMaker Studio when destroying the infrastructure.
StudioAppCustomResource subclass: Deletes the JupyterLab application for each user profile when destroying the infrastructure.
VPCCustomResource subclass: Deletes the security groups when destroying the infrastructure.

Note that only the first two subclasses in the list are used for lifecycle configurations, the other subclasses are not related to lifecycle configurations but serve other purposes. This code structure allows you to easily define your own custom resources following the same pattern. In the following subsections we dive deeper on how custom resources work and elaborate on a specific example.

The CustomResource class

The CDK CustomResource class is composed of three key elements including a Lambda function that contains the logic for Create, Update, and Delete cycles, a Provider that manages the creation of the Lambda function as well as its IAM role, and the custom resource itself which references the Provider and entails some properties that are passed to the Lambda function. The class definition is illustrated below and can be found in the repository under stacks/sagemaker/constructs/custom_resources/CustomResource.py.

from aws_cdk import (
  aws_iam as iam,
  aws_lambda as lambda_,
  aws_logs as logs,
)
import aws_cdk as cdk
from aws_cdk.custom_resources import Provider
from constructs import Construct
import os
from typing import Dict

class CustomResource(Construct):
    def __init__(
        self,
        scope: Construct,
        construct_id: str,
        properties: Dict,
        lambda_file_name: str,
        iam_policy: iam.PolicyStatement,
        **kwargs,
    ) -> None:
        super().__init__(scope, construct_id, **kwargs)

        on_event_lambda_fn = lambda_.Function(
            self,
            "EventLambda",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="index.on_event_handler",
            code=lambda_.Code.from_asset(
                os.path.join(os.getcwd(), "src", "lambda", lambda_file_name)
            ),
            initial_policy=[iam_policy],
            timeout=cdk.Duration.minutes(3),
        )
        is_complete_lambda_fn = lambda_.Function(
            self,
            "CompleteLambda",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="index.is_complete_handler",
            code=lambda_.Code.from_asset(
                os.path.join(os.getcwd(), "src", "lambda", lambda_file_name)
            ),
            initial_policy=[iam_policy],
            timeout=cdk.Duration.minutes(3),
        )

        provider = Provider(
            self,
            "Provider",
            on_event_handler=on_event_lambda_fn,
            is_complete_handler=is_complete_lambda_fn,
            total_timeout=cdk.Duration.minutes(10),
            log_retention=logs.RetentionDays.ONE_DAY,
        )

        cdk.CustomResource(
            self,
            "CustomResource",
            service_token=provider.service_token,
            properties={
                **properties,
                "on_event_lambda_version": on_event_lambda_fn.current_version.version,
                "is_complete_lambda_version": is_complete_lambda_fn.current_version.version,
            },
        )

The InstallPackagesCustomResource subclass

This subclass inherits from the CustomResource to deploy the lifecycle configurations for SageMaker Studio to automatically install Python packages within JupyterLab environments. The lifecycle configuration is defined on the domain level to cover all users at once. The subclass definition is illustrated below and can be found in the repository under stacks/sagemaker/constructs/custom_resources/InstallPackagesCustomResource.py.

from aws_cdk import (
    aws_iam as iam,
)
from constructs import Construct
from stacks.sagemaker.constructs.custom_resources import CustomResource


class InstallPackagesCustomResource(CustomResource):
    def __init__(
        self,
        scope: Construct,
        construct_id: str,
        domain_id: str,
    ) -> None:
        super().__init__(
            scope,
            construct_id,
            properties={
                "domain_id": domain_id,
                "package_lifecycle_config": f"{domain_id}-package-lifecycle-config",
            },
            lambda_file_name="lcc_install_packages_lambda",
            iam_policy=iam.PolicyStatement(
                effect=iam.Effect.ALLOW,
                actions=[
                    "sagemaker:CreateStudioLifecycleConfig",
                    "sagemaker:DeleteStudioLifecycleConfig",
                    "sagemaker:Describe*",
                    "sagemaker:List*",
                    "sagemaker:UpdateDomain",
                ],
                resources=["*"],
            ),
        )

The code for the AWS Lambda function used for the custom resources is stored in the repository under src/lambda/lcc_install_packages_lambda/index.py. During the Create event, the Lambda function uses the Boto3 client method create_studio_lifecycle_config to create the lifecycle configuration. In a consecutive step, it uses the update_domain method to update the configuration of the domain to attach the created lifecycle configuration. During the Update event, the lifecycle configuration is deleted and recreated as they can’t be modified in-place after they’re provisioned. During the Delete event, the delete_studio_lifecycle_config method is called to remove the lifecycle configuration. The lifecycle configuration itself is a shell script that is executed once deployed into the domain. As an example, the content of the install packages script is displayed below.

#!/bin/bash
set -eux

# Packages to install
pip install --upgrade darts pip-install-test

In this example, two packages are automatically installed for every new kernel instance provisioned by a Studio user: darts and pip-install-test. You can modify and extend this list of packages to fit your own requirements.

The source code for the idle kernel shutdown lifecycle configuration follows the same design principle and is stored in the repository under src/lambda/lcc_shutdown_idle_kernels_lambda/index.py. The main difference between the two Studio lifecycle configurations is the content of the bash scripts, which in this case was referenced from sagemaker-studio-lifecycle-config-examples.

Deploy the AWS CDK stacks

To deploy the AWS CDK stacks, run the following commands in the location where you cloned the repository. Depending on your path configurations, the command may be python instead of python3.

Create a virtual environment:
1. For macOS/Linux, use python3 -m venv .cdk-venv
2. For Windows, use python3 -m venv .cdk-venv
Activate the virtual environment:
1. For macOS/Linux, use source .cdk-venvbinactivate
2. For Windows, use .cdk-venv/Scripts/activate.bat
3. For PowerShell, use .cdk-venv/Scripts/activate.ps1
Install the required dependencies:
1. pip install -r requirements.txt
2. pip install -r requirements-dev.txt
(Optional) Synthesize the AWS CloudFormation template for this application: cdk synth
Deploy the solution with the following commands:
1. aws configure
2. cdk bootstrap
3. cdk deploy --all alternatively, you can deploy the two stacks individually using cdk deploy <StackName>

When the stacks are successfully deployed, you’ll be able to view the deployed stacks in the AWS CloudFormation console, as shown below.

You’ll also be able to view the Studio domain and the Studio lifecycle configurations on the SageMaker console, as shown in the following screenshots.

Choose one of the lifecycle configurations to view the shell code and its configuration details, as follows.

To make sure your lifecycle configuration is included in your space, launch SageMaker Studio from your user profile, navigate to JupyterLab, and choose the provisioned space. You can then select a lifecycle configuration that is associated with your domain or user profile and activate it, as shown below.

After you run the space and open JupyterLab, you can validate the functionality. In the example shown in the following screenshot, the preinstalled package can be imported directly.

Optional: How to attach Studio lifecycle configurations manually

If you want to manually attach a lifecycle configuration to an already existing domain, perform the following steps:

On the SageMaker console, choose Domains in the navigation pane.
Choose the domain name you’re using and the current user profile, then choose Edit.
Select the lifecycle configuration you want to use and choose Attach, as shown in the following screenshot.

From here, you can also set it as default.

Clean up

Complete the steps in this section to remove all your provisioned resources from your environment.

Delete the AWS CDK stacks

When you’re done with the resources you created, you can destroy your AWS CDK stack by running the following command in the location where you cloned the repository:

cdk destroy --all

When asked to confirm the deletion of the stack, enter yes.

You can also delete the stack on the AWS CloudFormation console with the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose the stack that you want to delete.
In the stack details pane, choose Delete.
Choose Delete stack when prompted.

User profile applications can sometimes take several minutes to delete, which can interfere with the deletion of the stack. If you run into any errors during stack deletion, you may have to manually delete the user profile apps and retry.

Conclusion

In this post, we described how customers can deploy a SageMaker Studio domain with automated lifecycle configurations to control their SageMaker resources. Lifecycle configurations are based on custom shell scripts to perform automated tasks and can be deployed with AWS CDK Custom Resources.

Whether you are already using SageMaker domains in your organization or starting out on your SageMaker adoption, effectively managing lifecycles on your SageMaker Studio domains will greatly improve the productivity of your data science team and alleviate administrative work. By implementing the described steps, you can streamline your workflows, reduce operational overhead, and empower your team to focus on driving insights and innovation.

About the Authors

Gabriel Rodriguez Garcia is a Machine Learning Engineer at AWS Professional Services in Zurich. In his current role, he has helped customers achieve their business goals on a variety of ML use cases, ranging from setting up MLOps inference pipelines to developing generative AI applications.

Gabriel Zylka is a Machine Learning Engineer within AWS Professional Services. He works closely with customers to accelerate their cloud adoption journey. Specializing in the MLOps domain, he focuses on productionizing ML workloads by automating end-to-end ML lifecycles and helping to achieve desired business outcomes.

Krithi Balasubramaniyan is a Principal Consultant at AWS. He enables global enterprise customers in their digital transformation journeys and helps architect cloud native solutions.

Cory Hairston is a Software Engineer with AWS Bedrock. He currently works on providing reusable software solutions.

Gouri Pandeshwar is an Engineering Manager with AWS Bedrock. He and his team of engineers are working to build reusable solutions and frameworks that help accelerate adoption of AWS AI/ML services for customers’ business use cases.

Autonomously Solve Complex, Multi-Step Problems With Agentic AI

AnythingLLM: A Community Effort, Accelerated by RTX

Powered by People, Driven by Innovation

Is o1-preview “just” fancy prompting?

Can we prompt engineer o1-preview?

GraphRAG auto-tuning provides rapid adaptation to new domains

Do results stand in another language?

Do reasoning tokens improve performance?

What’s the takeaway?

Anything more to consider?

Resources

Overview of the Amazon Q Salesforce Online connector

Types of documents

Authentication

ACL crawling

Solution overview

Prerequisites

Set up Salesforce authentication

Set up the Amazon Q Salesforce Online connector

Additional Salesforce field mappings

Mapping custom fields

Talk with your Salesforce data using the Amazon Q web experience

Metadata fields

Frequently asked questions

Amazon Q Business is unable to answer your questions

Custom fields aren’t showing up in fields mappings

Clean up

Conclusion

About the Author

Use case overview

Dataset

Prerequisites

Implement the solution

Step 1: Setting up Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents

Step 2: Invoke Amazon Bedrock Agents with user questions about Amazon Bedrock documentation

Step 3: Trigger human-in-the-loop in case of hallucination

Cost considerations

Clean up

Key considerations

Conclusion

About the Authors

Solution overview

Prerequisites

Create the EKS cluster

Set up the Inferentia 2 node group

Install the Neuron device plugin and scheduling extension

Prepare the Docker image

Deploy the Meta Llama 3.1-8B model

Test the deployment

Monitor performance

Scaling and multi-tenancy

Clean up

Conclusion

About the Authors

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

Prerequisite: Hugging Face account and model access

Create an EC2 instance

Start vLLM server

Inference

Offline inference with vLLM

Clean up

Performance tuning for variable sequence lengths

Conclusion

About the authors

Tasks

Task 1: Query generation from natural language

Task 2: Incident severity prediction

Task 3: Incident summarization

Experimental setup using Amazon Bedrock and Amazon SageMaker

Amazon Bedrock

Amazon SageMaker

Next steps

Conclusion

About the Authors

Comprehensive monitoring for Trainium and Inferentia

Get started with Datadog and Inferentia and Trainium

About the Authors

Solution overview

Prerequisites

Deploy the solution