December 2024 – Page 8

How Twitch used agentic workflow with RAG on Amazon Bedrock to supercharge ad sales

Twitch, the world’s leading live-streaming platform, has over 105 million average monthly visitors. As part of Amazon, Twitch advertising is handled by the ad sales organization at Amazon. New ad products across diverse markets involve a complex web of announcements, training, and documentation, making it difficult for sales teams to find precise information quickly. In early 2024, Amazon launched a major push to harness the power of Twitch for advertisers globally. This necessitated the ramping up of Twitch knowledge to all of Amazon ad sales. The task at hand was especially challenging to internal sales support teams. With a ratio of over 30 sellers per specialist, questions posed in public channels often took an average of 2 hours for an initial reply, with 20% of questions not being answered at all. All in all, the entire process from an advertiser’s request to the first campaign launch could stretch up to 7 days.

In this post, we demonstrate how we innovated to build a Retrieval Augmented Generation (RAG) application with agentic workflow and a knowledge base on Amazon Bedrock. We implemented the RAG pipeline in a Slack chat-based assistant to empower the Amazon Twitch ads sales team to move quickly on new sales opportunities. We discuss the solution components to build a multimodal knowledge base, drive agentic workflow, use metadata to address hallucinations, and also share the lessons learned through the solution development using multiple large language models (LLMs) and Amazon Bedrock Knowledge Bases.

Solution overview

A RAG application combines an LLM with a specialized knowledge base to help answer domain-specific questions. We developed an agentic workflow with RAG solution that revolves around a centralized knowledge base that aggregates Twitch internal marketing documentation. This content is then transformed into a vector database optimized for efficient information retrieval. In the RAG pipeline, the retriever taps into this vector database to surface relevant information, and the LLM generates tailored responses to Twitch user queries submitted through a Slack assistant. The solution architecture is presented in the following diagram.

The key architectural components driving this solution include:

Data sources – A centralized repository containing marketing data aggregated from various sources such as wikis and slide decks, using web crawlers and periodic refreshes
Vector database – The marketing contents are first embedded into vector representations using Amazon Titan Multimodal Embeddings G1 on Amazon Bedrock, capable of handling both text and image data. These embeddings are then stored in an Amazon Bedrock knowledge bases.
Agentic workflow – The agent acts as an intelligent dispatcher. It evaluates each user query to determine the appropriate course of action, whether refusing to answer off-topic queries, tapping into the LLM, or invoking APIs and data sources such as the vector database. The agent uses chain-of-thought (CoT) reasoning, which breaks down complex tasks into a series of smaller steps then dynamically generates prompts for each subtask, combines the results, and synthesizes a final coherent response.
Slack integration – A message processor was implemented to interface with users through a Slack assistant using an AWS Lambda function, providing a seamless conversational experience.

Lessons learned and best practices

The process of designing, implementing, and iterating a RAG application with agentic workflow and a knowledge base on Amazon Bedrock produced several valuable lessons.

Processing multimodal source documents in the knowledge base

An early problem we faced was that Twitch documentation is scattered across the Amazon internal network. Not only is there no centralized data store, but there is also no consistency in the data format. Internal wikis contain a mixture of image and text, and training materials to sales agents are often in the form of PowerPoint presentations. To make our chat assistant the most effective, we needed to coalesce all of this information together into a single repository the LLM could understand.

The first step was making a wiki crawler that uploaded all the relevant Twitch wikis and PowerPoint slide decks to Amazon Simple Storage Service (Amazon S3). We used that as the source to create a knowledge base on Amazon Bedrock. To handle the combination of images and text in our data source, we used the Amazon Titan Multimodal Embeddings G1 model. For the documents containing specific information such as demographic context, we summarized multiple slides to ensure this information is included in the final contexts for LLM.

In total, our knowledge base contains over 200 documents. Amazon Bedrock knowledge bases are easy to amend, and we routinely add and delete documents based on changing wikis or slide decks. Our knowledge base is queried from time to time every day, and metrics, dashboards, and alarms are inherently supported in Amazon Web Services (AWS) through Amazon CloudWatch. These tools provide complete transparency into the health of the system and allow fully hands-off operation.

Agentic workflow for a wide range of user queries

As we observed our users interact with our chat assistant, we noticed that there were some questions the standard RAG application couldn’t answer. Some of these questions were overly complex, with multiple questions combined, some asked for deep insights into Twitch audience demographics, and some had nothing to do with Twitch at all.

Because the standard RAG solution could only answer simple questions and couldn’t handle all these scenarios gracefully, we invested in an agentic workflow with RAG solution. In this solution, an agent breaks down the process of answering questions into multiple steps, and uses different tools to answer different types of questions. We implemented an XML agent in LangChain, choosing XML because the Anthropic Claude models available in Amazon Bedrock are extensively trained on XML data. In addition, we engineered our prompts to instruct the agent to adopt a specialized persona with domain expertise in advertising and the Twitch business realm. The agent breaks down queries, gathers relevant information, analyzes context, and weighs potential solutions. The flow for our chat agent is shown in the following diagram. In the follow, when the agent reads a user question, the first step is to decide whether the question is related to Twitch – if it isn’t, the agent politely refuses to answer. If the question is related to Twitch, the agent ‘thinks’ about which tool is best suited to answer the question. For instance, if the question is related to audience forecasting, the agent will invoke Amazon internal Audience Forecasting API. If the question is related to Twitch advertisement products, the agent will invoke its advertisement knowledge base. Once the agent fetches the results from the appropriate tool, the agent will consider the results and think whether it now has enough information to answer the question. If it doesn’t, the agent will invoke its toolkit again (maximum of 3 attempts) to gain more context. Once its finished gathering information, the agent will generate a final response and send it to the user.

One of the chief benefits of agentic AI is the ability to integrate with multiple data sources. In our case, we use an internal forecasting API to fetch data related to the available Amazon and Twitch audience supply. We also use Amazon Bedrock Knowledge Bases to help with questions about static data, such as features of Twitch ad products. This greatly increased the scope of questions our chatbot could answer, which the initial RAG couldn’t support. The agent is intelligent enough to know which tool to use based on the query. You only need to provide high-level instructions about the tool purpose, and it will invoke the LLM to make a decision. For example,

tools = [
  Tool(
    name="twitch_ad_product_tool",
    func=self.product_search,
    description="Use when you need to find information about Twitch ad products.",
   ),
  Tool(
    name="twitch_audience_forecasting_tool",
    func=self.forecasting_api_search,
    description="Use when you need to find forecasting information about the Amazon and Twitch audiences.",
   )
]

Even better, LangChain logs the agent’s thought process in CloudWatch. This is what a log statement looks like when the agent decides which tool to use:

Thought: I need to use the twitch_ad_product_tool to find information about Twitch Premium Video. 

3 documents returned from the retrievers: [Overview: Twitch Premium Video ....]

Thought: The documents provide relevant information about the ad product Twitch Premium Video. I have enough context to provide a final answer. 

<final_answer> Twitch Premium Video is a premier Twitch ad product in which .... </final answer>

The agent helps keep our RAG flexible. Looking towards the future, we plan to onboard additional APIs, build new vector stores, and integrate with chat assistants in other Amazon organizations. This is critical to helping us expand our product, maximizing its scope and impact.

Contextual compression for LLM invocation

During the document retrieval, we found that our internal wikis varied greatly in size. This meant that often a wiki would contain hundreds or even thousands of lines of text, but only a small paragraph was relevant to answering the question. To reduce the size of context and input token to the LLM, we used another LLM to perform contextual compression to extract the relevant portions of the returned documents. Initially, we used Anthropic Claude Haiku because of its superior speed. However, we found that Anthropic Claude Sonnet boosted the result accuracy, while being only 20% slower than Haiku (from 8 seconds to 10 seconds). As a result, we chose Sonnet for our use case because providing the best quality answers to our users is the most important factor. We’re willing to take an additional 2 seconds latency, comparing to the 2-day turn-around time in the traditional manual process.

Address hallucinations by document metadata

As with any RAG solution, our chat assistant occasionally hallucinated incorrect answers. While this is a well-recognized problem with LLMs, it was particularly pronounced in our system, because of the complexity of the Twitch advertising domain. Because our users relied on the chatbot responses to interact with their clients, they were reluctant to trust even its correct answers, despite most answers being correct.

We increased the users’ trust by showing them where the LLM was getting its information from for each statement made. This way, if a user is skeptical of a statement, they can check the references the LLM used and read through the authoritative documentation themselves. We achieved this by adding the source URL of the retrieved documents as metadata in our knowledge base, which Amazon Bedrock directly supports. We then instructed the LLM to read the metadata and append the source URLs as clickable links in its responses.

Here’s an example question and answer with citations:

What are the supported ad lengths for the Twitch Premium Video ad product?

Twitch offers the following ad lengths for Premium Video ads:

- Pre-roll (before stream): Up to 30 seconds, full-screen, non-skippable [1]
- Mid-roll (during stream):
- Up to 30 seconds when purchased through Amazon Demand-Side-Platform (DSP) [1]
- Up to 60 seconds when purchased directly [2]

Sources:
[1] US - Twitch + OLV Core Narrative (slide 8) - https://ads.amazon.com/cms/contents/9f24a95e
[2] Twitch Premium Video - https://w.amazon.com/TwitchAds/Products/PremiumVideo

Note that the LLM responds with two sources. The first is from a sales training PowerPoint slide deck, and the second is from an internal wiki. For the slide deck, the LLM can provide the exact slide number it pulled the information from. This is especially useful because some decks contain over 100 slides.

After adding citations, our user feedback score noticeably increased. Our favorable feedback rate increased by 40% and overall assistant usage increased by 20%, indicating that users gained more trust in the assistant’s responses due to the ability to verify the answers.

Human-in-the-loop feedback collection

When we launched our chat assistant in Slack, we had a feedback form that users could fill out. This included several questions to rate aspects of the chat assistant on a 1–5 scale. While the data was very rich, hardly anyone used it. After switching to a much simpler thumb up or thumb down button that a user could effortlessly select (the buttons are appended to each chatbot answer), our feedback rate increased by eightfold.

Conclusion

Moving fast is important in the AI landscape, especially because the technology changes so rapidly. Often engineers will have an idea about a new technique in AI and want to test it out quickly. Using AWS services helped us learn fast about what technologies are effective and what aren’t. We used Amazon Bedrock to test multiple foundation models (FMs), including Anthropic Claude Haiku and Sonnet, Meta Llama 3, Cohere embedding models, and Amazon Titan Multimodal Embeddings. Amazon Bedrock Knowledge Bases helped us implement RAG with agentic workflow efficiently without building custom integrations to our various multimodal data sources and data flows. Using dynamic chunking and metadata filtering let us retrieve the needed contents more accurately. All these together allowed us to spin up a working prototype in a few days instead of months. After we deployed the changes to our customers, we continued to adopt Amazon Bedrock and other AWS services in the application.

Since the Twitch Sales Bot launch in February 2024, we have answered over 11,000 questions about the Twitch sales process. In addition, Amazon sellers who used our generative AI solution delivered 25% more Twitch revenue year-to-date when compared with sellers who didn’t, and delivered 120% more revenue when compared to self-service accounts. We will continue expanding our chat assistant’s agentic capabilities—using Amazon Bedrock along with other AWS services—to solve new problems for our users and increase Twitch bottom line. We plan to incorporate distinct Knowledge Bases across Amazon portfolio of 1P Publishers like Prime Video, Alexa, and IMDb as a fast, accurate, and comprehensive generative AI solution to supercharge ad sales.

For your own project, you can follow our architecture and adopt a similar solution to build an AI assistant to address your own business challenge.

About the Authors

Bin Xu is a Senior Software Engineer at Amazon Twitch Advertising and holds a Master’s degree in Data Science from Columbia University. As the visionary creator behind TwitchBot, Bin successfully introduced the proof of concept in 2023. Bin is currently leading a team in Twitch Ads Monetization, focusing on optimizing video ad delivery, improving sales workflows, and enhancing campaign performance. Also leading efforts to integrate AI-driven solutions to further improve the efficiency and impact of Twitch ad products. Outside of his professional endeavors, Bin enjoys playing video games and tennis.

Nick Mariconda is a Software Engineer at Amazon Advertising, focused on enhancing the advertising experience on Twitch. He holds a Master’s degree in Computer Science from Johns Hopkins University. When not staying up to date with the latest in AI advancements, he enjoys getting outdoors for hiking and connecting with nature.

Frank Zhu is a Senior Product Manager at Amazon Advertising, located in New York City. With a background in programmatic ad-tech, Frank helps connect the business needs of advertisers and Amazon publishers through innovative advertising products. Frank has a BS in finance and marketing from New York University and outside of work enjoys electronic music, poker theory, and video games.

Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Cathy Willcock is a Principal Technical Business Development Manager located in Seattle, WA. Cathy leads the AWS technical account team supporting Amazon Ads adoption of AWS cloud technologies. Her team works across Amazon Ads enabling discovery, testing, design, analysis, and deployments of AWS services at scale, with a particular focus on innovation to shape the landscape across the AdTech and MarTech industry. Cathy has led engineering, product, and marketing teams and is an inventor of ground-to-air calling (1-800-RINGSKY).

Acknowledgments

We would also like to acknowledge and express our gratitude to our leadership team: Abhoy Bhaktwatsalam (VP, Amazon Publisher Monetization), Carl Petersen (Director, Twitch, Audio & Podcast Monetization), Cindy Barker (Senior Principal Engineer, Amazon Publisher Insights & Analytics), and Timothy Fagan (Principal Engineer, Twitch Monetization), for their invaluable insights and support. Their expertise and backing were instrumental for the successful development and implementation of this innovative solution.

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Illustrated image of Jindong Wang and Steven Euijong Whang

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Jindong Wang, a senior researcher at Microsoft Research, and Steven Euijong Whang, a tenured associate professor at Korea Advanced Institute of Science and Technology (KAIST), join host Gretchen Huizinga to discuss the paper “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” a spotlight session at this year’s Conference on Neural Information Processing Systems (NeurIPS). ERBench leverages the integrity constraints of relational databases to create LLM benchmarks that can verify model rationale via keywords as well as check for answer correctness.

Read the paper

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Jindong Wang, a senior researcher at Microsoft Research, and Steven Whang, a tenured associate professor at the Korea Advanced Institute of Science and Technology. Jindong and Steven are coauthors of a paper called “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” and this paper is a spotlight at this year’s conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC, this week. Jindong and Steven, thanks for joining us on Abstracts!

JINDONG WANG: Thank you. Nice to be here.

STEVEN EUIJONG WHANG: It’s great to be here.

HUIZINGA: So, Jindong, I’ll start with you. In just a few sentences, tell us what problem your research addresses and why people should care about it.

JINDONG WANG: OK, everybody knows that with the widespread usage of large language models, hallucination has become a crucial factor of concern. Hallucination occurs when models generate false or nonexistent information. In particular, factual hallucination greatly undermines the reliability of the large language models. To correctly evaluate the hallucination, evaluating the model’s rationale is also important. Up to date, when the paper, you know, was submitted, there were no works dealing with automatic rationale evaluation systematically because, you know, most of them focused on manual evaluation or just using GPT-judge. ERBench is the first one to generate a large language model evaluation benchmark utilizing relational databases. Relational databases are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on database design theories, so that integrity constraints in relational databases allows better evaluation of the large language models. Functional dependencies allow automatic rationale evaluation using the functional dependency inferred keywords, and foreign key constraints also allow for easy generation of the multi-hop questions, which are usually very complicated to generate with other techniques. So that’s basically what we want to do. So in one sentence, we try to build an automatic evaluation benchmark for evaluation of the hallucination.

HUIZINGA: Steven, give us a quick overview of your research methodology and findings. How did you conduct your research, and what were your major takeaways?

STEVEN EUIJONG WHANG: Sure. So this was a collaboration between our group at KAIST, and Dr. Xing Xie’s group at MSRA (Microsoft Research Asia). KAIST is Korea Advanced Institute of Science and Technology. So we had the privilege to closely work with our LLM expert, Dr. Jindong Wang, here. We also acknowledge the Microsoft Accelerating Foundation Models Research, or AFMR, program for using Azure quota for our experiments. So we had some biweekly meetings for maybe over a year, and at some point, we figured that relational databases could be really important for LLM evaluation. I personally have a background in databases, which I studied at Stanford University as a PhD student. So relational databases have integrity constraints that can be used to better construct complex, in-depth questions and verify answers. So the first ingredient is functional dependencies. So these are constraints where, given a few attributes, you can determine another attribute. So I’ll just give an example because I think that helps the understanding. So suppose that you have, like, a movie table, and in a movie, you have the title of the movie, the year of production, and the director of the movie, and the length of the movie, and so on and so forth. So if you know the title and year of the movie, that pretty much identifies the movie, and you can actually determine the director of the movie, as well. So, for example, if you know that there’s a movie called Star Wars, which is a very popular movie produced in 1977, that determines the director. We know it’s George Lucas, right. So, basically, it’s like a function. It receives the Star Wars 1977 and determines, gives the output, George Lucas. So that’s the first ingredient. Now, the reason this is important is that we can use these functional dependencies to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values. For example, we may ask the LLM, is there a director of a movie called Star Wars produced in 1977? And the LLM can say yes. And it is the right answer, but we’d like to know if the LLM is knowing what it’s saying, right. And so we look at the rationale. That’s why looking at the rationale is important. We just can’t say it’s doing the correct thing. So if the LLM mentions George Lucas, bingo, that’s a great answer. However, if the LLM mentions some other director, like Steven Spielberg, that’s not a correct rationale. So that’s exactly what we’re trying to evaluate. Functional dependency is key to being able to do that kind of verification.

The second ingredient is foreign key constraints. So foreign key constraint is where one of the attributes in one table can intuitively link to another attribute of another table. So in our movie table, we had the director attribute. Now we may also have a separate table called the director table, and maybe we might have some more information about the director in that table, like the director name, the director’s age, all sorts of information about the director. So foreign key constraint basically requires that if there is some director mentioned in the movie table, it has to be one of the directors in the director table. So this basically links a table to another table. It’s very useful. So using this, what we can do is we can join the two tables, right. So now we can join the movie and director table and generate a bigger table. The reason this is useful is that we can also chain together functional dependencies that I just mentioned into longer functional dependencies. So what this enables is us to construct more complex questions, arbitrarily, that are multi-hop. So using these integrity constraints, we can basically convert any relational database into an LLM benchmark, and this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various prompt engineering techniques.

HUIZINGA: Well, I would ask you to, kind of, drill in on what you found in how ERBench compares to other benchmark tests.

STEVEN EUIJONG WHANG: So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also performed prompt engineering and fine-tuning. And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs, like Gemini, tend to be a bit more conservative and do not answer as many questions but instead hallucinate less as a result. So the key conclusion is that no LLM, like, totally subsumes the other in all aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating any LLM’s thought process by pinpointing critical keywords within the rationale.

HUIZINGA: Well, Jindong, back to you. Research settings are one thing, but tell us how your work is significant in real-world settings, and who does this impact most and how?

JINDONG WANG: Relational databases, you know, they are everywhere across various domains. Anyone can easily get access from Google or from Kaggle or even create them targeting the domain or subject that one wants to test the model on. So taking into account that ERBench is the first work to utilize the relational database for generating large language model hallucination benchmarks … so this work will lead a new research direction of integrating database design theories and techniques, a long-studied field—you know, database is very traditional, old, and classic, but, you know, they’re still operating right now—into the large language model field, a recently emerging area.

HUIZINGA: Right. Well, Steven, as we close, I assume there are still a few unanswered questions or unsolved problems in the field. What do you propose to do about those, and what’s next on your research agenda?

STEVEN EUIJONG WHANG: Sure, so the big picture is that we basically proposed the first work to properly evaluate the rationale of LLMs, right. This is very important because LLMs are being used in our everyday lives, and everyone has the question, is the LLM suitable for my task? Can I benefit from the LLM? So it’s very important to verify if the LLM knows what it’s saying. So I just mentioned that we use functional dependencies to pinpoint critical keywords in the rationale. And we believe that’s just the first step. It’s very effective, by the way. So you may have the question, is it enough to just look at, like, the George Lucas within the long rationale? And it turns out 95% of the cases, it is actually effective, so we did human studies and also used GPT-judge to verify that. But these are factual questions and there could be various other questions that require long answers, right. Long rationales. And so the important question is, can we also verify all the rest of the rationales, the complicated rationales, as well? And so in order to properly do that, we need a lot of technology. So first we need to understand the rationales using NLP techniques, and we need to know if it’s properly answering the question, and so on and so forth. And so we believe that there’s a lot of opportunity to expand from that. So we basically, you know, proposed an initial work towards this direction, but we believe that there are many more interesting challenges that remain.

HUIZINGA: Well, Jindong Wang and Steven Whang, thanks for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts.

[MUSIC]

You can also find it on arXiv and on the NeurIPS website. And if you’re at the NeurIPS conference this week, go to the poster session and talk to the authors! See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang appeared first on Microsoft Research.

Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 2: ModelBuilder

In Part 1 of this series, we introduced the newly launched ModelTrainer class on the Amazon SageMaker Python SDK and its benefits, and showed you how to fine-tune a Meta Llama 3.1 8B model on a custom dataset. In this post, we look at the enhancements to the ModelBuilder class, which lets you seamlessly deploy a model from ModelTrainer to a SageMaker endpoint, and provides a single interface for multiple deployment configurations.

In November 2023, we launched the ModelBuilder class (see Package and deploy models faster with new tools and guided workflows in Amazon SageMaker and Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements), which reduced the complexity of initial setup of creating a SageMaker endpoint such as creating an endpoint configuration, choosing the container, serialization and deserialization, and more, and helps you create a deployable model in a single step. The recent update enhances usability of the ModelBuilder class for a wide range of use cases, particularly in the rapidly evolving field of generative AI. In this post, we deep dive into the enhancements made to the ModelBuilder class, and show you how to seamlessly deploy the fine-tuned model from Part 1 to a SageMaker endpoint.

Improvements to the ModelBuilder class

We’ve made the following usability improvements to the ModelBuilder class:

Seamless transition from training to inference – ModelBuilder now integrates directly with SageMaker training interfaces to make sure that the correct file path to the latest trained model artifact is automatically computed, simplifying the workflow from model training to deployment.
Unified inference interface – Previously, the SageMaker SDK offered separate interfaces and workflows for different types of inference, such as real-time, batch, serverless, and asynchronous inference. To simplify the model deployment process and provide a consistent experience, we have enhanced ModelBuilder to serve as a unified interface that supports multiple inference types.
Ease of development, testing, and production handoff – We are adding support for local mode testing with ModelBuilder so that users can effortlessly debug and test their processing and inference scripts with faster local testing without including a container, and a new function that outputs the latest container image for a given framework so you don’t have to update the code each time a new LMI release comes out.
Customizable inference preprocessing and postprocessing – ModelBuilder now allows you to customize preprocessing and postprocessing steps for inference. By enabling scripts to filter content and remove personally identifiable information (PII), this integration streamlines the deployment process, encapsulating the necessary steps within the model configuration for better management and deployment of models with specific inference requirements.
Benchmarking support – The new benchmarking support in ModelBuilder empowers you to evaluate deployment options—like endpoints and containers—based on key performance metrics such as latency and cost. With the introduction of a Benchmarking API, you can test scenarios and make informed decisions, optimizing your models for peak performance before production. This enhances efficiency and provides cost-effective deployments.

In the following sections, we discuss these improvements in more detail and demonstrate how to customize, test, and deploy your model.

Seamless deployment from ModelTrainer class

ModelBuilder integrates seamlessly with the ModelTrainer class; you can simply pass the ModelTrainer object that was used for training the model directly to ModelBuilder in the model parameter. In addition to the ModelTrainer, ModelBuilder also supports the Estimator class and the result of the SageMaker Core TrainingJob.create() function, and automatically parses the model artifacts to create a SageMaker Model object. With resource chaining, you can build and deploy the model as shown in the following example. If you followed Part 1 of this series to fine-tune a Meta Llama 3.1 8B model, you can pass the model_trainer object as follows:

# set container URI
image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"

model_builder = ModelBuilder(
    model=model_trainer,  # ModelTrainer object passed onto ModelBuilder directly
    role_arn=role,
    image_uri=image_uri,
    inference_spec=inf_spec,
    instance_type="ml.g5.2xlarge"
)
# deploy the model
model_builder.build().deploy()

Customize the model using InferenceSpec

The InferenceSpec class allows you to customize the model by providing custom logic to load and invoke the model, and specify any preprocessing logic or postprocessing logic as needed. For SageMaker endpoints, preprocessing and postprocessing scripts are often used as part of the inference pipeline to handle tasks that are required before and after the data is sent to the model for predictions, especially in the case of complex workflows or non-standard models. The following example shows how you can specify the custom logic using InferenceSpec:

from sagemaker.serve.spec.inference_spec import InferenceSpec

class CustomerInferenceSpec(InferenceSpec):
    def load(self, model_dir):
        from transformers import AutoModel
        return AutoModel.from_pretrained(HF_TEI_MODEL, trust_remote_code=True)

    def invoke(self, x, model):
        return model.encode(x)

    def preprocess(self, input_data):
        return json.loads(input_data)["inputs"]

    def postprocess(self, predictions):
        assert predictions is not None
        return predictions

Test using local and in process mode

Deploying a trained model to a SageMaker endpoint involves creating a SageMaker model and configuring the endpoint. This includes the inference script, any serialization or deserialization required, the model artifact location in Amazon Simple Storage Service (Amazon S3), the container image URI, the right instance type and count, and more. The machine learning (ML) practitioners need to iterate over these settings before finally deploying the endpoint to SageMaker for inference. The ModelBuilder offers two modes for quick prototyping:

In process mode – In this case, the inferences are made directly within the same inference process. This is highly useful in quickly testing the inference logic provided through InferenceSpec and provides immediate feedback during experimentation.
Local mode – The model is deployed and run as a local container. This is achieved by setting the mode to LOCAL_CONTAINER when you build the model. This is helpful to mimic the same environment as the SageMaker endpoint. Refer to the following notebook for an example.

The following code is an example of running inference in process mode, with a custom InferenceSpec:

from sagemaker.serve.spec.inference_spec import InferenceSpec
from transformers import pipeline
from sagemaker.serve import Mode
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.serve.builder.model_builder import ModelBuilder

value: str = "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:"
schema = SchemaBuilder(value,
            {"generated_text": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron: Hi, Daniel. I was just thinking about how magnificent giraffes are and how they should be worshiped by all.\nDaniel: You and I think alike, Girafatron. I think all animals should be worshipped! But I guess that could be a bit impractical...\nGirafatron: That's true. But the giraffe is just such an amazing creature and should always be respected!\nDaniel: Yes! And the way you go on about giraffes, I could tell you really love them.\nGirafatron: I'm obsessed with them, and I'm glad to hear you noticed!\nDaniel: I'"})

# custom inference spec with hugging face pipeline
class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        ...
    def invoke(self, input, model):
        ...
    def preprocess(self, input_data):
        ...
    def postprocess(self, predictions):
        ...
        
inf_spec = MyInferenceSpec()

# Build ModelBuilder object in IN_PROCESS mode
builder = ModelBuilder(inference_spec=inf_spec,
                       mode=Mode.IN_PROCESS,
                       schema_builder=schema
                      )
                      
# Build and deploy the model
model = builder.build()
predictor=model.deploy()

# make predictions
predictor.predict("How are you today?")

As the next steps, you can test it in local container mode as shown in the following code, by adding the image_uri. You will need to include the model_server argument when you include the image_uri.

image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04'

builder = ModelBuilder(inference_spec=inf_spec,
                       mode=Mode.LOCAL_CONTAINER,  # you can change it to Mode.SAGEMAKER_ENDPOINT for endpoint deployment
                       schema_builder=schema,
                       image_uri=image,
                       model_server=ModelServer.TORCHSERVE
                      )

model = builder.build()                      
predictor = model.deploy()

predictor.predict("How are you today?")

Deploy the model

When testing is complete, you can now deploy the model to a real-time endpoint for predictions by updating the mode to mode.SAGEMAKER_ENDPOINT and providing an instance type and size:

sm_predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    mode=Mode.SAGEMAKER_ENDPOINT,
    role=execution_role,
)

sm_predictor.predict("How is the weather?")

In addition to real-time inference, SageMaker supports serverless inference, asynchronous inference, and batch inference modes for deployment. You can also use InferenceComponents to abstract your models and assign CPU, GPU, accelerators, and scaling policies per model. To learn more, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

After you have the ModelBuilder object, you can deploy to any of these options simply by adding the corresponding inference configurations when deploying the model. By default, if the mode is not provided, the model is deployed to a real-time endpoint. The following are examples of other configurations:

Deploy the model to a serverless endpoint:

from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig
predictor = model_builder.deploy(
    endpoint_name="serverless-endpoint",
    inference_config=ServerlessInferenceConfig(memory_size_in_mb=2048))

Deploy the model to an asynchronous endpoint:

from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3_utils import s3_path_join

predictor = model_builder.deploy(
    endpoint_name="async-endpoint",
    inference_config=AsyncInferenceConfig(
        output_path=s3_path_join("s3://", bucket, "async_inference/output")))

Run a batch transform job for offline inference on a dataset:

from sagemaker.batch_inference.batch_transform_inference_config import BatchTransformInferenceConfig

transformer = model_builder.deploy(
    endpoint_name="batch-transform-job",
    inference_config=BatchTransformInferenceConfig(
        instance_count=1,
        instance_type='ml.m5.large',
        output_path=s3_path_join("s3://", bucket, "batch_inference/output"),
        test_data_s3_path = s3_test_path
    ))
print(transformer)

Deploy a multi-model endpoint using InferenceComponent:

from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

predictor = model_builder.deploy(
    endpoint_name="multi-model-endpoint",
    inference_config=ResourceRequirements(
        requests={
            "num_cpus": 0.5,
            "memory": 512,
            "copies": 2,
        },
        limits={},
))

Clean up

If you created any endpoints when following this post, you will incur charges while it is up and running. As best practice, delete any endpoints if they are no longer required, either using the AWS Management Console, or using the following code:

predictor.delete_model() 
predictor.delete_endpoint()

Conclusion

In this two-part series, we introduced the ModelTrainer and the ModelBuilder enhancements in the SageMaker Python SDK. Both classes aim to reduce the complexity and cognitive overhead for data scientists, providing you with a straightforward and intuitive interface to train and deploy models, both locally on your SageMaker notebooks and to remote SageMaker endpoints.

We encourage you to try out the SageMaker SDK enhancements (SageMaker Core, ModelTrainer, and ModelBuilder) by referring to the SDK documentation and sample notebooks on the GitHub repo, and let us know your feedback in the comments!

About the Authors

Durga Sury is a Senior Solutions Architect on the Amazon SageMaker team. Over the past 5 years, she has worked with multiple enterprise customers to set up a secure, scalable AI/ML platform built on SageMaker.

Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning (ML) platform team at AWS, leading SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.

Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer

Amazon SageMaker has redesigned its Python SDK to provide a unified object-oriented interface that makes it straightforward to interact with SageMaker services. The new SDK is designed with a tiered user experience in mind, where the new lower-level SDK (SageMaker Core) provides access to full breadth of SageMaker features and configurations, allowing for greater flexibility and control for ML engineers. The higher-level abstracted layer is designed for data scientists with limited AWS expertise, offering a simplified interface that hides complex infrastructure details.

In this two-part series, we introduce the abstracted layer of the SageMaker Python SDK that allows you to train and deploy machine learning (ML) models by using the new ModelTrainer and the improved ModelBuilder classes.

In this post, we focus on the ModelTrainer class for simplifying the training experience. The ModelTrainer class provides significant improvements over the current Estimator class, which are discussed in detail in this post. We show you how to use the ModelTrainer class to train your ML models, which includes executing distributed training using a custom script or container. In Part 2, we show you how to build a model and deploy to a SageMaker endpoint using the improved ModelBuilder class.

Benefits of the ModelTrainer class

The new ModelTrainer class has been designed to address usability challenges associated with Estimator class. Moving forward, ModelTrainer will be the preferred approach for model training, bringing significant enhancements that greatly improve the user experience. This evolution marks a step towards achieving a best-in-class developer experience for model training. The following are the key benefits:

Improved intuitiveness – The ModelTrainer class reduces complexity by consolidating configurations into just few core parameters. This streamlining minimizes cognitive overload, allowing users to focus on model training rather than configuration intricacies. Additionally, it employs intuitive config classes for straightforward platform interactions.
Simplified script mode and BYOC – Transitioning from local development to cloud training is now seamless. The ModelTrainer automatically maps source code, data paths, and parameter specifications to the remote execution environment, eliminating the need for special handshakes or complex setup processes.
Simplified distributed training – The ModelTrainer class provides enhanced flexibility for users to specify custom commands and distributed training strategies, allowing you to directly provide the exact command you want to run in your container through the command parameter in the SourceCode This approach decouples distributed training strategies from the training toolkit and framework-specific estimators.
Improved hyperparameter contracts – The ModelTrainer class passes the training job’s hyperparameters as a single environment variable, allowing the you to load the hyperparameters using a single SM_HPSvariable.

To further explain each of these benefits, we demonstrate with examples in the following sections, and finally show you how to set up and run distributed training for the Meta Llama 3.1 8B model using the new ModelTrainer class.

Launch a training job using the ModelTrainer class

The ModelTrainer class simplifies the experience by letting you customize the training job, including providing a custom script, directly providing a command to run the training job, supporting local mode, and much more. However, you can spin up a SageMaker training job in script mode by providing minimal parameters—the SourceCode and the training image URI.

The following example illustrates how you can launch a training job with your own custom script by providing just the script and the training image URI (in this case, PyTorch), and an optional requirements file. Additional parameters such as the instance type and instance size are automatically set by the SDK to preset defaults, and parameters such as the AWS Identity and Access Management (IAM) role and SageMaker session are automatically detected from the current session and user’s credentials. Admins and users can also overwrite the defaults using the SDK defaults configuration file. For the detailed list of pre-set values, refer to the SDK documentation.

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, InputData

# image URI for the training job
pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"
# you can find all available images here
# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html

# define the script to be run
source_code = SourceCode(
    source_dir="basic-script-mode",
    requirements="requirements.txt",
    entry_script="custom_script.py",
)

# define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    base_job_name="script-mode",
)

# pass the input data
input_data = InputData(
    channel_name="train",
    data_source=training_input_path,  #s3 path where training data is stored
)

# start the training job
model_trainer.train(input_data_config=[input_data], wait=False)

With purpose-built configurations, you can now reuse these objects to create multiple training jobs with different hyperparameters, for example, without having to re-define all the parameters.

Run the job locally for experimentation

To run the preceding training job locally, you can simply set the training_mode parameter as shown in the following code:

from sagemaker.modules.train.model_trainer import Mode

...
model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    base_job_name="script-mode-local",
    training_mode=Mode.LOCAL_CONTAINER,
)
model_trainer.train()

The training job runs remotely because training_mode is set to Mode.LOCAL_CONTAINER. If not explicitly set, the ModelTrainer runs a remote SageMaker training job by default. This behavior can also be enforced by changing the value to Mode.SAGEMAKER_TRAINING_JOB. For a full list of the available configs, including compute and networking, refer to the SDK documentation.

Read hyperparameters in your custom script

The ModelTrainer supports multiple ways to read the hyperparameters that are passed to a training job. In addition to the existing support to read the hyperparameters as command line arguments in your custom script, ModelTrainer also supports reading the hyperparameters as individual environment variables, prefixed with SM_HPS_<hyperparameter-key>, or as a single environment variable dictionary, SM_HPS.

Suppose the following hyperparameters are passed to the training job:

hyperparams = {
    "learning_rate": 1e-5,
    "epochs": 2,
}

model_trainer = ModelTrainer(
    ...
    hyperparameters=hyperparams,
    ...
)

You have the following options:

Option 1 – Load the hyperparameters into a single JSON dictionary using the SM_HPS environment variable in your custom script:

def main():
    hyperparams = json.loads(os.environ["SM_HPS"])
    learning_rate = hyperparams.get("learning_rate")
    epochs = hyperparams.get("epochs", 1)
    ...

Option 2 – Read the hyperparameters as individual environment variables, prefixed by SM_HP as shown in the following code (you need to explicitly specify the correct input type for these variables):

def main():
    learning_rate = float(os.environ.get("SM_HP_LEARNING_RATE", 3e-5))
    epochs = int(os.environ.get("SM_HP_EPOCHS", 1)
    ...

Option 3 – Read the hyperparameters as AWS CLI arguments using parse.args:

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--learning_rate", type=float, default=3e-5)
    parser.add_argument("--epochs", type=int, default=1)
    
    args = parse_args()
    
    learning_rate = args.learning_rate
    epochs = args.epochs

Run distributed training jobs

SageMaker supports distributed training to support training for deep learning tasks such as natural language processing and computer vision, to run secure and scalable data parallel and model parallel jobs. This is usually achieved by providing the right set of parameters when using an Estimator. For example, to use torchrun, you would define the distribution parameter in the PyTorch Estimator and set it to "torch_distributed": {"enabled": True}.

The ModelTrainer class provides enhanced flexibility for users to specify custom commands directly through the command parameter in the SourceCode class, and supports torchrun, torchrun smp, and the MPI strategies. This capability is particularly useful when you need to launch a job with a custom launcher command that is not supported by the training toolkit.

In the following example, we show how to fine-tune the latest Meta Llama 3.1 8B model using the default launcher script using Torchrun on a custom dataset that’s preprocessed and saved in an Amazon Simple Storage Service (Amazon S3) location:

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.configs import Compute, SourceCode, InputData

# provide  image URI - update the URI if you're in a different region
pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.2.0-gpu-py310"

# Define the source code configuration for the distributed training job
source_code = SourceCode(
    source_dir="distributed-training-scripts",    
    requirements="requirements.txt",  
    entry_point="fine_tune.py",
)

torchrun = Torchrun()

hyperparameters = {
    ...
}

# Compute configuration for the training job
compute = Compute(
    instance_count=1,
    instance_type="ml.g5.12xlarge",
    volume_size_in_gb=96,
    keep_alive_period_in_seconds=3600,
)


# Initialize the ModelTrainer with the specified configurations
model_trainer = ModelTrainer(
    training_image=pytorch_image,  
    source_code=source_code,
    compute=compute,
    distributed_runner=torchrun,
    hyperparameters=hyperparameters,
)

# pass the input data
input_data = InputData(
    channel_name="dataset",
    data_source="s3://your-bucket/your-prefix",  # this is the s3 path where processed data is stored
)

# Start the training job
model_trainer.train(input_data_config=[input_data], wait=False)

If you wanted to customize your torchrun launcher script, you can also directly provide the commands using the command parameter:

# Define the source code configuration for the distributed training job
source_code = SourceCode(
    source_dir="distributed-training-scripts",    
    requirements="requirements.txt",    
    # Custom command for distributed training launcher script
    command="torchrun --nnodes 1 
            --nproc_per_node 4 
            --master_addr algo-1 
            --master_port 7777 
            fine_tune_llama.py"
)


# Initialize the ModelTrainer with the specified configurations
model_trainer = ModelTrainer(
    training_image=pytorch_image,  
    source_code=source_code,
    compute=compute,
)

# Start the training job
model_trainer.train(..)

For more examples and end-to-end ML workflows using the SageMaker ModelTrainer, refer to the GitHub repo.

Conclusion

The newly launched SageMaker ModelTrainer class simplifies the user experience by reducing the number of parameters, introducing intuitive configurations, and supporting complex setups like bringing your own container and running distributed training. Data scientists can also seamlessly transition from local training to remote training and training on multiple nodes using the ModelTrainer.

We encourage you to try out the ModelTrainer class by referring to the SDK documentation and sample notebooks on the GitHub repo. The ModelTrainer class is available from the SageMaker SDK v2.x onwards, at no additional charge. In Part 2 of this series, we show you how to build a model and deploy to a SageMaker endpoint using the improved ModelBuilder class.

About the Authors

Amazon Q Apps supports customization and governance of generative AI-powered apps

We are excited to announce new features that allow creation of more powerful apps, while giving more governance control using Amazon Q Apps, a capability within Amazon Q Business that allows you to create generative AI-powered apps based on your organization’s data. These features enhance app customization options that let business users tailor solutions to their specific individual or organizational requirements. We have introduced new governance features for administrators to endorse user-created apps with app verification, and to organize app libraries with customizable label categories that reflect their organizations. App creators can now share apps privately and build data collection apps that can collate inputs across multiple users. These additions are designed to improve how companies use generative AI in their daily operations by focusing on admin controls and capabilities that unlock new use cases.

In this post, we examine how these features enhance the capabilities of Amazon Q Apps. We explore the new customization options, detailing how these advancements make Amazon Q Apps more accessible and applicable to a wider range of enterprise customers. We focus on key features such as custom labels, verified apps, private sharing, and data collection apps (preview).

Endorse quality apps and customize labels in the app library

To help with discoverability of published Amazon Q Apps and address questions about quality of user-created apps, we have launched verified apps. Verified apps are endorsed by admins, indicating they have undergone approval based on your company’s standards. Admins can endorse published Amazon Q Apps by updating their status from Default to Verified directly on the Amazon Q Business console. Admins can work closely with their business stakeholders to determine the criteria for verifying apps, based on their organization’s specific needs and policies. This admin-led labeling capability is a reactive approach to endorsing published apps, without gating the publishing process for app creators.

When users access the library, they will see a distinct blue checkmark icon on any apps that have been marked as Verified by admins (as shown in the following screenshot). Additionally, verified apps are automatically surfaced to the top of the app list within each category, making them easily discoverable. To learn more about verifying apps, refer to Understanding and managing Verified Amazon Q Apps.

The next feature we discuss is custom labels. Admins can create custom category labels for app users to organize and classify apps in the library to reflect their team functions or organizational structure. This feature enables admins to create and manage these labels on the Amazon Q Business console, and end-users can use them at app creation and to discover relevant apps in the library. Admins can update the category labels at any time to tailor towards specific business needs depending on their use cases. For example, admins that manage Amazon Q Business app environments for marketing organizations might add labels like Product Marketing, PR, Ads, or Sales solely for the users on the marketing team to use (see the following screenshot).

Users on the marketing team who create apps can use the custom labels to slot their app in the right category, which will help other users discover apps in the library based on their focus area (as shown in the following screenshot). To learn more about custom labels, see Custom labels for Amazon Q Apps.

Share your apps with select users

App creators can now use advanced sharing options to create more granular controls over apps and facilitate collaboration within their organizations. With private sharing, you have the option to share an app with select individuals or with all app users (which was previously possible). Sharing of any extent will still display the app in the library, but with private sharing, it will only be visible to app users with whom it has been shared. This means the library continues to be the place where users discover apps that they have access to. This feature unlocks the ability to enable apps only to the intended audience and helps reduce “noise” in the library from apps that aren’t necessarily relevant for all users. App creators have the ability to test updates before they are ready to publish changes, helping make sure app iterations and refinements aren’t shared before they are ready to widely publish the revised version.

To share an app with specific users, creators can add each user using their full email address (see the following screenshot). Users are only added after the email address match is found, making sure creators don’t unknowingly give access to someone who doesn’t have access to that Amazon Q Business app environment. To learn more about private sharing, see Sharing Amazon Q Apps.

Unlock new use cases with data collection

The last feature we share in this post is data collection apps (preview), a new capability that allows you to record inputs provided by other app users, resulting in a new genre of Amazon Q Apps such as team surveys and project retrospectives. This enhancement enables you to collate data across multiple users within your organization, further enhancing the collaborative quality of Amazon Q Apps for various business needs. These apps can further use generative AI to analyze the collected data, identify common themes, summarize ideas, and provide actionable insights.

After publishing a data collection app to the library, creators can share the unique link to invite their colleagues to participate. You must share the unique link to get submissions for your specific data collection. When app users open the data collection app from the library, it triggers a fresh data collection with its own unique shareable link, for which they are the designated owner. As the owner of a data collection, you can start new rounds and manage controls to start and stop accepting new data submissions, as well as reveal or hide the collected data. To learn more about data collection apps, see Data collection in Amazon Q Apps.

Conclusion

In this post, we discussed how these new features for Amazon Q Apps in Amazon Q Business make generative AI more customizable and governable for enterprise users. From custom labels and verified apps to private sharing and data collection capabilities, these innovations enable organizations to create, manage, and share AI-powered apps that align with their specific business needs while maintaining appropriate controls.

For more information, see Creating purpose-built Amazon Q Apps.

About the Author

Tiffany Myers is a Product Manager at AWS, where she leads bringing in new capabilities while maintaining the simplicity of Amazon Q Business and Amazon Q Apps, drawing inspiration from the adaptive intelligence of amphibians in nature to help customers transform and evolve their businesses through generative AI.

Answer questions from tables embedded in documents with Amazon Q Business

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. A large portion of that information is found in text narratives stored in various document formats such as PDFs, Word files, and HTML pages. Some information is also stored in tables (such as price or product specification tables) embedded in those same document types, CSVs, or spreadsheets. Although Amazon Q Business can provide accurate answers from narrative text, getting answers from these tables requires special handling of more structured information.

On November 21, 2024, Amazon Q Business launched support for tabular search, which you can use to extract answers from tables embedded in documents ingested in Amazon Q Business. Tabular search is a built-in feature in Amazon Q Business that works seamlessly across many domains, with no setup required from admin or end users.

In this post, we ingest different types of documents that have tables and show you how Amazon Q Business responds to questions related to the data in the tables.

Prerequisites

To follow along with this walkthrough, you need to have the following prerequisites in place:

An AWS Account where you can follow the instructions in this post.
At least one Amazon Q Business user is required. For information, refer to Amazon Q Business pricing.
Requires cross-Region inference enabled on the Amazon Q application.
Amazon Q Business applications created on or after November 21, 2024, will automatically benefit from the new capability. If your application was created before this date, you are required to reingest your content to update their indexes.

Overview of tabular search

Tabular search extends Amazon Q Business capabilities to find answers beyond text paragraphs, analyzing tables embedded in enterprise documents so you can get answers to a wide range of queries, including factual lookup from tables.

With tabular search in Amazon Q Business, you can ask questions such as, “what’s the credit card with the lowest APR and no annual fees?” or “which credit cards offer travel insurance?” where the answers may be found in a product-comparison table, inside a marketing PDF stored in an internal repository, or on a website.

This feature supports a wide range of file formats, including PDF, Word documents, CSV files, Excel spreadsheets, HTML, and SmartSheet (via SmartSheet connector). Notably, tabular search can also extract data from tables represented as images within PDFs and retrieve information from single or multiple cells. Additionally, it can perform aggregations on numerical data, providing users with valuable insights.

Ingest documents in Amazon Q Business

To create an Amazon Q Business application, retriever, and index to pull data in real time during a conversation, follow the steps under the Create and configure your Amazon Q application section in the AWS Machine Learning Blog post, Discover insights from Amazon S3 with Amazon Q S3 connector.

For this post, we use The World’s Billionaires, which lists the world’s top 10 billionaires from 1987 through 2024 in a tabular format. You can download this data as a PDF from Wikipedia using the Tools menu. Upload the PDF to an Amazon Simple Storage Service (Amazon S3) bucket and use it as a data source in your Amazon Q Business application.

Run queries with Amazon Q

You can start asking questions to Amazon Q using the Web experience URL, which can be found on the Applications page, as shown in the following screenshot.

Suppose we want to know the ratio of men to women who appeared on the Forbes 2024 list of the world’s billionaires. As you can tell from the following screenshot of The World’s Billionaires PDF, there were 383 women and 2398 men.

To use Amazon Q Business to elicit that information from the PDF, enter the following in the web experience chatbot

“In 2024, what is the ratio of men to women who appeared in the Forbes 2024 billionaire’s list?”

Amazon Q Business supplies the answer, as shown in the following screenshot.

The following screenshot is a list of the top 10 Billionaires from 2009.

We enter “How many of the top 10 billionaires in 2009 were from countries outside the United States?”

Amazon Q Business provides an answer, as shown in the following screenshot.

Next, to demonstrate how Amazon Q Business can pull data from a CSV file, we used the example of crime statistics found here.

We enter the question, “How many incidents of crime were reported in Hollywood?”

Amazon Q Business provides the answer, as shown in the following screenshot.

Metadata boosting

To improve the accuracy of responses from Amazon Q Business application with CSV files, you can add metadata to documents in an S3 bucket by using a metadata file. Metadata is additional information about a document describing it further in order to improve retrieval accuracy for context-poor document formats for example, a CSV with cryptic column names. Additional fields such as its title and the date and time it was created can also be useful if you want to search the titles or want documents from certain time period.

You can do this by following Enable document attributes for search in Amazon Q Business.

Additional details about metadata boosting can be found at Configuring document attributes for boosting in Amazon Q Business in the Amazon Q User Guide.

Clean up

To avoid incurring future charges and to clean out unused roles and policies, delete the resources you created: the Amazon Q application, data sources, and corresponding IAM roles.

To delete the Amazon Q application, follow these steps:

On the Amazon Q console, choose Applications and then select your application.
On the Actions drop-down menu, choose Delete.
To confirm deletion, enter delete in the field and choose Delete. Wait until you get the confirmation message; the process can take up to 15 minutes.

To delete the S3 bucket created in Prepare your S3 bucket as a data source, follow these steps:

Follow the instructions in Emptying a bucket
Follow the steps in Deleting a bucket

To delete the IAM Identity center instance you created as part of the prerequisites, follow the steps at Delete your IAM Identity Center instance.

Conclusion

By following this post, you can ingest different types of documents that contain tables in them. Then, you can ask Amazon Q questions related to information in the table and have Amazon Q provide you answers in natural language.

To learn about metadata search, refer to Configuring metadata controls in Amazon Q Business.

For S3 data source setup refer to Set up Amazon Q Business application with S3 data source.

About the author

Jiten Dedhia is a Sr. AIML Solutions Architect with over 20 years of experience in the software industry. He has helped Fortune 500 companies with their AIML/Generative AI needs.

Sapna Maheshwari is a Sr. Solutions Architect at AWS, with a passion for designing impactful tech solutions. She is an engaging speaker who enjoys sharing her insights at conferences.

24 of our favorite AI tips from 2024

Here are Google’s best AI tips and tricks from 2024.Read More

Human-AI Collaboration in Physical Tasks

TL;DR: At SmashLab, we’re creating an intelligent assistant that uses the sensors in a smartwatch to support physical tasks such as cooking and DIY. This blog post explores how we use less intrusive scene understanding—compared to cameras—to enable helpful, context-aware interactions for task execution in their daily lives.

Thinking about AI assistants for tasks beyond just the digital world? Every day, we perform many tasks, including cooking, crafting, and medical self-care (like the COVID-19 self-test kit), which involve a series of discrete steps. Accurately executing all the steps can be difficult; when we try a new recipe, for example, we might have questions at any step and might make mistakes by skipping important steps or doing them in the wrong order.

This project, Procedural Interaction from Sensing Module (PrISM), aims to support users in executing these kinds of tasks through dialogue-based interactions. By using sensors such as a camera, wearable devices like a smartwatch, and privacy-preserving ambient sensors like a Doppler Radar, an assistant can infer the user’s context (what they are doing within the task) and provide contextually situated help.

Overview of the *PrISM* framework: multimodal sensing, user state tracking, context-aware interactions, and co-adaptation to achieve the shared goal.

To achieve human-like assistance, we must consider many things: how does the agent understand the user’s context? How should it respond to user’s spontaneous questions? When should it decide to intervene proactively? And most importantly, how do both human users and AI assistants evolve together through everyday interactions?

While different sensing platforms (e.g., cameras, LiDAR, Doppler Radars, etc.) can be used in our framework, we focus on a smartwatch-based assistant in the following. The smartwatch is chosen for its ubiquity, minimal privacy concerns compared to camera-based systems, and capability for monitoring a user across various daily activities.

Tracking User Actions with Multimodal Sensing

*PrISM-Tracker* uses a transition graph to improve frame-level multimodal Human Activity Recognition within procedural tasks.

Human Activity Recognition (HAR) is a technique to identify user activity contexts from sensors. For example, a smartwatch has motion and audio sensors to detect different daily activities such as hand washing and chopping vegetables [1]. However, out of the box, state-of-the-art HAR struggles from noisy data and less-expressive actions that are often part of daily life tasks.

PrISM-Tracker (IMWUT’22) [2] improves tracking by adding state transition information, that is, how users transition from one step to another and how long they usually spend at each step. The tracker uses an extended version of the Viterbi algorithm [3] to stabilize the frame-by-frame HAR prediction.

The latte-making task consists of 19 steps. *PrISM-Tracker* (right) improves the raw classifier’s tracking accuracy (left) with an extended version of the Viterbi algorithm.

As shown in the above figure, PrISM-Tracker improves the accuracy of frame-by-frame tracking. Still, the overall accuracy is around 50-60%, highlighting the challenge of using just a smartwatch to precisely track the procedure state at the frame level. Nevertheless, we can develop helpful interactions out of this imperfect sensing.

Responding to User Ambiguous Queries

Demo of PrISM-Q&A in a latte-making scenario (1:06-)

Voice assistants (like Siri and Amazon Alexa), capable of answering user queries during various physical tasks, have shown promise in guiding users through complex procedures. However, users often find it challenging to articulate their queries precisely, especially when unfamiliar with the specific vocabulary. Our PrISM-Q&A (IMWUT’24) [4] can resolve such issues with context derived from PrISM-Tracker.

Overview of how *PrISM-Q&A* processes user queries in real-time

When a question is posed, sensed contextual information is supplied to Large Language Models (LLMs) as part of the prompt context used to generate a response, even in the case of inherently vague questions like “What should I do next with this?” and “Did I miss any step?” Our studies demonstrated improved accuracy in question answering and preferred user experience compared to existing voice assistants in multiple tasks: cooking, latte-making, and skin care.

Because PrISM-Tracker can make mistakes, the output of PrISM-Q&A may also be incorrect. Thus, if the assistant uses the context information, the assistant first characterizes its current understanding of the context in the response to avoid confusing the user, for instance, “If you are washing your hands, then the next step is cutting vegetables.” This way, it tries to help users identify the error and quickly correct it interactively to get the desired answer.

Intervening with Users Proactively to Prevent Errors

Demo of PrISM-Observer in a cooking scenario (3:38-)

Next, we extended the assistant’s capability by incorporating proactive intervention to prevent errors. Technical challenges include noise in sensing data and uncertainties in user behavior, especially since users are allowed flexibility in the order of steps to complete tasks. To address these challenges, PrISM-Observer (UIST’24) [5] employs a stochastic model to try to account for uncertainties and determine the optimal timing for delivering reminders in real time.

*PrISM-Observer* continuously models the remaining time to the target step, which involves two uncertainties: the current step and the user’s future transition behavior.

Crucially, the assistant does not impose a rigid, predefined step-by-step sequence; instead, it monitors user behavior and intervenes proactively when necessary. This approach balances user autonomy and proactive guidance, enabling individuals to perform essential tasks safely and accurately.

Future Directions

Our assistant system has just been rolled out, and plenty of future work is still on the horizon.

Minimizing the data collection effort

To train the underlying human activity recognition model on the smartwatch and build a transition graph, we currently conduct 10 to 20 sessions of the task, each annotated with step labels. Employing a zero-shot multimodal activity recognition model and refining step granularity are essential for scaling the assistant to handle various daily tasks.

Co-adaptation of the user and AI assistant

In the health application, our assistants and users learn from each other over time through daily interactions to achieve a shared goal.

As future work, we’re excited to deploy our assistants in healthcare settings to support everyday care for post-operative skin cancer patients and individuals with dementia.

Mackay [6] introduced the idea of a human-computer partnership, where humans and intelligent agents collaborate to outperform either working alone. Also, reciprocal co-adaptation [7] refers to where both the user and the system adapt to and affect the others’ behavior to achieve certain goals. Inspired by these ideas, we’re actively exploring ways to fine-tune our assistant through interactions after deployment. This helps the assistant improve context understanding and find a comfortable control balance by exploring the mixed-initiative interaction design [8].

Conclusion

There are many open questions when it comes to perfecting assistants for physical tasks. Understanding user context accurately during these tasks is particularly challenging due to factors like sensor noise. Through our PrISM project, we aim to overcome these challenges by designing interventions and developing human-AI collaboration strategies. Our goal is to create helpful and reliable interactions, even in the face of imperfect sensing.

Our code and datasets are available on GitHub. We are actively working in this exciting research field. If you are interested, please contact Riku Arakawa (HCII Ph.D. student).

Acknowledgments

The author thanks every collaborator in the project. The development of the PrISM assistant for health applications is in collaboration with University Hospitals of Cleveland Department of Dermatology and Fraunhofer Portugal AICOS.

References

[1] Mollyn, V., Ahuja, K., Verma, D., Harrison, C., & Goel, M. (2022). SAMoSA: Sensing activities with motion and subsampled audio. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(3), 1-19.

[2] Arakawa, R., Yakura, H., Mollyn, V., Nie, S., Russell, E., DeMeo, D. P., … & Goel, M. (2023). Prism-tracker: A framework for multimodal procedure tracking using wearable sensors and state transition information with user-driven handling of errors and uncertainty. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(4), 1-27.

[3] Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3), 268-278.

[4] Arakawa, R., Lehman, JF. & Goel, M. (2024) “Prism-q&a: Step-aware voice assistant on a smartwatch enabled by multimodal procedure tracking and large language models.” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(4), 1-26.

[5] Arakawa, R., Yakura, H., & Goel, M. (2024, October). PrISM-Observer: Intervention agent to help users perform everyday procedures sensed using a smartwatch. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (pp. 1-16).

[6] Mackay, W. E. (2023, November). Creating human-computer partnerships. In International Conference on Computer-Human Interaction Research and Applications (pp. 3-17). Cham: Springer Nature Switzerland.

[7] Beaudouin-Lafon, M., Bødker, S., & Mackay, W. E. (2021). Generative theories of interaction. ACM Transactions on Computer-Human Interaction (TOCHI), 28(6), 1-54.

[8] Allen, J. E., Guinn, C. I., & Horvtz, E. (1999). Mixed-initiative interaction. IEEE Intelligent Systems and their Applications, 14(5), 14-23.

Ready Player Fun: GFN Thursday Brings Six New Adventures to the Cloud

From heart-pounding action games to remastered classics, there’s something for everyone this GFN Thursday.

Six new titles join the cloud this week, starting with The Thing: Remastered. Face the horrors of the Antarctic as the game oozes onto GeForce NOW. Nightdive Studios’ revival of the cult-classic 2002 survival-horror game came to the cloud as a surprise at the PC Gaming Show last week. Since then, GeForce NOW members have been able to experience all the bone-chilling action in the sequel to the title based on Universal Pictures’ genre-defining 1982 film.

And don’t miss out on the limited-time GeForce NOW holiday sale, which offers 50% off the first month of a new Ultimate or Performance membership. The 25% off Day Pass sale ends today — take advantage of the offer to experience 24 hours of cloud gaming with all the benefits of Ultimate or Performance membership.

It’s Alive!

The Thing Remastered on GeForce NOW@ — *Freeze enemies, not frame rates.*

The Thing: Remastered brings the 2002 third-person shooter into the modern era with stunning visual upgrades, including improved character models, textures and animations, all meticulously crafted to enhance the game’s already-tense atmosphere.

Playing as Captain J.F. Blake, leader of a U.S. governmental rescue team, navigate the blood-curdling aftermath of the events depicted in the original film. Trust is a precious commodity as members command their squad through 11 terrifying levels, never knowing who might harbor the alien within. The remaster introduces enhanced lighting and atmospheric effects that make the desolate research facility more immersive and frightening than ever.

With an Ultimate or Performance membership, stream this blood-curdling experience in all its remastered glory without the need for high-end hardware. GeForce NOW streams from powerful GeForce RTX-powered servers in the cloud, rendering every shadow, every flicker of doubt in teammates’ eyes and every grotesque transformation with crystal-clear fidelity.

The Performance tier now offers up to 1440p resolution, allowing members to immerse themselves in the game’s oppressive atmosphere with even greater clarity. Ultimate members can experience the paranoia-inducing gameplay at up to 4K resolution and 120 frames per second, making every heart-pounding moment feel more real than ever.

Feast on This

Dive into the depths of a gothic vampire saga, slide through feudal Japan and flip burgers at breakneck speed with GeForce NOW and the power of the cloud. Grab a controller and rally the gaming squad to stream these mouth-watering additions.

Legacy of Kain Soul Reaver 1&2 Remastered on GeForce NOW — *Time to rise again.*

The highly anticipated Legacy of Kain Soul Reaver 1&2 Remastered from Aspyr and Crystal Dynamics breathes new life into the classic vampire saga genre. These beloved titles have been meticulously overhauled to offer stunning visuals and improved controls. Join the epic conflict of Kain and Raziel in the gothic world of Nosgoth and traverse between the Spectral and Material Realms to solve puzzles, reveal new paths and defeat foes.

The Spirit of the Samurai on GeForce NOW — *Defend the forbidden village.*

The Spirit of the Samurai from Digital Mind Games and Kwalee brings a blend of Souls and Metroidvania elements to feudal Japan. This stop-motion inspired 2D action-adventure game offers three playable characters and intense combat with legendary Japanese weapons, all set against a backdrop of mythological landscapes.

Fast Food Simulator on GeForce NOW — *The ice cream machine actually works.*

Or take on the chaotic world of fast-food management with Fast Food Simulator, a multiplayer simulation game from No Ceiling Games. Take orders, make burgers and increase earnings by dealing with customers. Play solo or co-op with up to four players and take on unexpected and bizarre events that can occur at any moment.

Shift between realms in Legacy of Kain at up to 4K 120 fps with an Ultimate membership, slice through The Spirit of the Samurai’s mythical landscapes in stunning 1440p with RTX ON with a Performance membership or manage a fast-food empire with silky-smooth gameplay. With extended sessions and priority access, members will have plenty of time to master these diverse worlds.

Play On

Diablo Immortal on GeForce NOW — *Evil never sleeps.*

Diablo Immortal — the action-packed role-playing game from Blizzard Entertainment, set in the dark fantasy world of Sanctuary — bridges the stories of Diablo II and Diablo III. Choose from a variety of classes, each offering unique playstyles and devastating abilities, to battle through diverse zones and randomly generated rifts, and uncover the mystery of the shattered Worldstone while facing off against hordes of demonic enemies.

Since its launch, the game has offered frequent updates, including two new character classes, new zones, gear, competitive events and more demonic stories to experience. With its immersive storytelling, intricate character customization and endless replayability, Diablo Immortal provides members with a rich, hellish adventure to stream from the cloud across devices.

Look for the following games available to stream in the cloud this week:

Indiana Jones and the Great Circle (New release on Steam and Xbox, available on the Microsoft Store and PC Game Pass, Dec. 8)
Fast Food Simulator (New release on Steam, Dec. 10)
Legacy of Kain Soul Reaver 1&2 Remastered (New release on Steam, Dec. 10)
The Spirit of the Samurai (New release on Steam, Dec. 12)
Diablo Immortal (Battle.net)
The Lord of the Rings: Return to Moria (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

If you were the main character in a survival game, what would your main skill be?

— NVIDIA GeForce NOW (@NVIDIAGFN) December 11, 2024

3 artists reimagine AI imagery through speculative photography

We commissioned artists Farah Al Qasimi, Charlie Engman and Max Pinckers to envision possible futures with AI for a new exhibition titled “Alternative Images of AI.” Let…Read More

Solution overview

Lessons learned and best practices

Processing multimodal source documents in the knowledge base

Agentic workflow for a wide range of user queries

Contextual compression for LLM invocation

Address hallucinations by document metadata

Human-in-the-loop feedback collection

Conclusion

About the Authors

Acknowledgments

Transcript

Improvements to the ModelBuilder class

Seamless deployment from ModelTrainer class

Customize the model using InferenceSpec

Test using local and in process mode

Deploy the model

Clean up

Conclusion

About the Authors

Benefits of the ModelTrainer class

Launch a training job using the ModelTrainer class

Run the job locally for experimentation

Read hyperparameters in your custom script

Run distributed training jobs

Conclusion

About the Authors

Endorse quality apps and customize labels in the app library

Share your apps with select users

Unlock new use cases with data collection

Conclusion

About the Author

Prerequisites

Overview of tabular search

Ingest documents in Amazon Q Business

Run queries with Amazon Q

Metadata boosting

Clean up

Conclusion

About the author

Tracking User Actions with Multimodal Sensing

Responding to User Ambiguous Queries

Intervening with Users Proactively to Prevent Errors

Future Directions

Minimizing the data collection effort

Co-adaptation of the user and AI assistant

Conclusion

Acknowledgments

References

It’s Alive!

Feast on This

Play On

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.