December 2024 – Page 2

60 of our biggest AI announcements in 2024

Recap some of Google’s biggest AI news from 2024, including moments from Gemini, NotebookLM, Search and more.Read More

Improving Retrieval Augmented Generation accuracy with GraphRAG

Customers need better accuracy to take generative AI applications into production. In a world where decisions are increasingly data-driven, the integrity and reliability of information are paramount. To address this, customers often begin by enhancing generative AI accuracy through vector-based retrieval systems and the Retrieval Augmented Generation (RAG) architectural pattern, which integrates dense embeddings to ground AI outputs in relevant context. When even greater precision and contextual fidelity are required, the solution evolves to graph-enhanced RAG (GraphRAG), where graph structures provide enhanced reasoning and relationship modeling capabilities.

Lettria, an AWS Partner, demonstrated that integrating graph-based structures into RAG workflows improves answer precision by up to 35% compared to vector-only retrieval methods. This enhancement is achieved by using the graph’s ability to model complex relationships and dependencies between data points, providing a more nuanced and contextually accurate foundation for generative AI outputs.

In this post, we explore why GraphRAG is more comprehensive and explainable than vector RAG alone, and how you can use this approach using AWS services and Lettria.

How graphs make RAG more accurate

In this section, we discuss the ways in which graphs make RAG more accurate.

Capturing complex human queries with graphs

Human questions are inherently complex, often requiring the connection of multiple pieces of information. Traditional data representations struggle to accommodate this complexity without losing context. Graphs, however, are designed to mirror the way humans naturally think and ask questions. They represent data in a machine-readable format that preserves the rich relationships between entities.

By modeling data as a graph, you capture more of the context and intent. This means your RAG application can access and interpret data in a way that aligns closely with human thought processes. The result is a more accurate and relevant answer to complex queries.

Avoiding loss of context in data representation

When you rely solely on vector similarity for information retrieval, you miss out on the nuanced relationships that exist within the data. Translating natural language into vectors reduces the richness of the information, potentially leading to less accurate answers. Also, end-user queries are not always aligned semantically to useful information in provided documents, leading to vector search excluding key data points needed to build an accurate answer.

Graphs maintain the natural structure of the data, allowing for a more precise mapping between questions and answers. They enable the RAG system to understand and navigate the intricate connections within the data, leading to improved accuracy.

Lettria demonstrated improvement on correctness of answers from 50% with traditional RAG to more than 80% using GraphRAG within a hybrid approach. The testing covered datasets from finance (Amazon financial reports), healthcare (scientific studies on COVID-19 vaccines), industry (technical specifications for aeronautical construction materials), and law (European Union directives on environmental regulations).

Proving that graphs are more accurate

To substantiate the accuracy improvements of graph-enhanced RAG, Lettria conducted a series of benchmarks comparing their GraphRAG solution—a hybrid RAG using both vector and graph stores—with a baseline vector-only RAG reference.

Lettria’s hybrid methodology to RAG

Lettria’s hybrid approach to question answering combines the best of vector similarity and graph searches to optimize performance of RAG applications on complex documents. By integrating these two retrieval systems, Lettria uses both structured precision and semantic flexibility in handling intricate queries.

GraphRAG specializes in using fine-grained, contextual data, ideal for answering questions that require explicit connections between entities. In contrast, vector RAG excels at retrieving semantically relevant information, offering broader contextual insights. This dual system is further reinforced by a fallback mechanism: when one system struggles to provide relevant data, the other compensates. For example, GraphRAG pinpoints explicit relationships when available, whereas vector RAG fills in relational gaps or enhances context when structure is missing.

The benchmarking process

To demonstrate the value of this hybrid method, Lettria conducted extensive benchmarks across datasets from various industries. Using their solution, they compared GraphRAG’s hybrid pipeline against a leading open source RAG package, Verba by Weaviate, a baseline RAG reference reliant solely on vector stores. The datasets included Amazon financial reports, scientific texts on COVID-19 vaccines, technical specifications from aeronautics, and European environmental directives—providing a diverse and representative test bed.

The evaluation tackled real-world complexity by focusing on six distinct question types, including fact-based, multi-hop, numerical, tabular, temporal, and multi-constraint queries. The questions ranged from simple fact-finding, like identifying vaccine formulas, to multi-layered reasoning tasks, such as comparing revenue figures across different timeframes. An example multi-hop query in finance is “Compare the oldest booked Amazon revenue to the most recent.”

Lettria’s in-house team manually assessed the answers with a detailed evaluation grid, categorizing results as correct, partially correct (acceptable or not), or incorrect. This process measured how the hybrid GraphRAG approach outperformed the baseline, particularly in handling multi-dimensional queries that required combining structured relationships with semantic breadth. By using the strengths of both vector and graph-based retrieval, Lettria’s system demonstrated its ability to navigate the nuanced demands of diverse industries with precision and flexibility.

The benchmarking results

The results were significant and compelling. GraphRAG achieved 80% correct answers, compared to 50.83% with traditional RAG. When including acceptable answers, GraphRAG’s accuracy rose to nearly 90%, whereas the vector approach reached 67.5%.

The following graph shows the results for vector RAG and GraphRAG.

In the industry sector, dealing with complex technical specifications, GraphRAG provided 90.63% correct answers, almost doubling vector RAG’s 46.88%. These figures highlight how GraphRAG offers substantial advantages over the vector-only approach, particularly for clients focused on structuring complex data.

GraphRAG’s overall reliability and superior handling of intricate queries allow customers to make more informed decisions with confidence. By delivering up to 35% more accurate answers, it significantly boosts efficiency and reduces the time spent sifting through unstructured data. These compelling results demonstrate that incorporating graphs into the RAG workflow not only enhances accuracy, but is essential for tackling the complexity of real-world questions.

Using AWS and Lettria for enhanced RAG applications

In this section, we discuss how you can use AWS and Lettria for enhanced RAG applications.

AWS: A robust foundation for generative AI

AWS offers a comprehensive suite of tools and services to build and deploy generative AI applications. With AWS, you have access to scalable infrastructure and advanced services like Amazon Neptune, a fully managed graph database service. Neptune allows you to efficiently model and navigate complex relationships within your data, making it an ideal choice for implementing graph-based RAG systems.

Implementing GraphRAG from scratch usually requires a process similar to the following diagram.

The process can be broken down as follows:

Based on domain definition, the large language model (LLM) can identify the entities and relationship contained in the unstructured data, which are then stored in a graph database such as Neptune.
At query time, user intent is turned into an efficient graph query based on domain definition to retrieve the relevant entities and relationship.
Results are then used to augment the prompt and generate a more accurate response compared to standard vector-based RAG.

Implementing such process requires teams to develop specific skills in topics such as graph modeling, graph queries, prompt engineering, or LLM workflow maintenance. AWS released an open source GraphRAG Toolkit to make it simple for customers who want to build and customize their GraphRAG workflows. Iterations on extraction process and graph lookup are to be expected in order to get accuracy improvement.

Managed GraphRAG implementations

There are two solutions for managed GraphRAG with AWS: Lettria’s solution, soon available on AWS Marketplace, and Amazon Bedrock integrated GraphRAG support with Neptune. Lettria provides an accessible way to integrate GraphRAG into your applications. By combining Lettria’s expertise in natural language processing (NLP) and graph technology with the scalable and managed AWS infrastructure, you can develop RAG solutions that deliver more accurate and reliable results.

The following are key benefits of Lettria on AWS:

Simple integration – Lettria’s solution simplifies the ingestion and processing of complex datasets
Improved accuracy – You can achieve up to 35% better performance in question-answering tasks
Scalability – You can use scalable AWS services to handle growing data volumes and user demands
Flexibility – The hybrid approach combines the strengths of vector and graph representations

In addition to Lettria’s solution, Amazon Bedrock introduced managed GraphRAG support on December 4, 2024, integrating directly with Neptune. GraphRAG with Neptune is built into Amazon Bedrock Knowledge Bases, offering an integrated experience with no additional setup or additional charges beyond the underlying services. GraphRAG is available in AWS Regions where Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics are both available (see the current list of supported Regions). To learn more, see Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases.

Conclusion

Data accuracy is a critical concern for enterprises adopting generative AI applications. By incorporating graphs into your RAG workflow, you can significantly enhance the accuracy of your systems. Graphs provide a richer, more nuanced representation of data, capturing the complexity of human queries and preserving context.

GraphRAG is a key option to consider for organizations seeking to unlock the full potential of their data. With the combined power of AWS and Lettria, you can build advanced RAG applications that help meet the demanding needs of today’s data-driven enterprises and achieve up to 35% improvement in accuracy.

Explore how you can implement GraphRAG on AWS in your generative AI application:

About the Authors

Denise Gosnell is a Principal Product Manager for Amazon Neptune, focusing on generative AI infrastructure and graph data applications that enable scalable, cutting-edge solutions across industry verticals.

Vivien de Saint Pern is a Startup Solutions Architect working with AI/ML startups in France, focusing on generative AI workloads.

PyTorch Grows as the Dominant Open Source Framework for AI and ML: 2024 Year in Review

This past year was a monumental year for PyTorch from major releases to the flagship PyTorch Conference. We’ve seen incredible growth in contributions from more than 3,500 individuals and 3,000 organizations. It’s safe to say PyTorch has now become the dominant deep learning framework for AI/ML. PyTorch leads the model training space with a 63% adoption rate according to the recent Shaping the Future of Generative AI Report from the Linux Foundation.

The PyTorch Foundation was formed in 2022 with the goal to drive the adoption of AI tooling by fostering and sustaining an ecosystem of open source, vendor-neutral projects centered around PyTorch and today remains a vibrant, collaborative hub created for and by the deep learning community. As we wrap up the year, let’s take a look back at a few highlights and how this year has been one of growth, collaboration, innovation, and community.

2024 Highlights: A Year of Growth and Impact

PyTorch accelerated its growth this year. Contributions are up 133%, from double the amount of organizations worldwide compared to last year.

The project has seen 20% year-over-year growth in new repositories using PyTorch, and a 30% increase in forks and users this past year.

Over 70% of AI research implementations are now using PyTorch.

Statistics based on the 2024 Linux Foundation Annual Report.

PyTorch Tools ecosystem grew by over 25%, enhancing both software and hardware capabilities. Working with all major cloud service providers, dozens of major software vendors, and industry partners, PyTorch is setting a new bar for the pace and breadth of AI innovation.

This year featured 4 milestone releases for PyTorch in the 2.2, 2.3, 2.4 and 2.5 releases. We observed the release of various hallmark features like AOTInductor, FlashAttention-2 support, Tensor Parallelism, a new Python Custom Operator API, and the introduction of FlexAttention. Engineers from across PyTorch Foundation member companies have also come together to introduce support and optimizations for platforms like Intel GPUs (XPU), AWS Graviton processors, Inductor performance, etc.

Throughout the year the PyTorch Team has been working hard to introduce a number of new PyTorch-native libraries! The ExecuTorch team released their alpha in collaboration with partners from Arm, Apple, and Qualcomm Technologies, Inc. then quickly followed with a beta focused on stability and adding MediaTek. TorchTune established a PyTorch-native library for easily fine-tuning large language models. TorchAO introduced a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. TorchCodec was launched to give developers a simple, performant, and PyTorch native way to decode videos into tensors. TorchRec 1.0 was released, the first stable release of the PyTorch native recommendation systems library.

We’ve also had a number of strong technical showcases throughout the year to highlight how PyTorch can be used! TorchTitan exhibited what an open source, PyTorch-native distributed training system could look like for training large language models (LLMs). TorchChat showcased how to seamlessly and performantly run LLMs across laptop, desktop, and mobile devices.

As well we were very excited to include multiple new projects into the PyTorch ecosystem throughout 2024, including the introduction of vLLM into the PyTorch Ecosystem, a state-of-the-art inference engine, which gives machine learning engineers an easy, fast, and cheap way of serving LLMs. If you are interested in joining the PyTorch Ecosystem, please join!

In June in Paris, France we premiered the official PyTorch documentary on powering the AI Revolution that spotlights PyTorch’s vibrant ecosystem and its role in advancing AI innovation. The film unveiled the authentic narrative of PyTorch’s inception, attributing its existence to a dedicated group of unsung heroes driving technological innovation.

The PyTorch Conference 2024, brought in triple the registrations compared to 2023, reflecting the rapid growth of AI and machine learning communities around open source technologies. The two day event included insightful talks, hands-on sessions, and lively discussions about the future of AI, covering everything from generative AI to large language models.

A brand new Startup Showcase featured early-stage founders pitching their AI startups to a panel of top venture capitalists, a DL Compiler Mini-Summit took a deep dive into the advances in deep learning (DL) compilers that are transforming AI workloads, and a Fine-Tuning Mini-Summit brought together a thriving community of researchers, developers, practitioners and hobbyists to discuss topics like memory efficiency, parameter-efficient fine-tuning, and performance at scale.

Outstanding contributors were honored with PyTorch Contributor Awards. Congratulations to this year’s nominees and recipients for the outstanding individuals and teams who have played a pivotal role in PyTorch’s journey this year.

PyTorch Foundation membership is growing with the addition of Arm and Rebellions this year. At the year-end mark, Premier Members include: AMD, Arm, AWS, Google Cloud, Huawei, Hugging Face, IBM, Intel, Lightning AI, Meta, Microsoft Azure, and NVIDIA. General Members include: Graphcore, Rebellions, and Snowflake. If your organization is interested in joining, find out how you can become a member of the PyTorch Foundation.

PyTorch hosted numerous in-person and virtual events, including The PyTorch Docathon where contributors worked to improve PyTorch documentation and foster collaboration, Local meetups around the world brought together interested parties in locations from Shanghai to Seoul, and more than a dozen webinars brought in attendees from everywhere during our Summer Webinar Series, live Q&As, and Expert Exchanges.

PyTorch Foundation welcomed new leadership this year. Executive Director Matt White took the reins in April and immediately began raising the profile of PyTorch across the AI landscape. The Technical Advisory Council (TAC) also elected new leadership with Luca Antiga, Lightning AI as the Chair and Jiong Gong, Intel as Vice Chair.

The PyTorch Governing Board continued to set the direction and lead the Foundation in accomplishing its mission. The PyTorch Marketing and Outreach Committee developed programs to maximize the visibility of PyTorch and advance the interests of the community. The PyTorch CI Working Group assembled to successfully migrate the PyTorch CI pipeline to the Linux Foundation.

Our community joined us on social media with 775 thousand followers strong across X, LinkedIn, Facebook, and YouTube with more than 12 million impressions of PyTorch content throughout the year. The PyTorch Ecosystem also grew, adding many new projects to leverage PyTorch deep learning across many vertical domains.

PyTorch was mentioned in the media in top technology publications such as The New Stack’s article on Why PyTorch Gets All the Love and InfoWorld’s article on how the TorchAO PyTorch library makes models faster and smaller.

We published 74 technical and community blogs, and nearly ten million people visited the PyTorch website throughout the year.

Thanks to each of you who helped make this year an outstanding success! The evolution and growth we’ve seen PyTorch undergo over the past year is driven by the passion, dedication, and ingenuity of this amazing community. Looking ahead to next year, we’re excited to build on this momentum as we continue to push the boundaries of AI.

Save the date for the PyTorch Conference which will be held October 22-23, 2025 in San Francisco. 2025 promises even greater innovation and stronger community collaboration.

AIOpsLab: Building AI agents for autonomous clouds

graphical user interface, application, icon

In our increasingly complex digital landscape, enterprises and cloud providers face significant challenges in the development, deployment, and maintenance of sophisticated IT applications. The broad adoption of microservices and cloud-based serverless architecture has streamlined certain aspects of application development while simultaneously introducing a host of operational difficulties, particularly in fault diagnosis and mitigation. These complexities can result in outages, which have the potential to cause major business disruptions, underscoring the critical need for robust solutions that ensure high availability and reliability in cloud services. As the expectation for five-nines availability grows, organizations must navigate the intricate web of operational demands to maintain customer satisfaction and business continuity.

To tackle these challenges, recent research on using AIOps agents for cloud operations—such as AI agents for incident root cause analysis (RCA) or triaging—has relied on proprietary services and datasets. Other prior works use frameworks specific to the solutions that they are building, or ad hoc and static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users developing agents for cloud operations tasks with Azure AI Agent Service can evaluate and improve them using AIOpsLab. Furthermore, current approaches do not agree on standard metrics or a standard taxonomy for operational tasks. This calls for a standardized and principled research framework for building, testing, comparing, and improving AIOps agents. The framework should allow agents to interact with realistic service operation tasks in a reproducible manner. It must be flexible in extending to new applications, workloads, and faults. Importantly, it should go beyond just evaluating the AI agents and enabling users to improve the agents themselves; for example, by providing sufficient observability and even serving as a training environment (“gym”) to generate samples to learn on.

We developed AIOpsLab, a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks. AIOpsLab is open sourced at GitHub (opens in new tab) with the MIT license, so that researchers and engineers can leverage it to evaluate AIOps agents at scale. The AIOpsLab research paper has been accepted at SoCC’24 (the annual ACM Symposium on Cloud Computing).

Flowchart of an AIOpsLab system. The chart is divided into four main sections: AIOps Tasks, Orchestrator, Problem Cache, and Service. AIOps Tasks list various applications like SocialNetwork, HotelReservation, E-Commerce, and others, each with associated Data, Actions, Metrics. These tasks connect to the Orchestrator. The Orchestrator is the central element and interacts with various components: it receives a Problem Query Q, detailing Problem, Task T, Workload W, Fault F, and Solution S. It is responsible for deploying or running the workload and injecting faults, as well as taking actions based on the Service State relayed by an Agent. The Problem Cache connects to a Workload Generator and a Fault Generator, creating Workload W for the Service. The Service component shows observability through Traces, Metrics, and Logs. It communicates with the Orchestrator to provide service state updates. The components are connected with arrows that indicate the flow of data and control between each part of the system. — Figure 1. System architecture of AIOpsLab.

Agent-cloud interface (ACI)

AIOpsLab strictly separates the agent and the application service using an intermediate orchestrator. It provides several interfaces for other system parts to integrate and extend. First, it establishes a session with an agent to share information about benchmark problems: (1) the problem description, (2) instructions (e.g., response format), and (3) available APIs to call as actions.

The APIs are a set of documented tools, e.g., get logs, get metrics, and exec shell, designed to help the agent solve a task. There are no restrictions on the agent’s implementation; the orchestrator poses problems and polls it for the next action to perform given the previous result. Each action must be a valid API call, which the orchestrator validates and carries out. The orchestrator has privileged access to the deployment and can take arbitrary actions (e.g., scale-up, redeploy) using appropriate tools (e.g., helm, kubectl) to resolve problems on behalf of the agent. Lastly, the orchestrator calls workload and fault generators to create service disruptions, which serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators.

Example shows how to onboard an agent to AIOpsLab

from aiopslab import Orchestrator
class Agent:
    def __init__(self, prob, instructs, apis):
        self.prompt = self.set_prompt(prob, instructs, apis)
        self.llm = GPT4()

    async def get_action(self, state: str) -> str:
        return self.llm.generate(self.prompt + state)

#initialize the orchestrator
orch = Orchestrator()
pid = "misconfig_app_hotel_res-mitigation-1"
prob_desc, instructs, apis = orch.init_problem(pid)

#register and evaluate the agent
agent = Agent(prob_desc, instructs, apis)
orch.register_agent(agent, name="myAgent")
asyncio.run(orch.start_problem(max_steps=10))

Service

AIOpsLab abstracts a diverse set of services to reflect the variance in production environments. This includes live, running services that are implemented using various architectural principles, including microservices, serverless, and monolithic.

We also leverage open-sourced application suites such as DeathStarBench as they provide artifacts, like source code and commit history, along with run-time telemetry. Adding tools like BluePrint can help AIOpsLab scale to other academic and production services.

Workload generator

The workload generator in AIOpsLab plays a crucial role by creating simulations of both faulty and normal scenarios. It receives specifications from the orchestrator, such as the task, desired effects, scale, and duration. The generator can use a model trained on real production traces to generate workloads that align with these specifications. Faulty scenarios may simulate conditions like resource exhaustion, exploit edge cases, or trigger cascading failures, inspired by real incidents. Normal scenarios mimic typical production patterns, such as daily activity cycles and multi-user interactions. When various characteristics (e.g., service calls, user distribution, arrival times) can lead to the desired effect, multiple workloads can be stored in the problem cache for use by the orchestrator. In coordination with the fault generator, the workload generator can also create complex fault scenarios with workloads.

Fault generator

AIOpsLab has a novel push-button fault generator designed for generic applicability across various cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” compatible with AIOps scenarios. This includes fine-grained fault injection capable of simulating complex failures inspired by production incidents. Additionally, it can inject faults at various system levels, exposing root causes while maintaining semantic integrity and considering interdependencies between cloud microservices. The fault injector’s versatility can enhance the reliability and robustness of cloud systems by enabling thorough testing and evaluation of AIOps capabilities.

Observability

AIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers for any AIOps tool. AIOpsLab collects a wide array of telemetry data, including (1) traces from Jaeger detailing the end-to-end paths of requests through distributed systems, (2) application logs formatted and recorded by Filebeat and Logstash, and (3) system metrics monitored by Prometheus. Additionally, AIOpsLab also captures lower-level system information such as syscall logs and cluster information. As mentioned, we handle potential data overload through flexible APIs to tune the telemetry data relevant to the AIOps tools.

AIOpsLab currently supports four key tasks within the AIOps domain. These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, we support several popular agent frameworks: React, Autogen, and TaskWeaver. Two key insights from the study emphasize the importance of observability and well-designed ACI. Observability is crucial for clear root-cause diagnosis, for instance, pinpointing a misconfigured API gateway can be vital in preventing service downtime.

Flexibility is another critical factor; the ability to execute arbitrary shell commands allowed for effective troubleshooting in real-time scenarios. Lastly, robust error handling is essential—providing agents with high-quality feedback on execution barriers, such as a failed database connection, ensures swift resolution and continuous improvement.

Example of ACI on diagnosis task

Agent: I need to first look at the logs to understand if there are any errors or warnings that could indicate problems. 

Action: 

``` 

get_logs("compose-post-service", "test-social-network") 

```
Service: [2024-Aug-04 23:18:49.365494] <info>: 

… 

Thrift: Sun Aug 4 23:19:19 2024 TSocket::open() connect() 
<Host: user-service Port: 9090>: Connection refusedAgent: … unable to connect to the `user-service` at port 9090, even though the pod is running. This suggests a possible network issue or misconfiguration in service discovery

Next steps

This research project adopts Microsoft security standards and Responsible AI principles, and we envision this research evolving into a vital resource for organizations aiming to optimize their IT operations. Additionally, we plan to collaborate with various generative AI teams to incorporate AIOpsLab as a benchmark scenario for evaluating state-of-the-art models. By doing so, we aim to foster innovation and encourage the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology, as it has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve their customers in an increasingly automated world.

Acknowledgements

We would like to thank Yinfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath, for contributing to this project.

The post AIOpsLab: Building AI agents for autonomous clouds appeared first on Microsoft Research.

The latest AI news we announced in December

Here are Google’s latest AI updates from December including Gemini 2.0, GenCast, and Willow.Read More

Amazon Research Awards recipients announced

Awardees, who represent 10 universities, have access to Amazon public datasets, along with AWS AI/ML services and tools.Read More

Improve RAG performance with torch.compile on AWS Graviton Processors

Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to support tasks like answering questions, translating languages, and completing sentences. There are a few challenges when working with LLMs such as domain knowledge gaps, factuality issues, and hallucination, which affect their reliability especially for the fields that require high levels of accuracy, such as healthcare, law, or engineering. Retrieval Augmented Generation (RAG) provides a solution to mitigate some of these issues by augmenting LLMs with a specific domain or an organization’s internal knowledge base, without the need to retrain the model.

The RAG knowledge source is generally business specific databases which are typically deployed on general-purpose CPU infrastructure. So, deploying RAG on general-purpose CPU infrastructure alongside related business services is both efficient and cost-effective. With this motivation, we evaluated RAG deployment on AWS Graviton based Amazon EC2 instances which have been delivering up to 40% price-performance advantage compared to comparable instances for the majority of the workloads including databases, in-memory caches, big data analytics, media codecs, gaming servers, and machine learning inference.

In the past we published a few blog posts on how PyTorch was optimized for AWS Graviton processors to accelerate ML Inference performance for both eager mode (blog) and torch.compile mode (blog). In this blog we cover how to deploy a typical RAG workload using PyTorch and torch.compile, how we improved its performance up to 1.7x for embedding model and 1.3x for RAG query on AWS Graviton3-based m7g.xlarge instance compared to the default PyTorch “eager mode”, and finally a few recommendations that you can apply for your RAG use cases.

How to Optimize RAG?

Without RAG, the LLM takes the user input and creates a response based on information it was trained on (what it already knows). With RAG, an information retrieval component is introduced that utilizes the user input to first pull information from a new data source. The user query and the relevant information are both given to the LLM. The LLM uses the new knowledge and its training data to create better responses. The following diagram shows the conceptual flow of using RAG with LLMs.

Image 1: Conceptual flow of using RAG with LLMs

Source: https://aws.amazon.com/what-is/retrieval-augmented-generation/

Embedding model

At the core of RAG is an embedding model that takes the text data and converts into a vector representation. These vectors are then stored in a vector db. When a user makes a query, the query is first converted to a vector and the RAG does a similarity search on the vector db. Hence, the first step in optimizing RAG performance is optimizing an embedding model’s inference performance. We used the AWS Graviton3-based m7g.xlarge instance and the HuggingFace sentence-transformer embedding model for the optimization work. Here is a sample script for profiling the HuggingFace sentence-transformer embedding model inference with PyTorch Eager mode.

import torch
from torch.profiler import profile, ProfilerActivity, record_function
from transformers import AutoModel, AutoTokenizer

model_name = "sentence-transformers/all-mpnet-base-v2"
input_text = ["This is an example sentence", "Each sentence is converted"]

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoded_input = tokenizer(
    input_text, padding=True, truncation=True, return_tensors="pt"
)

warmup, actual = 100, 100
model.eval()

with torch.no_grad():
    # warmup
    for i in range(warmup):
        embeddings = model(**encoded_input)

    with profile(activities=[ProfilerActivity.CPU]) as prof:
        with record_function("model_inference"):
            for i in range(actual):
                embeddings = model(**encoded_input)
        print(prof.key_averages().table(sort_by="self_cpu_time_total"))

Eager mode

Since PyTorch eager mode was already optimized on AWS Graviton processors with the following runtime environment settings, we included them in the baseline and measured the following performance. Please refer to Optimized PyTorch 2.0 Inference with AWS Graviton processors for more details on how we optimized the PyTorch eager mode on AWS Graviton processors.

# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable Linux Transparent Huge Page (THP) allocations,
# to reduce the tensor memory allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capacity to cache the primitives and avoid redundant
# memory allocations
export LRU_CACHE_CAPACITY=1024

---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                aten::addmm        61.01%        2.638s        62.49%        2.702s     370.197us          7300  
            model_inference        12.01%     519.161ms       100.00%        4.324s        4.324s             1  
                  aten::bmm         6.25%     270.084ms        11.96%     517.089ms     215.454us          2400  
               aten::select         3.98%     172.165ms         5.34%     230.863ms       1.331us        173500  
                aten::copy_         2.11%      91.133ms         2.11%      91.133ms       6.200us         14700   
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.324s

Table 1: Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with PyTorch Eager mode

Next, we added torch.compile, weights pre-packing, and torch.inference_mode and observed around 1.7x performance improvement. The following section talks about each of these optimizations and the resulting speedup.

torch.compile

In contrast to eager mode, the torch.compile pre-compiles the entire model into a single graph in a manner that’s optimized for running on given hardware. Please refer to Accelerated PyTorch Inference with torch.compile on AWS Graviton processors for more details on torch.compile features and how we optimized them on AWS Graviton processors. Invoke torch.compile as shown in the following snippet to trigger PyTorch dynamo compilation for the model. This resulted in around 1.04x performance improvement from the baseline.

model = torch.compile(model)

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 aten::addmm        64.46%        2.675s        66.66%        2.766s     378.905us          7300  
       Torch-Compiled Region        19.76%     820.085ms        99.04%        4.109s      41.094ms           100  
                   aten::bmm         6.66%     276.216ms        12.52%     519.527ms     216.470us          2400  
                aten::select         3.98%     164.991ms         5.41%     224.488ms       1.299us        172800  
            aten::as_strided         1.66%      69.039ms         1.66%      69.039ms       0.383us        180100  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.149s

Table 2: Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with torch.compile mode

Weights pre-packing

torch.compile opens up opportunities like pre-packing the model weights into a format that is more suitable for the given hardware during the model compilation, thus improving the performance. Set the following config to trigger weights pre-packing. This resulted in around 1.69x improvement from the baseline.

import torch._inductor.config as config
config.cpp.weight_prepack=True
config.freezing=True

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    mkldnn::_linear_pointwise        39.10%     994.821ms        41.50%        1.056s     144.628us          7300  
        Torch-Compiled Region        35.12%     893.675ms        98.42%        2.504s      25.043ms           100  
                    aten::bmm        10.96%     278.859ms        21.66%     551.073ms     229.614us          2400  
                 aten::select         7.34%     186.838ms         9.98%     253.840ms       1.469us        172800  
             aten::as_strided         2.63%      67.002ms         2.63%      67.002ms       0.388us        172800   
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.544s

Table 3: Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with torch.compile and weights pre-packing

torch.inference_mode

Additionally, use torch.inference_mode() to get savings from turning off version control for tensors and view tracking of tensors. Please refer to the PyTorch documentation for more details.

with torch.inference_mode():
# instead of
with torch.no_grad():

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    mkldnn::_linear_pointwise        38.92%     987.276ms        41.17%        1.044s     143.056us          7300  
        Torch-Compiled Region        34.92%     885.895ms        98.45%        2.498s      24.975ms           100  
                    aten::bmm        11.25%     285.292ms        22.22%     563.594ms     234.831us          2400  
                 aten::select         7.74%     196.223ms        10.22%     259.251ms       1.500us        172800  
             aten::as_strided         2.48%      63.027ms         2.48%      63.027ms       0.365us        172800  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.537s

Table 4: Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with torch.compile, weights pre-packing, and inference_mode

The following table shows the incremental performance improvements achieved for the standalone embedding model inference.

Optimization level	Latency measured (in sec)	Improvement over the baseline
PyTorch eager mode (Baseline)	0.04324	NA
torch.compile	0.04149	1.04x
weights pre-packing	0.02544	1.69x
torch.inference_mode	0.02537	1.70x

The following script is an updated example for the embedding model inference with the previously discussed optimizations included. The optimizations are highlighted in GREEN.

import torch
from torch.profiler import profile, record_function, ProfilerActivity
from transformers import AutoTokenizer, AutoModel
import torch._inductor.config as config
config.cpp.weight_prepack=True
config.freezing=True

model_name = "sentence-transformers/all-mpnet-base-v2"
input_text = ['This is an example sentence', 'Each sentence is converted']

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoded_input = tokenizer(input_text, padding=True, truncation=True, return_tensors='pt')

warmup , actual = 100, 100
model.eval()
model = torch.compile(model)

with torch.inference_mode():
#instead of with torch.no_grad()
# warmup
  for i in range(warmup):
  	embeddings = model(**encoded_input)

  with profile(activities=[ProfilerActivity.CPU]) as prof:
	with record_function("model_inference"):
  	for i in range(actual):
     	embeddings = model(**encoded_input)
  print(prof.key_averages().table(sort_by="self_cpu_time_total"))

End-to-End RAG scenario on CPU

After optimizing the embedding model inference, we started with a PyTorch eager mode based RAG setup, mainly to validate the functionality on the CPU backend. We built the RAG solution with HuggingFaceEmbeddings from langchain_community.embeddings, as shown in the following code snippet.

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain.prompts import PromptTemplate
from langchain_core.prompts import format_document
from bs4 import BeautifulSoup as Soup
import torch

url =  "https://pytorch.org/blog/pytorch2-5/"
chunk_size = 1000
chunk_overlap = 0
embedding_model = "sentence-transformers/all-mpnet-base-v2"
N = 5

question = "What's new in PyTorch 2.5?"

from transformers import AutoTokenizer, AutoModel
from typing import Any, List

loader = RecursiveUrlLoader(
            url=url, max_depth=3, extractor=lambda x: Soup(x, "html.parser").text
        )       
docs = loader.load()

# Split the document into chunks with a specified chunk size
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
all_splits = text_splitter.split_documents(docs)

# Store the document into a vector store with a specific embedding model
model = HuggingFaceEmbeddings(model_name=embedding_model)

warmup , actual = 100, 100

with torch.inference_mode():
    vectorstore = FAISS.from_documents(all_splits, model)

    for i in range(warmup):
        searchDocs = vectorstore.similarity_search(question, k=N)

    import time

    start = time.time()
    for i in range(actual):
        searchDocs = vectorstore.similarity_search(question, k=N)
    end = time.time()
    print(f"Time for 1 inference is {(end-start)/actual} seconds")

    doc_prompt = PromptTemplate.from_template("{page_content}")
    context = ""
    for i, doc in enumerate(searchDocs):
        context += f"n{format_document(doc, doc_prompt)}n"

Next, our goal was to optimize the end-to-end RAG use case with torch.compile and weights pre-packing that gave 1.7x improvement for the standalone embedding model inference. However, the optimizations didn’t work out of the box for the RAG scenario.

What are the challenges and solutions to achieve similar gains in an end-to-end RAG scenario?

Challenge 1: model handle

There was no way to get the model handle that was instantiated with HuggingFaceEmbeddings, and the wrapper class doesn’t provide compile APIs. So, there was no way for our application to invoke torch.compile to trigger the PyTorch dynamo compilation process.

Solution

We implemented our custom embedding class so that we can get a handle for the model. This instantiated the embedding model from sentence-transformers , and maintained the handle for immediate compilation or compilation at a later stage. With this, we were able to trigger torch.compile and hence the dynamo compilation.

class CustomEmbedding(HuggingFaceEmbeddings):
    
    def __init__(self, **kwargs: Any):
        """Initialize the sentence_transformer."""
        super().__init__(**kwargs)

        # Load model from HuggingFace Hub
        self.client = AutoModel.from_pretrained(self.model_name)
    class Config:
        arbitrary_types_allowed = True


    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Compute doc embeddings using a HuggingFace transformer model.
        Args:
            texts: The list of texts to embed.
        Returns:
            List of embeddings, one for each text.
        """

        texts = list(map(lambda x: x.replace("n", " "), texts))

        # Tokenize sentences
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        
        embeddings = self.client(
           **encoded_input, output_hidden_states=True
        )
        embeddings = embeddings.pooler_output.detach().numpy()

        return embeddings.tolist()

# instead of model = HuggingFaceEmbeddings(model_name=embedding_model)
model = CustomEmbedding(model_name=embedding_model)

# torch.compile the model
model.client = torch.compile(model.client)

Challenge 2: triggering the optimization

For a typical inference scenario where the graph is frozen and gradient calculations are disabled, Torch inductor (the compiler backend we used for CPUs) invokes hardware specific optimizations like graph rewrite into more performant operators, operator fusion, and weights pre-packing. Though Torch dynamo was able to see the model and trigger generic compilation, it failed to trigger these additional Fx passes in the Torch inductor.

There were two main reasons for Torch inductor not triggering the optimization passes: (1) The application didn’t set no_grad() or inference_mode() for torch inductor to detect that the graph was frozen; and (2) We hit a limitation with the torch.compile framework, where, if the no_grad is set just at the beginning of the compiled region, torch.compile wouldn’t be able to detect it while invoking the inductor Fx passes because it would not have hit the no_grad region by then. Please refer to this GitHub issue for more details.

Solution

We work around this limitation by moving the no_grad() context into the application code from within the model class. With this, the model compilation happened as expected and gave around 1.3x performance improvement when we profiled the stable inference pass for eager and compiled versions.

Challenge 3: extra compilation

With the previous fixes, the query lookup inference performance was improved, but not the total execution time of the benchmarking script. We root-caused it to redundant compilation for the model during the RAG inference. Further deep diving revealed that it was because of the batch size mismatch between the word embedding and the RAG query stages. For example, in our benchmarking script, when the database was vectorized and stored in vector db, we used the batch size of 16, hence the model was compiled with shapes of 16xNxK. Whereas, the RAG query lookup is usually a single request of shape 1xNxK. So, there was a batch size mismatch (dimension “0” of these tensors) that triggered the recompilation for the query lookup stage. We confirmed it with the following Torch logging: TORCH_LOGS="recompiles"

TORCH_LOGS="recompiles" python rag_compile.py 
V1103 02:48:08.805986 34281 site-packages/torch/_dynamo/guards.py:2813] [0/1] [__recompiles] Recompiling function forward in site-packages/transformers/models/mpnet/modeling_mpnet.py:502
V1103 02:48:08.805986 34281 site-packages/torch/_dynamo/guards.py:2813] [0/1] [__recompiles]     triggered by the following guard failure(s):
V1103 02:48:08.805986 34281 site-packages/torch/_dynamo/guards.py:2813] [0/1] [__recompiles]     - 0/0: tensor 'L['input_ids']' size mismatch at index 0. expected 16, actual 1

Solution

Torch dynamo provides a decorator to mark the dimension of a given tensor as dynamic and specify an expected value for the same, so that re-compilation is not triggered. For example, specifying dimension “0” of input_ids and attention_mask as dynamic, and specifying that value of “1” is allowed in that dimension (as shown in the following code snippet), should have avoided the redundant compilations.

torch._dynamo.decorators.mark_unbacked(encoded_input['input_ids'], 0)
torch._dynamo.mark_dynamic(encoded_input['input_ids'], 1)
        torch._dynamo.decorators.mark_unbacked(encoded_input['attention_mask'], 0)
torch._dynamo.mark_dynamic(encoded_input['attention_mask'], 1)

However, the Torch dynamo decorator and marking didn’t work in this particular case. Moreover, using the decorator created graph breaks. So, we added some warmup iterations to hide the compilation latency, and profiled the query lookup performance in the steady state. However, the good news is that, in practice, this re-compilation is triggered only for the first query, so it might not affect the production scenario if the database size is fixed. Moreover, PyTorch AOT Inductor (a new feature in PyTorch) addresses re-compilation and warm up challenges with torch.compile. In a follow-up blog we will address how in a production environment we can use AOT Inductor to address these challenges.

With these solutions we were able to apply torch.compile, weights pre-packing and the AWS Graviton specific optimizations for an end-end RAG scenario and improve the performance by 1.3x from the baseline eager mode.

Deployment

A detailed guide on how to deploy torch compiled RAG on AWS Graviton-based Amazon EC2 instances and how to deploy it in conjunction with Llama using TorchServe can be found on the PyTorch website.

Conclusion

In this blog, we covered how we optimized embedding model inference performance on AWS Graviton3-based EC2 instances. We also shared the challenges faced, the solutions we implemented to bring those optimizations for a RAG use case, and the resulting speedups. We hope that you will give it a try! If you need any support with ML software on Graviton, please open an issue on the AWS Graviton Technical Guide GitHub.

We would like to express our gratitude to Eli Uriegas for the support in making this blog post happen.

Authors

Sunita Nadampalli is a Principal Engineer and AI/ML expert at AWS. She leads AWS Graviton software performance optimizations for AI/ML and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions for SoCs based on the Arm ISA.

Ankith Gunapal is an AI Partner Engineer at Meta (PyTorch). He leads customer support, evangelizing & release engineering of TorchServe. He is passionate about solving production problems in model inference and model serving. He also enjoys distilling technically complex material in a user friendly format.

Hamid Shojanazeri leads the AI Frameworks Partner Engineering team at Meta. He is passionate about building scalable AI solutions and specializes in working with PyTorch to tackle the challenges of large-scale distributed training, inference, model serving, and optimization.

New AWS tool recommends removal of unused permissions

IAM Access Analyzer feature uses automated reasoning to recommend policies that remove unused accesses, helping customers achieve “least privilege”.Read More

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

Illustrated headshots of Ginny Badanes, Madeleine Daepp and Robert Ness

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

In 2024, with advancements in generative AI continuing to reach new levels and the world experiencing its “biggest election year in history (opens in new tab),” could there possibly be a better time to examine the technology’s emerging role in global democracies? Inspired by the moment, senior researchers Madeleine Daepp (opens in new tab) and Robert Osazuwa Ness (opens in new tab) conducted research in Taiwan, studying the technology’s influence on disinformation, and in India, documenting its impact on digital communications more broadly. In this episode, Daepp and Ness join guest host Ginny Badanes (opens in new tab), general manager of the Democracy Forward program at Microsoft. They discuss how leveraging commonly understood language such as fraud can help people understand potential risks associated with generative AI; the varied ways in which Daepp and Ness saw the tech being deployed to promote or discredit candidates; and the opportunities for the technology to be a force for fortifying democracy.

Learn more: 

Video will kill the truth if monitoring doesn’t improve, argue two researchers (opens in new tab)
The Economist, March 2024

Microsoft Research Special Projects
Group homepage

Democracy Forward
Program homepage, Microsoft Corporate Social Responsibility

As the US election nears, Russia, Iran and China step up influence efforts (opens in new tab)
Microsoft On the Issues blog, October 2024

Combatting AI Deepfakes: Our Participation in the 2024 Political Conventions (opens in new tab)
Microsoft On the Issues blog, July 2024

China tests US voter fault lines and ramps AI content to boost its geopolitical interests (opens in new tab)
Microsoft On the Issues, April 2024

Project Providence (opens in new tab)
Project homepage

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

MADELEINE DAEPP: Last summer, I was working on all of these like pro-democracy applications, trying to build out, like, a social data collection tool with AI, all this kind of stuff. And I went to the elections workshop that the Democracy Forward team at Microsoft had put on, and Dave Leichtman, who, you know, was the MC of that work, was really talking about how big of a global elections year 2024 was going to be. Over 70 countries around the world. And, you know, we’re coming from Microsoft Research, where we were so excited about this technology. And then, all of a sudden, I was at the elections workshop, and I thought, oh no, [LAUGHS] like, this is not good timing.

ROBERT OSAZUWA NESS: What are we really talking about in the context of deepfakes in the political context, elections context? It’s deception, right. I’m trying to use this technology to, say, create some kind of false record of events in order to convince people that something happened that actually did not happen. And so that goal of deceiving, of creating a false record, that’s kind of how I have been thinking about deepfakes in contrast to the broader category of generative AI.

[TEASER ENDS]

GINNY BADANES: Welcome to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

I’m your guest host, Ginny Badanes, and I lead Microsoft’s Democracy Forward program, where we’ve spent the past year deeply engaged in supporting democratic elections around the world, including the recent US elections. We have been working on everything from raising awareness of nation-state propaganda efforts to helping campaigns and election officials prepare for deepfakes to protecting political campaigns from cyberattacks. Today, I’m joined by two researchers who have also been diving deep into the impact of generative AI on democracy.

Microsoft senior researchers Madeleine Daepp and Robert Osazuwa Ness are studying generative AI’s influence in the political sphere with the goal of making AI systems more robust against misuse while supporting the development of AI tools that can strengthen democratic processes and systems. They spent time in Taiwan and India earlier this year, where both had big democratic elections. Madeleine and Robert, welcome to the podcast!

MADELEINE DAEPP: Thanks for having us.

ROBERT OSAZUWA NESS: Thanks for having us.

BADANES: So I have so many questions for you all—from how you conducted your research to what you’ve learned—and I’m really interested in what you think comes next. But first, let’s talk about how you got involved in this in the first place. Could you both start by telling me a little bit about your backgrounds and just what got you into AI research in the first place?

DAEPP: Sure. So I’m a senior researcher here at Microsoft Research in the Special Projects team. But I did my PhD at MIT in urban studies and planning. And I think a lot of folks hear that field and think, oh, you know, housing, like upzoning housing and figuring out transportation systems. But it really is a field that’s about little “d” democracy, right. About how people make choices about shared public spaces every single day. You know, I joined Microsoft first off to run this, sort of, technology deployment in the city of Chicago, running a low-cost air-quality-sensor network for the city. And when GPT-4 came out, you know, first ChatGPT, and then we, sort of, had this big recognition of, sort of, how well this technology could do in summarizing and in representing opinions and in making sense of big unstructured datasets, right. I got actually very excited. Like, I thought this could be used for town planning processes. [LAUGHS] Like, I thought we could … I had a whole project with a wonderful intern, Eva Maxfield Brown, looking at, can we summarize planning documents using AI? Can we build out policies from conversations that people have in shared public spaces? And so that was very much the impetus for thinking about how to apply and build things with this amazing new technology in these spaces.

BADANES: Robert, I think your background is a little bit different, yet you guys ended up in a similar place. So how did you get there?

NESS: Yeah, so I’m also on Special Projects, Microsoft Research. My work is focusing on large language models, LLMs. And, you know, so I focus on making these models more reliable and controllable in real-world applications. And my PhD is in statistics. And so I focus a lot on using just basic bread-and-butter statistical methods to try and control and understand LLM behavior. So currently, for example, I’m leading a team of engineers and running experiments designed to find ways to enhance a graphical approach to combining information retrieval in large language models. I work on statistical tests for testing significance of adversarial attacks on these models.

BADANES: Wow.

NESS: So, for example, if you find a way to trick one of these models into doing something it’s not supposed to do, I make sure that it’s not, like, a random fluke; that it’s something that’s reproducible. And I also work at this intersection between generative AI and, you know, Bayesian stuff, causal inference stuff. And so I came at looking at this democracy work through an alignment lens. So alignment is this task in AI of making sure these models align with human values and goals. And what I was seeing was a lot of research in the alignment space was viewing it as a technical problem. And, you know, as a statistician, we’re trained to consult, right. Like, to go to the actual stakeholders and say, hey, what are your goals? What are your values? And so this democracy work was an opportunity to do that in Microsoft Research and connected with Madeleine. So she was planning to go to Taiwan, and kind of from a past life, I wanted to become a trade economist and learned Mandarin. And so I speak fluent Mandarin and seemed like a good matchup of our skill sets …

BADANES: Yeah.

NESS: … and interests. And so that’s, kind of, how we got started.

BADANES: So, Madeleine, you brought the two of you together, but what started it for you? This podcast is all about big ideas. What sparked the big idea to bring this work that you’ve been doing on generative AI into the space of democracy and then to go out and find Robert and match up together?

DAEPP: Yeah, well, Ginny, it was you. [LAUGHS] It was actually your team.

BADANES: I didn’t plant that! [LAUGHS]

DAEPP: So, you know, I think last summer, I was working on all of these like pro-democracy applications, trying to build out, like, a social data collection tool with AI, all this kind of stuff. And I went to the elections workshop that the Democracy Forward team at Microsoft had put on, and Dave Leichtman, who, you know, was the MC of that work, was really talking about how big of a global elections year 2024 was going to be, that this—he was calling it “Votorama.” You know, that term didn’t take off. [LAUGHTER] The term that has taken off is biggest election year in history, right. Over 70 countries around the world. And, you know, we’re coming from Microsoft Research, where we were so excited about this technology. Like, when it started to pass theory of mind tests, right, which is like the ability to think about how other people are thinking, like, we were all like, oh, this is amazing; this opens up so many cool application spaces, right. When it was, like, passing benchmarks for multilingual communication, again, like, we were so excited about the prospect of building out multilingual systems. And then, all of a sudden, I was at the elections workshop, and I thought, oh no, [LAUGHS] this is not good timing.

BADANES: Yeah …

DAEPP: And because so much of my work focuses on, you know, building out computer science systems like, um, data science systems or AI systems but with communities in the loop, I really wanted to go to the folks most affected by this problem. And so I proposed a project to go to Taiwan and to study one of the … it was the second election of 2024. And Taiwan is known to be subject to more external disinformation than any other place in the world. So if you were going to see something anywhere, you would see it there. Also, it has amazing civil society response so really interesting people to talk to. But I do not speak, Chinese, right. Like, I don’t have the context; I don’t speak the language. And so part of my process is to hire a half-local team. We had an amazing interpreter, Vickie Wang, and then a wonderful graduate student, Ti-Chung Cheng, who supported this work. But then also my team, Special Projects, happened to have this person who, like, not only is a leading AI researcher publishing in NeurIPS, like building out these systems, but who also spoke Chinese, had worked in technology security, and had a real understanding of international studies and economics as well as AI. And so for me, like, finding Robert as a collaborator was kind of a unicorn moment.

BADANES: So it sounds like it was a match made in heaven of skill sets and abilities. Before we get into what you all found there, which I do want to get into, I first think it’s helpful—I don’t know, when we’re dealing with these, like, complicated issues, particularly things that are moving and changing really quickly, sometimes I found it’s helpful to agree on definitions and sort of say, this is what we mean when we say this word. And that helps lead to understanding. So while I know that this research is about more than deepfakes—and we’ll talk about some of the things that are more than deepfakes—I am curious how you all define that term and how you think of it. Because this is something that I think is constantly moving and changing. So how have you all been thinking about the definition of that term?

NESS: So I’ve been thinking about it in terms of the intention behind it, right. We say deepfake, and I think colloquially that means kind of all of generative AI. That’s a bit unfortunate because there are things that are … you know, you can use generative AI to generate cartoons …

BADANES: Right.

NESS: … or illustrations for a children’s book. And so in thinking about what are we really talking about in the context of deepfakes in the political context, elections context, it’s deception, right. I’m trying to use this technology to, say, create some kind of false record of events, say, for example, something that a politician says, in order to convince people that something happened that actually did not happen.

BADANES: Right.

NESS: And so that goal of deceiving, of creating a false record, that’s kind of how I have been thinking about deepfakes in contrast to the broader category of generative AI and deepfakes in terms of being a malicious use case. There are other malicious use cases that don’t necessarily have to be deceptive, as well, as well as positive use cases.

BADANES: Well, that really, I mean, that resonates with me because what we found was when you use the term deception—or another term we hear a lot that I think works is fraud—that resonates with other people, too. Like, that helps them distinguish between neutral uses or even positive uses of AI in this space and the malicious use cases, though to your point, I suppose there’s probably even deeper definitions of what malicious use could look like. Are you finding that distinction showing up in your work between fraud and deception in these use cases? Is that something that has been coming through?

DAEPP: You know, we didn’t really think about the term fraud until we started prepping for this interview with you. As Robert said, so much of what we were thinking about in our definition was this representation of people or events, you know, done in order to deceive and with malicious intent. But in fact, in all of our conversations, no matter who we were talking to, no matter what political bent, no matter, you know, national security, fact-checking, et cetera, you know, they all agreed that using AI for the purposes of scamming somebody financially was not OK, right. That’s fraud. Using AI for the purposes of nudifying, like removing somebody’s clothes and then sextorting them, right, extorting them for money out of fear that this would be shared, like, that was not OK. And those are such clear lines. And it was clear that there’s a set of uses of generative AI also in the political space, you know, of saying this person said something that they didn’t, …

BADANES: Mm-hmm.

DAEPP: … of voter suppression, that in general, there’s a very clear line that when it gets into that fraudulent place, when it gets into that simultaneously deceptive and malicious space, that’s very clearly a no-go zone.

NESS: Oftentimes during this research, I found myself thinking about this dichotomy in cybersecurity of state actors, or broadly speaking, kind of, political actors, versus criminals.

BADANES: Right.

NESS: And it’s important to understand the distinction because criminals are typically trying to target targets of opportunity and make money, while state-sponsored agents are willing to spend a lot more money and have very specific targets and have a very specific definition of success. And so, like, this fraud versus deception kind of feels like that a little bit in the sense that fraud is typically associated with criminal behavior, while, say, I might put out deceptive political messaging, but it might fall within the bounds of free speech within my country.

BADANES: Right, yeah.

NESS: And so this is not to say I disagree with that, but it just, actually, that it could be a useful contrast in terms of thinking about the criminal versus the political uses, both legitimate and illegitimate.

BADANES: Well, I also think those of us who work in the AI space are dealing in very complicated issues that the majority of the world is still trying to understand. And so any time you can find a word that people understand immediately in order to do the, sort of, storytelling: the reason that we are worried about deepfakes in elections is because we do not want voters to be defrauded. And that, we find really breaks through because people understand that term already. That’s a thing that they already know that they don’t want to be; they do not want to be defrauded in their personal life or in how they vote. And so that really, I found, breaks through. But as much as I have talked about deepfakes, I know that you—and I know there’s a lot of interest in talking about deepfakes when we talk about this subject—but I know your research goes beyond that. So what other forms of generative AI did you include in your research or did you encounter in the effort that you were doing both in Taiwan and India?

DAEPP: Yeah. So let me tell you just, kind of, a big overview of, like, our taxonomy. Because as you said, like, so much of this is just about finding a word, right. Like, so much of it is about building a shared vocabulary so that we can start to have these conversations. And so when we looked at the political space, right, elections, so much of what it means to win an election is kind of two things. It’s building an image of a candidate, right, or changing the image of your opposition and telling a story, right.

BADANES: Mm-hmm.

DAEPP: And so if you think about image creation, of course, there are deepfakes. Like, of course, there are malicious representations of a person. But we also saw a lot of what we’re calling auth fakes, like authorized fakes, right. Candidates who would actually go to a consultancy and, like, get their bodies scanned so that videos could be made of them. They’d get their voices, a bunch of snippets of their voices, recorded so that then there could be personalized phone calls, right. So these are authorized uses of their image and likeness. Then we saw a term I’ve heard in, sort of, the ether is soft fakes. So again, likenesses of a candidate, this time not necessarily authorized but promotional. They weren’t … people on Twitter—I guess, X—on Instagram, they were sharing images of the candidate that they supported that were really flattering or silly or, you know, just really sort of in support of that person. So not with malicious intent, right, with promotional intent. And then the last one, and this, I think, was Robert’s term, but in this image creation category, you know, one thing we talked about was just the way that people were also making fun of candidates. And in this case, this is a bit malicious, right. Like, they’re making fun of people; they’re satirizing them. But it’s not deceptive because, …

BADANES: Right …

DAEPP: … you know, often it has that hyper-saturated meme aesthetic. It’s very clearly AI or just, you know, per like, sort of, US standards for satire, like, a reasonable person would know that it was silly. And so Robert said, you know, oh, these influencers, they’re not trying to deceive people; like, they’re not trying to lie about candidates. They’re trying to roast them. [LAUGHTER] And so we called it a deep roast. So that’s, kind of, the images of candidates. I will say we also looked at narrative building, and there, one really important set of things that we saw was what we call text to b-roll. So, you know, a lot of folks think that you can’t really make AI videos because, like, Sora isn’t out yet[1]. But in fact, what there is a lot of is tooling to, sort of, use AI to pull from stock imagery and b-roll footage and put together a 90-second video. You know, it doesn’t look like AI; it’s a real video. So text to b- roll, AI pasta? So if you know the threat intelligence space, there’s this thing called copy pasta, where people just …

BADANES: Sure.

DAEPP: … it’s just a fun word for copy-paste. People just copy-paste terms in order to get a hashtag trending. And we talked to an ex-influencer who said, you know, we’re using AI to do this. And I asked him why. And he said, well, you know, if you just do copy-paste, the fact-checkers catch it. But if you use AI, they don’t. And so AI pasta. And there’s also some research showing that this is potentially more persuasive than copy-paste …

BADANES: Interesting.

DAEPP: … because people think there’s a social consensus. And then the last one, this is my last of the big taxonomy, and, Robert, of course, jump in on anything you want to go deeper on, but Fake News 2.0. You know, I’m sure you’ve seen this, as well. Just this, like, creation of news websites, like entire new newspapers that nobody’s ever heard of. AI avatars that are newscasters. And this is something that was happening before. Like, there’s a long tradition of pretending to be a real news pamphlet or pretending to be a real outlet. But there’s some interesting work out of … Patrick Warren at Clemson has looked at some of these and shown the quality and quantity of articles on these things has gotten a lot better and, you know, improves as a step function of, sort of, when new models come out.

NESS: And then on the flip side, you have people using the same technologies but stated clearly that it’s AI generated, right. So we mentioned the AI avatars. In India, there’s this … there’s Bhoomi, which is a AI news anchor for agricultural news, and it states there in clear terms that she’s not real. But of course, somebody who wanted to be deceptive could use the same technology to portray something that looks like a real news broadcast that isn’t. You know, and, kind of, going back, Madeleine mentioned deep roasts, right, so, kind of, using this technology to create satirical depictions of, say, a political opponent. Somebody, a colleague, sent something across my desk. It was a Douyin account—so Douyin is the version of TikTok that’s used inside China; …

BADANES: OK.

NESS: … same company, but it’s the internal version of TikTok—that was posting AI-generated videos of politicians in Taiwan. And these were excellent, real good-quality AI-generated deepfakes of these politicians. But some of them were, first off, on the bottom of all of them, it said, this is AI-generated content.

BADANES: Oh.

NESS: And some of them were, kind of, obviously meant to be funny and were clearly fake, like still images that were animated to make somebody singing a funny song, for example. A very serious politician singing a very silly song. And it’s a still image. It’s not even, it’s not even …

BADANES: a video.

NESS: …like video.

BADANES: Right, right.

NESS: And so I messaged Puma Shen, who is one of the legislators in Taiwan who was targeted by these attacks, and I said, what do you think about this? And, you know, he said, yeah, they got me. [LAUGHTER] And I said, you know, do you think people believe this? I mean, there are people who are trying to debunk it. And he said, no, our supporters don’t believe it, but, you know, people who support the other side or people who are apolitical, they might believe it, or even if it says it’s fake—they know it’s fake—but they might still say that, yeah, but this is something they would do, right. This is …

BADANES: Yeah, it fits the narrative. Yeah.

NESS: … it fits the narrative, right. And that, kind of, that really, you know, I had thought of this myself, but just hearing somebody, you know, who’s, you know, a politician who’s targeted by these attacks just saying that it’s, like, even if they believe it’s … even if they know it’s fake, they still believe it because it’s something that they would do.

BADANES: Sure.

NESS: That’s, you know, as a form of propaganda, even relative to the canonical idea of deepfake that we have, this could be more effective, right. Like, just say it’s AI and then use it to, kind of, paint the picture of the opponent in any way you like.

BADANES: Sure, and this gets into that, sort of, challenging space I think we find ourselves in right now, which is people don’t know necessarily how to tell what’s real or not. And the case you’re describing, it has labeling, so that should tell you. But a lot of the content we come across online does not have labeling. And you cannot tell just based on your eyes whether images were generated by AI or whether they’re real. One of the things that I get asked a lot is, why can’t we just build good AI to detect bad AI, right? Why don’t we have a solution where I just take a picture and I throw it into a machine and it tells me thumbs-up or thumbs-down if this is AI generated or not? And the question around detection is a really tricky one. I’m curious what you all think about, sort of, the question of, can detection solve this problem or not?

NESS: So I’ll mention one thing. So Madeleine mentioned an application of this technology called text to b-roll. And so what this is, technically speaking, what this is doing is you’re taking real footage, you stick it in a database, it’s quote, unquote “vectorized” into these representations that the AI can understand, and then you say, hey, generate a video that illustrates this narrative for me. And you provide it the text narrative, and then it goes and pulls out a whole bunch of real video from a database and curates them into a short video that you could put on TikTok, for example. So this was a fully AI-generated product, but none of the actual content is synthetic.

BADANES: Ah, right.

NESS: So in that case, your quote, unquote “AI detection tool” is not going to work.

DAEPP: Yeah, I mean, something that I find really fascinating any time that you’re dealing with a sociotechnical system, right—a technical system embedded in social context—is folks, you know, think that things are easy that are hard and things are hard that are easy, right. And so with a lot of the detections work, right, like if you put a deepfake detector out, you make that available to anyone, then what they can do is they can run a bunch of stuff by it, …

BADANES: Yeah.

DAEPP: … add a little bit of random noise, and then the deepfake detector doesn’t work anymore. And so that detection, actually, technically becomes an arms race, you know. And we’re seeing now some detectors that, like, you know, work when you’re not looking at a specific image or a specific piece of text but you’re looking at a lot all at once. That seems more promising. But, just, this is a very, very technically difficult problem, and that puts us as researchers in a really tricky place because, you know, you’re talking to folks who say, why can’t you just solve this? If you put this out, then you have to put the detector out. And we’re like, that’s actually not, that’s not a technically feasible long-term solution in this space. And the solutions are going to be social and regulatory and, you know, changes in norms as well as technical solutions that maybe are about everything outside of AI, right.

BADANES: Yeah.

DAEPP: Not about fixing the AI system but fixing the context within which it’s used.

BADANES: It’s not just a technological solution. There’s more to it. Robert?

NESS: So if somebody were to push back there, they could say, well, great; in the long term, maybe it’s an arms race, but in the short term, right, we can have solutions out there that, you know, at least in the next election cycle, we could maybe prevent some of these things from happening. And, again, kind of harkening back to cybersecurity, maybe if you make it hard enough, only the really dedicated, really high-funded people are going to be doing it rather than, you know, everybody who wants to throw a bunch of deepfakes on the internet. But the problem still there is that it focuses really on video and images, right.

BADANES: Yeah. What about audio?

NESS: What about audio? And what about text? So …

BADANES: Yeah. Those are hard. I feel like we’ve talked a lot about definitions and theoretical, but I want to make sure we talk more about what you guys saw and researched and understood on the ground, in particular, your trips to India and Taiwan and even if you want to reflect on how those compare to the US environment. What did you actually uncover? What surprised you? What was different between those countries?

DAEPP: Yeah, I mean, right, so Taiwan … both of these places are young democracies. And that’s really interesting, right. So like in Taiwan, for example, when people vote, they vote on paper. And anybody can go watch. That’s part of their, like, security strategies. Like, anyone around the world can just come and watch. People come from far. They fly in from Canada and Japan and elsewhere just to watch Taiwanese people vote. And then similarly in India, there’s this rule where you have to be walking distance from your polling place, and so the election takes two months. And, like, your polling places move from place to place, and sometimes, it arrives on an elephant. And so these were really interesting places to, like, I as an American, just, like, found it very, very fascinating to and important to be outside of the American context. You know, we just take for granted that how we do democracy is how other people do it. But Taiwan was very much a joint, like, civil society–government everyday response to this challenge of having a lot of efforts to manipulate public opinion happening with, you know, real-world speeches, with AI, with anything that you can imagine. You know, and I think the Microsoft Threat Analysis Center released a report documenting some of the, sort of, video stuff[2]. There’s a use of AI to create videos the night before the election, things like this. But then India is really thinking of … so India, right, it’s the world’s biggest democracy, right. Like, nearly a billion people were eligible to vote.

BADANES: Yeah.

NESS: And arguably the most diverse, right?

DAEPP: Yeah, arguably the most diverse in terms of languages, contexts. And it’s also positioning itself as the AI laboratory for the Global South. And so folks, including folks at the MSR (Microsoft Research) Bangalore lab, are leaders in thinking about representing low-resource languages, right, thinking about cultural representation in AI models. And so there you have all of these technologists who are really trying to innovate and really trying to think about what’s the next clever application, what’s the next clever use. And so that, sort of, that taxonomy that we talked about, like, I think just every week, every interview, we, sort of, had new things to add because folks there were just constantly trying all different kinds of ways of engaging with the public.

NESS: Yeah, I think for me, in India in particular, you know, India is an engineering culture, right. In terms of, like, the professional culture there, they’re very, kind of, engineering skewed. And so I think one of the bigger surprises for me was seeing people who were very experienced and effective campaign operatives, right, people who would go and, you know, hit the pavement; do door knocking; kind of, segment neighborhoods by demographics and voter block, these people were also, you know, graduated in engineering from an IIT (Indian Institute of Technology), …

BADANES: Sure.

NESS: … right, and so … [LAUGHS] so they were happy to pick up these tools and leverage them to support their expertise in this work, and so some of the, you know, I think a lot of the narrative that we tell ourselves in AI is how it’s going to be, kind of, replacing people in doing their work. But what I saw in India was that people who were very effective had a lot of domain expertise that you couldn’t really automate away and they were the ones who are the early adopters of these tools and were applying it in ways that I think we’re behind on in terms of, you know, ideas in the US.

BADANES: Yeah, I mean, there’s, sort of, this sentiment that AI only augments existing problems and can enhance existing solutions, right. So we’re not great at translation tools, but AI will make us much better at that. But that also can then be weaponized and used as a tool to deceive people, which propaganda is not new, right? We’re only scaling or making existing problems harder, or adversaries are trying to weaponize AI to build on things they’ve already been doing, whether that’s cyberattacks or influence operations. And while the three of us are in different roles, we do work for the same company. And it’s a large technology company that is helping bring AI to the world. At the same time, I think there are some responsibilities when we look at, you know, bad actors who are looking to manipulate our products to create and spread this kind of deceptive media, whether it’s in elections or in other cases like financial fraud or other ways that we see this being leveraged. I’m curious what you all heard from others when you’ve been doing your research and also what you think our responsibilities are as a big tech company when it comes to keeping actors from using our products in those ways.

DAEPP: You know, when I started using GPT-4, one of the things I did was I called my parents, and I said, if you hear me on a phone call, …

BADANES: Yeah.

DAEPP: … like, please double check. Ask me things that only I would know. And when I walk around Building 99, which is, kind of, a storied building in which a lot of Microsoft researchers work, everybody did that call. We all called our parents.

BADANES: Interesting.

DAEPP: Or, you know, we all checked in. So just as, like, we have a responsibility to the folks that we care about, I think as a company, that same, sort of, like, raising literacy around the types of fraud to expect and how to protect yourself from them—I think that gets back to that fraud space that we talked about—and, you know, supporting law enforcement, sharing what needs to be shared, I think that without question is a space that we need to work in. I will say a lot of the folks we talked with, they were using Llama on a local GPU, right.

BADANES: OK.

DAEPP: They were using open-source models. They were sometimes … they were testing out Phi. They would use Phi, Grok, Llama, like anything like that. And so that raises an interesting question about our guardrails and our safety practices. And I think there, we have an, like, our obligation and our opportunity actually is to set the standard, right. To say, OK, like, you know, if you use local Llama and it spouts a bunch of stuff about voter suppression, like, you can get in trouble for that. And so what does it mean to have a safe AI that wins in the marketplace, right? That’s an AI that people can feel confident and comfortable about using and one that’s societally safe but also personally safe. And I think that’s both a challenge and a real opportunity for us.

BADANES: Yeah … oh, go ahead, Robert, yeah …

NESS: Going back to the point about fraud. It was this year, in January, when that British engineering firm Arup, when somebody used a deepfake to defraud that company of about $25 million, …

BADANES: Yeah.

NESS: … their Hong Kong office. And after that happened, some business managers in Microsoft reached out to me regarding a major client who wanted to start red teaming. And by red teaming, I mean intentionally targeting your executives and employees with these types of attacks in order to figure out where your vulnerabilities as an organization are. And I think, yeah, it got me thinking like, wow, I would, you know, can we do this for my dad? [LAUGHS] Because I think that was actually a theme that came out from a lot of this work, which was, like, how can we empower the people who are really on the frontlines of defending democracy in some of these places in terms of the tooling there? So we talked about, say, AI detection tools, but the people who are actually doing fact-checking, they’re looking more than at just the video or the images; they’re actually looking at a, kind of, holistic … taking a holistic view of the news story and doing some proper investigative journalism to see if something is fake or not.

BADANES: Yeah.

NESS: And so I think as a company who creates products, can we take a more of a product mindset to building tools that support that entire workflow in terms of fact-checking or investigative journalism in the context of democratic outcomes …

BADANES: Yeah.

NESS: … where maybe looking at individual deepfake content is just a piece of that.

BADANES: Yeah, you know, I think there’s a lot of parallels here to cybersecurity. That’s also what we’ve found, is this idea that, first of all, the “no silver bullet,” as we were talking about earlier with the detection piece. Like, you can’t expect your system to be secure just because you have a firewall, right. You have to have this, like, defense in-depth approach where you have lots of different layers. And one of those layers has been on the literacy side, right. Training and teaching people not to click on a phishing link, understanding that they should scroll over the URL. Like, these are efforts that have been taken up, sort of, in a broad societal sense. Employers do it. Big tech companies do it. Governments do it through PSAs and other things. So there’s been a concerted effort to get a population who might not have been aware of the fact that they were about to be scammed to now know not to click on that link. I think, you know, you raised the point about literacy. And I think there’s something to be said about media literacy in this space. It’s both AI literacy—understanding what it is—but also understanding that people may try to defraud you. And whether that is in the political sense or in the financial sense, once you have that, sort of, skill set in place, you’re going to be protected. One thing that I’ve heard, though, as I have conversations about this challenge … I’ve heard a couple things back from people specifically in civil society. One is not to put the impetus too much on the end consumer, which I think I’m hearing that we also recognize there’s things that we as technology companies should be focusing on. But the other thing is the concern that in, sort of, the long run, we’re going to all lose trust in everything we see anyway. And I’ve heard some people refer to that as the trust deficit. Have you all seen anything promising in the space to give you a sense around, can we ever trust what we’re looking at again, or are we actually just training everyone to not believe anything they see? Which I hope is not the case. I am an optimist. But I’d love to hear what you all came across. Are there signs of hope here where we might actually have a place where we can trust what we see again?

DAEPP: Yeah. So two things. There is this phenomenon called the liar’s dividend, right, …

BADANES: Sure, yeah.

DAEPP: … which is where that if you educate folks about how AI can be used to create fake clips, fake audio clips, fake videos, then if somebody has a real audio clip, a real video, they can claim that it’s AI. And I think we talk, you know, again, this is, like, in a US-centric space, we talk about this with politicians, but the space in which this is really concerning, I think, is war crimes, right …

BADANES: Oh, yeah.

DAEPP: … I think are these real human rights infractions where you can prevent evidence from getting out or being taken seriously. And we do see that right after invasions, for example, these days. But this is actually a space … like, I just told you, like, oh, like, detection is so hard and not technically, like, that’ll be an arms race! But actually, there is this wonderful project, Project Providence, that is a Microsoft collaboration with a company called Truepic that … it’s, like, an app, right. And what happens is when you take a photo using this app, it encrypts the, you know, hashes the GPS coordinates where the photo was taken, the time, the day, and uploads that with the pixels, with the image, to Azure. And then later, when a journalist goes to use that image, they can see that the pixels are exactly the same, and then they can check the location and they can confirm the GPS. And this actually meets evidentiary standards for the UN human rights tribunal, right.

BADANES: Right.

DAEPP: So this is being used in Ukraine to document war crimes. And so, you know, what if everybody had that app on their phone? That means you don’t … you know, most photos you take, you can use an AI tool and immediately play with. But in that particular situation where you need to confirm provenance and you need to confirm that this was a real event that happened, that is a technology that exists, and I think folks like the C2PA coalition (Coalition for Content Provenance and Authenticity) can make that happen across hardware providers.

NESS: And I think the challenge for me is, we can’t separate this problem from some of the other, kind of, fundamental problems that we have in our media environment now, right. So, for example, if I go on to my favorite social media app and I see videos from some conflicts around the world, and these videos could be not AI generated and I still could be, you know, the target of some PR campaign to promote certain content and suppress other ones. The videos could be authentic videos, but not actually be accurate depictions of what they claim to be. And so I think that this is a … the AI presents a complicating factor in an already difficult problem space. And I think, you know, trying to isolate these different variables and targeting them individually is pretty tricky. I do think that despite the liar’s dividend that media literacy is a very positive area to, kind of, focus energy …

BADANES: Yeah.

NESS: … in the sense that, you know, you mentioned earlier, like, using this term fraud, again, going back to this analogy with cybersecurity and cybercrime, that it tends to resonate with people. We saw that, as well, especially in Taiwan, didn’t we, Madeleine? Well, in India, too, with the sextortion fears. But in Taiwan, a lot of just cybercrime in terms of defrauding people of money. And one of the things that we had observed there was that talking about generative AI in the context of elections was difficult to talk to people about it because people, kind of, immediately went into their political camps, right.

BADANES: Yeah.

NESS: And so you had to, kind of, penetrate … you know, people were trying to, kind of, suss out which side you were on when you’re trying to educate them about this topic.

BADANES: Sure.

NESS: But if you talk to—but everybody’s, like, fraud itself is a lot less partisan.

BADANES: Yeah, it’s a neutral term.

NESS: Exactly. And so it becomes a very useful way to, kind of, get these ideas out there.

BADANES: That’s really interesting. And I love the provenance example because it really gets to the question about authenticity. Like, where did something come from? What is the origin of that media? Where has it traveled over time? And if AI is a component of it, then that’s a noted fact. But it doesn’t put us into the space of AI or not AI, which I think is where a lot of the, sort of, labeling has gone so far. And I understand the instinct to do that. But I like the idea of moving more towards how do you know more about an image of which whether there was AI involved or not is a component but does not have judgment. That does not make the picture good or bad. It doesn’t make it true or false. It’s just more information for you to consume. And then, of course, the media literacy piece, people need to know to look for those indicators and want them and ask for them from the technology company. So I think that’s a good, that’s a good silver lining. You gave me the light at the end of the tunnel I think I was looking for on the post-truth world. So, look, here’s the big question. You guys have been spending this time focusing on AI and democracy in this big, massive global election year. There was a lot of hype. [LAUGHS] There was a lot of hype. Lots of articles written about how this was going to be the AI election apocalypse. What say you? Was it? Was it not?

NESS: I think it was, well, we definitely have documented cases where this happened. And I’m wary of this question, particularly again from the cybersecurity standpoint, which is if you were not the victim of a terrible hack that brought down your entire company, would you say, like, well, it didn’t happen, so it’s not going to happen, right. You would never …

BADANES: Yeah.

NESS: That would be a silly attitude to have, right. And also, you don’t know what you don’t know, right. So, like, a lot of the, you know, we mentioned sextortion; we mentioned these cybercrimes. A lot of these are small-dollar crimes, which means they don’t get reported or they don’t get reported for reasons of shame. And so we don’t even have numbers on a lot of that. And we know that the political techniques are going to mirror the criminal techniques.

BADANES: Yeah.

NESS: And also, I worry about, say, down-ballot elections. Like, so much of, kind of, our election this year, a lot of the focus was on the national candidates, but, you know, if local poll workers are being targeted, if disinformation campaigns are being put out about local candidates, it’s not going get the kind of play in the national media such that you and I might hear about it. And so I’m, you know, so I’ll hand it off to Madeleine, but yeah.

DAEPP: So absolutely agree with Robert’s point, right. If your child was affected by sextortion, if you are a country that had an audio clip go viral, this was the deepfake deluge for you, right. That said, something that happened, you know, in India as in the United States, there were major prosecutions very early on, right.

BADANES: Yeah.

DAEPP: So in India, there was a video. It turned out not to be a deepfake. It turned out to be a “cheap fake,” to your point about, you know, the question isn’t whether there’s AI involved; the question is whether this is an attempt to defraud. And five people were charged for this video.

BADANES: Yeah.

DAEPP: And in the United States, right, those Biden robocalls using Biden’s voice to tell folks not to vote, like, that led to a million-dollar fine, I think, for the telecoms and $6 million for the consultant who created that. And when we talk to people in India, you know, people who work in this space, they said, well, I’m not going to do that; like, I’m going to focus on other things. So internal actors pay attention to these things. That really changes what people do and how they do it. And so that, I do think the work that your team did, right, to educate candidates about looking out for the stuff, the work that the MTAC (Microsoft Threat Analysis Center) did to track usage and report it, all of that, I think, was, actually, those interventions, I think, worked. I think they were really important, and I do think that what we are … this absence of a deluge is actually a huge number of people making a very concerted effort to prevent it from happening.

BADANES: That’s encouraging.

NESS: Madeleine, you made a really important point that this deterrence from prosecution, it’s effective for internal actors, …

BADANES: Yeah.

DAEPP: Yeah, that’s right.

NESS: … right. So for foreign states who are trying to interfere with other people’s elections, the fear of prosecution is not going to be as much of a deterrent.

BADANES: That is true. I will say what we saw in this election cycle, in particular in the US, was a concerted effort by the intelligence community to call out and name nation-state actors who were either doing cyberattacks or influence operations, specific videos that they identified, whether there was AI involved or not. I think that level of communication with the public while maybe doesn’t lead to those actors going to jail—maybe someday—but does in fact lead to a more aware public and therefore hopefully a less effective campaign. If people on the other end … and it’s a little bit into the literacy space, and it’s something that we’ve seen government again in this last cycle do very effectively, to name and shame essentially when they see these things in part, though, to make sure voters are aware of what’s happening. We’re not quite through this big global election year; we have a couple more elections before we really hit the end of the year, but it’s winding down. What is next for you all? Are you all going to continue this work? Are you going build on it? What comes next?

DAEPP: So our research in India actually wasn’t focused specifically on elections. It was about AI and digital communications.

BADANES: Ahh.

DAEPP: Because, you know, again, like India is this laboratory.

BADANES: Sure.

DAEPP: And I think what we learned from that work is that, you know, this is going to be a part of our digital communications and our information system going forward without question. And the question is just, like, what are the viable business models, right? What are the applications that work? And again, that comes back to making sure that whatever AI … you know, people when they build AI into their entire, you know, newsletter-writing system, when they build it into their content production, that they can feel confident that it’s safe and that it meets their needs and that they’re protected when they use it. And similarly, like, what are those applications that really work, and how do you empower those lead users while mitigating those harms and supporting civil society and mitigating those harms? I think that’s an incredible, like, that’s—as a researcher—that’s, you know, that’s a career, right.

BADANES: Yeah.

DAEPP: That’s a wonderful research space. And so I think understanding how to support AI that is safe, that enables people globally to have self-determination in how models represent them, and that is usable and powerful, I think that’s broadly …

BADANES: Where this goes.

DAEPP: … what I want to drive.

BADANES: Robert, how about you?

NESS: You know, so I mentioned earlier on these AI alignment issues.

BADANES: Yeah.

NESS: And I was really fascinated by how local and contextual those issues really are. So to give an example from Taiwan, we train these models on training data that we find from the internet. Well, when it comes to, say, Mandarin Chinese, you can imagine the proportion of content, of just the quantity of content, on the internet that comes from China is a lot more than the quantity that comes from Taiwan. And of course, what’s politically correct in China is different from what’s politically correct in Taiwan. And so when we were talking to Taiwanese, a lot of people had these concerns about, you know, having these large language models that reflected Taiwanese values. We heard the same thing in India about just people on different sides of the political spectrum and, kind of, looking at … a YouTuber in India had walked us through this … how, for example, a founding father of India, there was a disparate literature in favor of this person and some more critical of this person, and he had spent time trying to suss out whether GPT-4 was on one side or the other.

BADANES: Oh. Whose side are you on? [LAUGHS]

NESS: Right, and so I think for our alignment research at Microsoft Research, this becomes the beginning of, kind of, a very fruitful way of engaging with local stakeholders and making sure that we can reflect these concerns in the models that we develop and deploy.

BADANES: Yeah. Well, first, I just want to thank you guys for all the work you’ve done. This is amazing. We’ve really enjoyed partnering with you. I’ve loved learning about the research and the efforts, and I’m excited to see what you do next. I always want to end these kinds of conversations on a more positive note, because we’ve talked a lot about the weaponization of AI and, you know, how … ethical areas that are confusing and … but I am sure at some point in your work, you came across really positive use cases of AI when it comes to democracy, or at least I hope you have. [LAUGHS] Do you have any examples or can you leave us with something about where you see either it going or actively being used in a way to really strengthen democratic processes or systems?

DAEPP: Yeah, I mean, there is just a big paper in Science, right, which, as researchers, when something comes out in Science, you know your field is about to change, right, …

BADANES: Yeah.

DAEPP: … showing that an AI model in, like, political deliberations, small groups of UK residents talking about difficult topics like Brexit, you know, climate crisis, difficult topics, that in these conversations, an AI moderator created, like, consensus statements that represented the majority opinion, still showed the minority opinion, but that participants preferred to a human-written statement and in fact preferred to their original opinion.

BADANES: Wow.

DAEPP: And that this, you know, not only works in these randomized controlled trials but actually works in a real citizens deliberation. And so that potential of, like, carefully fine-tuned, like, carefully aligned AI to actually help people find points of agreement, that’s a really exciting space.

BADANES: So next time my kids are in a fight, I’m going to point them to Copilot and say, work with Copilot to mediate. [LAUGHS] No, that’s really, really interesting. Robert, how about you?

NESS: She, kind of, stole my example. [LAUGHTER] But I’ll take it from a different perspective. So, yes, like how these technologies can enable people to collaborate and ideally, I think, from a democratic standpoint, at a local level, right. So, I mean, I think so much of our politics were, kind of, focused at the national-level campaign, but our opportunity to collaborate is much more … we’re much more easily … we can collaborate much more easily with people who are in our local constituencies. And I think to myself about, kind of, like, the decline particularly of local newspapers, local media.

BADANES: Right.

NESS: And so I wonder, you know, can these technologies help address that problem in terms of just, kind of, information about, say, your local community, as well as local politicians. And, yeah, and to Madeleine’s point, so Madeleine started the conversation talking about her background in urban planning and some of the work she did, you know, working on a local level with local officials to bring technology to the level of cities. And I think, like, well, you know, politics are local, right. So, you know, I think that that’s where there’s a lot of opportunity for improvement.

BADANES: Well, Robert, you just queued up a topic for a whole other podcast because our team also does a lot of work around journalism, and I will say we have seen that AI at the local level with local news is really a powerful tool that we’re starting to see a lot of appetite and interest for in order to overcome some of the hurdles they face right now in that industry when it comes to capacity, financing, you know, not able to be in all of the places they want to be at once to make sure that they’re reporting equally across the community. This is, like, a perfect use case for AI, and we’re starting to see folks who are really using it. So maybe we’ll come back and do this again another time on that topic. But I just want to thank you both, Madeleine and Robert, for joining us today and sharing your insights. This was really a fascinating conversation. I know I learned a lot. I hope that our listeners learned a lot, as well.

[MUSIC]

And, listeners, I hope that you tune in for more episodes of Ideas, where we continue to explore the technologies shaping our future and the big ideas behind them. Thank you, guys, so much.

DAEPP: Thank you.

NESS: Thank you.

[MUSIC FADES] [1] The video generation model Sora was released publicly earlier this month.

[2] For a summary of and link to the report, see the Microsoft On the Issues blog post China tests US voter fault lines and ramps AI content to boost its geopolitical interests.

The post Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness appeared first on Microsoft Research.

Add a generative AI experience to your website or web application with Amazon Q embedded

Generative AI offers many benefits for both you, as a software provider, and your end-users. AI assistants can help users generate insights, get help, and find information that may be hard to surface using traditional means. In addition, they can help your employees reduce repetitive tasks and focus on high-value work. However, adding generative AI assistants to your website or web application requires significant domain knowledge and the technical expertise to build, deploy, and maintain the infrastructure and end-user experience. These challenges fall outside of some software providers’ core domain, creating barriers to offering AI assistants to users.

Amazon Q Business is a generative AI assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business securely unites disparate data with over 40 built-in connectors to popular enterprise applications, document repositories, chat applications, and knowledge management systems. You can use natural language to request information or assistance to generate content. Amazon Q Business handles the complexity of deploying and maintaining the infrastructure required for generative AI assistants so you can focus on creating a delightful end-user experience.

Amazon Q embedded is a feature that lets you embed a hosted Amazon Q Business assistant on your website or application to create more personalized experiences that boost end-users’ productivity. You can configure the assistant with guardrails to define global and topic-level controls for your environment. With an embedded Amazon Q Business assistant, end-users can receive immediate, permission-aware responses from your data sources, with citations.

In this post, we demonstrate how to use the Amazon Q embedded feature to add an Amazon Q Business assistant to your website or web application using basic HTML or React. We also show you how to use the feature with content management systems like WordPress and Drupal. This post includes a sample webpage for Amazon Q Business that allows you to quickly test and demonstrate your AI assistant. This allows you to develop the changes on your website or application in parallel while refining your Amazon Q Business configurations.

Solution overview

Embedding Amazon Q Business gives your users access to a generative AI assistant without leaving your website or web application. Integrating the assistant involves creating an Amazon Q Business application, adding users or groups, connecting relevant data sources, allowlisting your domain, and finally adding an HTML inline frame (iframe) element to your website or web application.

Prerequisites

In this section, we walk through how to set up an Amazon Q Business application, permissions, and user access.

Amazon Q Business application

The Amazon Q embedded feature requires an Amazon Q Business application. If you don’t have an existing application, you can create an application integrated with AWS IAM Identity Center or AWS Identity and Access Management (IAM) identity federation. Refer to Configuring an Amazon Q Business application using AWS IAM Identity Center, or Creating an Amazon Q Business application using Identity Federation through IAM if you need to make a new application.

Permissions

Configuring the Amazon Q embedded feature IAM permissions that allow you to use and manage Amazon Q Business. Your permission policy must at least allow the Amazon Q Business CreateWebExperience and UpdateWebExperience actions:

"Action": "qbusiness:CreateWebExperience",
"Action": "qbusiness:UpdateWebExperience",

When creating the IAM permission policy, the IAM Visual policy creator is a great way to see the options available. Using the least privileged access approach, you can restrict the resource in which the permission grants access to a specific AWS Region, account ID, application ID, and web experience ID.

"Resource": "arn:aws:qbusiness:us-east-1:123456789012:application/<replace-with-id>"
"Resource": "arn:aws:qbusiness:us-east-1:123456789012:application/<replace-with-id>/web-experience/<replace-with-id>"

You can find your application ID on the Amazon Q Business console under Application settings or from the list-applications command in the AWS Command Line Interface (AWS CLI). You can find your web experience ID with the list-web-experiences AWS CLI command. For example:

aws qbusiness list-applications
aws qbusiness list-web-experiences --application-id a1b2c3d4-5678-90ab-cdef-EXAMPLE11111

User access

Amazon Q Business requires authentication before users can engage with the assistant. If you use AWS IAM Identity Center, you can grant users access to the assistant by adding the users or groups to your Amazon Q Business application. If you use IAM identity federation, Amazon Q Business automatically subscribes users to the subscription type you select when you create the application. For more information on managing users, refer to Managing user subscriptions for IAM Identity Center-integrated applications, or see Updating and cancelling user subscriptions for applications using IAM Federation.

Allowlisting your website or web application

To embed Amazon Q Business on your website or web application, you must first allowlist your domain. This restricts your assistant to only sites you trust and stops others from embedding your assistant. You can add multiple domains for different services or development instances used for testing. Complete the following steps:

Open the Amazon Q Business console.
Next, select your Amazon Q Business application.
From the menu, choose Amazon Q embedded under the Enhancements section, then choose Add allowed website.
For Enter website URL, enter the base URL of the website or web application you want to allowlist for Amazon Q Business, for example https://www.example.com (trailing / not required), and choose Add.

Amazon Q Business hosts the web experience on an AWS domain. To find the URL, navigate to the main page of your Amazon Q Business application and copy the value for Deployed URL, for example https://1234abcdef5678.chat.qbusiness.example.on.aws/, in the Web experience settings section. Now you can embed this assistant into the website or web application hosted at the domain you allowlisted.

Customizing the user experience

You can customize the user experience look and feel for your organization. Customization options include the assistant title, subtitle, welcome message, font, color, and logo. You can also enable sample prompts. Refer to Customizing an Amazon Q Business web experience to see the available customization options.

The following screenshots show the default Amazon Q Business user experience (left) and an Amazon Q Business user experience with a custom title, subtitle, and welcome message (right).

Add Amazon Q Business to your website or web application

Before continuing, make sure you have allowlisted your domain as described earlier in this post.

You can choose from the following embedding options:

Using an HTML iframe element
Using a React component
Using a content management system

Embed Amazon Q Business using an HTML iframe element

You can embed Amazon Q Business on your website or web application using an iframe element, which is an HTML element that you can use to insert another HTML page into the current one. Other embedding options build upon this foundational HTML element. The following is a sample iframe element:

<iframe src="https://1234abcdef5678.chat.qbusiness.example.on.aws/"></iframe>

You can customize the iframe element with various attributes such as the width, height, and title. Setting the Amazon Q Business deployed URL as the value for the src attribute will display the Amazon Q Business web experience within the iframe. The following code shows an example iframe element with the id, title, width, height, and src attributes set to example values:

<iframe
    id="inlineFrameExample"
    title="Inline Frame Example"
    width="600"
    height="650"
    src="https://1234abcdef5678.chat.qbusiness.example.on.aws/">
</iframe>

Refer to <iframe>: The Inline Frame element to learn more about the iframe element.

Embed Amazon Q Business using a React component

You can embed Amazon Q Business on your website or web application using a React component. React components offer more customizations and modularity than a standard iframe. In this post, we’ve included a sample React component that wraps an iframe element and adds abilities such as an expanding and collapsing chat interface and showing a loading spinner when the page first loads.

To use this React component, download the sample code from the Embed GenAI chat into React GitHub repo and add it to your React source code. Then you can import the component into your website or web application and add the Chat element with at least the embedUrl attribute set to the deployed URL of your Amazon Q Business application. The following example code shows the options of the sample React component:

import Chat from "../components/embed";
...
<Chat
    embedUrl="https://1234abcdef5678.chat.qbusiness.example.on.aws/"
    embedWidth={600}          // Optional
    embedHeight={650}         // Optional
    embedOffsetRightPc={5}    // Optional
    headerText="Chat"         // Optional
    headerInfo="Chat with us" // Optional
/>

Embed Amazon Q Business using a content management system

You can embed Amazon Q Business on a website published by a content management system that allows you to add HTML elements to the content. We’ve included examples for WordPress and Drupal, both of which you can deploy with Amazon Lightsail.

Embedding on a WordPress site

To embed Amazon Q Business on your WordPress site, first access the WordPress admin page. Optionally, add a block group wrapper to constrain iframe sizing with the values of your choosing. For example, you could set the layout content height to 650px, width to 620px, a width of 100% in the iframe to fill the container, and select a full-size block item. Finally, add a custom HTML block and insert the iframe code. The following code is a sample iframe element:

<iframe
    id="inlineFrameExample"
    title="Inline Frame Example"
    width="100%"
    height="650"
    src="https://021345abcdef.chat.qbusiness.example.on.aws/">
</iframe>

The following screenshot shows an example of adding a block to a WordPress site.

The following screenshot shows an example of adding an iframe to the block.

The following screenshot shows an example of Amazon Q Business in a WordPress site.

Embedding on a Drupal site

To embed Amazon Q Business on your Drupal site, complete the following steps:

Open the Drupal admin page.
Choose Content, Blocks, and Add content block.
Give your content block a description and change the text format to HTML.
Choose the Source
Add your iframe to the Body section of the block, then choose Save and configure.
When configuring your content block, the visibility options are optional and can be left with the default values.
Choose a Region to display this block, such as Content Above or Sidebar, then choose Save block.

The following screenshot shows an example of Amazon Q Business embedded with the Content Above option.

The following screenshot shows an example of Amazon Q Business embedded with the Sidebar option.

Sample website

To help you get started embedding Amazon Q Business, we have included a sample website that you can deploy on AWS Amplify with an AWS CloudFormation stack. The sample website contains an HTML iframe element with your Amazon Q Business assistant. To use the website, complete the following steps:

First collect your Amazon Q Business application ID and make a note. You can find your application ID on the Amazon Q Business console as described earlier in this post.
Download our YAML sample CloudFormation template to your workstation.
Deploy the stack either using the AWS CloudFormation console or using the AWS CLI.
After uploading the sample CloudFormation template, enter a stack name, a web page name, and your Amazon Q Business application ID in the Application ID input field.
You can leave all other settings at their default values.
After the stack fully deploys, navigate to the Outputs tab on the AWS CloudFormation console and copy the Amplify URL.
Return to the Amazon Q Business console, select your Amazon Q Business application, and choose Amazon Q Embedded to add your Amplify URL to the Allowed websites list as described earlier in this post.
Navigate to your Amplify URL in your web browser to see your sample website with Amazon Q Business. You may need to Sign in to Q Business.

Clean Up

To avoid future charges in your account from Amplify you can delete the resources you created in the previous section walkthrough on creating a sample website.

On the CloudFormation console, in the navigation pane, choose Stacks.
Select the stack you launched in the previous step, then choose Delete.

Conclusion

In this post, we showed you various methods of embedding Amazon Q Business, which enables users to have natural language conversations and get meaningful assistance directly on your website or web application. We discussed creating an Amazon Q Business application and how to allowlist your URL. We then walked through adding Amazon Q Business with a standard HTML iframe, a React component, and how to update a WordPress or Drupal site.

To get started, refer to Getting started with Amazon Q Business to create an Amazon Q Business application. For more information on the Amazon Q embedded feature, see Amazon Q embedded. Refer to Enhancing an Amazon Q Business application environment for guidance on integrating your data sources, which can include your website content, to enrich the answers Amazon Q Business can provide your website or web application users.

About the authors

Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.

David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.

Philip Whiteside is a Solutions Architect (SA) at Amazon Web Services. Philip is passionate about overcoming barriers by utilizing technology.

How graphs make RAG more accurate

Capturing complex human queries with graphs

Avoiding loss of context in data representation

Proving that graphs are more accurate

Lettria’s hybrid methodology to RAG

The benchmarking process

The benchmarking results

Using AWS and Lettria for enhanced RAG applications

AWS: A robust foundation for generative AI

Managed GraphRAG implementations

Conclusion

About the Authors

2024 Highlights: A Year of Growth and Impact

Agent-cloud interface (ACI)

Example shows how to onboard an agent to AIOpsLab

Service

Workload generator

Fault generator

Microsoft Research Newsletter

Observability

Example of ACI on diagnosis task

Next steps

Acknowledgements

How to Optimize RAG?

Embedding model

Eager mode

torch.compile

Weights pre-packing

torch.inference_mode

End-to-End RAG scenario on CPU

What are the challenges and solutions to achieve similar gains in an end-to-end RAG scenario?

Challenge 1: model handle

Solution

Challenge 2: triggering the optimization

Solution

Challenge 3: extra compilation

Solution

Deployment

Conclusion

Authors

Subscribe to the Microsoft Research Podcast:

Transcript

Solution overview

Prerequisites

Amazon Q Business application

Permissions

User access

Allowlisting your website or web application

Customizing the user experience

Add Amazon Q Business to your website or web application

Embed Amazon Q Business using an HTML iframe element

Embed Amazon Q Business using a React component

Embed Amazon Q Business using a content management system

Embedding on a WordPress site

Embedding on a Drupal site

Sample website

Clean Up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.