Integrate generative AI capabilities into Microsoft Office using Amazon Bedrock

Integrate generative AI capabilities into Microsoft Office using Amazon Bedrock

Generative AI is rapidly transforming the modern workplace, offering unprecedented capabilities that augment how we interact with text and data. At Amazon Web Services (AWS), we recognize that many of our customers rely on the familiar Microsoft Office suite of applications, including Word, Excel, and Outlook, as the backbone of their daily workflows. In this blog post, we showcase a powerful solution that seamlessly integrates AWS generative AI capabilities in the form of large language models (LLMs) based on Amazon Bedrock into the Office experience. By harnessing the latest advancements in generative AI, we empower employees to unlock new levels of efficiency and creativity within the tools they already use every day. Whether it’s drafting compelling text, analyzing complex datasets, or gaining more in-depth insights from information, integrating generative AI with Office suite transforms the way teams approach their essential work. Join us as we explore how your organization can leverage this transformative technology to drive innovation and boost employee productivity.

Solution overview


Figure 1: Solution architecture overview

The solution architecture in Figure 1 shows how Office applications interact with a serverless backend hosted on the AWS Cloud through an Add-In. This architecture allows users to leverage Amazon Bedrock’s generative AI capabilities directly from within the Office suite, enabling enhanced productivity and insights within their existing workflows.

Components deep-dive

Office Add-ins

Office Add-ins allow extending Office products with custom extensions built on standard web technologies. Using AWS, organizations can host and serve Office Add-ins for users worldwide with minimal infrastructure overhead.

An Office Add-in is composed of two elements:

The code snippet below demonstrates part of a function that could run whenever a user invokes the plugin, performing the following actions:

  1. Initiate a request to the generative AI backend, providing the user prompt and available context in the request body
  2. Integrate the results from the backend response into the Word document using Microsoft’s Office JavaScript APIs. Note that these APIs use objects as namespaces, alleviating the need for explicit imports. Instead, we use the globally available namespaces, such as Word, to directly access relevant APIs, as shown in following example snippet.
// Initiate backend request (optional context)
const response = await sendPrompt({ user_message: prompt, context: selectedContext });

// Modify Word content with responses from the Backend
await Word.run(async (context) => {
  let documentBody;

  // Target for the document modifications
  if (response.location === 'Replace') {
    documentBody = context.document.getSelection(); // active text selection
  } else {
    documentBody = context.document.body; // entire document body
  }

  // Markdown support for preserving original content layout
  // Dependencies used: React markdown
  const content = renderToString(<Markdown>{ response.content } < /Markdown>);
  const operation = documentBody.insertHtml(content, response.location);

  // set properties for the output content (font, size, color, etc.)
  operation.font.set({ name: 'Arial' });

  // flush changes to the Word document
  await context.sync();
});

Generative AI backend infrastructure

The AWS Cloud backend consists of three components:

  1. Amazon API Gateway acts as an entry point, receiving requests from the Office applications’ Add-in. API Gateway supports multiple mechanisms for controlling and managing access to an API.
  2. AWS Lambda handles the REST API integration, processing the requests and invoking the appropriate AWS services.
  3. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available via an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With Bedrock’s serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using the AWS tools without having to manage infrastructure.

LLM prompting

Amazon Bedrock allows you to choose from a wide selection of foundation models for prompting. Here, we use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock for completions. The system prompt we used in this example is as follows:

You are an office assistant helping humans to write text for their documents.

[When preparing the answer, take into account the following text: <text>{context}</text>]
Before answering the question, think through it step-by-step within the <thinking></thinking> tags.
Then, detect the user's language from their question and store it in the form of an ISO 639-1 code within the <user_language></user_language> tags.
Then, develop your answer in the user’s language within the <response></response> tags.

In the prompt, we first give the LLM a persona, indicating that it is an office assistant helping humans. The second, optional line contains text that has been selected by the user in the document and is provided as context to the LLM. We specifically instruct the LLM to first mimic a step-by-step thought process for arriving at the answer (chain-of-thought reasoning), an effective measure of prompt-engineering to improve the output quality. Next, we instruct it to detect the user’s language from their question so we can later refer to it. Finally, we instruct the LLM to develop its answer using the previously detected user language within response tags, which are used as the final response. While here, we use the default configuration for inference parameters such as temperature, that can quickly be configured with every LLM prompt. The user input is then added as a user message to the prompt and sent via the Amazon Bedrock Messages API to the LLM.

Implementation details and demo setup in an AWS account

As a prerequisite, we need to make sure that we are working in an AWS Region with Amazon Bedrock support for the foundation model (here, we use Anthropic’s Claude 3.5 Sonnet). Also, access to the required relevant Amazon Bedrock foundation models needs to be added. For this demo setup, we describe the manual steps taken in the AWS console. If required, this setup can also be defined in Infrastructure as Code.

To set up the integration, follow these steps:

  1. Create an AWS Lambda function with Python runtime and below code to be the backend for the API. Make sure that we have Powertools for AWS Lambda (Python) available in our runtime, for example, by attaching aLambda layer to our function. Make sure that the Lambda function’s IAM role provides access to the required FM, for example:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "bedrock:InvokeModel",
                "Resource": [
                    "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0"
                ]
            }
        ]
    }
    

    The following code block shows a sample implementation for the REST API Lambda integration based on a Powertools for AWS Lambda (Python) REST API event handler:

    import json
    import re
    from typing import Optional
    
    import boto3
    from aws_lambda_powertools import Logger
    from aws_lambda_powertools.event_handler import APIGatewayRestResolver, CORSConfig
    from aws_lambda_powertools.logging import correlation_paths
    from aws_lambda_powertools.utilities.typing import LambdaContext
    from pydantic import BaseModel
    
    logger = Logger()
    app = APIGatewayRestResolver(
        enable_validation=True,
        cors=CORSConfig(allow_origin="http://localhost:3000"),  # for testing purposes
    )
    
    bedrock_runtime_client = boto3.client("bedrock-runtime")
    
    
    SYSTEM_PROMPT = """
    You are an office assistant helping humans to write text for their documents.
    
    {context}
    Before answering the question, think through it step-by-step within the <thinking></thinking> tags.
    Then, detect the user's language from their question and store it in the form of an ISO 639-1 code within the <user_language></user_language> tags.
    Then, develop your answer in the user's language in markdown format within the <response></response> tags.
    """
    
    class Query(BaseModel):
        user_message: str  # required
        context: Optional[str] = None  # optional
        max_tokens: int = 1000  # default value
        model_id: str = "anthropic.claude-3-5-sonnet-20240620-v1:0"  # default value
    
    def wrap_context(context: Optional[str]) -> str:
        if context is None:
            return ""
        else:
            return f"When preparing the answer take into account the following text: <text>{context}</text>"
    
    def parse_completion(completion: str) -> dict:
        response = {"completion": completion}
        try:
            tags = ["thinking", "user_language", "response"]
            tag_matches = re.finditer(
                f"<(?P<tag>{'|'.join(tags)})>(?P<content>.*?)</(?P=tag)>",
                completion,
                re.MULTILINE | re.DOTALL,
            )
            for match in tag_matches:
                response[match.group("tag")] = match.group("content").strip()
        except Exception:
            logger.exception("Unable to parse LLM response")
            response["response"] = completion
    
        return response
    
    
    @app.post("/query")
    def query(query: Query):
        bedrock_response = bedrock_runtime_client.invoke_model(
            modelId=query.model_id,
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": query.max_tokens,
                    "system": SYSTEM_PROMPT.format(context=wrap_context(query.context)),
                    "messages": [{"role": "user", "content": query.user_message}],
                }
            ),
        )
        response_body = json.loads(bedrock_response.get("body").read())
        logger.info("Received LLM response", response_body=response_body)
        response_text = response_body.get("content", [{}])[0].get(
            "text", "LLM did not respond with text"
        )
        return parse_completion(response_text)
    
    @logger.inject_lambda_context(correlation_id_path=correlation_paths.API_GATEWAY_REST)
    def lambda_handler(event: dict, context: LambdaContext) -> dict:
        return app.resolve(event, context)
    

  2. Create an API Gateway REST API with a Lambda proxy integration to expose the Lambda function via a REST API. You can follow this tutorial for creating a REST API for the Lambda function by using the API Gateway console. By creating a Lambda proxy integration with a proxy resource, we can route requests to the resources to the Lambda function. Follow the tutorial to deploy the API and take note of the API’s invoke URL. Make sure to configure adequate access control for the REST API.

We can now invoke and test our function via the API’s invoke URL. The following example uses curl to send a request (make sure to replace all placeholders in curly braces as required), and the response generated by the LLM:

$ curl --header "Authorization: {token}" 
     --header "Content-Type: application/json" 
     --request POST 
     --data '{"user_message": "Write a 2 sentence summary about AWS."}' 
     https://{restapi_id}.execute-api.{region}.amazonaws.com/{stage_name}/query | jq .
{
 "completion": "<thinking>nTo summarize AWS in 2 sentences:n1. AWS (Amazon Web Services) is a comprehensive cloud computing platform offering a wide range of services like computing power, database storage, content delivery, and more.n2. It allows organizations and individuals to access these services over the internet on a pay-as-you-go basis without needing to invest in on-premises infrastructure.n</thinking>nn<user_language>en</user_language>nn<response>nnAWS (Amazon Web Services) is a cloud computing platform that offers a broad set of global services including computing, storage, databases, analytics, machine learning, and more. It enables companies of all sizes to access these services over the internet on a pay-as-you-go pricing model, eliminating the need for upfront capital expenditure or on-premises infrastructure management.nn</response>",
 "thinking": "To summarize AWS in 2 sentences:n1. AWS (Amazon Web Services) is a comprehensive cloud computing platform offering a wide range of services like computing power, database storage, content delivery, and more.n2. It allows organizations and individuals to access these services over the internet on a pay-as-you-go basis without needing to invest in on-premises infrastructure.",
 "user_language": "en",
 "response": "AWS (Amazon Web Services) is a cloud computing platform that offers a broad set of global services including computing, storage, databases, analytics, machine learning, and more. It enables companies of all sizes to access these services over the internet on a pay-as-you-go pricing model, eliminating the need for upfront capital expenditure or on-premises infrastructure management."
} 

If required, the created resources can be cleaned up by 1) deleting the API Gateway REST API, and 2) deleting the REST API Lambda function and associated IAM role.

Example use cases

To create an interactive experience, the Office Add-in integrates with the cloud back-end that implements conversational capabilities with support for additional context retrieved from the Office JavaScript API.

Next, we demonstrate two different use cases supported by the proposed solution, text generation and text refinement.

Text generation


Figure 2: Text generation use-case demo

In the demo in Figure 2, we show how the plug-in is prompting the LLM to produce a text from scratch. The user enters their query with some context into the Add-In text input area. Upon sending, the backend will prompt the LLM to generate respective text, and return it back to the frontend. From the Add-in, it is inserted into the Word document at the cursor position using the Office JavaScript API.

Text refinement


Figure 3: Text refinement use-case demo

In Figure 3, the user highlighted a text segment in the work area and entered a prompt into the Add-In text input area to rephrase the text segment. Again, the user input and highlighted text are processed by the backend and returned to the Add-In, thereby replacing the previously highlighted text.

Conclusion

This blog post showcases how the transformative power of generative AI can be incorporated into Office processes. We described an end-to-end sample of integrating Office products with an Add-in for text generation and manipulation with the power of LLMs. In our example, we used managed LLMs on Amazon Bedrock for text generation. The backend is hosted as a fully serverless application on the AWS cloud.

Text generation with LLMs in Office supports employees by streamlining their writing process and boosting productivity. Employees can leverage the power of generative AI to generate and edit high-quality content quickly, freeing up time for other tasks. Additionally, the integration with a familiar tool like Word provides a seamless user experience, minimizing disruptions to existing workflows.

To learn more about boosting productivity, building differentiated experiences, and innovating faster with AWS visit the Generative AI on AWS page.


About the Authors

Martin Maritsch is a Generative AI Architect at AWS ProServe focusing on Generative AI and MLOps. He helps enterprise customers to achieve business outcomes by unlocking the full potential of AI/ML services on the AWS Cloud.

Miguel Pestana is a Cloud Application Architect in the AWS Professional Services team with over 4 years of experience in the automotive industry delivering cloud native solutions. Outside of work Miguel enjoys spending its days at the beach or with a padel racket in one hand and a glass of sangria on the other.

Carlos Antonio Perea Gomez is a Builder with AWS Professional Services. He enables customers to become AWSome during their journey to the cloud. When not up in the cloud he enjoys scuba diving deep in the waters.

Read More

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

As we gather for NVIDIA GTC, organizations of all sizes are at a pivotal moment in their AI journey. The question is no longer whether to adopt generative AI, but how to move from promising pilots to production-ready systems that deliver real business value. The organizations that figure this out first will have a significant competitive advantage—and we’re already seeing compelling examples of what’s possible.

Consider Hippocratic AI’s work to develop AI-powered clinical assistants to support healthcare teams as doctors, nurses, and other clinicians face unprecedented levels of burnout. During a recent hurricane in Florida, their system called 100,000 patients in a day to check on medications and provide preventative healthcare guidance–the kind of coordinated outreach that would be nearly impossible to achieve manually. They aren’t just building another chatbot; they are reimagining healthcare delivery at scale.

Production-ready AI like this requires more than just cutting-edge models or powerful GPUs. In my decade working with customers’ data journeys, I’ve seen that an organization’s most valuable asset is its domain-specific data and expertise. And now leading our data and AI go-to-market, I hear customers consistently emphasize what they need to transform their domain advantage into AI success: infrastructure and services they can trust—with performance, cost-efficiency, security, and flexibility—all delivered at scale. When the stakes are high, success requires not just cutting-edge technology, but the ability to operationalize it at scale—a challenge that AWS has consistently solved for customers. As the world’s most comprehensive and broadly adopted cloud, our partnership with NVIDIA’s pioneering accelerated computing platform for generative AI amplifies this capability. It’s inspiring to see how, together, we’re enabling customers across industries to confidently move AI into production.

In this post, I will share some of these customers’ remarkable journeys, offering practical insights for any organization looking to harness the power of generative AI.

Transforming content creation with generative AI

Content creation represents one of the most visible and immediate applications of generative AI today. Adobe, a pioneer that has shaped creative workflows for over four decades, has moved with remarkable speed to integrate generative AI across its flagship products, helping millions of creators work in entirely new ways.

Adobe’s approach to generative AI infrastructure exemplifies what their VP of Generative AI, Alexandru Costin, calls an “AI superhighway”—a sophisticated technical foundation that enables rapid iteration of AI models and seamless integration into their creative applications. The success of their Firefly family of generative AI models, integrated across flagship products like Photoshop, demonstrates the power of this approach. For their AI training and inference workloads, Adobe uses NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) P5en (NVIDIA H200 GPUs), P5 (NVIDIA H100 GPUs), P4de (NVIDIA A100 GPUs), and G5 (NVIDIA A10G GPUs) instances. They also use NVIDIA software such as NVIDIA TensorRT and NVIDIA Triton Inference Server for faster, scalable inference. Adobe needed maximum flexibility to build their AI infrastructure, and AWS provided the complete stack of services needed—from Amazon FSx for Lustre for high-performance storage, to Amazon Elastic Kubernetes Service (Amazon EKS) for container orchestration, to Elastic Fabric Adapter (EFA) for high-throughput networking—to create a production environment that could reliably serve millions of creative professionals.

Key takeaway

If you’re building and managing your own AI pipelines, Adobe’s success highlights a critical insight: although GPU-accelerated compute often gets the spotlight in AI infrastructure discussions, what’s equally important is the NVIDIA software stack along with the foundation of orchestration, storage, and networking services that enable production-ready AI. Their results speak for themselves—Adobe achieved a 20-fold scale-up in model training while maintaining the enterprise-grade performance and reliability their customers expect.

Pioneering new AI applications from the ground up

Throughout my career, I’ve been particularly energized by startups that take on audacious challenges—those that aren’t just building incremental improvements but are fundamentally reimagining how things work. Perplexity exemplifies this spirit. They’ve taken on a technology most of us now take for granted: search. It’s the kind of ambitious mission that excites me, not just because of its bold vision, but because of the incredible technical challenges it presents. When you’re processing 340 million queries monthly and serving over 1,500 organizations, transforming search isn’t just about having great ideas—it’s about building robust, scalable systems that can deliver consistent performance in production.

Perplexity’s innovative approach earned them membership in both AWS Activate and NVIDIA Inception—flagship programs designed to accelerate startup innovation and success. These programs provided them with the resources, technical guidance, and support needed to build at scale. They were one of the early adopters of Amazon SageMaker HyperPod, and continue to use its distributed training capabilities to accelerate model training time by up to 40%. They use a highly optimized inference stack built with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to serve both their search application and pplx-api, their public API service that gives developers access to their proprietary models. The results speak for themselves—their inference stack achieves up to 3.1 times lower latency compared to other platforms. Both their training and inference workloads run on NVIDIA GPU-accelerated EC2 P5 instances, delivering the performance and reliability needed to operate at scale. To give their users even more flexibility, Perplexity complements their own models with services such as Amazon Bedrock, and provides access to additional state-of-the-art models in their API. Amazon Bedrock offers ease of use and reliability, which are crucial for their team—as they note, it allows them to effectively maintain the reliability and latency their product demands.

What I find particularly compelling about Perplexity’s journey is their commitment to technical excellence, exemplified by their work optimizing GPU memory transfer with EFA networking. The team achieved 97.1% of the theoretical maximum bandwidth of 3200 Gbps and open sourced their innovations, enabling other organizations to benefit from their learnings.

For those interested in the technical details, I encourage you to read their fascinating post Journey to 3200 Gbps: High-Performance GPU Memory Transfer on AWS Sagemaker Hyperpod.

Key takeaway

For organizations with complex AI workloads and specific performance requirements, Perplexity’s approach offers a valuable lesson. Sometimes, the path to production-ready AI isn’t about choosing between self-hosted infrastructure and managed services—it’s about strategically combining both. This hybrid strategy can deliver both exceptional performance (evidenced by Perplexity’s 3.1 times lower latency) and the flexibility to evolve.

Transforming enterprise workflows with AI

Enterprise workflows represent the backbone of business operations—and they’re a crucial proving ground for AI’s ability to deliver immediate business value. ServiceNow, which terms itself the AI platform for business transformation, is rapidly integrating AI to reimagine core business processes at scale.

ServiceNow’s innovative AI solutions showcase their vision for enterprise-specific AI optimization. As Srinivas Sunkara, ServiceNow’s Vice President, explains, their approach focuses on deep AI integration with technology workflows, core business processes, and CRM systems—areas where traditional large language models (LLMs) often lack domain-specific knowledge. To train generative AI models at enterprise scale, ServiceNow uses NVIDIA DGX Cloud on AWS. Their architecture combines high-performance FSx for Lustre storage with NVIDIA GPU clusters for training, and NVIDIA Triton Inference Server handles production deployment. This robust technology platform allows ServiceNow to focus on domain-specific AI development and customer value rather than infrastructure management.

Key takeaway

ServiceNow offers an important lesson about enterprise AI adoption: while foundation models (FMs) provide powerful general capabilities, the greatest business value often comes from optimizing models for specific enterprise use cases and workflows. In many cases, it’s precisely this deliberate specialization that transforms AI from an interesting technology into a true business accelerator.

Scaling AI across enterprise applications

Cisco’s Webex team’s journey with generative AI exemplifies how large organizations can methodically transform their applications while maintaining enterprise standards for reliability and efficiency. With a comprehensive suite of telecommunications applications serving customers globally, they needed an approach that would allow them to incorporate LLMs across their portfolio—from AI assistants to speech recognition—without compromising performance or increasing operational complexity.

The Webex team’s key insight was to separate their models from their applications. Previously, they had embedded AI models into the container images for applications running on Amazon EKS, but as their models grew in sophistication and size, this approach became increasingly inefficient. By migrating their LLMs to Amazon SageMaker AI and using NVIDIA Triton Inference Server, they created a clean architectural break between their relatively lean applications and the underlying models, which require more substantial compute resources. This separation allows applications and models to scale independently, significantly reducing development cycle time and increasing resource utilization. The team deployed dozens of models on SageMaker AI endpoints, using Triton Inference Server’s model concurrency capabilities to scale globally across AWS data centers.

The results validate Cisco’s methodical approach to AI transformation. By separating applications from models, their development teams can now fix bugs, perform tests, and add features to applications much faster, without having to manage large models in their workstation memory. The architecture also enables significant cost optimization—applications remain available during off-peak hours for reliability, and model endpoints can scale down when not needed, all without impacting application performance. Looking ahead, the team is evaluating Amazon Bedrock to further improve their price-performance, demonstrating how thoughtful architecture decisions create a foundation for continuous optimization.

Key takeaway

For enterprises with large application portfolios looking to integrate AI at scale, Cisco’s methodical approach offers an important lesson: separating LLMs from applications creates a cleaner architectural boundary that improves both development velocity and cost optimization. By treating models and applications as independent components, Cisco significantly improved development cycle time while reducing costs through more efficient resource utilization.

Building mission-critical AI for healthcare

Earlier, we highlighted how Hippocratic AI reached 100,000 patients during a crisis. Behind this achievement lies a story of rigorous engineering for safety and reliability—essential in healthcare where stakes are extraordinarily high.

Hippocratic AI’s approach to this challenge is both innovative and rigorous. They’ve developed what they call a “constellation architecture”—a sophisticated system of over 20 specialized models working in concert, each focused on specific safety aspects like prescription adherence, lab analysis, and over-the-counter medication guidance. This distributed approach to safety means they have to train multiple models, requiring management of significant computational resources. That’s why they use SageMaker HyperPod for their training infrastructure, using Amazon FSx and Amazon Simple Storage Service (Amazon S3) for high-speed storage access to NVIDIA GPUs, while Grafana and Prometheus provide the comprehensive monitoring needed to provide optimal GPU utilization. They build upon NVIDIA’s low-latency inference stack, and are enhancing conversational AI capabilities using NVIDIA Riva models for speech recognition and text-to-speech translation, and are also using NVIDIA NIM microservices to deploy these models. Given the sensitive nature of healthcare data and HIPAA compliance requirements, they’ve implemented a sophisticated multi-account, multi-cluster strategy on AWS—running production inference workloads with patient data on completely separate accounts and clusters from their development and training environments. This careful attention to both security and performance allows them to handle thousands of patient interactions while maintaining precise control over clinical safety and accuracy.

The impact of Hippocratic AI’s work extends far beyond technical achievements. Their AI-powered clinical assistants address critical healthcare workforce burnout by handling burdensome administrative tasks—from pre-operative preparation to post-discharge follow-ups. For example, during weather emergencies, their system can rapidly assess heat risks and coordinate transport for vulnerable patients—the kind of comprehensive care that would be too burdensome and resource-intensive to coordinate manually at scale.

Key takeaway

For organizations building AI solutions for complex, regulated, and high-stakes environments, Hippocratic AI’s constellation architecture reinforces what we’ve consistently emphasized: there’s rarely a one-size-fits-all model for every use case. Just as Amazon Bedrock offers a choice of models to meet diverse needs, Hippocratic AI’s approach of combining over 20 specialized models—each focused on specific safety aspects—demonstrates how a thoughtfully designed ensemble can achieve both precision and scale.

Conclusion

As the technology partners enabling these and countless other customer innovations, AWS and NVIDIA’s long-standing collaboration continues to evolve to meet the demands of the generative AI era. Our partnership, which began over 14 years ago with the world’s first GPU cloud instance, has grown to offer the industry’s widest range of NVIDIA accelerated computing solutions and software services for optimizing AI deployments. Through initiatives like Project Ceiba—one of the world’s fastest AI supercomputers hosted exclusively on AWS using NVIDIA DGX Cloud for NVIDIA’s own research and development use—we continue to push the boundaries of what’s possible.

As all the examples we’ve covered illustrate, it isn’t just about the technology we build together—it’s how organizations of all sizes are using these capabilities to transform their industries and create new possibilities. These stories ultimately reveal something more fundamental: when we make powerful AI capabilities accessible and reliable, people find remarkable ways to use them to solve meaningful problems. That’s the true promise of our partnership with NVIDIA—enabling innovators to create positive change at scale. I’m excited to continue inventing and partnering with NVIDIA and can’t wait to see what our mutual customers are going to do next.

Resources

Check out the following resources to learn more about our partnership with NVIDIA and generative AI on AWS:


About the Author

Rahul Pathak is Vice President Data and AI GTM at AWS, where he leads the global go-to-market and specialist teams who are helping customers create differentiated value with AWS’s AI and capabilities such as Amazon Bedrock, Amazon Q, Amazon SageMaker, and Amazon EC2 and Data Services such as Amaqzon S3, AWS Glue and Amazon Redshift. Rahul believes that generative AI will transform virtually every single customer experience and that data is a key differentiator for customers as they build AI applications. Prior to his current role, he was Vice President, Relational Database Engines where he led Amazon Aurora, Redshift, and DSQL . During his 13+ years at AWS, Rahul has been focused on launching, building, and growing managed database and analytics services, all aimed at making it easy for customers to get value from their data. Rahul has over twenty years of experience in technology and has co-founded two companies, one focused on analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from the University of Washington.

Read More

Claimify: Extracting high-quality claims from language model outputs

Claimify: Extracting high-quality claims from language model outputs

Gradient background transitioning from blue to pink with two white icons. The left icon depicts a network or molecule structure with interconnected nodes, and the right icon shows a laptop and the outline of a person.

While large language models (LLMs) are capable of synthesizing vast amounts of information, they sometimes produce inaccurate or unsubstantiated content. To mitigate this risk, tools like Azure AI’s Groundedness Detection (opens in new tab) can be used to verify LLM outputs. 

A common strategy for fact-checking LLM-generated texts – especially complex, highly detailed outputs – is claim extraction: instead of evaluating the entire text at once, it’s broken down into simple factual statements that can be verified independently. However, the effectiveness of this approach depends on the quality of the extracted claims: if they are inaccurate or incomplete, the fact-checking results may be compromised. 

Our paper “Towards Effective Extraction and Evaluation of Factual Claims” proposes a framework for evaluating claim extraction and introduces Claimify, a novel LLM-based method that outperforms prior solutions. In this blog post, we discuss common claim extraction issues and explore Claimify’s approach.

Key issues in claim extraction

As a case study, we prompted OpenAI’s GPT-4o model to “provide an overview of challenges in emerging markets” based on a collection of news articles. Here is its output:

# Emerging market challenges
Emerging markets face a myriad of challenges that span economic, social, and environmental dimensions. These challenges are often interconnected, exacerbating the difficulties faced by these regions.

## Economic instability and inflation
Several emerging markets are grappling with severe economic instability. For instance, Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.

## Food insecurity and agricultural challenges
Food insecurity is another critical issue in emerging markets. Nigeria, for example, is striving to become self-sufficient in wheat production but is hindered by climate change and violence, exacerbated by high grain prices due to the suspension of the Black Sea Grain Initiative. Climate change has played a pivotal role in creating food insecurity and economic instability in farming-dependent economies, such as Zambia and Mozambique.

## Natural disasters and public health emergencies
Public health crises, as demonstrated by the experience of Derna, Libya, remain a significant concern in emerging markets. Catastrophic flooding caused by the collapse of two dams has led to significant destruction, complicating access to healthcare. The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management. Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.

## Conclusion
Addressing emerging market challenges will require comprehensive strategies that consider economic stability, food security, and public health. International cooperation and targeted interventions are crucial in fostering sustainable development in emerging markets.

We then applied a simple prompt[1] that extracted the following claims from the answer:

1. Emerging markets face economic challenges.
2. Emerging markets face social challenges.
3. Emerging markets face environmental challenges.
4. Argentina experiences severe economic instability.
5. Argentina has a monthly inflation rate as high as 25.5%.
6. Argentina’s inflation has made many goods unobtainable.
7. Argentina’s currency value has plunged.
8. Some experts estimate Argentina’s annual inflation rate could reach 300%.
9. Food insecurity is a critical issue in emerging markets.
10. Nigeria is striving to become self-sufficient in wheat production.
11. Nigeria’s wheat production is hindered by climate change.
12. Nigeria’s wheat production is hindered by violence.
13. High grain prices affect Nigeria due to the suspension of the Black Sea Grain Initiative.
14. Climate change creates food insecurity in Zambia.
15. Climate change creates food insecurity in Mozambique.
16. Climate change causes economic instability in farming-dependent economies.
17. Public health crises are a concern in emerging markets.
18. Derna, Libya, experienced catastrophic flooding.
19. The collapse of two dams caused flooding in Derna, Libya.
20. Flooding in Derna, Libya, complicated access to healthcare.
21. The U.N. found contaminated water in Derna, Libya.
22. Contaminated water in Derna, Libya, caused many residents to fall ill.
23. Afghanistan has experienced challenges similar to those in Libya.
24. Sudan has experienced challenges similar to those in Libya.
25. Addressing emerging market challenges requires comprehensive strategies.
26. Comprehensive strategies should consider economic stability.
27. Comprehensive strategies should consider food security.
28. Comprehensive strategies should consider public health.
29. International cooperation is crucial for sustainable development in emerging markets.
30. Targeted interventions are crucial for sustainable development in emerging markets.

These claims appear to reflect the answer, but a closer look reveals four critical issues:

# Issue Example
1 Some claims are not factual statements that can be verified as true or false. Excerpt:Addressing emerging market challenges will require comprehensive strategies that consider economic stability, food security, and public health.

Claims:

  • Addressing emerging market challenges requires comprehensive strategies.
  • Comprehensive strategies should consider economic stability.
  • Comprehensive strategies should consider food security.
  • Comprehensive strategies should consider public health.

Explanation: These claims are not verifiable because they are opinions.

2 Some claims are missing or incomplete. Excerpt:Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.

Claims:

  • Argentina has a monthly inflation rate as high as 25.5%.
  • Argentina’s inflation has made many goods unobtainable.
  • Argentina’s currency value has plunged.
  • Some experts estimate Argentina’s annual inflation rate could reach 300%.

Explanation: The phrases “causing severe economic hardship” and “others predict even higher rates” are not reflected in any of the claims. The third claim also omits the fact that inflation caused the currency depreciation.

3 Some claims are inaccurate. Excerpt: The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management.”

Claims:

  • The U.N. found contaminated water in Derna, Libya.
  • Contaminated water in Derna, Libya, caused many residents to fall ill.

Explanation: The first claim is inaccurate because the U.N. found the link between contaminated water and illness, not the contaminated water itself. The second claim also misrepresents the sentence since it shifts the meaning from a viewpoint of a specific entity (the U.N.) to a general assertion about the effects of contaminated water in Derna, Libya.

4 Some claims cannot be understood without additional context. Excerpt: Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.

Claims:

  • Afghanistan has experienced challenges similar to those in Libya.
  • Sudan has experienced challenges similar to those in Libya.

Explanation: These claims cannot be understood on their own because “those” is not defined.

Introducing Claimify

The case study highlights that claim extraction is surprisingly error-prone. Our paper demonstrates that the issues identified above are common across LLM-based claim extraction methods. To minimize these errors, we created a system called Claimify[2].

Core principles

Claimify is an LLM-based claim extraction system built on the following principles:

# Principle Example
1 The claims should capture all verifiable content in the source text and exclude unverifiable content. In the sentence “The partnership between John and Jane illustrates the importance of collaboration,” the only verifiable content is the existence of a partnership between John and Jane. The rest is subjective interpretation.
2 Each claim should be entailed (i.e., fully supported) by the source text. Consider the sentence “Governments are curtailing emissions from cars and trucks, which are the largest source of greenhouse gases from transportation.” The following claims are incorrect:

  • Cars are the largest source of greenhouse gases from transportation.
  • Trucks are the largest source of greenhouse gases from transportation.

The sentence attributes the highest emissions to cars and trucks collectively, not individually.

3 Each claim should be understandable on its own, without additional context. The claim “They will update the policy next year” is not understandable on its own because it’s unclear what “They,” “the policy,” and “next year” refer to.
4 Each claim should minimize the risk of excluding critical context. Suppose the claim “The World Trade Organization has supported trade barriers” was extracted from the sentence “An exception to the World Trade Organization’s open-market philosophy is its history of supporting trade barriers when member countries have failed to comply with their obligations.” A fact-checking system would likely classify the claim as false, since there is extensive evidence that the WTO aims to reduce trade barriers. However, if the claim had specified that the WTO has supported trade barriers “when member countries have failed to comply with their obligations,” it would likely have been classified as true. This example demonstrates that missing context can distort the fact-checking verdict.
5 The system should flag cases where ambiguity cannot be resolved. The sentence “AI has advanced renewable energy and sustainable agriculture at Company A and Company B” has two mutually exclusive interpretations:

  • AI has advanced renewable energy and sustainable agriculture at both Company A and Company B.
  • AI has advanced renewable energy at Company A and sustainable agriculture at Company B.

If the context does not clearly indicate that one of these interpretations is correct, the system should flag the ambiguity instead of picking one interpretation arbitrarily.

Implementation

Claimify accepts a question-answer pair as input and performs claim extraction in four stages, illustrated in Figure 1:

# Stage Description
1 Sentence splitting and context creation The answer is split into sentences, with “context” – a configurable combination of surrounding sentences and metadata (e.g., the header hierarchy in a Markdown-style answer) – created for each sentence.
2 Selection An LLM identifies sentences that do not contain verifiable content. These sentences are labeled “No verifiable claims” and excluded from subsequent stages. When sentences contain verifiable and unverifiable components, the LLM rewrites the sentence, retaining only the verifiable components.
3 Disambiguation For sentences that passed the Selection stage, an LLM detects ambiguity and determines if it can be resolved using the context. If all ambiguity is resolvable, the LLM returns a disambiguated version of the sentence. Otherwise, the sentence is labeled “Cannot be disambiguated” and excluded from the Decomposition stage.
4 Decomposition For sentences that are unambiguous or were disambiguated, an LLM creates standalone claims that preserve critical context. If no claims are extracted, the sentence is labeled “No verifiable claims.”
A flowchart outlining Claimify’s stages for extracting claims from a question-answer pair. The process begins by splitting the answer into sentences and creating context. Next, the Selection stage asks if a sentence contains any verifiable content. If no, the sentence is labeled
Figure 1: Overview of Claimify’s stages

Results

In our paper, we demonstrate that Claimify outperforms existing LLM-based methods[3]. Specifically, we show that: (1) 99% of claims extracted by Claimify are entailed by their source sentence, (2) Claimify strikes the best balance between including verifiable content and excluding unverifiable content, and (3) Claimify is least likely to omit context critical to the fact-checking verdict.

For the above case study on challenges in emerging markets, here are Claimify’s outputs, with source sentences preceded by a letter and claims numbered[4]:

A. Several emerging markets are grappling with severe economic instability.
1. Several emerging markets are grappling with severe economic instability.

B. For instance, Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship.
1. Argentina has rampant inflation.
2. The monthly inflation rates in Argentina have reached as high as 25.5%.
3. Inflation has made many goods unobtainable in Argentina.
4. Inflation has plunged the value of the currency in Argentina.
5. Inflation has caused severe economic hardship in Argentina.

C. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.
1. Some experts estimate that Argentina’s annual inflation rate could double to 300% in the future.
2. Some experts predict that Argentina’s annual inflation rate could be higher than 300% in the future.

D. Nigeria, for example, is striving to become self-sufficient in wheat production but is hindered by climate change and violence, exacerbated by high grain prices due to the suspension of the Black Sea Grain Initiative.
1. Nigeria is striving to become self-sufficient in wheat production.
2. Nigeria is hindered by climate change in becoming self-sufficient in wheat production.
3. Nigeria is hindered by violence in becoming self-sufficient in wheat production.
4. High grain prices exacerbate the hindrance to Nigeria’s efforts to become self-sufficient in wheat production.
5. The suspension of the Black Sea Grain Initiative is a reason for high grain prices.

E. Climate change has played a pivotal role in creating food insecurity and economic instability in farming-dependent economies, such as Zambia and Mozambique.
1. Climate change has played a role in creating food insecurity in farming-dependent economies.
2. Zambia is a farming-dependent economy where climate change has played a role in creating food insecurity.
3. Mozambique is a farming-dependent economy where climate change has played a role in creating food insecurity.
4. Climate change has played a role in creating economic instability in farming-dependent economies.
5. Zambia is a farming-dependent economy where climate change has played a role in creating economic instability.
6. Mozambique is a farming-dependent economy where climate change has played a role in creating economic instability.

F. Public health crises, as demonstrated by the experience of Derna, Libya, remain a significant concern in emerging markets.
1. Public health crises are a concern in emerging markets.
2. Derna, Libya, is an example of a public health crisis in emerging markets.

G. Catastrophic flooding caused by the collapse of two dams has led to significant destruction, complicating access to healthcare.
1. There was catastrophic flooding in Derna, Libya.
2. The flooding in Derna, Libya, was caused by the collapse of two dams.
3. The flooding in Derna, Libya, has led to significant destruction.
4. The flooding in Derna, Libya, has complicated access to healthcare.

H. Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.
1. Afghanistan has experienced challenges related to public health crises.
2. Afghanistan has experienced challenges related to catastrophic flooding.
3. Afghanistan has experienced challenges related to contaminated water.
4. Sudan has experienced challenges related to public health crises.
5. Sudan has experienced challenges related to catastrophic flooding.
6. Sudan has experienced challenges related to contaminated water.

Note that the baseline prompt extracted several claims from the sentence “The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management,” but it ignored the phrase “highlighting the need for improved water management.” It also failed to capture that the contaminated water resulted from flooding, as implied by “resulting” in the original sentence.

Claimify took a different approach. First, it found two instances of ambiguity – “resulting contaminated water” and “many residents” – that it determined could be resolved using the context. Here’s an excerpt from its reasoning: “…the context specifies that the contaminated water is a result of the catastrophic flooding in Derna, Libya, and the residents are those of Derna, Libya.

However, it also found an instance of ambiguity – “highlighting the need for improved water management” – where it concluded that the context does not definitively support a single interpretation: “The sentence could be interpreted as: (1) The U.N. found that the contaminated water caused illness and also highlighted the need for improved water management, (2) The U.N. only found that the contaminated water caused illness, while the need for improved water management is an implication or conclusion drawn by the writer. Readers … would likely fail to reach consensus about the correct interpretation of this ambiguity.” As a result, Claimify labeled the sentence “Cannot be disambiguated” at the Disambiguation stage and did not proceed to the Decomposition stage. 

To the best of our knowledge, Claimify is the first claim extraction system that identifies when the source text has multiple possible interpretations and extracts claims only when there is high confidence in the correct interpretation.

Next steps

We’re currently working on new methods for evaluating LLM-generated texts. We anticipate that the high-quality claims extracted by Claimify will help not only in verifying the veracity of LLM outputs, but also in assessing their overall quality – especially when gold-standard references are difficult to create (e.g., long-form texts where people may disagree on what defines “good” content). For example, we recently used Claimify to evaluate the comprehensiveness and diversity of answers generated by GraphRAG, showing that GraphRAG outperforms traditional Retrieval Augmented Generation (RAG) in these areas.

For an in-depth discussion of Claimify and our evaluation framework, please see our paper “Towards Effective Extraction and Evaluation of Factual Claims.”


[1] (opens in new tab) We used the “proposition chunking” prompt from NirDiamant’s RAG Techniques repository (opens in new tab). We generated multiple responses using GPT-4o, then picked the response that was most representative of the samples.

[2] Claimify is currently used for research purposes only and is not available commercially.

[3] (opens in new tab) We benchmarked Claimify against VeriScore (opens in new tab), DnD (opens in new tab), SAFE (opens in new tab), AFaCTA (opens in new tab), and Factcheck-GPT (opens in new tab).

[4] The outputs were generated using GPT-4o. Sentences not shown were either labeled “No verifiable claims” or “Cannot be disambiguated.”

The post Claimify: Extracting high-quality claims from language model outputs appeared first on Microsoft Research.

Read More

Metasurface: Unlocking the future of wireless sensing and communication

Metasurface: Unlocking the future of wireless sensing and communication

The image features three white icons on a gradient background transitioning from blue on the left to green on the right. The first icon, located on the left, represents a Wi-Fi signal with curved lines radiating from a central point. The middle icon depicts a satellite with solar panels and an antenna emitting waves. The third icon, on the right, shows a bar chart with ascending bars indicating signal strength.

As the demand for faster, more reliable wireless communication continues to grow, traditional systems face limitations in efficiency and adaptability. To keep up with evolving needs, researchers are investigating new ways to manipulate electromagnetic waves to improve wireless performance.

One solution involves metasurfaces—engineered materials that can control wave propagation in unprecedented ways. By dynamically shaping and directing electromagnetic waves, they can overcome the constraints of conventional wireless systems.

Building on these capabilities, we are developing metasurfaces for a wide range of wireless application scenarios. Notably, we have developed metasurfaces for enhancing low earth orbit satellite communication, optimizing acoustic sensing, realizing acoustic and mmWave imaging using commodity devices. More recently, we have designed metasurfaces to enable indoor Global Navigation Satellite System (GNSS), offer good mmWave coverage over a target environment, optimize heat distribution inside a microwave oven, and deliver directional sound to a user without a headphone.

All these works, published at top networking conferences, including MobiCom 2023 & 2024, MobiSys 2024 & 2025, and NSDI 2023, demonstrate the transformative potential of metasurfaces in advancing wireless communication and sensing technologies. This blog post explores some of these technologies in more detail.

Microsoft research podcast

What’s Your Story: Lex Story

Model maker and fabricator Lex Story helps bring research to life through prototyping. He discusses his take on failure; the encouragement and advice that has supported his pursuit of art and science; and the sabbatical that might inspire his next career move.


Metasurfaces optimize GNSS for accurate indoor positioning

While GNSS is widely used for outdoor positioning and navigation, its indoor performance is often hindered by signal blockage, reflection, and attenuation caused by physical obstacles. Additional technologies like Wi-Fi and Bluetooth Low Energy (BLE) are often employed to address these issues. However, these solutions require extra infrastructure, are costly, and are complicated to deploy. Accurate positioning also typically depends on specialized hardware and software on mobile devices. 

Despite these challenges, GNSS signals hold promise for accurate indoor positioning. By leveraging the vast number of available satellites, GNSS-based solutions eliminate the need for base station deployment and maintenance required by Wi-Fi and BLE systems. This approach also allows seamless integration between indoor and outdoor environments, supporting continuous positioning in scenarios like guiding smart vehicles through indoor and outdoor industrial environments. 

To explore this potential, we conducted indoor measurements and found that GNSS satellite signals can penetrate windows at different angles and reflect or diffract from surfaces like floors and ceilings, resulting in uneven signals. Metasurfaces can control structured arrays of electromagnetic signals, allowing them to capture and redirect more GNSS signals. This allows signals to enter buildings in a path parallel to the ground, achieving broader coverage. Using this capability, we developed a GNSS positioning metasurface system (GPMS) based on passive metasurface technology.

One limitation of passive metasurfaces is their lack of programmability. To overcome this and enable them to effectively guide signals from different angles and scatter them in parallel, we designed a two-layer metasurface system. As shown in Figure 1, this design ensures that electromagnetic waves from different angles follow similar emission trajectories.  

A diagram showing the optimization of metasurfaces for enhancing GNSS signals indoors. It includes two GNSS satellites, far-field channels, a near-field channel matrix, a passive metasurface grid, and colorful 3D waveforms. The target radiation matrix is shown with indoor users. The text reads: “Optimization problem: The radiation output of our designed metasurfaces should all be close to the target radiation for GNSS signal input at all incidence angles.”
Figure 1: The GPMS two-layer metasurface structure

To improve positioning accuracy, we developed new algorithms that allow signals to pass through metasurfaces, using them as anchor points. Traditional GPS positioning requires signals from at least four satellites to decode location information. In the GPMS system, illustrated in Figure 2, each deployed metasurface functions as a virtual satellite. By deploying at least three metasurfaces indoors, we achieved high-precision positioning through a triangulation algorithm.

The image depicts a shopping mall indoor environment with three metasurfaces labeled Metasurface 1, Metasurface 2, and Metasurface 3. Each metasurface is associated with a steering and scattering area, labeled Steering and scattering area 1, Steering and scattering area 2, and Steering and scattering area 3 respectively. GNSS satellites are shown outside the building. The image illustrates how GNSS signals interact with metasurfaces within an indoor environment.
Figure 2. Diagram of the GPMS system. Passive metasurfaces guide GNSS signals indoors, while enhanced positioning algorithms provide precise indoor positioning on mobile devices. 

To evaluate the system, we deployed the GPMS with six metasurfaces on a 10×50-meter office floor and a 15×20-meter conference hall. The results show significant improvements in signal quality and availability. C/N₀, a measure of signal-to-noise ratio, increased from 9.1 dB-Hz to 32.2 dB-Hz. The number of visible satellites increased from 3.6 to 21.5. Finally, the absolute positioning error decreased from 30.6 meters to 3.2 meters in the office and from 11.2 meters to 2.7 meters in the conference hall. These findings are promising and highlight the feasibility and advantages of GNSS-based metasurfaces for indoor positioning. 

Metasurfaces extend millimeter-wave coverage

Millimeter waves enable the high-speed, low-latency performance needed for 5G and 6G communication systems. While commercial products like 60 GHz Wi-Fi routers and mobile devices are becoming popular, their limited coverage and susceptibility to signal obstruction restrict their widespread application. 

Traditional solutions include deploying multiple millimeter-wave access points, such as routers or base stations, or placing reflective metal panels in room corners to reflect electromagnetic waves. However, these approaches are both costly and offer limited performance. Metasurfaces offer a promising alternative for improving millimeter-wave applications. Previous research has shown that programmable metasurfaces can enhance signal coverage in blind spots and significantly improve signal quality and efficiency.  

To maximize the benefits of metasurfaces, we developed the AutoMS automation service framework, shown in Figure 3. This proposed framework can optimize millimeter-wave coverage using low-cost passive metasurface design and strategic placement. 

The three main components of AutoMS can address the limitations of traditional solutions: 

  1. Automated joint optimization: AutoMS determines the optimal network deployment configuration by analyzing phase settings, metasurface placement, and access point positioning. It also refines beam-forming configurations to enhance signal coverage. By iteratively identifying and optimizing the number, size, and placement of metasurfaces, AutoMS adjusts the metasurface phase settings and the access point’s configurations to achieve optimal signal coverage. 
A flowchart diagram illustrating the AutoMS framework, which generates optimized passive metasurface and access point deployment plans for a specific 3D model based on environmental scanning results. The process starts with an environment scan, producing a 3D model and reflection coefficients. This information feeds into wireless channel modeling, which along with deployment configurations, is optimized by a hyper-configuration tuner. The output includes phase maps used by the surface and AP optimizer. The optimized deployment configurations are then used for metasurface fabrication and network deployment.
Figure 3. The AutoMS framework generates optimized deployment plans for passive metasurface and access points based on environment scanning results. 
  1. Fast 3D ray tracing simulator: Using hardware and software acceleration, our simulator efficiently calculates channel matrices resulting from metasurfaces with tens of thousands of elements. This simulator, capable of tracing 1.3 billion rays in just three minutes on an A100 GPU, significantly accelerates calculations for complex environments.
  1. Low-cost passive metasurface design: We designed a high-reflectivity passive metasurface with near-2π phase control and broadband compatibility for the millimeter-wave frequency band. This metasurface is compatible with low-precision, cost-effective thermoforming processes. This process enables users to create metasurfaces at minimal cost, significantly reducing deployment expenses.

    Shown in Figure 4, users can capture the environment using existing 3D scanning apps on mobile devices, generate a 3D layout model, and upload it to the cloud. AutoMS then generates metasurface settings and placement guidelines.  

    Users can print metasurface patterns using hot stamping and customize them without affecting functionality, as millimeter waves penetrate paint and paper. 

A step-by-step process for creating low-cost passive metasurfaces. Step 1: Print patterns on paper with a laser printer. Step 2: Hot stamp aluminum foil on paper with a laminator. Step 3: Tear the aluminum foil off to get the metallic patterns. Step 4: Paste patterns on the plastic sheet and aluminum board.
Figure 4: The low-cost passive metasurface creation process 

Evaluation using publicly available 3D layout datasets and real-world tests shows that AutoMS significantly improves millimeter-wave coverage across various scenarios. Compared to a single router setup, AutoMS increased signal strength by 12.1 dB. Onsite tests further confirmed gains of 11 dB in target areas and over 20 dB in blind spots, with signal throughput increasing from 77 Mbps to 373 Mbps. AutoMS adapts to diverse environments, ensuring reliable and flexible deployment in real-world applications. 

Metasurfaces support uniform heating in microwave ovens 

Microwave ovens often heat unevenly, creating cold spots in food. These can allow harmful bacteria and other pathogens to survive, increasing the risk of foodborne illnesses. Uneven heating can cause eggs to burst or create “hot spots” that can scald.

Uneven heating is due to the appliance’s heating mechanism. Microwave ovens generate high-power radio frequency (RF) electromagnetic waves through dielectric heating. These waves create nodes with zero amplitude, which prevents heating. They also create antinodes, where heating occurs more rapidly.  

To address this issue, we developed MicroSurf, a low-cost solution that improves heating by using passive metasurfaces to control electromagnetic energy inside the microwave oven. It uses the resonance effect between the metasurface and electromagnetic waves to modify the standing-wave distribution and achieve more uniform heating. This is shown in Figure 5. 

A diagram illustrating the working principle of MicroSurf in four parts. A shows an uneven electric field distribution inside a microwave oven leading to uneven heating, with images of a microwave and thermal images of food. B depicts accurate modeling of the microwave oven, including geometry refinement, dielectric factor tuning, and frequency tuning. C involves designing and optimizing a metasurface that can function in a high-power environment to change the standing wave distribution, with an image of a high-power phase-tuning metasurface. D demonstrates achieving uniform heating of different foods and selectively heating specific parts of food, with thermal images showing uniform heating results.
Figure 5: MicroSurf’s working principle: Uneven electric field distribution inside the microwave oven leads to uneven heating. B. Modeling the microwave oven. C. Designing and optimizing a metasurface that can function in a high-power environment to change the standing wave distribution. D. Achieving uniform heating of different foods and selectively heating specific parts. 

Tests across four different microwave oven brands demonstrate that MicroSurf effectively optimizes heating for various liquids and solids, uniformly heating water, milk, bread, and meat. It concentrates heat on specific areas and adapts to differently shaped foods. MicroSurf offers a promising solution for even heating in microwave ovens, demonstrating the potential of metasurface technology in everyday applications. This innovation paves the way for smarter, more efficient home appliances.  

Advancing wireless innovation

Wireless sensing and communication technologies are evolving rapidly, driving innovation across a wide range of applications. We are continuing to push the boundaries of these technologies—particularly in metasurface development—while working to create practical solutions for a variety of use cases. 

The post Metasurface: Unlocking the future of wireless sensing and communication appeared first on Microsoft Research.

Read More

NVIDIA Honors Americas Partners Advancing Agentic and Physical AI

NVIDIA Honors Americas Partners Advancing Agentic and Physical AI

NVIDIA this week recognized 14 partners leading the way across the Americas for their work advancing agentic and physical AI across industries.

The 2025 Americas NVIDIA Partner Network awards — announced at the GTC 2025 global AI conference — represent key efforts by industry leaders to help customers become experts in using AI to solve many of today’s greatest challenges. The awards honor the diverse contributions of NPN members fostering AI-driven innovation and growth.

This year, NPN introduced three new award categories that reflect how AI is driving economic growth and opportunities, including:

  • Trailblazer, which honors a visionary partner spearheading AI adoption and setting new industry standards.
  • Rising Star, which celebrates an emerging talent helping industries harness AI to drive transformation.
  • Innovation, which recognizes a partner that’s demonstrated exceptional creativity and forward thinking.

This year’s NPN ecosystem winners have helped companies across industries use AI to adapt to new challenges and prioritize energy-efficient accelerated computing. NPN partners help customers implement a broad range of AI technologies, including NVIDIA-accelerated AI factories, as well as large language models and generative AI chatbots, to transform business operations.

The 2025 NPN award winners for the Americas are:

  • Global Consulting Partner of the Year — Accenture is recognized for its impact and depth of engineering with its AI Refinery platform for industries, simulation and robotics, marketing and sovereignty, which helps organizations enhance innovation and growth with custom-built approaches to AI-driven enterprise reinvention.
  • Trailblazer Partner of the Year — Advizex is recognized for its commitment to driving innovation in AI and high-performance computing, helping industries like healthcare, manufacturing, retail and government seamlessly integrate advanced AI technologies into existing business frameworks. This enables organizations to achieve significant operations efficiencies, enhanced decision-making, and accelerated digital transformation.
  • Rising Star Partner of the Year — AHEAD is recognized for its leadership, technical expertise and deployment of NVIDIA software, NVIDIA DGX systems, NVIDIA HGX and networking technologies to advance AI, benefitting customers across healthcare, financial services, life sciences and higher education.
  • Networking Partner of the Year — Computacenter is recognized for advancing high-performance computing and data centers with NVIDIA networking technologies. The company achieved this by using the NVIDIA AI Enterprise software platform, DGX platforms and NVIDIA networking to drive innovation and growth throughout industries with efficient, accelerated data centers.
  • Solution Integration Partner of the Year — EXXACT is recognized for its efforts in helping research institutions and businesses tap into generative AI, large language models and high-performance computing. The company harnesses NVIDIA GPUs and networking technologies to deliver powerful computing platforms that accelerate innovation and tackle complex computational challenges across various industries.
  • Enterprise Partner of the Year — World Wide Technology (WWT) is recognized for its leadership in advancing AI adoption of customers across industry verticals worldwide. The company expanded its end-to-end AI capabilities by integrating NVIDIA Blueprints into its AI Proving Ground and has made a $500 million commitment to AI development over three years to help speed enterprise generative AI deployments.
  • Software Partner of the Year — Mark III is recognized for the work of its cross-functional team spanning data scientists, developers, 3D artists, systems engineers, and HPC and AI architects, as well as its close collaborations with enterprises and institutions, to deploy NVIDIA software, including NVIDIA AI Enterprise and NVIDIA Omniverse, across industries. These efforts have helped many customers build software-powered pipelines and data flywheels with machine learning, generative AI, high-performance computing and digital twins.
  • Higher Education Research Partner of the Year — Mark III is recognized for its close engagement with universities, academic institutions and research organizations to cultivate the next generation of leaders across AI, machine learning, generative AI, high-performance computing and digital twins.
  • Healthcare Partner of the Year — Lambda is recognized for empowering healthcare and biotech organizations with AI training, fine-tuning and inferencing solutions to speed innovation and drive breakthroughs in AI-driven drug discovery. The company provides AI training, fine-tuning and inferencing solutions at every scale — from individual workstations to comprehensive AI factories — that help healthcare providers seamlessly integrate NVIDIA accelerated computing and software into their infrastructure.
  • Financial Services Partner of the Year — WWT is recognized for driving the digital transformation of the world’s largest banks and financial institutions. The company harnesses NVIDIA AI technologies to optimize data management, enhance cybersecurity and deliver transformative generative AI solutions, helping financial services clients navigate rapid technological changes and evolving customer expectations.
  • Innovation Partner of the Year — Cambridge Computer is recognized for supporting customers deploying transformative technologies, including NVIDIA Grace Hopper, NVIDIA Blackwell and the NVIDIA Omniverse platform for physical AI.
  • Service Delivery Partner of the Year — SoftServe is recognized for its impact in driving enterprise adoption of NVIDIA AI and Omniverse with custom NVIDIA Blueprints that tap into NVIDIA NIM microservices and NVIDIA NeMo and Riva software. SoftServe helps customers create generative AI services for industries spanning manufacturing, retail, financial services, auto, healthcare and life sciences.
  • Distribution Partner of the Year — TD SYNNEX has been recognized for the second consecutive year for supporting customers in accelerating AI growth through rapid delivery of NVIDIA accelerated computing and software, as part of its Destination AI initiative.
  • Rising Star Consulting Partner of the Year Tata Consultancy Services (TCS) is recognized for its growth and commitment to providing industry-specific solutions  that help customers adopt AI faster and at scale. Through its recently launched business unit and center of excellence built on NVIDIA AI Enterprise and Omniverse, TCS is poised to accelerate adoption of agentic AI and physical AI solutions to speed innovation for customers worldwide.
  • Canadian Partner of the Year — Hypertec is recognized for its advancement of high-performance computing and generative AI across Canada. The company has employed the full-stack NVIDIA platform to accelerate AI for financial services, higher education and research.
  • Public Sector Partner of the Year — Government Acquisitions (GAI) is recognized for its rapid AI deployment and robust customer relationships, helping serve the unique needs of the federal government by adding AI to operations to improve public safety and efficiency.

Learn more about the NPN program.

Read More

NVIDIA Blackwell Powers Real-Time AI for Entertainment Workflows

NVIDIA Blackwell Powers Real-Time AI for Entertainment Workflows

AI has been shaping the media and entertainment industry for decades, from early recommendation engines to AI-driven editing and visual effects automation. Real-time AI — which lets companies actively drive content creation, personalize viewing experiences and rapidly deliver data insights — marks the next wave of that transformation.

With the NVIDIA RTX PRO Blackwell GPU series, announced yesterday at the NVIDIA GTC global AI conference, media companies can now harness real-time AI for media workflows with unprecedented speed, efficiency and creative potential.

NVIDIA Blackwell serves as the foundation of NVIDIA Media2, an initiative that enables real-time AI by bringing together NVIDIA technologies — including NVIDIA NIM microservices, NVIDIA AI Blueprints, accelerated computing platforms and generative AI software — to transform all aspects of production workflows and experiences, starting with content creation, streaming and live media.

Powering Intelligent Content Creation

Accelerated computing enables AI-driven workflows to process massive datasets in real time, unlocking faster rendering, simulation and content generation.

NVIDIA RTX PRO Blackwell GPUs series include new features that enable unprecedented graphics and AI performance. The NVIDIA Streaming Multiprocessor offers up to 1.5x faster throughput over the NVIDIA Ada generation, and new neural shaders that integrate AI inside of programmable shaders for advanced content creation.

Fourth-generation RT Cores deliver up to 2x the performance of the previous generation, enabling the creation of massive photoreal and physically accurate animated scenes. Fifth-generation Tensor Cores deliver up to 4,000 AI trillion operations per second and add support for FP4 precision. And up to 96GB of GDDR7 memory boosts GPU bandwidth and capacity, allowing applications to run faster and work with larger, more complex datasets for massive 3D and AI projects, large-scale virtual-reality environments and more.

Elio © Disney/Pixar

“One of the most exciting aspects of new technology is how it empowers our artists with tools to enhance their creative workflows,” said Steve May, chief technology officer of Pixar Animation Studios. “With Pixar’s next-generation renderer, RenderMan XPU — optimized for the NVIDIA Blackwell platform — 99% of Pixar shots can now fit within the 96GB of memory on the NVIDIA RTX PRO 6000 Blackwell GPUs. This breakthrough will fundamentally improve the way we make movies.”

© Lucasfilm Ltd.

“Our artists were frequently maxing out our 48GB cards with ILM StageCraft environments and having to battle performance issues on set for 6K and 8K real-time renders,” said Stephen Hill, principal rendering engineer at Lucasfilm. “The new NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU lifts these limitations — we’re seeing upwards of a 2.5x performance increase over our current production GPUs, and with 96GB of VRAM we now have twice as much memory to play with.”

In addition, neural rendering with NVIDIA RTX Kit brings cinematic-quality ray tracing and AI-enhanced graphics to real-time engines, elevating visual fidelity in film, TV and interactive media. Including neural texture compression, neural shaders, RTX Global Illumination and Mega Geometry, RTX Kit is a suite of neural rendering technologies that enhance graphics for games, animation, virtual production scenes and immersive experiences.

Fueling the Future of Streaming and Data Analytics

Data analytics is transforming raw audience insights into actionable intelligence faster than ever. NVIDIA accelerated computing and AI-powered frameworks enable studios to analyze viewer behavior, predict engagement patterns and optimize content in real time, driving hyper-personalized experiences and smarter creative decisions.

With the new GPUs, users can achieve real-time ingestion and data transformation with GPU-accelerated data loading and cleansing at scale.

The NVIDIA technologies accelerating streaming and data analytics include a suite of NVIDIA CUDA-X data processing libraries that enable immediate insights from continuous data streams and reduce latency, such as:

  • NVIDIA cuML: Enables GPU-accelerated training and inference for recommendation models using scikit-learn algorithms, providing real-time personalization capabilities and up-to-date relevant content recommendations that boost viewer engagement while reducing churn.
  • NVIDIA cuDF: Offers pandas DataFrame operations on GPUs, enabling faster and more efficient NVIDIA-accelerated extract, transform and load operations and analytics. cuDF helps optimize content delivery by analyzing user data to predict demand and adjust content distribution in real time, improving overall user experiences.

Along with cuML and cuDF, accelerated data science libraries provide seamless integration with the open-source Dask library for multi-GPU or multi-node clusters. NVIDIA RTX Blackwell PRO GPUs’ large GPU memory can further assist with handling massive datasets and spikes in usage without sacrificing performance.

And, the video search and summarization blueprint integrates vision language models and large language models and provides cloud-native building blocks to build video analytics, search and summarization applications.

Breathing Life Into Live Media 

With NVIDIA RTX PRO Blackwell GPUs, broadcasters can achieve higher performance than ever in high-resolution video processing, real-time augmented reality and AI-driven content production and video analytics.

New features include:

  • Ninth-Generation NVIDIA NVENC: Adds support for 4:2:2 encoding, accelerating video encoding speed and improving quality for broadcast and live media applications while reducing costs of storing uncompressed video.
  • Sixth-Generation NVIDIA NVDEC: Provides up to double H.264 decoding throughput and offers support for 4:2:2 H.264 and HEVC decode. Professionals can benefit from high-quality video playback, accelerate video data ingestion and use advanced AI-powered video editing features.
  • Fifth-Generation PCIe: Provides double the bandwidth over the previous generation, improving data transfer speeds from CPU memory and unlocking faster performance for data-intensive tasks.
  • DisplayPort 2.1: Drives high-resolution displays at up to 8K at 240Hz and 16K at 60Hz. Increased bandwidth enables seamless multi-monitor setups, while high dynamic range and higher color depth support deliver more precise color accuracy for tasks like video editing and live broadcasting.

“The NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU is a transformative force in Cosm’s mission to redefine immersive entertainment,” said Devin Poolman, chief product and technology officer at Cosm, a global immersive technology, media and entertainment company. “With its unparalleled performance, we can push the boundaries of real-time rendering, unlocking the ultra-high resolution and fluid frame rates needed to make our live, immersive experiences feel nearly indistinguishable from reality.”

As a key component of Cosm’s CX System 12K LED dome displays, RTX PRO 6000 Max-Q enables seamless merging of the physical and digital worlds to deliver shared reality experiences, enabling audiences to engage with sports, live events and cinematic content in entirely new ways.

Cosm’s shared reality experience, featuring its 87-foot-diameter LED dome display in stunning 12K resolution, with millions of pixels shining 10x brighter than the brightest cinematic display.​ Image courtesy of Cosm.

To learn more about NVIDIA Media2, watch the GTC keynote and register to attend sessions from NVIDIA and industry leaders at the show, which runs through Friday, March 21. 

Try NVIDIA NIM microservices and AI Blueprints on build.nvidia.com.

Read More

Amazon Q Business now available in Europe (Ireland) AWS Region

Amazon Q Business now available in Europe (Ireland) AWS Region

Today, we are excited to announce that Amazon Q Business—a fully managed generative-AI powered assistant that you can configure to answer questions, provide summaries and generate content based on your enterprise data—is now generally available in the Europe (Ireland) AWS Region.

Since its launch, Amazon Q Business has been helping customers find information, gain insight, and take action at work. The general availability of Amazon Q Business in the Europe (Ireland) Region will support customers across Ireland and the EU to transform how their employees work and access information, while maintaining data security and privacy requirements.

AWS customers and partners innovate using Amazon Q Business in Europe

Organizations across the EU are using Amazon Q Business for a wide variety of use cases, including answering questions about company data, summarizing documents, and providing business insights.

Katya Dunets, the AWS Lead Sales Engineer for Adastra noted,

Adastra stands at the forefront of technological innovation, specializing in artificial intelligence, data, cloud, digital, and governance services. Our team was facing the daunting challenge of sifting through hundreds of documents on SharePoint, searching for content and information critical for market research and RFP generation. This process was not only time-consuming but also impeded our agility and responsiveness. Recognizing the need for a transformative solution, we turned to Amazon Q Business for its prowess in answering queries, summarizing documents, generating content, and executing tasks, coupled with its direct SharePoint integration. Amazon Q Business became the catalyst for unprecedented efficiency within Adastra, dramatically streamlining document retrieval, enhancing cross-team collaboration through shared insights from past projects, and accelerating our RFP development process by 70%. Amazon Q Business has not only facilitated a smoother exchange of knowledge within our teams but has also empowered us to maintain our competitive edge by focusing on innovation rather than manual tasks. Adastra’s journey with Amazon Q exemplifies our commitment to harnessing cutting-edge technology to better serve both our clients and their customers.

AllCloud is a cloud solutions provider specializing in cloud stack, infrastructure, platform, and Software-as-a-Service. Their CTO, Peter Nebel stated,

“AllCloud faces the common challenge of information sprawl. Critical knowledge for sales and delivery teams is scattered across various tools—Salesforce for customer and marketing data, Google Drive for documents, Bamboo for HR and internal information, and Confluence for internal wikis. This fragmented approach wastes valuable time as employees hunt and peck for the information they need, hindering productivity and potentially impacting client satisfaction. Amazon Q Business provides AllCloud a solution to increase productivity by streamlining information access. By leveraging Amazon Q’s natural language search capabilities, AllCloud can empower its personnel with a central hub to find answers to their questions across all their existing information sources. This drives efficiency and accuracy by eliminating the need for time-consuming searches across multiple platforms and ensures all teams have access to the most up-to-date information. Amazon Q will significantly accelerate productivity, across all lines of business, allowing AllCloud’s teams to focus on delivering exceptional service to their clients.”

Lars Ritter, Senior Manager at Woodmark Consulting noted,

“Amazon Bedrock and Amazon Q Business have been game-changers for Woodmark. Employees struggled with time-consuming searches across various siloed systems, leading to reduced productivity and slower operations. To solve for the inefficient retrieval of corporate knowledge from unstructured data sources we turned to Amazon Bedrock and Amazon Q Business for help. With this innovative solution, Woodmark has been able to revolutionize data accessibility, empowering our teams to effortlessly retrieve insights using simple natural language queries, and to make informed decisions without relying on specialized data teams, which was not feasible before. These solutions have dramatically increased efficiency, fostered a data-driven culture, and positioned us for scalable growth, driving our organization toward unparalleled success.”

Scott Kumono, Product Manager for Kinectus at Siemens Healthineers adds,

“Amazon Q Business has enhanced the delivery of service and clinical support for our ultrasound customers. Previously, finding specific information meant sifting through a 1,000-page manual or waiting for customer support to respond. Now, customers have instant access to answers and specifications right at their fingertips, using Kinectus Remote Service. With Amazon Q Business we were able to significantly reduce manual work and wait times to find the right information, allowing our customers to focus on what really matters – patient care.”

Till Gloger, Head of Digital Production Platform Region Americas at Volkswagen Group of America states,

“Volkswagen innovates not only on its products, but also on how to boost employee productivity and increase production throughput. Volkswagen is testing the use of Amazon Q to streamline employee workflows by potentially integrating it with existing processes. This integration has the possibility to help employees save time during the assembly process, reducing some processes from minutes to seconds, ultimately leading to more throughput.”

Pricing

With Amazon Q Business, enterprise customers pay for user subscriptions and index capacity. For more details, see Amazon Q Business pricing.

Get started with Amazon Q Business today

To get started with Amazon Q Business, users first need to configure an application environment and create a knowledge base using over 40 data source connectors that index documents (e.g text, pdf, images, tables). Organizations then set up user authentication through AWS IAM Identity Center or other SAML-based identity providers like Okta, Ping Identity, and Microsoft Entra ID. After configuring access permissions, applications users can navigate to their organization’s Amazon Q Business web interface using their credentials to begin interacting with the Q Business and the data they have access to. Q Business enables natural language interactions where users can ask questions and receive answers based on their indexed documents, uploaded content, and world knowledge – this may include getting details, generating content or insights. Users can access Amazon Q Business through multiple channels including web applications, Slack, Microsoft Teams, Microsoft 365 for Word and Outlook, or through browser extensions for gen-AI assistance directly where they work. Additionally, customers can securely share their data with verified independent software vendors (ISVs) like Asana, Miro, PagerDuty, and Zoom using the data accessors feature, which maintains security and compliance while respecting user-level permissions.

Learn more about how to get started with Amazon Q Business here. Read about other Amazon Q Business customers’ success stories here. Certain Amazon Q Business features already available in US East (N. Virginia) and US West (Oregon) including Q Apps, Q Actions, and Audio/Video file support will become available in Europe (Ireland) soon.


About the Authors

Jose Navarro is an AI/ML Specialist Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production.

Morgan Dutton is a Senior Technical Program Manager at AWS, Amazon Q Business based in Seattle.

Eva Pagneux is a Principal Product Manager at AWS, Amazon Q Business, based in San Francisco.

Wesleigh Roeca is a Senior Worldwide Gen AI/ML Specialist at AWS, Amazon Q Business, based in Santa Monica.

Read More

PyTorch Day China 2025 Call for Proposals Open

PyTorch Day China 2025 Call for Proposals Open

We’re excited to announce the first-ever PyTorch Day China! This new event, hosted by the PyTorch Foundation, will take place on June 7 in Beijing, China, bringing together AI practitioners, researchers, and industry professionals to explore the latest advancements in open source AI and machine learning. Co-located with the BAAI Conference, PyTorch Day China is a chance to connect with the community, share knowledge, and help shape the future of deep learning.

PyTorch Day China 2025 Call for Proposals Open

Why Submit a Proposal?

PyTorch Day China offers a platform for AI practitioners and researchers to showcase their work, exchange ideas, and connect with others in the community. If you’re working on innovative applications, tools, or research in the PyTorch ecosystem, we encourage you to share your expertise.

Topics for Submission:

  • AI Applications and Use Cases
  • Core PyTorch Framework
  • DL Compilers and Kernel Authoring
  • Edge AI and On-Device
  • Ethical AI, Governance, and Regulation
  • Generative AI and Large Language Models (LLMs) with PyTorch
  • Open Source Collaboration, Education, and Community Building
  • Optimization for Training and Inference
  • PyTorch on Accelerator Hardware
  • PyTorch Ecosystem and Tools
  • PyTorch in Research and Academia
  • Performance Measurement and Benchmarking
  • Scaling Training and Inference

The submission deadline is April 13. Submit and learn more here: https://www.lfasiallc.com/pytorch-day-china/call-for-proposals-cfp/

Why Attend?

PyTorch Day China will feature technical talks, discussions, and poster sessions that highlight real-world applications and developments in AI and machine learning. Attendees will have the opportunity to learn from experts, contribute to the open source community, and engage with fellow PyTorch users. Registration information will be available in April.

Event Details

  • Date: June 7, 2025
  • Location: Zhongguancun Exhibition Center, Beijing, China
  • Address: 索家坟, Hai Dian Qu, Bei Jing Shi, China, 100080
  • Co-located with: BAAI Conference

Travel Information

The venue, Zhongguancun Exhibition Center, is approximately 39 km from Beijing International Airport. More details on travel and accommodation will be available on the BAAI Conference website and updated here as they become available.

Have Questions?

For inquiries, please contact pytorchevents@linuxfoundation.org.

Submit your proposal by April 13 and join the conversation shaping the future of PyTorch.

Read More

SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine

SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine

sglang logo

We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

About SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

  • Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
  • Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
  • Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
  • Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. You can learn more about the underlying techniques from the past release blog posts: v0.2 blog, v0.3 blog, v0.4 blog.

SGLang has been widely adopted by leading industry companies and frontier research labs. For example, xAI uses SGLang to serve its flagship model, Grok 3, which is currently the best model according to the Chatbot Arena leaderboard. Microsoft Azure uses SGLang to serve DeepSeek R1 on AMD GPUs, which is currently the best open source model.

Serving DeepSeek Models

You can easily launch a Docker container to serve a DeepSeek model with the following command:

# Pull the latest image
docker pull lmsysorg/sglang:latest

# Launch a server
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest 
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

Then you can query the server with the OpenAI-compatible API

import openai
client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

The server launch command above works for 8xH200. You can find detailed instructions for other hardware (MI300X, H100, A100, H20, L40S) at https://docs.sglang.ai/references/deepseek.html.

SGLang integrates DeepSeek-specific optimizations, such as MLA throughput optimizations, MLA-optimized kernels, data-parallel attention, multi-token prediction, and DeepGemm, making it the top choice for serving DeepSeek models by dozens of companies, including AMD, NVIDIA, and many cloud providers. The team is actively working on integrating more optimizations following the 2025 H1 roadmap below.

Serving Llama Models

Similarly, you can launch the server for a Llama 3.1 text model with:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct

Or a Llama 3.2 multimodal model with:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct  --chat-template=llama_3_vision

Roadmap

This year, the SGLang team will continue to push the boundaries of system efficiency. You can find the roadmap of 2025H1 here. The focus is

  • Throughput-oriented large-scale deployment similar to the DeepSeek inference system
  • Long context optimizations
  • Low latency speculative decoding
  • Reinforcement learning training framework integration
  • Kernel optimizations

Community

SGLang has been deployed to large-scale production, generating trillions of tokens every day. It has an active community with over three hundred contributors on GitHub. It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, iFlytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.

logos

Conclusion

We’re excited to welcome SGLang to the PyTorch ecosystem. SGLang accelerates the serving of large language and vision language models. It’s widely adopted by industry, powering the large-scale online serving of frontier models like Grok and DeepSeek.

We invite you to explore the SGLang GitHub repo, join the community on Slack, and reach out to contact@sglang.ai for inquiries or collaboration opportunities. Together, we can make powerful AI models accessible to everyone.

Read More

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

  • Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
  • Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
  • Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
  • Enterprise-grade features with built-in capabilities including:
    • Advanced parallelism techniques
    • Memory optimization strategies
    • Distributed checkpointing
    • Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

  • Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
  • Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
  • Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

  • Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
  • Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
  • Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
  • Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
  • Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

  1. Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
  2. Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
  3. Template parameters:
    • Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
    • All other parameters can be left as default or changed as needed for your use case.
  4. Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

  1. Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:
$ aws --version
  1. Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars
  1. Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.
$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src
  1. Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
    1. Cluster name
    2. It defines three instance groups
      1. Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
      2. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
      3. Worker-group: The group of nodes that executes the actual model training workload
    3. VPC configuration
$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json 
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json 
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json
  1. Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.
$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{      
		"instance_group_name": "worker-group-1",      
		"partition_name": ${instance_type}
	}  
],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  1. Create the SageMaker HyperPod cluster
$ aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json --region $AWS_REGION
  1. Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding
$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

  1. Install the AWS SSM Session Manager Plugin.
  2. Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.
$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

  1. View the existing partition and nodes per partition
$ sinfo
  1. List the jobs that are in the queue or running.
$ squeue
  1. SSH to the compute nodes.
# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster

#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)

#Exit to the head node
$ exit

#Exit again to cancel the srun job above
$ exit
  1. Clone the code sample GitHub repository onto the cluster controller node (head node).
$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

  1. NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
  2. You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.
$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh
  1. To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:
$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:
   return run.Config(
       llm.Llama3Config8B,       
	   rotary_base=500_000,       
	   seq_length=1024,       
	   num_layers=12,       
	   hidden_size=768,       
	   ffn_hidden_size=2688,       
	   num_attention_heads=16,       
	   init_method_std=0.023,
   )

The following function defines the Slurm executor.

def slurm_executor(
   account: str,   
   partition: str,   
   nodes: int,   
   user: str = "local",   
   host: str = "local",   
   remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",   
   time: str = "01:00:00",   
   custom_mounts: Optional[list[str]] = None,   
	custom_env_vars: Optional[dict[str, str]] = None,   
	container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",   
	retries: int = 0,) -> run.SlurmExecutor: 

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:
       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
       # Run the experiment
       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

  1. After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed
$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

  1. After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.
$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

  • If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

  1. Delete the SageMaker HyperPod cluster.
    $ aws sagemaker delete-cluster --cluster-name ml-cluster

  2. Delete the CloudFormation stack created in the prerequisites.
    $ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References


About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Read More