NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Illustrated headshots of Lidong Zhou and Eliza Strickland

The Microsoft Research Podcast offers its audience a unique view into the technical advances being pursued at Microsoft through the insights and personal experiences of the people committed to those pursuits.

Just after his keynote at the 38th annual Conference on Neural Information Processing Systems (NeurIPS), Microsoft Corporate Vice President Lidong Zhou joins guest host Eliza Strickland of IEEE Spectrum at the conference to further explore the topic of his talk: the co-evolution of systems and AI. Zhou, who is also chief scientist of the Microsoft Asia-Pacific Research and Development Group and managing director of Microsoft Research Asia, discusses how rapidly advancing AI impacts the systems supporting it; AI as a tool for improving systems engineering itself; and how budding computer scientists can prepare for innovating in a world where AI and systems grow together.

Learn more: 

Verus: A Practical Foundation for Systems Verification
Publication, November 2024

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Publication, July 2024

BitNet: Scaling 1-bit Transformers for Large Language Models
Publication, October 2023

Transcript

[MUSIC]

ELIZA STRICKLAND: Welcome to the Microsoft Research Podcast, where Microsoft’s leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them.

I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special edition of the podcast.

[MUSIC FADES]

Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Lidong Zhou. Lidong is a Microsoft corporate vice president, chief scientist of the Microsoft Asia-Pacific Research and Development Group, and managing director of Microsoft Research Asia. Earlier today, Lidong gave a keynote here at NeurIPS on the co-evolution of AI and systems engineering.

Lidong, welcome to the podcast.


LIDONG ZHOU: Thank you, Eliza. It’s such a pleasure to be here.

STRICKLAND: You said in your keynote that progress in AI is now outpacing progress in the systems supporting AI. Can you give me some concrete examples of where the current infrastructure is struggling to keep up?

ZHOU: Yeah. So actually, we have been working on supporting AI from the infrastructure perspective, and I can say, you know, there are at least three dimensions where it’s actually posing a lot of challenges. One dimension is that the scale of the AI systems that we have to support. You know, you heard about the scaling law in AI and, you know, demanding even higher scale every so often. And when we scale, as I mentioned in the talk this morning, every time you scale the system, you actually have to rethink how to design a system, develop a new methodology, revisit all the assumptions. And it becomes very challenging for the community to keep up. And the other dimension is if you look at AI systems, it’s actually a whole-stack kind of design. You have to understand not only the AI workloads, the model architecture, but also the software and also the underlying hardware. And you have to make sure they are all aligned to deliver the best performance. And the third dimension is the temporal dimension, where you really see accelerated growth and the pace of innovation in AI and not actually only in AI but also in the underlying hardware. And that puts a lot of pressure on how fast we innovate on the systems side because we really have to keep up in that dimension, as well. So all those three dimensions add up. It’s becoming a pretty challenging task for the whole systems community.

STRICKLAND: I like how in your talk you proposed a marriage between systems engineering and AI. What does this look like in practice, and how might it change the way we approach both fields?

ZHOU: Yeah, so I’m actually a big fan of systems community and AI community work together to tackle some of the most challenging problems. Of course, you know, we have been working on systems that support AI. But now increasingly, we’re seeing opportunities where AI can actually help developers to become more productive and develop systems that are better in many dimensions in terms of efficiency, in terms of reliability, in terms of trustworthiness. So I really want to see the two communities work together even more closely going forward. You know, I talk about, sort of, the three pillars, right—the efficiency; there’s trust; there’s also the infusion of the two (AI and systems engineering)—that are three ambitions that we are actually working on. And we see very encouraging early results that makes us believe that there’s much more to be achieved going forward with the two communities working together.

STRICKLAND: You mentioned the challenging of scaling. I think everyone at NeurIPS is talking about scaling. And you’ve highlighted efficiency as a key opportunity for improvement in AI. What kind of breakthroughs in systems engineering or new ideas in systems engineering could help AI achieve greater efficiencies?

ZHOU: Yeah, that’s another great question. I think there are a couple of aspects to efficiency. So this morning, I talked about some of the innovations in model architecture. So our researchers have been looking into BitNet, which is essentially try to use one bit or, actually, using a ternary representation for the weights in all those AI models rather than using FP16 and so on. And that potentially creates a lot of opportunities for efficiency and energy gains. But that cannot be done without rethinking about the software and even the hardware stack so that, you know, those innovations that you have in the model architecture can actually have the end-to-end benefits. And that’s, you know, one of the dimensions where we see the coinnovation of AI and underlying system to deliver some efficiency gains for AI models, for example. But there’s another dimension, which I think is also very important. With all the AI infrastructure that we build to support AI, there’s actually a huge room for improvement, as well. And this is where AI can actually be utilized to solve some of the very challenging systems problems, for optimization, for reliability, for trustworthiness. And I use some of the examples in my talk, but this is a very early stage. I think the potential is much larger going forward.

STRICKLAND: Yeah. It’s interesting to think about how GPUs and large language models are so intertwined at this point. You can’t really have one without the other. And you said in your talk you sort of see the need to decouple the architectures and the hardware. Is that right?

ZHOU: Yes. Yeah, so this is always, you know, like very system type of thinking where, you know, you really want to decouple some of the elements so that they can evolve and innovate independently. And this gives more opportunities, you know, larger design space, for each field. And what we are observing now, which is actually very typical in relatively mature fields, where we have GPUs that are dominating in the hardware land and all the model architecture has to be designed and, you know, proving very efficient on GPUs. And that limits the design space for model architecture. And similarly, you know, if you look at hardware, it’s very hard for hardware innovations to happen because now you have to show that those hardwares are actually great for all the models that have been actually optimized for GPUs. So I think, you know, from a systems perspective, it’s actually possible if you design the right abstraction between the AI and the hardware, it’s possible for this two domains to actually evolve separately and have a much larger design space, actually, to find the best solution for both.

STRICKLAND: And when you think about systems engineering, are there ways that AI can be used to optimize your own work?

ZHOU: Yes, I think there are. Two examples that I gave this morning, one is, you know, in systems there’s this what we call a holy grail of system research because we want to build trustworthy systems that people can depend on. And one of the approach is called verified systems. And this has been a very active research area in systems because there are a lot of advancements in formal methods in how we can infuse the formal method into building real systems. But it’s still very hard for the general system community because, you know, you really have to understand how formal methods works and so on. And so it’s still not within reach. You know, like when we build mission-critical systems, we want to be completely verified so, you know, you don’t have to do a lot of testing to show that there are no bugs. You’ll never be able to show there’s no bugs with testing. But if you …

STRICKLAND: Sorry, can I pause you for one moment? Could you define formal verification for our listeners, just in case they don’t know?

ZHOU: Yeah, that’s a good point. I think the easy way to think about this is formal verification, it uses mathematical logic to describe, say, a program and, you know, it can represent some properties in math, essentially, in logic. And then you can use a proof to show that the program has certain properties that you desire, and a simple form, like, a very preliminary form of formal (specification for) verification is, you know, just assertions in the program, right, where it, say, asserts A is not equal to zero. And that’s a very simple form of logic that must hold (or be proven to hold), and then, you know, the proof system is also much more complicated to talk about more advanced properties of programs, their correctness, and so on.

STRICKLAND: Mm-hm.

ZHOU: So I think that the opportunity that we’re seeing is that with the help of AI, I think we are on the verge of providing the capability of building verified systems, at least for some of the mission-critical pieces of systems. And that would be a very exciting area for systems and AI to tackle together. And I think we’re going to see a paradigm shift in systems where some pieces of system components will actually be implemented using AI. [What] is interesting is, you know, system is generally deterministic because, so, you know, when you look at the traditional computer system, you want to know that it’s actually acting as you expected, but AI, you know, it can be stochastic, right. And it might not always give you the same answer. But how you combine these two is another area where I see a lot of opportunities for breakthroughs.

STRICKLAND: Yeah, yeah. I wanted to back up in your career a little bit and talk about the concept of gray failures because you were really instrumental in defining this concept, which for people who don’t know, gray failures are subtle and partial failures in cloud-scale systems. They can be very difficult to detect and can lead to major problems. I wanted to see if you’re still thinking about gray failures in the context of your thinking about AI and systems. Are gray failures having an impact on AI today?

ZHOU: Yes, definitely. So when we were looking at cloud systems, we realized the … so in systems, we developed a lot of mechanisms for reliability. And when we look at the cloud systems, when they reach a certain scale, a lot of methodology we develop in systems for reliability actually no longer applies. One of the reasons is we have those gray failures. And then we moved to looking at AI infrastructure. The problem is actually even worse because what we realize is there’s a lot of built-in redundancy at every level, like in GPUs, memory, or all the communication channels. And because of those built-in redundancies, sometimes the system is experience failures, but they’re being masked because of the redundancies. And that makes it very hard for us to actually maintain the system, debug the system, or to troubleshooting. And for AI infrastructure, what we have developed is a very different approach using proactive validation rather than reactive repair. And this is actually a paper that we wrote recently in USENIX ATC that talks about how we approach reliability in AI infrastructure, where the same concept happens to apply in a new meaning.

STRICKLAND: Mm. I like that. Yeah. So tell me a little bit about your vision for where AI goes from here. You talked a little bit in your keynote about AI-infused systems. And what would that look like?

ZHOU: Yeah, so I think AI is going to transform almost everything, and that includes systems. That’s why I’m so happy to be here to learn more from the AI community. But I also believe that for every domain that AI is going to transform, you really need the domain expertise and, sort of, the combination of AI and that particular domain. And the same for systems. So when we look at what we call AI-infused systems, we really see the opportunity where there are a lot of hard system challenges can be addressed by AI. But we need to define the right interface between the system and the AI so that we can leverage the advantage of both, right. Like, AI is creative. It comes up with solutions that, you know, people might not think of, but it’s also a little bit random sometimes. It could, you know, give you wrong answers. But systems are very grounded and very deterministic. So we need to figure out what is the design paradigm that we need to develop so that we can get the best of both worlds.

STRICKLAND: Makes sense. In your talk you gave an example of OptiFlow. Could you tell our listeners a bit about that?

ZHOU: Yeah. This is a pretty interesting project that is actually done in Microsoft Research Asia jointly with the Azure team where we look at collective communication, which is a major part of AI infrastructure. And it turns out, you know, there’s a lot of room for optimization. It was initially done manually. So an expert had to take a look at the system and look at the different configurations and do all kinds of experiments, and, you know, it takes about two weeks to come up with a solution. This is why I say, you know, the productivity is becoming a bottleneck for our AI infrastructure because people are in the loop who have to develop solutions. And it turns out that this is a perfect problem for AI, where AI can actually come up with various solutions. It can actually develop good system insights based on the observations from the system. And so OptiFlow, what it does is it comes up with the, sort of, the algorithm or the schedule of communications for different collective communication primitives. And it turns out to be able to discover algorithms that’s much better than the default one or, you know, for different settings. And it’s giving us the benefits of the productivity; also, efficiency.

STRICKLAND: And you said that this is in production today, right?

ZHOU: Yes. It is in production.

STRICKLAND: That’s exciting. So thinking still to the future, how might the co-evolution of AI and systems change the skills needed for future computer scientists?

ZHOU: Yeah, that’s a very deep question. As I mentioned, I think being fluent in AI is very important. But I also believe that domain expertise is probably undervalued in many ways. And I see a lot of needs for this interdisciplinary kind of education where someone who not only understands AI and what AI technology can do but also understands a particular domain very well. And those are the people who will be able to figure out the future for that particular domain with the power AI. And I think for students, certainly it’s no longer sufficient for you to be an expert in a very narrow domain. I think we see a lot of fields sort of merging together, and so you have to be an expert in multiple domains to see new opportunities for innovations.

STRICKLAND: So what advice would you give to a high school student who’s just starting out and thinks, ah, I want to get into AI?

ZHOU: Yeah, I mean certainly there’s a lot of excitement over AI, and it would be great for high school students to, actually, to have the firsthand experience. And I think it’s their world in the future. Because they probably can imagine a lot of things from scratch. I think they probably have the opportunity to disrupt a lot of the things that we take for granted today. So I think just use their imagination. And I don’t think we have really good advice for the young generation. It’s going to be their creativity and their imagination. And AI is definitely going to empower them to do something that’s going to be amazing.

STRICKLAND: Something that we probably can’t even imagine.

ZHOU: Right.

STRICKLAND: Yeah.

ZHOU: I think so.

STRICKLAND: I like that. So as we close, I’m hoping you can look ahead and talk about what excites you most about the potential of AI and systems working together, but also if you have any concerns, what concerns you most?

ZHOU: Yeah, I think in terms of AI systems, I’m certainly pretty excited about what we can do together, you know, with a combination of AI and systems. There are a lot of low-hanging fruit, and there are also a lot of potential grand challenges that we can actually take on. I mentioned a couple in this morning’s talk. And certainly, you know, we also want to look at the risks that could happen, especially when we have systems and AI start to evolve together. And this is also in an area where having some sort of trust foundation is very important so we can have some assurance of the kind of system or AI system that we are going to build. And this is actually fundamental in how we think about trust in systems. And I think that concept can be very useful for us to guard against unintended consequences or unintended issues.

[MUSIC]

STRICKLAND: Well, Lidong Zhou, thank you so much for joining us on the podcast. I really enjoyed the conversation.

ZHOU: It’s such a pleasure, Eliza.

STRICKLAND: And to our listeners, thanks for tuning. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at Microsoft.com/research. Until next time.

[MUSIC FADES]

The post NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou appeared first on Microsoft Research.

Read More

Simplify multimodal generative AI with Amazon Bedrock Data Automation

Simplify multimodal generative AI with Amazon Bedrock Data Automation

Developers face significant challenges when using foundation models (FMs) to extract data from unstructured assets. This data extraction process requires carefully identifying models that meet the developer’s specific accuracy, cost, and feature requirements. Additionally, developers must invest considerable time optimizing price performance through fine-tuning and extensive prompt engineering. Managing multiple models, implementing safety guardrails, and adapting outputs to align with downstream system requirements can be difficult and time consuming.

Amazon Bedrock Data Automation in public preview helps address these and other challenges. This new capability from Amazon Bedrock offers a unified experience for developers of all skillsets to easily automate the extraction, transformation, and generation of relevant insights from documents, images, audio, and videos to build generative AI–powered applications. With Amazon Bedrock Data Automation, customers can fully utilize their data by extracting insights from their unstructured multimodal content in a format compatible with their applications. Amazon Bedrock Data Automation’s managed experience, ease of use, and customization capabilities help customers deliver business value faster, eliminating the need to spend time and effort orchestrating multiple models, engineering prompts, or stitching together outputs.

In this post, we demonstrate how to use Amazon Bedrock Data Automation in the AWS Management Console and the AWS SDK for Python (Boto3) for media analysis and intelligent document processing (IDP) workflows.

Amazon Bedrock Data Automation overview

You can use Amazon Bedrock Data Automation to generate standard outputs and custom outputs. Standard outputs are modality-specific default insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. Custom outputs use customer-defined blueprints that specify output requirements using natural language or a schema editor. The blueprint includes a list of fields to extract, data format for each field, and other instructions, such as data transformations and normalizations. This gives customers full control of the output, making it easy to integrate Amazon Bedrock Data Automation into existing applications.

Using Amazon Bedrock Data Automation, you can build powerful generative AI applications and automate use cases such as media analysis and IDP. Amazon Bedrock Data Automation is also integrated with Amazon Bedrock Knowledge Bases, making it easier for developers to generate meaningful information from their unstructured multimodal content to provide more relevant responses for Retrieval Augmented Generation (RAG).

Customers can get started with standard outputs for all four modalities: documents, images, videos, and audio and custom outputs for documents and images. Custom outputs for video and audio will be supported when the capability is generally available.

Amazon Bedrock Data Automation for images, audio, and video

To take a media analysis example, suppose that customers in the media and entertainment industry are looking to monetize long-form content, such as TV shows and movies, through contextual ad placement. To deliver the right ads at the right video moments, you need to derive meaningful insights from both the ads and the video content. Amazon Bedrock Data Automation enables your contextual ad placement application by generating these insights. For instance, you can extract valuable information such as video summaries, scene-level summaries, content moderation concepts, and scene classifications based on the Interactive Advertising Bureau (IAB) taxonomy.

To get started with deriving insights with Amazon Bedrock Data Automation, you can create a project where you can specify your output configuration using the AWS console, AWS Command Line Interface (AWS CLI) or API.

To create a project on the Amazon Bedrock console, follow these steps:

  1. Expand the Data Automation dropdown menu in the navigation pane and select Projects, as shown in the following screenshot.
  2. From the Projects console, create a new project and provide a project name, as shown in the following screenshot.
  3. From within the project, choose Edit, as shown in the following screenshot, to specify or modify an output configuration. Standard output is the default way of interacting with Amazon Bedrock Data Automation, and it can be used with audio, documents, images and videos, where you can have one standard output configuration per data type for each project.
  4. For customers who want to analyze images and videos for media analysis, standard output can be used to generate insights such as image summary, video scene summary, and scene classifications with IAB taxonomy. You can select the image summarization, video scene summarization, and IAB taxonomy checkboxes from the Standard output tab and then choose Save changes to finish configuring your project, as shown in the following screenshot.
  5. To test the standard output configuration using your media assets, choose Test, as shown in the following screenshot.

The next example uses the project to generate insights for a travel ad.

  1. Upload an image, then choose Generate results, as shown in the following screenshot, for Amazon Bedrock Data Automation to invoke an inference request.
  2. Amazon Bedrock Data Automation will process the uploaded file based on the project’s configuration, automatically detecting that the file is an image and then generating a summary and IAB categories for the travel ad.
  3. After you have generated insights for the ad image, you can generate video insights to determine the best video scene for effective ad placement. In the same project, upload a video file and choose Generate results, as shown in the following screenshot.

Amazon Bedrock Data Automation will detect that the file is a video and will generate insights for the video based on the standard output configuration specified in the project, as shown in the following screenshot.

These insights from Amazon Bedrock Data Automation, can help you effectively place relevant ads in your video content, which can help improve content monetization.

Intelligent document processing with Amazon Bedrock Data Automation

You can use Amazon Bedrock Data Automation to automate IDP workflows at scale, without needing to orchestrate complex document processing tasks such as classification, extraction, normalization, or validation.

To take a mortgage example, a lender wants to automate the processing of a mortgage lending packet to streamline their IDP pipeline and improve the accuracy of loan processing. Amazon Bedrock Data Automation simplifies the automation of complex IDP tasks such as document splitting, classification, data extraction, output format normalization, and data validation. Amazon Bedrock Data Automation also incorporates confidence scores and visual grounding of the output data to mitigate hallucinations and help improve result reliability.

For example, you can generate custom output by defining blueprints, which specify output requirements using natural language or a schema editor, to process multiple file types in a single, streamlined API. Blueprints can be created using the console or the API, and you can use a catalog blueprint or create a custom blueprint for documents and images.

For all modalities, this workflow consists of three main steps: creating a project, invoking the analysis, and retrieving the results.

The following solution walks you through a simplified mortgage lending process with Amazon Bedrock Data Automation using the Amazon SDK for Python (Boto3), which is straightforward to integrate into an existing IDP workflow.

Prerequisites

Before you invoke the Amazon Bedrock API, make sure you have the following:

Create custom blueprint

In this example, you have the lending packet, as shown in the following image, which contains three documents: a pay stub, a W-2 form, and a driver’s license.

Amazon Bedrock Data Automation has sample blueprints for these three documents that define commonly extracted fields. However, you can also customize Amazon Bedrock Data Automation to extract specific fields from each document. For example, you can extract only the gross pay and net pay from the pay stub by creating a custom blueprint.

To create a custom blueprint using the API, you can use the CreateBlueprint operation using the Amazon Bedrock Data Automation Client. The following example shows the gross pay and net pay being defined as properties passed to CreateBlueprint, to be extracted from the lending packet:

bda_create_blueprint_response = bedrock_data_automation_client.create_blueprint(
    blueprintName='CUSTOM_PAYSLIP_BLUEPRINT',
    type='DOCUMENT',
    blueprintStage='LIVE',
    schema=json.dumps({
        '$schema': 'http://json-schema.org/draft-07/schema#',
        'description': 'default',
        'documentClass': 'default',
        'type': 'object',
        'properties': {
            'gross_pay_this_period': {
                'type': 'number',
                'inferenceType': 'extractive',
                'description': 'The gross pay for this pay period from the Earnings table'
            },
            'net_pay': {
                'type': 'number',
                'inferenceType': 'extractive',
                'description': 'The net pay for this pay period from the bottom of the document'
            }
        }
    }),
)

The CreateBlueprint response returns the blueprintARN for the pay stub’s custom blueprint:

'blueprintArn: arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'

Configure Amazon Bedrock Data Automation project

To begin processing files using blueprints with Amazon Bedrock Data Automation, you first need to create a data automation project. To process a multiple-page document containing different file types, you can configure a project with different blueprints for each file type.

Use Amazon Bedrock Data Automation to apply multiple document blueprints within one project so you can process different types of documents within the same project, each with its own custom extraction logic.

When using the API to create a project, you invoke the CreateDataAutomationProject operation. The following is an example of how you can configure custom output using the custom blueprint for the pay stub and the sample blueprints for the W-2 and driver’s license:

bda_bedrock_automation_create_project_response = bedrock_data_automation_client.create_data_automation_project(
    projectName='TEST_PROJECT',
    projectDescription='test BDA project',
    projectStage=bda_stage,
    standardOutputConfiguration={
        'document': {
            'outputFormat': {
                'textFormat': {
                    'types': ['PLAIN_TEXT']
                },
                'additionalFileFormat': {
                    'state': 'ENABLED',
                }
            }
        },
    },
    customOutputConfiguration={
        'blueprints': [
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'
          },
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-w2-form'
          },
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license'
          },
        ],
    },
    overrideConfiguration={
        'document': {
            'splitter': {
                'state': 'ENABLED'
            }
        }
    },
)

The CreateProject response returns the projectARN for the project:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'

To process different types of documents using multiple document blueprints in a single project, Amazon Bedrock Data Automation uses a splitter configuration, which must be enabled through the API. The following is the override configuration for the splitter, and you can refer to the Boto3 documentation for more information:

overrideConfiguration={
    'document': {
        'splitter': {
            'state': 'ENABLED' | 'DISABLED'
        }
    }
},

Upon creation, the API validates the input configuration and creates a new project, returning the projectARN, as shown in the following screenshot.

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'

Test the solution

Now that the blueprint and project setup is complete, the InvokeDataAutomationAsync operation from the Amazon Bedrock Data Automation runtime can be used to start processing files. This API call initiatives the asynchronous processing of files in an S3 bucket, in this case the lending packet, using the configuration defined in the project by passing the project’s ARN:

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri': '<S3_URI>'},
    outputConfiguration={'s3Uri': '<S3_URI>'},
    dataAutomationConfiguration={
        'dataAutomationArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>',
        'stage': 'LIVE'
    }
)

InvokeDataAutomationAsync returns the invocationARN:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'

GetDataAutomationStatus can be used to view the status of the invocation, using the InvocationARN from the previous response:

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.get_data_automation_status(
    invocationArn='arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'
)

When the job is complete, view the results in the S3 bucket used in the outputConfiguration by navigating to the ~/JOB_ID/0/custom_output/ folder.

From the following sample output, Amazon Bedrock Data Automation associated the pay stub file with the custom pay stub blueprint with a high level of confidence:

'matched_blueprint': {
    'arn': '<BLUEPRINT_ARN>', 'name': 'CUSTOM_PAYSLIP_BLUEPRINT', 'confidence': 0.99959725
}

Using the matched blueprint, Amazon Bedrock Data Automation was able to accurately extract each field defined in the blueprint:

'inference_result': {
    'net_pay': 291.9, 'gross_pay_this_period': 452.43
}

Additionally, Amazon Bedrock Data Automation returns confidence intervals and bounding box information for each field:

'explainability_info': [{
    'net_pay': {'success': true, 'confidence': 0.96484375, 'geometry': [{'boundingBox': ...

This example demonstrates how customers can use Amazon Bedrock Data Automation to streamline and automate an IDP workflow. Amazon Bedrock Data Automation automates complex document processing tasks such as data extraction, normalization, and validation from documents. Amazon Bedrock Data Automation helps to reduce operational complexity and improves processing efficiency to handle higher loan processing volumes, minimize errors, and drive operational excellence.

Cleanup

When you’re finished evaluating this feature, delete the S3 bucket and any objects to avoid any further charges.

Summary

Customers can get started with Amazon Bedrock Data Automation, which is available in public preview in AWS Region US West 2 (Oregon). Learn more on Amazon Bedrock Data Automation and how to automate the generation of accurate information from unstructured content for building generative AI–based applications.


About the authors

Ian Lodge is a Solutions Architect at AWS, helping ISV customers in solving their architectural, operational, and cost optimization challenges. Outside of work he enjoys spending time with his family, ice hockey and woodworking.

Alex Pieri is a Solutions Architect at AWS that works with retail customers to plan, build, and optimize their AWS cloud environments. He specializes in helping customers build enterprise-ready generative AI solutions on AWS.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Read More

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

A diagram illustrating the joint optimization process of instructions and in-context examples in PromptWizard. The figure demonstrates how the framework iteratively refines both components, integrating feedback to enhance the overall prompt effectiveness and adaptability across tasks.

The challenge of effective prompting

AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error. 

This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks? 

To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.

Introducing PromptWizard

PromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.

PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance. 

Three key insights behind PromptWizard:

  • Feedback-driven refinement: At its core, PW leverages an iterative feedback loop where the LLM generates, critiques, and refines its own prompts and examples. This continuous improvement mechanism ensures that each iteration is better than the last, leading to highly effective prompts and examples. 
  • Joint optimization and synthesis of diverse examples: PW generates synthetic examples that are not only robust and diverse but also task-aware. By optimizing prompts and examples together, it ensures they work in tandem to address specific task requirements effectively. 
  • Self-generated chain-of-thought (CoT) steps: Incorporating CoT reasoning improves the problem-solving capabilities of the model. By using selected few-shot examples, PW generates a detailed reasoning chain for each example, facilitating nuanced and step-by-step problem-solving approaches.
Fig 1: A diagram providing an overview of the PromptWizard process. It illustrates the main components, including iterative prompt generation, feedback-based refinement, and joint optimization of instructions and examples. The workflow emphasizes modularity and adaptability, demonstrating how PromptWizard evolves prompts to improve performance across diverse tasks.
Figure 1. Overview of PromptWizard

How PromptWizard works

PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.

Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses. 

Stage 1: Refinement of prompt instruction

The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.

For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state. 

Fig 2: A visualization of the refinement process for prompt instructions in PromptWizard. The figure highlights iterative improvements, where initial instructions are critiqued, adjusted based on feedback, and fine-tuned to achieve greater accuracy and alignment with task objectives.
Figure 2. Refinement of prompt instruction

Stage 2: Joint optimization of instructions and examples

The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.

This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content. 

Fig 3: A diagram illustrating the joint optimization process of instructions and in-context examples in PromptWizard. The figure demonstrates how the framework iteratively refines both components, integrating feedback to enhance the overall prompt effectiveness and adaptability across tasks.
Figure 3. Joint optimization of instructions and examples

Microsoft research podcast

Abstracts: August 15, 2024

Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.


Results

PromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency. 

Comprehensive evaluation across tasks

PromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper

  • Accuracy: PW consistently outperformed other methods, maintaining performance close to the best across all tasks. Figure 4 shows the performance profile curve that highlights PW’s reliability, demonstrating how frequently it achieves near-best accuracy compared to other approaches for BigBench Instruction Induction dataset (BBII).
  • Efficiency: Beyond accuracy, PW demonstrates its computational efficiency. Unlike many baseline methods that require extensive API calls and computational resources, PW achieves superior results with minimal overhead by striking an effective balance between exploration and exploitation. Table 1 demonstrates PW’s cost-effectiveness, with significantly reduced token usage for input and output while optimizing prompts effectively.
Fig 4: A performance profile curve illustrating PromptWizard's reliability on the BigBench Instruction Induction (BBII) dataset. The curve demonstrates how often PromptWizard achieves accuracy levels close to the best performance when compared to other approaches, highlighting its consistency and effectiveness.
Figure 4. Performance Profile curve on BBII dataset
Methods API calls Total tokens
Instinct 1730 115k
PromptBreeder 18600 1488k
EvoPrompt 5000 400k
PW 69 24k
Table 1. Cost analysis on BBII dataset

We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs. 

Resilience with limited data

Real-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2). 

Datasets 5 Examples 25 Examples
MMLU 80.4 89.5
GSM8k 94 95.4
Ethos 86.4 89.4
PubMedQA 68 78.2
MedQA 80.4 82.9
Average 81.9 87
Table 2. PW’s performance with varying number of examples

Leveraging smaller models for optimization

PromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).

Dataset Prompt Gen: Llama-70B Prompt Gen: GPT4
GSM8k 94.6 95.4
Ethos 89.2 89.4
Average 91.9 92.4
Table 3. Performance with smaller LLMs for prompt generation 

PromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.

 Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.

The post PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts appeared first on Microsoft Research.

Read More

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

TUI Group is one of the world’s leading global tourism services, providing 21 million customers with an unmatched holiday experience in 180 regions. TUI Group covers the end-to-end tourism chain with over 400 owned hotels, 16 cruise ships, 1,200 travel agencies, and 5 airlines covering all major holiday destinations around the globe. At TUI, crafting high-quality content is a crucial component of its promotional strategy.

The TUI content teams are tasked with producing high-quality content for its websites, including product details, hotel information, and travel guides, often using descriptions written by hotel and third-party partners. This content needs to adhere to TUI’s tone of voice, which is essential to communicating the brand’s distinct personality. But as its portfolio expands with more hotels and offerings, scaling content creation has proven challenging. This presents an opportunity to augment and automate the existing content creation process using generative AI.

In this post, we discuss how we used Amazon SageMaker and Amazon Bedrock to build a content generator that rewrites marketing content following specific brand and style guidelines. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon SageMaker helps data scientists and machine learning (ML) engineers build FMs from scratch, evaluate and customize FMs with advanced techniques, and deploy FMs with fine-grain controls for generative AI use cases that have stringent requirements on accuracy, latency, and cost.

Through experimentation, we found that following a two-phased approach worked best to make sure that the output aligned to TUI’s tone of voice requirements. The first phase was to fine-tune with a smaller large language model (LLM) on a large corpus of data. The second phase used a different LLM model for post-processing. Through fine-tuning, we generate content that mimics the TUI brand voice using static data and which could not be captured through prompt engineering. Employing a second model with few-shot examples helped verify the output adhered to specific formatting and grammatical rules. The latter uses a more dynamic dataset, which we can use to adjust the output quickly in the future for different brand requirements. Overall, this approach resulted in higher quality content and allowed TUI to improve content quality at a higher velocity.

Solution overview

The architecture consists of a few key components:

  • LLM models – We evaluated different approaches and found that a two-model solution performed the best. This consists of a fine-tuned Meta Llama model to generate a description for the given hotel and Anthropic’s Claude model to reformat its output. Fine-tuning and hosting the Meta Llama 2 model was done on Amazon SageMaker, and Anthropic’s Claude 2 was consumed from Amazon Bedrock through API calls.
  • Orchestration – We created a state machine using AWS Step Functions to make calls in a batch format to the two LLMs and fetch the search engine optimization (SEO) score for the generated content from a third-party API. If the SEO content score is above a defined threshold (80%), the generated content is stored in an Amazon DynamoDB table and can later be reviewed by the content team directly in the front-end UI. Through this process, we maintain and monitor content quality at scale.
  • Human in the loop feedback – We developed a custom React front-end application to gather feedback from the content team to facilitate continuous improvement and future model fine-tuning. You can use the feedback to fine-tune a base model on SageMaker using reinforcement learning from human feedback (RLHF) to improve performance.

The following diagram is the high-level architecture of the solution.

Architecture Diagram

Prompt engineering

We started by crafting prompts that include the most common issues identified by the TUI content team, including tone of voice, accuracy, length, and grammar. We provided the LLM with a few examples of curated content within the context window. Although the generated output followed the guidance, the writing style didn’t meet TUI’s tone of voice requirements.

 Example prompt:

You are an experienced British copywriter for TUI. TUI is a world-leading travel company. You are an expert in generating hotel descriptions, based on TUI’s tone of voice. TUI's tone of voice can be described as upbeat, enthusiastic, and promotional. Avoid all the words in the following list: {banned words}
Write at most 100 words.
Your hotel descriptions must follow TUI's tone of voice and apply SEO guidelines.
These are some good examples. You should mimic below.
{examples}.
Human: {input}
Assistant:

Here is an example of the results we received.

Original text Generated text
Boutique interiors and a great location combine at this small but perfectly formed hotel. It may be small but this 15-room hotel packs a punch with its interior design. The immaculate pool is set in an open atrium area. And there’s funky cocktail bar and lounge, too. The location is every bit as impressive as the styling. It’s about a 10-minute walk to the beach and less than half that to the centre of Thassos Town. This petite hotel brings boutique style to Thassos Town. Its immaculate pool and funky cocktail bar create an oasis in the heart of the action. The beach is just a 10-minute walk away, so you can easily split your time between sunbathing and sightseeing. With only 15 rooms, the experience feels intimate yet worldly. The interiors marry modern design with traditional touches for a unique look. And the central location means popular attractions like the Archaeological Museum and shopping street are right on your doorstep.

To evaluate these results, we generated a dataset of 150 hotel descriptions using two different prompt versions and asked our business teams to review it. However, they flagged 98% of the output with significant issues. Firstly, Anthropic’s Claude 2 struggled to adhere to TUI’s tone of voice and branding guidelines, frequently employed abbreviations, and favored American English over British English. It also failed to follow the SEO guidelines provided and sometimes generated false information regarding hotel facilities and locations. The following image shows a list of these challenges and how the LLM handled them. Of the six challenges, the LLM met only one.

Main Challenges

Fine-tuning Llama 2 using PEFT on Amazon SageMaker JumpStart

These issues and poor feedback led us to conclude that prompt engineering alone would not adequately address the problem. As a result, we decided to pursue an alternative approach: fine-tuning a smaller large language model to rewrite the text in accordance with TUI’s tone of voice. We used a curated set of hotel descriptions written by TUI copywriters so that the model would have better alignment with our guidelines.

We selected the Meta Llama 2 model using Amazon SageMaker JumpStart, one of the top open source LLMs available at the time, and chose the 13B parameter version to apply parameter-efficient fine-tuning (PEFT), specifically using quantized low-rank adaptation (QLoRA). This technique quantizes the pre-trained model to 4 bits and adds small low-rank adapters for fine-tuning. We fine-tuned the model on a single ml.g5.4xlarge instance in about 20 hours using a relatively small dataset of around 4,500 hotels. We also tested out the Llama 2 7B and 70B models. We found that the 7B model didn’t perform well enough, and the 70B model had much higher costs without seeing significant improvement.

We used common natural language processing (NLP) evaluation metrics, such as perplexity for evaluation and monitoring during training, and established daily feedback loops with the content team to refine the test set. The output from the fine-tuned Meta Llama 2 13B model effectively mimicked the TUI tone of voice and adhered to most SEO guidelines, even those not specified in the prompt.

For more information, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker Jumpstart.

Original text Generated text
The NH München City Süd hotel, formerly known as NH München am Ring, is located southeast of Munich. The location is ideal for reaching the highway towards Salzburg and provides good access to the airport and Exhibition Center. You can also get to the city center in 20 minutes by hopping on the S-Bahn at the Karl-Preis-Platz station. The peaceful residential location offers an excellent opportunity to relax in one of the 163 rooms. These are decorated in white with red accents. Rooms feature free Wi-Fi and electronic safes that are large enough for your laptop. If you need to work, there are spacious desks in every room and apartments with seperate living rooms are available for longer term guests. The suites also come with separate living rooms and foldaway beds are available upon request. The breakfast will get you ready for a busy day with a large spread of cold cuts, cheeses, fresh pastries, and fruit. These come accompanied with freshly brewed coffee, tea, and fresh fruit juices. For lunch and dinner, you will find many restaurant options in close proximity to the hotel. There is a sauna for relaxation and you can park your car in the garage.

Make yourself at home in Munich at the München City Süd. This former NH München am Ring sits just southeast of the Bavarian capital, putting both the highway to Salzburg and the airport within easy reach. The S-Bahn station Karl-Preis-Platz is a quick walk away too, so you can be in the city center in just 20 minutes.

The neighborhood here is peaceful, giving you the chance to properly relax during your stay. The hotel’s 163 rooms are decorated in calming white and red hues. Free Wi-Fi and laptop-sized safes come as standard, while spacious desks make working easy. For longer visits, apartments with separate living areas are available. The suites take things up a notch with their own living spaces and foldaway beds on request.

Fuel up each morning with a vast breakfast spread of cold cuts, cheeses, pastries and fruit. All washed down with coffee, tea and fresh juices. You’ll find many dining options close by for other meals. After a busy day, unwind in the sauna or park up in the garage.

The following image shows a list of the challenges and how the LLM handled them. Of the six challenges, the LLM met four.

Improvements

Integrating Anthropic’s Claude 2 for further improvement

To further improve the output, we introduced Anthropic’s Claude 2 using Amazon Bedrock as a final refinement step. This included converting American spelling to British spelling, writing numbers one through nine in words and larger numbers in digits, correcting typos and capitalization errors, minimizing banned words, incorporating essential TUI branding words, and adding missing hotel information. We also implemented a feedback mechanism in the UI to use data for ongoing fine-tuning in production. By using Anthropic Claude 2, we make sure that the final output applies the remaining formatting rules.

Original text Generated text

As its name suggests, Vienna Sporthotel has loads lined up for sports fans.

There are 10 tennis courts and four golf simulators here, plus, you can book a session with a coach to brush up on your skills. On top of that, there’s an outdoor pool, a well-stocked fitness room and a big sauna. To fuel up for all that exercise, you can grab a plate at the breakfast buffet.

Location-wise, this place is set up for sightseeing. Hop on a bus and you can reach the likes of The Hofburg, Belvedere Palace and Karlskirche in less than half an hour. You’re only a 10-minute stroll away from the nearest restaurants and cafes, too.

With a name like Vienna Sporthotel, you can expect this place to serve up plenty for fans of sport and activity. Brush up your skills on one of 10 tennis courts or in the four golf simulators, then recharge in the outdoor pool or well-equipped gym. There’s even a coach on hand to help perfect your technique.

When you’re ready to refuel, tuck into the breakfast buffet. Then get set for sightseeing – a bus ride whisks you to top Vienna attractions like The Hofburg, Belvedere Palace and Karlskirche in under 30 minutes. You’re also just a short stroll from local eateries and coffee shops.

The following image shows a list of the challenges and how the LLM handled them. The LLM met all six challenges.

Success in Target Outcomes

Key outcomes

The final architecture consists of a fine-tuned Meta Llama 2 13B model and Anthropic Claude 2, using the strengths of each model. In a blind test, these dynamically generated hotel descriptions were rated higher than those written by humans in 75% of a sample of 50 hotels. We also integrated a third-party API to calculate SEO scores for the generated content, and we observed up to 4% uplift in SEO scores for the generated content compared to human written descriptions. Most significantly, the content generation process is now five times faster, enhancing our team’s productivity without compromising quality or consistency. We can generate a vast number of hotel descriptions in just a few hours— a task that previously took months.

Takeaways

Moving forward, we plan to explore how this technology can address current inefficiencies and quality gaps, especially for hotels that our team hasn’t had the capacity to curate. We plan to expand this solution to more brands and regions within the TUI portfolio, including producing content in various languages and tailoring it to meet the specific needs of different audiences.

Throughout this project, we learned a few valuable lessons:

  • Few-shot prompting is cost-effective and sufficient when you have limited examples and specific guidelines for responses. Fine-tuning can help significantly improve model performance when you need to tailor content to match a brand’s tone of voice, but can be resource intensive and is based on static data sources that can get outdated.
  • Fine-tuning the Llama 70B model was much more expensive than Llama 13B and did not result in significant improvement.
  • Incorporating human feedback and maintaining a human-in-the-loop approach is essential for protecting brand integrity and continuously improving the solution. The collaboration between TUI engineering, content, and SEO teams was crucial to the success of this project.

Although Meta Llama 2 and Anthropic’s Claude 2 were the latest state-of-the-art models available at the time of our experiment, since then we have seen the launch of Meta Llama 3 and Anthropic’s Claude 3.5, which we expect can significantly improve the quality of our outputs. Amazon Bedrock also now supports fine-tuning for Meta Llama 2, Cohere Command Light, and Amazon Titan models, making it simpler and faster to test models without managing infrastructure.


About the Authors

Nikolaos Zavitsanos is a Data Scientist at TUI, specialized in developing customer-facing Generative AI applications using AWS services. With a strong background in Computer Science and Artificial Intelligence, he leverages advanced technologies to enhance user experiences and drive innovation. Outside of work, Nikolaos plays water polo and is competing at a national level. Connect with Nikolaos on Linkedin

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training. Connect with Hin Yee on LinkedIn.

Read More

AI in Your Own Words: NVIDIA Debuts NeMo Retriever Microservices for Multilingual Generative AI Fueled by Data

AI in Your Own Words: NVIDIA Debuts NeMo Retriever Microservices for Multilingual Generative AI Fueled by Data

In enterprise AI, understanding and working across multiple languages is no longer optional — it’s essential for meeting the needs of employees, customers and users worldwide.

Multilingual information retrieval — the ability to search, process and retrieve knowledge across languages — plays a key role in enabling AI to deliver more accurate and globally relevant outputs.

Enterprises can expand their generative AI efforts into accurate, multilingual systems using NVIDIA NeMo Retriever embedding and reranking NVIDIA NIM microservices, which are now available on the NVIDIA API catalog. These models can understand information across a wide range of languages and formats, such as documents, to deliver accurate, context-aware results at massive scale.

With NeMo Retriever, businesses can now:

  • Extract knowledge from large and diverse datasets for additional context to deliver more accurate responses.
  • Seamlessly connect generative AI to enterprise data in most major global languages to expand user audiences.
  • Deliver actionable intelligence at greater scale with 35x improved data storage efficiency through new techniques such as long context support and dynamic embedding sizing.
New NeMo Retriever microservices reduce storage volume needs by 35x, enabling enterprises to process more information at once and fit large knowledge bases on a single server. This makes AI solutions more accessible, cost-effective and easier to scale across organizations.

Leading NVIDIA partners like DataStax, Cohesity, Cloudera, Nutanix, SAP, VAST Data and WEKA are already adopting these microservices to help organizations across industries securely connect custom models to diverse and large data sources. By using retrieval-augmented generation (RAG) techniques, NeMo Retriever enables AI systems to access richer, more relevant information and effectively bridge linguistic and contextual divides.

Wikidata Speeds Data Processing From 30 Days to Under Three Days 

In partnership with DataStax, Wikimedia has implemented NeMo Retriever to vector-embed the content of Wikipedia, serving billions of users. Vector embedding — or “vectorizing” —  is a process that transforms data into a format that AI can process and understand to extract insights and drive intelligent decision-making.

Wikimedia used the NeMo Retriever embedding and reranking NIM microservices to vectorize over 10 million Wikidata entries into AI-ready formats in under three days, a process that used to take 30 days. That 10x speedup enables scalable, multilingual access to one of the world’s largest open-source knowledge graphs.

This groundbreaking project ensures real-time updates for hundreds of thousands of entries that are being edited daily by thousands of contributors, enhancing global accessibility for developers and users alike. With Astra DB’s serverless model and NVIDIA AI technologies, the DataStax offering delivers near-zero latency and exceptional scalability to support the dynamic demands of the Wikimedia community.

DataStax is using NVIDIA AI Blueprints and integrating the NVIDIA NeMo Customizer, Curator, Evaluator and Guardrails microservices into the LangFlow AI code builder to enable the developer ecosystem to optimize AI models and pipelines for their unique use cases and help enterprises scale their AI applications.

Language-Inclusive AI Drives Global Business Impact

NeMo Retriever helps global enterprises overcome linguistic and contextual barriers and unlock the potential of their data. By deploying robust, AI solutions, businesses can achieve accurate, scalable and high-impact results.

NVIDIA’s platform and consulting partners play a critical role in ensuring enterprises can efficiently adopt and integrate generative AI capabilities, such as the new multilingual NeMo Retriever microservices. These partners help align AI solutions to an organization’s unique needs and resources, making generative AI more accessible and effective. They include:

  • Cloudera plans to expand the integration of NVIDIA AI in the Cloudera AI Inference Service. Currently embedded with NVIDIA NIM, Cloudera AI Inference will include NVIDIA NeMo Retriever to improve the speed and quality of insights for multilingual use cases.
  • Cohesity introduced the industry’s first generative AI-powered conversational search assistant that uses backup data to deliver insightful responses. It uses the NVIDIA NeMo Retriever reranking microservice to improve retrieval accuracy and significantly enhance the speed and quality of insights for various applications.
  • SAP is using the grounding capabilities of NeMo Retriever to add context to its Joule copilot Q&A feature and information retrieved from custom documents.
  • VAST Data is deploying NeMo Retriever microservices on the VAST Data InsightEngine with NVIDIA to make new data instantly available for analysis. This accelerates the identification of business insights by capturing and organizing real-time information for AI-powered decisions.
  • WEKA is integrating its WEKA AI RAG Reference Platform (WARRP) architecture with NVIDIA NIM and NeMo Retriever into its low-latency data platform to deliver scalable, multimodal AI solutions, processing hundreds of thousands of tokens per second.

Breaking Language Barriers With Multilingual Information Retrieval

Multilingual information retrieval is vital for enterprise AI to meet real-world demands. NeMo Retriever supports efficient and accurate text retrieval across multiple languages and cross-lingual datasets. It’s designed for enterprise use cases such as search, question-answering, summarization and recommendation systems.

Additionally, it addresses a significant challenge in enterprise AI — handling large volumes of large documents. With long-context support, the new microservices can process lengthy contracts or detailed medical records while maintaining accuracy and consistency over extended interactions.

These capabilities help enterprises use their data more effectively, providing precise, reliable results for employees, customers and users while optimizing resources for scalability. Advanced multilingual retrieval tools like NeMo Retriever can make AI systems more adaptable, accessible and impactful in a globalized world.

Availability

Developers can access the multilingual NeMo Retriever microservices, and other NIM microservices for information retrieval, through the NVIDIA API catalog, or a no-cost, 90-day NVIDIA AI Enterprise developer license.

Learn more about the new NeMo Retriever microservices and how to use them to build efficient information retrieval systems.

Read More

NVIDIA Unveils Its Most Affordable Generative AI Supercomputer

NVIDIA Unveils Its Most Affordable Generative AI Supercomputer

NVIDIA is taking the wraps off a new compact generative AI supercomputer, offering increased performance at a lower price with a software upgrade.

The new NVIDIA Jetson Orin Nano Super Developer Kit, which fits in the palm of a hand, provides everyone from commercial AI developers to hobbyists and students, gains in generative AI capabilities and performance. And the price is now $249, down from $499.

Available today, it delivers as much as a 1.7x leap in generative AI inference performance, a 70% increase in performance to 67 INT8 TOPS, and a 50% increase in memory bandwidth to 102GB/s compared with its predecessor.

Whether creating LLM chatbots based on retrieval-augmented generation, building a visual AI agent, or deploying AI-based robots, the Jetson Orin Nano Super is an ideal solution to fetch.

The Gift That Keeps on Giving

The software updates available to the new Jetson Orin Nano Super will also boost generative AI performance for those who already own the Jetson Orin Nano Developer Kit.

Jetson Orin Nano Super is suited for those interested in developing skills in generative AI, robotics or computer vision. As the AI world is moving from task-specific models into foundation models, it also provides an accessible platform to transform ideas into reality.

Powerful Performance With Super for Generative AI

The enhanced performance of the Jetson Orin Nano Super delivers gains for all popular generative AI models and transformer-based computer vision.

The developer kit consists of a Jetson Orin Nano 8GB system-on-module (SoM) and a reference carrier board, providing an ideal platform for prototyping edge AI applications.

The SoM features an NVIDIA Ampere architecture GPU with tensor cores and a 6-core Arm CPU, facilitating multiple concurrent AI application pipelines and high-performance inference. It can support up to four cameras, offering higher resolution and frame rates than previous versions.

Extensive Generative AI Software Ecosystem and Community

Generative AI is evolving quickly. The NVIDIA Jetson AI lab offers immediate support for those cutting-edge models from the open-source community and provides easy-to-use tutorials. Developers can also get extensive support from the broader Jetson community and inspiration from projects created by developers.

Jetson runs NVIDIA AI software including NVIDIA Isaac for robotics, NVIDIA Metropolis for vision AI and NVIDIA Holoscan for sensor processing. Development time can be reduced with NVIDIA Omniverse Replicator for synthetic data generation and NVIDIA TAO Toolkit for fine-tuning pretrained AI models from the NGC catalog.

Jetson ecosystem partners offer additional AI and system software, developer tools and custom software development. They can also help with cameras and other sensors, as well as carrier boards and design services for product solutions.

Boosting Jetson Orin Performance for All With Super Mode

The software updates to boost 1.7X generative AI performance will also be available to the Jetson Orin NX and Orin Nano series of systems on modules.

Existing Jetson Orin Nano Developer Kit owners can upgrade the JetPack SDK to unlock boosted performance today.

Learn more about Jetson Orin Nano Super Developer Kit.

See notice regarding software product information.

Read More

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the Llama 3.3 70B from Meta is available in Amazon SageMaker JumpStart. Llama 3.3 70B marks an exciting advancement in large language model (LLM) development, offering comparable performance to larger Llama versions with fewer computational resources.

In this post, we explore how to deploy this model efficiently on Amazon SageMaker AI, using advanced SageMaker AI features for optimal performance and cost management.

Overview of the Llama 3.3 70B model

Llama 3.3 70B represents a significant breakthrough in model efficiency and performance optimization. This new model delivers output quality comparable to Llama 3.1 405B while requiring only a fraction of the computational resources. According to Meta, this efficiency gain translates to nearly five times more cost-effective inference operations, making it an attractive option for production deployments.

The model’s sophisticated architecture builds upon Meta’s optimized version of the transformer design, featuring an enhanced attention mechanism that can help substantially reduce inference costs. During its development, Meta’s engineering team trained the model on an extensive dataset comprising approximately 15 trillion tokens, incorporating both web-sourced content and over 25 million synthetic examples specifically created for LLM development. This comprehensive training approach results in the model’s robust understanding and generation capabilities across diverse tasks.

What sets Llama 3.3 70B apart is its refined training methodology. The model underwent an extensive supervised fine-tuning process, complemented by Reinforcement Learning from Human Feedback (RLHF). This dual-approach training strategy helps align the model’s outputs more closely with human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterpart, Llama 3.3 70B demonstrated remarkable consistency, trailing Llama 3.1 405B by less than 2% in 6 out of 10 standard AI benchmarks and actually outperforming it in three categories. This performance profile makes it an ideal candidate for organizations seeking to balance model capabilities with operational efficiency.

The following figure summarizes the benchmark results (source).

Getting started with SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select pre-trained foundation models (FMs), including Llama 3 models. These models are fully customizable for your use case with your data, and you can deploy them into production using either the UI or SDK.

Deploying Llama 3.3 70B through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.

Deploy Llama 3.3 70B through the SageMaker JumpStart UI

You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or Amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, complete the following steps:

  1. In SageMaker Unified Studio, on the Build menu, choose JumpStart models.

Alternatively, on the SageMaker Studio console, choose JumpStart in the navigation pane.

  1. Search for Meta Llama 3.3 70B.
  2. Choose the Meta Llama 3.3 70B model.
  3. Choose Deploy.
  4. Accept the end-user license agreement (EULA).
  5. For Instance type¸ choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
  6. Choose Deploy.

Wait until the endpoint status shows as InService. You can now run inference using the model.

Deploy Llama 3.3 70B using the SageMaker Python SDK

For teams looking to automate deployment or integrate with existing MLOps pipelines, you can use the following code to deploy the model using the SageMaker Python SDK:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"

gpu_instance_type = "ml.p4d.24xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

model= model_builder.build()

predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)

Set up auto scaling and scale down to zero

You can optionally set up auto scaling to scale down to zero after deployment. For more information, refer to Unlock cost savings with the new scale down to zero feature in SageMaker Inference.

Optimize deployment with SageMaker AI

SageMaker AI simplifies the deployment of sophisticated models like Llama 3.3 70B, offering a range of features designed to optimize both performance and cost efficiency. With the advanced capabilities of SageMaker AI, organizations can deploy and manage LLMs in production environments, taking full advantage of Llama 3.3 70B’s efficiency while benefiting from the streamlined deployment process and optimization tools of SageMaker AI. Default deployment through SageMaker JumpStart uses accelerated deployment, which uses speculative decoding to improve throughput. For more information on how speculative decoding works with SageMaker AI, see Amazon SageMaker launches the updated inference optimization toolkit for generative AI.

Firstly, the Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight streaming mechanism. This feature fundamentally changes how model weights are loaded onto accelerators, dramatically reducing the time required to get the model ready for inference. Instead of the traditional approach of loading the entire model into memory before beginning operations, Fast Model Loader streams weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster startup and scaling times.

One SageMaker inference capability is Container Caching, which transforms how model containers are managed during scaling operations. This feature eliminates one of the major bottlenecks in deployment scaling by pre-caching container images, removing the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where container images can be substantial in size, this optimization significantly reduces scaling latency and improves overall system responsiveness.

Another key capability is Scale to Zero. It introduces intelligent resource management that automatically adjusts compute capacity based on actual usage patterns. This feature represents a paradigm shift in cost optimization for model deployments, allowing endpoints to scale down completely during periods of inactivity while maintaining the ability to scale up quickly when demand returns. This capability is particularly valuable for organizations running multiple models or dealing with variable workload patterns.

Together, these features create a powerful deployment environment that maximizes the benefits of Llama 3.3 70B’s efficient architecture while providing robust tools for managing operational costs and performance.

Conclusion

The combination of Llama 3.3 70B with the advanced inference features of SageMaker AI provides an optimal solution for production deployments. By using Fast Model Loader, Container Caching, and Scale to Zero capabilities, organizations can achieve both high performance and cost-efficiency in their LLM deployments.

We encourage you to try this implementation and share your experiences.


About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Adriana Simmons is a Senior Product Marketing Manager at AWS.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Yotam Moss is a Software development Manager for Inference at AWS AI.

Read More