Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

Amazon Web Services (AWS) is committed to supporting the development of cutting-edge generative artificial intelligence (AI) technologies by companies and organizations across the globe. As part of this commitment, AWS Japan announced the AWS LLM Development Support Program (LLM Program), through which we’ve had the privilege of working alongside some of Japan’s most innovative teams. From startups to global enterprises, these trailblazers are harnessing the power of large language models (LLMs) and foundation models (FMs) to boost productivity, create differentiated customer experiences, and drive meaningful progress across a variety of industries by taking advantage of purpose-built generative AI infrastructure on AWS. Notably, 12 of the 15 organizations who successfully participated in the program used the powerful compute capabilities of AWS Trainium to train their models and are now exploring AWS Inferentia for inference. Earlier this year, at the conclusion of the program, the LLM Program held a media briefing, where several pioneering companies presented their results and stories. In this blog post, we share a recap of those results and cover how the participating organizations used the LLM Program to accelerate their generative AI initiatives.

AWS LLM Development Support Program in Japan

Since its launch, the LLM Program has welcomed 15 diverse companies and organizations, each with a unique vision for how to use LLMs to drive progress in their respective industries. The program provides comprehensive support through guidance on securing high-performance compute infrastructure, technical assistance and troubleshooting for distributed training, cloud credits, and support for go-to-market. The program also facilitated collaborative knowledge-sharing sessions, where the leading LLM engineers came together to discuss the technical complexities and commercial considerations of their work. This holistic approach enabled participating organizations to rapidly advance their generative AI capabilities and bring transformative solutions to market.

Let’s dive in and explore how these organizations are transforming what’s possible with generative AI on AWS.

Ricoh innovates with curriculum learning to train a bilingual LLM

Ricoh recognized that the development of Japanese LLMs was lagging behind English or multilingual LLMs. To address this, the company’s Digital Technology Development Center developed a Japanese-English bilingual LLM through a carefully crafted curriculum learning strategy.

Takeshi Suzuki, Deputy Director, Digital Technology Development Center, Digital Strategy Division, Ricoh

Takeshi Suzuki, Deputy Director, Digital Technology Development Center, Digital Strategy Division, Ricoh

Takeshi Suzuki, Deputy Director of the Digital Technology Development Center, explains Ricoh’s approach:

“Although new model architectures for FMs and LLMs are rapidly emerging, we focused on refining our training methodologies to create a competitive advantage, rather than solely pursuing architectural novelty.”

This led them to adopt a curriculum learning approach that gradually introduced increasingly complex data to their model.

“If a large amount of difficult Japanese data is introduced from the start into the initial English-trained weights of Llama 2 13B Chat, it can lead to a forgetting effect, hindering learning,” Suzuki says. “Therefore, we started with a substantial amount of English data, then gradually incorporated lower-quality English and Japanese data, before finally fine-tuning on high-quality Japanese content.”

To bring this innovative curriculum learning methodology to life, Ricoh used Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances, powered by Trainium. By using an on-demand cluster of 64 trn1.32xlarge instances (1,024 Trainium chips) with support from the LLM Program, Ricoh performed large-scale distributed training for their 13-billion-parameter bilingual LLM (Llama2-based). In benchmarks using the Japanese llm-jp-eval, the model demonstrated strong logical reasoning performance important in industrial applications.

Stockmark mitigates hallucination by pre-training a Japanese LLM

Stockmark wanted to build highly reliable LLMs for industrial applications and decided to pretrain a Japanese LLM to tackle the challenge of hallucination (factually inaccurate output)—a critical concern in many real-world use cases.

Kosuke Arima, CTO and Co-founder (left) and Dr. Takahiro Omi, VP of Research (right), Stockmark

Kosuke Arima, CTO and Co-founder (left) and Dr. Takahiro Omi, VP of Research (right), Stockmark

“In the industrial world, there is a demand for LLMs where hallucination is suppressed even more than it is in ChatGPT.”

– Kosuke Arima, CTO and co-founder of Stockmark.

Hallucination mitigation depends heavily on the amount of knowledge in LLMs. Multilingual LLMs, which are often used globally, contain only about 0.1 percent of training data in Japanese. Stockmark determined that retrieval augmented generation alone was insufficient to meet the needs of enterprise search or application search, because the LLMs used weren’t proficient in Japanese. So, they decided to develop Japanese LLMs in-house.

“To support practical business use cases, we pre-trained a 13-billion-parameter LLM from scratch using a total of 220 billion tokens of Japanese text data, including not only public data but also original web corpus and patent data for business domains.”

– Dr. Takahiro Omi, VP of Research of Stockmark.

Stockmark quickly developed Stockmark-13b LLM using 16 Trn1 instances powered by Trainium chips in about 30 days. Furthermore, to deploy the developed Stockmark-13b into their own services, they conducted a technical validation of inference using the AWS Inferentia2 chip, and published in a notebook.

NTT builds lightweight, high-performance LLMs for sustainable AI

The NTT group, together with Intel and Sony, has established Innovative Optical and Wireless Network (IOWN) as a new industry forum whose mission is to meet social and technological needs of society through innovative and sustainable technology. As part of this effort, NTT Human Informatics Laboratories is developing the lightweight, high-performance LLM tsuzumi (named after a traditional Japanese percussion instrument). Instead of increasing the parameter size, tsuzumi enhances the quality and quantity of Japanese training data, enabling high Japanese processing ability with a lightweight model. As described in their press release, tsuzumi demonstrates high Japanese language proficiency, as evaluated by the Rakuda benchmark, and possesses multi-modal capabilities that are currently in progress.

Kyosuke Nishida, Senior Distinguished Researcher, NTT Human Informatics Laboratories

Kyosuke Nishida, Senior Distinguished Researcher, NTT Human Informatics Laboratories

“Tsuzumi’s high Japanese language proficiency and multi-modal capabilities can benefit a variety of industry-specific and customer support use cases. In the healthcare and life sciences domain, tsuzumi can help parse electronic medical records, contributing to personalized medical care and accelerating drug discovery,” he explains. “For contact centers, tsuzumi’s multi-modal capabilities, such as visual understanding of manuals and charts, are expected to enhance both customer experience and employee experience.”

– Dr. Kyosuke Nishida, Senior Distinguished Researcher at NTT Human Informatics Laboratories.

By participating in the LLM Program, NTT was able to quickly launch a cluster of 96 NVIDIA H100 GPUs (12 EC2 P5 instances using AWS ParallelCluster). This enabled highly efficient, distributed training through the Elastic Fabric Adapter’s high-speed 3,200 Gbps inter-node communication. The AWS team also provided technical expertise to help NTT seamlessly migrate and validate its environment on AWS.

Customer innovations in domain-specific, multilingual, and multimodal generative AI

From intelligent chatbots that engage in witty banter to multimodal frameworks for autonomous vehicle systems, the LLM Program participants demonstrated the transformative potential of generative AI by using Trainium.

Domain-specific models: Trainium enabled creation of LLMs tailored to specific domains and tasks, unlocking new frontiers of efficiency and specialization. KARAKURI built an LLM (karakuri-ai/karakuri-lm-70b-chat-v0.1) to create customer support chatbots that not only have Japanese proficiency but also respond with a helpful demeanor. Meanwhile, Watashiha injected a dose of humor into the AI realm, developing OGIRI—a humor-focused foundation model that delivers delightfully funny responses to user queries. Poetics created an LLM adept at deciphering the nuances of online business meetings for their meeting analysis tool Jamroll. The Matsuo Institute pre-trained an LLM based on elyza/ELYZA-japanese-Llama-2-7b to develop an LLM-powered recommendation system that can intelligently curate personalized experiences for retail and travel customers. Aiming to build an LLM that specializes in specific tasks, Lightblue developed a small, lightweight LLM that will also reduce inference costs. To address the scalability challenges posed by a shrinking workforce, Recruit built an LLM through continued pre-training (with C4-ja, Wikipedia-ja, Pile, and in-house corpora) and instruction tuning (with databricks-dolly-15k-ja, ichikara-instruction, and in-house instruction data) on elyza/ELYZA-japanese-Llama-2-7b-fast and meta-llama/Llama-2-13b-hf models.

Multi-modal models: Several participants, such as Sparticle, have ventured into the realm of multimodal AI, weaving together language and visual modalities. Turing, with its innovative multi-modal Heron framework, is enhancing LLMs with the ability to interpret and navigate the visual landscape. Preferred Networks (PFN) has crafted a general-purpose vision FM that can seamlessly integrate and process both textual and visual information. As part of their future work, PFN will continue to develop multi-modal FMs based on PLaMo LLM, using the development method established in the LLM Program.

Linguistically-diverse models: The program participants also experimented with the training data, changing the ratio of English to Japanese or using training corpus in other languages. CyberAgent used Trainium to evaluate LLM performance when changing the ratio of Japanese to English included in training data, and expanded to grouped query attention (GQA) and verified architectures such as RetNet and Sparse Mixture of Experts (MoE) for their use cases. Using Trainium, Rinna built Nekomata 14B, based on the Qwen model trained on Chinese and English, by continued pre-training with 66-billion-token Japanese data, in just 6.5 days. Ubitus developed and released Taiwan LLM 13B (Taiwan-LLM-13B-v2.0-base) through joint research with National Taiwan University.

Fueling generative AI innovation in Japan

From startups to enterprises, organizations of all sizes have successfully trained their generative AI foundation models and large language models in the LLM Program. This testament to the program’s success was further underscored by the involvement and support of Japan’s Ministry of Economy, Trade, and Industry (METI). Several of the LLM Program participants will continue to develop their FMs and LLMs as part of the Generative AI Accelerator Challenge (GENIAC), where AWS will provide compute resources as METI announced and described in AWS Japan blog.

AWS will continue to support companies and organizations in their efforts to deploy these transformative models and bring generative AI innovation into real-world applications. We see the immense potential of FMs and LLMs to bolster Japan’s national strengths if implemented widely across various sectors. From a global perspective, AWS is committed to facilitate the development and adoption of these technologies worldwide, driving innovation and progress that will shape the future.

Visit AWS Trainium to learn how you can harness the power of purpose-built AI chips to build next-innovative foundation models while lowering costs.

This post is contributed by AWS LLM Development Support Program Executive Committee Yoshitaka Haribara, Akihiro Tsukada, Daishi Okada, Shoko Utsunomiya, and Technical Core Team Hiroshi Tokoyo, Keita Watanabe, and Masaru Isaka with the Executive Sponsorship represented by Yukiko Sato


About the Authors

Yoshitaka Haribara is a Senior Startup ML Solutions Architect at AWS Japan. In this role, Yoshitaka helps startup customers build generative AI foundation models and large language models on AWS, and came up with the idea of the LLM Program. In his spare time, Yoshitaka enjoys playing the drums.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

As generative artificial intelligence (AI) applications become more prevalent, maintaining responsible AI principles becomes essential. Without proper safeguards, large language models (LLMs) can potentially generate harmful, biased, or inappropriate content, posing risks to individuals and organizations. Applying guardrails helps mitigate these risks by enforcing policies and guidelines that align with ethical principles and legal requirements. Guardrails for Amazon Bedrock evaluates user inputs and model responses based on use case-specific policies, and provides an additional layer of safeguards regardless of the underlying foundation model (FM). Guardrails can be applied across all LLMs on Amazon Bedrock, including fine-tuned models and even generative AI applications outside of Amazon Bedrock. You can create multiple guardrails, each configured with a different combination of controls, and use these guardrails across different applications and use cases. You can configure guardrails in multiple ways, including to deny topics, filter harmful content, remove sensitive information, and detect contextual grounding.

The new ApplyGuardrail API enables you to assess any text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs. In this post, we demonstrate how to use the ApplyGuardrail API with long-context inputs and streaming outputs.

ApplyGuardrail API overview

The ApplyGuardrail API offers several key features:

  • Ease of use – You can integrate the API anywhere in your application flow to validate data before processing or serving results to users. For example, in a Retrieval Augmented Generation (RAG) application, you can now evaluate the user input prior to performing the retrieval instead of waiting until the final response generation.
  • Decoupled from FMs – The API is decoupled from FMs, allowing you to use guardrails without invoking FMs from Amazon Bedrock. For example, you can now use the API with models hosted on Amazon SageMaker. Alternatively, you could use it self-hosted or with models from third-party model providers. All that is needed is taking the input or output and request assessment using the API.

You can use the assessment results from the ApplyGuardrail API to design the experience on your generative AI application, making sure it adheres to your defined policies and guidelines.

The ApplyGuardrail API request allows you to pass all your content that should be guarded using your defined guardrails. The source field should be set to INPUT when the content to evaluated is from a user, typically the LLM prompt. The source should be set to OUTPUT when the model output guardrails should be enforced, typically an LLM response. An example request looks like the following code:

{
    "source": "INPUT" | "OUTPUT",
    "content": [{
        "text": {
            "text": "This is a sample text snippet...",
        }
    }]
}

For more information about the API structure, refer to Guardrails for Amazon Bedrock.

Streaming output

LLMs can generate text in a streaming manner, where the output is produced token by token or word by word, rather than generating the entire output at once. This streaming output capability is particularly useful in scenarios where real-time interaction or continuous generation is required, such as conversational AI assistants or live captioning. Incrementally displaying the output allows for a more natural and responsive user experience. Although it’s advantageous in terms of responsiveness, streaming output introduces challenges when it comes to applying guardrails in real time as the output is generated. Unlike the input scenario, where the entire text is available upfront, the output is generated incrementally, making it difficult to assess the complete context and potential violations.

One of the main challenges is the need to evaluate the output as it’s being generated, without waiting for the entire output to be complete. This requires a mechanism to continuously monitor the streaming output and apply guardrails in real time, while also considering the context and coherence of the generated text. Furthermore, the decision to halt or continue the generation process based on the guardrail assessment needs to be made in real time, which can impact the responsiveness and user experience of the application.

Solution overview: Use guardrails on streaming output

To address the challenges of applying guardrails on streaming output from LLMs, a strategy that combines batching and real-time assessment is required. This strategy involves collecting the streaming output into smaller batches or chunks, evaluating each batch using the ApplyGuardrail API, and then taking appropriate actions based on the assessment results.

The first step in this strategy is to batch the streaming output chunks into smaller batches that are closer to a text unit, which is approximately 1,000 characters. If a batch is smaller, such as 600 characters, you’re still charged for an entire text unit (1,000 characters). For a cost-effective usage of the API, it’s recommended that the batches of chunks are in order of text units, such as 1,000 characters, 2,000, and so on. This way, you minimize the risk of incurring unnecessary costs.

By batching the output into smaller batches, you can invoke the ApplyGuardrail API more frequently, allowing for real-time assessment and decision-making. The batching process should be designed to maintain the context and coherence of the generated text. This can be achieved by making sure the batches don’t split words or sentences, and by carrying over any necessary context from the previous batch. Though the chunking varies between use cases, for the sake of simplicity, this post showcases simple character-level chunking, but it’s recommended to explore options such as semantic chunking or hierarchical chunking while still adhering to the guidelines mentioned in this post.

After the streaming output has been batched into smaller chunks, each chunk can be passed to the API for evaluation. The API will assess the content of each chunk against the defined policies and guidelines, identifying any potential violations or sensitive information.

The assessment results from the API can then be used to determine the appropriate action for the current batch. If a severe violation is detected, the API assessment suggests halting the generation process, and instead a preset message or response can be displayed to the user. However, in some cases, no severe violation is detected, but the guardrail was configured to pass through the request, for example in the case of sensitiveInformationPolicyConfig to anonymize the detected entities instead of blocking. If such an intervention occurs, the output will be masked or modified accordingly before being displayed to the user. For latency-sensitive applications, you can also consider creating multiple buffers and multiple guardrails, each with different policies, and then processing them with the ApplyGuardrail API in parallel. This way, you can minimize the time it takes to make assessments for one guardrail at a time, but maximize getting the assessments from multiple guardrails and multiple batches, though this technique hasn’t been implemented in this example.

Example use case: Apply guardrails to streaming output

In this section, we provide an example of how such a strategy could be implemented. Let’s begin with creating a guardrail. You can use the following code sample to create a guardrail in Amazon Bedrock:

import boto3
REGION_NAME = "us-east-1"

bedrock_client = boto3.client("bedrock", region_name=REGION_NAME)
bedrock_runtime = boto3.client("bedrock-runtime", region_name=REGION_NAME)

response = bedrock_client.create_guardrail(
    name="<name>",
    description="<description>",
    ...
)
# alternatively provide the id and version for your own guardrail
guardrail_id = response['guardrailId'] 
guardrail_version = response['version']

Proper assessment of the policies must be conducted to verify if the input should be later sent to an LLM or whether the output generated by the LLM should be displayed to the user. In the following code, we examine the assessments, which are part of the response from the ApplyGuardrail API, for potential severe violation leading to BLOCKED intervention by the guardrail:

from typing import List, Dict
def check_severe_violations(violations: List[Dict]) -> int:
    """
    When guardrail intervenes either the action on the request is BLOCKED or NONE.
    This method checks the number of the violations leading to blocking the request.

    Args:
        violations (List[Dict]): A list of violation dictionaries, where each dictionary has an 'action' key.

    Returns:
        int: The number of severe violations (where the 'action' is 'BLOCKED').
    """
    severe_violations = [violation['action']=='BLOCKED' for violation in violations]
    return sum(severe_violations)

def is_policy_assessement_blocked(assessments: List[Dict]) -> bool:
    """
    While creating the guardrail you could specify multiple types of policies.
    At the time of assessment all the policies should be checked for potential violations
    If there is even 1 violation that blocks the request, the entire request is blocked
    This method checks if the policy assessment is blocked based on the given assessments.

    Args:
        assessments (list[dict]): A list of assessment dictionaries, where each dictionary may contain 'topicPolicy', 'wordPolicy', 'sensitiveInformationPolicy', and 'contentPolicy' keys.

    Returns:
        bool: True if the policy assessment is blocked, False otherwise.
    """
    blocked = []
    for assessment in assessments:
        if 'topicPolicy' in assessment:
            blocked.append(check_severe_violations(assessment['topicPolicy']['topics']))
        if 'wordPolicy' in assessment:
            if 'customWords' in assessment['wordPolicy']:
                blocked.append(check_severe_violations(assessment['wordPolicy']['customWords']))
            if 'managedWordLists' in assessment['wordPolicy']:
                blocked.append(check_severe_violations(assessment['wordPolicy']['managedWordLists']))
        if 'sensitiveInformationPolicy' in assessment:
            if 'piiEntities' in assessment['sensitiveInformationPolicy']:
                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['piiEntities']))
            if 'regexes' in assessment['sensitiveInformationPolicy']:
                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['regexes']))
        if 'contentPolicy' in assessment:
            blocked.append(check_severe_violations(assessment['contentPolicy']['filters']))
    severe_violation_count = sum(blocked)
    print(f'::Guardrail:: {severe_violation_count} severe violations detected')
    return severe_violation_count>0

We can then define how to apply guardrail. If the response from the API leads to an action == 'GUARDRAIL_INTERVENED', it means that the guardrail has detected a potential violation. We now need to check if the violation was severe enough to block the request or pass it through with either the same text as input or an alternate text in which modifications are made according to the defined policies:

def apply_guardrail(text, source, guardrail_id, guardrail_version):
    response = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version, 
        source=source,
        content=[{"text": {"text": text}}]
    )
    if response['action'] == 'GUARDRAIL_INTERVENED':
        is_blocked = is_policy_assessement_blocked(response['assessments'])
        alternate_text = ' '.join([output['text'] for output in response['output']])
        return is_blocked, alternate_text, response
    else:
        # Return the default response in case of no guardrail intervention
        return False, text, response

Let’s now apply our strategy for streaming output from an LLM. We can maintain a buffer_text, which creates a batch of chunks received from the stream. As soon as len(buffer_text + new_text) > TEXT_UNIT, meaning if the batch is close to a text unit (1,000 characters), it’s ready to be sent to the ApplyGuardrail API. With this mechanism, we can make sure we don’t incur the unnecessary cost of invoking the API on smaller chunks and also that enough context is available inside each batch for the guardrail to make meaningful assessments. Additionally, when the generation is complete from the LLM, the final batch must also be tested for potential violations. If at any point the API detects severe violations, further consumption of the stream is halted and the user is displayed the preset message at the time of creation of the guardrail.

In the following example, we ask the LLM to generate three names and tell us what is a bank. This generation will lead to GUARDRAIL_INTERVENED but not block the generation, and instead anonymize the text (masking the names) and continue with generation.

input_message = "List 3 names of prominent CEOs and later tell me what is a bank and what are the benefits of opening a savings account?"

model_id = "anthropic.claude-3-haiku-20240307-v1:0"
text_unit= 1000 # characters

response = bedrock_runtime.converse_stream(
    modelId=model_id,
    messages=[{
        "role": "user",
        "content": [{"text": input_message}]
    system=[{"text" : "You are an assistant that helps with tasks from users. Be as elaborate as possible"}],
)

stream = response.get('stream')
buffer_text = ""
if stream:
    for event in stream:
        if 'contentBlockDelta' in event:
            new_text = event['contentBlockDelta']['delta']['text']
            if len(buffer_text + new_text) > text_unit:
                is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)
                # print(alt_text, end="")
                if is_blocked:
                    break
                buffer_text = new_text
            else: 
                buffer_text += new_text

        if 'messageStop' in event:
            # print(f"nStop reason: {event['messageStop']['stopReason']}")
            is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)
            # print(alt_text)

After running the preceding code, we receive an example output with masked names:

Certainly! Here are three names of prominent CEOs:

1. {NAME} - CEO of Apple Inc.
2. {NAME} - CEO of Microsoft Corporation
3. {NAME} - CEO of Amazon

Now, let's discuss what a bank is and the benefits of opening a savings account.

A bank is a financial institution that accepts deposits, provides loans, and offers various other financial services to its customers. Banks play a crucial role in the economy by facilitating the flow of money and enabling financial transactions.

Long-context inputs

RAG is a technique that enhances LLMs by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases before generating responses, producing output tailored to specific contexts while providing relevance, accuracy, and efficiency. The input to the LLM in a RAG scenario can be quite long, because it includes the user’s query concatenated with the retrieved information from the knowledge base. This long-context input poses challenges when applying guardrails, because the input may exceed the character limits imposed by the ApplyGuardrail API. To learn more about the quotas applied to Guardrails for Amazon Bedrock, refer to Guardrails quotas.

We evaluated the strategy to avoid the risk from model response in the previous section. In the case of inputs, the risk could be both at the query level or together with the query and the retrieved context for the query.

The retrieved information from the knowledge base may contain sensitive or potentially harmful content, which needs to be identified and handled appropriately, for example masking sensitive information, before being passed to the LLM for generation. Therefore, guardrails must be applied to the entire input to make sure it adheres to the defined policies and constraints.

Solution overview: Use guardrails on long-context inputs

The ApplyGuardrail API has a default limit of 25 text units (approximately 25,000 characters) per second. If the input exceeds this limit, it needs to be chunked and processed sequentially to avoid throttling. Therefore, the strategy becomes relatively straightforward: if the length of input text is less than 25 text units (25,000 characters), then it can be evaluated in a single request, otherwise it needs to be broken down into smaller pieces. The chunk size can vary depending on application behavior and the type of context in the application; you can start with 12 text units and iterate to find the best suitable chunk size. This way, we maximize the allowed default limit while keeping most of the context intact in a single request. Even if the guardrail action is GUARDRAIL_INTERVENED, it doesn’t mean the input is BLOCKED. It could also be true that the input is processed and sensitive information is masked; in this case, the input text must be recompiled with any processed response from the applied guardrail.

text_unit = 1000 # characters
limit_text_unit = 25
max_text_units_in_chunk = 12
def apply_guardrail_with_chunking(text, guardrail_id, guardrail_version="DRAFT"):
    text_length = len(text)
    filtered_text = ''
    if text_length <= limit_text_unit * text_unit:
        return apply_guardrail(text, "INPUT", guardrail_id, guardrail_version)
    else:
        # If the text length is greater than the default text unit limits then it's better to chunk the text to avoid throttling.
        for i, chunk in enumerate(wrap(text, max_text_units_in_chunk * text_unit)):
            print(f'::Guardrail::Applying guardrails at chunk {i+1}')
            is_blocked, alternate_text, response = apply_guardrail(chunk, "INPUT", guardrail_id, guardrail_version)
            if is_blocked:
                filtered_text = alternate_text
                break
            # It could be the case that guardrails intervened and anonymized PII in the input text,
            # we can then take the output from guardrails to create filtered text response.
            filtered_text += alternate_text
        return is_blocked, filtered_text, response

Run the full notebook to test this strategy with long-context input.

Best practices and considerations

When applying guardrails, it’s essential to follow best practices to maintain efficient and effective content moderation:

  • Optimize chunking strategy – Carefully consider the chunking strategy. The chunk size should balance the trade-off between minimizing the number of API calls and making sure the context isn’t lost due to overly small chunks. Similarly, the chunking strategy should take into account the context split; a critical piece of text could span two (or more) chunks if not carefully divided.
  • Asynchronous processing – Implement asynchronous processing for RAG contexts. This can help decouple the guardrail application process from the main application flow, improving responsiveness and overall performance. For frequently retrieved context from vector databases, ApplyGuardrail could be applied one time and results stored in metadata. This would avoid redundant API calls for the same content. This can significantly improve performance and reduce costs.
  • Develop comprehensive test suites – Create a comprehensive test suite that covers a wide range of scenarios, including edge cases and corner cases, to validate the effectiveness of your guardrail implementation.
  • Implement fallback mechanisms – There could be scenarios where the guardrail created doesn’t cover all the possible vulnerabilities and is unable to catch edge cases. For such scenarios, it’s wise to have a fallback mechanism. One such option could be to bring human in the loop, or use an LLM as a judge to evaluate both the input and output.

In addition to the aforementioned considerations, it’s also good practice to regularly audit your guardrail implementation, continuously refine and adapt your guardrail implementation, and implement logging and monitoring mechanisms to capture and analyze the performance and effectiveness of your guardrails.

Clean up

The only resource we created in this example is a guardrail. To delete the guardrail, complete the following steps:

  1. On the Amazon Bedrock console, under Safeguards in the navigation pane, choose Guardrails.
  2. Select the guardrail you created and choose Delete.

Alternatively, you can use the SDK:

bedrock_client.delete_guardrail(guardrailIdentifier = "<your_guardrail_id>")

Key takeaways

Applying guardrails is crucial for maintaining responsible and safe content generation. With the ApplyGuardrail API from Amazon Bedrock, you can effectively moderate both inputs and outputs, protecting your generative AI application against violations and maintaining compliance with your content policies.

Key takeaways from this post include:

  • Understand the importance of applying guardrails in generative AI applications to mitigate risks and maintain content moderation standards
  • Use the ApplyGuardrail API from Amazon Bedrock to validate inputs and outputs against defined policies and rules
  • Implement chunking strategies for long-context inputs and batching techniques for streaming outputs to efficiently utilize the ApplyGuardrail API
  • Follow best practices, optimize performance, and continuously monitor and refine your guardrail implementation to maintain effectiveness and alignment with evolving content moderation needs

Benefits

By incorporating the ApplyGuardrail API into your generative AI application, you can unlock several benefits:

  • Content moderation at scale – The API allows you to moderate content at scale, so your application remains compliant with content policies and guidelines, even when dealing with large volumes of data
  • Customizable policies – You can define and customize content moderation policies tailored to your specific use case and requirements, making sure your application adheres to your organization’s standards and values
  • Real-time moderation – The API enables real-time content moderation, allowing you to detect and mitigate potential violations as they occur, providing a safe and responsible user experience
  • Integration with any LLMApplyGuardrail is an independent API, so it can be integrated with any of your LLMs of choice, so you can take advantage of the power of generative AI while maintaining control over the content being generated
  • Cost-effective solution – With its pay-per-use pricing model and efficient text unit-based billing, the API provides a cost-effective solution for content moderation, especially when dealing with large volumes of data

Conclusion

By using the ApplyGuardrail API from Amazon Bedrock and following the best practices outlined in this post, you can make sure your generative AI application remains safe, responsible, and compliant with content moderation standards, even with long-context inputs and streaming outputs.

To further explore the capabilities of the ApplyGuardrail API and its integration with your generative AI application, consider experimenting with the API using the following resources:

  • Refer to Guardrails for Amazon Bedrock for detailed information on the ApplyGuardrail API, its usage, and integration examples
  • Check out the AWS samples GitHub repository for sample code and reference architectures demonstrating the integration of the ApplyGuardrail API with various generative AI applications
  • Participate in AWS-hosted workshops and tutorials focused on responsible AI and content moderation, where you can learn from experts and gain hands-on experience with the ApplyGuardrail API

Resources

The following resources explain both practical and ethical aspects of applying Guardrails for Amazon Bedrock:


About the Author

Talha Chattha is a Generative AI Specialist Solutions Architect at Amazon Web Services, based in Stockholm. Talha helps establish practices to ease the path to production for Gen AI workloads. Talha is an expert in Amazon Bedrock and supports customers across entire EMEA. He holds passion about meta-agents, scalable on-demand inference, advanced RAG solutions and cost optimized prompt engineering with LLMs. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines. Connect with Talha at LinkedIn using /in/talha-chattha/.

Read More

NVIDIA and Zoox Pave the Way for Autonomous Ride-Hailing

NVIDIA and Zoox Pave the Way for Autonomous Ride-Hailing

In celebration of Zoox’s 10th anniversary, NVIDIA founder and CEO Jensen Huang recently joined the robotaxi company’s CEO, Aicha Evans, and its cofounder and CTO, Jesse Levinson, to discuss the latest in autonomous vehicle (AV) innovation and experience a ride in the Zoox robotaxi.

In a fireside chat at Zoox’s headquarters in Foster City, Calif., the trio reflected on the two companies’ decade of collaboration. Evans and Levinson highlighted how Zoox pioneered the concept of a robotaxi purpose-built for ride-hailing and created groundbreaking innovations along the way, using NVIDIA technology.

“The world has never seen a robotics company like this before,” said Huang. “Zoox started out solely as a sustainable robotics company that delivers robots into the world as a fleet.”

Since 2014, Zoox has been on a mission to create fully autonomous, bidirectional vehicles purpose-built for ride-hailing services. This sets it apart in an industry largely focused on retrofitting existing cars with self-driving technology.

A decade later, the company is operating its robotaxi, powered by NVIDIA GPUs, on public roads.

Computing at the Core

Zoox robotaxis are, at their core, supercomputers on wheels. They’re built on multiple NVIDIA GPUs dedicated to processing the enormous amounts of data generated in real time by their sensors.

The sensor array includes cameras, lidar, radar, long-wave infrared sensors and microphones. The onboard computing system rapidly processes the raw sensor data collected and fuses it to provide a coherent understanding of the vehicle’s surroundings.

The processed data then flows through a perception engine and prediction module to planning and control systems, enabling the vehicle to navigate complex urban environments safely.

NVIDIA GPUs deliver the immense computing power required for the Zoox robotaxis’ autonomous capabilities and continuous learning from new experiences.

Using Simulation as a Virtual Proving Ground

Key to Zoox’s AV development process is its extensive use of simulation. The company uses NVIDIA GPUs and software tools to run a wide array of simulations, testing its autonomous systems in virtual environments before real-world deployment.

These simulations range from synthetic scenarios to replays of real-world scenarios created using data collected from test vehicles. Zoox uses retrofitted Toyota Highlanders equipped with the same sensor and compute packages as its robotaxis to gather driving data and validate its autonomous technology.

This data is then fed back into simulation environments, where it can be used to create countless variations and replays of scenarios and agent interactions.

Zoox also uses what it calls “adversarial simulations,” carefully crafted scenarios designed to test the limits of the autonomous systems and uncover potential edge cases.

The company’s comprehensive approach to simulation allows it to rapidly iterate and improve its autonomous driving software, bolstering AV safety and performance.

“We’ve been using NVIDIA hardware since the very start,” said Levinson. “It’s a huge part of our simulator, and we rely on NVIDIA GPUs in the vehicle to process everything around us in real time.”

A Neat Way to Seat

Zoox’s robotaxi, with its unique bidirectional design and carriage-style seating, is optimized for autonomous operation and passenger comfort, eliminating traditional concepts of a car’s “front” and “back” and providing equal comfort and safety for all occupants.

“I came to visit you when you were zero years old, and the vision was compelling,” Huang said, reflecting on Zoox’s evolution over the years. “The challenge was incredible. The technology, the talent — it is all world-class.”

Using NVIDIA GPUs and tools, Zoox is poised to redefine urban mobility, pioneering a future of safe, efficient and sustainable autonomous transportation for all.

From Testing Miles to Market Projections

As the AV industry gains momentum, recent projections highlight the potential for explosive growth in the robotaxi market. Guidehouse Insights forecasts over 5 million robotaxi deployments by 2030, with numbers expected to surge to almost 34 million by 2035.

The regulatory landscape reflects this progress, with 38 companies currently holding valid permits to test AVs with safety drivers in California. Zoox is currently one of only six companies permitted to test AVs without safety drivers in the state.

As the industry advances, Zoox has created a next-generation robotaxi by combining cutting-edge onboard computing with extensive simulation and development.

In the image at top, NVIDIA founder and CEO Jensen Huang stands with Zoox CEO Aicha Evans and Zoox cofounder and CTO Jesse Levinson in front of a Zoox robotaxi.

Read More

NVIDIA Researchers Harness Real-Time Gen AI to Build Immersive Desert World

NVIDIA Researchers Harness Real-Time Gen AI to Build Immersive Desert World

NVIDIA researchers used NVIDIA Edify, a multimodal architecture for visual generative AI, to build a detailed 3D desert landscape within a few minutes in a live demo at SIGGRAPH’s Real-Time Live event on Tuesday.

During the event — one of the prestigious graphics conference’s top sessions — NVIDIA researchers showed how, with the support of an AI agent, they could build and edit a desert landscape from scratch within five minutes. The live demo highlighted how generative AI can act as an assistant to artists by accelerating ideation and generating custom secondary assets that would otherwise have been sourced from a repository.

By drastically decreasing ideation time, these AI technologies will empower 3D artists to be more productive and creative — giving them the tools to explore concepts faster and expedite parts of their workflows. They could, for example, generate the background assets or 360 HDRi environments that the scene needs in minutes, instead of spending hours finding or creating them.

From Idea to 3D Scene in Three Minutes

Creating a full 3D scene is a complex, time-consuming task. Artists must support their hero asset with plenty of background objects to create a rich scene, then find an appropriate background and an environment map to light it. Due to time constraints, they’ve often had to make a trade-off between rapid results and creative exploration.

With the support of AI agents, creative teams can achieve both goals: quickly bring concepts to life and continue iterating to achieve the right look.

In the Real-Time Live demo, the researchers used an AI agent to instruct an NVIDIA Edify-powered model to generate dozens of 3D assets, including cacti, rocks and the skull of a bull — with previews produced in just seconds.

They next directed the agent to harness other models to create potential backgrounds and a layout of how the objects would be placed in the scene — and showcased how the agent could adapt to last-minute changes in creative direction by quickly swapping the rocks for gold nuggets.

With a design plan in place, they prompted the agent to create full-quality assets and render the scene as a photorealistic image in NVIDIA Omniverse USD Composer, an app for virtual world-building.

NVIDIA Edify Accelerates Environment Generation 

NVIDIA Edify models can help creators focus on hero assets while accelerating the creation of background environments and objects using AI-powered scene generation tools. The Real-Time Live demo showcased two Edify models: 

  • Edify 3D generates ready-to-edit 3D meshes from text or image prompts. Within seconds, the model can generate previews, including rotating animations of each object, to help creators rapidly prototype before committing to a specific design.
  • Edify 360 HDRi uses text or image prompts to generate up to 16K high-dynamic range images (HDRi) of nature landscapes, which can be used as backgrounds and to light scenes.

During the demo, the researchers also showcased an AI agent powered by a large language model, and USD Layout, an AI model that generates scene layouts using OpenUSD, a platform for 3D workflows.

At SIGGRAPH, NVIDIA also announced that two leading creative content companies are giving designers and artists new ways to boost productivity with generative AI using tools powered by NVIDIA Edify.

Shutterstock has launched in commercial beta its Generative 3D service, which lets creators quickly prototype and generate 3D assets using text or image prompts. Its 360 HDRi generator based on Edify also entered early access.

Getty Images updated its Generative AI by Getty Images service with the latest version of NVIDIA Edify. Users can now create images twice as fast, with improved output quality and prompt adherence, and advanced controls and fine-tuning.

Harnessing Universal Scene Description in NVIDIA Omniverse

The 3D objects, environment maps and layouts generated using Edify models are structured with USD, a standard format for describing and composing 3D worlds. This compatibility allows artists to immediately import Edify-powered creations into Omniverse USD Composer.

Within Composer, they can use popular digital content creation tools to further modify the scene by, for example, changing the position of objects, modifying their appearance or adjusting lighting.

Real-Time Live is one of the most anticipated events at SIGGRAPH, featuring about a dozen real-time applications including generative AI, virtual reality and live performance capture technology. Watch the replay below.

 

Read More

Oracle Cloud Infrastructure Expands NVIDIA GPU-Accelerated Instances for AI, Digital Twins and More

Oracle Cloud Infrastructure Expands NVIDIA GPU-Accelerated Instances for AI, Digital Twins and More

Enterprises are rapidly adopting generative AI, large language models (LLMs), advanced graphics and digital twins to increase operational efficiencies, reduce costs and drive innovation.

However, to adopt these technologies effectively, enterprises need access to state-of-the-art, full-stack accelerated computing platforms. To meet this demand, Oracle Cloud Infrastructure (OCI) today announced NVIDIA L40S GPU bare-metal instances available to order and the upcoming availability of a new virtual machine accelerated by a single NVIDIA H100 Tensor Core GPU. This new VM expands OCI’s existing H100 portfolio, which includes an NVIDIA HGX H100 8-GPU bare-metal instance.

Paired with NVIDIA networking and running the NVIDIA software stack, these platforms deliver powerful performance and efficiency, enabling enterprises to advance generative AI.

NVIDIA L40S Now Available to Order on OCI

The NVIDIA L40S is a universal data center GPU designed to deliver breakthrough multi-workload acceleration for generative AI, graphics and video applications. Equipped with fourth-generation Tensor Cores and support for the FP8 data format, the L40S GPU excels in training and fine-tuning small- to mid-size LLMs and in inference across a wide range of generative AI use cases.

For example, a single L40S GPU (FP8) can generate up to 1.4x more tokens per second than a single NVIDIA A100 Tensor Core GPU (FP16) for Llama 3 8B with NVIDIA TensorRT-LLM at an input and output sequence length of 128.

The L40S GPU also has best-in-class graphics and media acceleration. Its third-generation NVIDIA Ray Tracing Cores (RT Cores) and multiple encode/decode engines make it ideal for advanced visualization and digital twin applications.

The L40S GPU delivers up to 3.8x the real-time ray-tracing performance of its predecessor, and supports NVIDIA DLSS 3 for faster rendering and smoother frame rates. This makes the GPU ideal for developing applications on the NVIDIA Omniverse platform, enabling real-time, photorealistic 3D simulations and AI-enabled digital twins. With Omniverse on the L40S GPU, enterprises can develop advanced 3D applications and workflows for industrial digitalization that will allow them to design, simulate and optimize products, processes and facilities in real time before going into production.

OCI will offer the L40S GPU in its BM.GPU.L40S.4 bare-metal compute shape, featuring four NVIDIA L40S GPUs, each with 48GB of GDDR6 memory. This shape includes local NVMe drives with 7.38TB capacity, 4th Generation Intel Xeon CPUs with 112 cores and 1TB of system memory.

These shapes eliminate the overhead of any virtualization for high-throughput and latency-sensitive AI or machine learning workloads with OCI’s bare-metal compute architecture. The accelerated compute shape features the NVIDIA BlueField-3 DPU for improved server efficiency, offloading data center tasks from CPUs to accelerate networking, storage and security workloads. The use of BlueField-3 DPUs furthers OCI’s strategy of off-box virtualization across its entire fleet.

OCI Supercluster with NVIDIA L40S enables ultra-high performance with 800Gbps of internode bandwidth and low latency for up to 3,840 GPUs. OCI’s cluster network uses NVIDIA ConnectX-7 NICs over RoCE v2 to support high-throughput and latency-sensitive workloads, including AI training.

“We chose OCI AI infrastructure with bare-metal instances and NVIDIA L40S GPUs for 30% more efficient video encoding,” said Sharon Carmel, CEO of Beamr Cloud. “Videos processed with Beamr Cloud on OCI will have up to 50% reduced storage and network bandwidth consumption, speeding up file transfers by 2x and increasing productivity for end users. Beamr will provide OCI customers video AI workflows, preparing them for the future of video.”

Single-GPU H100 VMs Coming Soon on OCI 

The VM.GPU.H100.1 compute virtual machine shape, accelerated by a single NVIDIA H100 Tensor Core GPU, is coming soon to OCI. This will provide cost-effective, on-demand access for enterprises looking to use the power of NVIDIA H100 GPUs for their generative AI and HPC workloads.

A single H100 provides a good platform for smaller workloads and LLM inference. For example, one H100 GPU can generate more than 27,000 tokens per second for Llama 3 8B (up to 4x more throughput than a single A100 GPU at FP16 precision) with NVIDIA TensorRT-LLM at an input and output sequence length of 128 and FP8 precision.

The VM.GPU.H100.1 shape includes 2×3.4TB of NVMe drive capacity, 13 cores of 4th Gen Intel Xeon processors and 246GB of system memory, making it well-suited for a range of AI tasks.

“Oracle Cloud’s bare-metal compute with NVIDIA H100 and A100 GPUs, low-latency Supercluster and high-performance storage delivers up to 20% better price-performance for Altair’s computational fluid dynamics and structural mechanics solvers,” said Yeshwant Mummaneni, chief engineer of data management analytics at Altair. “We look forward to leveraging these GPUs with virtual machines for the Altair Unlimited virtual appliance.”

GH200 Bare-Metal Instances Available for Validation

OCI has also made available the BM.GPU.GH200 compute shape for customer testing. It features the NVIDIA Grace Hopper Superchip and NVLink-C2C, a high-bandwidth, cache-coherent 900GB/s connection between the NVIDIA Grace CPU and NVIDIA Hopper GPU. This provides over 600GB of accessible memory, enabling up to 10x higher performance for applications running terabytes of data compared to the NVIDIA A100 GPU.

Optimized Software for Enterprise AI 

Enterprises have a wide variety of NVIDIA GPUs to accelerate their AI, HPC and data analytics workloads on OCI. However, maximizing the full potential of these GPU-accelerated compute instances requires an optimized software layer.

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform available on the OCI Marketplace, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inference to deploy world-class generative AI applications.

Optimized for NVIDIA GPUs, NIM pre-built containers offer developers improved cost of ownership, faster time to market and security. NIM microservices for popular community models, found on the NVIDIA API Catalog, can be deployed easily on OCI.

Performance will continue to improve over time with upcoming GPU-accelerated instances, including NVIDIA H200 Tensor Core GPUs and NVIDIA Blackwell GPUs.

Order the L40S GPU and test the GH200 Superchip by reaching out to OCI. To learn more, join Oracle and NVIDIA at SIGGRAPH, the world’s premier graphics conference, running through Aug. 1.

See notice regarding software product information.

Read More

Research Focus: Week of July 29, 2024

Research Focus: Week of July 29, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: July 22, 2024

Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior

Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). Previous differentiable MAG learning algorithms have been limited to small datasets and failed to scale to larger ones (e.g., with more than 50 variables).

In a recent paper: Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior, researchers from Microsoft and external colleagues explore the potential for causal skeleton, which is the undirected version of the causal graph, to improve accuracy and reduce the search space of the optimization procedure, thereby enhancing the performance of differentiable causal discovery. They propose SPOT (Skeleton Posterior-guided OpTimization), a two-phase framework that harnesses skeleton posterior for differentiable causal discovery in the presence of latent confounders.

Extensive experiments on various datasets show that SPOT substantially outperforms state-of-the-art methods for MAG learning. SPOT also demonstrates its effectiveness in the accuracy of skeleton posterior estimation in comparison with non-parametric bootstrap-based, or more recently, variational inference-based methods. The adoption of skeleton posterior exhibits strong promise in various causal discovery tasks.


Evaluating the Feasibility of Visual Imagery for an EEG-Based Brain–Computer Interface

Brain signals recorded via non-invasive electroencephalography (EEG) could help patients with severe neuromuscular disorders communicate with and control the world around them. Brain-computer interface (BCI) technology could use visual imagery, or the mental simulation of visual information from memory, as an effective control paradigm, directly conveying the user’s intention.

Initial investigations have been unable to fully evaluate the capabilities of true spontaneous visual mental imagery. One major limitation is that the target image is typically displayed immediately preceding the imagery period. This paradigm does not capture spontaneous mental imagery, as would be necessary in an actual BCI application, but something more akin to short-term retention in visual working memory.

In a recent paper: Evaluating the Feasibility of Visual Imagery for an EEG-Based Brain–Computer Interface, researchers from Microsoft and external colleagues show that short-term visual imagery following the presentation of a specific target image provides a stronger, more easily classifiable neural signature in EEG than spontaneous visual imagery from long-term memory following an auditory cue for the image. This research, published in IEEE Transactions on Neural Systems and Rehabilitation Engineering, provides the first direct comparison of short-term and long-term visual imagery tasks and provides greater insight into the feasibility of using visual imagery as a BCI control strategy.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first three episodes on demand.


Evolving Roles and Workflows of Creative Practitioners in the Age of Generative AI

Many creative practitioners – designers, software developers, and architects, for example – are using generative AI models to produce text, images, and other assets. While human-computer interaction (HCI) research explores specific generative AI models and creativity support tools, little is known about practitioners’ evolving roles and workflows with models across a project’s stages. This knowledge could help guide the development of the next generation of creativity support tools.

In a recent paper: Evolving Roles and Workflows of Creative Practitioners in the Age of Generative AI, researchers from Microsoft and the University of California-San Diego, contribute to this knowledge by employing a triangulated method to capture information from interviews, videos, and survey responses of creative practitioners reflecting on projects they completed with generative AI. Their observations help uncover a set of factors that capture practitioners’ perceived roles, challenges, benefits, and interaction patterns when creating with generative AI. From these factors, the researchers offer insights and propose design opportunities and priorities that serve to encourage reflection from the wider community of creativity support tools and generative AI stakeholders, such as systems creators, researchers, and educators, on how to develop systems that meet the needs of creatives in human-centered ways.


“It’s like a rubber duck that talks back”: Understanding Generative AI-Assisted Data Analysis Workflows through a Participatory Prompting Study

End-user tools based on generative AI can help people complete many tasks. One such task is data analysis, which is notoriously challenging for non-experts, but also holds much potential for AI. To understand how data analysis workflows can be assisted or impaired by generative AI, researchers from Microsoft conducted a study using Bing Chat via participatory prompting, a newer methodology in which users and researchers reflect together on tasks through co-engagement with generative AI. The recent paper: “It’s like a rubber duck that talks back”: Understanding Generative AI-Assisted Data Analysis Workflows through a Participatory Prompting Study, demonstrates the value of the participatory prompting method. The researchers found that generative AI benefits the information foraging and sensemaking loops of data analysis in specific ways, but also introduces its own barriers and challenges, arising from the difficulties of query formulation, specifying context, and verifying results. Based on these findings, the paper presents several implications for future AI research and the design of new generative AI interactions.

The post Research Focus: Week of July 29, 2024 appeared first on Microsoft Research.

Read More

Taking AI to Warp Speed: Decoding How NVIDIA’s Latest RTX-Powered Tools and Apps Help Developers Accelerate AI on PCs and Workstations

Taking AI to Warp Speed: Decoding How NVIDIA’s Latest RTX-Powered Tools and Apps Help Developers Accelerate AI on PCs and Workstations

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX PC users.

NVIDIA is spotlighting the latest NVIDIA RTX-powered tools and apps at SIGGRAPH, an annual trade show at the intersection of graphics and AI.

These AI technologies provide advanced ray-tracing and rendering techniques, enabling highly realistic graphics and immersive experiences in gaming, virtual reality, animation and cinematic special effects. RTX AI PCs and workstations are helping drive the future of interactive digital media, content creation, productivity and development.

ACE’s AI Magic

During a SIGGRAPH fireside chat, NVIDIA founder and CEO Jensen Huang introduced “James” — an interactive digital human built on NVIDIA NIM microservices — that showcases the potential of AI-driven customer interactions.

Using NVIDIA ACE technology and based on a customer-service workflow, James is a virtual assistant that can connect with people using emotions, humor and contextually accurate responses. Soon, users will be able to interact with James in real time at ai.nvidia.com.

James is a virtual assistant in NVIDIA ACE.

NVIDIA also introduced the latest advancements in the NVIDIA Maxine AI platform for telepresence, as well as companies adopting NVIDIA ACE, a suite of technologies for bringing digital humans to life with generative AI. These technologies enable digital human development with AI models for speech and translation, vision, intelligence, realistic animation and behavior, and lifelike appearance.

Maxine features two AI technologies that enhance the digital human experience in telepresence scenarios: Maxine 3D and Audio2Face-2D.

Developers can harness Maxine and ACE technologies to drive more engaging and natural interactions for people using digital interfaces across customer service, gaming and other interactive experiences.

Tapping advanced AI, NVIDIA ACE technologies allow developers to design avatars that can respond to users in real time with lifelike animations, speech and emotions. RTX GPUs provide the necessary computational power and graphical fidelity to render ACE avatars with stunning detail and fluidity.

With ongoing advancements and increasing adoption, ACE is setting new benchmarks for building virtual worlds and sparking innovation across industries. Developers tapping into the power of ACE with RTX GPUs can build more immersive applications and advanced, AI-based, interactive digital media experiences.

RTX Updates Unleash AI-rtistry for Creators

NVIDIA GeForce RTX PCs and NVIDIA RTX workstations are getting an upgrade with GPU accelerations that provide users with enhanced AI content-creation experiences.

For video editors, RTX Video HDR is now available through Wondershare Filmora and DaVinci Resolve. With this technology, users can transform any content into high dynamic range video with richer colors and greater detail in light and dark scenes — making it ideal for gaming videos, travel vlogs or event filmmaking. Combining RTX Video HDR with RTX Video Super Resolution further improves visual quality by removing encoding artifacts and enhancing details.

RTX Video HDR requires an RTX GPU connected to an HDR10-compatible monitor or TV. Users with an RTX GPU-powered PC can send files to the Filmora desktop app and continue to edit with local RTX acceleration, doubling the speed of the export process with dual encoders on GeForce RTX 4070 Ti or above GPUs. Popular media player VLC in June added support for RTX Video Super Resolution and RTX Video HDR, adding AI-enhanced video playback.

Read this blog on RTX-powered video editing and the RTX Video FAQ for more information. Learn more about Wondershare Filmora’s AI-powered features.

In addition, 3D artists are gaining more AI applications and tools that simplify and enhance workflows, including Replikant, Adobe, Topaz and Getty Images.

Replikant, an AI-assisted 3D animation platform, is integrating NVIDIA Audio2Face, an ACE technology, to enable improved lip sync and facial animation. By taking advantage of NVIDIA-accelerated generative models, users can enjoy real-time visuals enhanced by RTX and NVIDIA DLSS technology. Replikant is now available on Steam.

Adobe Substance 3D Modeler has added Search Asset Library by Shape, an AI-powered feature designed to streamline the replacement and enhancement of complex shapes using existing 3D models. This new capability significantly accelerates prototyping and enhances design workflows.

New AI features in Adobe Substance 3D integrate advanced generative AI capabilities, enhancing its texturing and material-creation tools. Adobe has launched the first integration of its Firefly generative AI capabilities into Substance 3D Sampler and Stager, making 3D workflows more seamless and productive for industrial designers, game developers and visual effects professionals.

For tasks like text-to-texture generation and prompt descriptions, Substance 3D users can generate photorealistic or stylized textures. These textures can then be applied directly to 3D models. The new Text to Texture and Generative Background features significantly accelerate traditionally time-consuming and intricate 3D texturing and staging tasks.

Powered by NVIDIA RTX Tensor Cores, Substance 3D can significantly accelerate computations and allows for more intuitive and creative design processes. This development builds on Adobe’s innovation with Firefly-powered Creative Cloud upgrades in Substance 3D workflows.

Topaz AI has added NVIDIA TensorRT acceleration for multi-GPU workflows, enabling parallelization across multiple GPUs for supercharged rendering speeds — up to 2x faster with two GPUs over a single GPU system, and scaling further with additional GPUs.

Getty Images has updated its Generative AI by iStock service with new features to enhance image generation and quality. Powered by NVIDIA Edify models, the latest enhancement delivers generation speeds set to reach around six seconds for four images, doubling the performance of the previous model, with speeds at the forefront of the industry. The improved Text-2-Image and Image-2-Image functionalities provide higher-quality results and greater adherence to user prompts.

Generative AI by iStock users can now also designate camera settings such as focal length (narrow, standard or wide) and depth of field (near or far). Improvements to generative AI super-resolution enhances image quality by using AI to create new pixels, significantly improving resolution without over-sharpening the image.

LLM-azing AI

ChatRTX — a tech demo that connects a large language model (LLM), like Meta’s Llama, to a user’s data for quickly querying notes, documents or images — is getting a user interface (UI) makeover, offering a cleaner, more polished experience.

ChatRTX also serves as an open-source reference project that shows developers how to build powerful, local, retrieval-augmented applications (RAG) applications accelerated by RTX.

ChatRTX is getting a interface (UI) makeover.

The latest version of ChatRTX, released today, uses the Electron + Material UI framework, which lets developers more easily add their own UI elements or extend the technology’s functionality. The update also includes a new architecture that simplifies the integration of different UIs and streamlines the building of new chat and RAG applications on top of the ChatRTX backend application programming interface.

End users can download the latest version of ChatRTX from the ChatRTX web page. Developers can find the source code for the new release on the ChatRTX GitHub repository.

Meta Llama 3.1-8B models are now optimized for inference on NVIDIA GeForce RTX PCs and NVIDIA RTX workstations. These models are natively supported with NVIDIA TensorRT-LLM, open-source software that accelerates LLM inference performance.

Dell’s AI Chatbots: Harnessing RTX Rocket Fuel

Dell is presenting how enterprises can boost AI development with an optimized RAG chatbot using NVIDIA AI Workbench and an NVIDIA NIM microservice for Llama 3. Using the NVIDIA AI Workbench Hybrid RAG Project, Dell is demonstrating how the chatbot can be used to converse with enterprise data that’s embedded in a local vector database, with inference running in one of three ways:

  • Locally on a Hugging Face TGI server
  • In the cloud using NVIDIA inference endpoints
  • On self-hosted NVIDIA NIM microservices

Learn more about the AI Workbench Hybrid RAG Project. SIGGRAPH attendees can experience this technology firsthand at Dell Technologies’ booth 301.

HP AI Studio: Innovate Faster With CUDA-X and Galileo

At SIGGRAPH, HP is presenting the Z by HP AI Studio, a centralized data science platform. Announced in October 2023, AI Studio has now been enhanced with the latest NVIDIA CUDA-X libraries as well as HP’s recent partnership with Galileo, a generative AI trust-layer company. Key benefits include:

  • Deploy projects faster: Configure, connect and share local and remote projects quickly.
  • Collaborate with ease: Access and share data, templates and experiments effortlessly.
  • Work your way: Choose where to work on your data, easily switching between online and offline modes.

Designed to enhance productivity and streamline AI development, AI Studio allows data science teams to focus on innovation. Visit HP’s booth 501 to see how AI Studio with RAPIDS cuDF can boost data preprocessing to accelerate AI pipelines. Apply for early access to AI Studio.

An RTX Speed Surge for Stable Diffusion

Stable Diffusion 3.0, the latest model from Stability AI, has been optimized with TensorRT to provide a 60% speedup.

A NIM microservice for Stable Diffusion 3 with optimized performance is available for preview on ai.nvidia.com.

There’s still time to join NVIDIA at SIGGRAPH to see how RTX AI is transforming the future of content creation and visual media experiences. The conference runs through Aug. 1.

Generative AI is transforming graphics and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

Configure Amazon Q Business with AWS IAM Identity Center trusted identity propagation

Configure Amazon Q Business with AWS IAM Identity Center trusted identity propagation

Amazon Q Business is a fully managed, permission aware generative artificial intelligence (AI)-powered assistant built with enterprise grade security and privacy features. Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on your enterprise data. The native data source connectors provided by Amazon Q Business can seamlessly integrate and index content from multiple repositories into a unified index. Amazon Q Business uses AWS IAM Identity Center to record the workforce users you assign access to and their attributes, such as group associations. IAM Identity Center is used by many AWS managed applications such as Amazon Q. You connect your existing source of identities to Identity Center once and can then assign users to any of these AWS services. Because Identity Center serves as their common reference of your users and groups, these AWS applications can give your users a consistent experience as they navigate AWS. For example, it enables user subscription management across Amazon Q offerings and consolidates Amazon Q billing from across multiple AWS accounts. Additionally, Q Business conversation APIs employ a layer of privacy protection by leveraging trusted identity propagation enabled by IAM Identity Center.

Amazon Q Business comes with rich API support to perform administrative tasks or to build an AI-assistant with customized user experience for your enterprise. With administrative APIs you can automate creating Q Business applications, set up data source connectors, build custom document enrichment, and configure guardrails. With conversation APIs, you can chat and manage conversations with Q Business AI assistant. Trusted identity propagation provides authorization based on user context, which enhances the privacy controls of Amazon Q Business.

In this blog post, you will learn what trusted identity propagation is and why to use it, how to automate configuration of a trusted token issuer in AWS IAM Identity Center with provided AWS CloudFormation templates, and what APIs to invoke from your application facilitate calling Amazon Q Business identity-aware conversation APIs.

Why use trusted identity propagation?

Trusted identity propagation provides a mechanism that enables applications that authenticate outside of AWS to make requests on behalf of their users with the use of a trusted token issuer. Consider a client-server application that uses an external identity provider (IdP) to authenticate a user to provide access to an AWS resource that’s private to the user. For example, your web application might use Okta as an external IdP to authenticate a user to view their private conversations from Q Business. In this scenario, Q Business is unable to use the identity token generated by the third party provider to provide direct access to the user’s private data since there is no mechanism to trust the identity token issued by the third party.

To solve this, you can use IAM Identity Center to get the user identity from your external IdP into an AWS Identity and Access Management (IAM) role session which allows you to authorize requests based on the human, their attributes, and their group memberships, rather than set up fine-grained permissions in an IAM policy. You can exchange the token issued by the external IdP for a token generated by Identity Center. The token generated by Identity Center refers to the corresponding Identity Center user. The web application can now use the new token to initiate a request to Q Business for the private chat conversation. That token refers to the corresponding user in Identity Center, Q Business can authorize the requested access to the private conversation based on the user or their group membership as represented in Identity Center.

Some of the benefits of using trusted identity propagation are:

  • Prevents user impersonation and protects against unauthorized access to user private data by spoofing user identity.
  • Facilitates auditability and fosters responsible use of resources as Q Business automatically logs API invocations to AWS CloudTrail along with user identifier.
  • Promotes software design principles rooted in user privacy.

Overview of trusted identity propagation deployment

The following figure is a model of a client-server architecture for trusted identity propagation.

To understand how your application can be integrated with IAM Identity Center for trusted identity propagation, consider the model client-server web application shown in the preceding figure. In this model architecture, the web browser represents the user interface to your application. This could be a web page rendered on a web browser, Slack, Microsoft Teams, or other applications. The application server might be a web server running on Amazon Elastic Container Service (Amazon ECS), or a Slack or Microsoft Teams gateway implemented with AWS Lambda. Identity Center itself might be deployed on a delegated admin account or Identity Center (the Identity Account in the preceding figure), or could be deployed in the same AWS account (the Application Account in the preceding figure) where the application server is deployed along with Amazon Q Business. Finally, you have an OAuth 2.0 OpenID Connect (OIDC) external IdP such as Okta, Ping One, Microsoft Entra ID, or Amazon Cognito for authenticating and authorizing.

Deployment of trusted identity propagation involves five steps. As a best practice, we recommend that the security owner manages IAM Identity Center updates and the application owner manages application updates, providing clear separation of duties. The security owner is responsible for administering the Identity Center of an organization or account. The application owner is responsible for creating an application on AWS.

  1. The security owner adds the external OIDC IdP’s issuer URL to the IAM Identity Center instance’s trusted token issuer. It’s important that the issuer URL matches the iss claim attribute present in the JSON Web Token (JWT) identity token generated by the IdP after user authentication. This is configured once for a given issuer URL.
  2. The security owner creates a customer managed identity provider application in IAM Identity Center and explicitly configures the specific audience for a given trusted token issuer is being authorized to perform token exchange using Identity Center. Because there could be more than one application (or audience) for which the external IdP could be authenticating users, explicitly specifying an audience helps prevent an unauthorized applications from using the token exchange process. It’s important the audience ID matches the aud claim attribute present in the JWT identity token generated by the IdP after user authentication.
  3. The security owner edits the application policy for the customer managed identity provider application created in the previous step to add or update the IAM execution role used by the application server or AWS Lambda. This helps prevent any unapproved users or applications from invoking the CreateTokenWithIAM API in Identity Center to initiate the token exchange.
  4. The application owner creates and adds an IAM policy to the application execution role to allow the application to invoke a CreateTokenWithIAM API on Identity Center to perform a token exchange and to create temporary credentials using AWS Security Token Service (AWS STS) .
  5. The application owner creates an IAM role with a policy allowing access to the Q Business Conversation API for use with STS to create a temporary credential to invoke Q Business APIs.

You can use AWS CloudFormation templates, discussed later in this blog, to automate the preceding deployment steps. See the IAM Identity Center documentation for detailed step-by-step instructions on setting up trusted identity propagation. You can also use the AWS Command Line Interface (AWS CLI) setup process in Making authenticated Amazon Q Business API calls using IAM Identity Center.

Important: Choosing to add a trusted token issuer is a security decision that requires careful consideration. Only choose trusted token issuers that you trust to perform the following tasks:

  • Authenticate the user who is specified in the token. Control the audience claim, a claim you configure as the user identifier.
  • Generate a token that IAM Identity Center can exchange for an Identity Center-created token. Control the Identity Center customer managed application policy to add only IAM users, roles, and execution roles that can perform the exchange.

Authorization flow

For a typical web application, the trusted identity propagation process will involve five steps as shown in the following flow diagram.

  1. Sign-in and obtain an authorization code from the IdP.
  2. Use the authorization code and client secret to retrieve the ID token from the IdP.
  3. Exchange the IdP generated JWT ID token with the IAM Identity Center token that includes the AWS STS context identity.
  4. Use the STS context identity to obtain temporary access credentials from AWS STS.
  5. Use temporary access credentials to access Q Business APIs.

An end-to-end implementation of the identity propagation is available for reference in <project_home>/webapp/main.py of AWS Samples – main.py.

Sample JWT tokens

In the preceding authorization flow, one of the key steps is step 3, where the JWT ID token from the OAuth IdP is exchanged with IAM Identity Center for an AWS identity-aware JWT token. Key attributes of the respective JWT tokens are explored in the next section. An understanding of the tokens will help with troubleshooting authorization flow errors.

OpenID Connect JWT ID token

A sample JWT ID token generated by an OIDC OAuth IdP is shown in the following code sample. OIDC’s ID tokens take the form of a JWT, which is a JSON payload that’s signed with the private key of the issuer and can be parsed and verified by the application. In contrast to access tokens, ID tokens are intended to be understood by the OAuth client and include a handful of defined property names that provide information to the application. Important properties include aud, email, iss, and jti, which are used by IAM Identity Center to validate the token issuer, match the user directory, and issue a new Identity Center token. The following code sample shows a JWT identity token issued by an OIDC external IdP (such as Okta).

{
    'amr': ['pwd'],
    'at_hash': '3fMsKeFGoem************',
    'aud': '0oae4epmqqa************',
    'auth_time': 1715792363,
    'email': 'john_doe@******.com',
    'exp': 1715795964,
    'iat': 1715792364,
    'idp': '00oe36vc7kj7************',
    'iss': 'https://*******.okta.com/oauth2/default',
    'jti': 'ID.7l6jFX3KO9M7***********************',
    'name': 'John Doe',
    'nonce': 'SampleNonce',
    'preferred_username': 'john_doe@******.com',
    'sub': '00ue36ou4gCv************',
    'ver': 1
}

IAM Identity Center JWT token with identity context

A sample JWT token generated by CreateTokenWithIAM is shown in the following code sample. This token includes a property called sts:identity_context which allows you to create an identity-enhanced IAM role session using an AWS STS AssumeRole API. The enhanced STS session allows the receiving AWS service to authorize the IAM Identity Center user to perform an action and log the user identity to CloudTrail for auditing.

{
    'act':{
        'sub': 'arn:aws:sso::*********:trustedTokenIssuer/ssoins-*********/74******-7***-7***-d***-fd9*********'
    },
    'aud': 'BTHY************-c9Ed3V************',
    'auth_time': '2024-05-15T16:59:27Z',
    'aws:application_arn': 'arn:aws:sso::************:application/ssoins-************/apl-************',
    'aws:credential_id': 'AAAAAGZE9_8Y******_Zj******',
    'aws:identity_store_arn': 'arn:aws:identitystore::************:identitystore/d-**********',
    'aws:identity_store_id': 'd-**********',
    'aws:instance_account': '************',
    'aws:instance_arn': 'arn:aws:sso:::instance/ssoins-************',
    'exp': 1715795967,
    'iat': 1715792367,
    'iss': 'https://identitycenter.amazonaws.com/ssoins-************',
    'sts:audit_context': 'AQoJb3Jp*********************************Bg==',
    'sts:identity_context': 'AQoJb3Jp********************************************gY=',
    'sub': '34******-d***-7***-b***-e2*********'
}

Automate configuration of a trusted token issuer using AWS CloudFormation

A broad range of possibilities exists to integrate your application with Amazon Q Business using IAM Identity Center and your enterprise IdP. For all integration projects, Identity Center needs to be configured to use a trusted token issuer. The sample CloudFormation templates discussed in this post focuses on helping you automate the core trusted token issuer setup. If you’re new to Amazon Q Business and don’t have all the inputs required to deploy the CloudFormation template, the prerequisites section includes links to resources that can help you get started. You can also follow a tutorial on Configuring sample web application with Okta included in the accompanying AWS Samples repository.

Note: The full source code of the solution using AWS CloudFormation templates and sample web application is available in AWS Samples Repository.

Prerequisites and considerations

Template for configuring AWS IAM Identity Center by a security owner

A security owner can use this CloudFormation template to automate configuration of the trusted token issuer in your IAM Identity Center. Deploy this stack in the AWS account where your Identity Center instance is located. This could be in the same AWS account where your application is deployed as a standalone or account instance, or can be in a delegated admin account managed as part of AWS Organizations.

  1. To launch the stack, choose:
    Launch Stack

You can download the latest version of the CloudFormation template from AWS Samples – TTI CFN.

The following figure shows the stack input for the template

  1. The stack creation requires four parameters:
  • AuthorizedAudiences: The authorized audience is an auto generated UUID by a third-party IdP service or a pseudo-ID configured by the administrator of the third-party IdP to uniquely identify the client (your application) for which the ID token is generated. The value must match the aud attribute value included in the JWT ID token generated by the third-party identity provider.
  • ClientAppExecutionArn: The Amazon Resource Name (ARN) of the IAM user, group or execution role that’s used to run your application, which will invoke Identity Center for token exchange and AWS STS service for generating temporary credentials. For example, this could be the execution role ARN of the Lambda function where your code is run.
  • IDCInstanceArn: The instance ARN of the IAM Identity Center instance used by your application.
  • TokenIssuerUrl: The URL of the trusted token issuer. The trusted token issuer is a third-party identity provider that will authenticate a user and generate an ID token for authorization purposes. The token URL must match the iss attribute value included in the JWT ID token generated by the third-party identity provider.

The following figure shows the output of the CloudFormation stack to configure a trusted token issuer with IAM Identity Center

The stack creation produces the following output:

  • IDCApiAppArn: The ARN for the IAM Identity Center custom application auth provider. You will use this application to call the Identity Center CreateTokenWithIAM API to exchange the third-party JWT ID token with the Identity Center token.

Validate the configuration

  1. From the AWS Management Console where your IAM Identity Center instance is located, go to the AWS IAM Identity Center console to verify if the trusted token issuer is configured properly.
  2. From the left navigation pane, choose Applications and choose the Customer Managed tab to see a list of applications as shown in the following figure. The newly created customer managed IdP application will be the same as the CloudFormation stack name. Choose application name to open the application configuration page.
  3. On your application configuration page, as shown in the following figure, verify the following:
    1. User and group assignments are set to Do not require assignments.
    2. Trusted applications for identity propagation lists Amazon Q and includes the application scope qbusiness:conversations:access.
    3. Authentication with the trusted token issuer is set to configured.
  4. Next, to verify trusted token issuer configuration, choose Actions on the top right of the page and select Edit configurations from the drop-down menu.
  5. At the bottom of the page, expand Authentication with trusted token issuer and verify:
  6. That your Issuer URL is selected by default and is listed under .
  7. The audience ID (Aud claim) is configured properly for the issuer URL, as shown in the following figure. Next expand Application credentials to verify if your application execution IAM role is listed.

Depending on your IAM Identity Center instance type, you might not be able to access the console customer managed applications page. In such cases, you can use the AWS CLI or SDK to view the configuration. Here is a list of useful AWS CLI commands: list-applicationslist-application-access-scopesget-application-assignment-configurationdescribe-trusted-token-issuer, and list-application-grants.

Template for configuring your application by the application owner

To propagate user identities, your application will need to:

  • Invoke the IAM Identity Center instance to exchange a third-party JWT ID token and obtain an Identity Center ID token
  • Invoke AWS STS to generate a temporary credential with an IAM assumed role.

The application owner can use a CloudFormation template to generate the required IAM policy, which can be attached to your application execution role and the assumed role with the required Q Business chat API privileges for use with AWS STS to generate temporary credentials.

Remember to include the add-on policy generated to your application’s IAM execution role to allow the applications to invoke Identity Center and AWS STS APIs.

  1. To launch the stack, choose:
    Launch Stack

You can download the latest version of the CloudFormation template from AWS Samples – App Roles CFN.

The following figure shows the CloudFormation stack configuration to install IAM roles and policies required for the application to propagate identities

  1. The stack creation takes four parameters, as shown in the preceding figure:
  • ClientAppExecutionArn: The ARN of an IAM user, group, or execution role that is used to run your application and will invoke IAM Identity Center for token exchange and AWS STS for generating temporary credentials. For example, this could be the execution role ARN of Lambda where your code is run.
  • IDCApiAppArn: ARN for the IAM Identity Center custom application auth provider. This will be created as part of the trusted token issuer configuration.
  • KMSKeyId: [Optional] The AWS Key Management Server (AWS KMS) ID, if the Q Business Application is encrypted with a customer managed encryption key.
  • QBApplicationID: Q Business application ID, which your application will use to invoke chat APIs. The STS assume role will be restricted to this application ID.

The following figure shows the output of the CloudFormation stack to install IAM roles and policies required for the application to propagate identities.

The stack creation produces the following outputs:

  • ClientAppExecutionAddOnPolicyArn: This is a customer managed IAM policy created with the required permissions for your application to invoke the IAM Identity Center CreateTokenWithIAM API and call the STS AssumeRole API to generate temporary credentials to call Q Business chat APIs. You can include this policy in your application IAM execution role to allow access for the APIs.
  • QBusinessSTSAssumeRoleArn: This IAM role will include the necessary permissions to call Q Business chat APIs, for use with the STS AssumeRole API call.

Validate the configuration

  1. From the AWS account where your application is deployed, open the AWS IAM console, verify if the IAM role for STS AssumeRole and the user managed IAM policy for the application execution role are created.
    • To verify if the IAM Role for STS AssumeRole, obtain the role name QBusinessSTSAssumeRoleArn stack output value, choose theRoles link on the left panel of the IAM console and use the search bar to enter the role name and shown in the following figure.
  2. Choose the link to the role to open the role and expand the inline policy to review the permissions, as shown in the following figure.
  3. To verify if the IAM policy for add-on to an application execution role is created, obtain the IAM policy name from the ClientAppExecutionAddOnPolicyArn stack output value, go the Policies in the IAM console, and search for the policy, as shown in the following figure.
  4. Choose the link to the policy name to open the policy and review the permissions, as shown in the following figure.

Update the application for invoking the Q Business API with identity propagation

Most web applications using OAuth 2.0 with an IdP will have implemented a sign-in mechanism and invoke the IdPs ID endpoint to retrieve a JWT ID token. However, before invoking Amazon Q Business APIs that require identity propagation, your application needs to be updated to include calls to CreateTokenWithIAM and AssumeRole to facilitate trusted token propagation.

The CreateTokenWithIAM API enables exchanging the JWT ID token received from the OIDC IdP with an IAM identity Center generated JWT token. The newly generated token is then passed on to AssumeRole API to create an identity aware temporary security credentials that you can use to access AWS resources.

Note: Remember to add permissions to your IAM role and user policy to allow invoking these APIs. Alternatively, you can attach the sample policy referenced by ClientAppExecutionAddOnPolicyArn that was created by the CloudFormation template for configuring your application.

A sample access helper method using  get_oidc_id_tokenget_idc_sts_id_context, or get_sts_credential is available in <project_home>/src/qbapi_tools/access_helpers.py  (AWS Samples – access_helpers.py). An end-to-end sample implementation of the complete sequence of steps as depicted in the end-to-end authentication sequence is provided in <project_home>/webapp/main.py (AWS Samples – main.py).

Restrictions and limitations

Below are some common limitations and restrictions that you may encounter while configuring trusted token propagation along with recommendations on how to mitigate them.

Group membership propagation

Enterprises typically manage group membership in their external IdP. However, when using trusted token propagation, the web identity token generated by the external IdP is exchanged with an ID token generated by IAM Identity Center. Thus, when invoking the Q Business API from an STS session enhanced with Identity Center identity context, only the group membership information available for the user in Identity Center is passed to the Q Business API, not the group membership from the external IdP. To mitigate this issue, it’s recommended that all relevant users and groups are synchronized to Identity Center from the external IdP using System for Cross-domain Identity Management (SCIM). For more information, see automatic provisioning (synchronization) of users and groups.

Caching credentials to prevent invalid grant types

You can use a web identity token only once with the CreateTokenWithIAM API. This is to prevent token replay attacks, where an attacker can intercept a JWT and reuse it multiple times, allowing them to bypass authentication and authorization controls. Because it isn’t practical to generate a new ID token for every Q Business API, it’s recommended that the temporary credentials generated by a Q Business API session using AWS STS AssumeRole is cached and reused for subsequent API calls.

Clean up

To avoid incurring additional charges, make sure you delete any resources created in this post.

  1. Follow the instructions in Deleting a stack on the AWS CloudFormation console to delete any CloudFormation stacks created using templates provided in this post.
  2. If you enabled an IAM Identity Center instance, follow the instructions to delete your IAM Identity Center instance.
  3. Ensure you unregister or delete any IdP services such as Okta, Entra ID, Ping Identity, or Amazon Cognito that you have created for this post.
  4. Finally, delete any sample code repositories you have cloned or downloaded, and any associated resources deployed as part of setting up the environment for running the samples in the code repository.

Conclusion

Trusted identity propagation is an important mechanism for securely integrating Amazon Q Business APIs into enterprise applications that use external IdPs. By implementing trusted identity propagation with AWS IAM Identity Center, organizations can confidently build AI-powered applications and tools using Amazon Q Business APIs, knowing that user identities are properly verified and protected throughout the process. This approach allows enterprises to harness the full potential of generative AI while maintaining the highest standards of security and privacy. To get started with Amazon Q Business, explore the Getting started guide. To learn more about how trusted token propagation works, see How to develop a user-facing data application with IAM Identity Center and S3 Access Grants.


About the Author

Rajesh Kumar Ravi is a Senior Solutions Architect at Amazon Web Services specializing in building generative AI solutions with Amazon Q Business, Amazon Bedrock, and Amazon Kendra. He is an accomplished technology leader with experience in building innovative AI products, nurturing the builder community, and contributes to the development of new ideas. He enjoys walking and loves to go on short hiking trips outside of work.

Read More