Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

As generative artificial intelligence (AI) applications become more prevalent, maintaining responsible AI principles becomes essential. Without proper safeguards, large language models (LLMs) can potentially generate harmful, biased, or inappropriate content, posing risks to individuals and organizations. Applying guardrails helps mitigate these risks by enforcing policies and guidelines that align with ethical principles and legal requirements. Guardrails for Amazon Bedrock evaluates user inputs and model responses based on use case-specific policies, and provides an additional layer of safeguards regardless of the underlying foundation model (FM). Guardrails can be applied across all LLMs on Amazon Bedrock, including fine-tuned models and even generative AI applications outside of Amazon Bedrock. You can create multiple guardrails, each configured with a different combination of controls, and use these guardrails across different applications and use cases. You can configure guardrails in multiple ways, including to deny topics, filter harmful content, remove sensitive information, and detect contextual grounding.

The new ApplyGuardrail API enables you to assess any text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs. In this post, we demonstrate how to use the ApplyGuardrail API with long-context inputs and streaming outputs.

ApplyGuardrail API overview

The ApplyGuardrail API offers several key features:

  • Ease of use – You can integrate the API anywhere in your application flow to validate data before processing or serving results to users. For example, in a Retrieval Augmented Generation (RAG) application, you can now evaluate the user input prior to performing the retrieval instead of waiting until the final response generation.
  • Decoupled from FMs – The API is decoupled from FMs, allowing you to use guardrails without invoking FMs from Amazon Bedrock. For example, you can now use the API with models hosted on Amazon SageMaker. Alternatively, you could use it self-hosted or with models from third-party model providers. All that is needed is taking the input or output and request assessment using the API.

You can use the assessment results from the ApplyGuardrail API to design the experience on your generative AI application, making sure it adheres to your defined policies and guidelines.

The ApplyGuardrail API request allows you to pass all your content that should be guarded using your defined guardrails. The source field should be set to INPUT when the content to evaluated is from a user, typically the LLM prompt. The source should be set to OUTPUT when the model output guardrails should be enforced, typically an LLM response. An example request looks like the following code:

{
    "source": "INPUT" | "OUTPUT",
    "content": [{
        "text": {
            "text": "This is a sample text snippet...",
        }
    }]
}

For more information about the API structure, refer to Guardrails for Amazon Bedrock.

Streaming output

LLMs can generate text in a streaming manner, where the output is produced token by token or word by word, rather than generating the entire output at once. This streaming output capability is particularly useful in scenarios where real-time interaction or continuous generation is required, such as conversational AI assistants or live captioning. Incrementally displaying the output allows for a more natural and responsive user experience. Although it’s advantageous in terms of responsiveness, streaming output introduces challenges when it comes to applying guardrails in real time as the output is generated. Unlike the input scenario, where the entire text is available upfront, the output is generated incrementally, making it difficult to assess the complete context and potential violations.

One of the main challenges is the need to evaluate the output as it’s being generated, without waiting for the entire output to be complete. This requires a mechanism to continuously monitor the streaming output and apply guardrails in real time, while also considering the context and coherence of the generated text. Furthermore, the decision to halt or continue the generation process based on the guardrail assessment needs to be made in real time, which can impact the responsiveness and user experience of the application.

Solution overview: Use guardrails on streaming output

To address the challenges of applying guardrails on streaming output from LLMs, a strategy that combines batching and real-time assessment is required. This strategy involves collecting the streaming output into smaller batches or chunks, evaluating each batch using the ApplyGuardrail API, and then taking appropriate actions based on the assessment results.

The first step in this strategy is to batch the streaming output chunks into smaller batches that are closer to a text unit, which is approximately 1,000 characters. If a batch is smaller, such as 600 characters, you’re still charged for an entire text unit (1,000 characters). For a cost-effective usage of the API, it’s recommended that the batches of chunks are in order of text units, such as 1,000 characters, 2,000, and so on. This way, you minimize the risk of incurring unnecessary costs.

By batching the output into smaller batches, you can invoke the ApplyGuardrail API more frequently, allowing for real-time assessment and decision-making. The batching process should be designed to maintain the context and coherence of the generated text. This can be achieved by making sure the batches don’t split words or sentences, and by carrying over any necessary context from the previous batch. Though the chunking varies between use cases, for the sake of simplicity, this post showcases simple character-level chunking, but it’s recommended to explore options such as semantic chunking or hierarchical chunking while still adhering to the guidelines mentioned in this post.

After the streaming output has been batched into smaller chunks, each chunk can be passed to the API for evaluation. The API will assess the content of each chunk against the defined policies and guidelines, identifying any potential violations or sensitive information.

The assessment results from the API can then be used to determine the appropriate action for the current batch. If a severe violation is detected, the API assessment suggests halting the generation process, and instead a preset message or response can be displayed to the user. However, in some cases, no severe violation is detected, but the guardrail was configured to pass through the request, for example in the case of sensitiveInformationPolicyConfig to anonymize the detected entities instead of blocking. If such an intervention occurs, the output will be masked or modified accordingly before being displayed to the user. For latency-sensitive applications, you can also consider creating multiple buffers and multiple guardrails, each with different policies, and then processing them with the ApplyGuardrail API in parallel. This way, you can minimize the time it takes to make assessments for one guardrail at a time, but maximize getting the assessments from multiple guardrails and multiple batches, though this technique hasn’t been implemented in this example.

Example use case: Apply guardrails to streaming output

In this section, we provide an example of how such a strategy could be implemented. Let’s begin with creating a guardrail. You can use the following code sample to create a guardrail in Amazon Bedrock:

import boto3
REGION_NAME = "us-east-1"

bedrock_client = boto3.client("bedrock", region_name=REGION_NAME)
bedrock_runtime = boto3.client("bedrock-runtime", region_name=REGION_NAME)

response = bedrock_client.create_guardrail(
    name="<name>",
    description="<description>",
    ...
)
# alternatively provide the id and version for your own guardrail
guardrail_id = response['guardrailId'] 
guardrail_version = response['version']

Proper assessment of the policies must be conducted to verify if the input should be later sent to an LLM or whether the output generated by the LLM should be displayed to the user. In the following code, we examine the assessments, which are part of the response from the ApplyGuardrail API, for potential severe violation leading to BLOCKED intervention by the guardrail:

from typing import List, Dict
def check_severe_violations(violations: List[Dict]) -> int:
    """
    When guardrail intervenes either the action on the request is BLOCKED or NONE.
    This method checks the number of the violations leading to blocking the request.

    Args:
        violations (List[Dict]): A list of violation dictionaries, where each dictionary has an 'action' key.

    Returns:
        int: The number of severe violations (where the 'action' is 'BLOCKED').
    """
    severe_violations = [violation['action']=='BLOCKED' for violation in violations]
    return sum(severe_violations)

def is_policy_assessement_blocked(assessments: List[Dict]) -> bool:
    """
    While creating the guardrail you could specify multiple types of policies.
    At the time of assessment all the policies should be checked for potential violations
    If there is even 1 violation that blocks the request, the entire request is blocked
    This method checks if the policy assessment is blocked based on the given assessments.

    Args:
        assessments (list[dict]): A list of assessment dictionaries, where each dictionary may contain 'topicPolicy', 'wordPolicy', 'sensitiveInformationPolicy', and 'contentPolicy' keys.

    Returns:
        bool: True if the policy assessment is blocked, False otherwise.
    """
    blocked = []
    for assessment in assessments:
        if 'topicPolicy' in assessment:
            blocked.append(check_severe_violations(assessment['topicPolicy']['topics']))
        if 'wordPolicy' in assessment:
            if 'customWords' in assessment['wordPolicy']:
                blocked.append(check_severe_violations(assessment['wordPolicy']['customWords']))
            if 'managedWordLists' in assessment['wordPolicy']:
                blocked.append(check_severe_violations(assessment['wordPolicy']['managedWordLists']))
        if 'sensitiveInformationPolicy' in assessment:
            if 'piiEntities' in assessment['sensitiveInformationPolicy']:
                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['piiEntities']))
            if 'regexes' in assessment['sensitiveInformationPolicy']:
                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['regexes']))
        if 'contentPolicy' in assessment:
            blocked.append(check_severe_violations(assessment['contentPolicy']['filters']))
    severe_violation_count = sum(blocked)
    print(f'::Guardrail:: {severe_violation_count} severe violations detected')
    return severe_violation_count>0

We can then define how to apply guardrail. If the response from the API leads to an action == 'GUARDRAIL_INTERVENED', it means that the guardrail has detected a potential violation. We now need to check if the violation was severe enough to block the request or pass it through with either the same text as input or an alternate text in which modifications are made according to the defined policies:

def apply_guardrail(text, source, guardrail_id, guardrail_version):
    response = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version, 
        source=source,
        content=[{"text": {"text": text}}]
    )
    if response['action'] == 'GUARDRAIL_INTERVENED':
        is_blocked = is_policy_assessement_blocked(response['assessments'])
        alternate_text = ' '.join([output['text'] for output in response['output']])
        return is_blocked, alternate_text, response
    else:
        # Return the default response in case of no guardrail intervention
        return False, text, response

Let’s now apply our strategy for streaming output from an LLM. We can maintain a buffer_text, which creates a batch of chunks received from the stream. As soon as len(buffer_text + new_text) > TEXT_UNIT, meaning if the batch is close to a text unit (1,000 characters), it’s ready to be sent to the ApplyGuardrail API. With this mechanism, we can make sure we don’t incur the unnecessary cost of invoking the API on smaller chunks and also that enough context is available inside each batch for the guardrail to make meaningful assessments. Additionally, when the generation is complete from the LLM, the final batch must also be tested for potential violations. If at any point the API detects severe violations, further consumption of the stream is halted and the user is displayed the preset message at the time of creation of the guardrail.

In the following example, we ask the LLM to generate three names and tell us what is a bank. This generation will lead to GUARDRAIL_INTERVENED but not block the generation, and instead anonymize the text (masking the names) and continue with generation.

input_message = "List 3 names of prominent CEOs and later tell me what is a bank and what are the benefits of opening a savings account?"

model_id = "anthropic.claude-3-haiku-20240307-v1:0"
text_unit= 1000 # characters

response = bedrock_runtime.converse_stream(
    modelId=model_id,
    messages=[{
        "role": "user",
        "content": [{"text": input_message}]
    system=[{"text" : "You are an assistant that helps with tasks from users. Be as elaborate as possible"}],
)

stream = response.get('stream')
buffer_text = ""
if stream:
    for event in stream:
        if 'contentBlockDelta' in event:
            new_text = event['contentBlockDelta']['delta']['text']
            if len(buffer_text + new_text) > text_unit:
                is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)
                # print(alt_text, end="")
                if is_blocked:
                    break
                buffer_text = new_text
            else: 
                buffer_text += new_text

        if 'messageStop' in event:
            # print(f"nStop reason: {event['messageStop']['stopReason']}")
            is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)
            # print(alt_text)

After running the preceding code, we receive an example output with masked names:

Certainly! Here are three names of prominent CEOs:

1. {NAME} - CEO of Apple Inc.
2. {NAME} - CEO of Microsoft Corporation
3. {NAME} - CEO of Amazon

Now, let's discuss what a bank is and the benefits of opening a savings account.

A bank is a financial institution that accepts deposits, provides loans, and offers various other financial services to its customers. Banks play a crucial role in the economy by facilitating the flow of money and enabling financial transactions.

Long-context inputs

RAG is a technique that enhances LLMs by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases before generating responses, producing output tailored to specific contexts while providing relevance, accuracy, and efficiency. The input to the LLM in a RAG scenario can be quite long, because it includes the user’s query concatenated with the retrieved information from the knowledge base. This long-context input poses challenges when applying guardrails, because the input may exceed the character limits imposed by the ApplyGuardrail API. To learn more about the quotas applied to Guardrails for Amazon Bedrock, refer to Guardrails quotas.

We evaluated the strategy to avoid the risk from model response in the previous section. In the case of inputs, the risk could be both at the query level or together with the query and the retrieved context for the query.

The retrieved information from the knowledge base may contain sensitive or potentially harmful content, which needs to be identified and handled appropriately, for example masking sensitive information, before being passed to the LLM for generation. Therefore, guardrails must be applied to the entire input to make sure it adheres to the defined policies and constraints.

Solution overview: Use guardrails on long-context inputs

The ApplyGuardrail API has a default limit of 25 text units (approximately 25,000 characters) per second. If the input exceeds this limit, it needs to be chunked and processed sequentially to avoid throttling. Therefore, the strategy becomes relatively straightforward: if the length of input text is less than 25 text units (25,000 characters), then it can be evaluated in a single request, otherwise it needs to be broken down into smaller pieces. The chunk size can vary depending on application behavior and the type of context in the application; you can start with 12 text units and iterate to find the best suitable chunk size. This way, we maximize the allowed default limit while keeping most of the context intact in a single request. Even if the guardrail action is GUARDRAIL_INTERVENED, it doesn’t mean the input is BLOCKED. It could also be true that the input is processed and sensitive information is masked; in this case, the input text must be recompiled with any processed response from the applied guardrail.

text_unit = 1000 # characters
limit_text_unit = 25
max_text_units_in_chunk = 12
def apply_guardrail_with_chunking(text, guardrail_id, guardrail_version="DRAFT"):
    text_length = len(text)
    filtered_text = ''
    if text_length <= limit_text_unit * text_unit:
        return apply_guardrail(text, "INPUT", guardrail_id, guardrail_version)
    else:
        # If the text length is greater than the default text unit limits then it's better to chunk the text to avoid throttling.
        for i, chunk in enumerate(wrap(text, max_text_units_in_chunk * text_unit)):
            print(f'::Guardrail::Applying guardrails at chunk {i+1}')
            is_blocked, alternate_text, response = apply_guardrail(chunk, "INPUT", guardrail_id, guardrail_version)
            if is_blocked:
                filtered_text = alternate_text
                break
            # It could be the case that guardrails intervened and anonymized PII in the input text,
            # we can then take the output from guardrails to create filtered text response.
            filtered_text += alternate_text
        return is_blocked, filtered_text, response

Run the full notebook to test this strategy with long-context input.

Best practices and considerations

When applying guardrails, it’s essential to follow best practices to maintain efficient and effective content moderation:

  • Optimize chunking strategy – Carefully consider the chunking strategy. The chunk size should balance the trade-off between minimizing the number of API calls and making sure the context isn’t lost due to overly small chunks. Similarly, the chunking strategy should take into account the context split; a critical piece of text could span two (or more) chunks if not carefully divided.
  • Asynchronous processing – Implement asynchronous processing for RAG contexts. This can help decouple the guardrail application process from the main application flow, improving responsiveness and overall performance. For frequently retrieved context from vector databases, ApplyGuardrail could be applied one time and results stored in metadata. This would avoid redundant API calls for the same content. This can significantly improve performance and reduce costs.
  • Develop comprehensive test suites – Create a comprehensive test suite that covers a wide range of scenarios, including edge cases and corner cases, to validate the effectiveness of your guardrail implementation.
  • Implement fallback mechanisms – There could be scenarios where the guardrail created doesn’t cover all the possible vulnerabilities and is unable to catch edge cases. For such scenarios, it’s wise to have a fallback mechanism. One such option could be to bring human in the loop, or use an LLM as a judge to evaluate both the input and output.

In addition to the aforementioned considerations, it’s also good practice to regularly audit your guardrail implementation, continuously refine and adapt your guardrail implementation, and implement logging and monitoring mechanisms to capture and analyze the performance and effectiveness of your guardrails.

Clean up

The only resource we created in this example is a guardrail. To delete the guardrail, complete the following steps:

  1. On the Amazon Bedrock console, under Safeguards in the navigation pane, choose Guardrails.
  2. Select the guardrail you created and choose Delete.

Alternatively, you can use the SDK:

bedrock_client.delete_guardrail(guardrailIdentifier = "<your_guardrail_id>")

Key takeaways

Applying guardrails is crucial for maintaining responsible and safe content generation. With the ApplyGuardrail API from Amazon Bedrock, you can effectively moderate both inputs and outputs, protecting your generative AI application against violations and maintaining compliance with your content policies.

Key takeaways from this post include:

  • Understand the importance of applying guardrails in generative AI applications to mitigate risks and maintain content moderation standards
  • Use the ApplyGuardrail API from Amazon Bedrock to validate inputs and outputs against defined policies and rules
  • Implement chunking strategies for long-context inputs and batching techniques for streaming outputs to efficiently utilize the ApplyGuardrail API
  • Follow best practices, optimize performance, and continuously monitor and refine your guardrail implementation to maintain effectiveness and alignment with evolving content moderation needs

Benefits

By incorporating the ApplyGuardrail API into your generative AI application, you can unlock several benefits:

  • Content moderation at scale – The API allows you to moderate content at scale, so your application remains compliant with content policies and guidelines, even when dealing with large volumes of data
  • Customizable policies – You can define and customize content moderation policies tailored to your specific use case and requirements, making sure your application adheres to your organization’s standards and values
  • Real-time moderation – The API enables real-time content moderation, allowing you to detect and mitigate potential violations as they occur, providing a safe and responsible user experience
  • Integration with any LLMApplyGuardrail is an independent API, so it can be integrated with any of your LLMs of choice, so you can take advantage of the power of generative AI while maintaining control over the content being generated
  • Cost-effective solution – With its pay-per-use pricing model and efficient text unit-based billing, the API provides a cost-effective solution for content moderation, especially when dealing with large volumes of data

Conclusion

By using the ApplyGuardrail API from Amazon Bedrock and following the best practices outlined in this post, you can make sure your generative AI application remains safe, responsible, and compliant with content moderation standards, even with long-context inputs and streaming outputs.

To further explore the capabilities of the ApplyGuardrail API and its integration with your generative AI application, consider experimenting with the API using the following resources:

  • Refer to Guardrails for Amazon Bedrock for detailed information on the ApplyGuardrail API, its usage, and integration examples
  • Check out the AWS samples GitHub repository for sample code and reference architectures demonstrating the integration of the ApplyGuardrail API with various generative AI applications
  • Participate in AWS-hosted workshops and tutorials focused on responsible AI and content moderation, where you can learn from experts and gain hands-on experience with the ApplyGuardrail API

Resources

The following resources explain both practical and ethical aspects of applying Guardrails for Amazon Bedrock:


About the Author

Talha Chattha is a Generative AI Specialist Solutions Architect at Amazon Web Services, based in Stockholm. Talha helps establish practices to ease the path to production for Gen AI workloads. Talha is an expert in Amazon Bedrock and supports customers across entire EMEA. He holds passion about meta-agents, scalable on-demand inference, advanced RAG solutions and cost optimized prompt engineering with LLMs. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines. Connect with Talha at LinkedIn using /in/talha-chattha/.

Read More