Import a question answering fine-tuned model into Amazon Bedrock as a custom model

Import a question answering fine-tuned model into Amazon Bedrock as a custom model

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Common generative AI use cases, including but not limited to chatbots, virtual assistants, conversational search, and agent assistants, use FMs to provide responses. Retrieval Augment Generation (RAG) is a technique to optimize the output of FMs by providing context around the questions for these use cases. Fine-tuning the FM is recommended to further optimize the output to follow the brand and industry voice or vocabulary.

Custom Model Import for Amazon Bedrock, in preview now, allows you to import customized FMs created in other environments, such as Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) instances, and on premises, into Amazon Bedrock. This post is part of a series that demonstrates various architecture patterns for importing fine-tuned FMs into Amazon Bedrock.

In this post, we provide a step-by-step approach of fine-tuning a Mistral model using SageMaker and import it into Amazon Bedrock using the Custom Import Model feature. We use the OpenOrca dataset to fine-tune the Mistral model and use the SageMaker FMEval library to evaluate the fine-tuned model imported into Amazon Bedrock.

Key Features

Some of the key features of Custom Model Import for Amazon Bedrock are:

  1. This feature allows you to bring your fine-tuned models and leverage the fully managed serverless capabilities of Amazon Bedrock
  2. Currently we are supporting Llama 2, Llama 3, Flan, Mistral Model architectures using this feature with a precisions of FP32, FP16 and BF16 with further quantizations coming soon.
  3. To leverage this feature you can run the import process (covered later in the blog) with your model weights being in Amazon Simple Storage Service (Amazon S3).
  4. You can even leverage your models created using Amazon SageMaker by referencing the Amazon SageMaker model Amazon Resource Names (ARN) which provides for a seamless integration with SageMaker.
  5. Amazon Bedrock will automatically scale your model as your traffic pattern increases and when not in use, scale your model down to 0 thus reducing your costs.

Let us dive into a use-case and see how easy it is to use this feature.

Solution overview

At the time of writing, the Custom Model Import feature in Amazon Bedrock supports models following the architectures and patterns in the following figure.

In this post, we walk through the following high-level steps:

  1. Fine-tune the model using SageMaker.
  2. Import the fine-tuned model into Amazon Bedrock.
  3. Test the imported model.
  4. Evaluate the imported model using the FMEval library.

The following diagram illustrates the solution architecture.

The process includes the following steps:

  1. We use a SageMaker training job to fine-tune the model using a SageMaker JupyterLab notebook. This training job reads the dataset from Amazon Simple Storage Service (Amazon S3) and writes the model back into Amazon S3. This model will then be imported into Amazon Bedrock.
  2. To import the fine-tuned model, you can use the Amazon Bedrock console, the Boto3 library, or APIs.
  3. An import job orchestrates the process to import the model and make the model available from the customer account.
    1. The import job copies all the model artifacts from the user’s account into an AWS managed S3 bucket.
  4. When the import job is complete, the fine-tuned model is made available for invocation from your AWS account.
  5. We use the SageMaker FMEval library in a SageMaker notebook to evaluate the imported model.

The copied model artifacts will remain in the Amazon Bedrock account until the custom imported model is deleted from Amazon Bedrock. Deleting model artifacts in your AWS account S3 bucket doesn’t delete the model or the related artifacts in the Amazon Bedrock managed account. You can delete an imported model from Amazon Bedrock along with all the copied artifacts using either the Amazon Bedrock console, Boto3 library, or APIs.

Additionally, all data (including the model) remains within the selected AWS Region. The model artifacts are imported into the AWS operated deployment account using a virtual private cloud (VPC) endpoint, and you can encrypt your model data using an AWS Key Management Service (AWS KMS) customer managed key.

In the following sections, we dive deep into each of these steps to deploy, test, and evaluate the model.

Prerequisites

We use Mistral-7B-v0.3 in this post because it uses an extended vocabulary compared to its prior version produced by Mistral AI. This model is straightforward to fine-tune, and Mistral AI has provided example fine-tuned models. We use Mistral for this use case because this model supports a 32,000-token context capacity and is fluent in English, French, Italian, German, Spanish, and coding languages. With the Mixture of Experts (MoE) feature, it can achieve higher accuracy for customer support use cases.

Mistral-7B-v0.3 is a gated model on the Hugging Face model repository. You need to review the terms and conditions and request access to the model by submitting your details.

We use Amazon SageMaker Studio to preprocess the data and fine-tune the Mistral model using a SageMaker training job. To set up SageMaker Studio, refer to Launch Amazon SageMaker Studio. Refer to the SageMaker JupyterLab documentation to set up and launch a JupyterLab notebook. You will submit a SageMaker training job to fine-tune the Mistral model from the SageMaker JupyterLab notebook, which can found on the GitHub repo.

Fine-tune the model using QLoRA

To fine-tune the Mistral model, we apply QLoRA and Parameter-Efficient Fine-Tuning (PEFT) optimization techniques. In the provided notebook, you use the Fully Sharded Data Parallel (FSDP) PyTorch API to perform distributed model tuning. You use supervised fine-tuning (SFT) to fine-tune the Mistral model.

Prepare the dataset

The first step in the fine-tuning process is to prepare and format the dataset. After you transform the dataset into the Mistral Default Instruct format, you upload it as a JSONL file into the S3 bucket used by the SageMaker session, as shown in the following code:

# Load dataset from the hub
dataset = load_dataset("Open-Orca/OpenOrca")
flan_dataset = dataset.filter(lambda example, indice: "flan" in example["id"], with_indices=True)
flan_dataset = flan_dataset["train"].train_test_split(test_size=0.01, train_size=0.035)

columns_to_remove = list(dataset["train"].features)
flan_dataset = flan_dataset.map(create_conversation, remove_columns=columns_to_remove, batched=False)

# save datasets to s3
flan_dataset["train"].to_json(f"{training_input_path}/train_dataset.json", orient="records", force_ascii=False)
flan_dataset["test"].to_json(f"{training_input_path}/test_dataset.json", orient="records", force_ascii=False)

You transform the dataset into Mistral Default Instruct format within the SageMaker training job as instructed in the training script (run_fsdp_qlora.py):

    ################
    # Dataset
    ################
    
    train_dataset = load_dataset(
        "json",
        data_files=os.path.join(script_args.dataset_path, "train_dataset.json"),
        split="train",
    )
    test_dataset = load_dataset(
        "json",
        data_files=os.path.join(script_args.dataset_path, "test_dataset.json"),
        split="train",
    )

    ################
    # Model & Tokenizer
    ################

    # Tokenizer        
    tokenizer = AutoTokenizer.from_pretrained(script_args.model_id, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.chat_template = MISTRAL_CHAT_TEMPLATE
    
    # template dataset
    def template_dataset(examples):
        return{"text":  tokenizer.apply_chat_template(examples["messages"], tokenize=False)}
    
    train_dataset = train_dataset.map(template_dataset, remove_columns=["messages"])
    test_dataset = test_dataset.map(template_dataset, remove_columns=["messages"])

Optimize fine-tuning using QLoRA

You optimize your fine-tuning using QLoRA and with the precision provided as input into the training script as SageMaker training job parameters. QLoRA is an efficient fine-tuning approach that reduces memory usage to fine-tune a 65-billion-parameter model on a single 48 GB GPU, preserving the full 16-bit fine-tuning task performance. In this notebook, you use the bitsandbytes library to set up quantization configurations, as shown in the following code:

    # Model    
    torch_dtype = torch.bfloat16 if training_args.bf16 else torch.float32
    quant_storage_dtype = torch.bfloat16

    if script_args.use_qlora:
        print(f"Using QLoRA - {torch_dtype}")
        quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch_dtype,
                bnb_4bit_quant_storage=quant_storage_dtype,
            )
    else:
        quantization_config = None

You use the LoRA config based on the QLoRA paper and Sebastian Raschka experiment, as shown in the following code. Two key points to consider from the Raschka experiment are that QLoRA offers 33% memory savings at the cost of an 39% increase in runtime, and to make sure LoRA is applied to all layers to maximize model performance.

################
# PEFT
################
# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
    )

You use SFTTrainer to fine-tune the Mistral model:

    ################
    # Training
    ################
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        dataset_text_field="text",
        eval_dataset=test_dataset,
        peft_config=peft_config,
        max_seq_length=script_args.max_seq_length,
        tokenizer=tokenizer,
        packing=True,
        dataset_kwargs={
            "add_special_tokens": False,  # We template with special tokens
            "append_concat_token": False,  # No need to add additional separator token
        },
    )

At the time of writing, only merged adapters are supported using the Custom Model Import feature for Amazon Bedrock. Let’s look at how to merge the adapter with the base model next.

Merge the adapters

Adapters are new modules added between layers of a pre-trained network. Creation of these new modules is possible by back-propagating gradients through a frozen, 4-bit quantized pre-trained language model into low-rank adapters in the fine-tuning process. To import the Mistral model into Amazon Bedrock, the adapters need to be merged with the base model and saved in Safetensors format. Use the following code to merge the model adapters and save them in Safetensors format:

        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            training_args.output_dir,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16
        )
        # Merge LoRA and base model and save
        model = model.merge_and_unload()
        model.save_pretrained(
            sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
        )

To import the Mistral model into Amazon Bedrock, the model needs to be in an uncompressed directory within an S3 bucket accessible by the Amazon Bedrock service role used in the import job.

Import the fine-tuned model into Amazon Bedrock

Now that you have fine-tuned the model, you can import the model into Amazon Bedrock. In this section, we demonstrate how to import the model using the Amazon Bedrock console or the SDK.

Import the model using the Amazon Bedrock console

To import the model using the Amazon Bedrock console, see Import a model with Custom Model Import. You use the Import model page as shown in the following screenshot to import the model from the S3 bucket.

After you successfully import the fine-tuned model, you can see the model listed on the Amazon Bedrock console.

Import the model using the SDK

The AWS Boto3 library supports importing custom models into Amazon Bedrock. You can use the following code to import a fine-tuned model from within the notebook into Amazon Bedrock. This is an asynchronous method.

import boto3
import datetime
br_client = boto3.client('bedrock', region_name='<aws-region-name>')
pt_model_nm = "<bedrock-custom-model-name>"
pt_imp_jb_nm = f"{pt_model_nm}-{datetime.datetime.now().strftime('%Y%m%d%M%H%S')}"
role_arn = "<<bedrock_role_with_custom_model_import_policy>>"
pt_model_src = {"s3DataSource": {"s3Uri": f"{pt_pubmed_model_s3_path}"}}
resp = br_client.create_model_import_job(jobName=pt_imp_jb_nm,
                                  importedModelName=pt_model_nm,
                                  roleArn=role_arn,
                                  modelDataSource=pt_model_src)

Test the imported model

Now that you have imported the fine-tuned model into Amazon Bedrock, you can test the model. In this section, we demonstrate how to test the model using the Amazon Bedrock console or the SDK.

Test the model on the Amazon Bedrock console

You can test the imported model using an Amazon Bedrock playground, as illustrated in the following screenshot.

Test the model using the SDK

You can also use the Amazon Bedrock Invoke Model API to run the fine-tuned imported model, as shown in the following code:

client = boto3.client("bedrock-runtime", region_name="us-west-2")
model_id = "<<replace with the imported bedrock model arn>>"


def call_invoke_model_and_print(native_request):
    request = json.dumps(native_request)

    try:
        # Invoke the model with the request.
        response = client.invoke_model(modelId=model_id, body=request)
        model_response = json.loads(response["body"].read())

        response_text = model_response["outputs"][0]["text"]
        print(response_text)
    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

prompt = "will there be a season 5 of shadowhunters"
formatted_prompt = f"[INST] {prompt} [/INST]</s>"
native_request = {
"prompt": formatted_prompt,
"max_tokens": 64,
"top_p": 0.9,
"temperature": 0.91
}
call_invoke_model_and_print(native_request)

The custom Mistral model that you imported using Amazon Bedrock supports temperature, top_p, and max_gen_len parameters when invoking the model for inferencing. The inference parameters top_k, max_seq_len, max_batch_size, and max_new_tokens are not supported for a custom Mistral fine-tuned model.

Evaluate the imported model

Now that you have imported and tested the model, let’s evaluate the imported model using the SageMaker FMEval library. For more details, refer to Evaluate Bedrock Imported Models. To evaluate the question answering task, we use the metrics F1 Score, Exact Match Score, Quasi Exact Match Score, Precision Over Words, and Recall Over Words. The key metrics for the question answering tasks are Exact Match, Quasi-Exact Match, and F1 over words evaluated by comparing the model predicted answers against the ground truth answers. The FMEval library supports out-of-the-box evaluation algorithms for metrics such as accuracy, QA Accuracy, and others detailed in the FMEval documentation. Because you fine-tuned the Mistral model for question answering, you can use the QA Accuracy algorithm, as shown in the following code. The FMEval library supports these metrics for the QA Accuracy algorithm.

config = DataConfig(
    dataset_name="trex_sample",
    dataset_uri="data/test_dataset.json",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answer"
)
bedrock_model_runner = BedrockModelRunner(
    model_id=model_id,
    output='outputs[0].text',
    content_template='{"prompt": $prompt, "max_tokens": 500}',
)

eval_algo = QAAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config, 
                                    prompt_template="[INST]$model_input[/INST]", save=True)

You can get the consolidated metrics for the imported model as follows:

for op in eval_output:
    print(f"Eval Name: {op.eval_name}")
    for score in op.dataset_scores:
        print(f"{score.name} : {score.value}")

Clean up

To delete the imported model from Amazon Bedrock, navigate to the model on the Amazon Bedrock console. On the options menu (three dots), choose Delete.

To delete the SageMaker domain along with the SageMaker JupyterLab space, refer to Delete an Amazon SageMaker domain. You may also want to delete the S3 buckets where the data and model are stored. For instructions, see Deleting a bucket.

Conclusion

In this post, we explained the different aspects of fine-tuning a Mistral model using SageMaker, importing the model into Amazon Bedrock, invoking the model using both an Amazon Bedrock playground and Boto3, and then evaluating the imported model using the FMEval library. You can use this feature to import base FMs or FMs fine-tuned either on premises, on SageMaker, or on Amazon EC2 into Amazon Bedrock and use the models without any heavy lifting in your generative AI applications. Explore the Custom Model Import feature for Amazon Bedrock to deploy FMs fine-tuned for code generation tasks in a secure and scalable manner. Visit our GitHub repository to explore samples prepared for fine-tuning and importing models from various families.


About the Authors

Jay Pillai is a Principal Solutions Architect at Amazon Web Services. In this role, he functions as the Lead Architect, helping partners ideate, build, and launch Partner Solutions. As an Information Technology Leader, Jay specializes in artificial intelligence, generative AI, data integration, business intelligence, and user interface domains. He holds 23 years of extensive experience working with several clients across supply chain, legal technologies, real estate, financial services, insurance, payments, and market research business domains.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Evandro Franco is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 18 years of experience working with technology, from software development, infrastructure, serverless, to machine learning.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, artificial intelligence, machine learning, and system design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Ragha Prasad is a Principal Engineer and a founding member of Amazon Bedrock, where he has had the privilege to listen to customer needs first-hand and understands what it takes to build and launch scalable and secure Gen AI products. Prior to Bedrock, he worked on numerous products in Amazon, ranging from devices to Ads to Robotics.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area.

Read More

Using task-specific models from AI21 Labs on AWS

Using task-specific models from AI21 Labs on AWS

In this blog post, we will show you how to leverage AI21 Labs’ Task-Specific Models (TSMs) on AWS to enhance your business operations. You will learn the steps to subscribe to AI21 Labs in the AWS Marketplace, set up a domain in Amazon SageMaker, and utilize AI21 TSMs via SageMaker JumpStart.

AI21 Labs is a foundation model (FM) provider focusing on building state-of-the-art language models. AI21 Task Specific Models (TSMs) are built for answering questions, summarization, condensing lengthy texts, and so on. AI21 TSMs are available in Amazon SageMaker Jumpstart.

Here are the AI21 TSMs that can be accessed and customized in SageMaker JumpStart: AI21 Contextual Answers, AI21 Summarize, AI21 Paraphrase, and AI21 Grammatical Error Correction.

AI21 FMs (Jamba-Instruct, AI21 Jurassic-2 Ultra, AI21 Jurassic-2 Mid) are available in Amazon Bedrock and can be used for large language model (LLM) use cases. We used AI21 TSMs available in SageMaker Jumpstart for this post. SageMaker Jumpstart enables you to select, compare, and evaluate available AI21 TSMs.

AI21’s TSMs

Foundation models can solve many tasks, but not every task is unique. Some commercial tasks are common across many applications. AI21 Labs’ TSMs are specialized models built to solve a particular problem. They’re built to deliver out-of-box value, cost effectiveness, and higher accuracy for the common tasks behind many commercial use-cases. In this post, we will explore three of AI21 Labs’ TSMs and their unique capabilities.

Foundation models are built and trained on massive datasets to perform a variety of tasks. Unlike FMs, TSMs are trained to perform unique tasks.

When your use case is supported by a TSM, you quickly realize benefits such as improved refusal rates when you don’t want the model to provide answers unless they’re grounded in actual document content.

  • Paraphrase: This model is used to enhance content creation and communication by generating varied versions of text while maintaining a consistent tone and style. This model is ideal for creating multiple product descriptions, marketing materials, and customer support responses, improving clarity and engagement. It also simplifies complex documents, making information more accessible.
  • Summarize: This model is used to condense lengthy texts into concise summaries while preserving the original meaning. This model is particularly useful for processing large documents, such as financial reports, legal documents, and technical papers, making critical information more accessible and comprehensible.
  • Contextual answers: This model is used to significantly enhance information retrieval and customer support processes. This model excels at providing accurate and relevant answers based on specific document contexts, making it particularly useful in customer service, legal, finance, and educational sectors. It streamlines workflows by quickly accessing relevant information from extensive databases, reducing response times and improving customer satisfaction.

Prerequisites

To follow the steps in this post, you must have the following prerequisites in place:

AWS account setup

Completing the labs in this post requires an AWS account and SageMaker environments set up. If you don’t have an AWS account, see Complete your AWS registration for the steps to create one.

AWS Marketplace opt-in

AI21 TSMs can also be accessed through Amazon Marketplace for subscription. Using AWS Marketplace, you can subscribe to AI21 TSMs and deploy SageMaker endpoints.

To do these exercises you must subscribe to the following offerings in the AWS Marketplace

Service quota limits

To use some of the GPU’s required to run AI21’s task specific models, you must have the required service quota limits. You can request a service quota limit increase in the AWS Management Console. Limits are account and resource specific.

To create a service request, search for service quotas in the console search bar. Select the service to land go to the dashboard and enter the name of the GPU (for example, ml.g5.48xlarge). Ensure the quota is for endpoint usage

Estimated cost

The following is the estimated cost to walk through the solution in this post.

Contextual answers:

  • We used an ml.g5.48xlarge
    • By default, AWS accounts don’t have access to this GPU. You must request a service quota limit increase (see the previous section: Service Quota Limits).
  • The notebook runtime was approximately 15 minutes.
  • The cost was $20.41 (billed on an hourly basis).

Summarize notebook

  • We used an ml.g4dn.12xlarge GPU.
    • You must request a service quota limit increase (see the previous section: Service Quota Limits).
  • The notebook runtime was approximately 10 minutes.
  • The cost was $4.94 (billed on an hourly basis).

Paraphrase notebook

  • We used the ml.g4dn.12xlarge GPU.
    • You must request a service quota limit increase (see the previous section: Service Quota Limits).
  • The notebook runtime approximately 10 minutes.
  • The cost was $4.94 (billed on an hourly basis).

Total cost: $30.29 (1 hour charge for each deployed endpoint)

Using AI21 models on AWS

Getting started

In this section, you will access AI21 TSMs in SageMaker Jumpstart.  These interactive notebooks contain code to deploy TSM endpoints and will also provide example code blocks to run inference.  These first few steps are pre-requisites to deploying the same notebooks.  If you already have a SageMaker domain and username set up, you may skip to Step 7.

  1. Use the search bar in the AWS Management Console to navigate to Amazon SageMaker , as shown in the following figure.


If you don’t already have one set up, you must create a SageMaker domain. A domain consists of an associated Amazon Elastic File System (Amazon EFS) volume; a list of authorized users, and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations.

Users within a domain can share notebook files and other artifacts with each other. For more information, see Learn about Amazon SageMaker domain entities and statuses. For today’s exercises, you will use Quick Set-Up to deploy an environment.

  1. Choose Create a SageMaker domain as shown in the following figure.
  2. Select Quick setup. After you choose Set up the domain will begin creation
  3. After a moment, your domain will be created.
  4. Choose Add user.
  5. You can keep the default user profile values.
  6. Launch Studio by choosing Launch button and then selecting Studio.
  7. Choose JumpStart in the navigation pane as shown in the following figure.

Here you can see the model providers for our JumpStart notebooks.

You will see the model providers for JumpStart notebooks.

  1. Select AI21 Labs to see their available models.

Each of AI21’s models has an associated model card. A model card provides key information about the model such as its intended use cases, training, and evaluation details. For this example, you will use the Summarize, Paraphrase, and Contextual Answers TSMs.

  1. Start with Contextual Answers. Select the AI21 Contextual Answers model card.

A sample notebook is included as part of the model. Jupyter Notebooks are a popular way to interact with code and LLMs.

  1. Choose Notebooks to explore the notebook.
  2. To run the notebook’s code blocks, choose Open in JupyterLab.
  3. If you do not already have an existing space, choose Create new space and enter an appropriate name. When ready, choose Create space and open notebook.

It can take up to 5 minutes to open your notebook.
SageMaker Spaces are used to manage the storage and resource needs of some SageMaker Studio applications. Each space has a 1:1 relationship with an instance of an application.

  1. After the notebook opens, you will be prompted to select a kernal. Ensure Python 3 is selected and choose Select.

Navigating the notebook exercises

Repeat the preceding process to import the remaining notebooks.

Each AI21 notebook demonstrates required code imports, version checks, model selection, endpoint creation, and inferences showcasing the TSM’s unique strengths through code blocks and example prompts

Each notebook will have a clean up step at the end to delete your deployed endpoints. It’s important to terminate any running endpoints to avoid additional costs.

Contextual Answers JumpStart Notebook

AWS customers and partners can use AI21 Labs’s Contextual Answers model to significantly enhance their information retrieval and customer support processes. This model excels at providing accurate and relevant answers based on specific context, making it useful in customer service, legal, finance, and educational sectors.

The following are code snippets from AI21’s Contextual Answers TSM through JumpStart. Notice that there is no prompt engineering required. The only input is the question and the context provided.

Input:

financial_context = """In 2020 and 2021, enormous QE — approximately $4.4 trillion, or 18%, of 2021 gross domestic product (GDP) — and enormous fiscal stimulus (which has been and always will be inflationary) — approximately $5 trillion, or 21%, of 2021 GDP — stabilized markets and allowed companies to raise enormous amounts of capital. In addition, this infusion of capital saved many small businesses and put more than $2.5 trillion in the hands of consumers and almost $1 trillion into state and local coffers. These actions led to a rapid decline in unemployment, dropping from 15% to under 4% in 20 months — the magnitude and speed of which were both unprecedented. Additionally, the economy grew 7% in 2021 despite the arrival of the Delta and Omicron variants and the global supply chain shortages, which were largely fueled by the dramatic upswing in consumer spending and the shift in that spend from services to goods. Fortunately, during these two years, vaccines for COVID-19 were also rapidly developed and distributed.
In today's economy, the consumer is in excellent financial shape (on average), with leverage among the lowest on record, excellent mortgage underwriting (even though we've had home price appreciation), plentiful jobs with wage increases and more than $2 trillion in excess savings, mostly due to government stimulus. Most consumers and companies (and states) are still flush with the money generated in 2020 and 2021, with consumer spending over the last several months 12% above pre-COVID-19 levels. (But we must recognize that the account balances in lower-income households, smaller to begin with, are going down faster and that income for those households is not keeping pace with rising inflation.)
Today's economic landscape is completely different from the 2008 financial crisis when the consumer was extraordinarily overleveraged, as was the financial system as a whole — from banks and investment banks to shadow banks, hedge funds, private equity, Fannie Mae and many other entities. In addition, home price appreciation, fed by bad underwriting and leverage in the mortgage system, led to excessive speculation, which was missed by virtually everyone — eventually leading to nearly $1 trillion in actual losses.
"""
question = "Did the economy shrink after the Omicron variant arrived?"
response = client.answer.create(
    context=financial_context,
    question=question,
)

print(response.answer)

Output:

No, the economy did not shrink after the Omicron variant arrived. In fact, the economy grew 7% in 2021, despite the arrival of the Delta and Omicron variants and the global supply chain shortages, which were largely fueled by the dramatic upswing in consumer spending and the shift in that spend from services to goods.

As mentioned in our introduction, AI21’s Contextual Answers model does not provide answers to questions outside of the context provided. If the prompt includes a question unrelated to 2020/2021 economy, you will get a response as shown in the following example.

Input:

irrelevant_question = "How did COVID-19 affect the financial crisis of 2008?"

response = client.answer.create(
context=financial_context,
question=irrelevant_question,
)

print(response.answer)

Output:

None

When finished, you can delete your deployed endpoint by running the final two cells of the notebook.

model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)
model.delete_model()

You can import the other notebooks by navigating to SageMaker JumpStart and repeating the same process you used to import this first notebook.

Summarize JumpStart Notebook

AWS customers and partners can uses AI21 Labs’ Summarize model to condense lengthy texts into concise summaries while preserving the original meaning. This model is particularly useful for processing large documents, such as financial reports, legal documents, and technical papers, making critical information more accessible and comprehensible.

The following are highlight code snippets from AI21’s Summarize TSM using JumpStart. Notice that the  input must include the full text that the user wants to summarize.

Input:

text = """The error affected a number of international flights leaving the terminal on Wednesday, with some airlines urging passengers to travel only with hand luggage.
Virgin Atlantic said all airlines flying out of the terminal had been affected.
Passengers have been warned it may be days before they are reunited with luggage.
An airport spokesperson apologised and said the fault had now been fixed.
Virgin Atlantic said it would ensure all bags were sent out as soon as possible.
It added customers should retain receipts for anything they had bought and make a claim to be reimbursed.
Passengers, who were informed by e-mail of the problem, took to social media to vent their frustrations.
One branded the situation "ludicrous" and said he was only told 12 hours before his flight.
The airport said it could not confirm what the problem was, what had caused it or how many people had been affected."""

response = client.summarize.create(
    source=text,
    source_type=DocumentType.TEXT,
)

print("Original text:")
print(text)
print("================")
print("Summary:")
print(response.summary)

Output:
Original text:
The error affected a number of international flights leaving the terminal on Wednesday, with some airlines urging passengers to travel only with hand luggage.
Virgin Atlantic said all airlines flying out of the terminal had been affected.
Passengers have been warned it may be days before they are reunited with luggage.
An airport spokesperson apologised and said the fault had now been fixed.
Virgin Atlantic said it would ensure all bags were sent out as soon as possible.
It added customers should retain receipts for anything they had bought and make a claim to be reimbursed.
Passengers, who were informed by e-mail of the problem, took to social media to vent their frustrations.
One branded the situation "ludicrous" and said he was only told 12 hours before his flight.
The airport said it could not confirm what the problem was, what had caused it or how many people had been affected.
================
Summary:
A number of international flights leaving the terminal were affected by the error on Wednesday, with some airlines urging passengers to travel only with hand luggage. Passengers were warned it may be days before they are reunited with their luggage.

Paraphrase JumpStart Notebook

AWS customers and partners can use AI21 Labs’s Paraphrase TSM through JumpStart to enhance content creation and communication by generating varied versions of text.

The following are highlight code snippets from AI21’s Paraphrase TSM using JumpStart. Notice that there is no extensive prompt engineering required. The only input required is the full text that the user wants to paraphrase and a chosen style, for example casual, formal, and so on.

Input:

text = "Throughout this page, we will explore the advantages and features of the Paraphrase model."

response = client.paraphrase.create(
text=text,
style="formal"

)

print(response.suggestions) Output: 
[Suggestion(text='We will examine the advantages and features of the Paraphrase model throughout this page.'), Suggestion(text='The purpose of this page is to examine the advantages and features of the Paraphrase model.'), Suggestion(text='On this page, we will discuss the advantages and features of the Paraphrase model.'), Suggestion(text='This page will provide an overview of the advantages and features of the Paraphrase model.'), Suggestion(text='In this article, we will examine the advantages and features of the Paraphrase model.'), Suggestion(text='Here we will explore the advantages and features of the Paraphrase model.'), Suggestion(text='The purpose of this page is to describe the advantages and features of the Paraphrase model.'), Suggestion(text='In this page, we will examine the advantages and features of the Paraphrase model.'), Suggestion(text='The Paraphrase model will be reviewed on this page with an emphasis on its advantages and features.'), Suggestion(text='Our goal on this page will be to explore the benefits and features of the Paraphrase model.')]

Input:

print("Original sentence:")
print(text)
print("============================")
print("Suggestions:")
print("n".join(["- " + x.text for x in response.suggestions]))

Output:

Original sentence:
Throughout this page, we will explore the advantages and features of the Paraphrase model.
============================
Suggestions:
- We will examine the advantages and features of the Paraphrase model throughout this page.
- The purpose of this page is to examine the advantages and features of the Paraphrase model.
- On this page, we will discuss the advantages and features of the Paraphrase model.
- This page will provide an overview of the advantages and features of the Paraphrase model.
- In this article, we will examine the advantages and features of the Paraphrase model.
- Here we will explore the advantages and features of the Paraphrase model.
- The purpose of this page is to describe the advantages and features of the Paraphrase model.
- In this page, we will examine the advantages and features of the Paraphrase model.
- The Paraphrase model will be reviewed on this page with an emphasis on its advantages and features.
- Our goal on this page will be to explore the benefits and features of the Paraphrase model.

Less prompt engineering

A key advantage of AI21’s task-specific models is the reduced need for complex prompt engineering compared to foundation models. Let’s consider how you might approach a summarization task using a foundation model compared to using AI21’s specialized Summarize TSM.

For a foundation model, you might need to craft an elaborate prompt template with detailed instructions:

python prompt_template = "You are a highly capable summarization assistant. Concisely summarize the given text while preserving key details and overall meaning. Use clear language tailored for human readers.nnText: 

[INPUT_TEXT]nnSummary:" ``` To summarize text with this foundation model, you'd populate the template and pass the full prompt: ```python input_text = "Insert text to summarize here..." prompt = prompt_template.replace("[INPUT_TEXT]", input_text) summary = model(prompt)

 In contrast, using AI21's Summarize TSM is more straightforward:
python input_text = "Insert text to summarize here..." summary = summarize_model(input_text)

That’s it! With the Summarize TSM, you pass the input text directly to the model; there’s no need for an intricate prompt template.

Lower cost and higher accuracy

By using TSMs, you can achieve lower costs and higher accuracy. As demonstrated previously in the Contextual Notebook, TSMs have a higher refusal rate than most mainstream models, which can lead to higher accuracy. This characteristic of TSMs is beneficial in use cases where wrong answers are less acceptable.

Conclusion

In this post, we explored AI21 Labs’s approach to generative AI using task-specific models (TSMs). Through guided exercises, you walked through the process of setting up a SageMaker domain and importing sample JumpStart Notebooks to experiment with AI21’s TSMs, including Contextual Answers, Paraphrase, and Summarize.

Throughout the exercises, you saw the potential benefits of task-specific models compared to foundation models. When asking questions outside the context of the intended use case, the AI21 TSMs refused to answer, making them less prone to hallucinating or generating nonsensical outputs beyond their intended domain—a critical factor for applications that require precision and safety. Lastly, we highlighted how task-specific models are designed from the outset to excel at specific tasks, streamlining development and reducing the need for extensive prompt engineering and fine-tuning, which can them a more cost-effective solution.

Whether you’re a data scientist, machine learning practitioner, or someone curious about AI advancements, we hope this post has provided valuable insights into the advantages of AI21 Labs’s task-specific approach. As the field of generative AI continues to evolve rapidly, we encourage you to stay curious, experiment with various approaches, and ultimately choose the one that best aligns with your project’s unique requirements and goals. Visit AWS GitHub for other example use cases and codes to experiment in your own environment.

Additional resources


About the Authors

Joe Wilson is a Solutions Architect at Amazon Web Services supporting nonprofit organizations. He has core competencies in data analytics, AI/ML and GenAI. Joe background is in data science and international development. He is passionate about leveraging data and technology for social good.

Pat Wilson is a Solutions Architect at Amazon Web Services with a focus on AI/ML workloads and security. He currently supports Federal Partners. Outside of work Pat enjoys learning, working out, and spending time with family/friends.

Josh Famestad is a Solutions Architect at Amazon Web Services. Josh works with public sector customers to build and execute cloud based approaches to deliver on business priorities.

Read More

Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment

Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment

This paper has been accepted at the 18th European Conference on Computer Vision (ECCV 2024) (opens in new tab), the premier gathering on computer vision and machine learning.

 On the left is a simple drawing of the lungs. The drawing shows the borders of the left and right lung as well as the trachea and the left and right main stem bronchi. The text under the drawing reads: Original image. To the right of the drawing are the 3 additional inputs of RadEdit. They are arranged vertically. On top there is an example editing prompt. It reads

Biomedical vision models are computational tools that analyze medical images, like X-rays, MRIs, and CT scans, and are used to predict medical conditions and outcomes. These models assist medical practitioners in disease diagnosis, treatment planning, disease monitoring, and risk assessment. However, datasets used to train these models can be small and not representative of real-world conditions, which often leads to these models performing worse in actual medical settings. To avoid misdiagnoses and other errors, these models must be rigorously tested and adjusted to perform reliably across different conditions.

To mitigate the dataset challenge of not having enough diverse data and to improve the testing of biomedical vision models, we developed “RadEdit: Stress-testing biomedical vision models via diffusion image editing,” presented at ECCV 2024. Aligned with the Microsoft Responsible AI principles of reliability and safety, RadEdit helps researchers identify when and how models might fail before they are deployed in a medical setting. RadEdit uses generative image editing to simulate different dataset shifts (e.g., a shift in the patients’ demographics), helping researchers to identify weaknesses in the model. By employing text-to-image diffusion models trained on a wide array of chest X-ray datasets, RadEdit can generate synthetic yet realistic X-rays.

RadEdit’s approach involves using multiple image masks (binary images representing designated regions of a reference image), as illustrated in Figure 1, to limit changes to specific areas of the image, therefore preserving their integrity. It generates synthetic datasets free from spurious correlations and artifacts, addressing shortcomings in existing editing techniques. Traditional editing techniques often overlook biases within the generative model, leading to synthetic data that perpetuate these biases. Alternatively, these other editing techniques restrict edits to the point of unrealistic outputs.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


How RadEdit works

RadEdit improves biomedical image editing using three key inputs, as illustrated in Figure 1:

  • Text prompt: Defines the desired modifications. For example, a disease can be added with a description like “Consolidation”
  • Edit mask: A binary mask indicating the main area to be modified, such as the “right lung”
  • Keep mask: A binary mask outlining parts of the original image to be preserved, like the “left lung”
 On the left is a simple drawing of the lungs. The drawing shows the borders of the left and right lung as well as the trachea and the left and right main stem bronchi. The text under the drawing reads: Original image. To the right of the drawing are the 3 additional inputs of RadEdit. They are arranged vertically. On top there is an example editing prompt. It reads
Figure 1: RadEdit’s inputs and outputs. By using separate “edit” and “keep” masks, RadEdit can make the desired modifications to an image with precise spatial control and realistic output.

RadEdit depends on a diffusion model for image editing, where the image is first converted to a latent noise representation by inverting the diffusion generative process. The noise representation is then iteratively denoised over multiple time steps. During each step, RadEdit:

  1. Uses the text prompt to conditionally generate pixels within the edit mask with classifier-free guidance.
  2. Generates the remaining pixels based on the original image and edited area.
  3. Replicates the content of the original image within the “keep” mask, ensuring that this area remains unaltered.

Finally, a quality check ensures that the edited image is faithful to the editing prompt. RadEdit uses Microsoft’s BioViL-T to compute an image-text alignment score that we can then use to filter out low-quality and unfaithful edits.

Simulating dataset shifts

A key feature of RadEdit is its ability to simulate dataset shifts with precise spatial control for comprehensive model performance evaluation. This includes differences in image acquisition, the appearance of underlying pathologies, and population characteristics.

Particularly notable is RadEdit’s ability to simulate image variations from different sources (e.g., different hospitals), helping researchers identify potential biases in models trained solely on data from one source. For example, in a COVID-19 study, if all positive cases in a dataset come from a single hospital and all negative cases come from a different hospital, a model trained on detecting COVID-19 might over-rely on hospital-specific indicators from the X-ray images. Among others, we considered the laterality markers in the corners of an X-ray (e.g., a highly visible letter “L” on the left side of the X-ray) as well as the amount of black space on the image edges to be hospital-specific indicators. To test if a model relies too much on differences in image acquisition, we created synthetic data using RadEdit, where we removed COVID-19 features while retaining hospital-specific indicators. After creating the synthetic dataset with the COVID-19 features no longer present, we can test if the COVID-10 detection model still predicts COVID-19. This would indicate that the model is biased with respect to hospital-specific indicators.

RadEdit can also remove specific diseases, like pneumothorax (collapsed lung), from an image while keeping treatment features like chest drains. This helps researchers understand how models detect and understand “visual shortcuts.” Because RadEdit maintains the size and location of the main anatomical structures (like lungs, ribs, and heart), it can also be used to stress-test segmentation models. For example, RadEdit can add rare abnormalities or medical devices to lung images to test how well segmentation models handle new variations, ensuring they generalize accurately across different populations. Figure 2 illustrates these three examples of stress-testing scenarios.

All drawings of lungs are the same as in Figure 1. The drawing shows the borders of the left and right lung as well as the trachea and the left and right main stem bronchi. In the first row on the left there are two drawings of a lung. The first drawing of a lung labelled
Figure 2: Stress-testing models by simulating dataset shifts via image editing.

Stress-testing multimodal models

We have used RadEdit to stress-test image classification and segmentation models, and we see potential for future applications in complex multimodal tasks like generating radiology reports. RadEdit can help identify limitations in multimodal large language models (MLLMs) like Microsoft’s MAIRA-1 and MAIRA-2, especially when dealing with rare conditions or unusual combinations of findings not well-represented in the training data. These MLLMs take one or more radiological images and relevant clinical information as input to produce detailed text reports.

RadEdit can generate synthetic image-report pairs for challenging scenarios. For example, manually editing a report to describe a rare combination of findings and then using RadEdit to edit the corresponding image, creates a valuable test case for the MLLM. This approach allows us to stress-test MLLM with diverse synthetic data, identifying weaknesses or biases and ensuring the model is more robust in real-world scenarios. This is a crucial step for using these models safely and effectively in clinical settings.

Implications and looking forward

RadEdit offers significant advantages for the biomedical research community. It helps identify biases and blind spots before deployment, helping to ensure that biomedical vision models perform reliably in clinical settings. By simulating dataset shifts, RadEdit reduces the need to collect additional evaluation data, saving time and resources.

RadEdit is applicable to a wide range of settings and can be used to stress-test state-of-the-art foundation models like Microsoft’s Rad-DINO (opens in new tab) and BiomedParse (opens in new tab). By integrating RadEdit into their research workflow, researchers can validate that their biomedical vision models are not only state-of-the-art but also more prepared for the complexities of real-world deployment. In the future, we envision RadEdit being applied to more complex multimodal tasks, such as generating radiology reports.

The code for RadEdit as well as the weights of the diffusion model we used can be found under https://huggingface.co/microsoft/radedit (opens in new tab).

Acknowledgments

We would like to thank our paper coauthors: Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, and Ozan Oktay, as well as all our collaborators across Microsoft Cloud for Healthcare and Microsoft Health Futures.

RadEdit is intended for research purposes only and not for any commercial or clinical use.

The post Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment appeared first on Microsoft Research.

Read More

Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation Program

Machine learning (ML) and artificial intelligence (AI) systems rely heavily on human-annotated data for training and evaluation. A major challenge in this context is the occurrence of annotation errors, as their effects can degrade model performance. This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks for three industry-scale ML applications (music streaming, video streaming, and mobile apps). Drawing on real-world data from an extensive search relevance annotation program, we demonstrate that errors can be predicted with…Apple Machine Learning Research

Misty: UI Prototyping Through Interactive Conceptual Blending

UI prototyping often involves iterating and blending elements from examples such as screenshots and sketches, but current tools offer limited support for incorporating these examples. Inspired by the cognitive process of conceptual blending, we introduce a novel UI workflow that allows developers to rapidly incorporate diverse aspects from design examples into work-in-progress UIs. We prototyped this workflow as Misty. Through an exploratory first-use study with 14 frontend developers, we assessed Misty’s effectiveness and gathered feedback on this workflow. Our findings suggest that Misty’s…Apple Machine Learning Research

Architecture to AWS CloudFormation code using Anthropic’s Claude 3 on Amazon Bedrock

Architecture to AWS CloudFormation code using Anthropic’s Claude 3 on Amazon Bedrock

The Anthropic’s Claude 3 family of models, available on Amazon Bedrock, offers multimodal capabilities that enable the processing of images and text. This capability opens up innovative avenues for image understanding, wherein Anthropic’s Claude 3 models can analyze visual information in conjunction with textual data, facilitating more comprehensive and contextual interpretations. By taking advantage of its multimodal prowess, we can ask the model questions like “What objects are in the image, and how are they relatively positioned to each other?” We can also gain an understanding of data presented in charts and graphs by asking questions related to business intelligence (BI) tasks, such as “What is the sales trend for 2023 for company A in the enterprise market?” These are just some examples of the additional richness Anthropic’s Claude 3 brings to generative artificial intelligence (AI) interactions.

Architecting specific AWS Cloud solutions involves creating diagrams that show relationships and interactions between different services. Instead of building the code manually, you can use Anthropic’s Claude 3’s image analysis capabilities to generate AWS CloudFormation templates by passing an architecture diagram as input.

In this post, we explore some ways you can use Anthropic’s Claude 3 Sonnet’s vision capabilities to accelerate the process of moving from architecture to the prototype stage of a solution.

Use cases for architecture to code

The following are relevant use cases for this solution:

  • Converting whiteboarding sessions to AWS infrastructure To quickly prototype your designs, you can take the architecture diagrams created during whiteboarding sessions and generate the first draft of a CloudFormation template. You can also iterate over the CloudFormation template to develop a well-architected solution that meets all your requirements.
  • Fast deployment of architecture diagrams – You can generate boilerplate CloudFormation templates by using architecture diagrams you find on the web. This allows you to experiment quickly with new designs.
  • Streamlined AWS infrastructure design through collaborative diagramming – You might draw architecture diagrams on a diagramming tool during an all-hands meeting. These raw diagrams can generate boilerplate CloudFormation templates, quickly leading to actionable steps while speeding up collaboration and increasing meeting value.

Solution overview

To demonstrate the solution, we use Streamlit to provide an interface for diagrams and prompts. Amazon Bedrock invokes the Anthropic’s Claude 3 Sonnet model, which provides multimodal capabilities. AWS Fargate is the compute engine for web application. The following diagram illustrates the step-by-step process.

Architecture Overview

The workflow consists of the following steps:

  1. The user uploads an architecture image (JPEG or PNG) on the Streamlit application, invoking the Amazon Bedrock API to generate a step-by-step explanation of the architecture using the Anthropic’s Claude 3 Sonnet model.
  2. The Anthropic’s Claude 3 Sonnet model is invoked using a step-by-step explanation and few-shot learning examples to generate the initial CloudFormation code. The few-shot learning example consists of three CloudFormation templates; this helps the model understand writing practices associated with CloudFormation code.
  3. The user manually provides instructions using the chat interface to update the initial CloudFormation code.

*Steps 1 and 2 are executed once when architecture diagram is uploaded. To trigger changes to the AWS CloudFormation code (step 3) provide update instructions from the Streamlit app

The CloudFormation templates generated by the web application are intended for inspiration purposes and not for production-level applications. It is the responsibility of a developer to test and verify the CloudFormation template according to security guidelines.

Few-shot Prompting

To help Anthropic’s Claude 3 Sonnet understand the practices of writing CloudFormation code, we use few-shot prompting by providing three CloudFormation templates as reference examples in the prompt. Exposing Anthropic’s Claude 3 Sonnet to multiple CloudFormation templates will allow it to analyze and learn from the structure, resource definitions, parameter configurations, and other essential elements consistently implemented across your organization’s templates. This enables Anthropic’s Claude 3 Sonnet to grasp your team’s coding conventions, naming conventions, and organizational patterns when generating CloudFormation templates. The following examples used for few-shot learning can be found in the GitHub repo.

Few-shot prompting example 1

Few-shot prompting example 1

Few-shot prompting example 2

Few-shot prompting example 2

Few-shot prompting example 3

Few-shot prompting example 3

Furthermore, Anthropic’s Claude 3 Sonnet can observe how different resources and services are configured and integrated within the CloudFormation templates through few-shot prompting. It will gain insights into how to automate the deployment and management of various AWS resources, such as Amazon Simple Storage Service (Amazon S3), AWS Lambda, Amazon DynamoDB, and AWS Step Functions.

Inference parameters are preset, but they can be changed from the web application if desired. We recommend experimenting with various combinations of these parameters. By default, we set the temperature to zero to reduce the variability of outputs and create focused, syntactically correct code.

Prerequisites

To access the Anthropic’s Claude 3 Sonnet foundation model (FM), you must request access through the Amazon Bedrock console. For instructions, see Manage access to Amazon Bedrock foundation models. After requesting access to Anthropic’s Claude 3 Sonnet, you can deploy the following development.yaml CloudFormation template to provision the infrastructure for the demo. For instructions on how to deploy this sample, refer to the GitHub repo. Use the following table to launch the CloudFormation template to quickly deploy the sample in either us-east-1 or us-west-2.

Region Stack
us-east-1 development.yaml
us-west-2 development.yaml

When deploying the template, you have the option to specify the Amazon Bedrock model ID you want to use for inference. This flexibility allows you to choose the model that best suits your needs. By default, the template uses the Anthropic’s Claude 3 Sonnet model, renowned for its exceptional performance. However, if you prefer to use a different model, you can seamlessly pass its Amazon Bedrock model ID as a parameter during deployment. Verify that you have requested access to the desired model beforehand and that the model possesses the necessary vision capabilities required for your specific use case.

After you launch the CloudFormation stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and collect the Amazon CloudFront URL. Enter the URL in your browser to view the web application.

Web application screenshot

In this post, we discuss CloudFormation template generation for three different samples. You can find the sample architecture diagrams in the GitHub repo. These samples are similar to the few-shot learning examples, which is intentional. As an enhancement to this architecture, you can employ a Retrieval Augmented Generation (RAG)-based approach to retrieve relevant CloudFormation templates from a knowledge base to dynamically augment the prompt.

Due to the non-deterministic behavior of the large language model (LLM), you might not get the same response as shown in this post.

Let’s generate CloudFormation templates for the following sample architecture diagram.

Sample Architecture for CloudFormation generation

Uploading the preceding architecture diagram to the web application generates a step-by-step explanation of the diagram using Anthropic’s Claude 3 Sonnet’s vision capabilities.

Web application output screenshot

Let’s analyze the step-by-step explanation. The generated response is divided into three parts:

  1. The context explains what the architecture diagram depicts.
  2. The architecture diagram’s flow gives the order in which AWS services are invoked and their relationship with each other.
  3. We get a summary of the entire generated response.

In the following step-by-step explanation, we see a few highlighted errors.

Step-by-step explanation errors

The step-by-step explanation is augmented with few-shot learning examples to develop an initial CloudFormation template. Let’s analyze the initial CloudFormation template:

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  This CloudFormation stack sets up a serverless data processing pipeline triggered by file uploads to an S3 bucket.
  It uses AWS Lambda to process the uploaded files, and Amazon SNS to send notifications upon successful processing.
  This template is not production ready and should only be used for inspiration
Parameters:
  S3BucketName:
    Type: String
    Description: Name of the S3 bucket for file uploads
    AllowedPattern: ^[a-z0-9][a-z0-9-]*[a-z0-9]$
    MinLength: 1
    MaxLength: 63

  EmailAddress:
    Type: String
    Description: Email address to receive notifications
    AllowedPattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$

Resources:

  # S3 Bucket
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref S3BucketName
      NotificationConfiguration:
        LambdaConfigurations:
          - Event: 's3:ObjectCreated:*'
            Function: !GetAtt ProcessingLambda.Arn

  # SNS Topic
  SNSTopic:
    Type: AWS::SNS::Topic
    Properties:
      Subscription:
        - Endpoint: !Ref EmailAddress
          Protocol: email

  # Lambda Function
  ProcessingLambda:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: ProcessingLambda
      Runtime: python3.9
      Handler: index.lambda_handler
      Role: !GetAtt LambdaRole.Arn
      Code:
        ZipFile: |
          import boto3

          def lambda_handler(event, context):
              s3 = boto3.client('s3')
              sns = boto3.client('sns')

              # Process the uploaded file
              for record in event['Records']:
                  bucket_name = record['s3']['bucket']['name']
                  object_key = record['s3']['object']['key']

                  # Process the file data
                  # ...

              # Send notification upon successful processing
              sns.publish(
                  TopicArn=!Ref SNSTopic,
                  Message='File processing completed successfully',
                  Subject='Data Processing Notification'
              )

              return {
                  'statusCode': 200,
                  'body': 'File processing completed successfully'
              }

  # Lambda Role
  LambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: 'sts:AssumeRole'
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: S3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 's3:GetObject'
                Resource: !Join ['', ['arn:aws:s3:::', !Ref S3BucketName, '/*']]
        - PolicyName: SNSPublish
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'sns:Publish'
                Resource: !Ref SNSTopic

  # Lambda Permissions
  LambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !GetAtt ProcessingLambda.Arn
      Action: 'lambda:InvokeFunction'
      Principal: s3.amazonaws.com
      SourceAccount: !Ref AWS::AccountId
      SourceArn: !Join ['', ['arn:aws:s3:::', !Ref S3BucketName]]

Outputs:

  S3BucketName:
    Description: Name of the S3 bucket for file uploads
    Value: !Ref S3Bucket
    Export:
      Name: !Sub '${AWS::StackName}-S3BucketName'

  LambdaFunctionArn:
    Description: ARN of the Lambda function
    Value: !GetAtt ProcessingLambda.Arn
    Export:
      Name: !Sub '${AWS::StackName}-LambdaFunctionArn'

  SNSTopicArn:
    Description: ARN of the SNS topic for notifications
    Value: !Ref SNSTopic
    Export:
      Name: !Sub '${AWS::StackName}-SNSTopicArn'

After analyzing the CloudFormation template, we see that the Lambda code refers to an Amazon Simple Notification Service (Amazon SNS) topic using !Ref SNSTopic, which is not valid. We also want to add additional functionality to the template. First, we want to filter the Amazon S3 notification configuration to invoke Lambda only when *.csv files are uploaded. Second, we want to add metadata to the CloudFormation template. To do this, we use the chat interface to give the following update instructions to the web application:

Make the following updates:

Use environment variables for AWS Lambda to access SNS Topic ARN.

Add filter to S3 notification configuration to only invoke AWS lambda when *.csv files are uploaded

Add metadata to CloudFormation template

Chat interface web application screenshot

The updated CloudFormation template is as follows:

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  This CloudFormation stack sets up a serverless data processing pipeline triggered by file uploads to an S3 bucket.
  It uses AWS Lambda to process the uploaded files, and Amazon SNS to send notifications upon successful processing.
  This template is not production ready and should only be used for inspiration.
Metadata:
  'AWS::CloudFormation::Interface':
    ParameterGroups:
      - Label:
          default: 'S3 Bucket Configuration'
        Parameters:
          - S3BucketName
      - Label:
          default: 'Notification Configuration'
        Parameters:
          - EmailAddress

Parameters:
  S3BucketName:
    Type: String
    Description: Name of the S3 bucket for file uploads
    AllowedPattern: ^[a-z0-9][a-z0-9-]*[a-z0-9]$
    MinLength: 1
    MaxLength: 63

  EmailAddress:
    Type: String
    Description: Email address to receive notifications
    AllowedPattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$

Resources:

  # S3 Bucket
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref S3BucketName
      NotificationConfiguration:
        LambdaConfigurations:
          - Event: 's3:ObjectCreated:*'
            Function: !GetAtt ProcessingLambda.Arn
            Filter:
              S3Key:
                Rules:
                  - Name: suffix
                    Value: .csv

  # SNS Topic
  SNSTopic:
    Type: AWS::SNS::Topic
    Properties:
      Subscription:
        - Endpoint: !Ref EmailAddress
          Protocol: email

  # Lambda Function
  ProcessingLambda:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: ProcessingLambda
      Runtime: python3.9
      Handler: index.lambda_handler
      Role: !GetAtt LambdaRole.Arn
      Environment:
        Variables:
          SNS_TOPIC_ARN: !Ref SNSTopic
      Code:
        ZipFile: |
          import boto3
          import os

          def lambda_handler(event, context):
              s3 = boto3.client('s3')
              sns = boto3.client('sns')
              sns_topic_arn = os.environ['SNS_TOPIC_ARN']

              # Process the uploaded file
              for record in event['Records']:
                  bucket_name = record['s3']['bucket']['name']
                  object_key = record['s3']['object']['key']

                  # Process the file data
                  # ...

              # Send notification upon successful processing
              sns.publish(
                  TopicArn=sns_topic_arn,
                  Message='File processing completed successfully',
                  Subject='Data Processing Notification'
              )

              return {
                  'statusCode': 200,
                  'body': 'File processing completed successfully'
              }

  # Lambda Role
  LambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: 'sts:AssumeRole'
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: S3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 's3:GetObject'
                Resource: !Join ['', ['arn:aws:s3:::', !Ref S3BucketName, '/*']]
        - PolicyName: SNSPublish
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'sns:Publish'
                Resource: !Ref SNSTopic

  # Lambda Permissions
  LambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !GetAtt ProcessingLambda.Arn
      Action: 'lambda:InvokeFunction'
      Principal: s3.amazonaws.com
      SourceAccount: !Ref AWS::AccountId
      SourceArn: !Join ['', ['arn:aws:s3:::', !Ref S3BucketName]]

Outputs:

  S3BucketName:
    Description: Name of the S3 bucket for file uploads
    Value: !Ref S3Bucket
    Export:
      Name: !Sub '${AWS::StackName}-S3BucketName'

  LambdaFunctionArn:
    Description: ARN of the Lambda function
    Value: !GetAtt ProcessingLambda.Arn
    Export:
      Name: !Sub '${AWS::StackName}-LambdaFunctionArn'

  SNSTopicArn:
    Description: ARN of the SNS topic for notifications
    Value: !Ref SNSTopic
    Export:
      Name: !Sub '${AWS::StackName}-SNSTopicArn'

Additional examples

We have provided two more sample diagrams, their associated CloudFormation code generated by Anthropic’s Claude 3 Sonnet, and the prompts used to create them. You can see how diagrams in various forms, from digital to hand-drawn, or some combination, can be used. The end-to-end analysis of these samples can be found at sample 2 and sample 3 on the GitHub repo.

Best practices for architecture to code

In the demonstrated use case, you can observe how well the Anthropic’s Claude 3 Sonnet model could pull details and relationships between services from an architecture image. The following are some ways you can improve the performance of Anthropic’s Claude in this use case:

  • Implement a multimodal RAG approach to enhance the application’s ability to handle a wider variety of complex architecture diagrams, because the current implementation is limited to diagrams similar to the provided static examples.
  • Enhance the architecture diagrams by incorporating visual cues and features, such as labeling services, indicating orchestration hierarchy levels, grouping related services at the same level, enclosing services within clear boxes, and labeling arrows to represent the flow between services. These additions will aid in better understanding and interpreting the diagrams.
  • If the application generates an invalid CloudFormation template, provide the error as update instructions. This will help the model understand the mistake and make a correction.
  • Use Anthropic’s Claude 3 Opus or Anthropic’s Claude 3.5 Sonnet for greater performance on long contexts in order to support near-perfect recall
  • With careful design and management, orchestrate agentic workflows by using Agents for Amazon Bedrock. This enables you to incorporate self-reflection, tool use, and planning within your workflow to generate more relevant CloudFormation templates.
  • Use Amazon Bedrock Prompt Flows to accelerate the creation, testing, and deployment of workflows through an intuitive visual interface. This can reduce development effort and accelerate workflow testing.

Clean up

To clean up the resources used in this demo, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the deployed yaml development.yaml stack and choose Delete.

Conclusion

With the pattern demonstrated with Anthropic’s Claude 3 Sonnet, developers can effortlessly translate their architectural visions into reality by simply sketching their desired cloud solutions. Anthropic’s Claude 3 Sonnet’s advanced image understanding capabilities will analyze these diagrams and generate boilerplate CloudFormation code, minimizing the need for initial complex coding tasks. This visually driven approach empowers developers from a variety of skill levels, fostering collaboration, rapid prototyping, and accelerated innovation.

You can investigate other patterns, such as including RAG and agentic workflows, to improve the accuracy of code generation. You can also explore customizing the LLM by fine-tuning it to write CloudFormation code with greater flexibility.

Now that you have seen Anthropic’s Claude 3 Sonnet in action, try designing your own architecture diagrams using some of the best practices to take your prototyping to the next level.

For additional resources, refer to the :


About the Authors

Author 1 Eashan KaushikEashan Kaushik is an Associate Solutions Architect at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.

Author 2 Chris PecoraChris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focusing on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

Read More

How Northpower used computer vision with AWS to automate safety inspection risk assessments

How Northpower used computer vision with AWS to automate safety inspection risk assessments

This post is co-written with Andreas Astrom from Northpower.

Northpower provides reliable and affordable electricity and fiber internet services to customers in the Northland region of New Zealand. As an electricity distributor, Northpower aims to improve access, opportunity, and prosperity for its communities by investing in infrastructure, developing new products and services, and giving back to shareholders. Additionally, Northpower is one of New Zealand’s largest infrastructure contractors, serving clients in transmission, distribution, generation, and telecommunications. With over 1,400 staff working across 14 locations, Northpower plays a crucial role in maintaining essential services for customers driven by a purpose of connecting communities and building futures for Northland.

The energy industry is at a critical turning point. There is a strong push from policymakers and the public to decarbonize the industry, while at the same time balancing energy resilience with health, safety, and environmental risk. Recent events including Tropical Cyclone Gabrielle have highlighted the susceptibility of the grid to extreme weather and emphasized the need for climate adaptation with resilient infrastructure. Electricity Distribution Businesses (EDBs) are also facing new demands with the integration of decentralized energy resources like rooftop solar as well as larger-scale renewable energy projects like solar and wind farms. These changes call for innovative solutions to ensure operational efficiency and continued resilience.

In this post, we share how Northpower has worked with their technology partner Sculpt to reduce the effort and carbon required to identify and remediate public safety risks. Specifically, we cover the computer vision and artificial intelligence (AI) techniques used to combine datasets into a list of prioritized tasks for field teams to investigate and mitigate. The resulting dashboard highlighted that 141 power pole assets required action, out of a network of 57,230 poles.

Northpower challenge

Utility poles have stay wires that anchor the pole to the ground for extra stability. These stay wires are meant to have an inline insulator to avoid the situation of the stay wire becoming live, which would create a safety risk for person or animal in the area.

Northpower faced a significant challenge in determining how many of their 57,230 power poles have stay wires without insulators. Without reliable historical data, manual inspections of such a vast and predominantly rural network is labor-intensive and costly. Alternatives like helicopter surveys or field technicians require access to private properties for safety inspections, and are expensive. Moreover, the travel requirement for technicians to physically visit each pole across such a large network posed a considerable logistical challenge, emphasizing the need for a more efficient solution.

Thankfully, some asset datasets were available in digital format, and historical paper-based inspection reports, dating back 20 years, were available in scanned format. This archive, along with 765,933 varied-quality inspection photographs, some over 15 years old, presented a significant data processing challenge. Processing these images and scanned documents is not a cost- or time-efficient task for humans, and requires highly performant infrastructure that can reduce the time to value.

Solution overview

Amazon SageMaker is a fully managed service that helps developers and data scientists build, train, and deploy machine learning (ML) models. In this solution, the team used Amazon SageMaker Studio to launch an object detection model available in Amazon SageMaker JumpStart using the PyTorch framework.

The following diagram illustrates the high-level workflow.

Northpower chose SageMaker for a number of reasons:

  • SageMaker Studio is a managed service with ready-to-go development environments, saving time otherwise used for setting up environments manually
  • SageMaker JumpStart took care of the setup and deployed the required ML jobs involved in the project with minimal configuration, further saving development time
  • The integrated labeling solution with Amazon SageMaker Ground Truth was suitable for large-scale image annotations and simplified the collaboration with a Northpower labeling workforce

In the following sections, we discuss the key components of the solution as illustrated in the preceding diagram.

Data preparation

SageMaker Ground Truth employs a human workforce made up of Northpower volunteers to annotate a set of 10,000 images. The workforce created a bounding box around stay wires and insulators and the output was subsequently used to train an ML model.

Model training, validation, and storage

This component uses the following services:

  • SageMaker Studio is used to access and deploy a pre-trained object detection model and develop code on managed Jupyter notebooks. The model was then fine-tuned with training data from the data preparation stage. For a step-by-step guide to set up SageMaker Studio, refer to Amazon SageMaker simplifies the Amazon SageMaker Studio setup for individual users.
  • SageMaker Studio runs custom Python code to augment the training data and transform the metadata output from SageMaker Ground Truth into a format supported by the computer vision model training job. The model is then trained using a fully managed infrastructure, validated, and published to the Amazon SageMaker Model Registry.
  • Amazon Simple Storage Service (Amazon S3) stores the model artifacts and creates a data lake to host the inference output, document analysis output, and other datasets in CSV format.

Model deployment and inference

In this step, SageMaker hosts the ML model on an endpoint used to run inferences.

A SageMaker Studio notebook was used again post-inference to run custom Python code to simplify the datasets and render bounding boxes on objects based on criteria. This step also applied a custom scoring system that was also rendered onto the final image, and this allowed for an additional human QA step for low confidence images.

Data analytics and visualization

This component includes the following services:

  • An AWS Glue crawler is used to understand the dataset structures stored in the data lake so that it can be queried by Amazon Athena
  • Athena allows the use of SQL to combine the inference output and asset datasets to find highest risk items
  • Amazon QuickSight was used as the tool for both the human QA process and for determining which assets needed a field technician to be sent for physical inspection

Document understanding

In the final step, Amazon Textract digitizes historical paper-based asset assessments and stores the output in CSV format.

Results

The trained PyTorch object detection model enabled the detection of stay wires and insulators on utility poles, and a SageMaker postprocessing job calculated a risk score using an m5.24xlarge Amazon Elastic Compute Cloud (EC2) instance with 200 concurrent Python threads. This instance was also responsible for rendering the score information along with an object bounding box onto an output image, as shown in the following example.

Writing the confidence scores into the S3 data lake alongside the historical inspection results allowed Northpower to run analytics using Athena to understand each classification of image. The sunburst graph below is a visualization of this classification.

Northpower categorized 1,853 poles as high priority risks, 3,922 as medium priority, 36,260 as low priority, and 15,195 as the lowest priority. These were viewable in the QuickSight dashboard and used as an input for humans to review the highest risk assets first.

At the conclusion of the analysis, Northpower found that 31 poles needed stay wire insulators installed and a further 110 poles needed investigation in the field. This significantly reduced the cost and carbon usage involved in manually checking every asset.

Conclusion

Remote asset inspecting remains a challenge for regional EDBs, but using computer vision and AI to uncover new value from data that was previously unused was key to Northpower’s success in this project. SageMaker JumpStart provided deployable models that could be trained for object detection use cases with minimal data science knowledge and overhead.

Discover the publicly available foundation models offered by SageMaker JumpStart and fast-track your own ML project with the following step-by-step tutorial.


About the authors

Scott Patterson is a Senior Solutions Architect at AWS.

Andreas Astrom is the Head of Technology and Innovation at Northpower

Read More

Bon Voyage: NIO Unveils ONVO L60 Smart Electric SUV, Built on NVIDIA DRIVE Orin

Bon Voyage: NIO Unveils ONVO L60 Smart Electric SUV, Built on NVIDIA DRIVE Orin

NIO’s smart EV brand, ONVO, has unveiled the L60 flagship mid-size family SUV, built on the NVIDIA DRIVE Orin system-on-a-chip.

Earlier this year, the automaker introduced the ONVO brand — which stands for On Voyage — to reinforce its commitment to bringing safer, smarter, and more enjoyable and affordable mobility solutions for the mainstream family market.

NVIDIA DRIVE Orin serves as the AI brain of ONVO’s smart-driving system — known as OSD — and delivers up to 254 trillion operations per second of high-performance compute. This allows for diverse and redundant processing of sensor data from the L60’s vision-based sensor suite of high-definition cameras (with maximum forward detection of 687 meters) and 4D radar (with maximum detection range of 370 meters).

The automotive-grade NVIDIA DRIVE Orin runs NVIDIA DriveOS, an operating system for safe, AI-defined vehicles, and is widely used by leading global automakers, including in NIO’s ET7 sedan and its ET5 and ES7 models.

NVIDIA DRIVE Orin enables highly automated driver assistance and autonomous driving systems, along with other features that can be updated via over-the-air software updates.

Bold Design Built to Go the Distance

The ONVO L60 embodies NIO’s commitment to innovative design and user-centric features, blending sleek style with cutting-edge technology.

The L60 was developed to elevate driving experiences by incorporating six critical features: comprehensive safety, spacious comfort, smart cabins with immersive digital experiences, impressive mileage range with convenient recharging, superior ride and handling, and advanced assisted driving capabilities.

The L60 base model (which costs approximately ¥149, 900 yuan, equivalent to $21,000) comes without a battery, allowing users to opt for NIO’s battery-as-a-service option, which is used by 70% of NIO customers. ONVO vehicles are compatible with NIO’s battery swap network, which includes more than 2,500 Power Swap Stations and is expected to continue expanding throughout China.

The launch of the ONVO L60 marks the latest in NIO and NVIDIA’s decade of collaboration. With the integration of NVIDIA DRIVE, the ONVO L60 is poised to deliver an advanced driving experience at an affordable price.

Read More

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that embeds information in the output of a model to verify its source, aims to mitigate the misuse of such AI-generated content. Current state-of-the-art watermarking schemes embed watermarks by slightly perturbing probabilities of the LLM’s output tokens, which can be detected via statistical testing during verification.

Unfortunately, our work shows that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to watermark removal or spoofing attacks—leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems and propose guidelines and defenses for LLM watermarking in practice.

Prompt Alan Turing was born in …
Unwatermarked
Z-Score: 0.16 
PPL: 3.19
Alan Turing was born in 1912 and died in 1954. He was an English mathematician, logician, cryptanalyst, and computer scientist. In 1938, Turing joined the Government Code and Cypher School (GC&CS), where he contributed to the design of the bombe, a machine that was used to decipher the Enigma-enciphered messages
Watermarked
Z-Score: 5.98 
PPL: 4.46
Alan Turing was born in 1912 and died in 1954, at the age of 41. He was the brilliant British scientist and mathematician who is largely credited with being the father of modern computer science. He is known for his contributions to mathematical biology and chemistry. He was also one of the pioneers of computer science
(a) Piggyback spoofing attack
Exploiting robustness
Z-Score: 5.98 
Alan Turing was born in 1950 and died in 1994, at the age of 43. He was the brilliant American scientist and mathematician who is largely credited with being the father of modern computer science. He is known for his contributions to mathematical biology and musicology. He was also one of the pioneers of computer science
(b) Watermark-removal attack
Exploiting multiple keys
Z-Score: 2.40 
PPL: 4.05
Alan Turing was born in 1912 and died in 1954He was a mathematician, logician, cryptologist and theoretical computer scientistHe is famous for his work on code-breaking and artificial intelligence, and his contribution to the Allied victory in World War II. Turing was born in London. He showed an interest in mathematics
(c) Watermark-removal attack
Exploiting public detection API
Z-Score: 1.47 
PPL: 4.57
Alan Turing was born in 1912 and died in 1954. He was an English mathematician, computer scientistcryptanalyst and philosopher. Turing was a leading mathematician and cryptanalyst. He was one of the key players in cracking the German Enigma Code during World War II. He also came up with the Turing Machine
Table 1. Examples generated using LLAMA-2-7B with/without the KGW watermark under various attacks. The watermark will split the vocabulary into green and red lists and give preference for words in the green list. Z-score reflects the detection confidence of the watermark, and perplexity (PPL) measures text quality. (a) In the piggyback spoofing attack, we exploit watermark robustness by generating incorrect content that appears as watermarked (matching the z-score of the watermarked baseline), potentially damaging the reputation of the LLM. Incorrect tokens modified by the attacker are marked in orange and watermarked tokens in blue(b-c) In watermark-removal attacks, attackers can effectively lower the z-score below the detection threshold while preserving a high sentence quality (low PPL) by exploiting either the (b) use of multiple keys or (c) publicly available watermark detection APIs.

What is LLM Watermarking?

Similar to image watermarks, LLM watermarking embeds invisible secret patterns into the text. Here, we briefly introduce LLMs and LLM watermarks. We use (x) to denote a sequence of tokens, (x_i in mathcal{V}) represents the (i)-th token in the sequence, and (mathcal{V}) is the vocabulary. (M_{text{orig}}) denotes the original model without a watermark, (M_{text{wm}}) is the watermarked model, and (sk in mathcal{S}) is the watermark secret key sampled from (mathcal{S}).

Language Model. State-of-the-art (SOTA) LLMs are auto-regressive models, which predict the next token based on the prior tokens. We define language models more formally below:

Definition 1 (Language Model). A language model (LM) without a watermark is a mapping:
$$ M_{text{orig}}: mathcal{V}^* rightarrow mathcal{V}, $$
where the input is a sequence of length (t) tokens (x). (M_{text{orig}}(textbf{x})) first returns the probability distribution for the next token, then (x_{t+1}) is sampled from this distribution.

Figure 1. LLM predicts the next tokens auto-regressively.

Watermarks for LLMs. We focus on SOTA decoding-based robust watermarking schemes including KGWUnigram, and Exp. In each of these methods, the watermarks are embedded by perturbing the output distribution of the original LLM. The perturbation is determined by secret watermark keys held by the LLM owner. Formally, we define the watermarking scheme:

Definition 2 (Watermarked LLMs). The watermarked LLM takes token sequence (x in mathcal{V}^* ) and secret key (sk in mathcal{S}) as input, and outputs a perturbed probability distribution for the next token. The perturbation is determined by (sk):

$$M_{text{wm}} : mathcal{V}^* times mathcal{S} rightarrow mathcal{V}$$

The watermark detection outputs the statistical testing score for the null hypothesis that the input token sequence is independent of the watermark secret key: $$f_{text{detection}} : mathcal{V}^* times mathcal{S} rightarrow mathbb{R}$$ The output score reflects the confidence of the watermark’s existence in the input.

Figure 2. Watermark is embedded by perturbing the probability distribution of the next token. The perturbation is determined by the secret key (sk).

Common Design Choices of LLM Watermarks

There are a number of common design choices in existing LLM watermarking schemes, including robustness, the use of multiple keys, and public detection APIs that have clear benefits to enhance watermark security and usability. We describe these key design choices briefly below, and then explain how an attacker can easily take advantage of these design choices to launch watermark removal or spoofing attacks.

Robustness. The goal of developing a watermark that is robust to output perturbations is to defend against watermark removal, which may be used to circumvent detection schemes for applications such as phishing or fake news generation. Robust watermark designs have been the topic of many recent works. A more robust watermark can better defend against watermark-removal attacks. However, our work shows that robustness can also enable piggyback spoofing attacks.


Multiple Keys. Many works have explored the possibility of launching watermark stealing attacks to infer the secret pattern of the watermark, which can then boost the performance of spoofing and removal attacks. A natural and effective defense against watermark stealing is using multiple watermark keys during embedding, which can improve the unbiasedness property of the watermark (it is called distortion-free in the Exp watermark). Rotating multiple secret keys is a common practice in cryptography and is also suggested by prior watermarks. More keys being used during embedding indicates better watermark unbiasedness and thus it becomes more difficult for attackers to infer the watermark pattern. However, we show that using multiple keys can also introduce new watermark-removal vulnerabilities.


Public Detection API. It is still an open question whether watermark detection APIs should be made publicly available to users. Although this makes it easier to detect watermarked text, it is commonly acknowledged that it will make the system vulnerable to attacks. We study this statement more precisely by examining the specific risk trade-offs that exist, as well as introducing a novel defense that may make the public detection API more feasible in practice.

Attacks, Defenses, and Guidelines

Although the above design choices are beneficial for enhancing the security and usability of watermarking systems, they also introduce new vulnerabilities. Our work studies a set of simple yet effective attacks on common watermarking systems and propose guidelines and defenses for LLM watermarking in practice.

In particular, we study two types of attacks—watermark-removal attacks and (piggyback or general) spoofing attacks. In the watermark-removal attack, the attacker aims to generate a high-quality response from the LLM without an embedded watermark. For the spoofing attacks, the goal is to generate a harmful or incorrect output that has the victim organization’s watermark embedded.

Piggyback Spoofing Attack Exploiting Robustness

More robust watermarks can better defend against editing attacks, but this seemingly desirable property can also be easily misused by malicious users to launch simple piggyback spoofing attacks. In piggyback spoofing attacks, a small portion of toxic or incorrect content is inserted into the watermarked material, making it seem like it was generated by a specific watermarked LLM. The toxic content will still be detected as watermarked, potentially damaging the reputation of the LLM service provider.

Attack Procedure. (i) The attacker queries the target watermarked LLM to receive a high-entropy watermarked sentence (x_{wm}), (ii) The attacker edits (x_{wm}) and forms a new piece of text (x’) and claims that (x’) is generated by the target LLM. The editing method can be defined by the attacker. Simple strategies could include inserting toxic tokens into the watermarked sentence (x_{wm}), or editing specific tokens to make the output inaccurate. As we show, editing can be done at scale by querying another LLM like GPT4 to generate fluent output.

Results. We show that piggyback spoofing can generate fluent, watermarked, but inaccurate results at scale. Specifically, we edit the watermarked sentence by querying GPT4 using the prompt "Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings."

Figure 3. Piggyback spoofing on KGW and LLAMA-2-7B. Lower perplexity (PPL) indicates higher sentence quality. Higher z-score reflects higher confidence in watermarking.

Figure 3 shows that we can successfully generate fluent results, with a slightly higher PPL. 94.17% of the piggyback results have a z-score higher than the detection threshold 4. We randomly sample 100 piggyback results and manually check that most of them (92%) are fluent and have inaccurate or opposite content from the original watermarked content. We present a spoofing sentence below. For more results, please check our manuscript.

Watermarked content, z-score: 4.93, PPL: 4.61
Earth has a history of 4.5 billion years and humans have been around for 200,000 years. Yet humans have been using computers for just over 70 years and even then the term was first used in 1945. In the age of technology, we are still just getting started. The first computer, ENIAC (Electronic Numerical Integrator And Calculator), was built at the University of Pennsylvania between 1943 and 1946. The ENIAC took up 1800 sq ft and had 18,000 vacuum tube and mechanical parts. The ENIAC was used for mathematical calculations, ballistics, and code breaking. The ENIAC was 1000 times faster than any other calculator of the time. The first computer to run a program was the Z3, built by Konrad Zuse at his house.

Piggyback attack, z-score: 4.36, PPL: 5.68
Earth has a history of 4.5 billion years and humans have been around for 200,000 years. Yet humans have been using computers for just over 700 years and even then the term was first used in 1445. In the age of technology, we are still just getting started. The first computer, ENIAC (Electronic Numerical Integrator And Calculator), was built at the University of Pennsylvania between 1943 and 1946. The ENIAC took up 1800 sq ft and had 18,000 vacuum tube and mechanical parts. The ENIAC was used for mathematical calculations, ballistics, and code breaking. The ENIAC was 1000 times slower than any other calculator of the time. The first computer to run a program was the Z3, built by Konrad Zuse at his house.

Discussion. Piggyback spoofing attacks are easy to execute in practice. Robust LLM watermarks typically do not consider such attacks during design and deployment, and existing robust watermarks are inherently vulnerable to such attacks. We consider this attack to be challenging to defend against, especially considering the examples presented above, where by only editing a single token, the entire content becomes incorrect. It is hard, if not impossible, to detect whether a particular token is from the attacker by using robust watermark detection algorithms. Recently, researchers proposed publicly detectable watermarks that plant a cryptographic signature into the generated sentence [Fairoze et al. 2024]. They mitigate such piggyback spoofing attacks at the cost of sacrificing robustness. Thus, practitioners should weigh the risks of removal vs. piggyback spoofing attacks for the model at hand.

Guideline #1. Robust watermarks are vulnerable to spoofing attacks and are not suitable as proof of content authenticity alone. To mitigate spoofing while preserving robustness, it may be necessary to combine additional measures such as signature-based fragile watermarks.


Watermark-Removal Attack Exploiting Multiple Keys

SOTA watermarking schemes aim to ensure the watermarked text retains its high quality and the private watermark patterns are not easily distinguished by maintaining an unbiasedness property: $$|mathbb{E}_{sk in mathcal{S}}[M_{text{wm}}(textbf{x}, sk)] – M_text{orig}(textbf{x})| = O(epsilon),$$ i.e., the expected distribution of watermarked output over the watermark key space (sk in S) is close to the output distribution without a watermark, differing by a distance of (epsilon). Exp is rigorously unbiased, and KGW and Unigram slightly shift the watermarked distributions.

We consider multiple keys to be used during watermark embedding to defend against watermark stealing attacks. The insight of our proposed watermark-removal attack is that, given the “unbiased” nature of watermarks, malicious users can estimate the output distribution without any watermark by repeatedly querying the watermarked LLM using the same prompt. As this attack estimates the original, unwatermarked distribution, the quality of the generated content is preserved.

Attack Procedure. (i) An attacker queries a watermarked model with an input (x) multiple times, observing (n) subsequent tokens (x_{t+1})​. (ii) The attacker then creates a frequency histogram of these tokens and samples according to the frequency. This sampled token matches the result of sampling on an unwatermarked output distribution with a nontrivial probability. Consequently, the attacker can progressively eliminate watermarks while maintaining a high quality of the synthesized content.

Results. We study the trade-off between resistance against watermark stealing and watermark-removal attacks by evaluating a recent watermark stealing attack [Jovanović et al. 2024], where we query the watermarked LLM to obtain 2.2 million tokens to “steal” the watermark pattern and then launch spoofing attacks using the estimated watermark pattern. In our watermark removal attack, we consider that the attacker has observations with different keys.

Figure 4a. Z-Score and attack success rate (ASR) of watermark stealing on KGW watermark and LLAMA-2-7B model with different watermark keys (n).
Figure 4b. Z-Score and attack success rate (ASR) of watermark-removal on KGW watermark and LLAMA-2-7B model with different watermark keys (n).
Figure 4c. Perplexity (PPL) of watermark-removal on KGW watermark and LLAMA-2-7B model with different watermark keys (n).

As shown in Figure 4a, using multiple keys can effectively defend against watermark stealing attacks. With a single key, the ASR is 91%. We observe that using three keys can effectively reduce the ASR to 13%, and using more than 7 keys, the ASR of the watermark stealing is close to zero. However, using more keys also makes the system vulnerable to our watermark-removal attacks as shown in Figure 4b. When we use more than 7 keys, the detection scores of the content produced by our watermark removal attacks closely resemble those of unwatermarked content and are much lower than the detection thresholds, with ASRs higher than 97%. Figure 4c suggests that using more keys improves the quality of the output content. This is because, with a greater number of keys, there is a higher probability for an attacker to accurately estimate the unwatermarked distribution.

Discussion. Many prior works have suggested using multiple keys to defend against watermark stealing attacks. However, we reveal that a conflict exists between improving resistance to watermark stealing and the feasibility of removing watermarks. Our results show that finding a “sweet spot” in terms of the number of keys to use to mitigate both the watermark stealing and the watermark-removal attacks is not trivial. Given this tradeoff, we suggest that LLM service providers consider “defense-in-depth” techniques such as anomaly detection, query rate limiting, and user identification verification.

Guideline #2. Using a larger number of watermarking keys can defend against watermark stealing attacks, but increases vulnerability to watermark-removal attacks. Limiting users’ query rates can help to mitigate both attacks.


Attacks Exploiting Public Detection APIs

Fially, we show that publicly available detection APIs can enable both spoofing and removal attacks. The insight is that by querying the detection API, the attacker can gain knowledge about whether a specific token is carrying the watermark or not. Thus, the attacker can select the tokens based on the detection result to launch spoofing and removal attacks.

Attack Procedure. (i) An attacker feeds a prompt into the watermarked LLM (removal attack) or into a local LLM (spoofing attack), which generates the response in an auto-regressive manner. For the token (x_i)​ the attacker will generate a list of possible replacements for (x_i)​. (ii) The attacker will query the detection using these replacements and sample a token based on their probabilities and detection scores to remove or spoof the watermark while preserving a high output quality. This replacement list can be generated by querying the watermarked LLM, querying a local model, or simply returned by the watermarked LLM (e.g., enabled by OpenAI’s API top_logprobs=5).

Results. We evaluate the detection scores for both the watermark-removal and the spoofing attacks. Furthermore, for the watermark-removal attack, where the attackers care more about the output quality, we report the output PPL.

Figure 5a. Z-Score/P-Value of watermark removal using detection APIs on LLAMA-2-7B and different watermarks.
Figure 5b. The perplexity of watermark removal using detection APIs on LLAMA-2-7B and different watermarks.
Figure 5c. Z-Score/P-Value of watermark spoofing using detection APIs on LLAMA-2-7B and different watermarks.

As shown in Figure 5a and Figure 5b, watermark-removal attacks exploiting the detection API significantly reduce detection confidence while maintaining high output quality. For instance, for the KGW watermark on LLAMA-2-7B model, we achieve a median z-score of 1.43, which is much lower than the threshold 4. The PPL is also close to the watermarked outputs (6.17 vs. 6.28). We observe that the Exp watermark has higher PPL than the other two watermarks. A possible reason is that Exp watermark is deterministic, while other watermarks enable random sampling during inference. Our attack also employs sampling based on the token probabilities and detection scores, thus we can improve the output quality for the Exp watermark. The spoofing attacks also significantly boost the detection confidence even though the content is not from the watermarked LLM, as depicted in Figure 5c.

Defending Detection with Differential Privacy. In light of the issues above, we propose an effective defense using ideas from differential privacy (DP) to counteract the detection API based spoofing attacks. DP adds random noise to function results evaluated on private datasets such that the results from neighboring datasets are indistinguishable. Similarly, we consider adding Gaussian noise to the distance score in the watermark detection, making the detection ((epsilon, delta))-DP, and ensuring that attackers cannot tell the difference between two queries by replacing a single token in the content, thus increasing the hardness of launching the attacks.

Figure 6a. Spoofing attack success rate (ASR) and detection accuracy (ACC) without and with DP watermark detection under different noise parameters.
Figure 6b. Z-scores of original text without attack, spoofing attack without DP defense, and spoofing attacks with DP defense. We use the best (sigma=4) from Figure 6a.

As shown in Figure 6, with a noise scale of (sigma=4), the DP detection’s accuracy drops from the original 98.2% to 97.2% on KGW and LLAMA-2-7B, while the spoofing ASR becomes 0%. The results are consistent for other watermarks and models.

Discussion. The detection API, available to the public, aids users in differentiating between AI and human-created materials. However, it can be exploited by attackers to gradually remove watermarks or launch spoofing attacks. We propose a defense utilizing the ideas in differential privacy. Even though the attacker can still obtain useful information by increasing the detection sensitivity, our defense significantly increases the difficulty of spoofing attacks. However, this method is less effective against watermark-removal attacks that exploit the detection API because attackers’ actions will be close to random sampling, which, even though with lower success rates, remains an effective way of removing watermarks. Therefore, we leave developing a more powerful defense mechanism against watermark-removal attacks exploiting detection API as future work. We recommend companies providing detection services should detect and curb malicious behavior by limiting query rates from potential attackers, and also verify the identity of the users to protect against Sybil attacks.

Guideline #3. Public detection APIs can enable both spoofing and removal attacks. To defend against these attacks, we propose a DP-inspired defense, which combined with techniques such as anomaly detection, query rate limiting, and user identification verification can help to make public detection more feasible in practice.

Conclusion

Although LLM watermarking is a promising tool for auditing the usage of LLM-generated text, fundamental trade-offs exist in the robustness, usability, and utility of existing approaches. In particular, our work shows that it is easy to take advantage of common design choices in LLM watermarks to launch attacks that can easily remove the watermark or generate falsely watermarked text. Our study finds that these vulnerabilities are common to existing LLM watermarks and provides caution for the field in deploying current solutions in practice without carefully considering the impact and trade-offs of watermarking design choices. To establish more reliable future LLM watermarking systems, we also suggest guidelines for designing and deploying LLM watermarks along with possible defenses motivated by the theoretical and empirical analyses of our attacks. For more results and discussions, please see our manuscript.

Read More