Improve Amazon Nova migration performance with data-aware prompt optimization

In the era of generative AI, new large language models (LLMs) are continually emerging, each with unique capabilities, architectures, and optimizations. Among these, Amazon Nova foundation models (FMs) deliver frontier intelligence and industry-leading cost-performance, available exclusively on Amazon Bedrock. Since its launch in 2024, generative AI practitioners, including the teams in Amazon, have started transitioning their workloads from existing FMs and adopting Amazon Nova models.

However, when transitioning between different foundation models, the prompts created for your original model might not be as performant for Amazon Nova models without prompt engineering and optimization. Amazon Bedrock prompt optimization offers a tool to automatically optimize prompts for your specified target models (in this case, Amazon Nova models). It can convert your original prompts to Amazon Nova-style prompts. Additionally, during the migration to Amazon Nova, a key challenge is making sure that performance after migration is at least as good as or better than prior to the migration. To mitigate this challenge, thorough model evaluation, benchmarking, and data-aware optimization are essential, to compare the Amazon Nova model’s performance against the model used before the migration, and optimize the prompts on Amazon Nova to align performance with that of the previous workload or improve upon them.

In this post, we present an LLM migration paradigm and architecture, including a continuous process of model evaluation, prompt generation using Amazon Bedrock, and data-aware optimization. The solution evaluates the model performance before migration and iteratively optimizes the Amazon Nova model prompts using user-provided dataset and objective metrics. We demonstrate successful migration to Amazon Nova for three LLM tasks: text summarization, multi-class text classification, and question-answering implemented by Retrieval Augmented Generation (RAG). We also discuss the lessons learned and best practices for you to implement the solution for your real-world use cases.

Migrating your generative AI workloads to Amazon Nova

Migrating the model from your generative AI workload to Amazon Nova requires a structured approach to achieve performance consistency and improvement. It includes evaluating and benchmarking the old and new models, optimizing prompts on the new model, and testing and deploying the new models in your production. In this section, we present a four-step workflow and a solution architecture, as shown in the following architecture diagram.

model migration process

The workflow includes the following steps:

  1. Evaluate the source model and collect key performance metrics based on your business use case, such as response accuracy, response format correctness, latency, and cost, to set a performance baseline as the model migration target.
  2. Automatically update the structure, instruction, and language of your prompts to adapt to the Amazon Nova model for accurate, relevant, and faithful outputs. We will discuss this more in the next section.
  3. Evaluate the optimized prompts on the migrated Amazon Nova model to meet the performance target defined in Step 1. You can conduct the optimization in Step 2 as an iterative process until the optimized prompts meet your business criteria.
  4. Conduct A/B testing to validate the Amazon Nova model performance in your testing and production environment. When you’re satisfied, you can deploy the Amazon Nova model, settings, and prompts in production. 

This four-step workflow needs to run continuously, to adapt to variations in both the model and the data, driven by the changes in business use cases. The continuous adaptation provides ongoing optimization and helps maximize overall model performance.

Data-aware prompt optimization on Amazon Nova

In this section, we present a comprehensive optimization methodology, taking two steps. The first step is to use Amazon Bedrock prompt optimization to refine your prompt structure, and then use an innovative data-aware prompt optimization approach to further optimize the prompt to improve the Amazon Nova model performance.

Amazon Bedrock prompt optimization

Amazon Bedrock provides a prompt optimization feature that rewrites prompts to improve performance for your use cases. Prompt optimization streamlines the way that AWS developers interact with FMs on Amazon Bedrock, automatically adapts the prompts to the selected models, and generates for better performance.

As the first step, you can use prompt optimization to adapt your prompt to Amazon Nova. By analyzing the prompt you provide, the feature interprets the task, system prompt, and instruction within the prompt, and automatically crafts the prompt with Amazon Nova specific format and appropriate words, phrases, and sentences. The following example shows how prompt optimization converts a typical prompt for a summarization task on Anthropic’s Claude Haiku into a well-structured prompt for an Amazon Nova model, with sections that begin with special markdown tags such as ## Task, ### Summarization Instructions, and ### Document to Summarize.

Model Prompt
Anthropic’s Claude 3 Haiku Human:
Act like you are an intelligent AI assistant. You are required to provide a summarization based on given document. Please use below instructions when generating the response.
The document is provided in <doc></doc> XML tags.
Please be brief and concise in your answer. Do not add any information that is not mentioned in the document.
Do not provide any preamble and directly start with the summarization.
Do not make up the answer, If you don’t know the answer, just say that I don’t know.<doc>{document}</doc>Assistant:
Amazon Nova Lite with Amazon
Bedrock prompt optimization
### Task
Your task is to summarize the given document enclosed in <doc></doc> tags in a brief and concise manner. Follow these guidelines:### Summarization Instructions
– Read the document carefully to understand its main points and key information.
– Identify the core ideas, arguments, and supporting details presented in the document.
– Synthesize the essential information into a clear and succinct summary.
– Use your own words to paraphrase the key points – do not copy verbatim from the original text.
– Omit any extraneous or redundant information not central to the main ideas.
– Do not introduce new information or make up content not present in the original document.
– If you cannot summarize the document due to lack of understanding, simply respond “I don’t know.”### Document to Summarize
<doc>{document}</doc>Please provide your concise summary immediately without any preamble.

We applied the preceding prompts to the Anthropic Claude 3 Haiku and Amazon Nova Lite models, respectively, using the public xsum dataset. To evaluate the model performance, because the summarization task doesn’t have a predefined ground truth, we designed an LLM judge as shown in the following prompt to validate the summarization quality:

You are an AI assistant, your task is to compare the following LLM-generated summary with the original document, rate how well it captures the key points and conveys the most critical information, on a scale of 1-5.
    
    The score should be based on the following performance criteria:
    - Consistency: characterizes the summary’s factual and logical correctness. It should stay true to the original text, not introduce additional information, and use the same terminology.
    - Relevance: captures whether the summary is limited to the most pertinent information in the original text. A relevant summary focuses on the essential facts and key messages, omitting unnecessary details or trivial information.
    - Fluency: describes the readability of the summary. A fluent summary is well-written and uses proper syntax, vocabulary, and grammar.
    - Coherence: measures the logical flow and connectivity of ideas. A coherent summary presents the information in a structured, logical, and easily understandable manner.
    
    Score 5 means the LLM-generated summary is the best summary fully aligned with the original document,
    Score 1 means the LLM-generated summary is the worst summary completely irrelevant to the original document.  

    Please also provide an explanation on why you provide the score. Keep the explanation as concise as possible.

    The LLM-generated summary is provided within the <summary> XML tag,
    The original document is provided within the <document> XML tag,

    In your response, present the score within the <score> XML tag, and the explanation within the <thinking> XML tag.

    DO NOT nest <score> and <thinking> element.
    DO NOT put any extra attribute in the <score> and <thinking> tag.
    
    <document>
    {document}
    </document>

    LLM generated summary:
    <summary>
    {summary}
    </summary>

The experiment, using 80 data samples, shows that the accuracy is improved on the Amazon Nova Lite model from 77.75% to 83.25% using prompt optimization.

Data-aware optimization

Although Amazon Bedrock prompt optimization supports the basic needs of prompt engineering, other prompt optimization techniques are available to maximize LLM performance, such as Multi-Aspect Critique, Self-Reflection, Gradient Descent and Beam Search, and Meta Prompting. Specifically, we observed requirements from users that they need to fine-tune their prompts against their optimization objective metrics they define, such as ROUGE, BERT-F1, or an LLM judge score, by using a dataset they provide. To meet these needs, we designed a data-aware optimization architecture as shown in the following diagram.

data-aware-optimization

The data-aware optimization takes inputs. The first input is the user-defined optimization objective metrics; for the summarization task discussed in the previous section, you can use the BERT-F1 score or create your own LLM judge. The second input is a training dataset (DevSet) provided by the user to validate the response quality, for example, a summarization data sample with the following format.

Source Document Summarization
Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. Detectives said three firearms, ammunition and a five-figure sum of money were recovered. A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday. A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.
<another document ...> <another summarization ...>

The data-aware optimization uses these two inputs to improve the prompt for better Amazon Nova response quality. In this work, we use the DSPy (Declarative Self-improving Python) optimizer for the data-aware optimization. DSPy is a widely used framework for programming language models. It offers algorithms for optimizing the prompts for multiple LLM tasks, from simple classifiers and summarizers to sophisticated RAG pipelines. The dspy.MIPROv2 optimizer intelligently explores better natural language instructions for every prompt using the DevSet, to maximize the metrics you define.

We applied the MIPROv2 optimizer on top of the results optimized by Amazon Bedrock in the previous section for better Amazon Nova performance. In the optimizer, we specify the number of the instruction candidates in the generation space, use Bayesian optimization to effectively search over the space, and run it iteratively to generate instructions and few-shot examples for the prompt in each step:

# Initialize optimizer
teleprompter = MIPROv2(
    metric=metric,
    num_candidates=5,
    auto="light", 
    verbose=False,
)

With the setting of num_candidates=5, the optimizer generates five candidate instructions:

0: Given the fields `question`, produce the fields `answer`.

1: Given a complex question that requires a detailed reasoning process, produce a structured response that includes a step-by-step reasoning and a final answer. Ensure the reasoning clearly outlines each logical step taken to arrive at the answer, maintaining clarity and neutrality throughout.

2: Given the fields `question` and `document`, produce the fields `answer`. Read the document carefully to understand its main points and key information. Identify the core ideas, arguments, and supporting details presented in the document. Synthesize the essential information into a clear and succinct summary. Use your own words to paraphrase the key points without copying verbatim from the original text. Omit any extraneous or redundant information not central to the main ideas. Do not introduce new information or make up content not present in the original document. If you cannot summarize the document due to lack of understanding, simply respond "I don't know.

3: In a high-stakes scenario where you must summarize critical documents for an international legal case, use the Chain of Thought approach to process the question. Carefully read and understand the document enclosed in <doc></doc> tags, identify the core ideas and key information, and synthesize this into a clear and concise summary. Ensure that the summary is neutral, precise, and omits any extraneous details. If the document is too complex or unclear, respond with "I don't know.

4: Given the fields `question` and `document`, produce the fields `answer`. The `document` field contains the text to be summarized. The `answer` field should include a concise summary of the document, following the guidelines provided. Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

We set other parameters for the optimization iteration, including the number of trials, the number of few-shot examples, and the batch size for the optimization process:

# Optimize program
optimized_program = teleprompter.compile(
        program.deepcopy(),
        trainset=trainset,
        num_trials=7,
        minibatch_size=20,
        minibatch_full_eval_steps=7,
        max_bootstrapped_demos=2,
        max_labeled_demos=2,
        requires_permission_to_run=False,
)

When the optimization starts, MIPROv2 uses each instruction candidate along with the mini-batch of the testing dataset we provided to infer the LLM and calculate the metrics we defined. After the loop is complete, the optimizer evaluates the best instruction by using the full testing dataset and calculates the full evaluation score. Based on the iterations, the optimizer provides the improved instruction for the prompt:

Given the fields `question` and `document`, produce the fields `answer`.
The `document` field contains the text to be summarized.
The `answer` field should include a concise summary of the document, following the guidelines provided.
Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

Applying the optimized prompt, the summarization accuracy generated by the LLM judge on Amazon Nova Lite model is further improved from 83.25% to 87.75%.

We also applied the optimization process on other LLM tasks, including a multi-class text classification task, and a question-answering task using RAG. In all the tasks, our approach optimized the migrated Amazon Nova model to out-perform the Anthropic Claude Haiku and Meta Llama models before migration. The following table and chart illustrate the optimization results.

Task DevSet Evaluation Before Migration After Migration (Amazon Bedrock Prompt Optimization) After Migration (DSPy with Amazon Bedrock Prompt Optimization)
Summarization (Anthropic Claude 3 Haiku to Amazon Nova Lite) 80 samples LLM Judge 77.75 83.25 87.75
Classification (Meta Llama 3.2 3B to Amazon Nova Micro) 80 samples Accuracy 81.25 81.25 87.5
QA-RAG (Anthropic Claude 3 Haiku to Amazon Nova Lite) 50 samples Semantic Similarity 52.71 51.6 57.15

Migration results

For the text classification use case, we optimized the Amazon Nova Micro model using 80 samples, using the accuracy metrics to evaluate the optimization performance in each step. After seven iterations, the optimized prompt provides 87.5% accuracy, improved from the accuracy of 81.25% running on the Meta Llama 3.2 3B model.

For the question-answering use case, we used 50 samples to optimize the prompt for an Amazon Nova Lite model in the RAG pipeline, and evaluated the performance using a semantic similarity score. The score compares the cosine distance between the model’s answer and the ground truth answer. Comparing to the testing data running on Anthropic’s Claude 3 Haiku, the optimizer improved the score from 52.71 to 57.15 after migrating to the Amazon Nova Lite model and prompt optimization.

You can find more details of these examples in the GitHub repository.

Lessons learned and best practices

Through the solution design, we have identified best practices that can help you properly configure your prompt optimization to maximize the metrics you specify for your use case:

  • Your dataset for optimizer should be of high quality and relevancy, and well-balanced to cover the data patterns and edge cases of your use case, and nuances to minimize biases.
  • The metrics you defined as the target of optimization should be use case specific. For example, if your dataset has ground truth, then you can use statistical and programmatical machine learning (ML) metrics such as accuracy and semantic similarity If your dataset doesn’t include ground truth, a well-designed and human-aligned LLM judge can provide a reliable evaluation score for the optimizer.
  • The optimizer runs with a number of prompt candidates (parameter dspy.num_candidates) and uses the evaluation metric you defined to select the optimal prompt as the output. Avoid setting too few candidates that might miss opportunity for improvement. In the previous summarization example, we set five prompt candidates for optimizing through 80 training samples, and received good optimization performance.
  • The prompt candidates include a combination of prompt instructions and few-shot examples. You can specify the number of examples (parameter dspy.max_labeled_demos for examples from labeled samples, and parameter dspy.max_bootstrapped_demos for examples from unlabeled samples); we recommend the example number be no less than 2.
  • The optimization runs in iteration (parameter dspy.num_trials); you should set enough iterations that allow you to refine prompts based on different scenarios and performance metrics, and gradually enhance clarity, relevance, and adaptability. If you optimize both the instructions and the few-shot examples in the prompt, we recommend you set the iteration number to no less than 2, preferably between 5–10.

In your use case, if your prompt structure is complex with chain-of-thoughts or tree-of-thoughts, long instructions in the system prompt, and multiple inputs in the user prompt, you can use a task-specific class to abstract the DSPy optimizer. The class helps encapsulate the optimization logic, standardize the prompt structure and optimization parameters, and allow straightforward implementation of different optimization strategies. The following is an example of the class created for text classification task:

class Classification(dspy.Signature):

""" You are a product search expert evaluating the quality of specific search results and deciding will that lead to a buying decision or not. You will be given a search query and the resulting product information and will classify the result against a provided classification class. Follow the given instructions to classify the search query using the classification scheme

   Class Categories:

   Class Label:

   Category Label: Positive Search

   The class is chosen when the search query and the product are a full match and hence the customer experience is positive

   Category Label: Negative Search

   The class is chosen when the search query and the product are fully misaligned, meaning you searched for something but the output is completely different

   Category Label: Moderate Search

   The class is chosen when the search query and the product may not be fully same, but still are complementing each other and maybe of similar category

"""

   search_query = dspy.InputField(desc="Search Query consisting of keywords")

   result_product_title = dspy.InputField(desc="This is part of Product Description and indicates the Title of the product")

   result_product_description = dspy.InputField(desc="This is part of Product Description and indicates the description of the product")

   …

   thinking = dspy.OutputField(desc="justification in the scratchpad, explaining the reasoning behind the classification choice and highlighting key factors that led to the decision")

   answer = dspy.OutputField(desc="final classification label for the product result: positive_search/negative_search/moderate_search. ")

""" Instructions:

Begin by creating a scratchpad where you can jot down your initial thoughts, observations, and any pertinent information related to the search query and product. This section is for your personal use and doesn't require a formal structure.
Proceed to examine and dissect the search query. Pinpoint essential terms, brand names, model numbers, and specifications. Assess the user's probable objective based on the query.
Subsequently, juxtapose the query with the product. Seek out precise correspondences in brand, model, and specifications. Recognize commonalities in functionality, purpose, or features. Reflect on how the product connects to or augments the item being queried.
Afterwards, employ a methodical classification approach, contemplating each step carefully
Conclude by verifying the classification. Scrutinize the selected category in relation to its description to confirm its precision. Take into account any exceptional circumstances or possible uncertainties.

"""

Conclusion

In this post, we introduced the workflow and architecture for you to migrate your current generative AI workload into Amazon Nova models, and presented a comprehensive prompt optimization approach using Amazon Bedrock prompt optimization and a data-aware prompt optimization methodology with DSPy. The results on three LLM tasks demonstrated the optimized performance of Amazon Nova in its intelligence classes and the model performance improved by Amazon Bedrock prompt optimization post-model migration, which is further enhanced by the data-aware prompt optimization methodology presented in this post.

The Python library and code examples are publicly available on GitHub. You can use this LLM migration method and the prompt optimization solution to migrate your workloads into Amazon Nova, or in other model migration processes.


About the Authors

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon Builders who build customer facing application using generative AI. He lives in Seattle area, and outside of work loves to go on hiking and enjoy nature.

Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language processing, machine learning, large language modeling, and other related AI areas. Outside work, he enjoys sports, particularly basketball, and family activities.

Kashif Imran is a seasoned engineering and product leader with deep expertise in AI/ML, cloud architecture, and large-scale distributed systems. Currently a Senior Manager at AWS, Kashif leads teams driving innovation in generative AI and Cloud, partnering with strategic cloud customers to transform their businesses. Kashif holds dual master’s degrees in Computer Science and Telecommunications, and specializes in translating complex technical capabilities into measurable business value for enterprises.

Read More