Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval

Generative artificial intelligence (AI) applications powered by large language models (LLMs) are rapidly gaining traction for question answering use cases. From internal knowledge bases for customer support to external conversational AI assistants, these applications use LLMs to provide human-like responses to natural language queries. However, building and deploying such assistants with responsible AI best practices requires a robust ground truth and evaluation framework to make sure they meet quality standards and user experience expectations, as well as clear evaluation interpretation guidelines to make the quality and responsibility of these systems intelligible to business decision-makers.

This post focuses on evaluating and interpreting metrics using FMEval for question answering in a generative AI application. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, providing standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, refer to Evaluate large language models for quality and responsibility.

In this post, we discuss best practices for working with FMEval in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality. Ground truth data in AI refers to data that is known to be true, representing the expected outcome for the system being modeled. By providing a true expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Ground truth curation and metric interpretation are tightly coupled, and the implementation of the evaluation metric must inform ground truth curation to achieve best results. By following these guidelines, data scientists can quantify the user experience delivered by their generative AI pipelines and communicate meaning to business stakeholders, facilitating ready comparisons across different architectures, such as Retrieval Augmented Generation (RAG) pipelines, off-the-shelf or fine-tuned LLMs, or agentic solutions.

Solution overview

We use an example ground truth dataset (referred to as the golden dataset, shown in the following table) of 10 question-answer-fact triplets. Each triplet describes a fact, and an encapsulation of the fact as a question-answer pair to emulate an ideal response, derived from a knowledge source document. We used Amazon’s Q2 2023 10Q report as the source document from the SEC’s public EDGAR dataset to create 10 question-answer-fact triplets. The 10Q report contains details on company financials and operations over the Q2 2023 business quarter. The golden dataset applies the ground truth curation best practices discussed in this post for most questions, but not all, to demonstrate the downstream impact of ground truth curation on metric results.

Question	Answer	Fact
Who is Andrew R. Jassy?	Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.	Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon
What were Amazon’s total net sales for the second quarter of 2023?	Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.	134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion
Where is Amazon’s principal office located?	Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.	410 Terry Avenue North
What was Amazon’s operating income for the six months ended June 30, 2023?	Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.	12.5 billion<OR>12,455 million<OR>12.455 billion
When did Amazon acquire One Medical?	Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.	Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023
What was a key challenge faced by Amazon’s business in the second quarter of 2023?	Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.	foreign exchange rates
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?	Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.	50.1 billion<OR>50,067 million<OR>50.067 billion
What were Amazon’s AWS sales for the second quarter of 2023?	Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.	22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?	As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.	158 million
How many shares of common stock were outstanding as of July 21, 2023?	There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.	10317750796<OR>10,317,750,796

We generated responses from three generative AI RAG pipelines (anonymized as Pipeline1, Pipeline2, Pipeline3, as shown in the following figure) and calculated factual knowledge and QA accuracy metrics, evaluating them against the golden dataset. The fact key of the triplet is used for the Factual Knowledge metric ground truth, and the answer key is used for the QA Accuracy metric ground truth. With this, factual knowledge is measured against the fact key, and the ideal user experience in terms of style and conciseness is measured against the question-answer pairs.

Evaluation for question answering in a generative AI application

A generative AI pipeline can have many subcomponents, such as a RAG pipeline. RAG is a methodology to improve the accuracy of LLM responses answering a user query by retrieving and inserting relevant domain knowledge into the language model prompt. RAG quality depends on the configurations of the retriever (chunking, indexing) and generator (LLM selection and hyperparameters, prompt), as illustrated in the following figure. Tuning chunking and indexing in the retriever makes sure the correct content is available in the LLM prompt for generation. The chunk size and chunk splitting method, as well as the means of embedding and ranking relevant document chunks as vectors in the knowledge store, impacts whether the actual answer to the query is ultimately inserted in the prompt. In the generator, selecting an appropriate LLM to run the prompt, and tuning its hyperparameters and prompt template, all control how the retrieved information is interpreted for the response. With this, when a final response from a RAG pipeline is evaluated, the preceding components may be adjusted to improve response quality.

Alternatively, question answering can be powered by a fine-tuned LLM, or through an agentic approach. Although we demonstrate the evaluation of final responses from RAG pipelines, the final responses from a generative AI pipeline for question answering can be similarly evaluated because the prerequisites are a golden dataset and the generative answers. With this approach, changes in the generative output due to different generative AI pipeline architectures can be evaluated to inform the best design choices (comparing RAG and knowledge retrieval agents, comparing LLMs used for generation, retrievers, chunking, prompts, and so on).

Although evaluating each sub-component of a generative AI pipeline is important in development and troubleshooting, business decisions rely on having an end-to-end, side-by-side data view, quantifying how a given generative AI pipeline will perform in terms of user experience. With this, business stakeholders can understand expected quality changes in terms of end-user experience by switching LLMs, and adhere to legal and compliance requirements, such as ISO42001 AI Ethics. There are further financial benefits to realize; for example, quantifying expected quality changes on internal datasets when switching a development LLM to a cheaper, lightweight LLM in production. The overall evaluation process for the benefit of decision-makers is outlined in the following figure. In this post, we focus our discussion on ground truth curation, evaluation, and interpreting evaluation scores for entire question answering generative AI pipelines using FMEval to enable data-driven decision-making on quality.

A useful mental model for ground truth curation and improvement of a golden dataset is a flywheel, as shown in the following figure. The ground truth experimentation process involves querying your generative AI pipeline with the initial golden dataset questions and evaluating the responses against initial golden answers using FMEval. Then, the quality of the golden dataset must be reviewed by a judge. The judge review of the golden dataset quality accelerates the flywheel towards an ever-improving golden dataset. The judge role in the workflow can be assumed by another LLM to enable scaling against established, domain-specific criteria for high-quality ground truth. Maintaining a human-in-the-loop component to the judge function remains essential to sample and verify results, as well as to increase the quality bar with increasing task complexity. Improvement to the golden dataset fosters improvement to the quality of the evaluation metrics, until sufficient measurement accuracy in the flywheel is met by the judge, using the established criteria for quality. To learn more about AWS offerings on human review of generations and data labeling, such as Amazon Augmented AI (Amazon A2I) and Amazon SageMaker Ground Truth Plus, refer to Using Amazon Augmented AI for Human Review and High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus. When using LLMs as a judge, make sure to apply prompt safety best practices.

However, to conduct reviews of golden dataset quality as part of the ground truth experiment flywheel, human reviewers must understand the evaluation metric implementation and its coupling to ground truth curation.

FMEval metrics for question answering in a generative AI application

The Factual Knowledge and QA Accuracy metrics from FMEval provide a way to evaluate custom question answering datasets against ground truth. For a full list of metrics implemented with FMEval, refer to Using prompt datasets and available evaluation dimensions in model evaluation jobs.

Factual Knowledge

The Factual Knowledge metric evaluates whether the generated response contains factual information present in the ground truth answer. It is a binary (0 or 1) score based on a string match. Factual knowledge also reports a quasi-exact string match which performs matching after normalization. For simplicity, we focus on the exact match Factual Knowledge score in this post.

For each golden question:

0 indicates the lowercased factual ground truth is not present in the model response
1 indicates the lowercased factual ground truth is present in the response

QA Accuracy

The QA Accuracy metric measures a model’s question answering accuracy by comparing its generated answers against ground truth answers. The metrics are computed by string matching true positive, false positive, and false negative word matches between QA ground truth answers and generated answers.

It includes several sub-metrics:

Recall Over Words – Scores from 0 (worst) to 1 (best), measuring how much of the QA ground truth is contained in the model output
Precision Over Words – Scores from 0 (worst) to 1 (best), measuring how many words in the model output match the QA ground truth
F1 Over Words – The harmonic mean of precision and recall, providing a balanced score from 0 to 1
Exact Match – Binary 0 or 1, indicating if the model output exactly matches the QA ground truth
Quasi Exact Match – Similar to Exact Match, but with normalization (lowercasing and removing articles)

Because QA Accuracy metrics are calculated on an exact match basis, (for more details, see Accuracy) they may be less reliable for questions where the answer can be rephrased without modifying its meaning. To mitigate this, we propose applying Factual Knowledge as the assessment of factual correctness, motivating the use of a dedicated factual ground truth with minimal word expression, together with QA Accuracy as a measure of idealized user experience in terms of response verbosity and style. We elaborate on these concepts later in this post. The BERTScore is also computed as part of QA Accuracy, which provides a measure of semantic match quality against the ground truth.

Proposed ground truth curation best practices for question answering with FMEval

In this section, we share best practices for curating your ground truth for question answering with FMEval.

Understanding the Factual Knowledge metric calculation

A factual knowledge score is a binary measure of whether a real-world fact was correctly retrieved by the generative AI pipeline. 0 indicates the lower-cased expected answer is not part of the model response, whereas 1 indicates it is. Where there is more than one acceptable answer, and either answer is considered correct, apply a logical operator for OR. A configuration for a logical AND can also be applied for cases where the factual material encompasses multiple distinct entities. In the present examples, we demonstrate a logical OR, using the <OR> delimiter. See Use SageMaker Clarify to evaluate large language models for information about logical operators. An example curation of a golden question and golden fact is shown in the following table.

Golden Question	“How many shares of common stock were outstanding as of July 21, 2023?”
Golden Fact	10,317,750,796<OR>10317750796

Fact detection is useful for assessing hallucination in a generative AI pipeline. The two sample responses in the following table illustrate fact detection. The first example correctly states the fact in the example response, and receives a 1.0 score. The second example hallucinates a number instead of stating the fact, and receives a 0 score.

Metric	Example Response	Score	Calculation Approach
Factual Knowledge	“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	1.0	String match to golden fact
Factual Knowledge	“Based on the documents provided, Amazon had 22,003,237,746 shares of common stock outstanding as of July 21, 2023.”	0.0	String match to golden fact

In the following example, we highlight the importance of units in ground truth for Factual Knowledge string matching. The golden question and golden fact represent Amazon’s total net sales for the second quarter of 2023.

Golden Question	“What were Amazon’s total net sales for the second quarter of 2023?
Golden Fact	134.4 billion<OR>134,383 million

The first response hallucinates the fact, using units of billions, and correctly receives a score of 0.0. The second response correctly represents the fact, in units of millions. Both units should be represented in the golden fact. The third response was unable to answer the question, flagging a potential issue with the information retrieval step.

Metric	Example Response	Score	Calculation Approach
Factual Knowledge	Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.	0.0	String match to golden fact
	The total consolidated net sales for Q2 2023 were $134,383 million according to this report.	1.0
	Sorry, the provided context does not include any information about Amazon’s total net sales for the second quarter of 2023. Would you like to ask another question?	0.0

Interpreting Factual Knowledge scores

Factual knowledge scores are a useful flag for challenges in the generative AI pipeline such as hallucination or information retrieval problems. Factual knowledge scores can be curated in the form of a Factual Knowledge Report for human review, as shown in the following table, to visualize pipeline quality in terms of fact detection side by side.

User Question	QA Ground Truth	Factual Ground Truth	Pipeline 1	Pipeline 2	Pipeline 3
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?	As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.	158 million	1	1	1
How many shares of common stock were outstanding as of July 21, 2023?	There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.	10317750796<OR>10,317,750,796	1	1	1
What was Amazon’s operating income for the six months ended June 30, 2023?	Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.	12.5 billion<OR>12,455 million<OR>12.455 billion	1	1	1
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?	Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.	50.1 billion<OR>50,067 million<OR>50.067 billion	1	0	0
What was a key challenge faced by Amazon’s business in the second quarter of 2023?	Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.	foreign exchange rates	0	0	0
What were Amazon’s AWS sales for the second quarter of 2023?	Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.	22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million	1	0	0
What were Amazon’s total net sales for the second quarter of 2023?	Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.	134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion	1	0	0
When did Amazon acquire One Medical?	Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.	Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023	1	0	1
Where is Amazon’s principal office located?	Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.	410 Terry Avenue North	0	0	0
Who is Andrew R. Jassy?	Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.	Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon	1	1	1

Curating Factual Knowledge ground truth

Consider the impact of string matching between your ground truth and LLM responses when curating ground truth for Factual Knowledge. Best practices for curation in consideration of string matching are the following:

Use a minimal version of the QA Accuracy ground truth for a factual ground truth containing the most important facts – Because the Factual Knowledge metric uses exact string matching, curating minimal ground truth facts distinct from the QA Accuracy ground truth is imperative. Using QA Accuracy ground truth will not yield a string match unless the response is identical to the ground truth. Apply logical operators as is best suited to represent your facts.
Zero factual knowledge scores across the benchmark can indicate a poorly formed golden question-answer-fact triplet – If a golden question doesn’t contain an obvious singular answer, or can be equivalently interpreted multiple ways, reframe the golden question or answer to be specific. In the Factual Knowledge table, a question such as “What was a key challenge faced by Amazon’s business in the second quarter of 2023?” can be subjective, and interpreted with multiple possible acceptable answers. Factual Knowledge scores were 0.0 for all entries because each LLM interpreted a unique answer. A better question would be: “How much did foreign exchange rates reduce Amazon’s International segment net sales?” Similarly, “Where is Amazon’s principal office located?” renders multiple acceptable answers, such as “Seattle,” “Seattle, Washington,” or the street address. The question could be reframed as “What is the street address of Amazon’s principal office?” if this is the desired response.
Generate many variations of fact representation in terms of units and punctuation – Different LLMs will use different language to present facts (date formats, engineering units, financial units, and so on). The factual ground truth should accommodate such expected units for the LLMs being evaluated as part of the pipeline. Experimenting with LLMs to automate fact generation from QA ground truth using LLMs can help.
Avoid false positive matches – Avoid curating ground truth facts that are overly simple. Short, unpunctuated number sequences, for example, can be matched with years, dates, or phone numbers and can generate false positives.

Understanding QA Accuracy metric calculation

We use the following question answer pair to demonstrate how FMEval metrics are calculated, and how this informs best practices in QA ground truth curation.

Golden Question	“How many shares of common stock were outstanding as of July 21, 2023?”
Golden Answer	“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

In calculating QA Accuracy metrics, first the responses and ground truth are first normalized (lowercase, remove punctuation, remove articles, remove excess whitespace). Then, true positive, false positives, and false negative matches are computed between the LLM response and the ground truth. QA Accuracy metrics returned by FMEval include recall, precision, F1. By assessing exact matching, the Exact Match and Quasi-Exact Match metrics are returned. A detailed walkthrough of the calculation and scores are shown in the following tables.

The first table illustrates the accuracy metric calculation mechanism.

Metric	Definition	Example	Score
True Positive (TP)	The number of words in the model output that are also contained in the ground truth.	Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	11
False Positive (FP)	The number of words in the model output that are not contained in the ground truth.	Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	7
False Negative (FN)	The number of words that are missing from the model output, but are included in the ground truth.	Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	3

The following table lists the accuracy scores.

Metric	Score	Calculation Approach
Recall Over Words	0.786
Precision Over Words	0.611
F1	0.688
Exact Match	0.0	(Non-normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.
Quasi-Exact Match	0.0	(Normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.

Interpreting QA Accuracy scores

The following are best practices for interpreting QA accuracy scores:

Interpret recall as closeness to ground truth – The recall metric in FMEval measures the fraction of ground truth words that are in the model response. With this, we can interpret recall as closeness to ground truth.
- The higher the recall score, the more ground truth is included in the model response. If the entire ground truth is included in the model response, recall will be perfect (1.0), and if no ground truth is included in the model, response recall will be zero (0.0).
- Low recall in response to a golden question can indicate a problem with information retrieval, as shown in the example in the following table. A high recall score, however, doesn’t unilaterally indicate a correct response. Hallucinations of facts can present as a single deviated word between model response and ground truth, while still yielding a high true positive rate in word matching. For such cases, you can complement QA Accuracy scores with Factual Knowledge assessments of golden questions in FMEval (we provide examples later in this post).

Interpretation	Question	Curated Ground Truth	High Closeness to Ground Truth		Low Closeness to Ground Truth
Interpreting Closeness to Ground Truth Scores	“How many shares of common stock were outstanding as of July 21, 2023?”	“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”	“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”	0.923	“Sorry, I do not have access to documents containing common stock information about Amazon.”	0.111

Interpret precision as conciseness to ground truth – The higher the score, the closer the LLM response is to the ground truth in terms of conveying ground truth information in the fewest number of words. By this definition, we recommend interpreting precision scores as a measure of conciseness to the ground truth. The following table demonstrates LLM responses that show high conciseness to the ground truth and low conciseness. Both answers are factually correct, but the reduction in precision is derived from the higher verbosity of the LLM response relative to the ground truth.

Interpretation

Question

Curated Ground Truth

High Conciseness to Ground Truth

Low Conciseness to Ground Truth

Interpreting Conciseness to Ground Truth

“How many shares of common stock were outstanding as of July 21, 2023?”

“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.

1.0

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.238

Interpret F1 score as combined closeness and conciseness to ground truth – F1 score is the harmonic mean of precision and recall, and so represents a joint measure that equally weights closeness and conciseness for a holistic score. The highest-scoring responses will contain all the words and remain similarly concise as the curated ground truth. The lowest-scoring responses will differ in verbosity relative to the ground truth and contain a large number of words that are not present in the ground truth. Due to the intermixing of these four qualities, F1 score interpretation is subjective. Reviewing recall and precision independently will clearly indicate the qualities of the generative responses in terms of closeness and conciseness. Some examples of high and low F1 scores are provided in the following table.

Interpretation

Question

Curated Ground Truth

High Combined Closeness x Conciseness

Low Combined Closeness x Conciseness

Interpreting Closeness and Conciseness to Ground Truth

“How many shares of common stock were outstanding as of July 21, 2023?”

“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”

0.96

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.364

Combine factual knowledge with recall for detection of hallucinated facts and false fact matches – Factual Knowledge scores can be interpreted in combination with recall metrics to distinguish likely hallucinations and false positive facts. For example, the following cases can be caught, with examples in the following table:
- High recall with zero factual knowledge suggests a hallucinated fact.
- Zero recall with positive factual knowledge suggests an accidental match between the factual ground truth and an unrelated entity such as a document ID, phone number, or date.
- Low recall and zero factual knowledge may also suggest a correct answer that has been expressed with alternative language to the QA ground truth. Improved ground truth curation (increased question specificity, more ground truth fact variants) can remediate this problem. The BERTScore can also provide semantic context on match quality.

Interpretation	QA Ground Truth	Factual Ground Truth	Factual Knowledge	Recall Score	LLM response
Hallucination detection	Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.	134.4 billion<OR>134,383 million	0	0.92	Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
Detect false positive facts	There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.	10317750796<OR> 10,317,750,796	1.0	0.0	Document ID: 10317750796
Correct answer, expressed in different words to ground truth question-answer-fact	Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.	410 Terry Avenue North	0	0.54	Amazon’s principal office is located in Seattle, Washington.

Curating QA Accuracy ground truth

Consider the impact of true positive, false positive, and false negative matches between your golden answer and LLM responses when curating your ground truth for QA Accuracy. Best practices for curation in consideration of string matching are as follows:

Use LLMs to generate initial golden questions and answers – This is beneficial in terms of speed and level of effort; however, outputs must be reviewed and further curated if necessary before acceptance (see Step 3 of the ground truth experimentation flywheel earlier in this post). Furthermore, applying an LLM to generate your ground truth may bias correct answers towards that LLM, for example, due to string matching of filler words that the LLM commonly uses in its language expression that other LLMs may not. Keeping ground truth expressed in an LLM-agnostic manner is a gold standard.
Human review golden answers for proximity to desired output – Your golden answers should reflect your standard for the user-facing assistant in terms of factual content and verbiage. Consider the desired level of verbosity and choice of words you expect as outputs based on your production RAG prompt template. Overly verbose ground truths, and ground truths that adopt language unlikely to be in the model output, will increase false negative scores unnecessarily. Human curation of generated golden answers should reflect the desired verbosity and word choice in addition to accuracy of information, before accepting LLM generated golden answers, to make sure evaluation metrics are computed relative to a true golden standard. Apply guardrails on the verbosity of ground truth, such as controlling word count, as part of the generation process.
Compare LLM accuracy using recall – Closeness to ground truth is the best indicator of word agreement between the model response and the ground truth. When golden answers are curated properly, a low recall suggests strong deviation between the ground truth and the model response, whereas a high recall suggests strong agreement.
Compare verbosity using precision – When golden answers are curated properly, verbose LLM responses decrease precision scores due to false positives present, and concise LLM responses are rewarded by high precision scores. If the golden answer is highly verbose, however, concise model responses will incur false negatives.
Experiment to determine recall acceptability thresholds for generative AI pipelines – A recall threshold for the golden dataset can be set to determine cutoffs for pipeline quality acceptability.
Interpret QA accuracy metrics in conjunction with other metrics to pass judgement on accuracy – Metrics such as Factual Knowledge can be combined with QA Accuracy scores to judge factual knowledge in addition to ground truth word matching.

Key takeaways

Curating appropriate ground truth and interpreting evaluation metrics in a feedback loop is crucial for effective business decision-making when deploying generative AI pipelines for question answering.

There were several key takeaways from this experiment:

Ground truth curation and metric interpretation are a cyclical process – Understanding how the metrics are calculated should inform the ground truth curation approach to achieve the desired comparison.
Low-scoring evaluations can indicate problems with ground truth curation in addition to generative AI pipeline quality – Using golden datasets that don’t reflect true answer quality (misleading questions, incorrect answers, ground truth answers don’t reflect expected response style) can be the root cause of poor evaluation results for a successful pipeline. When golden dataset curation is in place, low-scoring evaluations will correctly flag pipeline problems.
Balance recall, precision, and F1 scores – Find the balance between acceptable recall (closeness to ground truth), precision (conciseness to ground truth), and F1 scores (combined) through iterative experimentation and data curation. Pay close attention to what scores quantify your ideal closeness to ground truth and conciseness to the ground truth based on your data and business objectives.
Design ground truth verbosity to the level desired in your user experience – For QA Accuracy evaluation, curate ground truth answers that reflect the desired level of conciseness and word choice expected from the production assistant. Overly verbose or unnaturally worded ground truths can unnecessarily decrease precision scores.
Use recall and factual knowledge for setting accuracy thresholds – Interpret recall in conjunction with factual knowledge to assess overall accuracy, and establish thresholds by experimentation on your own datasets. Factual knowledge scores can complement recall to detect hallucinations (high recall, false factual knowledge) and accidental fact matches (zero recall, true factual knowledge).
Curate distinct QA and factual ground truths – For a Factual Knowledge evaluation, curate minimal ground truth facts distinct from the QA Accuracy ground truth. Generate comprehensive variations of fact representations in terms of units, punctuation, and formats.
Golden questions should be unambiguous – Zero factual knowledge scores across the benchmark can indicate poorly formed golden question-answer-fact triplets. Reframe subjective or ambiguous questions to have a specific, singular acceptable answer.
Automate, but verify, with LLMs – Use LLMs to generate initial ground truth answers and facts, with a human review and curation to align with the desired assistant output standards. Recognize that applying an LLM to generate your ground truth may bias correct answers towards that LLM during evaluation due to matching filler words, and strive to keep ground truth language LLM-agnostic.

Conclusion

In this post, we outlined best practices for ground truth curation and metric interpretation when evaluating generative AI question answering using FMEval. We demonstrated how to curate ground truth question-answer-fact triplets in consideration of the Factual Knowledge and QA Accuracy metrics calculated by FMEval. To validate our approach, we curated a golden dataset of 10 question-answer-fact triplets from Amazon’s Q2 2023 10Q report. We generated responses from three anonymized generative AI pipelines and calculated QA Accuracy and Factual Knowledge metrics.

Our primary findings emphasize that ground truth curation and metric interpretation are tightly coupled. Ground truth should be curated with the measurement approach in mind, and metrics can update the ground truth during golden dataset development. We further recommend curating separate ground truths for QA accuracy and factual knowledge, particularly emphasizing setting a desired level of verbosity according to user experience goals, and setting golden questions with unambiguous interpretations. Closeness and conciseness to ground truth are valid interpretations of FMEval recall and precision metrics, and factual knowledge scores can be used to detect hallucinations. Ultimately, the quantification of the expected user experience in the form of a golden dataset for pipeline evaluation with FMEval supports business decision-making, such as choosing between pipeline options, projecting quality changes from development to production, and adhering to legal and compliance requirements.

Whether you are building an internal application, a customer-facing virtual assistant, or exploring the potential of generative AI for your business, this post can help you use FMEval to make sure your projects meet the highest standards of quality and responsibility. We encourage you to adopt these best practices and start evaluating your generative AI question answering pipelines with the FMEval toolkit today.

About the Authors

Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.

Rahul Jani is a Data Architect with AWS Professional Services. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Andrei Ivanovic is a Data Scientist with AWS Professional Services, with experience delivering internal and external solutions in generative AI, AI/ML, time series forecasting, and geospatial data science. Andrei has a Master’s in CS from the University of Toronto, where he was a researcher at the intersection of deep learning, robotics, and autonomous driving. Outside of work, he enjoys literature, film, strength training, and spending time with loved ones.

Vedere AI