How Travelers Insurance classified emails with Amazon Bedrock and prompt engineering

This is a guest blog post co-written with Jordan Knight, Sara Reynolds, George Lee from Travelers.

Foundation models (FMs) are used in many ways and perform well on tasks including text generation, text summarization, and question answering. Increasingly, FMs are completing tasks that were previously solved by supervised learning, which is a subset of machine learning (ML) that involves training algorithms using a labeled dataset. In some cases, smaller supervised models have shown the ability to perform in production environments while meeting latency requirements. However, there are benefits to building an FM-based classifier using an API service such as Amazon Bedrock, such as the speed to develop the system, the ability to switch between models, rapid experimentation for prompt engineering iterations, and the extensibility into other related classification tasks. An FM-driven solution can also provide rationale for outputs, whereas a traditional classifier lacks this capability. In addition to these features, modern FMs are powerful enough to meet accuracy and latency requirements to replace supervised learning models.

In this post, we walk through how the Generative AI Innovation Center (GenAIIC) collaborated with leading property and casualty insurance carrier Travelers to develop an FM-based classifier through prompt engineering. Travelers receives millions of emails a year with agent or customer requests to service policies. The system GenAIIC and Travelers built uses the predictive capabilities of FMs to classify complex, and sometimes ambiguous, service request emails into several categories. This FM classifier powers the automation system that can save tens of thousands of hours of manual processing and redirect that time toward more complex tasks. With Anthropic’s Claude models on Amazon Bedrock, we formulated the problem as a classification task, and through prompt engineering and partnership with the business subject matter experts, we achieved 91% classification accuracy.

Problem Formulation

The main task was classifying emails received by Travelers into a service request category. Requests involved areas like address changes, coverage adjustments, payroll updates, or exposure changes. Although we used a pre-trained FM, the problem was formulated as a text classification task. However, instead of using supervised learning, which normally involves training resources, we used prompt engineering with few-shot prompting to predict the class of an email. This allowed us to use a pre-trained FM without having to incur the costs of training. The workflow started with an email, then, given the email’s text and any PDF attachments, the email was given a classification by the model.

It should be noted that fine-tuning an FM is another approach that could have improved the performance of the classifier with an additional cost. By curating a longer list of examples and expected outputs, an FM can be trained to perform better on a specific task. In this case, given the accuracy was already high by just using prompt engineering, the accuracy after fine-tuning would have to justify the cost. Although at the time of the engagement, Anthropic’s Claude models weren’t available for fine-tuning on Amazon Bedrock, now Anthropic’s Claude Haiku fine-tuning is in beta testing through Amazon Bedrock.

Overview of solution

The following diagram illustrates the solution pipeline to classify an email.

The workflow consists of the following steps:

  1. The raw email is ingested into the pipeline. The body text is extracted from the email text files.
  2. If the email has a PDF attachment, the PDF is parsed.
  3. The PDF is split into individual pages. Each page is saved as an image.
  4. The PDF page images are processed by Amazon Textract to extract text, specific entities, and table data using Optical Character Recognition (OCR).
  5. Text from the email is parsed.
  6. The text is then cleaned of HTML tags, if necessary.
  7. The text from the email body and PDF attachment are combined into a single prompt for the large language model (LLM).
  8. Anthropic’s Claude classifies this content into one of 13 defined categories and then returns that class. The predictions for each email are further used for analysis of performance.

Amazon Textract served multiple purposes, such as extracting the raw text of the forms included in as attachments in emails. Additional entity extraction and table data detection was included to identify names, policy numbers, dates, and more. The Amazon Textract output was then combined with the email text and given to the model to decide the appropriate class.

This solution is serverless, which has many benefits for the organization. With a serverless solution, AWS provides a managed solution, facilitating lower cost of ownership and reduced complexity of maintenance.

Data

The ground truth dataset contained over 4,000 labeled email examples. The raw emails were in Outlook .msg format and raw .eml format. Approximately 25% of the emails had PDF attachments, of which most were ACORD insurance forms. The PDF forms included additional details that provided a signal for the classifier. Only PDF attachments were processed to limit the scope; other attachments were ignored. For most examples, the body text contained the majority of the predictive signal that aligned with one of the 13 classes.

Prompt engineering

To build a strong prompt, we needed to fully understand the differences between categories to provide sufficient explanations for the FM. Through manually analyzing email texts and consulting with business experts, the prompt included a list of explicit instructions on how to classify an email. Additional instructions showed Anthropic’s Claude how to identify key phrases that help distinguish an email’s class from the others. The prompt also included few-shot examples that demonstrated how to perform the classification, and output examples that showed how the FM is to format its response. By providing the FM with examples and other prompting techniques, we were able to significantly reduce the variance in the structure and content of the FM output, leading to explainable, predictable, and repeatable results.

The structure of the prompt was as follows:

  • Persona definition
  • Overall instruction
  • Few-shot examples
  • Detailed definitions for each class
  • Email data input
  • Final output instruction

To learn more about prompt engineering for Anthropic’s Claude, refer to Prompt engineering in the Anthropic documentation.

“Claude’s ability to understand complex insurance terminology and nuanced policy language makes it particularly adept at tasks like email classification. Its capacity to interpret context and intent, even in ambiguous communications, aligns perfectly with the challenges faced in insurance operations. We’re excited to see how Travelers and AWS have harnessed these capabilities to create such an efficient solution, demonstrating the potential for AI to transform insurance processes.”

– Jonathan Pelosi, Anthropic

Results

For an FM-based classifier to be used in production, it must show a high level of accuracy. Initial testing without prompt engineering yielded 68% accuracy. After using a variety of techniques with Anthropic’s Claude v2, such as prompt engineering, condensing categories, adjusting document processing process, and improving instructions, accuracy increased to 91%. Anthropic’s Claude Instant on Amazon Bedrock also performed well, with 90% accuracy, with additional areas of improvement identified.

Conclusion

In this post, we discussed how FMs can reliably automate the classification of insurance service emails through prompt engineering. When formulating the problem as a classification task, an FM can perform well enough for production environments, while maintaining extensibility into other tasks and getting up and running quickly. All experiments were conducted using Anthropic’s Claude models on Amazon Bedrock.


About the Authors

Jordan Knight is a Senior Data Scientist working for Travelers in the Business Insurance Analytics & Research Department. His passion is for solving challenging real-world computer vision problems and exploring new state-of-the-art methods to do so. He has a particular interest in the social impact of ML models and how we can continue to improve modeling processes to develop ML solutions that are equitable for all. In his free time you can find him either rock climbing, hiking, or continuing to develop his somewhat rudimentary cooking skills.

Sara Reynolds is a Product Owner at Travelers. As a member of the Enterprise AI team, she has advanced efforts to transform processing within Operations using AI and cloud-based technologies. She recently earned her MBA and PhD in Learning Technologies and is serving as an Adjunct Professor at the University of North Texas.

George Lee is AVP, Data Science & Generative AI Lead for International at Travelers Insurance. He specializes in developing enterprise AI solutions, with expertise in Generative AI and Large Language Models. George has led several successful AI initiatives and holds two patents in AI-powered risk assessment. He received his Master’s in Computer Science from the University of Illinois at Urbana-Champaign.

Francisco Calderon is a Data Scientist at the Generative AI Innovation Center (GAIIC). As a member of the GAIIC, he helps discover the art of the possible with AWS customers using generative AI technologies. In his spare time, Francisco likes playing music and guitar, playing soccer with his daughters, and enjoying time with his family.

Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.

Read More