SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’s Claude models in Amazon Bedrock

SK Telecom (SKT), South Korea’s leading telecommunications company serving 30 million customers, is at the forefront of AI innovation. In line with its AI Pyramid Strategy, which aims to unlock AI’s potential for anyone, anywhere, anytime, SKT has collaborated with the AWS Generative AI Innovation Center (GenAIIC) Custom Model Program to explore domain-trained models using Amazon Bedrock for telco-specific use cases.

This collaboration aligns with SKT’s vision of using AI expertise and strategic partnerships to develop innovative AI-based products and services. One such initiative focused on developing a custom solution for grounded question answering (Q&A) based on reference documents.

Retrieval Augmented Generation (RAG) is a popular technique for Q&A tasks, offering improved factual accuracy and knowledge grounding. However, RAG faces challenges with generating a response not matching preferred tone, style, and manners for telco use cases, as well as retrieving irrelevant documents, potentially leading to inaccurate responses. To address this, SKT and AWS GenAIIC aimed to use model customization to improve Anthropic Claude models on Amazon Bedrock in three key areas:

  • Providing concise and informative answers
  • Correctly referencing links from retrieved documents
  • Answering in a tone and style consistent with SKT and similar to ground truth answers

Additionally, the team explored boosting smaller model performance using synthetic data generated by bigger large language models (LLMs) for knowledge distillation and scenarios with limited labeled training data.

Amazon Bedrock is a fully managed service that offers a variety of LLMs and foundation models (FMs) along with capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Guardrails that can expedite many generative AI use cases. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Claude models. Amazon Bedrock offers an intuitive and secure way of fine-tuning Anthropic’s Claude models and more. The fine-tuned Claude model can be deployed using Amazon Bedrock and can use the capabilities of Amazon Bedrock seamlessly, for example, Amazon Bedrock Knowledge Bases for the telco domain-specific RAG or Amazon Bedrock Agents for the agentic usage.

In this post, we share how SKT customizes Anthropic Claude models for telco-specific Q&A regarding technical telecommunication documents of SKT using Amazon Bedrock.

Solution overview

The team explored combinations of prompt optimization, customization (fine-tuning), and data augmentation with synthetic data. This multifaceted approach aimed to maximize the benefits of each technique for the grounded Q&A generation task.

In the following sections, we explore these methods in more detail.

Anthropic’s Claude customization with prompt optimization

Fine-tuning, which is available through Amazon Bedrock for various FMs, including Anthropic’s Claude, allows adaptation of pre-trained language models for specific use cases. It’s particularly effective for tailoring response style and format adherence.

The team first optimized the system prompt, implementing standardized guidelines for answer formatting and document citation based on Anthropic model prompting best practices. Key focus areas included:

  • Clear presentation of system commands
  • Consistent use of code block formatting
  • Context-based tailored responses

This prompt engineering, combined with fine-tuning, yielded substantial improvements:

  • Over 50% increase in ROUGE-3 score
  • Over 25% improvement in ROUGE-L score
  • Over 4% increase in embedding similarity score
  • Significant progress in accurate reference citation

The iterative enhancement process demonstrated cumulative benefits, with prompt updates alone showing 35–40 percent improvements in key metrics, and the final customized model achieving 50–60 percent gains in some metrics.

This progression clearly illustrates the cumulative benefits of model customization through RAG, prompt engineering, and fine-tuning, resulting in a model that significantly outperformed both the baseline and the prompt-updated versions in terms of ROUGE scores and citation accuracy. ROUGE score measures the similarity between ground truths and generated results by computing N-gram word overlaps. The following table summarizes these improvements.

LLM Prompt update Fine-tuning Relative improvement over baseline
ROUGE-3 ROUGE-L Citation accuracy
Anthropic’s Claude 3 Sonnet baseline baseline baseline
Anthropic’s Claude 3 Sonnet ✅ +38.30% +13.4% +52.94%
Anthropic’s Claude 3 Sonnet ✅ ✅ +58.1% +26.8% +70.59%

Synthetic data for fine-tuning

To address the challenge of limited high-quality labeled training data, the team explored synthetic data generation techniques. This approach also facilitates knowledge distillation from larger LLMs to smaller, more targeted models, offering benefits such as lower latency and cost.

The team conducted controlled experiments using:

  • A baseline set of 500 ground truth samples
  • An augmented set with 500 original over 1,500 synthetic samples
  • A larger original set of 2,000 samples

Synthetic data was generated using Anthropic’s Claude Sonnet 3, creating new question-answer pairs over the same retrieved documents used in ground truth examples.

The results were evaluated using both LLM-based comparison and human preference evaluation. Human evaluators blindly ranked model outputs, with scores assigned based on preference (Best: 4, Second: 3, Third: 2, Worst: 1). The following table shows the results of the human preference evaluation scores.

Rank Model Cumulative score
(best possible: 160)
1 Fine-tuned with 2,000 original samples 114
2 Fine-tuned with 500 original and 1,500 synthetic samples 112
3 Fine-tuned with 500 original samples 85
4 No fine-tuning (baseline) 84

Some key findings include:

  • Small training sets (500 samples) showed minimal improvement over baseline
  • Larger training sets (2,000 samples) scored considerably higher
  • Synthetically augmented data performed similarly to equivalent-sized original data

Although having a large volume of domain-specific training data is always ideal, many businesses have limited available datasets. In such scenarios, synthetic data can play a crucial role in place of original data. This demonstrates the potential of synthetic data for model customization.

Conclusion

SK Telecom’s collaboration with AWS GenAIIC showcases the company’s commitment to developing innovative AI solutions for telco challenges. By using Amazon Bedrock to customize Anthropic’s Claude models, SKT has achieved significant performance improvements for telco-specific, Korean language use cases without the need to build models from scratch. The proof of concept demonstrated significant improvements:

  • ~58% increase in ROUGE-3 score
  • ~27% increase in ROUGE-L score
  • Substantial improvement in returning correct reference links

This approach, combined with synthetic data generation techniques, aligns with SKT’s AI Pyramid Strategy, enabling faster testing and development of new approaches. As SKT continues to focus on key areas such as personal AI assistants, AI healthcare, and AI data centers, this collaboration with AWS represents a significant step in their AI evolution and long-term competitiveness in the global AI landscape.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center.


About the Authors

Sungmin Hong is a Senior Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.

Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.

Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she supports providing generative AI solutions to AWS customers. In this role, she collaborates with a team of experts to develop innovative AI-driven models for AWS customers across various industries. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research focused on advanced machine learning and deep learning techniques.

Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Seunghyeon Jeong (Steve) is a team leader of the Platform Application team at SKT. He is responsible for commercializing the Global Intelligence Platform (GIP), which provides AI models and tools. For most of his career, he has been a PM developing various mobile services such as mobile wallet, fashion streaming, and unified login services for SK. His team is expanding the delivery of models and features to make it easier for internal teams to apply AI, contributing to SKT’s AI Transformation. Before entering the AI space, he was a Product Manager, developing and operating various mobile services such as mobile wallet, fashion streaming, and unified login services for the US and Korea.

Sunwoo Lee (Lois) is the team leader of the Data Construction and Evaluation Team within SK Telecom’s Global AI Tech division. She oversees the design and construction of training data for language models, the model performance evaluation process, and its application to services. Her career has focused on NLP within IT, which is a great fit with her background in Linguistics and Korean language education. Alongside her world-class team, she continues to explore and solve fascinating problems such as how to optimize the design of data for language model training, which tasks and methods to implement for validating language model performance, and the best design of AI-human conversations.

Eric Davis is the vice president of the AI Tech Collaboration Group at SKT. Eric oversees tech collaborations with worldwide tech partners to customize large language models (LLMs) for the telecommunications domain. His teams are responsible for designing and building the datasets to tune LLMs, as well as benchmarking LLMs in general and for the telecommunications domain. Eric holds a Master of Science degree in Computer Science from Carnegie Mellon from the Language Technologies Institute and a Bachelor of Arts in Linguistics and Psychology from the University of California, Los Angeles.

Read More