Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Large language models (or LLMs) have become a topic of daily conversations. Their quick adoption is evident by the amount of time required to reach a 100 million users, which has gone from “4.5yrs by facebook” to an all-time low of mere “2 months by ChatGPT.” A generative pre-trained transformer (GPT) uses causal autoregressive updates to make prediction. Variety of tasks such as speech recognition, text generation, and question answering are demonstrated to have stupendous performance by these model architectures. Several recent models such as NeoX, Falcon, Llama use the GPT architecture as a backbone. Training LLMs requires colossal amount of compute time, which costs millions of dollars. In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium, a purpose-built machine learning (ML) accelerator optimized for deep learning training. We’ll outline how we cost-effectively (3.2 M tokens/$) trained such models with AWS Trainium without losing any model quality.

Solution overview

GPT NeoX and Pythia models

GPT NeoX and Pythia are the open-source causal language models by Eleuther-AI with approximately 20 billion parameters in NeoX and 6.9 billion in Pythia. Both are decoder models following similar architectural design as Chat GPT3. However, they also have several additions, which are also widely adopted in the recent models such as Llama. Particularly, they have rotational positional embedding (ROPE) with partial rotation across head dimensions. The original models (NeoX and Pythia 6.9B) are trained on openly available Pile dataset with deduplication and using Megatron and Deepspeed backend.

We demonstrate the pre-training and fine-tuning of these models on AWS Trainium-based Trn1 instances using Neuron NeMo library. To establish the proof-of-concept and quick reproduction, we’ll use a smaller Wikipedia dataset subset tokenized using GPT2 Byte-pair encoding (BPE) tokenizer.

Walkthrough

Download the pre-tokenized Wikipedia dataset as shown:

export DATA_DIR=~/examples_datasets/gpt2

mkdir -p ${DATA_DIR} && cd ${DATA_DIR}

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin . --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx . --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt . --no-sign-request

Both NeoX 20B and Pythia 6.9B uses ROPE with partial rotation, for example, rotating 25% of the head dimensions and keeping the rest unrotated. To efficiently implement the partial rotation on AWS Trainium accelerator, instead of concatenating the rotating and non-rotating dimensions, we append zero frequencies for non-rotating dimensions and then rotate the complete set of head dimensions. This simple trick helped us improve the throughput (sequences processed per sec) on AWS Trainium.

Training steps

To run the training, we use SLURM managed multi-node Amazon Elastic Compute Cloud (Amazon EC2) Trn1 cluster, with each node containing a trn1.32xl instance. Each trn1.32xl has 16 accelerators with two workers per accelerator. After downloading the latest Neuron NeMo package, use the provided neox and pythia pre-training and fine-tuning scripts with optimized hyper-parameters and execute the following for a four node training.

  1. Compile: Pre-compile the model with three train iterations to generate and save the graphs:
    sbatch --nodes 4 compile.slurm ./neoX_20B_slurm.sh

  2. Run: Execute the training by loading the cached graphs from first steps
    sbatch --nodes 4 run.slurm ./neoX_20B_slurm.sh

  3. Monitor results
    tensorboard --logdir=nemo_experiments/megatron_neox

Same steps needs to be followed for running Pythia 6.9B model with replacing neox_20B_slurm.sh by pythia_6.9B_slurm.sh.

Pre-training and fine-tuning experiments

We demonstrate the pre-training of GPT-NeoX and Pythia models on AWS Trainium using Neuron NeMo library for 10k iterations, and also show fine-tuning of these models for 1k steps. For pre-training, we use the GPT2 BPE tokenizer inside the NeMo and follow same config as used in the original model. Fine-tuning on AWS Trainium requires change of few parameters (such as vocab size division factor), which are provided in the fine-tuning scripts to accommodate for Megatron versus NeMo differences and GPU versus AWS Trainium changes. The multi-node distributed training throughput with varying number of nodes is shown in the Table-1.

Model Tensor Parallel Pipeline Parallel Number of instances Cost ($/hour) Sequence length Global batch size Throughput (seq/sec) Cost-throughput ratio (tokens/$)
Pythia 6.9B 8 1 1 7.59 2048 256 10.4 10,102,387
8 1 4 30.36 2048 256 35.8 8,693,881
NeoX 20B 8 4 4 30.36 2048 16384 13.60 3,302,704
8 4 8 60.72 2048 16384 26.80 3,254,134
8 4 16 121.44 2048 16384 54.30 3,296,632
8 4 32 242.88 2048 16384 107.50 3,263,241
8 4 64 485.76 2048 16384 212.00 3,217,708

Table 1. Comparing mean throughput of GPT NeoX and Pythia models for training up to 500 steps with changing number of nodes. The pricing of trn1.32xl is based on the 3-year reserved effective per hour rate.

Next, we also evaluate the loss trajectory of the model training on AWS Trainium and compare it with the corresponding run on a P4d (Nvidia A100 GPU cores) cluster. Along with the training loss, we also compare useful indicator such as gradient norm, which is 2-norm of the model gradients computed at each training iteration to monitor the training progress. The training results are shown in Figure-1, 2 and fine-tuning of NeoX 20B in Figure-3.

Training loss averaged across all workers (left) and gradient norm (right) at training each step.

Figure-1. Training loss averaged across all workers (left) and gradient norm (right) at training each step. NeoX 20B is trained on 4 nodes with small wiki dataset on GPU and Trainium with same training hyper-parameters (global batch size=256). GPU is using BF16 and default mixed-precision while AWS Trainium is using full BF16 with stochastic rounding. The loss and gradient norm trajectories match for GPU and AWS Trainium.

Training loss averaged across all workers (left) and gradient norm (right) at training each step (Pythia).

Figure-2. Training loss averaged across all workers (left) and gradient norm (right) at training each step. Similar to GPT NeoX in Figure-1, Pythia 6.9B is trained on 4 nodes with small wiki dataset on GPU and Trainium with same training hyper-parameters (global batch size=256). The loss and gradient norm trajectories match for GPU and Trainium.

Fine-tuning GPT NeoX 20B model on GPU and AWS Trainium with training loss averaged across all workers (left) and gradient norm (right).

Figure-3. Fine-tuning GPT NeoX 20B model on GPU and AWS Trainium with training loss averaged across all workers (left) and gradient norm (right). A small wiki dataset is used for fine-tuning demonstration. The loss and gradient norm trajectories match for GPU and AWS Trainium.

In this post, we showed cost-efficient training of LLMs on AWS deep learning hardware. We trained GPT NeoX 20B and Pythia 6.9B models on AWS Trn1 with Neuron NeMo library. The cost normalized throughput for 20 billion models with AWS Trainium is around approximately 3.2M tokens/$ spent. Along with cost-efficient training on AWS Trainium, we obtain similar model accuracy, which is evident from training step loss and gradient norm trajectory. We also fine-tuned the available checkpoints for NeoX 20B model on AWS Trainium. For additional information on the distributed training with NeMo Megatron on AWS Trainium, see AWS Neuron Reference for NeMo Megatron. A good resource to start fine-tuning of Llama model could be found here, Llama2 fine-tuning. To get started with managed AWS Trainium on Amazon SageMaker, see Train your ML Models with AWS Trainium and Amazon SageMaker.


About the Authors

Gaurav Gupta is currently an Applied Scientist at Amazon Web Services (AWS) AI labs. Dr. Gupta completed his PhD from USC Viterbi. His research interests span the domain of sequential data modeling, learning partial differential equations, information theory for machine learning, fractional dynamical models, and complex networks. He is currently working on applied and mathematical problems on LLMs training behavior, vision models with PDEs, information-theoretic multi-modality models. Dr. Gupta has publications in top journals/conferences such as Neurips, ICLR, ICML, Nature, IEEE Control Society, ACM cyber-physical society.

Ben Snyder is an applied scientist with AWS Deep Learning. His research interests include foundational models, reinforcement learning, and asynchronous optimization. Outside of work, he enjoys cycling and backcountry camping.

Amith (R) Mamidala is the senior machine learning application engineering at AWS Annapurna Labs. Dr. Mamidala completed his PhD at the Ohio State University in high performance computing and communication. During his tenure at IBM research, Dr. Mamidala contributed towards the BlueGene class of computers which often led the Top500 ranking of the most powerful and power-efficient supercomputers. The project was awarded 2009 National medal of Technology and Innovation. After a brief stint as an AI engineer at a financial hedge fund, Dr. Mamidala joined the Annapurna labs focusing on Large Language model training.

Jun (Luke) Huan is a principal scientist at AWS AI Labs. Dr. Huan works on AI and Data Science. He has published more than 180 peer-reviewed papers in leading conferences and journals. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI start-up, and worked as the CEO and Chief Scientist in 2019-2021. Before joining industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Vodafone advances its machine learning skills with AWS DeepRacer and Accenture

Vodafone advances its machine learning skills with AWS DeepRacer and Accenture

Vodafone is transitioning from a telecommunications company (telco) to a technology company (TechCo) by 2025, with objectives of innovating faster, reducing costs, improving security, and simplifying operations. Thousands of engineers are being onboarded to contribute to this transition. By 2025, Vodafone plans to have 50% of its global workforce actively involved in software development, with an objective to deliver 60% of digital services in-house. This new workforce requires rapid reskilling and understanding of disruptive services such as artificial intelligence (AI) and machine learning (ML) to drive meaningful outcomes.

To help achieve this ambitious transition, Vodafone has partnered with Accenture and AWS to build a cloud platform that helps its engineers work in flexible, creative, and agile ways by providing them a curated set of managed, security and DevOps-oriented AWS services and application workloads. To learn more, check out Redefining Vodafone’s customer experience with AWS and the following talk at AWS re:Invent 2022.

Vodafone Digital engineering (VDE) invited Accenture and AWS to co-host an exclusive event at their annual DigiFest, a week-long event celebrating the scale of their global VDE teams, championing reusable apps and collaborative idea generation. As one of the main events of the DigiFest, AWS and Accenture conceptualized a company-wide AWS DeepRacer challenge where engineers can build and train their models to become better versed in using ML with AWS.

In this post, we share how Vodafone is advancing its ML skills using AWS DeepRacer and Accenture.

Why is machine learning important to Vodafone?

Machine learning is one of the fastest growing domains in technology and telecommunications, owing to the benefits of improved productivity and forecasting across key domains in telecommunications such as channels, CRM, billing, order management, service assurance, network management, and more.

Vodafone has already adopted ML in the proactive detection and correction of network anomalies to improve customer satisfaction. Their AI and ML capabilities in digital self-care, via a chatbot, have been helping their customer care team focus on cases that need deeper attention. Because they use AWS for providing digital services packaged as telco as a service, incorporating AI and ML components is crucial to maintain a competitive edge in delivering state-of-the-art services to customers.

Why AWS DeepRacer?

AWS DeepRacer is an interesting and fun way to get started with reinforcement learning (RL). RL is an advanced ML technique that takes a very different approach to training models than other ML methods. Its super power is that it learns very complex behaviors without requiring any labeled training data, and can make short-term decisions while optimizing for a longer-term goal. The AWS DeepRacer Challenge provided an opportunity for Vodafone’s engineers to engage in a friendly competition, develop an ML mindset, and share insights on how to succeed in a private virtual racing event.

Racing with AWS DeepRacer

The event played out in three stages, starting with a workshop on AWS DeepRacer to cover the basics of reinforcement learning, which was attended by over 225 Vodafone engineers. They learned how to fine-tune an AWS DeepRacer model by creating a reward function, exploring the action space, systematically tuning hyperparameters, examining the training job progress, evaluating the model, and testing the model on a virtual AWS DeepRacer vehicle and virtual track.

In the next stage, a league race was organized where 130 racers were able to view the race videos of the best model submission of every participant on a live leaderboard. This helped them understand how a high-performance model performs after it’s trained. They quickly understood overtraining occurs when a model is trained for too long, leading to overfitting, which leads to underperformance in a new environment. They also experimented with different styles of reward functions such as follow the center line, excessive steering penalty, slowness penalty, and progress rewards.

The event culminated with a grand finale, a showdown of 11 racers who tuned their models one final time to compete in a live race with commentary. All 11 racers completed a full lap with their models. Eight racers had a lap time of less than 15 seconds, with the winner coming in with an incredible lap time of 11.194 seconds on the tricky Toronto Turnpike virtual race track.

Summary

The goal of the AWS DeepRacer Challenge was to build awareness and excitement of ML on AWS for a global cloud engineering audience with varying technology skills and competencies. The tournament exceeded 585 total registrations across the globe, with over 400 models submitted and over 600 hours of training and evaluation.

Vodafone was able to help a broad range of builders get hands-on with ML through the AWS DeepRacer challenge. With over 47% AWS and ML beginners, it reaffirms how effective AWS DeepRacer can be in introducing ML with AWS in a safe and engaging environment for beginners.

“Having the Digital Engineering team attend events like DigiFest and participate in challenges like AWS DeepRacer is a huge part of our vision of building a world-class software engineering team in Vodafone. As we take on the complex challenge of transforming a telecommunications company into a technology company, growing our skillset becomes a top priority and our partnership with Accenture and AWS has provided the team with not just this, but multiple opportunities to learn and develop. I am excited for more of this to come!”

Ben Connolly, Vodafone Global Director of Cloud Engineering


About the Author

Ramakrishna Natarajan is a Senior Partner Solutions Architect at Amazon Web Services. He is based out of London and helps AWS Partners find optimal solutions on AWS for their customers. He specialises in Telecommunications OSS/BSS and has a keen interest in evolving domains such as AI/ML, Data Analytics, Security and Modernisation. He enjoys playing squash, going on long hikes and learning new languages.

Read More

Getir end-to-end workforce management: Amazon Forecast and AWS Step Functions

Getir end-to-end workforce management: Amazon Forecast and AWS Step Functions

This is a guest post co-authored by Nafi Ahmet Turgut, Mehmet İkbal Özmen, Hasan Burak Yel, Fatma Nur Dumlupınar Keşir, Mutlu Polatcan and Emre Uzel from Getir.

Getir is the pioneer of ultrafast grocery delivery. The technology company has revolutionized last-mile delivery with its grocery in-minutes delivery proposition. Getir was founded in 2015 and operates in Turkey, the UK, the Netherlands, Germany, and the United States. Today, Getir is a conglomerate incorporating nine verticals under the same brand.

In this post, we describe the end-to-end workforce management system that begins with location-specific demand forecast, followed by courier workforce planning and shift assignment using Amazon Forecast and AWS Step Functions.

In the past, operational teams engaged in manual workforce management practices, which resulted in a significant waste of time and effort. However, with the implementation of our comprehensive end-to-end workforce management project, they are now able to efficiently generate the necessary courier plans for warehouses through a simplified, one-click process accessible via a web interface. Before the initiation of this project, business teams relied on more intuitive methods for demand forecasting, which required improvement in terms of precision.

Amazon Forecast is a fully managed service that uses machine learning (ML) algorithms to deliver highly accurate time series forecasts. In this post, we describe how we reduced the modelling time by 70% by doing the feature engineering and modelling using Amazon Forecast. We achieved a 90% reduction in elapsed time when running scheduling algorithms for all warehouses using AWS Step Functions, which is a fully managed service that makes it easier to coordinate the components of distributed applications and microservices using visual workflows. This solution also led to an 90% improvement in prediction accuracy across Turkey and several European countries.

Solution overview

The End-to-end Workforce Management Project (E2E Project) is a large-scale project and it can be described in three topics:

1. Calculating courier requirements

The first step is to estimate hourly demand for each warehouse, as explained in the Algorithm selection section. These predictions, produced with Amazon Forecast, help determine when and how many couriers each warehouse needs.

Based on the throughput ratio of the couriers in warehouses, the number of couriers required for each warehouse is calculated in hourly intervals. These calculations assist in determining the feasible courier counts considering legal working hours, which involves mathematical modeling.

2. Solving the shift Assignment problem

Once we have the courier needs and know the other constraints of the couriers and warehouses, we can solve the shift assignment problem. The problem is modelled with decision variables determining the couriers to be assigned and creating shift schedules, minimizing surplus and shortage that may cause missed orders. This is typically a mixed-integer programming (MIP) problem.

3. Utilizing AWS Step Functions

We use AWS Step Functions to coordinate and manage workflows with its capability to execute jobs in parallel. Each warehouse’s shift assignment process is defined as a separate workflow. AWS Step Functions automatically initiate and monitor these workflows by simplifying error handling.

Since this process requires extensive data and complex computations, services like AWS Step Functions offer a significant advantage in organizing and optimizing tasks. It allows for better control and efficient resource management.

In the solution architecture, we also take advantage of other AWS services by integrating them into AWS Step Functions:

The following diagrams show AWS Step Functions workflows and architecture of the shifting tool:

Figure 1 AWS Step Functions workflows

Figure 2 Shifting tool architecture

Algorithm selection

Forecasting locational demand constitutes the initial phase in the E2E project. The overarching goal of E2E is to determine the number of couriers to allocate to a specific warehouse, commencing with a forecast of the demand for that warehouse.

This forecasting component is pivotal within the E2E framework, as subsequent phases rely on these forecasting outcomes. Thus, any prediction inaccuracies can detrimentally impact the entire project’s efficacy.

The objective of the locational demand forecast phase is to generate predictions on a country-specific basis for every warehouse segmented hourly over the forthcoming two weeks. Initially, daily forecasts for each country are formulated through ML models. These daily predictions are subsequently broken down into hourly segments, as depicted in the following graph. Historic transactional demand data, location-based weather information, holiday dates, promotions and marketing campaign data are the features used in the model as shown in the graph below.

Figure 3 The architecture of location-specific forecasting

The team initially explored traditional forecasting techniques such as open-source SARIMA (Seasonal Auto-Regressive Integrated Moving Average), ARIMAX (Auto-Regressive Integrated Moving Average using exogenous variables), and Exponential Smoothing.

ARIMA (Auto-Regressive Integrated Moving Average) is a time series forecasting method that combines autoregressive (AR) and moving average (MA) components along with differencing to make the time series stationary.

SARIMA extends ARIMA by incorporating additional parameters to account for seasonality in the time series. It includes seasonal auto-regressive and seasonal moving average terms to capture repeating patterns over specific intervals, making it suitable for time series with a seasonal component.

ARIMAX builds upon ARIMA by introducing exogenous variables, which are external factors that can influence the time series. These additional variables are considered in the model to improve forecasting accuracy by accounting for external influences beyond the historical values of the time series.

Exponential Smoothing is another time series forecasting method that, unlike ARIMA, is based on weighted averages of past observations. It is particularly effective for capturing trends and seasonality in data. The method assigns exponentially decreasing weights to past observations, with more recent observations receiving higher weights.

The Amazon Forecast models were eventually selected for the algorithmic modeling segment. The vast array of models and the sophisticated feature engineering capabilities offered by AWS Forecast proved more advantageous and optimized our resource utilization.

Six algorithms available in Forecast were tested: Convolutional Neural Network – Quantile Regression (CNN-QR), DeepAR+, Prophet, Non-Parametric Time Series (NPTS), Autoregressive Integrated Moving Average (ARIMA), and Exponential Smoothing (ETS). Upon analysis of the forecast results, we determined that CNN-QR surpassed the others in efficacy. CNN-QR is a proprietary ML algorithm developed by Amazon for forecasting scalar (one-dimensional) time series using causal Convolutional Neural Networks (CNNs). Given the availability of diverse data sources at this juncture, employing the CNN-QR algorithm facilitated the integration of various features, operating within a supervised learning framework. This distinction separated it from univariate time-series forecasting models and markedly enhanced performance.

Utilizing Forecast proved effective due to the simplicity of providing the requisite data and specifying the forecast duration. Subsequently, Forecast employs the CNN-QR algorithm to generate predictions. This tool significantly expedited the process for our team, particularly in algorithmic modeling. Furthermore, utilizing Amazon Simple Storage Service (Amazon S3) buckets for input data repositories and Amazon Redshift for storing outcomes has facilitated centralized management of the entire procedure.

Conclusion

In this post, we showed you how Getir’s E2E project demonstrated how combining Amazon Forecast and AWS Step Functions services streamlines complex processes effectively. We achieved an impressive prediction accuracy of around 90% across countries in Europe and Turkey, and using Forecast reduced modeling time by 70% due to its efficient handling of feature engineering and modeling.

Using AWS Step Functions service has led to practical advantages, notably reducing scheduling time by 90% for all warehouses. Also, by considering field requirements, we improved compliance rates by 3%, helping allocate the workforce more efficiently. This, in turn, highlights the project’s success in optimizing operations and service delivery.

To access further details on commencing your journey with Forecast, please refer to the available Amazon Forecast resources. Additionally, for insights on constructing automated workflows and crafting machine learning pipelines, you can explore AWS Step Functions for comprehensive guidance.


About the Authors

Nafi Ahmet Turgut finished his master’s degree in electrical & Electronics Engineering and worked as graduate research scientist. His focus was building machine learning algorithms to simulate nervous network anomalies. He joined Getir in 2019 and currently works as a Senior Data Science & Analytics Manager. His team is responsible for designing, implementing, and maintaining end-to-end machine learning algorithms and data-driven solutions for Getir.

Mehmet İkbal Özmen received his Master’s Degree in Economics and worked as Graduate Research Assistant. His research area was mainly economic time series models, Markov simulations, and recession forecasting. He then joined Getir in 2019 and currently works as Data Science & Analytics Manager. His team is responsible for optimization and forecast algorithms to solve the complex problems experienced by the operation and supply chain businesses.

Hasan Burak Yel received his Bachelor’s Degree in Electrical & Electronics Engineering at Boğaziçi University. He worked at Turkcell, mainly focused on time series forecasting, data visualization, and network automation. He joined Getir in 2021 and currently works as a Data Science & Analytics Manager with the responsibility of Search, Recommendation, and Growth domains.

Fatma Nur Dumlupınar Keşir received her Bachelor’s Degree from Industrial Engineering Department at Boğaziçi University. She worked as a researcher at TUBITAK, focusing on time series forecasting & visualization. She then joined Getir in 2022 as a data scientist and has worked on Recommendation Engine projects, Mathematical Programming for Workforce Planning.

Emre Uzel received his Master’s Degree in Data Science from Koç University. He worked as a data science consultant at Eczacıbaşı Bilişim where he mainly focused on recommendation engine algorithms. He joined Getir in 2022 as a Data Scientist and started working on time-series forecasting and mathematical optimization projects.

Mutlu Polatcan is a Staff Data Engineer at Getir, specializing in designing and building cloud-native data platforms. He loves combining open-source projects with cloud services.

Esra Kayabalı is a Senior Solutions Architect at AWS, specializing in the analytics domain including data warehousing, data lakes, big data analytics, batch and real-time data streaming and data integration. She has 12 years of software development and architecture experience. She is passionate about learning and teaching cloud technologies.

Read More

Mitigate hallucinations through Retrieval Augmented Generation using Pinecone vector database & Llama-2 from Amazon SageMaker JumpStart

Mitigate hallucinations through Retrieval Augmented Generation using Pinecone vector database & Llama-2 from Amazon SageMaker JumpStart

Despite the seemingly unstoppable adoption of LLMs across industries, they are one component of a broader technology ecosystem that is powering the new AI wave. Many conversational AI use cases require LLMs like Llama 2, Flan T5, and Bloom to respond to user queries. These models rely on parametric knowledge to answer questions. The model learns this knowledge during training and encodes it into the model parameters. In order to update this knowledge, we must retrain the LLM, which takes a lot of time and money.

Fortunately, we can also use source knowledge to inform our LLMs. Source knowledge is information fed into the LLM through an input prompt. One popular approach to providing source knowledge is Retrieval Augmented Generation (RAG). Using RAG, we retrieve relevant information from an external data source and feed that information into the LLM.

In this blog post, we’ll explore how to deploy LLMs such as Llama-2 using Amazon Sagemaker JumpStart and keep our LLMs up to date with relevant information through Retrieval Augmented Generation (RAG) using the Pinecone vector database in order to prevent AI Hallucination.

Retrieval Augmented Generation (RAG) in Amazon SageMaker

Pinecone will handle the retrieval component of RAG, but you need two more critical components: somewhere to run the LLM inference and somewhere to run the embedding model.

Amazon SageMaker Studio an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all machine learning (ML) development. It provides SageMaker JumpStart which is a model hub where users can locate, preview, and launch a particular model in their own SageMaker account. It provides pretrained, publicly available and proprietary models for a wide range of problem types, including Foundation Models.

Amazon SageMaker Studio provides the ideal environment for developing RAG-enabled LLM pipelines. First, using the AWS console, go to Amazon SageMaker & create a SageMaker Studio domain and open a Jupyter Studio notebook.

Prerequisites

Complete the following prerequisite steps:

  1. Set up Amazon SageMaker Studio.
  2. Onboard to an Amazon SageMaker Domain.
  3. Sign up for a free-tier Pinecone Vector Database.
  4. Prerequisite libraries: SageMaker Python SDK, Pinecone Client

Solution Walkthrough

Using SageMaker Studio notebook, we first need install prerequisite libraries:

!pip install -qU sagemaker pinecone-client==2.2.1 ipywidgets==7.0.0 

Deploying an LLM

In this post, we discuss two approaches to deploying an LLM. The first is through the HuggingFaceModel object. You can use this when deploying LLMs (and embedding models) directly from the Hugging Face model hub.

For example, you can create a deployable config for the google/flan-t5-xl model as shown in the following screen capture:

import sagemaker
from sagemaker.huggingface import (
HuggingFaceModel, 
get_huggingface_llm_image_uri
)
role = sagemaker.get_execution_role()
hub_config = {'HF_MODEL_ID':'google/flan-t5-xl', # model_id from hf.co/models
'HF_TASK':'text-generation' # NLP task you want to use for predictions

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri("huggingface", version="0.8.2"&)
huggingface_model = HuggingFaceModel(env=hub_config, role=role, # iam role with permissions to create an Endpoint 
image_uri=llm_image
)

When deploying models directly from Hugging Face, initialize the my_model_configuration with the following:

  • An env config tells us which model we want to use and for what task.
  • Our SageMaker execution role gives us permissions to deploy our model.
  • An image_uri is an image config specifically for deploying LLMs from Hugging Face.

Alternatively, SageMaker has a set of models directly compatible with a simpler JumpStartModel object. Many popular LLMs like Llama 2 are supported by this model, which can be initialized as shown in the following screen capture:

import sagemaker 
from sagemaker.jumpstart.model import JumpStartModel 

role = sagemaker.get_execution_role() 

my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")

For both versions of my_model, deploy them as shown in the following screen capture:

predictor = my_model.deploy(
    initial_instance_count=1, instance_type="ml.g5.4xlarge", endpoint_name="llama-2-generator")

Querying the pre-trained LLM

With our initialized LLM endpoint, you can begin querying. The format of our queries may vary (particularly between conversational and non-conversational LLMs), but the process is generally the same. For the Hugging Face model, do the following:

# https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/

prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know

ANSWER:

"""

payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": prompt},
         {"role": "user", "content": question},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
out[0]['generation']['content']

You can find the solution in the GitHub repository.

The generated answer we’re receiving here doesn’t make much sense — it is a hallucination.

Providing Additional Context to LLM

Llama 2 attempts to answer our question based solely on internal parametric knowledge. Clearly, the model parameters do not store knowledge of which instances we can with managed spot training in SageMaker.

To answer this question correctly, we must use source knowledge. That is, we give additional information to the LLM via the prompt. Let’s add that information directly as additional context for the model.

context = """Managed Spot Training can be used with all instances
supported in Amazon SageMaker. Managed Spot Training is supported
in all AWS Regions where Amazon SageMaker is currently available."""

prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": text_input},
         {"role": "user", "content": question},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?

[Output]:  Based on the given context, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:

All instances supported in Amazon SageMaker.

We now see the correct answer to the question; that was easy! However, a user is unlikely to insert contexts into their prompts, they would already know the answer to their question.

Rather than manually inserting a single context, automatically identify relevant information from a more extensive database of information. For that, you will need Retrieval Augmented Generation.

Retrieval Augmented Generation

With Retrieval Augmented Generation, you can encode a database of information into a vector space where the proximity between vectors represents their relevance/semantic similarity. With this vector space as a knowledge base, you can convert a new user query, encode it into the same vector space, and retrieve the most relevant records previously indexed.

After retrieving these relevant records, select a few of them and include them in the LLM prompt as additional context, providing the LLM with highly relevant source knowledge. This is a two-step process where:

  • Indexing populates the vector index with information from a dataset.
  • Retrieval happens during a query and is where we retrieve relevant information from the vector index.

Both steps require an embedding model to translate our human-readable plain text into semantic vector space. Use the highly efficient MiniLM sentence transformer from Hugging Face as shown in the following screen capture. This model is not an LLM and therefore is not initialized in the same way as our Llama 2 model.

hub_config = {
    "HF_MODEL_ID": "sentence-transformers/all-MiniLM-L6-v2",  # model_id from hf.co/models
    "HF_TASK": "feature-extraction",
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6",  # transformers version used
    pytorch_version="1.7",  # pytorch version used
    py_version="py36",  # python version of the DLC
)

In the hub_config, specify the model ID as shown in the screen capture above but for the task, use feature-extraction because we are generating vector embeddings not text like our LLM. Following this, initialize the model config with HuggingFaceModel as before, but this time without the LLM image and with some version parameters.

encoder = huggingface_model.deploy(
    initial_instance_count=1, instance_type="ml.t2.large", endpoint_name="minilm-embedding"
)

You can deploy the model again with deploy, using the smaller (CPU only) instance of ml.t2.large. The MiniLM model is tiny, so it does not require a lot of memory and doesn’t need a GPU because it can quickly create embeddings even on a CPU. If preferred, you can run the model faster on GPU.

To create embeddings, use the predict method and pass a list of contexts to encode via the inputs key as shown:

out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

Two input contexts are passed, returning two context vector embeddings as shown:

len(out)

2

The embedding dimensionality of the MiniLM model is 384 which means each vector embedding MiniLM outputs should have a dimensionality of 384. However, looking at the length of our embeddings, you will see the following:

len(out[0]), len(out[1])

(8, 8)

Two lists contain eight items each. MiniLM first processes text in a tokenization step. This tokenization transforms our human-readable plain text into a list of model-readable token IDs. In the output features of the model, you can see the token-level embeddings. one of these embeddings shows the expected dimensionality of 384 as shown:

len(out[0][0])

384

Transform these token-level embeddings into document-level embeddings by using the mean values across each vector dimension, as shown in the following illustration.

Mean pooling operation to get a single 384-dimensional vector.

import numpy as np embeddings = np.mean(np.array(out), axis=1)embeddings.shape(2, 384)

With two 384-dimensional vector embeddings, one for each input text. To make our lives easier, wrap the encoding process into a single function as shown in the following screen capture:

from typing import List

def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({"inputs": docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

Downloading the Dataset

Download the Amazon SageMaker FAQs as the knowledge base to get the data which contains both question and answer columns.

Download the Amazon SageMaker FAQs

When performing the search, look for Answers only, so you can drop the Question column. See notebook for details.

Our dataset and the embedding pipeline are ready. Now all we need is somewhere to store those embeddings.

Indexing

The Pinecone vector database stores vector embeddings and searches them efficiently at scale. To create a database, you will need a free API key from Pinecone.

import pinecone
import os

# add Pinecone API key from app.pinecone.io
api_key = os.environ.get("PINECONE_API_KEY") or "YOUR_API_KEY"
# set Pinecone environment - find next to API key in console
env = os.environ.get("PINECONE_ENVIRONMENT") or "YOUR_ENV"

pinecone.init(api_key=api_key, environment=env)

After you have connected to the Pinecone vector database, create a single vector index (similar to a table in traditional DBs). Name the index retrieval-augmentation-aws and align the index dimension and metric parameters with those required by the embedding model (MiniLM in this case).

import time

index_name = "retrieval-augmentation-aws"

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(name=index_name, dimension=embeddings.shape[1], metric="cosine")
# wait for index to finish initialization
while not pinecone.describe_index(index_name).status["ready"]:
    time.sleep(1)

To begin inserting data, run the following:

from tqdm.auto import tqdm

batch_size = 2  # can increase but needs larger instance size otherwise instance runs out of memory
vector_limit = 1000

answers = df_knowledge[:vector_limit]
index = pinecone.Index(index_name)

for i in tqdm(range(0, len(answers), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(answers))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{"text": text} for text in answers["Answer"][i:i_end]]
    # create embeddings
    texts = answers["Answer"][i:i_end].tolist()
    embeddings = embed_docs(texts)
    # create records list for upsert
    records = zip(ids, embeddings, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

You can begin querying the index with the question from earlier in this post.

# extract embeddings for the questions
query_vec = embed_docs(question)[0]

# query pinecone
res = index.query(query_vec, top_k=1, include_metadata=True)

# show the results
res
{'matches': [{'id': '90',
'metadata': {'text': 'Managed Spot Training can be used with all '
'instances supported in Amazon '
'SageMaker.rn'},
'score': 0.881181657,
'values': []}],
'namespace': ''}

Above output shows that we’re returning relevant contexts to help us answer our question. Since we top_k = 1, index.query returned the top result along side the metadata which reads Managed Spot Training can be used with all instances supported in Amazon.

Augmenting the Prompt

Use the retrieved contexts to augment the prompt and decide on a maximum amount of context to feed into the LLM. Use the 1000 characters limit to iteratively add each returned context to the prompt until you exceed the content length.

Augmenting the Prompt

Augmenting the Prompt

Feed the context_str into the LLM prompt as shown in the following screen capture:

payload = create_payload(question, context_str)
out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}n[Output]: {generated_text}")
[Input]: Which instances can I use with Managed Spot Training in SageMaker?

[Output]:  Based on the context provided, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:


All instances supported in Amazon SageMaker.

The logic works, so wrap it up into a single function to keep things clean.

def rag_query(question: str) -> str:
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata["text"] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    payload = create_payload(question, context_str)
    # make prediction
    out = predictor.predict(payload, custom_attributes='accept_eula=true')
    return out[0]["generation"]["content"]

You can now ask questions like those shown in the following:

rag_query("Does SageMaker support spot instances?")

' Yes, Amazon SageMaker supports spot instances for managed spot training. According to the provided context, Managed Spot Training can be used with all instances supported in Amazon SageMaker, and Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.nnTherefore, the answer to your question is:nnYes, SageMaker supports spot instances in all regions where Amazon SageMaker is available.'

Clean up

To stop incurring any unwanted charges, delete the model and endpoint.

encoder.delete_model()

encoder.delete_endpoint()

Conclusion

In this post, we introduced you to RAG with open-access LLMs on SageMaker. We also showed how to deploy Amazon SageMaker Jumpstart models with Llama 2, Hugging Face LLMs with Flan T5, and embedding models with MiniLM.

We implemented a complete end-to-end RAG pipeline using our open-access models and a Pinecone vector index. Using this, we showed how to minimize hallucinations, and keep LLM knowledge up to date, and ultimately enhance the user experience and trust in our systems.

To run this example on your own, clone this GitHub repository and walkthrough the previous steps using the Question Answering notebook on GitHub.


About the authors

Vedant Jain profile pictureVedant Jain is a Sr. AI/ML Specialist, working on strategic Generative AI initiatives. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, rock climbing, using science to lead a meaningful life & exploring cuisines from around the world.

James Briggs is a Staff Developer Advocate at Pinecone, specializing in vector search and AI/ML. He guides developers and businesses in developing their own GenAI solutions through online education. Prior to Pinecone James worked on AI for small tech startups to established finance corporations. Outside of work, James has a passion for traveling and embracing new adventures, ranging from surfing and scuba to Muay Thai and BJJ.

Xin HuangXin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Read More

Techniques for automatic summarization of documents using language models

Techniques for automatic summarization of documents using language models

Summarization is the technique of condensing sizable information into a compact and meaningful form, and stands as a cornerstone of efficient communication in our information-rich age. In a world full of data, summarizing long texts into brief summaries saves time and helps make informed decisions. Summarization condenses content, saving time and improving clarity by presenting information concisely and coherently. Summarization is invaluable for decision-making and in managing large volumes of content.

Summarization methods have a broad range of applications serving various purposes, such as:

  • News aggregation News aggregation involves summarizing news articles into a newsletter for the media industry
  • Legal document summarization Legal document summarization helps legal professionals extract key legal information from lengthy documents like terms, conditions, and contracts
  • Academic research – Summarization annotates, indexes, condenses, and simplifies important information from academic papers
  • Content curation for blogs and websites – You can create engaging and original content summaries for readers, especially in marketing
  • Financial reports and market analysis – You can extract financial insights from reports and create executive summaries for investor presentations in the finance industry

With the advancements in natural language processing (NLP), language models, and generative AI, summarizing texts of varying lengths has become more accessible. Tools like LangChain, combined with a large language model (LLM) powered by Amazon Bedrock or Amazon SageMaker JumpStart, simplify the implementation process.

This post delves into the following summarization techniques:

  • Extractive summarization using the BERT extractive summarizer
  • Abstractive summarization using specialized summarization models and LLMs
  • Two multi-level summarization techniques:
    • Extractive-abstractive summarization using the extractive-abstractive content summarization strategy (EACSS)
    • Abstractive-abstractive summarization using Map Reduce and Map ReRank

Text Summarization Techniques

The complete code sample is found in the GitHub repo. You can launch this solution in Amazon SageMaker Studio.

Click here to open the AWS console and follow along.

Types of summarizations

There are several techniques to summarize text, which are broadly categorized into two main approaches: extractive and abstractive summarization. Furthermore, multi-level summarization methodologies incorporate a series of steps, combining both extractive and abstractive techniques. These multi-level approaches are advantageous when dealing with text with tokens longer than the limit of an LLM, enabling an understanding of complex narratives.

Extractive summarization

Extractive summarization is a technique used in NLP and text analysis to create a summary by extracting key sentences. Instead of generating new sentences or content as in abstractive summarization, extractive summarization relies on identifying and pulling out the most relevant and informative portions of the original text to create a condensed version.

Extractive summarization, although advantageous in preserving the original content and ensuring high readability by directly pulling important sentences from the source text, has limitations. It lacks creativity, is unable to generate novel sentences, and may overlook nuanced details, potentially missing important information. Moreover, it may produce lengthy summaries, sometimes overwhelming readers with excessive and unwanted information. There are many extractive summarization techniques, such as TextRank and LexRank. In this post, we focus on the BERT extractive summarizer.

BERT extractive summarizer

The BERT extractive summarizer is a type of extractive summarization model that uses the BERT language model to extract the most important sentences from a text. BERT is a pre-trained language model that can be fine-tuned for a variety of tasks, including text summarization. It works by first embedding the sentences in the text using BERT. This produces a vector representation for each sentence that captures its meaning and context. The model then uses a clustering algorithm to group the sentences into clusters. The sentences that are closest to the center of each cluster are selected to form the summary.

Compared with LLMs, the advantage of the BERT extractive summarizer is it’s relatively straightforward to train and deploy the model and it’s more explainable. The disadvantage is the summarization isn’t creative and doesn’t generate sentences. It only selects sentences from the original text. This limits its ability to summarize complex or nuanced texts.

Abstractive summarization

Abstractive summarization is a technique used in NLP and text analysis to create a summary that goes beyond mere extraction of sentences or phrases from the source text. Instead of selecting and reorganizing existing content, abstractive summarization generates new sentences or phrases that capture the core meaning and main ideas of the original text in a more condensed and coherent form. This approach requires the model to understand the content of the text and express it in a way that is not necessarily present in the source material.

Specialized summarization models

These pre-trained natural language models, such as BART and PEGASUS, are specifically tailored for text summarization tasks. They employ encoder-decoder architectures and are smaller in parameters compared to their counterparts. This reduced size allows for ease of fine-tuning and deployment on smaller instances. However, it’s important to note that these summarization models also come with smaller input and output token sizes. Unlike their more general-purpose counterparts, these models are exclusively designed for summarization tasks. As a result, the input required for these models is solely the text that needs to be summarized.

Large language models

A large language model refers to any model that undergoes training on extensive and diverse datasets, typically through self-supervised learning at a large scale, and is capable of being fine-tuned to suit a wide array of specific downstream tasks. These models are larger in parameter size and perform better in tasks. Notably, they feature substantially larger input token sizes, some going up to 100,000, such as Anthropic’s Claude. To use one of these models, AWS offers the fully managed service Amazon Bedrock. If you need more control of the model development lifecycle, you can deploy LLMs through SageMaker.

Given their versatile nature, these models require specific task instructions provided through input text, a practice referred to as prompt engineering. This creative process yields varying outcomes based on the model type and input text. The effectiveness of both the model’s performance and the prompt’s quality significantly influence the final quality of the model’s outputs. The following are some tips when engineering prompts for summarization:

  • Include the text to summarize – Input the text that needs to be summarized. This serves as the source material for the summary.
  • Define the task – Clearly state that the objective is text summarization. For example, “Summarize the following text: [input text].”
  • Provide context – Offer a brief introduction or context for the given text that needs to be summarized. This helps the model understand the content and context. For example, “You are given the following article about Artificial Intelligence and its role in Healthcare: [input text].”
  • Prompt for the summary – Prompt the model to generate a summary of the provided text. Be clear about the desired length or format of the summary. For example, “Please generate a concise summary of the given article on Artificial Intelligence and its role in Healthcare: [input text].”
  • Set constraints or length guidelines – Optionally, guide the length of the summary by specifying a desired word count, sentence count, or character limit. For example, “Please generate a summary that is no longer than 50 words: [input text].”

Effective prompt engineering is critical for ensuring that the generated summaries are accurate, relevant, and aligned with the intended summarization task. Refine the prompt for optimal summarization result with experiments and iterations. After you have established the effectiveness of the prompts, you can reuse them with the use of prompt templates.

Multi-level summarization

Extractive and abstractive summarizations are useful for shorter texts. However, when the input text exceeds the model’s maximum token limit, multi-level summarization becomes necessary. Multi-level summarization involves a combination of various summarization techniques, such as extractive and abstractive methods, to effectively condense longer texts by applying multiple layers of summarization processes. In this section, we discuss two multi-level summarization techniques: extractive-abstractive summarization and abstractive-abstractive summarization.

Extractive-abstractive summarization

Extractive-abstractive summarization works by first generating an extractive summary of the text. Then it uses an abstractive summarization system to refine the extractive summary, making it more concise and informative. This enhances accuracy by providing more informative summaries compared to extractive methods alone.

Extractive-abstractive content summarization strategy

The EACSS technique combines the strengths of two powerful techniques: the BERT extractive summarizer for the extractive phase and LLMs for the abstractive phase, as illustrated in the following diagram.

Extractive Abstractive Text Summarization

EACSS offers several advantages, including the preservation of crucial information, enhanced readability, and adaptability. However, implementing EACSS is computationally expensive and complex. There’s a risk of potential information loss, and the quality of the summarization heavily depends on the performance of the underlying models, making careful model selection and tuning essential for achieving optimal results. Implementation includes the following steps:

  1. The first step is to break down the large document, such as a book, into smaller sections, or chunks. These chunks are defined as sentences, paragraphs, or even chapters, depending on the granularity desired for the summary.
  2. For the extractive phase, we employ the BERT extractive summarizer. This component works by embedding the sentences within each chunk and then employing a clustering algorithm to identify sentences that are closest to the cluster’s centroids. This extractive step helps in preserving the most important and relevant content from each chunk.
  3. Having generated extractive summaries for each chunk, we move on to the abstractive summarization phase. Here, we utilize LLMs known for their ability to generate coherent and contextually relevant summaries. These models take the extracted summaries as input and produce abstractive summaries that capture the essence of the original document while ensuring readability and coherence.

By combining extractive and abstractive summarization techniques, this approach offers an efficient and comprehensive way to summarize lengthy documents such as books. It ensures that important information is extracted while allowing for the generation of concise and human-readable summaries, making it a valuable tool for various applications in the domain of document summarization.

Abstractive-abstractive summarization

Abstractive-abstractive summarization is an approach where abstractive methods are used for both extracting and generating summaries. It offers notable advantages, including enhanced readability, coherence, and the flexibility to adjust summary length and detail. It excels in language generation, allowing for paraphrasing and avoiding redundancy. However, there are drawbacks. For example, it’s computationally expensive and resource intensive, and its quality heavily depends on the effectiveness of the underlying models, which, if not well-trained or versatile, may impact the quality of the generated summaries. Selection of models is crucial to mitigate these challenges and ensure high-quality abstractive summaries. For abstractive-abstractive summarization, we discuss two strategies: Map Reduce and Map ReRank.

Map Reduce using LangChain

This two-step process comprises a Map step and a Reduce step, as illustrated in the following diagram. This technique enables you to summarize an input that is longer than the model’s input token limit.

Abstractive text summarization mapreduce

The process consists of three main steps:

  1. The corpora is split into smaller chunks that fit into the LLM’s token limit.
  2. We use a Map step to individually apply an LLM chain that extracts all the important information from each passage, and its output is used as a new passage. Depending on the size and structure of the corpora, this could be in the form of overarching themes or short summaries.
  3. The Reduce step combines the output passages from the Map step or a Reduce Step such that it fits the token limit and feeds it into the LLM. This process is repeated until the final output is a singular passage.

The advantage of using this technique is that it’s highly scalable and parallelizable. All the processing in each step is independent from each other, which takes advantage of distributed systems or serverless services and lower compute time.

Map ReRank using LangChain

This chain runs an initial prompt on each document that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest scoring response is returned.

This technique is very similar to Map Reduce but with the advantage of requiring fewer overall calls, streamlining the summarization process. However, its limitation lies in its inability to merge information across multiple documents. This restriction makes it most effective in scenarios where a single, straightforward answer is expected from a singular document, making it less suitable for more complex or multifaceted information retrieval tasks that involve multiple sources. Careful consideration of the context and the nature of the data is essential to determine the appropriateness of this method for specific summarization needs.

Cohere ReRank uses a semantic-based reranking system that contextualizes the meaning of a user’s query beyond keyword relevance. It’s used with vector store systems as well as keyword-based search engines, giving it flexibility.

Comparing summarization techniques

Each summarization technique has its own unique advantages and disadvantages:

  • Extractive summarization preserves the original content and ensures high readability but lacks creativity and may produce lengthy summaries.
  • Abstractive summarization, while offering creativity and generating concise, fluent summaries, comes with the risk of unintentional content modification, challenges in language accuracy, and resource-intensive development.
  • Extractive-abstractive multi-level summarization effectively summarizes large documents and provides better flexibility in fine-tuning the extractive part of the models. However, it’s expensive, time consuming, and lacks parallelization, making parameter tuning challenging.
  • Abstractive-abstractive multi-level summarization also effectively summarizes large documents and excels in enhanced readability and coherence. However, it’s computationally expensive and resource intensive, relying heavily on the effectiveness of underlying models.

Careful model selection is crucial to mitigate challenges and ensure high-quality abstractive summaries in this approach. The following table summarizes the capabilities for each type of summarization.

Aspect Extractive Summarization Abstractive Summarization Multi-level Summarization
Generate creative and engaging summaries No Yes Yes
Preserve original content Yes No No
Balance information preservation and creativity No Yes Yes
Suitable for short, objective text (input text length smaller than maximum tokens of the model) Yes Yes No
Effective for longer, complex documents such as books (input text length greater than maximum tokens of the model) No No Yes
Combines extraction and content generation No No Yes

Multi-level summarization techniques are suitable for long and complex documents where the input text length exceeds the token limit of the model. The following table compares these techniques.

Technique Advantages Disadvantages
EACSS (extractive-abstractive) Preserves crucial information, provides the ability to fine-tune the extractive part of the models. Computationally expensive, potential information loss, and lacks parallelization.
Map Reduce (abstractive-abstractive) Scalable and parallelizable, with less compute time. The best technique to generate creative and concise summaries. Memory-intensive process.
Map ReRank (abstractive-abstractive) Streamlined summarization with semantic-based ranking. Limited information merging.

Tips when summarizing text

Consider the following best practices when summarizing text:

  • Be aware of the total token size – Be prepared to split the text if it exceeds the model’s token limits or employ multiple levels of summarization when using LLMs.
  • Be aware of the types and number of data sources – Combining information from multiple sources may require transformations, clear organization, and integration strategies. LangChain Stuff has integration on a wide variety of data sources and document types. It simplifies the process of combining text from different documents and data sources with the use of this technique.
  • Be aware of model specialization – Some models may excel at certain types of content but struggle with others. There may be fine-tuned models that are better suited for your domain of text.
  • Use multi-level summarization for large bodies of text – For texts that exceed the token limits, consider a multi-level summarization approach. Start with a high-level summary to capture the main ideas and then progressively summarize subsections or chapters for more detailed insights.
  • Summarize text by topics – This approach helps maintain a logical flow and reduce information loss, and it prioritizes the retention of crucial information. If you’re using LLMs, craft clear and specific prompts that guide the model to summarize a particular topic instead of the whole body of text.

Conclusion

Summarization stands as a vital tool in our information-rich era, enabling the efficient distillation of extensive information into concise and meaningful forms. It plays a pivotal role in various domains, offering numerous advantages. Summarization saves time by swiftly conveying essential content from lengthy documents, aids decision-making by extracting critical information, and enhances comprehension in education and content curation.

This post provided a comprehensive overview of various summarization techniques, including extractive, abstractive, and multi-level approaches. With tools like LangChain and language models, you can harness the power of summarization to streamline communication, improve decision-making, and unlock the full potential of vast information repositories. The comparison table in this post can help you identify the most suitable summarization techniques for your projects. Additionally, the tips shared in the post serve as valuable guidelines to avoid repetitive errors when experimenting with LLMs for text summarization. This practical advice empowers you to apply the knowledge gained, ensuring successful and efficient summarization in the projects.

References


About the authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Suhas chowdary Jonnalagadda is a Data Scientist at AWS Global Services. He is passionate about helping enterprise customers solve their most complex problems with the power of AI/ML. He has helped customers in transforming their business solutions across diverse industries, including finance, healthcare, banking, ecommerce, media, advertising, and marketing.

Tabby Ward is a Principal Cloud Architect/Strategic Technical Advisor with extensive experience migrating customers and modernizing their application workload and services to AWS. With over 25 years of experience developing and architecting software, she is recognized for her deep-dive ability as well as skillfully earning the trust of customers and partners to design architectures and solutions across multiple tech stacks and cloud providers.

Shyam Desai is a Cloud Engineer for big data and machine learning services at AWS. He supports enterprise-level big data applications and customers using a combination of software engineering expertise with data science. He has extensive knowledge in computer vision and imaging applications for artificial intelligence, as well as biomedical and bioinformatic applications.

Read More