Fine-tune and deploy a summarizer model using the Hugging Face Amazon SageMaker containers bringing your own script

There have been many recent advancements in the NLP domain. Pre-trained models and fully managed NLP services have democratised access and adoption of NLP. Amazon Comprehend is a fully managed service that can perform NLP tasks like custom entity recognition, topic modelling, sentiment analysis and more to extract insights from data without the need of any prior ML experience.

Last year, AWS announced a partnership with Hugging Face to help bring natural language processing (NLP) models to production faster. Hugging Face is an open-source AI community, focused on NLP. Their Python-based library (Transformers) provides tools to easily use popular state-of-the-art Transformer architectures like BERT, RoBERTa, and GPT. You can apply these models to a variety of NLP tasks, such as text classification, information extraction, and question answering, among others.

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the ML process, making it easier to develop high-quality models. The SageMaker Python SDK provides open-source APIs and containers to train and deploy models on SageMaker, using several different ML and deep learning frameworks.

The Hugging Face integration with SageMaker allows you to build Hugging Face models at scale on your own domain-specific use cases.

In this post, we walk you through an example of how to build and deploy a custom Hugging Face text summarizer on SageMaker. We use Pegasus [1] for this purpose, the first Transformer-based model specifically pre-trained on an objective tailored for abstractive text summarization. BERT is pre-trained on masking random words in a sentence; in contrast, during Pegasus’s pre-training, sentences are masked from an input document. The model then generates the missing sentences as a single output sequence using all the unmasked sentences as context, creating an executive summary of the document as a result.

Thanks to the flexibility of the HuggingFace library, you can easily adapt the code shown in this post for other types of transformer models, such as t5, BART, and more.

Load your own dataset to fine-tune a Hugging Face model

To load a custom dataset from a CSV file, we use the load_dataset method from the Transformers package. We can apply tokenization to the loaded dataset using the datasets.Dataset.map function. The map function iterates over the loaded dataset and applies the tokenize function to each example. The tokenized dataset can then be passed to the trainer for fine-tuning the model. See the following code:

# Python
def tokenize(batch):
    tokenized_input = tokenizer(batch[args.input_column], padding='max_length', truncation=True, max_length=args.max_source)
    tokenized_target = tokenizer(batch[args.target_column], padding='max_length', truncation=True, max_length=args.max_target)
    tokenized_input['target'] = tokenized_target['input_ids']

    return tokenized_input
    

def load_and_tokenize_dataset(data_dir):
    for file in os.listdir(data_dir):
        dataset = load_dataset("csv", data_files=os.path.join(data_dir, file), split='train')
    tokenized_dataset = dataset.map(lambda batch: tokenize(batch), batched=True, batch_size=512)
    tokenized_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
    
    return tokenized_dataset

Build your training script for the Hugging Face SageMaker estimator

As explained in the post AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models, training a Hugging Face model on SageMaker has never been easier. We can do so by using the Hugging Face estimator from the SageMaker SDK.

The following code snippet fine-tunes Pegasus on our dataset. You can also find many sample notebooks that guide you through fine-tuning different types of models, available directly in the transformers GitHub repository. To enable distributed training, we can use the Data Parallelism Library in SageMaker, which has been built into the HuggingFace Trainer API. To enable data parallelism, we need to define the distribution parameter in our Hugging Face estimator.

# Python
from sagemaker.huggingface import HuggingFace
# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='code',
                                    base_job_name='huggingface-pegasus',
                                    instance_type= 'ml.g4dn.16xlarge',
                                    instance_count=1,
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',
                                    output_path=output_path,
                                    role=role,
                                    hyperparameters = {
                                        'model_name': 'google/pegasus-xsum',
                                        'epoch': 10,
                                        'per_device_train_batch_size': 2
                                    },
                                    distribution=distribution)
huggingface_estimator.fit({'train': training_input_path, 'validation': validation_input_path, 'test': test_input_path})

The maximum training batch size you can configure depends on the model size and the GPU memory of the instance used. If SageMaker distributed training is enabled, the total batch size is the sum of every batch that is distributed across each device/GPU. If we use an ml.g4dn.16xlarge with distributed training instead of an ml.g4dn.xlarge instance, we have eight times (8 GPUs) as much memory as a ml.g4dn.xlarge instance (1 GPU). The batch size per device remains the same, but eight devices are training in parallel.

As usual with SageMaker, we create a train.py script to use with Script Mode and pass hyperparameters for training. The following code snippet for Pegasus loads the model and trains it using the Transformers Trainer class:

# Python
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainer,
    Seq2seqTrainingArguments
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    
training_args = Seq2seqTrainingArguments(
    output_dir=args.model_dir,
    num_train_epochs=args.epoch,
    per_device_train_batch_size=args.train_batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    warmup_steps=args.warmup_steps,
    weight_decay=args.weight_decay,
    logging_dir=f"{args.output_data_dir}/logs",
    logging_strategy='epoch',
    evaluation_strategy='epoch',
    saving_strategy='epoch',
    adafactor=True,
    do_train=True,
    do_eval=True,
    do_predict=True,
    save_total_limit = 3,
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss'
    # With the goal to deploy the best checkpoint to production
    # it is important to set load_best_model_at_end=True,
    # this makes sure that the last model is saved at the root
    # of the model_dir” directory
)
    
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation']
)

trainer.train()
trainer.save_model()

# Get rid of unused checkpoints inside the container to limit the model.tar.gz size
os.system(f"rm -rf {args.model_dir}/checkpoint-*/")

The full code is available on GitHub.

Deploy the trained Hugging Face model to SageMaker

Our friends at Hugging Face have made inference on SageMaker for Transformers models simpler than ever thanks to the SageMaker Hugging Face Inference Toolkit. You can directly deploy the previously trained model by simply setting up the environment variable "HF_TASK":"summarization" (for instructions, see Pegasus Models), choosing Deploy, and then choosing Amazon SageMaker, without needing to write an inference script.

However, if you need some specific way to generate or postprocess predictions, for example generating several summary suggestions based on a list of different text generation parameters, writing your own inference script can be useful and relatively straightforward:

# Python
# inference.py script

import os
import json
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def model_fn(model_dir):
    # Create the model and tokenizer and load weights
    # from the previous training Job, passed here through "model_dir"
    # that is reflected in HuggingFaceModel "model_data"
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_dir).to(device).eval()
    
    model_dict = {'model':model, 'tokenizer':tokenizer}
    
    return model_dict
        

def predict_fn(input_data, model_dict):
    # Return predictions/generated summaries
    # using the loaded model and tokenizer on input_data
    text = input_data.pop('inputs')
    parameters_list = input_data.pop('parameters_list', None)
    
    tokenizer = model_dict['tokenizer']
    model = model_dict['model']

    # Parameters may or may not be passed    
    input_ids = tokenizer(text, truncation=True, padding='longest', return_tensors="pt").input_ids.to(device)
    
    if parameters_list:
        predictions = []
        for parameters in parameters_list:
            output = model.generate(input_ids, **parameters)
            predictions.append(tokenizer.batch_decode(output, skip_special_tokens=True))
    else:
        output = model.generate(input_ids)
        predictions = tokenizer.batch_decode(output, skip_special_tokens=True)
    
    return predictions
    
    
def input_fn(request_body, request_content_type):
    # Transform the input request to a dictionary
    request = json.loads(request_body)
    return request

As shown in the preceding code, such an inference script for HuggingFace on SageMaker only needs the following template functions:

model_fn() – Reads the content of what was saved at the end of the training job inside SM_MODEL_DIR, or from an existing model weights directory saved as a tar.gz file in Amazon Simple Storage Service (Amazon S3). It’s used to load the trained model and associated tokenizer.
input_fn() – Formats the data received from a request made to the endpoint.
predict_fn() – Calls the output of model_fn() (the model and tokenizer) to run inference on the output of input_fn() (the formatted data).

Optionally, you can create an output_fn() function for inference formatting, using the output of predict_fn(), which we didn’t demonstrate in this post.

We can then deploy the trained Hugging Face model with its associated inference script to SageMaker using the Hugging Face SageMaker Model class:

# Python
from sagemaker.huggingface import HuggingFaceModel

model = HuggingFaceModel(model_data=huggingface_estimator.model_data,
                     role=role,
                     framework_version='1.7',
                     py_version='py36',
                     entry_point='inference.py',
                     source_dir='code')
                     
predictor = model.deploy(initial_instance_count=1,
                         instance_type='ml.g4dn.xlarge'
                         )

Test the deployed model

For this demo, we trained the model on the Women’s E-Commerce Clothing Reviews dataset, which contains reviews of clothing articles (which we consider as the input text) and their associated titles (which we consider as summaries). After we remove articles with missing titles, the dataset contains 19,675 reviews. Fine-tuning the Pegasus model on a training set containing 70% of those articles for five epochs took approximately 3.5 hours on an ml.p3.16xlarge instance.

We can then deploy the model and test it with some example data from the test set. The following is an example review describing a sweater:

# Python
Review Text
"I ordered this sweater in green in petite large. The color and knit is beautiful and the shoulders and body fit comfortably; however, the sleeves were very long for a petite. I roll them, and it looks okay but would have rather had a normal petite length sleeve."

Original Title
"Long sleeves"

Rating
3

Thanks to our custom inference script hosted in a SageMaker endpoint, we can generate several summaries for this review with different text generation parameters. For example, we can ask the endpoint to generate a range of very short to moderately long summaries specifying different length penalties (the smaller the length penalty, the shorter the generated summary). The following are some parameter input examples, and the subsequent machine-generated summaries:

# Python
inputs = {
    "inputs":[
"I ordered this sweater in green in petite large. The color and knit is   beautiful and the shoulders and body fit comfortably; however, the sleeves were very long for a petite. I roll them, and it looks okay but would have rather had a normal petite length sleeve."
    ],

    "parameters_list":[
        {
            "length_penalty":2
        },
	{
            "length_penalty":1
        },
	{
            "length_penalty":0.6
        },
        {
            "length_penalty":0.4
        }
    ]

result = predictor.predict(inputs)
print(result)

[
    ["Beautiful color and knit but sleeves are very long for a petite"],
    ["Beautiful sweater, but sleeves are too long for a petite"],
    ["Cute, but sleeves are long"],
    ["Very long sleeves"]
]

Which summary do you prefer? The first generated title captures all the important facts about the review, with a quarter the number of words. In contrast, the last one only uses three words (less than 1/10th the length of the original review) to focus on the most important feature of the sweater.

Conclusion

You can fine-tune a text summarizer on your custom dataset and deploy it to production on SageMaker with this simple example available on GitHub. Additional sample notebooks to train and deploy Hugging Face models on SageMaker are also available.

As always, AWS welcomes feedback. Please submit any comments or questions.

References

[1] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

About the authors

Viktor Malesevic is a Machine Learning Engineer with AWS Professional Services, passionate about Natural Language Processing and MLOps. He works with customers to develop and put challenging deep learning models to production on AWS. In his spare time, he enjoys sharing a glass of red wine and some cheese with friends.

Aamna Najmi is a Data Scientist with AWS Professional Services. She is passionate about helping customers innovate with Big Data and Artificial Intelligence technologies to tap business value and insights from data. In her spare time, she enjoys gardening and traveling to new places.

Vedere AI

Fine-tune and deploy a summarizer model using the Hugging Face Amazon SageMaker containers bringing your own script

Load your own dataset to fine-tune a Hugging Face model

Build your training script for the Hugging Face SageMaker estimator

Deploy the trained Hugging Face model to SageMaker

Test the deployed model

Conclusion

References

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.