Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker

Genomic language models are a new and exciting field in the application of large language models to challenges in genomics. In this blog post and open source project, we show you how you can pre-train a genomics language model, HyenaDNA, using your genomic data in the AWS Cloud. Here, we use AWS HealthOmics storage as a convenient and cost-effective omic data store and Amazon Sagemaker as a fully managed machine learning (ML) service to train and deploy the model.

Genomic language models

Genomic language models represent a new approach in the field of genomics, offering a way to understand the language of DNA. These models use the transformer architecture, a type of natural language processing (NLP), to interpret the vast amount of genomic information available, allowing researchers and scientists to extract meaningful insights more accurately than with existing in silico approaches and more cost-effectively than with existing in situ techniques.

By bridging the gap between raw genetic data and actionable knowledge, genomic language models hold immense promise for various industries and research areas, including whole-genome analysis, delivered care, pharmaceuticals, and agriculture. They facilitate the discovery of novel gene functions, the identification of disease-causing mutations, and the development of personalized treatment strategies, ultimately driving innovation and advancement in genomics-driven fields. The ability to effectively analyze and interpret genomic data at scale is the key to precision medicine, agricultural optimization, and biotechnological breakthroughs, making genomic language models a possible new foundational technology in these industries.

Some of the pioneering genomic language models include

  • DNABERT which was one of the first attempts to use the transformer architecture to learn the language of DNA. DNABERT used a Bidirectional Encoder Representations from Transformers (BERT, encoder-only) architecture pre-trained on a human reference genome and showed promising results on downstream supervised tasks.
  • Nucleotide transformer has a similar architecture to DNABERT and showed that pre-training on more data and increasing the context window size improves the model’s accuracy on downstream tasks.
  • HyenaDNA uses the transformer architecture, like other genomic models, except that it replaces each self-attention layer with a Hyena operator. This widens the context window to allow processing of up to 1 million tokens, substantially more than prior models, allowing it to learn longer-range interactions in DNA.

In our exploration of cutting-edge models that push the boundaries of genetic sequence analysis, we focused on HyenaDNA. Pretrained HyenaDNA models are readily accessible on Hugging Face. This availability facilitates easy integration into existing projects or the starting point for new explorations in genetic sequence analysis.

AWS HealthOmics and sequence stores

AWS HealthOmics is a purpose-built service that helps healthcare and life science organizations and their software partners store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health and drive deeper biological understanding. It supports large-scale analysis and collaborative research through HealthOmics storage, analytics, and workflow capabilities.

With HealthOmics storage, a managed omics focused findable accessible, interoperable, and reusable (FAIR) data store, users can cost effectively store, organize, share, and access petabytes of bioinformatics data efficiently at a low cost per gigabase. HealthOmics sequence stores deliver cost savings through automatic tiering and compression of files based on usage, enable sharing and findability through the biologically focused metadata and provenance tracking, and provide instant access to frequently used data through low latency Amazon Simple Storage Service (Amazon S3) compatible APIs or HealthOmics native APIs. All of this is delivered by HealthOmics, removing the burden of managing compression, tiering, metadata, and file organization from customers.

Amazon SageMaker

Amazon SageMaker is a fully managed ML service offered by AWS, designed to reduce the time and cost associated with training and tuning ML models at scale.

With SageMaker Training, a managed batch ML compute service, users can efficiently train models without having to manage the underlying infrastructure. SageMaker notably supports popular deep learning frameworks, including PyTorch, which is integral to the solutions provided here.

SageMaker also provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs.

Solution overview

In this blog post we address pre-training a genomic language model on an assembled genome. This genomic data could be either public (for example, GenBank) or could be your own proprietary data. The following diagram illustrates the workflow:

The image illustrates an architecture diagram for training HyenaDNA model using the data stored in AWS HealthOmics sequence store. 1. Read the Data: Data is read from an external genomic data source, such as GenBank. 2. Load the Data to Store: The data is then loaded into an AWS HealthOmics sequence store using Data Loading SageMaker Notebook. 3. Start Training Job: Utilizes SageMaker train & Deploy Notebook to initiate a training job on Amazon SageMaker. 4. Read the Data from Sequence Store: Training job accesses data from the Sequence Store using S3 access point of sequence store. 5. Download Model Checkpoint: A model checkpoint from Hugging Face (HyneDNA model) is downloaded. 6. Save Trained Model: The trained model is saved following the training process. 7. Deploy Trained Model: The trained model is then deployed using Amazon SageMaker, establishing a real-time endpoint. 8. Inference: Finally, the model performs inference tasks, likely using the deployed SageMaker real-time endpoint.

  1. We start with genomic data. For the purposes of this blog post, we’re using a public non-reference Mouse genome from GenBank. The dataset is part of The Mouse Genomes Project and represents a consensus genome sequence of inbred mouse strains. This type of genomic data could readily be interchanged with proprietary datasets that you might be working with in your research.
  2. We use a SageMaker notebook to process the genomic files and to import these into a HealthOmics sequence store.
  3. A second SageMaker notebook is used to start the training job on SageMaker.
  4. Inside the managed training job in the SageMaker environment, the training job first downloads the mouse genome using the S3 URI supplied by HealthOmics.
  5. Then the training job retrieves the checkpoint weights of the HyenaDNA model from Huggingface. These weights are pretrained on the human reference genome. This pretraining allows the model to understand and predict genomic sequences, providing a comprehensive baseline for further specialized training on a variety of genomic tasks.
  6. Using these resources, the HyenaDNA model is trained, where it uses the mouse genome to refine its parameters. After pre-training is complete and validation results are satisfactory, the trained model is saved to Amazon S3.
  7. Then we deploy that model as a SageMaker real-time inference endpoint.
  8. Lastly the model is tested against a set of known genome sequences using some inference API calls.

Data preparation and loading into sequence store

The initial step in our machine learning workflow focuses on preparing the data. We start by uploading the genomic sequences into a HealthOmics sequence store. Although FASTA files are the standard format for storing reference sequences, we convert these to FASTQ format. This conversion is carried out to better reflect the format expected to store the assembled data of a sequenced sample.

In the sample Jupyter notebook we show how to download FASTA files from GenBank, convert them into FASTQ files, and then load them into a HealthOmics sequence store. You can skip this step If you already have your own genomic data in a sequence store.

Training on SageMaker

We use PyTorch and Amazon SageMaker script mode to train this model. Script mode’s compatibility with PyTorch was crucial, allowing us to use our existing scripts with minimal modifications. For the training, we extract the training data from the sequence store through the sequence store’s provided S3 URIs. You can, for example, use the boto3 library to obtain this S3 URI.

seq_store_id = "4308389581“

seq_store_info = omics.get_sequence_store(id=seq_store_id)
s3_uri = seq_store_info["s3Access"]["s3Uri"]
s3_arn = seq_store_info["s3Access"]["s3AccessPointArn"]
key_arn = seq_store_info["sseConfig"]["keyArn"]
s3_uri, s3_arn, key_arn

S3_DATA_URI = f"{s3_uri}readSet/"
S3_DATA_URI

When you provide this to the SageMaker estimator, the training job takes care of downloading the data from the sequence store through its S3 URI. Following Nguyen et al, we train on chromosomes 2, 4, 6, 8, X, and 14–19; cross-validate on chromosomes 1, 3, 12, and 13; and test on chromosomes 5, 7, and 9–11.

To maximize the training efficiency of our HyenaDNA model, we use distributed data parallel (DDP). DDP is a technique that facilitates the parallel processing of our training tasks across multiple GPUs. To efficiently implement DDP, we used the Hugging Face Accelerate library. Accelerate simplifies running distributed training by abstracting away the complexity typically associated with setting up DDP.

After you have defined your training script, you can configure and submit a SageMaker training job.

First, let’s define the hyperparameters, starting with model_checkpoint. This parameter refers to a HuggingFace model ID for a specific pre-trained model. Notably, the HyenaDNA model lineup includes checkpoints that can handle up to 1 million tokens. However, for demonstration purposes, we are using the hyenadna-small-32k-seqlen-hf model, which has a context window of 32,000 tokens, indicated by the max_length setting. It’s essential to understand that different model IDs and corresponding max_length settings can be selected to use models with smaller or larger context windows, depending on your computational needs and objectives.

The species parameter is set to mouse, specifying the type of organism the genomic training data represents.

hyperparameters = {
    "species" : "mouse",
    "epochs": 150,
    "model_checkpoint": MODEL_ID,
    "max_length": 32_000,
    "batch_size": 4,
    "learning_rate": 6e-4,
    "weight_decay" : 0.1,
    "log_level" : "INFO",
    "log_interval" : 100
}

Next, define what metrics, especially the training and validation perplexity, to capture from the training logs:

metric_definitions = [
    {"Name": "epoch", "Regex": "Epoch: ([0-9.]*)"},
    {"Name": "step", "Regex": "Step: ([0-9.]*)"},
    {"Name": "train_loss", "Regex": "Train Loss: ([0-9.e-]*)"},
    {"Name": "train_perplexity", "Regex": "Train Perplexity: ([0-9.e-]*)"},
    {"Name": "eval_loss", "Regex": "Eval Average Loss: ([0-9.e-]*)"},
    {"Name": "eval_perplexity", "Regex": "Eval Perplexity: ([0-9.e-]*)"}
]

Finally, define a Pytorch estimator and submit a training job that refers to the data location obtained from the HealthOmics sequence store.

hyenaDNA_estimator = PyTorch(
    base_job_name=TRAINING_JOB_NAME,
    entry_point="train_hf_accelerate.py",
    source_dir="scripts/",
    instance_type="ml.g5.12xlarge",
    instance_count=1,
    image_uri=pytorch_image_uri,
    role=SAGEMAKER_EXECUTION_ROLE,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    sagemaker_session=sagemaker_session,
    distribution={"torch_distributed": {"enabled": True}},
    tags=[{"Key": "project", "Value": "genomics-model-pretraining"}],
    keep_alive_period_in_seconds=1800,
    tensorboard_output_config=tensorboard_output_config,
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hyenaDNA_estimator.fit(
        {
            "data": TrainingInput(
                s3_data=S3_DATA_URI, input_mode="File"
            ),
        },
        wait=True,
    )

Results

In our training cycle for the model, we processed a dataset consisting of one mouse genome with 10,000 entries. The computational resources included a cluster configured with one ml.g5.12xlarge instance, which houses four Nvidia A10G GPUs. The 32k sequence length model, was trained using a batch size of four per GPU (24 gigabit (Gb) of VRAM). With this setup we completed 150 epochs to report the results below.

Evaluation metrics: The evaluation perplexity and loss graphs show a downward trend at the outset, which then plateaus. The initial steep decrease indicates that the model rapidly learned from the training data, improving its predictive performance. As training progressed, the rate of improvement slowed, as evidenced by the plateau, which is typical in the later stages of training as the model converges.

The image plots the evaluation loss of a HyenaDNA model training over a series of epochs. The overall trend suggests that the model's loss decreased significantly early in the training and reached a plateau, indicating potential convergence of the model training process.

The image plots the evaluation perplexity values of HyenaDNA model during its training over a sequence of epochs. This decreasing trend followed by stabilization indicates that the model's ability to predict or understand the data improved quickly initially and then reached a level of consistency as training progressed.

Training Metrics: Similarly, the training perplexity and loss graphs indicate an initial sharp improvement followed by a gradual plateau. This shows that the model effectively learned from the data. The training loss’s slight fluctuations suggest that the model continued to fine-tune its parameters in response to the inherent complexities in the training dataset.

The image plots the perplexity values of a machine learning model over training steps. training perplexity, which demonstrates a significant decrease early on, followed by a gradual decline and stabilization around 3.2. This behavior suggests that as training progresses, the model becomes increasingly efficient at predicting or understanding the training data, indicated by the decreasing perplexity values. The stabilization at a lower perplexity level indicates that the model has likely achieved a good level of generalization.

Deployment

Upon the completion of training, we then deployed the model on a SageMaker real-time endpoint. SageMaker real-time endpoints provide an on-demand, scalable way to generate embeddings for genomic sequences.

In our SageMaker real-time endpoint setup, we need to adjust the default configurations to handle large payload sizes, specifically 32k context windows for both requests and responses. Because the default payload size of 6.5 MB isn’t sufficient, we’re increasing it to a little over 50 MB:

hyenaDNAModel = PyTorchModel(
    model_data=model_data,
    role=SAGEMAKER_EXECUTION_ROLE,
    image_uri=pytorch_deployment_uri,
    entry_point="inference.py",
    source_dir="scripts/",
    sagemaker_session=sagemaker_session,
    name=endpoint_name,
    env = {
        'TS_MAX_RESPONSE_SIZE':'60000000',
        'TS_MAX_REQUEST_SIZE':'60000000',
    }
)

# deploy the endpoint endpoint
realtime_predictor = hyenaDNAModel.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    endpoint_name=endpoint_name,
    env=env,
)

By submitting a sequence to the endpoint, users can quickly receive the corresponding embeddings generated by HyenaDNA. These embeddings encapsulate the complex patterns and relationships learned during training, representing the genetic sequences in a form that is conducive to further analysis and predictive modeling. Here is an example of how to invoke the model.

import json
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

sample_genome_data = []
with open("./sample_mouse_data.json") as file:
    for line in file:
        sample_genome_data.append(json.loads(line))
len(sample_genome_data)

data = [sample_genome_data[0]]
realtime_predictor.serializer = JSONSerializer()
realtime_predictor.deserializer = JSONDeserializer()
realtime_predictor.predict(data=data)

When you submit a sample genomic sequence to the model, it returns the embeddings of that sequence:

{'embeddings': [[-0.50390625, 0.447265625,-1.03125, 0.546875, 0.50390625, -0.53125, 0.59375, 0.71875, 0.349609375, -0.404296875, -4.8125, 0.84375, 0.359375, 1.2265625,………]]}

Conclusion

We’ve shown how to pre-train a HyenaDNA model with a 32k context window and to produce embeddings that can be used for downstream predictive tasks. Using the techniques shown here you can also pre-train a HyenaDNA model with context windows of other sizes (for example, 1 million tokens) and on other genomic data (for example, proprietary genomic sequence data).

Pre-training genomic models on large, diverse datasets is a foundational step in preparing them for downstream tasks, such as identifying genetic variants linked to diseases or predicting gene expression levels. In this blog post, you’ve learned how AWS facilitates this pre-training process by providing a scalable and cost-efficient infrastructure through HealthOmics and SageMaker. Looking forward, researchers can use these pre-trained models to fast-track their projects, fine-tuning them with specific datasets to gain deeper insights into genetic research.

To explore further details and try your hand at using these resources, we invite you to visit our GitHub repository. Additionally, We encourage you to learn more by visiting the Amazon SageMaker documentation and the AWS HealthOmics documentation.


About the authors

Shamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences division at Amazon Web Services (AWS), specializes in Generative AI. He assists customers in integrating Generative AI into their projects, emphasizing the adoption of Large Language Models (LLMs) for healthcare and life sciences domains with a focus on distributed training. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.

Simon Handley, PhD, is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 25 years experience in biotechnology and machine learning and is passionate about helping customers solve their machine learning and genomic challenges. In his spare time, he enjoys horseback riding and playing ice hockey.

Read More