An introduction to preparing your own dataset for LLM training

An introduction to preparing your own dataset for LLM training

Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of linguistic tasks. However, the performance of these models is heavily influenced by the data used during the training process.

In this blog post, we provide an introduction to preparing your own dataset for LLM training. Whether your goal is to fine-tune a pre-trained modIn this blog post, we provide an introduction to preparing your own dataset for LLM training. Whether your goal is to fine-tune a pre-trained model for a specific task or to continue pre-training for domain-specific applications, having a well-curated dataset is crucial for achieving optimal performance.el for a specific task or to continue pre-training for domain-specific applications, having a well-curated dataset is crucial for achieving optimal performance.

Data preprocessing

Text data can come from diverse sources and exist in a wide variety of formats such as PDF, HTML, JSON, and Microsoft Office documents such as Word, Excel, and PowerPoint. It’s rare to already have access to text data that can be readily processed and fed into an LLM for training. Thus, the first step in an LLM data preparation pipeline is to extract and collate data from these various sources and formats. During this step, you read data from multiple sources, extract the text using tools such as optical character recognition (OCR) for scanned PDFs, HTML parsers for web documents, and bespoke libraries for proprietary formats such as Microsoft Office files. Non-textual elements such as HTML tags and non-UTF-8 characters are typically removed or normalized.

The next step is to filter low quality or desirable documents. Common patterns for filtering data include:

  • Filtering on metadata such as the document name or URL.
  • Content-based filtering such as excluding any toxic or harmful content or personally identifiable information (PII).
  • Regex filters to identify specific character patterns present in the text.
  • Filtering documents with excessive repetitive sentences or n-grams.
  • Filters for specific languages such as English.
  • Other quality filters such as the number of words in the document, average word length, ratio of words comprised of alphabetic characters versus non-alphabetic characters, and others.
  • Model based quality filtering using lightweight text classifiers to identify low quality documents. For example, the FineWeb-Edu classifier is used to classify the education value of web pages.

Extracting text from various file formats can be a non-trivial task. Fortunately, many high-level libraries exist that can significantly simplify this process. We will use a few examples to demonstrate extracting text and review how to scale this to large collections of documents further down.

HTML preprocessing

When processing HTML documents, remove non-text data such as the document mark-up tags, inline CSS styles, and inline JavaScript. Furthermore, translate structured objects such as lists, tables, and sample code blocks into markdown format. The trafilatura library provides a command-line interface (CLI) and Python SDK for translating HTML documents in this fashion. The following code snippet demonstrates the library’s usage by extracting and preprocessing the HTML data from the Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker blog post.

from trafilatura import fetch_url, extract, html2txt

url = "https://aws.amazon.com/blogs/machine-learning/fine-tune-meta-llama-3-1-models-using-torchtune-on-amazon-sagemaker/"

downloaded = fetch_url(url)
print("RAW HTMLn", downloaded[:250])

all_text = html2txt(downloaded)
print("nALL TEXTn", all_text[:250])

main_text = extract(downloaded)
print("nMAIN TEXTn", main_text[:250])

trafilatura provides numerous functions for dealing with HTML. In the preceding example, fetch_url fetches the raw HTML, html2txt extracts the text content which includes the navigation links, related content links, and other text content. Finally, the extract method extracts the content of the main body which is the blog post itself. The output of the preceding code should look like the following:

RAW HTML
<!doctype html> <html lang="en-US" class="no-js aws-lng-en_US" xmlns="http://www.w3.org/1999/xhtml" data-aws-assets="https://a0.awsstatic.com" data-js-version="1.0.681" data-css-version="1.0.538" data-static-assets="https://a0.awsstatic.com" prefix="

ALL TEXT
Skip to Main Content Click here to return to Amazon Web Services homepage About AWS Contact Us Support English My Account Sign In Create an AWS Account Products Solutions Pricing Documentation Learn Partner Network AWS Marketplace Customer Enablement

MAIN TEXT
AWS Machine Learning Blog Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker This post is co-written with Meta’s PyTorch team. In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large lan

PDF processing

PDF is a common format for storing and distributing documents within organizations. Extracting clean text from PDFs can be challenging for several reasons. PDFs may use complex layouts that include text columns, images, tables, and figures. They can also contain embedded fonts and graphics that cannot be parsed by standard libraries. Unlike HTML, there is no structural information to work with such as headings, paragraphs, lists, and others, which makes parsing PDF documents significantly more difficult. If possible, PDF parsing should be avoided if an alternative format for the document exists such an HTML, markdown, or even a DOCX file. In cases where an alternative format is not available, you can use libraries such as pdfplumber, pypdf, and pdfminer to help with the extraction of text and tabular data from the PDF. The following is an example of using pdfplumber to parse the first page of the 2023 Amazon annual report in PDF format.

import pdfplumber

pdf_file = "Amazon-com-Inc-2023-Annual-Report.pdf"

with pdfplumber.open(pdf_file) as pdf:
    page = pdf.pages[1]

print(page.extract_text(x_tolerance=1)[:300])

pdfplumber provides bounding box information, which can be used to remove superfluous text such as page headers and footers. However, the library only works with PDFs that have text present, such as digitally authored PDFs. For PDF documents that require OCR, such as scanned documents, you can use services such as Amazon Textract.

Office document processing

Documents authored with Microsoft Office or other compatible productivity software are another common format within an organization. Such documents can include DOCX, PPTX, and XLSX files, and there are libraries available to work with these formats. The following code snippet uses the python-docx library to extract text from a Word document. The code iterates through the document paragraphs and concatenates them into a single string.

from docx import Document
doc_file = "SampleDoc.docx"

doc = Document(doc_file)

full_text = []
for paragraph in doc.paragraphs:
  full_text.append(paragraph.text)

document_text = 'n'.join(full_text)

Deduplication

After the preprocessing step, it is important to process the data further to remove duplicates (deduplication) and filter out low-quality content.

Deduplication is a critical aspect for preparing high-quality pretraining datasets. According to CCNet, duplicated training examples are pervasive in common natural language processing (NLP) datasets. This issue is not only a frequent source of bias in datasets originating from public domains such as the internet, but it can also be a potential problem when curating your own training dataset. When organizations attempt to create their own training dataset, they often use various data sources such as internal emails, memos, internal employee chat logs, support tickets, conversations, and internal wiki pages. The same chunk of text might appear across multiple sources or can repeat excessively in a single data source such as an email thread. Duplicated data extends the training time and potentially biases the model towards more frequently repeated examples.

A commonly used processing pipeline is the CCNet pipeline. The following section will describe deduplication and filtering employed in the CCNet pipeline.

Break documents into shards. In the CCNet paper, the author divided 30 TB of data into 1,600 shards. In that example, the shards are documents that have been grouped together. Each shard contains 5 GB data and 1.6 million documents. Organizations can determine the number of shards and size of each shard based on their data size and compute environment. The main purpose of creating shards is to parallelize the deduplication process across a cluster of compute nodes.

Compute hash code for each paragraph of the document. Each shard contains many documents and each document contains multiple paragraphs. For each paragraph, we compute a hash code and save them into a binary file. The authors of the CCNet paper use the first 64 bits of SHA-1 digits of the normalized paragraphs as the key. Deduplication is done by comparing these keys. If the same key appears multiple times, the paragraphs that these keys link to are considered duplicates. You can compare the keys within one shard, in which case there might still be duplicated paragraphs across different shards. If you compare the keys across all shards, you can verify that no duplicated paragraph exists in your whole dataset. However,  this can be computationally expensive.

MinHash is another popular method for estimating the similarities between two paragraphs. This technique is particularly useful for large datasets because it provides an efficient approximation of the Jaccard similarity. Paragraphs are broken down into shingles, which are overlapping sequences of words or characters of a fixed length. Multiple hashing functions are applied to each shingle. For each hash function, we find the minimum hash value across all the shingles and use that as the signature of the paragraph, called the MinHash signature. Using the MinHash signatures, we can calculate the similarity of the paragraphs. The MinHash technique can also be applied to words, sentences, or entire documents. This flexibility makes MinHash a powerful tool for a wide range of text similarity tasks. The following example shows the pseudo-code for this technique:

function MinHash_similarity(text1, text2, shingle_length, num_hash_functions):
    # Preprocess texts
    shingles1 = create_shingles(text1, shingle_length)
    shingles2 = create_shingles(text2, shingle_length)

    # Initialize MinHash signatures
    minhash_signatures = []

    # Compute MinHash signatures
    for i from 1 to num_hash_functions:
        hash_function = generate_hash_function()
        minhash1 = minimum_hash(shingles1, hash_function)
        minhash2 = minimum_hash(shingles2, hash_function)
        minhash_signatures.append((minhash1, minhash2))

    # Estimate Jaccard similarity
    common_minhashes = count_common_minhashes(minhash_signatures)
    jaccard_similarity = common_minhashes / num_hash_functions
    return jaccard_similarity

The complete steps of using MinHash for deduplication are:

  1. Break down documents into paragraphs.
  2. Apply the MinHash algorithm as shown in the preceding example and calculate the similarity scores between paragraphs.
  3. Use the similarity between paragraphs to identify duplicate pairs.
  4. Combine duplicate pairs into clusters. From each cluster, select one representative paragraph to minimize duplicates.

To enhance the efficiency of similarity searches, especially when dealing with large datasets, MinHash is often used in conjunction with additional techniques such as Locality Sensitive Hashing (LSH). LSH complements MinHash by providing a way to quickly identify potential matches through bucketing and hashing techniques without having to compare every pair of items in the dataset. This combination allows for efficient similarity searches even in massive collections of documents or data points, significantly reducing the computational overhead typically associated with such operations.

It’s important to note that paragraph-level deduplication is not the only choice of granularity. As shown in Meta’s Llama 3 paper, you can also use sentence-level deduplication. The authors also applied document-level deduplication to remove near duplicate documents. The computation cost for sentence-level deduplication is even higher compared to paragraph-level deduplication. However, this approach offers more fine-grained control over duplicate content. At the same time, removing duplicated sentences might result in an incomplete paragraph, potentially affecting the coherence and context of the remaining text. Thus, the trade-off between granularity and context preservation needs to be carefully considered based on the nature of the dataset.

Creating a dataset for model fine-tuning

Fine-tuning a pre-trained LLM involves adapting it to a specific task or domain by training it on an annotated dataset in a supervised manner or through reinforcement learning techniques. The dataset considerations for fine-tuning are crucial because they directly impact the model’s performance, accuracy, and generalization capabilities. Top considerations include:

  1. Relevance and domain-specificity:The dataset should closely match the task or domain the model is being fine-tuned for. Make sure that the dataset includes diverse examples and edge cases that the model is likely to encounter. This helps improve the robustness and generalizability of the model across a range of real-world scenarios. For example,  when fine-tuning a model for financial sentiment analysis, the dataset should contain financial news articles, analyst reports, stock market commentary, and corporate earnings announcements.
  2. Annotation quality:The dataset must be free of noise, errors, and irrelevant information. Annotated datasets must maintain consistency in labeling. The dataset should accurately reflect the correct answers, human preferences, or other target outcomes that the fine-tuning process aims to achieve.
  3. Dataset size and distribution:Although fine-tuning generally requires fewer tokens than pretraining (thousands compared to millions), the dataset should still be large enough to cover the breadth of the task requirements. The dataset should include a diverse set of examples that reflect the variations in language, context, and style that the model is expected to handle.
  4. Ethical considerations: Analyze and mitigate biases present in the dataset, such as gender, racial, or cultural biases. These biases can be amplified during fine-tuning, leading to unfair or discriminatory model outputs. Make sure that the dataset aligns with ethical standards and represents diverse groups and perspectives fairly.
  5. Sensible data cut offs: While preparing the dataset, one of the considerations to understand is choosing a cut-off date for the data. Generally, depending on the speed of changes in the information, you can choose an early or late cut off. For example, for fine-tuning an LLM for brand adherence, you can have a distant cutoff date because the brand language remains consistent for many years. Whereas preparing the dataset for generating audit and compliance letters needs an earlier cutoff date because new compliance regulations are created and are updated quite often.
  6. Modalities: In the case of multi-modal models, the dataset must include various supported data types. Each data type must follow the other considerations mentioned here around annotation quality, ethical considerations, relevance, domain specificity, and so on.
  7. Synthetic data augmentation:Consider generating synthetic data to supplement real-world data, especially to help fill gaps in the dataset to make sure that it’s realistic and representative. Employing these techniques can help overcome the challenges of limited data availability, enhance model robustness, and provide better generalization across diverse tasks.

Dataset format for fine tuning

Instruction tuning is a process of further training a pretrained model on a diverse set of tasks framed as natural language instructions. This approach aims to enhance the model’s ability to understand and follow explicit instructions, improving its performance on a wide range of downstream tasks without the need for task-specific fine -tuning. A technique called reference tuning is used to align large models with human preferences and values. This approach aims to improve the quality, safety, and desirability of AI-generated outputs by incorporating human feedback into the training process. The data requirements for both of these techniques are very different, so we’ll go into further detail in the following sections.

Instruction fine tuning dataset format

The columns in the table that follows represent the key components of the instruction-tuning paradigm: Type categorizes the task or instruction type. Instruction contains the specific directive for the model. Input provides the context or data to work with. Output shows the expected or desired response. This structure is crucial for training models to understand and execute a wide variety of tasks based on natural language instructions.

Type Instruction Input  Output
antonym_relation What is the relationship between the given pairs? Night : Day :: Right : Left The relationship between the given pairs is that they are opposites.
email_subject_generation Generate an appropriate subject title for the following email: Hi [person name],nn I’m writing to ask you if you are willing to be a panelist of our workshop on multi-modality at CVPR. The workshop will be held on June 20, 2023. nnBest,n[my name] Invitation to be a panelist for CVPR 2023 workshop on multi-modality
analogy_completion To make the pairs have the same analogy, write the fourth word. Instability : Turmoil:: Change : Instability : Turmoil :: Change : Revolution
question_answering_generation_from_facts Using the facts given, write a question-answer pair. Name: Bill GatesnBorn: October 28, 1955 (age 66)nEducation: Harvard University (dropped out)nOccupation: Software developer, investor, entrepreneurnSource: WikipedianTime: August 2022 Question: What is Bill Gates’s occupation? nAnswer: He is a software developer, investor, and entrepreneur.

 

Preference-tuning dataset format

For a preference-tuning dataset, the dataset typically consists of entries, each containing an input or prompt, a chosen (preferred) response, and a rejected (less preferred) response. The following is an example of a Direct Preference Optimization (DPO) dataset.

Prompt Chosen Rejected

Artists Gustav Klimt and Carl Moll were born in which European country?

What is the answer?

[{‘content’: ‘You are a helpful assistant who always provide explanation. Think like you are answering a five year old.’, ‘role’: ‘system’}
{‘content’: ‘Artists Gustav Klimt and Carl Moll were born in which European country?nWhat is the answer?’, ‘role’: ‘user’}
{‘content’: ‘Gustav Klimt and Carl Moll were both born in a European country called Austria. Austria is a beautiful place with mountains, forests, and many people who speak German. These two artists created lots of pretty paintings while they lived in Austria.’, ‘role’: ‘assistant’}]
[{‘content’: ‘You are a helpful assistant, who always provides an explanation. Think like you are answering a five year old.’, ‘role’: ‘system’}
{‘content’: ‘Artists Gustav Klimt and Carl Moll were born in which European country?nWhat is the answer?’, ‘role’: ‘user’}
{‘content’: ” Ooh ooh! *giggle* You wanna know where Gustav Klimt and Carl Moll were born? *bounce bounce* Well, they were born in a special place called Austria! *twirl* That’s right, Austria! It’s a beautiful country in Europe, and lots of talented artists like Gustav and Carl were born there! *smile* Does that make you wanna dance? *spin spin* Hehehe!”, ‘role’: ‘assistant’}]

The following is an example using the Ultrachat-feedback dataset format, which includes the following elements: prompt, chosen, rejected, message, score_chosen, and score_rejected. This type of dataset is often used in DPO or reinforcement learning from human feedback (RLHF) to improve AI model outputs. By providing examples of your preferred and non-preferred responses along with their respective scores, the dataset can be used to train models to generate more desirable outputs.

prompt chosen rejected messages score_chosen score_rejected
Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready? [{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: ‘Sure, I would love to play.’, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
7 6

In the case of Meta Llama 3, instruction-tuned models go through an iterative process of DPO preference alignment, and the dataset typically consists of triplets—a user prompt and two model responses, with one response preferred over the other. In advanced implementations, this format can be extended to include a third, edited response that’s considered superior to both original responses. The preference between responses is quantified using a multi-level rating system, ranging from marginally better to significantly better. This granular approach to preference annotation allows for a more nuanced training of the model, enabling it to distinguish between slight improvements and significant enhancements in response quality.

prompt chosen rejected edited alignment rating
Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready? [{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: ‘Sure, I would love to play.’, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
significantly better

 

Synthetic data creation approach for the instruction-tuning dataset format using the Self-Instruct technique

Synthetic data creation using the Self-Instruct technique is one of the most well-known approaches for generating instruction-finetuning datasets. This method uses the capabilities of LLMs to bootstrap a diverse and extensive collection of instruction-tuning examples, significantly reducing the need for manual annotation. The following figure shows the process of the Self-Instruct technique, which is described in the following sections.

 

Seed data and tasks

The seed data process begins with a small set of human-written instruction-output pairs that serve as seed data. The seed dataset serves as the foundation for building a robust collection of tasks used in various domains, with a focus on promoting task diversity. In some cases, the input field provides context to support the instruction, especially in classification tasks where output labels are limited. On the other hand, for tasks that are non-classification, the instruction alone might be self-contained without needing additional input. This dataset encourages task variety through different data formats and solutions, making it a critical step in defining the final task pool, which supports the development of diverse AI applications.

The following is an example of a seed task that identifies financial entities (companies, government institutions, or assets) and assigns a part of speech tag or entity classification based on the given sentence.

{
    "id": "finance_task_001",
    "name": "financial_entity_classification",
    "instruction": "Identify the type of financial entity in the given sentence.",
    "instances": [
      {
        "input": "Entity: Federal ReservenSentence: The Federal Reserve raised interest rates by 0.25% to combat inflation.",
        "output": "Government Institution, ORG"
      }
    ],
    "is_classification": true
  }

The following example requests an explanation of a financial concept, and because it isn’t a classification task, the output is more open-ended.

{
    "id": "finance_task_002",
    "name": "explain_financial_concept",
    "instruction": "Explain the concept of compound interest in two sentences.",
    "instances": [
      {
        "input": "",
        "output": "Compound interest is the interest on a loan or deposit calculated based on both the initial principal and the accumulated interest from previous periods. It allows investments to grow at a faster rate compared to simple interest, where interest is only calculated on the principal."
      }
    ],
    "is_classification": false
  }

Instruction generation

Using the seed data as a foundation, an LLM is prompted to generate new instructions. The process uses existing human-written instructions as examples to help a model (such as Anthropic’s Claude 3.5 or Meta Llama 405B) to generate new instructions, which are then checked and filtered for quality before being added to the final output list.

Come up with a series of tasks:
1. Suggest a diversified investment portfolio for someone with a moderate risk tolerance.
2. What is the relation between the following financial ratios and company performance?
3. Generate a one-sentence description for each of the following economic terms.
4. Describe a situation in which market volatility can negatively impact retirement planning.

Instance generation

For each generated instruction, the model creates corresponding input-output pairs. This step produces concrete examples of how to follow the instructions. The Input-First Approach for non-classification tasks asks the model to first generate the input values, which will then be used to generate the corresponding output. This approach is especially useful for tasks such as financial calculations, where the output directly depends on specific inputs.

input_first_template = 
'''Come up with examples for the following tasks.
Try to generate multiple examples when possible.
If the task doesn't require additional input, you can generate the output directly.
Task: Calculate the compound interest for the given principal, rate, and time period.
Example 1
Principal: $10,000, Rate: 5%, Time: 2 years
Output: $1,025 (Compound interest using annual compounding)
Example 2
Principal: $5,000, Rate: 3%, Time: 5 years
Output: $796.25 (Compound interest using annual compounding)
...
Task: {instruction}'''

The Output-First Approach for classification tasks is designed to first define the output (class label), and then condition the input generation based on the output. This approach verifies that inputs are created in such a way that they correspond to the pre-defined class labels.

output_first_template = 
'''Given the classification task definition and the class labels,
generate an input that corresponds to each of the class labels.
If the task doesn't require input, just generate possible class labels.
Task: Identify whether the following financial transaction is categorized as "Income" or "Expense."
Class Label: Income
Transaction: Payment received from client for consulting services - $5,000.
Class Label: Expense
Transaction: Payment made for office rent - $1,200.
...
Task: {instruction}'''

Post-processing filters

The filtering and quality control step verifies the dataset quality by applying various mechanisms to remove low-quality or redundant examples. After generating tasks, instances are extracted and formatted, followed by filtering based on rules such as removing instances where the input and output are identical, the output is empty, or the instance is already in the task pool. Additional heuristic checks, such as incomplete generations or formatting issues, are also applied to maintain the integrity of the final dataset.

For more details on self-instruct synthetic data creation, see Alpaca: A Strong, Replicable Instruction-Following Model for information about the data creation approach and instruction fine-tuning with the dataset. You can follow a similar approach for various fine-tuning tasks including instruction fine-tuning and direct preference optimization.

Data labeling for different downstream tasks (such as, code languages, summarization, and so on)

When it comes to preparing the data for training an LLM, data labeling plays a crucial role because it directly controls and impacts the quality of responses a model produces. Generally, for training an LLM, there are a variety of approaches that you can take. It depends on the task at hand because we expect the LLM to work on a variety of use cases. The reason we see base foundation models excelling a variety of instructions and tasks is because during the pre-training process, we provided such instructions and examples to the model so it can understand the instructions and perform the tasks. For example, asking the model to generate code or perform name entity extraction. Training the LLM for each type of task requires task-specific labeled datasets. Let’s explore some of the common data-labeling approaches:

  • Human labelers: The most common method for data labeling is to use human labelers. In this approach, a team of human labelers annotates data for various tasks, such as general question-answering, sentiment analysis, summarization, comparing various text for similarity and differences, and so on. For each category of task, you prepare a dataset for the various tasks and ask the human labelers to provide the answers. To mitigate individual bias, you can collect multiple responses for the same question by sourcing answers from multiple human labelers and then consolidate responses into an aggregate label. Human labeling is regarded as the gold standard for collecting high-quality data at scale. However, the process of labeling by hand tends to be tedious, time-consuming, and expensive for labeling tasks that involve millions of data points, which has motivated the study of AI-assisted data annotation tools—such as Snapper—that interactively reduce the burden of manual annotation.
  • LLM-assisted labeling: Another common approach to labeling is to use another LLM to label the data to speed up the labeling process. In this approach, you use another LLM to generate the responses for the various tasks such as sentiment analysis, summarization, coding, and so on. This can be achieved in different ways. In some cases, we can use N-shot learning approaches to improve the quality of the label. To mitigate bias, we use the human-in-the-loop (HITL) approach to review certain responses to verify that the labels are high quality. The benefit of this approach is that it’s faster than human labeling because you can scale the LLM endpoint and serve multiple requests in parallel. However, the downside is that you have to keep iterating and changing the acceptance threshold of confidence of the model’s response. For example, if you’re preparing the dataset for financial crime, you have to lower the tolerance for false negatives and accept slightly higher false positives.
  • Cohort-based labeling: Cohort-based labeling is an emerging approach where more than two LLMs are asked to generate the label for the same data. The models are then asked whether they agree with the other model’s response. The label is accepted if both models agree with each other’s response. There is another variation of this approach where instead of asking the models to agree with each other’s responses, you use a third LLM to rate the quality of the output of the other two models. It produces high quality outputs, but the cost of labeling rises exponentially because you need to make at least three LLM invocation calls for each data point to produce the final label. This approach is under active research, and we expect more orchestration tools for this in the near future.
  • RLHF-based data labeling: This approach is inspired by the RLHF fine-tuning process. Based on the task at hand, you first take a sample of unlabeled data points and have them labeled by a human labeler. You then use the labeled dataset to fine-tune an LLM. The next step is to use the fine-tuned LLM to produce multiple outputs for another subset of unlabeled data points. A human labeler ranks the outputs from best to worst and you use this data to train a reward model. You then send the rest of the unlabeled data points through the re-enforcement-learned PPO initialized through supervised policy. The policy generates the label and then you ask the reward model to calculate a reward for the label. The reward is further used to update the PPO policy. For further reading on this topic, see Improving your LLMs with RLHF on Amazon SageMaker.

Data processing architecture

The entire data processing pipeline can be achieved using a series of jobs as illustrated in the following architecture diagram. Amazon SageMaker is used as a job facility to filter, deduplicate, and tokenize the data. The intermediate outputs of each job can be stored on Amazon Simple Storage Service (Amazon S3). Depending on the size of the final datasets, either Amazon S3 or FSx for Lustre can be used for storing the final dataset. For larger datasets, FSx can provide significant improvements in the training throughput by eliminating the need to copy or stream data directly from S3. An example pipeline using the Hugging Face DataTrove library is provided in this repo.

Pipeline for fine-tuning

As previously discussed, fine-tuning data is typically comprised of an input instruction and the desired outputs. This data can be sourced using manual human annotation, synthetic generation, or a combination of the two. The following architecture diagram outlines an example pipeline where fine-tuning data is generated from an existing corpus of domain-specific documents. An example of a fine-tuning dataset would take a source document as input or context and generate task-specific responses such as a summary of the document, key information extracted from the document, or answers to questions about the document.

Models provided by Amazon Bedrock can be used to generate the synthetic data, which can then be validated and modified by a human reviewer using Amazon SageMaker Ground Truth. SageMaker Ground Truth can also be used to create human-labeled data fine-tuning from scratch. For synthetic data generation, be sure to review the model provider’s acceptable usage terms to verify compliance.

Pipeline for DPO

After a model is fine-tuned, it can be deployed on model hosting services such as Amazon SageMaker. The hosted model can then be used to generate candidate responses to various prompts. Through SageMaker Ground Truth, users can then provide feedback on which responses they prefer, resulting in a preference dataset. This flow is outlined in the following architecture diagram and can be repeated multiple times as the model tunes using the latest preference data.

Conclusion

Preparing high-quality datasets for LLM training is a critical yet complex process that requires careful consideration of various factors. From extracting and cleaning data from diverse sources to deduplicating content and maintaining ethical standards, each step plays a crucial role in shaping the model’s performance. By following the guidelines outlined in this post, organizations can curate well-rounded datasets that capture the nuances of their domain, leading to more accurate and reliable LLMs.


About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Vinayak Arannil is a Sr. Applied Scientist from the AWS Bedrock team. With several years of experience, he has worked on various domains of AI like computer vision, natural language processing etc. Vinayak led the data processing for the Amazon Titan model training. Currently, Vinayak helps build new features on the Bedrock platform enabling customers to build cutting-edge AI applications with ease and efficiency.

Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping customers from financial industries design, build and scale their GenAI/ML workloads on AWS. He carries an experience of more than a decade and a half working on entire ML and software engineering stack. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

David Ping is a Sr. Manager of AI/ML Solutions Architecture at Amazon Web Services. He helps enterprise customers build and operate machine learning solutions on AWS. David enjoys hiking and following the latest machine learning advancement.

Graham Horwood is Sr. Manager of Data Science from the AWS Bedrock team.

Read More

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

As generative AI capabilities evolve, successful business adoptions hinge on the development of robust problem-solving capabilities. At the forefront of this transformation are agentic systems, which harness the power of foundation models (FMs) to tackle complex, real-world challenges. By seamlessly integrating multiple agents, these innovative solutions enable autonomous collaboration, decision-making, and efficient problem-solving in diverse environments. Empirical research conducted by Amazon Web Services (AWS) scientists in conjunction with academic researchers has demonstrated the significant strides made in enhancing the reasoning capabilities through agent collaboration on competitive tasks.

This post provides step-by-step instructions for creating a collaborative multi-agent framework with reasoning capabilities to decouple business applications from FMs. It demonstrates how to combine Amazon Bedrock Agents with open source multi-agent frameworks, enabling collaborations and reasoning among agents to dynamically execute various tasks. The exercise will guide you through the process of building a reasoning orchestration system using Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and FMs. We also explore the integration of Amazon Bedrock Agents with open source orchestration frameworks LangGraph and CrewAI for dispatching and reasoning.

AWS has introduced a multi-agent collaboration capability for Amazon Bedrock, enabling developers to build, deploy, and manage multiple AI agents working together on complex tasks. This feature allows for the creation of specialized agents that handle different aspects of a process, coordinated by a supervisor agent that breaks down requests, delegates tasks, and consolidates outputs. This approach improves task success rates, accuracy, and productivity, especially for complex, multi-step tasks.

For the example code and demonstration discussed in this post, refer to the agentic-orchestration GitHub repository and this AWS Workshop. You can also refer to GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

Key characteristics of an agentic service

In the context of generative AI, “agent” refers to an autonomous function that can interact with its environment, gather data, and make decisions to execute complex tasks to achieve predefined goals. Generative AI agents are autonomous, goal-oriented systems that use FMs, such as large language models (LLMs), to interact with and adapt to their environments. These agents excel in planning, problem-solving, and decision-making, using techniques such as chain-of-thought prompting to break down complex tasks. They can self-reflect, improve their processes, and expand their capabilities through tool use and collaborations with other AI models. These agents can operate independently or collaboratively, executing tasks across various domains while continuously adapting to new information and changing circumstances. Agents can lead to increased creativity and produce content at scale, automating repetitive tasks so humans can focus on strategic work, thus reducing repetitive actions and leading to cost savings. The following diagram shows the high-level architecture of the solution.

To implement an agent on AWS, you can use the Amazon Bedrock Agents Boto3 client as demonstrated in the following code example. After the required AWS and Identity and Access Management (IAM) role is created for the agent, use the create_agent API. This API requires an agent name, an FM identifier, and an instruction string. Optionally, you can also provide an agent description. The created agent is not yet prepared for use. We focus on preparing the agent and then using it to invoke actions and interact with other APIs. Use the following code example to obtain your agent ID; it will be crucial for performing operations with the agent.

# Use the Python boto3 SDK to interact with Amazon Bedrock Agent service

bedrock_agent_client = boto3.client('bedrock-agent')

# Create a new Bedrock Agent
response = bedrock_agent_client.create_agent(
    agentName=<agent_name>, #customized text string
    agentResourceRoleArn=<agent_role['Role']['Arn']>, #IAM role assigned to the agent
    description=<agent_description>, #customized text string
    idleSessionTTLInSeconds=1800, 
    foundationModel=<agent_foundation_model>, #e.g. "anthropic.claude-3-sonnet-20240229-v1:0"
    instruction=<agent_instruction>, #agent instruction text string
)
agent_id = response['agent']['agentId']

Multi-agent pipelines for intra-agent collaboration

Multi-agent pipelines are orchestrated processes within AI systems that involve multiple specialized agents working together to accomplish complex tasks. Within pipelines, agents are organized in a sequential order structure, with different agents handling specific subtasks or roles within the overall workflow. Agents interact with each other, often through a shared “scratchpad” or messaging system, allowing them to exchange information and build upon each other’s work. Each agent maintains its own state, which can be updated with new information as the flow progresses. Complex projects are broken down into manageable subtasks, which are then distributed among the specialized agents. The workflow includes clearly defined processes for how tasks should be orchestrated, facilitating efficient task distribution and alignment with objectives. These processes can govern both inter-agent interactions and intra-agent operations (such as how an agent interacts with tools or processes outputs). Agents can be assigned specific roles (for example, retriever or injector) to tackle different aspects of a problem.

As a practical example, consider a multi-agent pipeline for blog writing, implemented with the multi-agent framework CrewAI. To create a multi-agent pipeline with CrewAI, first define the individual agents that will participate in the pipeline. The agents in the following example are the Planner Agent, a Writer Agent, and an Editor Agent. Next, arrange these agents into a pipeline, specifying the order of task execution and how the data flows between them. CrewAI provides mechanisms for agents to pass information to each other and coordinate their actions. The modular and scalable design of CrewAI makes it well-suited for developing both simple and sophisticated multi-agent AI applications. The following diagram shows this multi-agent pipeline.

from crewai import Agent, Task, Crew, Process

# Create a blog writing multi-agent pipeline, which is comprised of a planner, a writer, and an editor agent
# This code snippet shows only the planner agent, which calls web search tools 
# and Amazon Bedrock for the LLM 
class blogAgents():
   def __init__(self, topic, model_id):
       self.topic = topic
       self.model_id = model_id
    
   def planner(self, topic, model_id):
       return Agent(
           role="Content Planner",
           goal=f"""Plan engaging and factually accurate content on {topic}.""", 
           backstory=f"""You're working on planning a blog article about the topic: {topic}. n
                     You collect information by searching the web for the latest developments that directly relate to the {topic}. n
                     You help the audience learn something to make informed decisions regarding {topic}. n 
                     Your work is the basis for the Content Writer to write an article on this {topic}.""",
           allow_delegation=False,
           tools=<tools_to_use>,
           llm=<Bedrock_foundation_model>,
           verbose=True
       )
......

# Create the associated blog agent tasks which are comprised of a planner, writer, and editor tasks.
# This code snippet shows only the planner task.
class blogTasks():
   def __init__(self, topic, model_id):
       self.topic = topic
       self.model_id = model_id

   def plan(self, planner, topic, model_id):  
       return Task(
           description=(
                 f"""1. Prioritize the latest trends, key players, and noteworthy news on {topic}.n
                 2. Identify the target audience, considering their interests and pain points.n
                 3. Develop a detailed content outline including an introduction, key points, and a call to action.n
                 4. Include SEO keywords and relevant data or sources."""
           ),
           expected_output=f"""Convey the latest developments on the {topic} with sufficient depth as a domain expert.n
               Create a comprehensive content plan document with an outline, audience analysis,
SEO keywords, and resources.""",
           agent=planner
       )
......

# Define planner agent and planning tasks
planner_agent = agents.planner(self.topic, self.model_id)
plan_task = tasks.plan(planner_agent, self.topic, self.model_id)
......
 
# Define an agentic pipeline to chain the agent and associated tasks
# with service components, embedding engine, and execution process
crew = Crew(
        agents=[planner_agent, writer_agent, editor_agent],
        tasks=[plan_task, write_task, edit_task],
        verbose=True,
        memory=True,
        embedder={
            "provider": "huggingface",
            "config": {"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"},
        },
        cache=True,
        process=Process.sequential # Sequential process will have tasks executed one after the other
       )
result = crew.kickoff()

As demonstrated in this code example, multi-agent pipelines are generally simple linear structures that may be easy to set up and understand. They have a clear sequential flow of tasks from one agent to the next and can work well for straightforward workflows with a defined order of operations. Meanwhile, the pipeline structure can be less flexible for complex, nonlinear agent interactions, which makes it less able to handle branching logic or cycles. This might be less efficient for problems that require back-and-forth between agents. The next section addresses a graph framework for multi-agent systems, which lend better to more complex scenarios.

Multi-agent graph framework for asynchronous orchestration and reasoning

A multi-agent framework offers significant potential for intelligent, dynamic problem-solving that enable collaborative, specialized task execution. While these systems can enhance inference accuracy and response efficiency by dynamically activating and coordinating agents, they also present critical challenges including potential bias, limited reasoning capabilities, and the need for robust oversight. Effective multi-agent frameworks require careful design considerations such as clear leadership, dynamic team construction, effective information sharing, planning mechanisms like chain-of-thought prompting, memory systems for contextual learning, and strategic orchestration of specialized language models. As the technology evolves, balancing agent autonomy with human oversight and ethical safeguards will be crucial to unlocking the full potential of these intelligent systems while mitigating potential risks.

A multi-agent graph framework is a system that models the interactions and relationships between multiple autonomous agents using a graph-based representation. In this type of framework, agents are represented as nodes in the graph, with each agent having its own set of capabilities, goals, and decision-making processes. The edges in the graph represent the interactions, communications, or dependencies between the agents. These can include things like information sharing, task delegation, negotiation, or coordination. The graph structure allows for the modeling of complex, dynamic relationships between agents, including cycles, feedback loops, and hierarchies. The following diagram shows this architecture.

The graph-based approach provides a flexible and scalable way to represent the structure of multi-agent systems, making it easier to analyze, simulate, and reason about the emergent behaviors that arise from agent interactions. The following code snippet illustrates the process of building a graph framework designed for multi-agent orchestration using LangGraph. This framework is essential for managing and coordinating the interactions between multiple agents within a system, promoting efficient and effective communication and collaboration. Notably, it emphasizes the plug-and-play feature, which allows for dynamic changes and the flexibility to accommodate third-party agents. Frameworks with this capability can seamlessly adapt to new requirements and integrate with external systems, enhancing their overall versatility and usability.

from langgraph.graph import StateGraph, END
......
# Create a graph to orchestrate multiple agents (i.e. nodes) 
orch = StateGraph(MultiAgentState)
orch.add_node("rewrite_agent", rewrite_node)
orch.add_node('booking_assistant', bedrock_agent_node)
orch.add_node('blog_writer', blog_writer_node)
orch.add_node("router_agent", router_node)
orch.add_node('search_expert', search_expert_node)
....

# Create edges to connect agents to form a graph
orch.set_entry_point("rewrite_agent")
orch.add_edge('rewrite_agent', 'router_agent')
orch.add_conditional_edges(
    "RAG_agent",
    decide_to_search,
    {
        "to_human": "human",
        "do_search": "search_expert",
    },
)
orch.add_edge('blog_writer', 'text2image_generation')
......

# Compile the graph for agentic orchestration
graph = orch.compile(checkpointer=memory, interrupt_before = ['human'])

The multi-agent graph approach is particularly useful for domains where complex, dynamic interactions between autonomous entities need to be modeled and analyzed, such as in robotics, logistics, social networks, and more. There are multiple advantages and disadvantages to the multi-agent graph-based approach over the linear multi-agent pipelines approach, which are captured below.

Advantages and limitations

The emergence of agentic services represents a transformative approach to system design. Unlike conventional AI models that adhere to fixed, predetermined workflows, agentic systems are characterized by their capacity to collaborate, adapt, and make decisions in real time. This transition from passive to active AI opens up exciting opportunities and presents unique design challenges for developers and architects. Central to agentic services is the notion of agentic reasoning, which embodies a flexible, iterative problem-solving methodology that reflects human cognitive processes. By integrating design patterns such as reflection, self-improvement, and tool utilization, we can develop AI agents that are capable of ongoing enhancement and broader functionality across various domains.

Agentic services, although promising, face several limitations that must be addressed for their successful production implementation. The complexity of managing multiple autonomous agents, especially as their numbers and scope increase, poses a significant challenge in maintaining system coherence and stability. Additionally, the emergent behaviors of these systems can be difficult to predict and understand, hindering transparency and interpretability, which are crucial for building trust and accountability. Safety and robustness are paramount concerns because unintended behaviors or failures could have far-reaching consequences, necessitating robust safeguards and error-handling mechanisms. As agentic services scale up, maintaining efficient performance becomes increasingly challenging, requiring optimized resource utilization and load balancing. Finally, the lack of widely adopted standards and protocols for agent-based systems creates interoperability issues, making it difficult to integrate these services with existing infrastructure. Addressing these limitations is essential for the widespread adoption and success of agentic services in various domains.

Advantages:

  • More flexible representation of agent interactions using a graph structure
  • Better suited for complex workflows with nonlinear agent communication
  • Can more easily represent cycles and branching logic between agents
  •  Potentially more scalable for large multi-agent system
  • Clearer visualization of overall agent system structure

Disadvantages:

  • More complex initial setup compared to linear pipelines
  • Can require more upfront planning to design the graph structure
  • Can require extra source usage and longer response time

Next steps

In the next phase of multi-agent orchestration, our focus will be on enhancing the reasoning, reflection, and self-correction capabilities of our agents. This involves developing advanced algorithms (such as tree-of-thoughts (ToT) prompting, Monte Carlo tree search (MCTS), and others) that allow agents to learn from their peer interactions, adapt to new situations, and correct their behaviors based on feedback. Additionally, we’re working on creating a production-ready framework that can accommodate a variety of agentic services. This framework will be designed to be flexible and scalable, enabling seamless integration of different types of agents and services. These efforts are currently underway, and we’ll provide a detailed update on our progress in the next blog post. Stay tuned for more insights into our innovative approach to multi-agent orchestration.

Conclusion

Multi-agent orchestration and reasoning represent a significant leap forward in generative AI production adoption, offering unprecedented potential for complex problem-solving and decision-making, decoupling your applications from individual FMs. It’s also crucial to acknowledge and address the limitations, including scalability challenges, long latency and likely incompatibility among different agents. As we look to the future, enhancing self and intra-agent reasoning, reflection, and self-correction capabilities of our agents will be paramount. This will involve developing more sophisticated algorithms for metacognition, improving inter-agent communication protocols, and implementing robust error detection and correction mechanisms.

For the example code and demonstration discussed in this post, refer to the agentic-orchestration GitHub repository and this AWS Workshop. You can also refer to GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

The authors wish to express their gratitude to Mark Roy, Maria Laderia Tanke, and Max Iguer for their insightful contributions, as well as to Nausheen Sayed for her relentless coordination.


About the authors

Alfred Shen is a Senior GenAI Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on agentic solutions and multimodality.

annadrbAnya Derbakova is a Senior Startup Solutions Architect at AWS, specializing in Healthcare and Life Science technologies. A University of North Carolina graduate, she previously worked as a Principal Developer at Blue Cross Blue Shield Association. Anya is recognized for her contributions to AWS professional development, having been featured on the AWS Developer Podcast and participating in multiple educational series. She co-hosted a six-part mini-series on AWS Certification Exam Prep, focusing on cost-optimized cloud architecture strategies. Additionally, she was instrumental in the “Get Schooled on…Architecting” podcast, which provided comprehensive preparation for the AWS Solutions Architect Exam.

Read More

NieR Perfect: GeForce NOW Loops Square Enix’s ‘NieR:Automata’ and ‘NieR Replicant ver.1.22474487139…’ Into the Cloud

NieR Perfect: GeForce NOW Loops Square Enix’s ‘NieR:Automata’ and ‘NieR Replicant ver.1.22474487139…’ Into the Cloud

Stuck in a gaming rut? Get out of the loop this GFN Thursday with four new games joining the GeForce NOW library of over 2,000 supported games.

Dive into Square Enix’s mind-bending action role-playing games (RPGs) NieR:Automata and NieR Replicant ver.1.22474487139…, now streaming in the cloud. Plus, explore HoYoverse’s Zenless Zone Zero for an adrenaline-packed adventure, just in time for its 1.4 update.

Check out GeForce Greats, which offers a look back at the biggest and best moments of PC gaming, from the launch of the GeForce 256 graphics card to the modern era. Follow the GeForce, GeForce NOW, NVIDIA Studio and NVIDIA AI PC channels on X, as well as #GeForceGreats, to join in on the nostalgic journey. Plus, participate in the GeForce LAN Missions from the cloud with GeForce NOW starting on Saturday, Jan. 4, for a chance to win in-game rewards, first come, first served.

GeForce NOW members will also be able to launch a virtual stadium for a front-row seat to the CES opening keynote, to be delivered by NVIDIA founder and CEO Jensen Huang on Monday, Jan. 6. Stay tuned to GFN Thursday for more details.

A Tale of Two NieRs

NieR:Automata and NieR Replicant ver.1.22474487139… — two captivating action RPGs from Square Enix — delve into profound existential themes and are set in a distant, postapocalyptic future.

Existence is futile, except in the cloud.

Control androids 2B, 9S and A2 as they battle machine life-forms in a proxy war for human survival in NieR:Automata. The game explores complex philosophical concepts through its multiple endings and perspective shifts, blurring the lines between man and machine. It seamlessly mixes stylish and exhilarating combat with open-world exploration for a diverse gameplay experience.

The hero’s journey leads to the cloud.

NieR Replicant ver.1.22474487139…, an updated version of the original NieR game, follows a young man’s quest to save his sister from a mysterious illness called the Black Scrawl. Uncover dark secrets about their world while encountering a cast of unforgettable characters and making heart-wrenching decisions.

Unravel the layers of the emotionally charged world of NieR with each playthrough on GeForce NOW. Experience rich storytelling and intense combat without high-end hardware. Carefully explore every possible loop with extended gaming sessions for Performance and Ultimate members.

Find Zen in the Cloud

Dive into the Hollows.

Zenless Zone Zero, the free-to-play action role-playing game from HoYoverse, is set in the post-apocalyptic metropolis of New Eridu. Take on the role of a “Proxy” and guide others through dangerous alternate dimensions to confront an interdimensional threat. The game features a fast-paced, combo-oriented combat system and offers a mix of intense action, character-driven storytelling and exploration of a unique futuristic world.

The title comes to the cloud in time for the version 1.4 update, A Storm of Falling Stars, bringing additions to the game for new and experienced players alike. Joining the roster of playable characters are Frost Anomaly agent Hoshimi Miyabi and Electric Attack agent Asaba Harumasa. Plus, the revamped Decibel system allows individual characters to collect and use Decibels instead of sharing across the squad, offering a new layer of strategy. Explore two new areas, Port Elpis and Reverb Arena, and try out the new “Hollow Zero-Lost Void” mode.

Experience the adventure on GeForce NOW and dive deeper into New Eridu across devices with a Performance or Ultimate membership. Snag some in-game loot by following the GeForce NOW social channels (X, Facebook, Instagram, Threads) and be on the lookout for for a limited-quantity redemption code for a free reward package — including 20,000 Dennies, three Official Investigator Logs and three W-Engine Power Supplies.

Fresh Arrivals

Look for the following games available to stream in the cloud this week:

  • NieR:Automata (Steam)
  • NieR Replicant ver.1.22474487139… (Steam)
  • Replikant Chat (Steam)
  • Zenless Zone Zero v1.4 (HoYoverse)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

AI’s in Style: Ulta Beauty Helps Shoppers Virtually Try New Hairstyles

AI’s in Style: Ulta Beauty Helps Shoppers Virtually Try New Hairstyles

Shoppers pondering a new hairstyle can now try styles before committing to curls or a new color. An AI app by Ulta Beauty, the largest specialty beauty retailer in the U.S., uses selfies to show near-instant, highly realistic previews of desired hairstyles.

GLAMlab Hair Try On is a digital experience that lets users take a photo, upload a headshot or use a model’s picture to experiment with different hair colors and styles. Used by thousands of web and mobile app users daily, the experience is powered by the NVIDIA StyleGAN2 generative AI model.

Hair color try-ons feature links to Ulta Beauty products so shoppers can achieve the look in real life. The company, which has more than 1,400 stores across the U.S., has found that people who use the virtual tool are more likely to purchase a product than those who don’t.

“Shoppers need to try out hair and makeup styles before they purchase,” said Juan Cardelino, director of the computer vision and digital innovation department at Ulta Beauty. “As one of the first cosmetics companies to integrate makeup testers in stores, offering try-ons is part of Ulta Beauty’s DNA — whether in physical or digital retail environments.”

Adding Ulta Beauty’s Flair to StyleGAN2

GLAMlab is Ulta Beauty’s first generative AI application, developed by its digital innovation team.

To build its AI pipeline, the team turned to StyleGAN2, a style-based neural network architecture for generative adversarial networks, aka GANs. StyleGAN2, developed by NVIDIA Research, uses transfer learning to generate infinite images in a variety of styles.

“StyleGAN2 is one of the most well-regarded models in the tech community, and, since the source code was available for experimentation, it was the right choice for our application,” Cardelino said. “For our hairstyle try-on use case, we had to license the model for commercial use, retrain it and put guardrails around it to ensure the AI was only modifying pixels related to hair — not distorting any feature of the user’s face.”

Available on the Ulta Beauty website and mobile app, the hair style and color try-ons rely on NVIDIA Tensor Core GPUs in the cloud to run AI inference, which takes around 5 seconds to compute the first style and about a second each for subsequent styles.

The company next plans to incorporate virtual trials for additional hair categories like wigs and is exploring how the virtual hairstyle try-ons could be connected to in-store styling services.

“Stylists could use the tool to show our guests how certain hairstyles will look on them, giving them more confidence to try new looks,” Cardelino said.

Beyond giving customers a new way to interact with Ulta Beauty’s products, these AI-powered virtual try-ons give users a chance to be creative and explore new possibilities for their personal styles.

“Hair and makeup are playful categories,” Cardelino said. “Virtual try-ons are a way to explore options that may be out of a customer’s comfort zone without needing to commit to a physical change.”

See the latest work from NVIDIA Research, which has hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.

Read More

Research Focus: Week of December 16, 2024

Research Focus: Week of December 16, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of December 16, 2024

NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

The Compute Express Link (CXL) open standard interconnect enables integration of diverse types of memory into servers via its byte-addressable SerDes links. To fully utilize CXL-based heterogeneous memory systems (which combine different types of memory with varying access speeds), it’s necessary to implement efficient memory tiering—a strategy to manage data placement across memory tiers for optimal performance. Efficiently managing these memory systems is crucial, but has been challenging due to the lack of precise and efficient tools for understanding how memory is accessed.

In a recent paper: NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering researchers from Microsoft propose a novel solution which features a hardware/software co-design to address this problem. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf, which monitors memory accesses and provides the operating system (OS) with crucial page hotness statistics and other system state information. On the OS kernel side, the researchers designed a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. Implemented on a real FPGA-based CXL memory platform and Linux kernel v6.3, NeoMem demonstrated 32% to 67% geomean speedup over several existing memory tiering solutions.


Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Planning and conducting chemical syntheses is a significant challenge in the discovery of functional small molecules, which limits the potential of generative AI for molecular inverse design. Although early machine learning-based retrosynthesis models have shown the ability to predict reasonable routes, they are less accurate for infrequent, yet important reactions.

In a recent paper: Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases, researchers from Microsoft and external colleagues address this limitation, with a new framework for building highly accurate reaction models. Chimera incorporates two newly developed models, each achieving state-of-the-art performance in their respective categories. Evaluations by PhD-level organic chemists show that Chimera’s predictions are preferred for their higher quality compared to baseline models.

The researchers further validate Chimera’s robustness by applying its largest-scale model to an internal dataset from a major pharmaceutical company, demonstrating its ability to generalize effectively under distribution shifts. This new framework shows the potential to substantially accelerate the development of even more accurate and versatile reaction prediction models.


Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


The GA4GH Task Execution API: Enabling Easy Multicloud Task Execution

In bioinformatics and computational biology, data analysis often involves chaining command-line programs developed by specialized teams at different institutions. These tools, which vary widely in age, software stacks, and dependencies, lack a common programming interface, which makes integration, workflow management and reproducibility challenging.

A recent article (opens in new tab) emphasizes the development, adoption and implementation of the Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API, created in collaboration with researchers at Microsoft and other institutions. The TES API offers a unified schema and interface for submitting and managing tasks, seamlessly bridging gaps between on-premises high-performance and high-throughput computing systems, cloud platforms, and hybrid infrastructures. Its flexibility and extensibility have already made it a critical asset for applications ranging from federated data analysis to load balancing across multi-cloud systems.

Adopted by numerous service providers and integrated into several workflow engines, TES empowers researchers to execute complex computational tasks through a single, abstracted interface. This eliminates compatibility hurdles, accelerates research timelines, reduces costs and enables “compute to data” solutions—essential for tackling the challenges of distributed data analysis.


RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Increasing use of code agents for AI-assisted coding and software development has brought safety and security concerns, such as generating or executing malicious code, which have become significant barriers to real-world deployment of these agents.

In a recent paper: RedCode: Risky Code Execution and Generation Benchmark for Code Agents, published at NeurIPS 2024, researchers from Microsoft and external colleagues propose comprehensive and practical evaluations on the safety of code agents. RedCode is an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests.

This research evaluated three agents based on various large language models (LLMs), providing insights into code agents’ vulnerabilities. For instance, results showed that agents are more likely to reject executing unsafe operations on the operating system. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additional evaluations revealed that more capable base models and agents with stronger overall coding abilities, such as GPT-4, tend to produce more sophisticated harmful software.

These findings highlight the need for stringent safety evaluations for diverse code agents. The underlying dataset and related code are publicly available at https://github.com/AI-secure/RedCode (opens in new tab).


Towards industrial foundation models: Integrating large language models with industrial data intelligence

Although large language models (LLMs) excel at language-focused tasks like news writing, document summarization, customer service, and supporting virtual assistants, they can face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular and time series data. To address these issues, researchers from Microsoft propose a new approach to building industrial foundation models (IFMs). As outlined in a recent blog post, they have successfully demonstrated the feasibility of cross-domain universal in-context learning on tabular data and the significant potential it could achieve.

The researchers designed Generative Tabular Learning (opens in new tab) (GTL), a new framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks. This technical paradigm has been open-sourced (opens in new tab) to promote broader use.

Microsoft Research in the news


Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king 

December 12, 2024

Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors.


Microsoft researcher Ece Kamar discusses the future of AI agents in 2025 

Tech Brew | December 12, 2024

With AI agents widely expected to take off in 2025, the director of Microsoft’s AI Frontiers lab weighs in on the future of this technology, the safeguards needed, and the year ahead in AI research.


A new frontier awaits — computing with light 

December 12, 2024

In the guts of a new type of computer, a bunch of tiny LEDs emit a green glow. Those lights have a job to do. They’re performing calculations. Right now, this math is telling the computer how to identify handwritten images of numbers. The computer is part of a research program at Microsoft.

The post Research Focus: Week of December 16, 2024 appeared first on Microsoft Research.

Read More

Imbue’s Kanjun Qiu Shares Insights on How to Build Smarter AI Agents

Imbue’s Kanjun Qiu Shares Insights on How to Build Smarter AI Agents

Imagine a future in which everyone is empowered to build and use their own AI agents. That future may not be far off, as new software is infused with intelligence through collaborative AI systems that work alongside users rather than merely automating tasks.

In this episode of the NVIDIA AI Podcast, Kanjun Qiu, CEO of Imbue, discusses the rise of AI agents, drawing parallels between the personal computer revolution of the late 1970s and 80s and today’s AI agent transformation. She details Imbue’s approach to building reasoning capabilities into its products, the challenges of verifying the correctness of AI outputs and how Imbue is focusing on post-training and fine-tuning to improve verification capabilities.

Learn more about Imbue, and read more about AI agents, including how virtual assistants can enhance customer service experiences.

And hear more about the future of AI and graphics by tuning in to the CES keynote, delivered by NVIDIA founder and CEO Jensen Huang live in Las Vegas on Monday, Jan. 6, at 6:30 p.m. PT.

Time Stamps

1:21 – What are AI agents? And Imbue’s approach to them.

9:00 – Where are AI agents being used the most today?

17:05 – Why building a good user experience around agents requires invention.

26:28 – How reasoning and verification capabilities factor into Imbue’s products.

You Might Also Like… 

Zoom CTO Xuedong “XD” Huang on How AI Revolutionizes Productivity 

Zoom is now transforming into an AI-first platform. CTO Xuedong Huang discusses Zoom’s AI Companion 2.0 and the company’s “federated AI” strategy, which aims to integrate multiple large language models to enhance productivity and collaboration.

How Roblox Uses Generative AI to Enhance User Experiences

Roblox is enhancing its colorful online platform with generative AI to improve user safety and inclusivity through features like automated chat filters and real-time text translation. Anupam Singh, VP of AI and growth engineering at Roblox, explores how AI coding assistants are helping creators focus more on creative expression.

Rendered.ai CEO Nathan Kundtz on Using AI to Build Better AI

Data is crucial for training AI and machine learning systems, and synthetic data offers a solution to the challenges of compiling real-world data. Nathan Kundtz, founder and CEO of Rendered.ai, discusses how his company’s platform generates synthetic data to enhance AI models.

Subscribe to the AI Podcast

Get the AI Podcast through Apple Podcasts, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Read More

AI at Your Service: Digital Avatars With Speech Capabilities Offer Interactive Customer Experiences

AI at Your Service: Digital Avatars With Speech Capabilities Offer Interactive Customer Experiences

Editor’s note: This post is part of the AI On blog series, which explores the latest techniques and real-world applications of agentic AI, chatbots and copilots. The series will also highlight the NVIDIA software and hardware powering advanced AI agents, which form the foundation of AI query engines that gather insights and perform tasks to transform everyday experiences and reshape industries.

To enhance productivity and upskill workers, organizations worldwide are seeking ways to provide consistent, around-the-clock customer service with greater speed, accuracy and scale.

Intelligent AI agents offer one such solution. They deliver advanced problem-solving capabilities and integrate vast and disparate sources of data to understand and respond to natural language.

Powered by generative AI and agentic AI, digital avatars are boosting efficiency across industries like healthcare, telecom, manufacturing, retail and more. According to Gartner, by 2028, 45% of organizations with more than 500 employees will use employee AI avatars to expand the capacity of human capital.1

From educating prospects on policies to giving customers personalized solutions, AI is helping organizations optimize revenue streams and elevate employee knowledge and productivity.

Where Context-Aware AI Avatars Are Most Impactful

Staying ahead in a competitive, evolving market requires continuous learning and analysis. AI avatars — also referred to as digital humans — are addressing key concerns and enhancing operations across industries.

One key benefit of agentic digital human technology is the ability to offer consistent, multilingual support and personalized guidance for a variety of use cases.

For instance, a medical-based AI agent can provide 24/7 virtual intake and support telehealth services. Or, a virtual financial advisor can help enhance client security and financial literacy by alerting bank customers of potential fraud, or offering personalized offers and investment tips based on their unique portfolio.

These digital humans boost efficiency, cut costs and enhance customer loyalty. Some key ways digital humans can be applied include:

  • Personalized, On-Brand Customer Assistance: A digital human interface can provide a personal touch when educating new customers on a company’s products and service portfolios. They can provide ongoing customer support, offering immediate responses and solving problems without the need for a live operator.
  • Enhanced Employee Onboarding: Intelligent AI assistants can offer streamlined, adaptable, personalized employee onboarding, whether in hospitals or offices, by providing consistent access to updated institutional knowledge at scale. With pluggable, customizable retrieval-augmented generation (RAG), these assistants can deliver real-time answers to queries while maintaining a deep understanding of company-specific data.
  • Seamless Communication Across Languages: In global enterprises, communication barriers can slow down operations. AI-powered avatars with natural language processing capabilities can communicate effortlessly across languages. This is especially useful in customer service or employee training environments where multilingual support is crucial.

Learn more by listening to the NVIDIA AI Podcast episode with Kanjun Qiu, CEO of Imbue, who shares insights on how to build smarter AI agents.

Interactive AI Agents With Text-to-Speech and Speech-to-Text

With text-to-speech and speech-to-text capabilities, AI agents can offer enhanced interactivity and engagement in customer service interactions.

SoftServe, an IT consulting and digital services provider, has built several digital humans for a variety of use cases, highlighting the technology’s potential to enhance user experiences.

SoftServe’s Digital Concierge is accelerated by NVIDIA AI Blueprints and NVIDIA ACE technologies to rapidly deploy scalable, customizable digital humans across diverse infrastructures.

GEN, SoftServe’s virtual customer service assistant and digital concierge, makes customer service more engaging by providing lifelike interactions, continuous availability, personalized responses and simultaneous access to all necessary knowledge bases.

SoftServe also developed FINNA, an AI-powered virtual financial advisor that can provide financial guidance tailored to a client’s profile and simplify complex financial terminology. It helps streamline onboarding and due diligence, supporting goal-oriented financial planning and risk assessment.

AISHA is another AI-powered digital human developed by SoftServe with NVIDIA technology. Created for the UAE Ministry of Justice, the digital human significantly improves judicial processes by reducing case review times, enhancing the accuracy of rulings and providing rapid access to legal databases. It demonstrates how generative AI can bridge the gap between technology and meaningful user interaction to enhance customer service and operational efficiency in the judicial sector.

How to Design AI Agents With Avatar and Speech Features

Designing AI agents with avatar and speech features involves several key steps

  1. Determine the use case: Choose between 2D or 3D avatars based on the required level of immersion and interaction.
  2. Avatar development:
    • For 3D avatars, use specialized software and technical expertise to create lifelike movements and photorealism.
    • For 2D avatars, opt for quicker development suitable for web-embedded solutions.
  3. Integrate speech technologies: Use NVIDIA Riva for world-class automatic speech recognition, along with text-to-speech to enable verbal interactions.
  4. Rendering options: Use NVIDIA Omniverse RTX Renderer technology or Unreal Engine tools for 3D avatars to achieve high-quality output and compute efficiency.
  5. Deployment: Tap cloud-native deployment for real-time output and scalability, particularly for interactive web or mobile applications.

For an overview on how to design interactive customer service tools, read the technical blogs on how to “Build a Digital Human Interface for AI Apps With an NVIDIA AI Blueprint” and “Expanding AI Agent Interface Options With 2D and 3D Digital Human Avatars.”

NVIDIA AI Blueprint for Digital Humans

The latest release of the NVIDIA AI Blueprint for digital humans introduces several updates that enhance the interactivity and responsiveness of digital avatars, including dynamic switching between RAG models. Users can experience this directly in preview.

The integration of the Audio2Face-2D microservice in the blueprint means developers can create 2D digital humans, which require significantly less processing power compared with 3D models, for web- and mobile-based applications.

2D avatars are better suited for simpler interactions and platforms where photorealism isn’t necessary. This makes them ideal for scenarios like telemedicine, where quick loading times with lower bandwidth requirements are crucial.

Another significant update is the introduction of user attention detection through vision AI. This feature enables digital humans to detect when a user is present — even if they are idle or on mute — and initiate interaction, such as greeting the user. This capability is particularly beneficial in kiosk scenarios, where engaging users proactively can enhance the service experience.

Getting Started

NVIDIA AI Blueprints make it easy to start building and setting up virtual assistants by offering ready-made workflows and tools to accelerate deployment. Whether for a simple AI-powered chatbot or a fully animated digital human interface, the blueprints offer resources to create AI assistants that are scalable, aligned with an organization’s brand and deliver a responsive, efficient customer support experience.

 

1. Gartner®, Hype Cycle™ for the Future of Work, 2024, Tori Paulman, Emily Rose, etc., July 2024

GARTNER is a registered trademark and service mark and Hype Cycle is a trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Read More

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco and Andrea Policarpi from BIP xTech.

AI’s transformative impact extends throughout the modern business landscape, with telecommunications emerging as a key area of innovation. Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. With a vision to build a large language model (LLM) trained on Italian data, Fastweb embarked on a journey to make this powerful AI capability available to third parties.

Training an LLM is a compute-intensive and complex process, which is why Fastweb, as a first step in their AI journey, used AWS generative AI and machine learning (ML) services such as Amazon SageMaker HyperPod.

SageMaker HyperPod can provision and maintain large-scale compute resilient clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA H200 and H100 Graphical Processing Units (GPUs), but its flexibility allowed Fastweb to deploy a small, agile and on-demand cluster enabling efficient resource utilization and cost management, aligning well with the project’s requirements.

In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.

Fine-tuning Mistral 7B on AWS

Fastweb recognized the importance of developing language models tailored to the Italian language and culture. To achieve this, the team built an extensive Italian language dataset by combining public sources and acquiring licensed data from publishers and media companies. Using this data, Fastweb, in their first experiment with LLM training, fine-tuned the Mistral 7B model, a state-of-the-art LLM, successfully adapting it to handle tasks such as summarization, question answering, and creative writing in the Italian language, applying a nuanced understanding of Italian culture to the LLM’s responses and providing contextually appropriate and culturally sensitive output.

The team opted for fine-tuning on AWS. This strategic decision was driven by several factors:

  • Efficient data preparation – Building a high-quality pre-training dataset is a complex task, involving assembling and preprocessing text data from various sources, including web sources and partner companies. Because the final, comprehensive pre-training dataset was still under construction, it was essential to begin with an approach that could adapt existing models to Italian.
  • Early results and insights – Fine-tuning allowed the team to achieve early results in training models on the Italian language, providing valuable insights and preliminary Italian language models. This enabled the engineers to iteratively improve the approach based on initial outcomes.
  • Computational efficiency – Fine-tuning requires significantly less computational power and less time to complete compared to a complete model pre-training. This approach streamlined the development process and allowed for a higher volume of experiments within a shorter time frame on AWS.

To facilitate the process, the team created a comprehensive dataset encompassing a wide range of tasks, constructed by translating existing English datasets and generating synthetic elements. The dataset was stored in an Amazon Simple Storage Service (Amazon S3) bucket, which served as a centralized data repository. During the training process, our SageMaker HyperPod cluster was connected to this S3 bucket, enabling effortless retrieval of the dataset elements as needed.

The integration of Amazon S3 and the SageMaker HyperPod cluster exemplifies the power of the AWS ecosystem, where various services work together seamlessly to support complex workflows.

Overcoming data scarcity with translation and synthetic data generation

When fine-tuning a custom version of the Mistral 7B LLM for the Italian language, Fastweb faced a major obstacle: high-quality Italian datasets were extremely limited or unavailable. To tackle this data scarcity challenge, Fastweb had to build a comprehensive training dataset from scratch to enable effective model fine-tuning.

While establishing strategic agreements to acquire licensed data from publishers and media companies, Fastweb employed two main strategies to create a diverse and well-rounded dataset: translating open source English training data into Italian and generating synthetic Italian data using AI models.

To use the wealth of information available in English, Fastweb translated open source English training datasets into Italian. This approach made valuable data accessible and relevant for Italian language training. Both LLMs and open source translation tools were used for this process.

The open source Argos Translate tool was used for bulk translation of datasets with simpler content. Although LLMs offer superior translation quality, Argos Translate is free, extremely fast, and well-suited for efficiently handling large volumes of straightforward data. For complex datasets where accuracy was critical, LLMs were employed to provide high-quality translations.

To further enrich the dataset, Fastweb generated synthetic Italian data using LLMs. This involved creating a variety of text samples covering a wide range of topics and tasks relevant to the Italian language. High-quality Italian web articles, books, and other texts served as the basis for training the LLMs to generate authentic-sounding synthetic content that captured the nuances of the language.

The resulting sub-datasets spanned diverse subjects, including medical information, question-answer pairs, conversations, web articles, science topics, and more. The tasks covered were also highly varied, encompassing question answering, summarization, creative writing, and others.

Each subset generated through translation or synthetic data creation underwent meticulous filtering to maintain quality and diversity. A similarity check was performed to deduplicate the data; if two elements were found to be too similar, one was removed. This step was crucial in maintaining variability and preventing bias from repetitive or overly similar content.

The deduplication process involved embedding dataset elements using a text embedder, then computing cosine similarity between the embeddings to identify similar elements. Meta’s FAISS library, renowned for its efficiency in similarity search and clustering of dense vectors, was used as the underlying vector database due to its ability to handle large-scale datasets effectively.

After filtering and deduplication, the remaining subsets were postprocessed and combined to form the final fine-tuning dataset, comprising 300,000 training elements. This comprehensive dataset enabled Fastweb to effectively fine-tune their custom version of the Mistral 7B model, achieving high performance and diversity across a wide range of tasks and topics.

All data generation and processing steps were run in parallel directly on the SageMaker HyperPod cluster nodes, using a unique working environment and highlighting the cluster’s versatility for various tasks beyond just training models.

The following diagram illustrates two distinct data pipelines for creating the final dataset: the upper pipeline uses translations of existing English datasets into Italian, and the lower pipeline employs custom generated synthetic data.

Dataset creation pipelines

The computational cost of training an LLM

The computational cost of training LLMs scales approximately with the number of parameters and the amount of training data. As a general rule, for each model parameter being trained, approximately 24 bytes of memory are required. This means that to fully fine-tune a 7 billion parameter model like Mistral 7B, at least 156 GB of hardware memory is necessary, not including the additional overhead of loading training data.

The following table provides additional examples.

LLM Model Size vs. Training Memory
Number of Parameters Memory Requirement
500 million 12 GB
1 billion 23 GB
2 billion 45 GB
3 billion 67 GB
5 billion 112 GB
7 billion 156 GB
10 billion 224 GB

Parameter-efficient fine-tuning (PEFT) methods minimize the number of trainable parameters, whereas quantization reduces the number of bits per parameter, often with minimal negative impact on the final training results.

Despite these memory-saving techniques, fine-tuning large models still demands substantial GPU memory and extended training times. This makes distributed training essential, allowing the workload to be shared across multiple GPUs, thereby enabling the efficient handling of such large-scale computational tasks.

The following table and figure illustrate the allocation of GPU memory during each phase of LLM training.

Training requirements

Solution overview

Training LLMs often requires significant computational resources that can exceed the capabilities of a single GPU. Distributed training is a powerful technique that addresses this challenge by distributing the workload across multiple GPUs and nodes, enabling parallel processing and reducing training time. SageMaker HyperPod simplifies the process of setting up and running distributed training jobs, providing preconfigured environments and libraries specifically designed for this purpose.

There are two main techniques for distributed training: data parallelization and model parallelization. Data parallelization involves distributing the training data across multiple GPUs, whereas model parallelization splits the model itself across different GPUs.

To take advantage of distributed training, a cluster of interconnected GPUs, often spread across multiple physical nodes, is required. SageMaker HyperPod allows for both data and model parallelization techniques to be employed simultaneously, maximizing the available computational resources. Also, SageMaker HyperPod provides resilience through features like automatic fault detection and recovery, which are crucial for long-running training jobs. SageMaker HyperPod allows for the creation of personalized Conda environments, enabling the installation of necessary libraries and tools for distributed training.

One popular library for implementing distributed training is DeepSpeed, a Python optimization library that handles distributed training and makes it memory-efficient and fast by enabling both data and model parallelization. The choice to use DeepSpeed was driven by the availability of an extensive, already-developed code base, ready to be employed for training experiments. The high flexibility and environment customization capabilities of SageMaker HyperPod made it possible to create a personalized Conda environment with all the necessary libraries installed, including DeepSpeed.

The following diagram illustrates the two key parallelization strategies offered by DeepSpeed: data parallelism and model parallelism. Data parallelism involves replicating the entire model across multiple devices, with each device processing a distinct batch of training data. In contrast, model parallelism distributes different parts of a single model across multiple devices, enabling the training of large models that exceed the memory capacity of a single device.

Data parallelization and model parallelization

To help meet the demanding computational requirements of training LLMs, we used the power and flexibility of SageMaker HyperPod clusters, orchestrated with Slurm. While HyperPod also supports orchestration with Amazon EKS, our research team had prior expertise with Slurm. The cluster configuration was tailored to our specific training needs, providing optimal resource utilization and cost-effectiveness.

The SageMaker HyperPod cluster architecture consisted of a controller machine to orchestrate the training job’s coordination and resource allocation. The training tasks were run by two compute nodes, which were g5.12xlarge instances equipped with high-performance GPUs. These compute nodes handled the bulk of the computational workload, using their GPUs to accelerate the training process.

The AWS managed high-performance Lustre file system (Amazon FSx for Lustre) mounted on the nodes provided high-speed data access and transfer rates, which are essential for efficient training operations.

SageMaker HyperPod is used to launch large clusters for pre-training Large Language Models (LLMs) with thousands of GPUs, but one of its key advantages is its flexibility, indeed it also allows for the creation of small, agile, and on-demand clusters. The versatility of SageMaker HyperPod made it possible to use resources only when needed, avoiding unnecessary costs.

For the DeepSpeed configuration, we followed the standard recommended setup, enabling data and model parallelism across the two g5.12xlarge nodes of the cluster, for a total of 8 GPUs.

Although more advanced techniques were available, such as offloading some computation to the CPU during training, our cluster was sized with a sufficiently high GPU margin. With 192 GiB (206 GB) of available overall GPU memory, even accounting for the additional GPU needed to keep dataset batches in memory during training, we had ample resources to train a 7B parameter model without the need for these advanced techniques. The following figure describes the infrastructure setup of our training solution.

Architecture diagram

Training results and output examples

After completing the training process, Fastweb’s fine-tuned language model demonstrated a significant performance improvement on Italian language tasks compared to the base model. Evaluated on an internal benchmark dataset, the fine-tuned model achieved an average accuracy increase of 20% across a range of tasks designed to assess its general understanding of the Italian language.

The benchmark tasks focused on three key areas: question answering, common sense reasoning, and next word prediction. Question answering tasks tested the model’s ability to comprehend and provide accurate responses to queries in Italian. Common sense reasoning evaluated the model’s grasp of common sense knowledge and its capacity to make logical inferences based on real-world scenarios. Next word prediction assessed the model’s understanding of language patterns and its ability to predict the most likely word to follow in a given context.

To evaluate the fine-tuned model’s performance, we initiated our interaction by inquiring about its capabilities. The model responded by enumerating its primary functions, emphasizing its ability to address Fastweb-specific topics. The response was formulated in correct Italian with a very natural syntax, as illustrated in the following example.

Dialog 1 - How can you help me?

Afterwards, we asked the model to generate five titles for a presentation on the topic of AI.

Generate titles for a slide deck about AI

Just for fun, we asked what the most famous sandwich is. The model responded with a combination of typical Italian ingredients and added that there is a wide variety of choices.

What is the most famous panini in Italy?

Lastly, we asked the model to provide us with a useful link to understand the recent EU AI Act. The model provided a working link, along with a helpful description.

Tell me something about EU AI Act

Conclusion

Using SageMaker HyperPod, Fastweb successfully fine-tuned the Mistral 7B model as a first step in their generative AI journey, significantly improving its performance on tasks involving the Italian language.

Looking ahead, Fastweb plans to deploy their next models also on Amazon Bedrock using the Custom Model Import feature. This strategic move will enable Fastweb to quickly build and scale new generative AI solutions for their customers, using the broad set of capabilities available on Amazon Bedrock.

By harnessing Amazon Bedrock, Fastweb can further enhance their offerings and drive digital transformation for their customers. This initiative aligns with Fastweb’s commitment to staying at the forefront of AI technology and fostering innovation across various industries.

With their fine-tuned language model running on Amazon Bedrock, Fastweb will be well-positioned to deliver cutting-edge generative AI solutions tailored to the unique needs of their customers. This will empower businesses to unlock new opportunities, streamline processes, and gain valuable insights, ultimately driving growth and competitiveness in the digital age.

Fastweb’s decision to use the Custom Model Import feature in Amazon Bedrock underscores the company’s forward-thinking approach and their dedication to providing their customers with the latest and most advanced AI technologies. This collaboration with AWS further solidifies Fastweb’s position as a leader in digital transformation and a driving force behind the adoption of innovative AI solutions across industries.

To learn more about SageMaker HyperPod, refer to Amazon SageMaker HyperPod and the Amazon SageMaker HyperPod workshop.


About the authors

Marta Cavalleri is the Manager of the Artificial Intelligence Center of Excellence (CoE) at Fastweb, where she leads teams of data scientists and engineers in implementing enterprise AI solutions. She specializes in AI operations, data governance, and cloud architecture on AWS.

Giovanni Germani is the Manager of Architecture & Artificial Intelligence CoE at Fastweb, where he leverages his extensive experience in Enterprise Architecture and digital transformation. With over 12 years in Management Consulting, Giovanni specializes in technology-driven projects across telecommunications, media, and insurance industries. He brings deep expertise in IT strategy, cybersecurity, and artificial intelligence to drive complex transformation programs.

Claudia Sacco is an AWS Professional Solutions Architect at BIP xTech, collaborating with Fastweb’s AI CoE and specialized in architecting advanced cloud and data platforms that drive innovation and operational excellence. With a sharp focus on delivering scalable, secure, and future-ready solutions, she collaborates with organizations to unlock the full potential of cloud technologies. Beyond her professional expertise, Claudia finds inspiration in the outdoors, embracing challenges through climbing and trekking adventures with her family.

Andrea Policarpi is a Data Scientist at BIP xTech, collaborating with Fastweb’s AI CoE. With a strong foundation in computer vision and natural language processing, he is currently exploring the world of Generative AI and leveraging its powerful tools to craft innovative solutions for emerging challenges. In his free time, Andrea is an avid reader and enjoys playing the piano to relax.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Adolfo Pica has a strong background in cloud computing, with over 20 years of experience in designing, implementing, and optimizing complex IT systems and architectures and with a keen interest and hands-on experience in the rapidly evolving field of generative AI and foundation models. He has expertise in AWS cloud services, DevOps practices, security, data analytics and generative AI. In his free time, Adolfo enjoys following his two sons in their sporting adventures in taekwondo and football.

Maurizio Pinto is a Senior Solutions Architect at AWS, specialized in cloud solutions for telecommunications. With extensive experience in software architecture and AWS services, he helps organizations navigate their cloud journey while pursuing his passion for AI’s transformative impact on technology and society.

Read More