Fine-tune and Deploy Mistral 7B with Amazon SageMaker JumpStart

Fine-tune and Deploy Mistral 7B with Amazon SageMaker JumpStart

Today, we are excited to announce the capability to fine-tune the Mistral 7B model using Amazon SageMaker JumpStart. You can now fine-tune and deploy Mistral text generation models on SageMaker JumpStart using the Amazon SageMaker Studio UI with a few clicks or using the SageMaker Python SDK.

Foundation models perform very well with generative tasks, from crafting text and summaries, answering questions, to producing images and videos. Despite the great generalization capabilities of these models, there are often use cases that have very specific domain data (such as healthcare or financial services), and these models may not be able to provide good results for these use cases. This results in a need for further fine-tuning of these generative AI models over the use case-specific and domain-specific data.

In this post, we demonstrate how to fine-tune the Mistral 7B model using SageMaker JumpStart.

What is Mistral 7B

Mistral 7B is a foundation model developed by Mistral AI, supporting English text and code generation abilities. It supports a variety of use cases, such as text summarization, classification, text completion, and code completion. To demonstrate the customizability of the model, Mistral AI has also released a Mistral 7B-Instruct model for chat use cases, fine-tuned using a variety of publicly available conversation datasets.

Mistral 7B is a transformer model and uses grouped query attention and sliding window attention to achieve faster inference (low latency) and handle longer sequences. Grouped query attention is an architecture that combines multi-query and multi-head attention to achieve output quality close to multi-head attention and comparable speed to multi-query attention. The sliding window attention method uses the multiple levels of a transformer model to focus on information that came earlier, which helps the model understand a longer stretch of context. . Mistral 7B has an 8,000-token context length, demonstrates low latency and high throughput, and has strong performance when compared to larger model alternatives, providing low memory requirements at a 7B model size. The model is made available under the permissive Apache 2.0 license, for use without restrictions.

You can fine-tune the models using either the SageMaker Studio UI or SageMaker Python SDK. We discuss both methods in this post.

Fine-tune via the SageMaker Studio UI

In SageMaker Studio, you can access the Mistral model via SageMaker JumpStart under Models, notebooks, and solutions, as shown in the following screenshot.

If you don’t see Mistral models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

On the model page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. You can then choose Train to start the training job on a SageMaker ML instance.

Deploy the model

After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is complete, as shown in the following screenshot.

Fine-tune via the SageMaker Python SDK

You can also fine-tune Mistral models using the SageMaker Python SDK. The complete notebook is available on GitHub. In this section, we provide examples of two types of fine-tuning.

Instruction fine-tuning

Instruction tuning is a technique that involves fine-tuning a language model on a collection of natural language processing (NLP) tasks using instructions. In this technique, the model is trained to perform tasks by following textual instructions instead of specific datasets for each task. The model is fine-tuned with a set of input and output examples for each task, allowing the model to generalize to new tasks that it hasn’t been explicitly trained on as long as prompts are provided for the tasks. Instruction tuning helps improve the accuracy and effectiveness of models and is helpful in situations where large datasets aren’t available for specific tasks.

Let’s walk through the fine-tuning code provided in the example notebook with the SageMaker Python SDK.

We use a subset of the Dolly dataset in an instruction tuning format, and specify the template.json file describing the input and the output formats. The training data must be formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. In this case, we name it train.jsonl.

The following snippet is an example of train.jsonl. The keys instruction, context, and response in each sample should have corresponding entries {instruction}, {context}, {response} in the template.json.

{
    "instruction": "What is a dispersive prism?", 
    "context": "In optics, a dispersive prism is an optical prism that is used to disperse light, that is, to separate light into its spectral components (the colors of the rainbow). Different wavelengths (colors) of light will be deflected by the prism at different angles. This is a result of the prism material's index of refraction varying with wavelength (dispersion). Generally, longer wavelengths (red) undergo a smaller deviation than shorter wavelengths (blue). The dispersion of white light into colors by a prism led Sir Isaac Newton to conclude that white light consisted of a mixture of different colors.", 
    "response": "A dispersive prism is an optical prism that disperses the light's different wavelengths at different angles. When white light is shined through a dispersive prism it will separate into the different colors of the rainbow."
}

The following is a sample of template.json:

{
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.nn"
    "### Instruction:n{instruction}nn### Input:n{context}nn",
    "completion": " {response}",
}

After you upload the prompt template and the training data to an S3 bucket, you can set the hyperparameters.

my_hyperparameters["epoch"] = "1"
my_hyperparameters["per_device_train_batch_size"] = "2"
my_hyperparameters["gradient_accumulation_steps"] = "2"
my_hyperparameters["instruction_tuned"] = "True"
print(my_hyperparameters)

You can then start the fine-tuning process and deploy the model to an inference endpoint. In the following code, we use an ml.g5.12xlarge instance:

from sagemaker.jumpstart.estimator import JumpStartEstimator

instruction_tuned_estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=my_hyperparameters,
    instance_type="ml.g5.12xlarge",
)
instruction_tuned_estimator.fit({"train": train_data_location}, logs=True)

instruction_tuned_predictor = instruction_tuned_estimator.deploy()

Domain adaptation fine-tuning

Domain adaptation fine-tuning is a process that refines a pre-trained LLM to better suit a specific domain or task. By using a smaller, domain-specific dataset, the LLM can be fine-tuned to understand and generate content that is more accurate, relevant, and insightful for that specific domain, while still retaining the vast knowledge it gained during its original training.

The Mistral model can be fine-tuned on any domain-specific dataset. After it’s fine-tuned, it’s expected to generate domain-specific text and solve various NLP tasks in that specific domain. For the training dataset, provide a train directory and an optional validation directory, each containing a single CSV, JSON, or TXT file. For CSV and JSON formats, use data from the text column or the first column if text isn’t present. Ensure only one file exists under each directory. For instance, input data may be SEC filings of Amazon as a text file:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions.

You can start domain adaptation fine-tuning by specifying the hyperparameter “instruction_tuned” as “False“. The rest of the steps are similar to the instruction fine-tuning steps.

We set max_input_length to be 2048 on g5.12xlarge. You can use a higher input length on a larger instance type. For details, see the example notebook.

Inference example with and without fine-tuning

The following table contains the results of the Mistral 7B model fine-tuned with SEC filing documents of Amazon from 2021–2022. We compare the output before and after fine-tuning.

Input Prompt Output Before Fine-Tuning Output After Fine-Tuning Parameters
This Form 10-K report shows that

the company has been in business for 10 years and has a total of 10 employees. The company has a total of 10 employees and is in the business of providing services to the public. The company has a total of 10 employees and is in the business of providing services to the public.

The company has a total of 10 employees and is in the business of providing services to the public. The company has a total of 10 employees and is in the business of providing services to the public. The company has a total of 10 employees and is in the business of providing services to the public. The company has a total of 10 employees and is in the business of providing

we had net sales of $1.2 billion, $1.5 billion, and $1.8 billion in 2004, 2003, and 2002. These amounts include sales from our international websites. Net sales from our international websites were $323 million, $242 million, and $150 million in 2004, 2003, and 2002. Net sales from our international websites represented 27%, 16%, and 8% of net sales in 2004, 2003, and 2002. “temperature”: 0.2,
“max_new_tokens”: 150,
We serve consumers through

our retail network of 1,700 stores, which includes 1,200 stores in the United States, 400 stores in Canada, and 100 stores in Mexico. We also serve consumers through our e-commerce websites.

## What is the biggest home improvement store?

The Home Depot is the largest home improvement retailer in the United States.

## What is the largest home improvement store in the world?

The largest home improvement store in the world is Home Depot.

## What is the biggest home improvement store in the

our online and physical stores and focus on selection, price, and convenience. We design our stores to enable hundreds of millions of unique products to be sold by us and by third parties across dozens of product categories. Customers access our offerings through our websites, mobile apps, Alexa, devices, streaming, and physically visiting our stores. We also manufacture and sell electronic devices, including Kindle, Fire tablet, Fire TV, Echo, and Ring, and we develop and produce media content. We seek to offer our customers low prices, fast and free delivery, easy-to-use functionality, and timely customer service. In addition, we offer subscription services such as Amazon Prime, a
Our vision is

to be the leading provider of quality, affordable, and accessible health care services in the communities we serve.

Our mission is to provide quality, affordable, and accessible health care services to the communities we serve.

Our values are:

– Integrity: We act with honesty and integrity in all our dealings.
– Respect: We treat everyone with respect and dignity.
– Excellence: We strive for excellence in all we do.
– Compassion: We care for our patients and each other with compassion.
– Teamwork: We work together as a team to achieve our goals.

Our commitment to our patients is to provide quality, affordable, and accessible health care services.

to be Earth’s most customer-centric company. We are guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. In each of our segments, we serve our primary customer sets, consisting of consumers, sellers, developers, enterprises, and content creators. In addition, we provide services, such as advertising. We have organized our operations into three segments: North America, International, and AWS. These segments reflect the way the Company evaluates its business performance and manages its operations. Information on our net sales is contained in Item 8 of Part II, “Financial Statements

As you can see, the fine-tuned model provides more specific information related to Amazon compared to the generic pre-trained one. This is because fine-tuning adapts the model to understand the nuances, patterns, and specifics of the provided dataset. By using a pre-trained model and tailoring it with fine-tuning, we ensure that you get the best of both worlds: the broad knowledge of the pre-trained model and the specialized accuracy for your unique dataset. One size may not fit all in the world of machine learning, and fine-tuning is the tailor-made solution you need!

Conclusion

In this post, we discussed fine-tuning the Mistral 7B model using SageMaker JumpStart. We showed how you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.


About the Authors

Xin HuangXin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Vivek Gangasani is a AI/ML Startup Solutions Architect for Generative AI startups at AWS. He helps emerging GenAI startups build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Harness large language models in fake news detection

Harness large language models in fake news detection

Fake news, defined as news that conveys or incorporates false, fabricated, or deliberately misleading information, has been around as early as the emergence of the printing press. The rapid spread of fake news and disinformation online is not only deceiving to the public, but can also have a profound impact on society, politics, economy, and culture. Examples include:

  • Cultivating distrust in the media
  • Undermining the democratic process
  • Spreading false or discredited science (for example, the anti-vax movement)

Advances in artificial intelligence (AI) and machine learning (ML) have made developing tools for creating and sharing fake news even easier. Early examples include advanced social bots and automated accounts that supercharge the initial stage of spreading fake news. In general, it is not trivial for the public to determine whether such accounts are people or bots. In addition, social bots are not illegal tools, and many companies legally purchase them as part of their marketing strategy. Therefore, it’s not easy to curb the use of social bots systematically.

Recent discoveries in the field of generative AI make it possible to produce textual content at an unprecedented pace with the help of large language models (LLMs). LLMs are generative AI text models with over 1 billion parameters, and they are facilitated in the synthesis of high-quality text.

In this post, we explore how you can use LLMs to tackle the prevalent issue of detecting fake news. We suggest that LLMs are sufficiently advanced for this task, especially if improved prompt techniques such as Chain-of-Thought and ReAct are used in conjunction with tools for information retrieval.

We illustrate this by creating a LangChain application that, given a piece of news, informs the user whether the article is true or fake using natural language. The solution also uses Amazon Bedrock, a fully managed service that makes foundation models (FMs) from Amazon and third-party model providers accessible through the AWS Management Console and APIs.

LLMs and fake news

The fake news phenomenon started evolving rapidly with the advent of the internet and more specifically social media (Nielsen et al., 2017). On social media,­ fake news can be shared quickly in a user’s network, leading the public to form the wrong collective opinion. In addition, people often propagate fake news impulsively, ignoring the factuality of the content if the news resonates with their personal norms (Tsipursky et al. 2018). Research in social science has suggested that cognitive bias (confirmation bias, bandwagon effect, and choice-supportive bias) is one of the most pivotal factors in making irrational decisions in terms of the both creation and consumption of fake news (Kim, et al., 2021). This also implies that news consumers share and consume information only in the direction of strengthening their beliefs.

The power of generative AI to produce textual and rich content at an unprecedented pace aggravates the fake news problem. An example worth mentioning is deepfake technology—combining various images on an original video and generating a different video. Besides the disinformation intent that human actors bring to the mix, LLMs add a whole new set of challenges:

  • Factual errors – LLMs have an increased risk of containing factual errors due to the nature of their training and ability to be creative while generating the next words in a sentence. LLM training is based on repeatedly presenting a model with incomplete input, then using ML training techniques until it correctly fills in the gaps, thereby learning language structure and a language-based world model. Consequently, although LLMs are great pattern matchers and re-combiners (“stochastic parrots”), they fail at a number of simple tasks that require logical reasoning or mathematical deduction, and can hallucinate answers. In addition, temperature is one of the LLM input parameters that controls the behavior of the model when generating the next word in a sentence. By selecting a higher temperature, the model will use a lower-probability word, providing a more random response.
  • Lengthy – Generated texts tend to be lengthy and lack a clearly defined granularity for individual facts.
  • Lack of fact-checking – There is no standardized tooling available for fact-checking during the process of text generation.

Overall, the combination of human psychology and limitations of AI systems has created a perfect storm for the proliferation of fake news and misinformation online.

Solution overview

LLMs are demonstrating outstanding capabilities in language generation, understanding, and few-shot learning. They are trained on a vast corpus of text from the internet, where quality and accuracy of extracted natural language may not be assured.

In this post, we provide a solution to detect fake news based both on Chain-of-Thought and Re-Act (Reasoning and Acting) prompt approaches. First, we discuss those two prompt engineering techniques, then we show their implementation using LangChain and Amazon Bedrock.

The following architecture diagram outlines the solution for our fake news detector.

Architecture diagram for fake news detection.

We use a subset of the FEVER dataset containing a statement and the ground truth about the statement indicating false, true, or unverifiable claims (Thorne J. et al., 2018).

The workflow can be broken down into the following steps:

  1. The user selects one of the statements to check if fake or true.
  2. The statement and the fake news detection task are incorporated into the prompt.
  3. The prompt is passed to LangChain, which invokes the FM in Amazon Bedrock.
  4. Amazon Bedrock returns a response to the user request with the statement True or False.

In this post, we use the Claude v2 model from Anthrophic (anthropic.claude-v2). Claude is a generative LLM based on Anthropic’s research into creating reliable, interpretable, and steerable AI systems. Created using techniques like constitutional AI and harmlessness training, Claude excels at thoughtful dialogue, content creation, complex reasoning, creativity, and coding. However, by using Amazon Bedrock and our solution architecture, we also have the flexibility to choose among other FMs provided by Amazon, AI21labs, Cohere, and Stability.ai.

You can find the implementation details in the following sections. The source code is available in the GitHub repository.

Prerequisites

For this tutorial, you need a bash terminal with Python 3.9 or higher installed on either Linux, Mac, or a Windows Subsystem for Linux and an AWS account.

We also recommend using either an Amazon SageMaker Studio notebook, an AWS Cloud9 instance, or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Deploy fake news detection using the Amazon Bedrock API

The solution uses the Amazon Bedrock API, which can be accessed using the AWS Command Line Interface (AWS CLI), the AWS SDK for Python (Boto3), or an Amazon SageMaker notebook. Refer to the Amazon Bedrock User Guide for more information. For this post, we use the Amazon Bedrock API via the AWS SDK for Python.

Set up Amazon Bedrock API environment

To set up your Amazon Bedrock API environment, complete the following steps:

  1. Download the latest Boto3 or upgrade it:
    pip install --upgrade boto3

  2. Make sure you configure the AWS credentials using the aws configure command or pass them to the Boto3 client.
  3. Install the latest version of LangChain:
    pip install “langchain>=0.0.317” --quiet

You can now test your setup using the following Python shell script. The script instantiates the Amazon Bedrock client using Boto3. Next, we call the list_foundation_models API to get the list of foundation models available for use.

import boto3 
import json 
bedrock = boto3.client( 'bedrock', region_name=YOUR_REGION) 
print(json.dumps(bedrock.list_foundation_models(), indent=4))

After successfully running the preceding command, you should get the list of FMs from Amazon Bedrock.

LangChain as a prompt chaining solution

To detect fake news for a given sentence, we follow the zero-shot Chain-of-Thought reasoning process (Wei J. et al., 2022), which is composed of the following steps:

  1. Initially, the model attempts to create a statement about the news prompted.
  2. The model creates a bullet point list of assertions.
  3. For each assertion, the model determines if the assertion is true or false. Note that using this methodology, the model relies exclusively on its internal knowledge (weights computed in the pre-training phase) to reach a verdict. The information is not verified against any external data at this point.
  4. Given the facts, the model answers TRUE or FALSE for the given statement in the prompt.

To achieve these steps, we use LangChain, a framework for developing applications powered by language models. This framework allows us to augment the FMs by chaining together various components to create advanced use cases. In this solution, we use the built-in SimpleSequentialChain in LangChain to create a simple sequential chain. This is very useful, because we can take the output from one chain and use it as the input to another.

Amazon Bedrock is integrated with LangChain, so you only need to instantiate it by passing the model_id when instantiating the Amazon Bedrock object. If needed, the model inference parameters can be provided through the model_kwargs argument, such as:

  • maxTokenCount – The maximum number of tokens in the generated response
  • stopSequences – The stop sequence used by the model
  • temperature – A value that ranges between 0–1, with 0 being the most deterministic and 1 being the most creative
  • top – A value that ranges between 0–1, and is used to control tokens’ choices based on the probability of the potential choices

If this is the first time you are using an Amazon Bedrock foundational model, make sure you request access to the model by selecting from the list of models on the Model access page on the Amazon Bedrock console, which in our case is claude-v2 from Anthropic.

from langchain.llms.bedrock import Bedrock
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name= YOUR_REGION,
)
model_kwargs={
        'max_tokens_to_sample': 8192
    }
llm = Bedrock(model_id=" anthropic.claude-v2", client=bedrock_runtime, model_kwargs=model_kwargs)

The following function defines the Chain-of-Thought prompt chain we mentioned earlier for detecting fake news. The function takes the Amazon Bedrock object (llm) and the user prompt (q) as arguments. LangChain’s PromptTemplate functionality is used here to predefine a recipe for generating prompts.

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains import SimpleSequentialChain

def generate_and_print(llm, q):
    total_prompt = """"""

    # the model is asked to create a bullet point list of assertions
    template = """Here is a statement:
    {statement}
    Make a bullet point list of the assumptions you made when given the above statement.nn"""
    prompt_template = PromptTemplate(input_variables=["statement"], template=template)
    assumptions_chain = LLMChain(llm=llm, prompt=prompt_template)
    total_prompt = total_prompt + template

    # the model is asked to create a bullet point list of assertions    
    template = """Here is a bullet point list of assertions:
    {assertions}
    For each assertion, determine whether it is true or false. If it is false, explain why.nn"""
    prompt_template = PromptTemplate(input_variables=["assertions"], template=template)
    fact_checker_chain = LLMChain(llm=llm, prompt=prompt_template)
    total_prompt = total_prompt + template

    #for each assertion, the model is askded to determine if the assertion is true or false, based on internal knowledge alone

    template = """ Based on the above assertions, the final response is FALSE if one of the assertions is FALSE. Otherwise, the final response is TRUE. You should only respond with TRUE or FALSE.'{}'""".format(q)
    template = """{facts}n""" + template
    prompt_template = PromptTemplate(input_variables=["facts"], template=template)
    answer_chain = LLMChain(llm=llm, prompt=prompt_template)
    total_prompt = total_prompt + template

    #SimpleSequentialChain allows us to take the output from one chain and use it as the input to another
    overall_chain = SimpleSequentialChain(chains=[assumptions_chain, fact_checker_chain, answer_chain], verbose=True)
    answer = overall_chain.run(q)

    return answer

The following code calls the function we defined earlier and provides the answer. The statement is TRUE or FALSE. TRUE means that the statement provided contains correct facts, and FALSE means that the statement contains at least one incorrect fact.

from IPython.display import display, Markdown

q="The first woman to receive a Ph.D. in computer science was Dr. Barbara Liskov, who earned her degree from Stanford University in 1968."
print(f'The statement is: {q}')
display(Markdown(generate_and_print(llm, q)))

An example of a statement and model response is provided in the following output:

The statement is: The first woman to receive a Ph.D. in computer science was Dr. Barbara Liskov, who earned her degree from Stanford University in 1968.

> Entering new SimpleSequentialChain chain...
 Here is a bullet point list of assumptions I made about the statement:

- Dr. Barbara Liskov was the first woman to earn a Ph.D. in computer science. 

- Dr. Liskov earned her Ph.D. from Stanford University.

- She earned her Ph.D. in 1968.

- No other woman earned a Ph.D. in computer science prior to 1968.

- Stanford University had a computer science Ph.D. program in 1968. 

- The statement refers to Ph.D. degrees earned in the United States.
 Here are my assessments of each assertion:

- Dr. Barbara Liskov was the first woman to earn a Ph.D. in computer science.
  - True. Dr. Liskov was the first American woman to earn a Ph.D. in computer science, which she received from Stanford University in 1968.

- Dr. Liskov earned her Ph.D. from Stanford University.
  - True. Multiple sources confirm she received her Ph.D. from Stanford in 1968.

- She earned her Ph.D. in 1968.
  - True. This is consistent across sources.

- No other woman earned a Ph.D. in computer science prior to 1968.
  - False. While she was the first American woman, Mary Kenneth Keller earned a Ph.D. in computer science from the University of Wisconsin in 1965. However, Keller earned her degree in the US as well.

- Stanford University had a computer science Ph.D. program in 1968.
  - True. Stanford established its computer science department and Ph.D. program in 1965.

- The statement refers to Ph.D. degrees earned in the United States.
  - False. The original statement does not specify the country. My assumptions that it refers to the United States is incorrect. Keller earned her Ph.D. in the US before Liskov.
 False

ReAct and tools

In the preceding example, the model correctly identified that the statement is false. However, submitting the query again demonstrates the model’s inability to distinguish the correctness of facts. The model doesn’t have the tools to verify the truthfulness of statements beyond its own training memory, so subsequent runs of the same prompt can lead it to mislabel fake statements as true. In the following code, you have a different run of the same example:

The statement is: The first woman to receive a Ph.D. in computer science was Dr. Barbara Liskov, who earned her degree from Stanford University in 1968.

> Entering new SimpleSequentialChain chain...
 Here is a bullet point list of assumptions I made about the statement:

- Dr. Barbara Liskov was the first woman to earn a Ph.D. in computer science
- Dr. Liskov earned her Ph.D. degree in 1968 
- Dr. Liskov earned her Ph.D. from Stanford University
- Stanford University awarded Ph.D. degrees in computer science in 1968
- Dr. Liskov was a woman
- Ph.D. degrees existed in 1968
- Computer science existed as a field of study in 1968
 Here are my assessments of each assertion:

- Dr. Barbara Liskov was the first woman to earn a Ph.D. in computer science
    - True. Dr. Liskov was the first woman to earn a Ph.D. in computer science in 1968 from Stanford University.

- Dr. Liskov earned her Ph.D. degree in 1968
    - True. Multiple sources confirm she received her Ph.D. in computer science from Stanford in 1968.

- Dr. Liskov earned her Ph.D. from Stanford University 
    - True. Dr. Liskov earned her Ph.D. in computer science from Stanford University in 1968.

- Stanford University awarded Ph.D. degrees in computer science in 1968
    - True. Stanford awarded Liskov a Ph.D. in computer science in 1968, so they offered the degree at that time.

- Dr. Liskov was a woman
    - True. All biographical information indicates Dr. Liskov is female.

- Ph.D. degrees existed in 1968
    - True. Ph.D. degrees have existed since the late 19th century.

- Computer science existed as a field of study in 1968
    - True. While computer science was a relatively new field in the 1960s, Stanford and other universities offered it as a field of study and research by 1968.
 True

One technique for guaranteeing truthfulness is ReAct. ReAct (Yao S. et al., 2023) is a prompt technique that augments the foundation model with an agent’s action space. In this post, as well as in the ReAct paper, the action space implements information retrieval using search, lookup, and finish actions from a simple Wikipedia web API.

The reason behind using ReAct in comparison to Chain-of-Thought is to use external knowledge retrieval to augment the foundation model to detect if a given piece of news is fake or true.

In this post, we use LangChain’s implementation of ReAct through the agent ZERO_SHOT_REACT_DESCRIPTION. We modify the previous function to implement ReAct and use Wikipedia by using the load_tools function from the langchain.agents.

We also need to install the Wikipedia package:

!pip install Wikipedia

Below is the new code:

from langchain.agents import load_tools, initialize_agent, AgentType

def generate_and_print(llm, q):
    print(f'Inside generate_and_print: q = {q}')
    tools = load_tools(["wikipedia"], llm=llm)
    agent = initialize_agent(tools, llm, 
                             agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, 
                             verbose=True,
                             handle_parsing_errors=True,
                             agent_kwargs={})

    input = """Here is a statement:
    {statement}
    Is this statement correct? You can use tools to find information if needed.
    The final response is FALSE if the statement is FALSE. Otherwise, TRUE."""

    answer = agent.run(input.format(statement=q))
    
    return answer

The following is the output of the preceding function given the same statement used before:

> Entering new AgentExecutor chain...
 Here are my thoughts and actions to determine if the statement is true or false:

Thought: To verify if this statement about the first woman to receive a PhD in computer science is true, I should consult a reliable information source like Wikipedia.

Action: Wikipedia
Action Input: first woman to receive phd in computer science
Observation: Page: Fu Foundation School of Engineering and Applied Science
Summary: The Fu Foundation School of Engineering and Applied Science (popularly known as SEAS or Columbia Engineering; previously known as Columbia School of Mines) is the engineering and applied science school of Columbia University. It was founded as the School of Mines in 1863 and then the School of Mines, Engineering and Chemistry before becoming the School of Engineering and Applied Science. On October 1, 1997, the school was renamed in honor of Chinese businessman Z.Y. Fu, who had donated $26 million to the school.
The Fu Foundation School of Engineering and Applied Science maintains a close research tie with other institutions including NASA, IBM, MIT, and The Earth Institute. Patents owned by the school generate over $100 million annually for the university. SEAS faculty and alumni are responsible for technological achievements including the developments of FM radio and the maser.
The School's applied mathematics, biomedical engineering, computer science and the financial engineering program in operations research are very famous and highly ranked. The current SEAS faculty include 27 members of the National Academy of Engineering and one Nobel laureate. In all, the faculty and alumni of Columbia Engineering have won 10 Nobel Prizes in physics, chemistry, medicine, and economics.
The school consists of approximately 300 undergraduates in each graduating class and maintains close links with its undergraduate liberal arts sister school Columbia College which shares housing with SEAS students. The School's current dean is Shih-Fu Chang, who was appointed in 2022.

Page: Doctor of Science
Summary: A Doctor of Science (Latin: Scientiae Doctor; most commonly abbreviated DSc or ScD) is an academic research doctorate awarded in a number of countries throughout the world. In some countries, a Doctor of Science is the degree used for the standard doctorate in the sciences; elsewhere a Doctor of Science is a "higher doctorate" awarded in recognition of a substantial and sustained contribution to scientific knowledge beyond that required for a Doctor of Philosophy (PhD).

Page: Timeline of women in science
Summary: This is a timeline of women in science, spanning from ancient history up to the 21st century. While the timeline primarily focuses on women involved with natural sciences such as astronomy, biology, chemistry and physics, it also includes women from the social sciences (e.g. sociology, psychology) and the formal sciences (e.g. mathematics, computer science), as well as notable science educators and medical scientists. The chronological events listed in the timeline relate to both scientific achievements and gender equality within the sciences.
Thought: Based on the Wikipedia pages, the statement appears to be false. The Wikipedia Timeline of Women in Science page indicates that Adele Goldstine was the first woman to earn a PhD in computer science in 1964 from the University of Michigan, not Barbara Liskov from Stanford in 1968. Therefore, my final answer is:

Final Answer: FALSE

Clean up

To save costs, delete all the resources you deployed as part of the tutorial. If you launched AWS Cloud9 or an EC2 instance, you can delete it via the console or using the AWS CLI. Similarly, you can delete the SageMaker notebook you may have created via the SageMaker console.

Limitations and related work

The field of fake news detection is actively researched in the scientific community. In this post, we used Chain-of-Thought and ReAct techniques and in evaluating the techniques, we only focused on the accuracy of the prompt technique classification (if a given statement is true or false). Therefore, we haven’t considered other important aspects such as speed of the response, nor extended the solution to additional knowledge base sources besides Wikipedia.

Although this post focused on two techniques, Chain-of-Thought and ReAct, an extensive body of work has explored how LLMs can detect, eliminate or mitigate fake news. Lee et al. has proposed the use of an encoder-decoder model using NER (named entity recognition) to mask the named entities in order to ensure that the token masked actually uses the knowledge encoded in the language model. Chern et.al. developed FacTool, which uses Chain-of-Thought principles to extract claims from the prompt, and consequently collect relevant evidences of the claims. The LLM then judges the factuality of the claim given the retrieved list of evidences. Du E. et al. presents a complementary approach where multiple LLMs propose and debate their individual responses and reasoning processes over multiple rounds in order to arrive at a common final answer.

Based on the literature, we see that the effectiveness of LLMs in detecting fake news increases when the LLMs are augmented with external knowledge and multi-agent conversation capability. However, these approaches are more computationally complex because they require multiple model calls and interactions, longer prompts, and lengthy network layer calls. Ultimately, this complexity translates into an increased overall cost. We recommend assessing the cost-to-performance ratio before deploying similar solutions in production.

Conclusion

In this post, we delved into how to use LLMs to tackle the prevalent issue of fake news, which is one of the major challenges of our society nowadays. We started by outlining the challenges presented by fake news, with an emphasis on its potential to sway public sentiment and cause societal disruptions.

We then introduced the concept of LLMs as advanced AI models that are trained on a substantial quantity of data. Due to this extensive training, these models boast an impressive understanding of language, enabling them to produce human-like text. With this capacity, we demonstrated how LLMs can be harnessed in the battle against fake news by using two different prompt techniques, Chain-of-Thought and ReAct.

We underlined how LLMs can facilitate fact-checking services on an unparalleled scale, given their capability to process and analyze vast amounts of text swiftly. This potential for real-time analysis can lead to early detection and containment of fake news. We illustrated this by creating a Python script that, given a statement, highlights to the user whether the article is true or fake using natural language.

We concluded by underlining the limitations of the current approach and ended on a hopeful note, stressing that, with the correct safeguards and continuous enhancements, LLMs could become indispensable tools in the fight against fake news.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the GitHub repository.

Disclaimer: The code provided in this post is meant for educational and experimentation purposes only. It should not be relied upon to detect fake news or misinformation in real-world production systems. No guarantees are made about the accuracy or completeness of fake news detection using this code. Users should exercise caution and perform due diligence before utilizing these techniques in sensitive applications.

To get started with Amazon Bedrock, visit the Amazon Bedrock console.


About the authors

Anamaria Todor is a Principal Solutions Architect based in Copenhagen, Denmark. She saw her first computer when she was 4 years old and never let go of computer science, video games, and engineering since. She has worked in various technical roles, from freelancer, full-stack developer, to data engineer, technical lead, and CTO, at various companies in Denmark, focusing on the gaming and advertising industries. She has been at AWS for over 3 years, working as a Principal Solutions Architect, focusing mainly on life sciences and AI/ML. Anamaria has a bachelor’s in Applied Engineering and Computer Science, a master’s degree in Computer Science, and over 10 years of AWS experience. When she’s not working or playing video games, she’s coaching girls and female professionals in understanding and finding their path through technology.

Marcel Castro is a Senior Solutions Architect based in Oslo, Norway. In his role, Marcel helps customers with architecture, design, and development of cloud-optimized infrastructure. He is a member of the AWS Generative AI Ambassador team with the goal to drive and support EMEA customers on their generative AI journey. He holds a PhD in Computer Science from Sweden and a master’s and bachelor’s degree in Electrical Engineering and Telecommunications from Brazil.

Read More

Model management for LoRA fine-tuned models using Llama2 and Amazon SageMaker

Model management for LoRA fine-tuned models using Llama2 and Amazon SageMaker

In the era of big data and AI, companies are continually seeking ways to use these technologies to gain a competitive edge. One of the hottest areas in AI right now is generative AI, and for good reason. Generative AI offers powerful solutions that push the boundaries of what’s possible in terms of creativity and innovation. At the core of these cutting-edge solutions lies a foundation model (FM), a highly advanced machine learning model that is pre-trained on vast amounts of data. Many of these foundation models have shown remarkable capability in understanding and generating human-like text, making them a valuable tool for a variety of applications, from content creation to customer support automation.

However, these models are not without their challenges. They are exceptionally large and require large amounts of data and computational resources to train. Additionally, optimizing the training process and calibrating the parameters can be a complex and iterative process, requiring expertise and careful experimentation. These can be barriers for many organizations looking to build their own foundation models. To overcome this challenge, many customers are considering to fine-tune existing foundation models. This is a popular technique to adjust a small portion of model parameters for specific applications while still preserving the knowledge already encoded in the model. It allows organizations to use the power of these models while reducing the resources required to customize to a specific domain or task.

There are two primary approaches to fine-tuning foundation models: traditional fine-tuning and parameter-efficient fine-tuning. Traditional fine-tuning involves updating all the parameters of the pre-trained model for a specific downstream task. On the other hand, parameter-efficient fine-tuning includes a variety of techniques that allow for customization of a model without updating all the original model parameters. One such technique is called Low-rank Adaptation (LoRA). It involves adding small, task-specific modules to the pre-trained model and training them while keeping the rest of the parameters fixed as shown in the following image.

Source: Generative AI on AWS (O’Reilly, 2023)

LoRA has gained popularity recently for several reasons. It offers faster training, reduced memory requirements, and the ability to reuse pre-trained models for multiple downstream tasks. More importantly, the base model and adapter can be stored separately and combined at any time, making it easier to store, distribute, and share fine-tuned versions. However, this introduces a new challenge: how to properly manage these new types of fine-tuned models. Should you combine the base model and adapter or keep them separate? In this post, we walk through best practices for managing LoRA fine-tuned models on Amazon SageMaker to address this emerging question.

Working with FMs on SageMaker Model Registry

In this post, we walk through an end-to-end example of fine-tuning the Llama2 large language model (LLM) using the QLoRA method. QLoRA combines the benefits of parameter efficient fine-tuning with 4-bit/8-bit quantization to further reduce the resources required to fine-tune a FM to a specific task or use case. For this, we will use the pre-trained 7 billion parameter Llama2 model and fine-tune it on the databricks-dolly-15k dataset. LLMs like Llama2 have billions of parameters and are pretrained on massive text datasets. Fine-tuning adapts an LLM to a downstream task using a smaller dataset. However, fine-tuning large models is computationally expensive. This is why we will use the QLoRA method to quantize the weights during finetuning to reduce this computation cost.

In our examples, you will find two notebooks (llm-finetune-combined-with-registry.ipynb and llm-finetune-separate-with-registry.ipynb). Each works through a different way to handle LoRA fine-tuned models as illustrated in the following diagram:

  1. First, we download the pre-trained Llama2 model with 7 billion parameters using SageMaker Studio Notebooks. LLMs, like Llama2, have shown state-of-the-art performance on natural language processing (NLP) tasks when fine-tuned on domain-specific data.
  2. Next, we fine-tune Llama2 on the databricks-dolly-15k dataset using the QLoRA method. QLoRA reduces the computational cost of fine-tuning by quantizing model weights.
  3. During fine-tuning, we integrate SageMaker Experiments Plus with the Transformers API to automatically log metrics like gradient, loss, etc.
  4. We then version the fine-tuned Llama2 model in SageMaker Model Registry using two approaches:
    1. Storing the full model
    2. Storing the adapter and base model separately.
  5. Finally, we host the fine-tuned Llama2 models using Deep Java Library (DJL) Serving on a SageMaker Real-time endpoint.

In the following sections, we will dive deeper into each of these steps, to demonstrate the flexibility of SageMaker for different LLM workflows and how these features can help improve the operations of your models.

Prerequisites

Complete the following prerequisites to start experimenting with the code.

  • Create a SageMaker Studio Domain: Amazon SageMaker Studio, specifically Studio Notebooks, is used to kick off the Llama2 fine-tuning task then register and view models within SageMaker Model Registry. SageMaker Experiments is also used to view and compare Llama2 fine-tuning job logs (training loss/test loss/etc.).
  • Create an Amazon Simple Storage Service (S3) bucket: Access to an S3 bucket to store training artifacts and model weights is required. For instructions, refer to Creating a bucket. The sample code used for this post will use the SageMaker default S3 bucket but you can customize it to use any relevant S3 bucket.
  • Set up Model Collections (IAM permissions): Update your SageMaker Execution Role with permissions to resource-groups as listed under Model Registry Collections Developer Guide to implement Model Registry grouping using Model Collections.
  • Accept the Terms & Conditions for Llama2: You will need to accept the end-user license agreement and acceptable use policy for using the Llama2 foundation model.

The examples are available in the GitHub repository. The notebook files are tested using Studio notebooks running on PyTorch 2.0.0 Python 3.10 GPU Optimized kernel and ml.g4dn.xlarge instance type.

Experiments plus callback integration

Amazon SageMaker Experiments lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions from any integrated development environment (IDE), including local Jupyter Notebooks, using the SageMaker Python SDK or boto3. It provides the flexibility to log your model metrics, parameters, files, artifacts, plot charts from the different metrics, capture various metadata, search through them and support model reproducibility. Data scientists can quickly compare the performance and hyperparameters for model evaluation through visual charts and tables. They can also use SageMaker Experiments to download the created charts and share the model evaluation with their stakeholders.

Training LLMs can be a slow, expensive, and iterative process. It is very important for a user to track LLM experimentation at scale to prevent an inconsistent model tuning experience. HuggingFace Transformer APIs allow users to track metrics during training tasks through Callbacks. Callbacks are “read only” pieces of code that can customize the behavior of the training loop in the PyTorch Trainer that can inspect the training loop state for progress reporting, logging on TensorBoard or SageMaker Experiments Plus via custom logic (which is included as a part of this codebase).

You can import the SageMaker Experiments callback code included in this post’s code repository as shown in the following code block:

# imports a custom implementation of Experiments Callback
from smexperiments_callback import SageMakerExperimentsCallback
...
...
# Create Trainer instance with SageMaker experiments callback
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=default_data_collator,
    callbacks=[SageMakerExperimentsCallback] # Add our Experiments Plus Callback function
)

This callback will automatically log the following information into SageMaker Experiments as a part of the training run:

  • Training Parameters and Hyper-Parameters
  • Model Training and Validation loss at Step, Epoch and Final
  • Model Input and Output artifacts (training dataset, validation dataset, model output location, training debugger and more)

The following graph shows examples of the charts you can display by using that information.

This allows you to compare multiple runs easily using the Analyze feature of SageMaker Experiments. You can select the experiment runs you want to compare, and they will automatically populate comparison graphs.

Register fine-tuned models to Model Registry Collections

Model Registry Collections is a feature of SageMaker Model Registry that allows you to group registered models that are related to each other and organize them in hierarchies to improve model discoverability at scale. We will use Model Registry Collections to keep track of the base model and fine-tuned variants.

Full Model Copy method

The first method combines the base model and LoRA adapter and saves the full fine-tuned model. The following code illustrates the model merging process and saves the combined model using model.save_pretrained().

if args.merge_weights:
        
    trainer.model.save_pretrained(temp_dir, safe_serialization=False)
    # clear memory
    del model
    del trainer
    torch.cuda.empty_cache()
    
    from peft import AutoPeftModelForCausalLM

    # load PEFT model in fp16
    model = AutoPeftModelForCausalLM.from_pretrained(
        temp_dir,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
    )  
    # Merge LoRA and base model and save
    model = model.merge_and_unload()        
    model.save_pretrained(
        args.sm_model_dir, safe_serialization=True, max_shard_size="2GB"
    )

Combining the LoRA adapter and base model into a single model artifact after fine-tuning has advantages and disadvantages. The combined model is self-contained and can be independently managed and deployed without needing the original base model. The model can be tracked as its own entity with a version name reflecting the base model and fine-tuning data. We can adopt a nomenclature using the base_model_name + fine-tuned dataset_name to organize the model groups. Optionally, model collections could associate the original and fine-tuned models, but this may not be necessary since the combined model is independent.  The following code snippet shows you how to register the fine-tuned model.

# Model Package Group Vars
ft_package_group_name = f"{model_id.replace('/', '--')}-{dataset_name}"
ft_package_group_desc = "QLoRA for model Mikael110/llama-2-7b-{dataset_name}-fp16"
...
...
...
model_package_group_input_dict = {
    "ModelPackageGroupName" : ft_package_group_name,
    "ModelPackageGroupDescription" : ft_package_group_desc,
    "Tags": ft_tags
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
**model_package_group_input_dict
)

You can use the training estimator to register the model into Model Registry.

inference_image_uri = sagemaker.image_uris.retrieve(
    "djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

model_package = huggingface_estimator.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
        "ml.p2.16xlarge", 
...
...
...
    ],
    image_uri = inference_image_uri,
    customer_metadata_properties = {"training-image-uri": huggingface_estimator.training_image_uri()},  #Store the training image url
    model_package_group_name=ft_model_pkg_group_name,
    approval_status="Approved"
)

model_package_arn = model_package.model_package_arn
print("Model Package ARN : ", model_package_arn)

From Model Registry, you can retrieve the model package and deploy that model directly.

endpoint_name = f"{name_from_base(model_group_for_base)}-endpoint"

model_package.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    endpoint_name=endpoint_name
)

However, there are drawbacks to this approach. Combining the models leads to storage inefficiency and redundancy since the base model is duplicated in each fine-tuned version. As model size and the number of fine-tuned models increase, this exponentially inflates storage needs. Taking the llama2 7b model as an example, the base model is approximately 13 GB and the fine-tuned model is 13.6 GB. 96% percent of the model needs to be duplicated after each fine tuning. Additionally, distributing and sharing very large model files also becomes more difficult and presents operational challenges as file transfer and management cost increases with increasing model size and fine-tune jobs.

Separate adapter and base method

The second method focuses on separation of base weights and adapter weights by saving them as separate model components and loading them sequentially at runtime.

    ..
    ..
    ..
    else:   
        # save finetuned LoRA model and then the tokenizer for inference
        trainer.model.save_pretrained(
            args.sm_model_dir, 
            safe_serialization=True
        )
    tokenizer.save_pretrained(
        args.sm_model_dir
    )

Saving base and adapter weights has advantages and disadvantages, similar to the Full Model Copy method. One advantage is that it can save storage space. The base weights, which are the largest component of a fine-tuned model, are only saved once and can be reused with other adapter weights that are tuned for different tasks. For example, the base weights of Llama2-7B are about 13 GB, but each fine-tuning task only needs to store about 0.6 GB of adapter weights, which is a 95% space savings. Another advantage is that base weights can be managed separately from adapter weights using a base weights only model registry. This can be useful for SageMaker domains that are running in a VPC only mode without an internet gateway, since the base weights can be accessed without having to go through the internet.

Create Model Package Group for base weights

### Create Model Package Group
base_package_group_name = model_id.replace('/', '--')
base_package_group_desc = "Source: https://huggingface.co/Mikael110/llama-2-7b-guanaco-fp16"
...
...
...
model_package_group_input_dict = {
    "ModelPackageGroupName" : base_package_group_name,
    "ModelPackageGroupDescription" : base_package_group_desc,
    "Tags": base_tags
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
**model_package_group_input_dict
)

>>>
Created ModelPackageGroup Arn : arn:aws:sagemaker:us-west-2:376678947624:model-package-group/Mikael110--llama-2-7b-guanaco-fp16
...
...
...

### Register Base Model Weights
from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.28',
    pytorch_version='2.0',
    py_version='py310',
    model_data=model_data_uri, # this is an S3 path to your base weights as *.tar.gz
    role=role,
)

_response = huggingface_model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
    "ml.p2.16xlarge",
    ...
    ],
    transform_instances=[
    "ml.p2.16xlarge",
    ...
    ],
    model_package_group_name=base_model_pkg_group_name,
    approval_status="Approved"
 )

Create Model Package Group for QLoRA weights

The following code shows how to tag QLoRA weights with the dataset/task type and register fine-tuned delta weights into a separate model registry and track the delta weights separately.

### Create Model Package Group for delta weights
ft_package_group_name = f"{model_id.replace('/', '--')}-finetuned-sql"
ft_package_group_desc = "QLoRA for model Mikael110/llama-2-7b-guanaco-fp16"
ft_tags = [
    {
    "Key": "modelType",
    "Value": "QLoRAModel"
    },
    {
    "Key": "fineTuned",
    "Value": "True"
    },
    {
    "Key": "sourceDataset",
    "Value": f"{dataset_name}"
    }
]
model_package_group_input_dict = {
    "ModelPackageGroupName" : ft_package_group_name,
    "ModelPackageGroupDescription" : ft_package_group_desc,
    "Tags": ft_tags
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
**model_package_group_input_dict
)
print(f'Created ModelPackageGroup Arn : {create_model_pacakge_group_response["ModelPackageGroupArn"]}')
ft_model_pkg_group_name = create_model_pacakge_group_response["ModelPackageGroupArn"]

>>> 
Created ModelPackageGroup Arn : arn:aws:sagemaker:us-east-1:811828458885:model-package-group/mikael110--llama-2-7b-guanaco-fp16-finetuned-sql

...
...
...

### Register Delta Weights QLoRA Model Weights
huggingface_model = HuggingFaceModel(
    transformers_version='4.28',
    pytorch_version='2.0',  
    py_version='py310',
    model_data="s3://sagemaker-us-east-1-811828458885/huggingface-qlora-2308180454/output/model.tar.gz", OR #huggingface_estimator.model_data
    role=role,
)

_response = huggingface_model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
    "ml.p2.16xlarge",
    ...
    ],
    transform_instances=[
    "ml.p2.16xlarge",
    ...
    ],
    model_package_group_name=ft_model_pkg_group_name,
    approval_status="Approved"
)

>>>
Model collection creation status: {'added_groups': ['arn:aws:sagemaker:us-east-1:811828458885:model-package-group/mikael110--llama-2-7b-guanaco-fp16-finetuned-sql'], 'failure': []}

The following snippet shows a view from the Model Registry where the models are split into base and fine-tuned weights.

Managing models, datasets, and tasks for hyper-personalized LLMs can quickly become overwhelming. SageMaker Model Registry Collections can help you group related models together and organize them in a hierarchy to improve model discoverability. This makes it easier to track the relationships between base weights, adapter weights, and fine-tuning task datasets. You can also create complex relationships and linkages between models.

Create a new Collection and add your base model weights to this Collection

# create model collection
base_collection = model_collector.create(
    collection_name=model_group_for_base # ex: "Website_Customer_QnA_Bot_Model"
)

# Add the base weights at first level of model collections as all future models 
# are going to be tuned from the base weights
_response = model_collector.add_model_groups(
    collection_name=base_collection["Arn"],
    model_groups=[base_model_pkg_group_name]
)
print(f"Model collection creation status: {_response}")

>>>
Model collection creation status: {'added_groups': ['arn:aws:sagemaker:us-west-2:376678947624:model-package-group/Mikael110--llama-2-7b-guanaco-fp16'], 'failure': []}

Link all your Fine-Tuned LoRA Adapter Delta Weights to this collection by task and/or dataset

# create model collection for finetuned and link it back to the base
finetuned_collection = model_collector.create(
    collection_name=model_group_for_finetune,
    parent_collection_name=model_group_for_base
)

# add finetuned model package group to the new finetuned collection
_response = model_collector.add_model_groups(
    collection_name=model_group_for_finetune,
    model_groups=[ft_model_pkg_group_name]
)
print(f"Model collection creation status: {_response}")

>>>
Model collection creation status: {'added_groups': ['arn:aws:sagemaker:us-east-1:811828458885:model-package-group/mikael110--llama-2-7b-guanaco-fp16-finetuned-sql'], 'failure': []}

This will result in a collection hierarchy which are linked by model/task type and the dataset used to fine tune the base model.

This method of separating the base and adapter models has some drawbacks. One drawback is complexity in deploying the model. Because there are two separate model artifacts, you need additional steps to repackage the model instead of deploy directly from Model Registry. In the following code example, download and repack the latest version of the base model first.

!aws s3 cp {base_model_package.model_data} .

!tar -xvf {model_tar_filename} -C ./deepspeed/

!mv ./deepspeed/{model_id} ./deepspeed/base

!rm -rf ./deepspeed/{model_id}

Then download and repack the latest fine-tuned LoRA adapter weights.

!aws s3 cp {LoRA_package.model_data} .

!mkdir -p ./deepspeed/lora/

!tar -xzf model.tar.gz -C ./deepspeed/lora/

Since you will be using DJL serving with deepspeed to host the model, your inference directory should look like the following.

deepspeed
    |-serving.properties
    |-requirements.txt
    |-model.py
    |-base/
        |-...
    |-lora/
        |-...

Finally, package the custom inference code, base model, and LoRA adaptor in a single .tar.gz file for deployment.

!rm -f model.tar.gz
!tar czvf model.tar.gz -C deepspeed .
s3_code_artifact_deepspeed = sagemaker_session.upload_data("model.tar.gz", default_bucket, f"{s3_key_prefix}/inference")
print(f"S3 Code or Model tar for deepspeed uploaded to --- > {s3_code_artifact_deepspeed}")

Clean up

Clean up your resources by following the instructions in the cleanup section of the notebook. Refer to Amazon SageMaker Pricing for details on the cost of the inference instances.

Conclusion

This post walked you through best practices for managing LoRA fine-tuned models on Amazon SageMaker. We covered two main methods: combining the base and adapter weights into one self-contained model, and separating the base and adapter weights. Both approaches have tradeoffs, but separating weights helps optimize storage and enables advanced model management techniques like SageMaker Model Registry Collections. This allows you to build hierarchies and relationships between models to improve organization and discoverability. We encourage you to try the sample code on GitHub repository to experiment with these methods yourself. As generative AI progresses rapidly, following model management best practices will help you track experiments, find the right model for your task, and manage specialized LLMs efficiently at scale.

References


About the authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.

Mecit Gungor is an AI/ML Specialist Solution Architect at AWS helping customers design and build AI/ML solutions at scale. He covers a wide range of AI/ML use cases for Telecommunication customers and currently focuses on Generative AI, LLMs, and training and inference optimization. He can often be found hiking in the wilderness or playing board games with his friends in his free time.

Shelbee Eigenbrode is a Principal AI and Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.

Read More

Challenge Accepted: Animator Wade Neistadt Leads Robotic Revolution in Record Time This Week ‘In the NVIDIA Studio’

Challenge Accepted: Animator Wade Neistadt Leads Robotic Revolution in Record Time This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

Character animator Sir Wade Neistadt works to make animation and 3D education more accessible for aspiring and professional artists alike through video tutorials and industry training.

The YouTube creator, who goes by Sir Wade, also likes a challenge. When electronics company Razer recently asked him to create something unique and creative using the new Razer Blade 18 laptop with GeForce RTX 4090 graphics, Sir Wade obliged.

“I said yes because I thought it’d be a great opportunity to try something creatively risky and make something I didn’t yet know how to achieve,” the artist said.

I, Robot

One of the hardest parts of getting started on a project is needing to be creative on demand, said Sir Wade. For the Razer piece, the animator started by asking himself two questions: “What am I inspired by?” and “What do I have to work with?”

Sir Wade finds inspiration in games, technology, movies, people-watching and conversations. Fond of tech — and having eyed characters from the ProRigs library for some time — he decided his short animation should feature robots.

When creating a concept for the animation, Sir Wade took an unorthodox approach, skipping the popular step of 2D sketching. Instead, he captured video references by acting out the animations himself.

This gave Sir Wade the opportunity to quickly try a bunch of movements and preview body mechanics for the animation phase. Since ProRigs characters are rigs based on Autodesk Maya, he naturally began his animation work using this 3D software.

“YOU SHALL NOT (RENDER) PASS.”

His initial approach was straightforward: mimicking the main robot character’s movements with the edited reference footage. This worked fairly well, as NVIDIA RTX-accelerated ray tracing and AI denoising with the default Autodesk Arnold renderer resulted in smooth viewport movement and photorealistic visuals.

Then, Sir Wade continued tinkering with the piece, focusing on how the robot’s arm plates crashed into each other and how its feet moved. This was a great challenge, but he kept moving on the project. The featured artist would advise, “Don’t wait for everything to be perfect.”

The video reference footage captured earlier paid off later in Sir Wade’s creative workflow.

Next, Sir Wade exported files into Blender software with the Universal Scene Description (OpenUSD) framework, unlocking an open and extensible ecosystem, including the ability to make edits in NVIDIA Omniverse, a development platform for building and connecting 3D tools and applications. The edits could then be captured in the original native files, eliminating the need for tedious uploading, downloading and file reformatting.

AI-powered RTX-accelerated OptiX ray tracing in the viewport allowed Sir Wade to manipulate the scene with ease.

Sir Wade browsed the Kitbash3D digital platform with the new asset browser Cargo to compile kits, models and materials, and drag them into Blender with ease. It’s important at this stage to get base-level models in the scene, he said, so the environment can be further refined.

Dubbed the “ultimate desktop replacement,” the Razer Blade 18 offers NVIDIA GeForce RTX 4090 graphics.

Sir Wade raved about the Razer Blade 18’s quad-high-definition (QHD+) 18″ screen and 16:10 aspect ratio, which gives him more room to create, as well as its color-calibrated display, which ensures uploads to social media are as accurate as possible and require minimal color correction.

The preinstalled NVIDIA Studio Drivers, free to RTX GPU owners, are extensively tested with the most popular creative software to deliver maximum stability and performance.

“This is by far the best laptop I’ve ever used for this type of work.” — Sir Wade Neistadt

Returning to the action, Sir Wade used an emission shader to form the projectiles aimed at the robot. He also tweaked various textures, such as surface imperfections, to make the robot feel more weathered and battle-worn, before moving on to visual effects (VFX).

The artist used basic primitives as particle emitters in Blender to achieve the look of bursting particles over a limited number of frames. This, combined with the robot and floor surfaces containing surface nodes, creates sparks when the robot moves or gets hit by objects.

Sir Wade’s GeForce RTX 4090 Laptop GPU with Blender Cycles RTX-accelerated OptiX ray tracing in the viewport provides interactive, photorealistic rendering for modeling and animation.

Particle and collusion effects in Blender enable compelling VFX.

To further experiment with VFX, Sir Wade imported the project into the EmberGen simulation tool to test out various preset and physics effects.

VFX in EmberGen.

He added dust and debris VFX, and exported the scene as an OpenVDB file back to Blender to perfect the lighting.

Final lighting elements in Blender.

“I chose an NVIDIA RTX GPU-powered system for its reliable speed, performance and stability, as I had a very limited window to complete this project.” — Sir Wade Neistadt

Finally, Sir Wade completed sound-design effects in Blackmagic Design’s DaVinci Resolve software.

Sir Wade’s video tutorials resonate with diverse audiences because of their fresh approach to solving problems and individualistic flair.

“Creativity for me doesn’t come naturally like for other artists,” Sir Wade explained. “I reverse engineer the process by seeing a tool or a concept, evaluating what’s interesting, then either figuring out a way to use it uniquely or explaining the discovery in a relatable way.”

Sir Wade Neistadt.

Check out Sir Wade’s animation workshops on his website.

Less than two days remain in Sir Wade’s Fall 2023 Animation Challenge. Download the challenge template and Maya character rig files, and submit a custom 3D scene to win an NVIDIA RTX GPU or other prizes by end of day on Wednesday, Nov. 15.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More

Ghostbuster: Detecting Text Ghostwritten by Large Language Models

Ghostbuster: Detecting Text Ghostwritten by Large Language Models



The structure of Ghostbuster, our new state-of-the-art method for detecting AI-generated text.

Large language models like ChatGPT write impressively well—so well, in fact, that they’ve become a problem. Students have begun using these models to ghostwrite assignments, leading some schools to ban ChatGPT. In addition, these models are also prone to producing text with factual errors, so wary readers may want to know if generative AI tools have been used to ghostwrite news articles or other sources before trusting them.

What can teachers and consumers do? Existing tools to detect AI-generated text sometimes do poorly on data that differs from what they were trained on. In addition, if these models falsely classify real human writing as AI-generated, they can jeopardize students whose genuine work is called into question.

Our recent paper introduces Ghostbuster, a state-of-the-art method for detecting AI-generated text. Ghostbuster works by finding the probability of generating each token in a document under several weaker language models, then combining functions based on these probabilities as input to a final classifier. Ghostbuster doesn’t need to know what model was used to generate a document, nor the probability of generating the document under that specific model. This property makes Ghostbuster particularly useful for detecting text potentially generated by an unknown model or a black-box model, such as the popular commercial models ChatGPT and Claude, for which probabilities aren’t available. We’re particularly interested in ensuring that Ghostbuster generalizes well, so we evaluated across a range of ways that text could be generated, including different domains (using newly collected datasets of essays, news, and stories), language models, or prompts.

Ghostbuster: Detecting Text Ghostwritten by Large Language Models

Ghostbuster: Detecting Text Ghostwritten by Large Language Models



The structure of Ghostbuster, our new state-of-the-art method for detecting AI-generated text.

Large language models like ChatGPT write impressively well—so well, in fact, that they’ve become a problem. Students have begun using these models to ghostwrite assignments, leading some schools to ban ChatGPT. In addition, these models are also prone to producing text with factual errors, so wary readers may want to know if generative AI tools have been used to ghostwrite news articles or other sources before trusting them.

What can teachers and consumers do? Existing tools to detect AI-generated text sometimes do poorly on data that differs from what they were trained on. In addition, if these models falsely classify real human writing as AI-generated, they can jeopardize students whose genuine work is called into question.

Our recent paper introduces Ghostbuster, a state-of-the-art method for detecting AI-generated text. Ghostbuster works by finding the probability of generating each token in a document under several weaker language models, then combining functions based on these probabilities as input to a final classifier. Ghostbuster doesn’t need to know what model was used to generate a document, nor the probability of generating the document under that specific model. This property makes Ghostbuster particularly useful for detecting text potentially generated by an unknown model or a black-box model, such as the popular commercial models ChatGPT and Claude, for which probabilities aren’t available. We’re particularly interested in ensuring that Ghostbuster generalizes well, so we evaluated across a range of ways that text could be generated, including different domains (using newly collected datasets of essays, news, and stories), language models, or prompts.

Asymmetric Certified Robustness via Feature-Convex Neural Networks

Asymmetric Certified Robustness via Feature-Convex Neural Networks


Asymmetric Certified Robustness via Feature-Convex Neural Networks

TLDR: We propose the asymmetric certified robustness problem, which requires certified robustness for only one class and reflects real-world adversarial scenarios. This focused setting allows us to introduce feature-convex classifiers, which produce closed-form and deterministic certified radii on the order of milliseconds.

diagram illustrating the FCNN architecture


Figure 1. Illustration of feature-convex classifiers and their certification for sensitive-class inputs. This architecture composes a Lipschitz-continuous feature map $varphi$ with a learned convex function $g$. Since $g$ is convex, it is globally underapproximated by its tangent plane at $varphi(x)$, yielding certified norm balls in the feature space. Lipschitzness of $varphi$ then yields appropriately scaled certificates in the original input space.

Despite their widespread usage, deep learning classifiers are acutely vulnerable to adversarial examples: small, human-imperceptible image perturbations that fool machine learning models into misclassifying the modified input. This weakness severely undermines the reliability of safety-critical processes that incorporate machine learning. Many empirical defenses against adversarial perturbations have been proposed—often only to be later defeated by stronger attack strategies. We therefore focus on certifiably robust classifiers, which provide a mathematical guarantee that their prediction will remain constant for an $ell_p$-norm ball around an input.

Conventional certified robustness methods incur a range of drawbacks, including nondeterminism, slow execution, poor scaling, and certification against only one attack norm. We argue that these issues can be addressed by refining the certified robustness problem to be more aligned with practical adversarial settings.

Asymmetric Certified Robustness via Feature-Convex Neural Networks

Asymmetric Certified Robustness via Feature-Convex Neural Networks


Asymmetric Certified Robustness via Feature-Convex Neural Networks

TLDR: We propose the asymmetric certified robustness problem, which requires certified robustness for only one class and reflects real-world adversarial scenarios. This focused setting allows us to introduce feature-convex classifiers, which produce closed-form and deterministic certified radii on the order of milliseconds.

diagram illustrating the FCNN architecture


Figure 1. Illustration of feature-convex classifiers and their certification for sensitive-class inputs. This architecture composes a Lipschitz-continuous feature map $varphi$ with a learned convex function $g$. Since $g$ is convex, it is globally underapproximated by its tangent plane at $varphi(x)$, yielding certified norm balls in the feature space. Lipschitzness of $varphi$ then yields appropriately scaled certificates in the original input space.

Despite their widespread usage, deep learning classifiers are acutely vulnerable to adversarial examples: small, human-imperceptible image perturbations that fool machine learning models into misclassifying the modified input. This weakness severely undermines the reliability of safety-critical processes that incorporate machine learning. Many empirical defenses against adversarial perturbations have been proposed—often only to be later defeated by stronger attack strategies. We therefore focus on certifiably robust classifiers, which provide a mathematical guarantee that their prediction will remain constant for an $ell_p$-norm ball around an input.

Conventional certified robustness methods incur a range of drawbacks, including nondeterminism, slow execution, poor scaling, and certification against only one attack norm. We argue that these issues can be addressed by refining the certified robustness problem to be more aligned with practical adversarial settings.