NVIDIA and Hexagon Deliver Suite of Solutions for Accelerating Industrial Digitalization

NVIDIA and Hexagon Deliver Suite of Solutions for Accelerating Industrial Digitalization

For industrial businesses to reach the next level of digitalization, they need to create accurate, virtual representations of their physical systems.

NVIDIA is working with Hexagon, the Stockholm-based global leader in digital reality solutions combining sensor, software and autonomous technologies, to equip enterprises with the tools and solutions they need to build physically accurate, perfectly synchronized, AI-enabled digital twins that can be used to transform their organizations.

Hexagon is building integrations from their HxDR reality-capture and Nexus manufacturing platforms to NVIDIA Omniverse, an open platform for developing and operating industrial metaverse applications via Universal Scene Description (“OpenUSD”) plug-ins. The connected platforms, powered by NVIDIA AI technologies, will provide benefits across Hexagon’s major ecosystems, including agriculture, autonomous mobility, buildings, cities, defense, infrastructure, manufacturing and mining.

Together, these solutions deliver seamless, collaborative planning through a unified view, so industrial customers can better optimize workflows and improve efficiencies at scale. Professionals and developers will be able to use advanced capabilities in reality capture, digital twins, AI, simulation and visualization to enhance the most complex graphics workflows — from virtual prototyping to digital factories.

Fusing Physical and Digital Worlds Into One Reality

The $46 trillion manufacturing industry encompasses millions of factories worldwide designing and developing new products. Digitalization allows manufacturers to tackle the most complex engineering problems in more efficient, productive ways. It also brings industrial businesses one step closer to automating their workflows  and becoming software-defined, which means improving operational efficiency and transforming their services with software.

At the HxGN LIVE Global event, Hexagon and NVIDIA showcased how their integrated offering can help teams accelerate their digitalization journeys. Watch the demo below to see how designers, engineers and others can use the Omniverse platform to quickly aggregate and simulate ultra-complex data from Hexagon’s HxDR and Nexus platforms.

Hexagon is also developing an AI-enabled web application, based on Omniverse, which will allow teams to see real-time comparisons of digital twins and their physical counterparts, so they can accelerate decision-making while optimizing planning and operations. This solution will help enterprises unlock more collaborative workflows and achieve rapid iteration across their teams, wherever they’re located.

With this announcement, the Omniverse ecosystem will benefit from Hexagon’s digital reality expertise, including geospatial reality capture, sensors, software and autonomous technologies. Enterprises will be able to build, simulate, operate and optimize virtual worlds faster, more accurately and easier than ever before.

Learn more about NVIDIA Omniverse. Read Hexagon’s latest announcement, and see the latest demos and exhibits at HxGN LIVE Global 2023. 

Read More

Build custom chatbot applications using OpenChatkit models on Amazon SageMaker

Build custom chatbot applications using OpenChatkit models on Amazon SageMaker

Open-source large language models (LLMs) have become popular, allowing researchers, developers, and organizations to access these models to foster innovation and experimentation. This encourages collaboration from the open-source community to contribute to developments and improvement of LLMs. Open-source LLMs provide transparency to the model architecture, training process, and training data, which allows researchers to understand how the model works and identify potential biases and address ethical concerns. These open-source LLMs are democratizing generative AI by making advanced natural language processing (NLP) technology available to a wide range of users to build mission-critical business applications. GPT-NeoX, LLaMA, Alpaca, GPT4All, Vicuna, Dolly, and OpenAssistant are some of the popular open-source LLMs.

OpenChatKit is an open-source LLM used to build general-purpose and specialized chatbot applications, released by Together Computer in March 2023 under the Apache-2.0 license. This model allows developers to have more control over the chatbot’s behavior and tailor it to their specific applications. OpenChatKit provides a set of tools, base bot, and building blocks to build fully customized, powerful chatbots. The key components are as follows:

  • An instruction-tuned LLM, fine-tuned for chat from EleutherAI’s GPT-NeoX-20B with over 43 million instructions on 100% carbon negative compute. The GPT-NeoXT-Chat-Base-20B model is based on EleutherAI’s GPT-NeoX model, and is fine-tuned with data focusing on dialog-style interactions.
  • Customization recipes to fine-tune the model to achieve high accuracy on your tasks.
  • An extensible retrieval system enabling you to augment bot responses with information from a document repository, API, or other live-updating information source at inference time.
  • A moderation model, fine-tuned from GPT-JT-6B, designed to filter which questions the bot responds to.

The increasing scale and size of deep learning models present obstacles to successfully deploy these models in generative AI applications. To meet the demands for low latency and high throughput, it becomes essential to employ sophisticated methods like model parallelism and quantization. Lacking proficiency in the application of these methods, numerous users encounter difficulties in initiating the hosting of sizable models for generative AI use cases.

In this post, we show how to deploy OpenChatKit models (GPT-NeoXT-Chat-Base-20B and GPT-JT-Moderation-6B) models on Amazon SageMaker using DJL Serving and open-source model parallel libraries like DeepSpeed and Hugging Face Accelerate. We use DJL Serving, which is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. We demonstrate how the Hugging Face Accelerate library simplifies deployment of large models into multiple GPUs, thereby reducing the burden of running LLMs in a distributed fashion. Let’s get started!

Extensible retrieval system

An extensible retrieval system is one of the key components of OpenChatKit. It enables you to customize the bot response based on a closed domain knowledge base. Although LLMs are able to retain factual knowledge in their model parameters and can achieve remarkable performance on downstream NLP tasks when fine-tuned, their capacity to access and predict closed domain knowledge accurately remains restricted. Therefore, when they’re presented with knowledge-intensive tasks, their performance suffers to that of task-specific architectures. You can use the OpenChatKit retrieval system to augment knowledge in their responses from external knowledge sources such as Wikipedia, document repositories, APIs, and other information sources.

The retrieval system enables the chatbot to access current information by obtaining pertinent details in response to a specific query, thereby supplying the necessary context for the model to generate answers. To illustrate the functionality of this retrieval system, we provide support for an index of Wikipedia articles and offer example code demonstrating how to invoke a web search API for information retrieval. By following the provided documentation, you can integrate the retrieval system with any dataset or API during the inference process, allowing the chatbot to incorporate dynamically updated data into its responses.

Moderation model

Moderation models are important in chatbot applications to enforce content filtering, quality control, user safety, and legal and compliance reasons. Moderation is a difficult and subjective task, and depends a lot on the domain of the chatbot application. OpenChatKit provides tools to moderate the chatbot application and monitor input text prompts for any inappropriate content. The moderation model provides a good baseline that can be adapted and customized to various needs.

OpenChatKit has a 6-billion-parameter moderation model, GPT-JT-Moderation-6B, which can moderate the chatbot to limit the inputs to the moderated subjects. Although the model itself does have some moderation built in, TogetherComputer trained a GPT-JT-Moderation-6B model with Ontocord.ai’s OIG-moderation dataset. This model runs alongside the main chatbot to check that both the user input and answer from the bot don’t contain inappropriate results. You can also use this to detect any out of domain questions to the chatbot and override when the question is not part of the chatbot’s domain.

The following diagram illustrates the OpenChatKit workflow.

Extensible retrieval system use cases

Although we can apply this technique in various industries to build generative AI applications, for this post we discuss use cases in the financial industry. Retrieval augmented generation can be employed in financial research to automatically generate research reports on specific companies, industries, or financial products. By retrieving relevant information from internal knowledge bases, financial archives, news articles, and research papers, you can generate comprehensive reports that summarize key insights, financial metrics, market trends, and investment recommendations. You can use this solution to monitor and analyze financial news, market sentiment, and trends.

Solution overview

The following steps are involved to build a chatbot using OpenChatKit models and deploy them on SageMaker:

  1. Download the chat base GPT-NeoXT-Chat-Base-20B model and package the model artifacts to be uploaded to Amazon Simple Storage Service (Amazon S3).
  2. Use a SageMaker large model inference (LMI) container, configure the properties, and set up custom inference code to deploy this model.
  3. Configure model parallel techniques and use inference optimization libraries in DJL serving properties. We will use Hugging Face Accelerate as the engine for DJL serving. Additionally, we define tensor parallel configurations to partition the model.
  4. Create a SageMaker model and endpoint configuration, and deploy the SageMaker endpoint.

You can follow along by running the notebook in the GitHub repo.

Download the OpenChatKit model

First, we download the OpenChatKit base model. We use huggingface_hub and use snapshot_download to download the model, which downloads an entire repository at a given revision. Downloads are made concurrently to speed up the process. See the following code:

from huggingface_hub import snapshot_download
from pathlib import Path
import os
# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path("./openchatkit")
local_model_path.mkdir(exist_ok=True)
model_name = "togethercomputer/GPT-NeoXT-Chat-Base-20B"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]
# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
chat_model_download_path = snapshot_download(
    repo_id=model_name,#A user or an organization name and a repo name 
    cache_dir=local_model_path, #Path to the folder where cached files are stored.
    allow_patterns=allow_patterns, #only files matching at least one pattern are downloaded.
)

DJL Serving properties

You can use SageMaker LMI containers to host large generative AI models with custom inference code without providing your own inference code. This is extremely useful when there is no custom preprocessing of the input data or postprocessing of the model’s predictions. You can also deploy a model using custom inference code. In this post, we demonstrate how to deploy OpenChatKit models with custom inference code.

SageMaker expects the model artifacts in tar format. We create each OpenChatKit model with the following files: serving.properties and model.py.

The serving.properties configuration file indicates to DJL Serving which model parallelization and inference optimization libraries you would like to use. The following is a list of settings we use in this configuration file:

openchatkit/serving.properties
engine = Python
option.tensor_parallel_degree = 4
option.s3url = {{s3url}}

This contains the following parameters:

  • engine – The engine for DJL to use.
  • option.entryPoint – The entry point Python file or module. This should align with the engine that is being used.
  • option.s3url – Set this to the URI of the S3 bucket that contains the model.
  • option.modelid – If you want to download the model from huggingface.co, you can set option.modelid to the model ID of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model ID to download the corresponding model repository on huggingface.co.
  • option.tensor_parallel_degree – Set this to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model that will be started up when DJL Serving runs. For example, if we have an 8 GPU machine and we are creating eight partitions, then we will have one worker per model to serve the requests. It’s necessary to tune the parallelism degree and identify the optimal value for a given model architecture and hardware platform. We call this ability inference-adapted parallelism.

Refer to Configurations and settings for an exhaustive list of options.

OpenChatKit models

The OpenChatKit base model implementation has the following four files:

  • model.py – This file implements the handling logic for the main OpenChatKit GPT-NeoX model. It receives the inference input request, loads the model, loads the Wikipedia index, and serves the response. Refer to model.py(created part of the notebook) for additional details. model.py uses the following key classes:
    • OpenChatKitService – This handles passing the data between the GPT-NeoX model, Faiss search, and conversation object. WikipediaIndex and Conversation objects are initialized and input chat conversations are sent to the index to search for relevant content from Wikipedia. This also generates a unique ID for each invocation if one is not supplied for the purpose of storing the prompts in Amazon DynamoDB.
    • ChatModel – This class loads the model and tokenizer and generates the response. It handles partitioning the model across multiple GPUs using tensor_parallel_degree, and configures the dtypes and device_map. The prompts are passed to the model to generate responses. A stopping criteria StopWordsCriteria is configured for the generation to only produce the bot response on inference.
    • ModerationModel – We use two moderation models in the ModerationModel class: the input model to indicate to the chat model that the input is inappropriate to override the inference result, and the output model to override the inference result. We classify the input prompt and output response with the following possible labels:
      • casual
      • needs caution
      • needs intervention (this is flagged to be moderated by the model)
      • possibly needs caution
      • probably needs caution
  • wikipedia_prepare.py – This file handles downloading and preparing the Wikipedia index. In this post, we use a Wikipedia index provided on Hugging Face datasets. To search the Wikipedia documents for relevant text, the index needs to be downloaded from Hugging Face because it’s not packaged elsewhere. The wikipedia_prepare.py file is responsible for handling the download when imported. Only a single process in the multiple that are running for inference can clone the repository. The rest wait until the files are present in the local file system.
  • wikipedia.py – This file is used for searching the Wikipedia index for contextually relevant documents. The input query is tokenized and embeddings are created using mean_pooling. We compute cosine similarity distance metrics between the query embedding and the Wikipedia index to retrieve contextually relevant Wikipedia sentences. Refer to wikipedia.py for implementation details.
#function to create sentence embedding using mean_pooling
def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.0)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

#function to compute cosine similarity distance between 2 embeddings   
def cos_sim_2d(x, y):
    norm_x = x / np.linalg.norm(x, axis=1, keepdims=True)
    norm_y = y / np.linalg.norm(y, axis=1, keepdims=True)
    return np.matmul(norm_x, norm_y.T)
  • conversation.py – This file is used for storing and retrieving the conversation thread in DynamoDB for passing to the model and user. conversation.py is adapted from the open-source OpenChatKit repository. This file is responsible for defining the object that stores the conversation turns between the human and the model. With this, the model is able to retain a session for the conversation, allowing a user to refer to previous messages. Because SageMaker endpoint invocations are stateless, this conversation needs to be stored in a location external to the endpoint instances. On startup, the instance creates a DynamoDB table if it doesn’t exist. All updates to the conversation are then stored in DynamoDB based on the session_id key, which is generated by the endpoint. Any invocation with a session ID will retrieve the associated conversation string and update it as required.

Build an LMI inference container with custom dependencies

The index search uses Facebook’s Faiss library for performing the similarity search. Because this isn’t included in the base LMI image, the container needs to be adapted to install this library. The following code defines a Dockerfile that installs Faiss from the source alongside other libraries needed by the bot endpoint. We use the sm-docker utility to build and push the image to Amazon Elastic Container Registry (Amazon ECR) from Amazon SageMaker Studio. Refer to Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks for more details.

The DJL container doesn’t have Conda installed, so Faiss needs to be cloned and compiled from the source. To install Faiss, the dependencies for using the BLAS APIs and Python support need to be installed. After these packages are installed, Faiss is configured to use AVX2 and CUDA before being compiled with the Python extensions installed.

pandas, fastparquet, boto3, and git-lfs are installed afterwards because these are required for downloading and reading the index files.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117
ARG FAISS_URL=https://github.com/facebookresearch/faiss.git
RUN apt-get update && apt-get install -y git-lfs wget cmake pkg-config build-essential apt-utils
RUN apt search openblas && apt-get install -y libopenblas-dev swig
RUN git clone $FAISS_URL && 
cd faiss && 
cmake -B build . -DFAISS_OPT_LEVEL=avx2 -DCMAKE_CUDA_ARCHITECTURES="86" && 
make -C build -j faiss && 
make -C build -j swigfaiss && 
make -C build -j swigfaiss_avx2 && 
(cd build/faiss/python && python -m pip install )

RUN pip install pandas fastparquet boto3 && 
git lfs install --skip-repo && 
apt-get clean all

Create the model

Now that we have the Docker image in Amazon ECR, we can proceed with creating the SageMaker model object for the OpenChatKit models. We deploy GPT-NeoXT-Chat-Base-20B input and output moderation models using GPT-JT-Moderation-6B. Refer to create_model for more details.

from sagemaker.utils import name_from_base

chat_model_name = name_from_base(f"gpt-neoxt-chatbase-ds")
print(chat_model_name)

create_model_response = sm_client.create_model(
    ModelName=chat_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": chat_inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)
chat_model_arn = create_model_response["ModelArn"]

print(f"Created Model: {chat_model_arn}")

Configure the endpoint

Next, we define the endpoint configurations for the OpenChatKit models. We deploy the models using the ml.g5.12xlarge instance type. Refer to create_endpoint_config for more details.

chat_endpoint_config_name = f"{chat_model_name}-config"
chat_endpoint_name = f"{chat_model_name}-endpoint"

chat_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=chat_endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": chat_model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
        },
    ],
)

Deploy the endpoint

Finally, we create an endpoint using the model and endpoint configuration we defined in the previous steps:

chat_create_endpoint_response = sm_client.create_endpoint(
EndpointName=f"{chat_endpoint_name}", EndpointConfigName=chat_endpoint_config_name
)
print(f"Created Endpoint: {chat_create_endpoint_response['EndpointArn']},")

Run inference from OpenChatKit models

Now it’s time to send inference requests to the model and get the responses. We pass the input text prompt and model parameters such as temperature, top_k, and max_new_tokens. The quality of the chatbot responses is based on the parameters specified, so it’s recommended to benchmark model performance against these parameters to find the optimal setting for your use case. The input prompt is first sent to the input moderation model, and the output is sent to ChatModel to generate the responses. During this step, the model uses the Wikipedia index to retrieve contextually relevant sections to the model as the prompt to get domain-specific responses from the model. Finally, the model response is sent to the output moderation model to check for classification, and then the responses are returned. See the following code:

def chat(prompt, session_id=None, **kwargs):
    if session_id:
        chat_response_model = smr_client.invoke_endpoint(
            EndpointName=chat_endpoint_name,
            Body=json.dumps(
                {
                    "inputs": prompt,
                    "parameters": {
                        "temperature": 0.6,
                        "top_k": 40,
                        "max_new_tokens": 512,
                        "session_id": session_id,
                        "no_retrieval": True,
                    },
                }
            ),
            ContentType="application/json",
        )
    else:
        chat_response_model = smr_client.invoke_endpoint(
            EndpointName=chat_endpoint_name,
            Body=json.dumps(
                {
                    "inputs": prompt,
                    "parameters": {
                        "temperature": 0.6,
                        "top_k": 40,
                        "max_new_tokens": 512,
                    },
                }
            ),
            ContentType="application/json",
        )
    response = chat_response_model["Body"].read().decode("utf8")
    return response
prompts = "What does a data engineer do?"
chat(prompts)

Refer to sample chat interactions below.

Clean up

Follow the instructions in the cleanup section of the to delete the resources provisioned as part of this post to avoid unnecessary charges. Refer to Amazon SageMaker Pricing for details about the cost of the inference instances.

Conclusion

In this post, we discussed the importance of open-source LLMs and how to deploy an OpenChatKit model on SageMaker to build next-generation chatbot applications. We discussed various components of OpenChatKit models, moderation models, and how to use an external knowledge source like Wikipedia for retrieval augmented generation (RAG) workflows. You can find step-by-step instructions in the GitHub notebook. Let us know about the amazing chatbot applications you’re building. Cheers!


About the Authors

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Vikram Elango is a Sr. AIML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.

Read More

Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and the model parallel library

Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and the model parallel library

GPT-J is an open-source 6-billion-parameter model released by Eleuther AI. The model is trained on the Pile and can perform various tasks in language processing. It can support a wide variety of use cases, including text classification, token classification, text generation, question and answering, entity extraction, summarization, sentiment analysis, and many more. GPT-J is a transformer model trained using Ben Wang’s Mesh Transformer JAX.

In this post, we present a guide and best practices on training large language models (LLMs) using the Amazon SageMaker distributed model parallel library to reduce training time and cost. You will learn how to train a 6-billion-parameter GPT-J model on SageMaker with ease. Finally, we share the main features of SageMaker distributed model parallelism that help with speeding up training time.

Transformer neural networks

A transformer neural network is a popular deep learning architecture to solve sequence-to-sequence tasks. It uses attention as the learning mechanism to achieve close to human-level performance. Some of the other useful properties of the architecture compared to previous generations of natural language processing (NLP) models include the ability distribute, scale, and pre-train. Transformers-based models can be applied across different use cases when dealing with text data, such as search, chatbots, and many more. Transformers use the concept of pre-training to gain intelligence from large datasets. Pre-trained transformers can be used as is or fine-tuned on your datasets, which can be much smaller and specific to your business.

Hugging Face on SageMaker

Hugging Face is a company developing some of the most popular open-source libraries providing state-of-the-art NLP technology based on transformers architectures. The Hugging Face transformers, tokenizers, and datasets libraries provide APIs and tools to download and predict using pre-trained models in multiple languages. SageMaker enables you to train, fine-tune, and run inference using Hugging Face models directly from its Hugging Face Model Hub using the Hugging Face estimator in the SageMaker SDK. The integration makes it easier to customize Hugging Face models on domain-specific use cases. Behind the scenes, the SageMaker SDK uses AWS Deep Learning Containers (DLCs), which are a set of prebuilt Docker images for training and serving models offered by SageMaker. The DLCs are developed through a collaboration between AWS and Hugging Face. The integration also offers integration between the Hugging Face transformers SDK and SageMaker distributed training libraries, enabling you to scale your training jobs on a cluster of GPUs.

Overview of the SageMaker distributed model parallel library

Model parallelism is a distributed training strategy that partitions the deep learning model over numerous devices, within or across instances. Deep learning (DL) models with more layers and parameters perform better in complex tasks like computer vision and NLP. However, the maximum model size that can be stored in the memory of a single GPU is limited. GPU memory constraints can be bottlenecks while training DL models in the following ways:

  • They limit the size of the model that can be trained because a model’s memory footprint scales proportionately to the number of parameters
  • They reduce GPU utilization and training efficiency by limiting the per-GPU batch size during training

SageMaker includes the distributed model parallel library to help distribute and train DL models effectively across many compute nodes, overcoming the restrictions associated with training a model on a single GPU. Furthermore, the library allows you to obtain the most optimal distributed training utilizing EFA-supported devices, which improves inter-node communication performance with low latency, high throughput, and OS bypass.

Because large models such as GPT-J, with billions of parameters, have a GPU memory footprint that exceeds a single chip, it becomes essential to partition them across multiple GPUs. The SageMaker model parallel (SMP) library enables automatic partitioning of models across multiple GPUs. With SageMaker model parallelism, SageMaker runs an initial profiling job on your behalf to analyze the compute and memory requirements of the model. This information is then used to decide how the model is partitioned across GPUs, in order to maximize an objective, such as maximizing speed or minimizing memory footprint.

It also supports optional pipeline run scheduling in order to maximize the overall utilization of available GPUs. The propagation of activations during forward pass and gradients during backward pass requires sequential computation, which limits the amount of GPU utilization. SageMaker overcomes the sequential computation constraint utilizing the pipeline run schedule by splitting mini-batches into micro-batches to be processed in parallel on different GPUs. SageMaker model parallelism supports two modes of pipeline runs:

  • Simple pipeline – This mode finishes the forward pass for each micro-batch before starting the backward pass.
  • Interleaved pipeline – In this mode, the backward run of the micro-batches is prioritized whenever possible. This allows for quicker release of the memory used for activations, thereby using memory more efficiently.

Tensor parallelism

Individual layers, ornn.Modules, are divided across devices using tensor parallelism so they can run concurrently. The simplest example of how the library divides a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2) is shown in the following figure. Each model replica’s layers are bisected (divided in half) and distributed between two GPUs. The degree of data parallelism is eight in this example because the model parallel configuration additionally includes "pipeline_parallel_degree": 1 and "ddp": True. The library manages communication among the replicas of the tensor-distributed model.

Tensor parallelism

The benefit of this feature is that you may choose which layers or which subset of layers you want to apply tensor parallelism to. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set up a combination of pipeline and tensor parallelism, see Extended Features of the SageMaker Model Parallel Library for PyTorch.

SageMaker sharded data parallelism

Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.

When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint of the model by sharding the training state over multiple GPUs. This returns two benefits: you can fit larger models, which would otherwise run out of memory with standard data parallelism, or you can increase the batch size using the freed-up GPU memory.

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group and performs gradient aggregation based on the AllReduce operation. In effect, sharded data parallelism introduces a trade-off between the communication overhead and GPU memory efficiency. Using sharded data parallelism increases the communication cost, but the memory footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data parallelism degree, therefore larger models can fit in a GPU cluster.

SageMaker implements sharded data parallelism through the MiCS implementation. For more information, see Near-linear scaling of gigantic-model training on AWS.

Refer to Sharded Data Parallelism for further details on how to apply sharded data parallelism to your training jobs.

Use the SageMaker model parallel library

The SageMaker model parallel library comes with the SageMaker Python SDK. You need to install the SageMaker Python SDK to use the library, and it’s already installed on SageMaker notebook kernels. To make your PyTorch training script utilize the capabilities of the SMP library, you need to make the following changes:

  1. Strat by importing and initializing the smp library using the smp.init()call.
  2. Once it’s initialized, you can wrap your model with the smp.DistributedModel wrapper and use the returned DistributedModel object instead of the user model.
  3. For your optimizer state, use the smp.DistributedOptimizer wrapper around your model optimizer, enabling smp to save and load the optimizer state. The forward and backward pass logic can be abstracted as a separate function and add a smp.step decorator to the function. Essentially, the forward pass and back-propagation needs to be run inside the function with the smp.step decorator placed over it. This allows smp to split the tensor input to the function into a number of microbatches specified while launching the training job.
  4. Next, we can move the input tensors to the GPU used by the current process using the torch.cuda.set_device API followed by the .to() API call.
  5. Finally, for back-propagation, we replace torch.Tensor.backward and torch.autograd.backward.

See the following code:

@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(Loss)
    
    return output, loss

with smp.tensor_parallelism():
    model = AutoModelForCausalLM.from_config(model_config)
    
model = smp.DistributedModel (model)
optimizer = smp. DistributedOptimizer(optimizer)

The SageMaker model parallel library’s tensor parallelism offers out-of-the-box support for the following Hugging Face Transformer models:

  • GPT-2, BERT, and RoBERTa (available in the SMP library v1.7.0 and later)
  • GPT-J (available in the SMP library v1.8.0 and later)
  • GPT-Neo (available in the SMP library v1.10.0 and later)

Best practices for performance tuning with the SMP library

When training large models, consider the following steps so that your model fits in GPU memory with a reasonable batch size:

  • It’s recommended to use instances with higher GPU memory and high bandwidth interconnect for performance, such as p4d and p4de instances.
  • Optimizer state sharding can be enabled in most cases, and will be helpful when you have more than one copy of the model (data parallelism enabled). You can turn on optimizer state sharding by setting "shard_optimizer_state": True in the modelparallel configuration.
  • Use activation checkpointing, a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass of selected modules in the model.
  • Use activation offloading, an additional feature that can further reduce memory usage. To use activation offloading, set "offload_activations": True in the modelparallel configuration. Use when activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one.
  • Enable tensor parallelism and increase parallelism degrees where the degree is a power of 2. Typically for performance reasons, tensor parallelism is restricted to within a node.

We have run many experiments to optimize training and tuning GPT-J on SageMaker with the SMP library. We have managed to reduce GPT-J training time for an epoch on SageMaker from 58 minutes to less than 10 minutes—six times faster training time per epoch. It took initialization, model and dataset download from Amazon Simple Storage Service (Amazon S3) less than a minute, tracing and auto partitioning with GPU as the tracing device less than 1 minute, and training an epoch 8 minutes using tensor parallelism on one ml.p4d.24xlarge instance, FP16 precision, and a SageMaker Hugging Face estimator.

To reduce training time as a best practice, when training GPT-J on SageMaker, we recommend the following:

  • Store your pretrained model on Amazon S3
  • Use FP16 precision
  • Use GPU as a tracing device
  • Use auto-partitioning, activation checkpointing, and optimizer state sharding:
    • auto_partition: True
    • shard_optimizer_state: True
  • Use tensor parallelism
  • Use a SageMaker training instance with multiple GPUs such as ml.p3.16xlarge, ml.p3dn.24xlarge, ml.g5.48xlarge, ml.p4d.24xlarge, or ml.p4de.24xlarge.

GPT-J model training and tuning on SageMaker with the SMP library

A working step-by-step code sample is available on the Amazon SageMaker Examples public repository. Navigate to the training/distributed_training/pytorch/model_parallel/gpt-j folder. Select the gpt-j folder and open the train_gptj_smp_tensor_parallel_notebook.jpynb Jupyter notebook for the tensor parallelism example and train_gptj_smp_notebook.ipynb for the pipeline parallelism example. You can find a code walkthrough in our Generative AI on Amazon SageMaker workshop.

This notebook walks you through how to use the tensor parallelism features provided by the SageMaker model parallelism library. You’ll learn how to run FP16 training of the GPT-J model with tensor parallelism and pipeline parallelism on the GLUE sst2 dataset.

Summary

The SageMaker model parallel library offers several functionalities. You can reduce cost and speed up training LLMs on SageMaker. You can also learn and run sample codes for BERT, GPT-2, and GPT-J on the Amazon SageMaker Examples public repository. To learn more about AWS best practices for training LLMS using the SMP library, refer to the following resources:

To learn how one of our customers achieved low-latency GPT-J inference on SageMaker, refer to How Mantium achieves low-latency GPT-J inference with DeepSpeed on Amazon SageMaker.

If you’re looking to accelerate time-to-market of your LLMs and reduce your costs, SageMaker can help. Let us know what you build!


About the Authors

Zmnako AwrahmanZmnako Awrahman, PhD, is a Practice Manager, ML SME, and Machine Learning Technical Field Community (TFC) member at Global Competency Center, Amazon Web Services. He helps customers leverage the power of the cloud to extract value from their data with data analytics and machine learning.

Roop BainsRoop Bains is a Senior Machine Learning Solutions Architect at AWS. He is passionate about helping customers innovate and achieve their business objectives using artificial intelligence and machine learning. He helps customers train, optimize, and deploy deep learning models.

Anastasia Pachni TsitiridouAnastasia Pachni Tsitiridou is a Solutions Architect at AWS. Anastasia lives in Amsterdam and supports software businesses across the Benelux region in their cloud journey. Prior to joining AWS, she studied electrical and computer engineering with a specialization in computer vision. What she enjoys most nowadays is working with very large language models.

Dhawal PatelDhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on SageMaker.

Wioletta StobienieckaWioletta Stobieniecka is a Data Scientist at AWS Professional Services. Throughout her professional career, she has delivered multiple analytics-driven projects for different industries such as banking, insurance, telco, and the public sector. Her knowledge of advanced statistical methods and machine learning is well combined with a business acumen. She brings recent AI advancements to create value for customers.

Rahul HuilgolRahul Huilgol is a Senior Software Development Engineer in Distributed Deep Learning at Amazon Web Services.

Read More

Meet the Maker: Software Engineer Ramps Up NVIDIA Jetson to Build Self-Driving Skate Park

Meet the Maker: Software Engineer Ramps Up NVIDIA Jetson to Build Self-Driving Skate Park

Kirk Kaiser

Kirk Kaiser grew up a fan of the video game Paperboy, where players act as cyclists delivering newspapers while encountering various obstacles, like ramps that appear in the middle of the street.

This was the inspiration behind the software developer’s latest project using the NVIDIA Jetson platform for edge AI and robotics — a self-driving skate ramp.

“I wanted the absurdity and fun of Paperboy to be a part of my life,” said Kaiser, an avid skateboarder based in Naples, Fla. “I was boarding one day with my dog Benji running beside me and I was like, ‘What if I had ramps that came with me?’”

He’s now building just that — technology that could lead to a portable, autonomous skate park.

So far, he’s developed an electric platform that can elevate a ramp and make it level with the ground. It’s steerable using a PS4 controller linked via Bluetooth to an NVIDIA Jetson Nano Developer Kit.

Now, he’s collecting data to train AI models that’ll enable the platform to recognize streets and obstacles — and eventually become fully autonomous — with the help of the new NVIDIA Jetson Orin Nano Developer Kit.

It’s a project for when he isn’t engrossed in his work as the head of developer relations at Gitpod, a startup that provides cloud development environments for software makers.

About the Maker

Kaiser learned software engineering at a young age and received a scholarship to a prestigious high school specialized in tech. There, he honed his programming skills before taking time in his early adulthood to see and experience the world in completely different ways.

At 18 years old, he packed a bag and lived for a year in a wildlife refuge in Costa Rica, where he worked on a permaculture farm, growing food and collecting rainwater to drink. Relocating to Vermont, Kaiser then spent a year farming with a Zen Buddhist before hiking 1,000 miles of Appalachian Trail, passing through four states.

Upon leaving the trail, Kirk launched a travel website, got his first software job at a cosmetics company, and worked in R&D for a lighting company before rekindling his passion for software engineering as a way to provide for his family — including his now four-year-old son.

His Inspiration

Before all of this, skateboarding was Kaiser’s greatest love. “I just wanted to skateboard as a kid,” he said. “I wanted to maximize the amount of time I could spend skateboarding.”

He built his own skate parks growing up, which made him familiar with the mechanics of building a wooden ramp — knowledge that came in handy when building the foundation of his latest Jetson-powered project.

And to inspire others to embark on inventive projects with technology, Kaiser authored Make Art With Python, a step-by-step introduction to programming for creative people.

He was spurred to write the book while talking to high school students at a biohacker bootcamp in New York.

“What the high schoolers said blew my mind — they basically thought that software engineering was for overachievers,” he said. “So I wanted to write a book that would convince younger people that programming is fundamentally a platform for creating worlds, and it can be for anyone, which is a really exciting thing.”

His Favorite Jetson Projects

Kaiser kicked off his self-driving skate park project 18 months ago, intending to start with a ramp about the size of a golf cart. The electrical components needed to steer it were prohibitively expensive, however, and getting such a large platform to break along two axes of rotation was incredibly challenging, he said.

Rescaling the project to the size of a skateboard itself, Kaiser bought a welder and a metal brake, learned how to use both tools for the first time, and built a platform that can raise and lower, as well as accept any kind of ramp.

It’s fully steerable along both axes thanks to the edge capabilities of NVIDIA Jetson. And the developer’s now training the platform’s self-driving features using Robot Operating System repositories available through the NVIDIA Isaac platform for accelerated, AI-powered robotics.

“In the machine learning space, NVIDIA is really the only show in town,” he said. “The Jetson platform is the industry standard for edge AI, and its compatibility with other development platforms and the onboard GPU are huge pluses.”

Kaiser dives deeper into the technical aspects of his skate ramp project on his blog.

The developer’s other favorite projects using the NVIDIA Jetson platform include training an AI model for turning lights off and on using a dab and T-pose, as well as creating an AI-powered camera for bird-watching.

“The acceleration of smaller-scale robotics is becoming more accessible to everyone,” Kaiser said, “which is really exciting because I think robotics is so damn cool.”

Go along for the ride by keeping up with Kaiser’s work, and learn more about the NVIDIA Jetson platform.

Read More

Matching Latent Encoding for Audio-Text based Keyword Spotting

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the…Apple Machine Learning Research

Semi-Supervised and Long-Tailed Object Detection with CascadeMatch

This paper focuses on long-tailed object detection in the semi-supervised learning setting, which poses realistic challenges, but has rarely been studied in the literature. We propose a novel pseudo-labeling-based detector called CascadeMatch. Our detector features a cascade network architecture, which has multi-stage detection heads with progressive confidence thresholds. To avoid manually tuning the thresholds, we design a new adaptive pseudo-label mining mechanism to automatically identify suitable values from data. To mitigate confirmation bias, where a model is negatively reinforced by…Apple Machine Learning Research

Near-Optimal Algorithms for Private Online Optimization in the Realizable Regime

*=Equal Contributors
We consider online learning problems in the realizable setting, where there is a zero-loss solution, and propose new Differentially Private (DP) algorithms that obtain near-optimal regret bounds. For the problem of online prediction from experts, we design new algorithms that obtain near-optimal regret where is the number of experts. This significantly improves over the best existing regret bounds for the DP non-realizable setting which are . We also develop an adaptive algorithm for the small-loss setting with regret where is the total loss of the best expert…Apple Machine Learning Research

Approximate Nearest Neighbor Phrase Mining for Contextual Speech Recognition

This paper presents an extension to train end-to-end Context-Aware Transformer Transducer ( CATT ) models by using a simple, yet efficient method of mining hard negative phrases from the latent space of the context encoder. During training, given a reference query, we mine a number of similar phrases using approximate nearest neighbour search. These sampled phrases are then used as negative examples in the context list alongside random and ground truth contextual information. By including approximate nearest neighbour phrases (ANN-P) in the context list, we encourage the learned representation…Apple Machine Learning Research