Decoding How NVIDIA RTX AI PCs and Workstations Tap the Cloud to Supercharge Generative AI

Decoding How NVIDIA RTX AI PCs and Workstations Tap the Cloud to Supercharge Generative AI

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and RTX workstation users.

Generative AI is enabling new capabilities for Windows applications and games. It’s powering unscripted, dynamic NPCs, it’s enabling creators to generate novel works of art, and it’s helping gamers boost frame rates by up to 4x. But this is just the beginning.

As the capabilities and use cases for generative AI continue to grow, so does the demand for compute to support it.

Hybrid AI combines the onboard AI acceleration of NVIDIA RTX with scalable, cloud-based GPUs to effectively and efficiently meet the demands of AI workloads.

Hybrid AI, a Love Story

With growing AI adoption, app developers are looking for deployment options: AI running locally on RTX GPUs delivers high performance and low latency, and is always available — even when not connected to the internet. On the other hand, AI running in the cloud can run larger models and scale across many GPUs, serving multiple clients simultaneously. In many cases, a single application will use both.

Hybrid AI is a kind of matchmaker that harmonizes local PC and workstation compute with cloud scalability. It provides the flexibility to optimize AI workloads based on specific use cases, cost and performance. It helps developers ensure that AI tasks run where it makes the most sense for their specific applications.

Whether the AI is running locally or in the cloud it gets accelerated by NVIDIA GPUs and NVIDIA’s AI stack, including TensorRT and TensorRT-LLM. That means less time staring at pinwheels of death and more opportunity to deliver cutting-edge, AI powered features to users.

A range of NVIDIA tools and technologies support hybrid AI workflows for creators, gamers, and developers.

Dream in the Cloud, Bring to Life on RTX

Generative AI has demonstrated its ability to help artists ideate, prototype and brainstorm new creations. One such solution, the cloud-based Generative AI by iStock — powered by NVIDIA Edify — is a generative photography service that was built for and with artists, training only on licensed content and with compensation for artist contributors.

Generative AI by iStock goes beyond image generation, providing artists with extensive tools to explore styles, variations, modify parts of an image or expand the canvas. With all these tools, artists can ideate numerous times and still bring ideas to life quickly.

Once the creative concept is ready, artists can bring it back to their local systems. RTX-powered PCs and workstations offer artists AI acceleration in more than 125 of the top creative apps to realize the full vision — whether it’s creating an amazing piece of artwork in Photoshop with local AI tools, animating the image with a parallax effect in DaVinci Resolve, or building a 3D scene with the reference image in Blender with ray tracing acceleration, and AI denoising in Optix.

Hybrid ACE Brings NPCs to Life

Hybrid AI is also enabling a new realm of interactive PC gaming with NVIDIA ACE, allowing game developers and digital creators to integrate state-of-the-art generative AI models into digital avatars on RTX AI PCs.

Powered by AI neural networks, NVIDIA ACE lets developers and designers create non-playable characters (NPCs) that can understand and respond to human player text and speech. It leverages AI models, including speech-to-text models to handle natural language spoken aloud, to generate NPCs’ responses in real time.

A Hybrid Developer Tool That Runs Anywhere

Hybrid also helps developers build and tune new AI models. NVIDIA AI Workbench helps developers quickly create, test and customize pretrained generative AI models and LLMs on RTX GPUs. It offers streamlined access to popular repositories like Hugging Face, GitHub and NVIDIA NGC, along with a simplified user interface that enables data scientists and developers to easily reproduce, collaborate on and migrate projects.

Projects can be easily scaled up when additional performance is needed — whether to the data center, a public cloud or NVIDIA DGX Cloud — and then brought back to local RTX systems on a PC or workstation for inference and light customization. Data scientists and developers can leverage pre-built Workbench projects to chat with documents using retrieval-augmented generation (RAG), customize LLMs using fine-tuning, accelerate data science workloads with seamless CPU-to-GPU transitions and more.

The Hybrid RAG Workbench project provides a customizable RAG application that developers can run and adapt themselves. They can embed their documents locally and run inference either on a local RTX system, a cloud endpoint hosted on NVIDIA’s API catalog or using NVIDIA NIM microservices. The project can be adapted to use various models, endpoints and containers, and provides the ability for developers to quantize models to run on their GPU of choice.

NVIDIA GPUs power remarkable AI solutions locally on NVIDIA GeForce RTX PCs and RTX workstations and in the cloud. Creators, gamers and developers can get the best of both worlds with growing hybrid AI workflows.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

Tidy Tech: How Two Stanford Students Are Building Robots for Handling Household Chores

Tidy Tech: How Two Stanford Students Are Building Robots for Handling Household Chores

Imagine having a robot that could help you clean up after a party — or fold heaps of laundry. Chengshu Eric Li and Josiah David Wong, two Stanford University Ph.D. students advised by renowned American computer scientist Professor Fei-Fei Li, are making that a ‌dream come true. In this episode of the AI Podcast, host Noah Kravitz spoke with the two about their project, BEHAVIOR-1K, which aims to enable robots to perform 1,000 household chores, including picking up fallen objects or cooking. To train the robots, they’re using the NVIDIA Omniverse platform, as well as reinforcement and imitation learning techniques. Listen to hear more about the breakthroughs and challenges Li and Wong experienced along the way.

Stay tuned for more AI Podcast episodes recorded live from GTC.

Time Stamps

3:33: Background on the BEHAVIOR-1K project

5:00: Why use a simulated environment to train robots? 

6:48: Why build a new simulation engine instead of using an existing one? 

10:48: The process of training the robots to perform household chores

14:04: Some of the most complex tasks taught to the robots

19:07: How are large language models and large vision models affecting the progress of robotics?

24:09: What’s next for the project?  

You Might Also Like…

NVIDIA’s Annalamai Chockalingam on the Rise of LLMs – Ep. 206

Generative AI and large language models (LLMs) are stirring change across industries — but according to NVIDIA Senior Product Manager of Developer Marketing Annamalai Chockalingam, “we’re still in the early innings.” In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Chockalingam about LLMs: what they are, their current state and their future potential.

How GluxKind Created Ella, the AI-Powered Smart Stroller – Ep. 193

Imagine a stroller that can drive itself, help users up hills, brake on slopes and provide alerts of potential hazards. That’s what GlüxKind has done with Ella, an award-winning smart stroller that uses the NVIDIA Jetson edge AI and robotics platform to power its AI features.

GANTheftAuto: Harrison Kinsley on AI-Generated Gaming Environments – Ep. 151

Machines have long played games – think of Deep Blue or AlphaGo. Now they’re building them. GANTheftAuto creator Harrison Kinsley talks about his creation on the latest episode of the AI Podcast.

NVIDIA’s Liila Torabi Talks the New Era of Robotics Through Isaac Sim – Ep. 147

Robots are not just limited to the assembly line. At NVIDIA, Liila Torabi works on making the next generation of robotics possible. Torabi is the senior product manager for Isaac Sim, a robotics and AI simulation platform powered by NVIDIA Omniverse. Torabi spoke with NVIDIA AI Podcast host Noah Kravitz about the new era of robotics, one driven by making robots smarter through AI.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Read More

Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

This paper was accepted at the Image Matching: Local Features & Beyond workshop at CVPR 2024.
Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cross-view deformations. Secondly, we present selective fusion to merge local and global messages from…Apple Machine Learning Research

NVIDIA Scoops Up Wins at COMPUTEX Best Choice Awards

NVIDIA Scoops Up Wins at COMPUTEX Best Choice Awards

Building on more than a dozen years of stacking wins at the COMPUTEX trade show’s annual Best Choice Awards, NVIDIA was today honored with BCAs for its latest technologies.

The NVIDIA GH200 Grace Hopper Superchip won the Computer and System Category Award; the NVIDIA Spectrum-X AI Ethernet networking platform won the Networking and Communication Category Award; and the NVIDIA AI Enterprise software platform won a Golden Award.

The awards — judged on the functionality, innovation and market potential of products exhibited at the leading computer and technology expo — were announced ahead of the show, which runs from June 4-7, in Taipei.

NVIDIA founder and CEO Jensen Huang will deliver a COMPUTEX keynote address on Sunday, June 2, at 7 p.m. Taiwan time, at the NTU Sports Center and online.

NVIDIA AI Enterprise Takes Gold

NVIDIA AI Enterprise — a cloud-native software platform that streamlines the development and deployment of copilots and other generative AI applications — won a Golden Award.

The platform lifts the burden of maintaining and securing complex AI software, so businesses can focus on building and harnessing the technology’s game-changing insights.

Microservices that come with NVIDIA AI Enterprise — including NVIDIA NIM and NVIDIA CUDA-X — optimize model performance and run anywhere with enterprise-grade security, support and stability, offering users a smooth transition from prototype to production.

Plus, the platform’s ability to improve AI performance results in better overall utilization of computing resources. This means companies using NVIDIA AI Enterprise need fewer servers to support the same workloads, greatly reducing their energy costs and data center footprint.

More BCA Wins for NVIDIA Technologies

NVIDIA GH200 and Spectrum-X were named best in their respective categories.

The NVIDIA GH200 Grace Hopper Superchip is the world’s first truly heterogeneous accelerated platform for AI and high-performance computing workloads. It combines the power-efficient NVIDIA Grace CPU with an NVIDIA Hopper architecture-based GPU over a high-bandwidth 900GB/s coherent NVIDIA NVLink chip-to-chip interconnect.

The superchip — shipping worldwide and powering more than 40 AI supercomputers across global research centers, system makers and cloud providers — supercharges scientific innovation with accelerated computing and scale-out solutions for AI inference, large language models, recommenders, vector databases, HPC applications and more.

The Spectrum-X platform, featuring NVIDIA Spectrum SN5600 switches and NVIDIA BlueField-3 SuperNICs, is the world’s first Ethernet fabric built for AI, accelerating generative AI network performance 1.6x over traditional Ethernet fabrics.

It can serve as the backend AI fabric for any AI cloud or large enterprise deployment, and is available from major server manufacturers as part of the full NVIDIA AI stack.

NVIDIA Partners Recognized

Other BCA winners include NVIDIA partners Acer, ASUS, MSI and YUAN, which were given Golden Awards for their respective laptops, gaming motherboards and smart-city applications — all powered by NVIDIA technologies, such as NVIDIA GeForce RTX 4090 GPUs, the NVIDIA Studio platform for creative workflows and the NVIDIA Jetson platform for edge AI and robotics.

ASUS also won a Computer and System Category Award, while MSI won a Gaming and Entertainment Category Award.

Learn more about the latest generative AI, HPC and networking technologies by joining NVIDIA at COMPUTEX.

Read More

Efficient Diffusion Models without Attention

Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences. In this paper, we address these issues by presenting a novel approach to enhance the efficiency and scalability of image generation models, incorporating state space models (SSMs) as the core component and deviating from the widely adopted transformer-based and U-Net…Apple Machine Learning Research

ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground…Apple Machine Learning Research

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and…Apple Machine Learning Research

KPConvX: Modernizing Kernel Point Convolution with Kernel Attention

In the field of deep point cloud understanding, KPConv is a unique architecture that uses kernel points to locate convolutional weights in space, instead of relying on Multi-Layer Perceptron (MLP) encodings. While it initially achieved success, it has since been surpassed by recent MLP networks that employ updated designs and training strategies. Building upon the kernel point principle, we present two novel designs: KPConvD (depthwise KPConv), a lighter design that enables the use of deeper architectures, and KPConvX, an innovative design that scales the depthwise convolutional weights of…Apple Machine Learning Research

Swallowing the Bitter Pill: Simplified Scalable Conformer Generation

We present a novel way to predict molecular conformers through a simple formulation that sidesteps many of the heuristics of prior works and achieves state of the art results by using the advantages of scale. By training a diffusion generative model directly on 3D atomic positions without making assumptions about the explicit structure of molecules (e.g. modeling torsional angles) we are able to radically simplify structure learning, and make it trivial to scale up the model sizes. This model, called Molecular Conformer Fields (MCF), works by parameterizing conformer structures as functions…Apple Machine Learning Research

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

Mixture of Experts (MoE) architectures for large language models (LLMs) have recently gained popularity due to their ability to increase model capacity and computational efficiency compared to fully dense models. By utilizing sparse expert subnetworks that process different subsets of tokens, MoE models can effectively increase the number of parameters while requiring less computation per token during training and inference. This enables more cost-effective training of larger models within fixed compute budgets compared to dense architectures.

Despite their computational benefits, training and fine-tuning large MoE models efficiently presents some challenges. MoE models can struggle with load balancing if the tokens aren’t evenly distributed across experts during training, and some experts may become overloaded while others are under-utilized. MoE models have high memory requirements, because all expert parameters need to be loaded into memory even though only a subset is used for each input.

In this post, we highlight new features of the Amazon SageMaker model parallelism library that enable efficient training of MoE models using expert parallelism. Expert parallelism is a type of parallelism that handles splitting experts of an MoE model across separate workers or devices, similar to how tensor parallelism can partition dense model layers. We demonstrate how to use these new features of SMP by pre-training the 47 billion parameter Mixtral 8x7B MoE model using expert parallelism. To learn more, refer to our GitHub repo and Expert parallelism.

Expert parallelism

The Mixtral 8x7B model has a sparse MoE architecture, containing eight expert subnetworks with around 7 billion parameters each. A trainable gate network called a router determines which input tokens are sent to which expert. With this architecture, the experts specialize in processing different aspects of the input data. The complete Mixtral 8x7B model has a total of 47 billion parameters, but only around 12.9 billion (two experts, for this model architecture) are activated for any given input token; this results in improved computational efficiency relative to a dense model of the same total size. To learn more about the MoE architecture in general, refer to Applying Mixture of Experts in LLM Architectures.

SMP adds support for expert parallelism

SMP now supports expert parallelism, which is essential to performant MoE model training. With expert parallelism, different expert subnetworks that comprise the MoE layers are placed on separate devices. During training, different data is routed to the different devices, with each device handling the computation for the experts it contains. By distributing experts across workers, expert parallelism addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster. The following figure offers a simplified look at how expert parallelism works on a multi-GPU cluster.

The SMP library uses NVIDIA Megatron to implement expert parallelism and support training MoE models, and runs on top of PyTorch Fully Sharded Data Parallel (FSDP) APIs. You can keep using your PyTorch FSDP training code as is and activate SMP expert parallelism for training MoE models. SMP offers a simplified workflow where you need to specify the expert_parallel_degree parameter, which will evenly divide experts across the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, you can set the expert_parallel_degree to 2, 4, or 8. We recommend that you start with a small number and gradually increase it until the model fits in the GPU memory.

SMP’s expert parallelism is compatible with sharded data parallelism

SMP’s expert parallel implementation is compatible with sharded data parallelism, enabling more memory-efficient and faster training. To understand how this works, consider an MoE model in the following example with eight experts (N=8) training on a simple cluster with one node containing 4 GPUs.

SMP’s expert parallelism splits the MoE experts across GPUs. You control how many experts are instantiated on each device by using the expert_parallel_degree parameter. For example, if you set the degree to 2, SMP will assign half of the eight experts to each data parallel group. The degree value must be a factor of the number of GPUs in your cluster and the number of experts in your model. Data is dynamically routed to and from the GPU or GPUs hosting the selected expert using all-to-all GPU communication.

Next, sharded data parallelism partitions and distributes the experts as well as the non-MoE layers of the model, like attention or routers, across your cluster to reduce the memory footprint of the model. The hybrid_shard_degree parameter controls this. For example, a hybrid_shard_degree of 2 will shard the model states (including experts and non-MoE layers) across half of the GPUs in our cluster. The product of expert_parallel_degree and hybrid_shard_degree should not exceed the world size of the cluster. In the following example, hybrid_shard_degree * expert_parallel_degree = 4 is a valid configuration.

Solution overview

With the background out of the way, let’s dig into the components of our distributed training architecture. The following diagram illustrates the solution architecture.

In this example, we use SageMaker training jobs. With SageMaker training jobs, you can launch and manage clusters of high-performance instances with simple API calls. For example, you can use the SageMaker Estimator to specify the type and quantity of instances to use in your distributed systems with just a few lines of code. Later in this post, we use a cluster of two ml.p4d.24xlarge instances to train our model by specifying these parameters in our Estimator. To learn about SageMaker training jobs, see Train a Model with Amazon SageMaker.

In this post, we use the SMP library to efficiently distribute the workload across the cluster using hybrid sharded data parallelism and expert parallelism. In addition to these implementations, SMP offers many other performance-improving and memory-saving techniques, such as:

  • Mixed precision training and fp8 support for dense Llama models (which accelerates distributed training and takes advantage of the performance improvements on P5 instances)
  • Tensor parallelism composable with sharded data parallelism
  • Delayed parameter initialization
  • Activation checkpointing (a technique to reduce memory usage by clearing activations of certain layers and recomputing them during the backward pass)

For the latest updates, refer to SageMaker model parallelism library v2.

Along with SMP, this example also uses the SageMaker distributed data parallel library (SMDDP). As you scale your workload and add instances to your cluster, the overhead of communication between instances also increases, which can lead to a drop in overall computational performance and training efficiency. This is where SMDDP helps. SMDDP includes optimized communication collectives such as AllGather that are designed for AWS network infrastructure. Because of this, SMDDP can outperform other more general communications libraries such as NCCL when training on SageMaker.

Together, the SMP and SMDDP libraries can accelerate large distributed training workloads by up to 20%. Additionally, these libraries are compatible with standard PyTorch APIs and capabilities, which makes it convenient to adapt any existing PyTorch FSDP training script to the SageMaker training platform and take advantage of the performance improvements that SMP and SMDDP provide. To learn more, see SageMaker model parallelism library v2 and Run distributed training with the SageMaker distributed data parallelism library.

In the following sections, we showcase how you can accelerate distributed training of the Hugging Face Transformers Mixtral 8*7B model on P4 instances using SMP and SMDDP.

Prerequisites

You need to complete some prerequisites before you can run the Mixtral notebook.

First, make sure you have created a Hugging Face access token so you can download the Hugging Face tokenizer to be used later. After you have the access token, you need to make a few quota increase requests for SageMaker. You need to request a minimum of 2 P4d instances ranging to a maximum of 8 P4d instances (depending on time-to-train and cost-to-train trade-offs for your use case).

On the Service Quotas console, request the following SageMaker quotas:

  • P4 instances (ml.p4d.24xlarge) for training job usage: 2–8

It may take up to 24 hours for the quota increase to get approved.

Now that you’re ready to begin the process to pre-train the Mixtral model, we start with dataset preparation in the next step.

Prepare the dataset

We begin our tutorial with preparing the dataset. This will cover loading the GLUE/SST2 dataset, tokenizing and chunking the dataset, and configuring the data channels for SageMaker training on Amazon Simple Storage Service (Amazon S3). Complete the following steps:

  1. You first need to load the GLUE/SST2 dataset and split it into training and validation datasets:
    hyperparameters = {
        "cache_dir": "tmp",
        "dataset_config_name": "sst2",
        "dataset_name": "glue",
        "do_train": True,
        "do_eval": True,
    }
    
    raw_datasets = load_dataset(
        hyperparameters["dataset_name"],
        hyperparameters["dataset_config_name"],
    )
    
    del raw_datasets["validation"]
    
    if "validation" not in raw_datasets.keys():
        validation_percentage = "10%"
    
        raw_datasets["validation"] = load_dataset(
            hyperparameters["dataset_name"],
            hyperparameters["dataset_config_name"],
            split=f"train[:{validation_percentage}]",
            cache_dir=hyperparameters["cache_dir"],
        )
    
        raw_datasets["train"] = load_dataset(
            hyperparameters["dataset_name"],
            hyperparameters["dataset_config_name"],
            split=f"train[{validation_percentage}:]",
            cache_dir=hyperparameters["cache_dir"],
        )

  2. Load the Mixtral-8x7B tokenizer from the Hugging Face Transformers library:
    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1", **tokenizer_kwargs)

Next, you define two utility functions: tokenize_function() and group_texts(). The tokenize_function() runs the tokenizer on the text data. The group_texts() function concatenates all texts from the dataset and generates chunks of a block size that corresponds to the model’s input length (2048) for this example. By chunking the text data into smaller pieces, you make sure the model can process the entire dataset during training, even if some text examples are longer than the input length (2048).

  1. Define the functions with the following code:
    def tokenize_function(examples):
        ...
        
        output = tokenizer(examples[text_column_name])
        return output
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
            # Split by chunks of max_len.
            result = {
                k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
                for k, t in concatenated_examples.items()
            }
        result["labels"] = result["input_ids"].copy()
        return result

  2. Call the preceding utility functions on your dataset to tokenize and generate chunks suitable for the model:
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True,num_proc=1,remove_columns=column_names)
    lm_datasets = tokenized_datasets.map(group_texts, batched=True)

  3. Prepare the training and validation datasets for SageMaker training by saving them as JSON files and constructing the S3 paths where these files will be uploaded:
    train_dataset = lm_datasets["train"]
    train_dataset.to_json("./training.json")
    training_dataset_location = f"s3://{default_bucket}/dataset/train/"
    
     
    eval_dataset = lm_datasets["validation"]
    eval_dataset.to_json("./validation.json")
    validation_dataset_location = f"s3://{default_bucket}/dataset/validation/"

  4. Finally, set up the data channels for SageMaker training by creating TrainingInput objects from the provided S3 bucket paths for the training and test/validation datasets:
    train = sagemaker.inputs.TrainingInput(
                s3_train_bucket, distribution="FullyReplicated", 
                s3_data_type="S3Prefix"
            )
    data_channels = {"train": train}
    
    test = sagemaker.inputs.TrainingInput(
                s3_test_bucket, distribution="FullyReplicated", 
                s3_data_type="S3Prefix"
            )
    data_channels["test"] = test

You’re now ready to run pre-training or fine-tuning on the dataset.

Pre-train Mixtral 8x7B with expert parallelism on SMP

To pre-train the Mixtral 8x7B model, complete the following steps:

  1. Initialize the script with torch.sagemaker.init() to activate the SMP library:
    import torch.sagemaker as tsm
    tsm.init()

  2. Import the MoEConfig class from the torch.sagemaker.transform API. We use the MoEConfig class to enable the model to use the SMP implementation of MoE:
    from torch.sagemaker.moe.moe_config import MoEConfig

  3. Create a model configuration for Mixtral 8x7B model. This will be passed to AutoModelForCausalLM.from_config(model_config, attn_implementation="flash_attention_2") from the Hugging Face Transformers library to initialize the model with random weights. If you want to fine-tune, you can provide the path to the pre-trained weights instead of the model configuration.
    model_config = MixtralConfig(
                vocab_size=args.vocab_size, # 32000,
                hidden_size=args.hidden_width, # 4096,
                intermediate_size=args.intermediate_size, # 14336,
                num_hidden_layers=args.num_layers, # 32,
                num_attention_heads=args.num_heads, # 32,
                num_key_value_heads=args.num_key_value_heads, # 8,
                hidden_act="silu",
                max_position_embeddings=args.max_context_width, # 4096 * 32,
                initializer_range=args.initializer_range, # 0.02,
                rms_norm_eps=1e-5,
                use_cache=False,
                pad_token_id=None,
                bos_token_id=1,
                eos_token_id=2,
                tie_word_embeddings=False,
                rope_theta=1e6,
                sliding_window=args.sliding_window, # None,
                attention_dropout=0.0,
                num_experts_per_tok=args.num_experts_per_tok, # 2,
                num_local_experts=args.num_local_experts, # 8,
                output_router_logits=False,
                router_aux_loss_coef=0.001,
            )
           
    model = AutoModelForCausalLM.from_config(model_config, dtype=dtype, attn_implementation="flash_attention_2" )

In the example Jupyter Notebook, you use a create_model() function that invokes the AutoModelForCausalLM.from_config() function.

  1. Create the SMP MoE configuration class. In the following code, you specify parameters in the training estimator in the subsequent steps. To learn more about the SMP MoEConfig class, see torch.sagemaker.moe.moe_config.MoEConfig.
    moe_config = MoEConfig(
                        smp_moe=args.use_smp_implementation > 0, #Whether to use the SMP-implementation of MoE. The default value is True.
                        random_seed=args.seed, # A seed number for the random operations in expert-parallel distributed modules. This seed will be added to the expert parallel rank to set the actual seed for each rank. It is unique for each expert parallel rank. The default value is 12345.
                        moe_load_balancing=args.moe_load_balancing, #Specify the load balancing type of the MoE router. Valid options are aux_loss, sinkhorn, balanced, and none. The default value is sinkhorn.
                        global_token_shuffle=args.global_token_shuffle > 0,  #Whether to shuffle tokens across EP ranks within the same expert parallel group. The default value is False
                        moe_all_to_all_dispatcher=args.moe_all_to_all_dispatcher > 0, #Whether to use all-to-all dispatcher for the communications in MoE. The default value is True.
                    )

  2. With the model and MoE configuration ready, you wrap the model with the SMP transform API and pass the MoE configuration. Here, the tsm.transform method adapts the model from Hugging Face format to SMP format. For more information, refer to torch.sagemaker.transform.
    model = tsm.transform(
            model, 
            config=moe_config,
        )

  3. Define the training hyperparameters, including the MoE configuration and other settings specific to the model and training setup:
    hyperparameters = {
        # MoE config
        "moe": 1,
        "moe_load_balancing": "sinkhorn",
        "moe_all_to_all_dispatcher": 1,
        "seed": 12345,
        #rest of hyperparameters
        ...
        "model_type": "mixtral",
        "sharding_strategy": "hybrid_shard",
        "delayed_param": 1, 
        "epochs": 100,
        "activation_checkpointing": 1,
        "beta1": 0.9,
        "bf16": 1,
        "fp8": 0,
        "checkpoint_dir": "/opt/ml/checkpoints",
        ...
        ...
        
    }

We enable delayed parameter initialization in SMP, which allows initializing large models on a meta device without attaching data. This can resolve limited GPU memory issues when you first load the model. This approach is particularly useful for training LLMs with tens of billions of parameters, where even CPU memory might not be sufficient for initialization.

SMP supports various routing strategies, including sinkhorn, balanced, and aux_loss. Each provides distinct load balancing approaches to achieve equitable token assignment among experts, thereby maintaining balanced workload distribution.

  1. Specify the parameters for expert_parallel_degree and hybrid_shard_degree:
    expert_parallel_degree = 2  # An integer in [1, world_size]
    hybrid_shard_degree = (
        8  # An integer in [0, world_size // expert_parallel_degree] and its default value is 0.
    )

Hybrid sharding is a memory saving technique between `FULL_SHARD` and `NO_SHARD`, with `FULL_SHARD` saving the most memory and `NO_SHARD` not saving any. This technique shards parameters within the hybrid shard degree (HSD) group and replicates parameters across groups. The HSD controls sharding across GPUs and can be set to an integer from 0 to `world_size`.

An HSD of 8 applies `FULL_SHARD` within a node and then replicates parameters across nodes because there are 8 GPUs in the nodes we are using. This results in reduced communication volume because expensive all-gathers and reduce-scatters are only done within a node, which can be more performant for medium-sized models. Generally, you want to use the smallest HSD that doesn’t cause out of memory (OOM) errors. If you’re experiencing OOM, try increasing the hybrid shard degree to reduce memory usage on each node.

  1. With all the necessary configurations in place, you now create the PyTorch estimator function to encapsulate the training setup and launch the training job. We run the pre-training on the 2 ml.p4d.24xlarge instances, where each instance contains 8 A100 Nvidia GPUs:
    smp_estimator = PyTorch(
        entry_point="train.py",
        hyperparameters=hyperparameters,
        role=role,
        checkpoint_s3_uri=checkpoint_s3_uri,
        checkpoint_local_path=hyperparameters["checkpoint_dir"] 
        instance_type="ml.p4d.24xlarge",
        volume_size=400,
        instance_count=2,
        sagemaker_session=sagemaker_session,
        ...
        distribution={
            "torch_distributed": {
                "enabled": True,
            },
            "smdistributed": {
                "modelparallel": {
                    "enabled": True,
                    "parameters": {
                        "activation_loading_horizon": activation_loading_horizon,
                        "hybrid_shard_degree": hybrid_shard_degree,
                        "sm_activation_offloading": offload_activations,
                        "expert_parallel_degree": expert_parallel_degree,
                    },
                }
            },
        },
        py_version="py310",
        framework_version="2.2.0",
        output_path=s3_output_bucket,
    )

  2. Finally, launch the pre-training workload:
    smp_estimator.fit(inputs=data_channels)

Clean up

As part of cleanup, you can delete the SageMaker default bucket created to host the GLUE/SST2 dataset.

Conclusion

Training large MoE language models like the 47 billion parameter Mistral 8x7B can be challenging due to high computational and memory requirements. By using expert parallelism and sharded data parallelism from the SageMaker model parallelism library, you can effectively scale these MoE architectures across multiple GPUs and workers.

SMP’s expert parallelism implementation seamlessly integrates with PyTorch and the Hugging Face Transformers library, allowing you to enable MoE training using simple configuration flags without changing your existing model code. Additionally, SMP provides performance optimizations like hybrid sharding, delayed parameter initialization, and activation offloading and recomputation to further improve training efficiency.

For the complete sample to pre-train and fine-tune Mixtral 8x7B, see the GitHub repo.

Special thanks

Special thanks to Rahul Huilgol, Gautam Kumar, and Luis Quintela for their guidance and engineering leadership in developing this new capability.


About the Authors

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Teng Xu is a Software Development Engineer in the Distributed Training group in AWS AI. He enjoys reading.

Suhit Kodgule is a Software Development Engineer with the AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling, and cooking.

Read More