3D Artist Ignites Flights at Exceptional Heights This Week ‘In the NVIDIA Studio’

3D Artist Ignites Flights at Exceptional Heights This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. 

An adrenaline-fueled virtual ride in the sky is sure to satisfy all thrill seekers — courtesy of 3D artist Kosei Wano’s sensational animation, Moon Hawk. Wano outlines his creative workflow this week In the NVIDIA Studio.

Plus, join the #GameArtChallenge — running through Sunday, April 30 — by using the hashtag to share video game fan art, character creations and more for a chance to be featured across NVIDIA social media channels.

Original game content can be made with NVIDIA Omniverse — a platform for creating and operating metaverse applications — using the Omniverse Machinima app. This enables users to collaborate in real time when animating characters and environments in virtual worlds.

Who Dares, Wins

Wano often finds inspiration exploring the diversity of flora and fauna. He has a penchant for examining birds — and even knows the difference in wing shapes between hawks and martins, he said. This interest in flying entities extends to his fascination with aircrafts. For Moon Hawk, Wano took on the challenge of visually evolving a traditional, fuel-based fighter jet into an electric one.

With reference material in hand, Wano opened the 3D app Blender to scale the fighter jet to accurate, real-life sizing, then roughly sketched within the 3D design space, his preferred method to formulate models.

“Moon Hawk” in its traditional form.

The artist then deployed several tips and tricks to model more efficiently: adding Blender’s automatic detailing modifier, applying neuro-reflex modeling to change the aircraft’s proportions, then dividing the model’s major 3D shapes into sections to edit individually — a step Wano calls “dividing each difficulty.”

Neuro-reflex modeling enables Wano to change proportions while maintaining model integrity.

Blender Cycles RTX-accelerated OptiX ray tracing, unlocked by the artist’s GeForce RTX 3080 Ti GPU, enabled interactive, photorealistic modeling in the viewport. “Optix’s AI-powered denoiser renders lightly, allowing for comfortable trial and error,” said Wano, who then applied sculpting and other details. Next, Wano used geo nodes to add organic style and customization to his Blender scenes and animate his fighter jet.

Applying geo nodes.

Blender geo nodes make modeling an almost completely procedural process — allowing for non-linear, non-destructive workflows and the instancing of objects — to create incredibly detailed scenes using small amounts of data.

The “Moon Hawk” model is nearly complete.

For Moon Hawk, Wano applied geo nodes to mix materials not found in nature, creating unique textures for the fighter jet. Being able to make real-time base mesh edits without the concern of destructive workflows gave Wano the freedom to alter his model on the fly with an assist from his GPU. “With the GeForce RTX 3080 Ti, there’s no problem, even with a model as complicated as this,” he said.

Animations accelerated at the speed of light with Wano’s GeForce RTX GPU.

Wano kicked off the animation phase by selecting the speed of the fighter jet and roughly designing its flight pattern.

Mapping the flight path in advance.

The artist referenced popular fighter jet scenes in cinema and video games, as well as studied basic rules of physics, such as inertia, to ensure the flight patterns in his animation were realistic. Then, Wano returned to using geo nodes to add 3D lighting effects without the need to simulate or bake. Such lighting modifications helped to make rendering the project simpler in its final stage.

Parameters were edited with ease, in addition to applying particle simulations and manually shaking the camera to add more layers of immersion to the scenes.

Final color edits in Blender.

With the animation complete, Wano added short motion blur. Accelerated motion blur rendering enabled by his RTX GPU and the NanoVBD toolset for easy rendering of volumes let him apply this effect quickly. And RTX-accelerated OptiX ray tracing in Blender Cycles delivered the fastest final frame renders.

Wano imported final files into Blackmagic Design’s DaVinci Resolve application, where GPU-accelerated color grading, video editing and color scopes helped the artist complete the animation in record time.

3D artist Kosei Wano.

Choosing GeForce RTX was a simple choice for Wano, who said, “NVIDIA products have been trusted by many people for a long time.”

For a deep dive into Wano’s workflow, visit the NVIDIA Studio YouTube channel to browse the playlist Designing and Modeling a Sci-Fi Ship in Blender With Wanoco4D and view each stage: Modeling, Materials, Geometry Nodes and Lightning Effect, Setting Animation and Lights and Rendering.

View more of Wano’s impressive portfolio on ArtStation.

Who Dares With Photogrammetry, Wins Again

Wano, like most artists, is always growing his craft, refining essential skills and learning new techniques, including photogrammetry — the art and science of extracting 3D information from photographs.

In the NVIDIA Studio artist Anna Natter recently highlighted her passion for photogrammetry, noting that virtually anything can be preserved in 3D and showcasing features that have the potential to save 3D artists countless hours. Wano saw this same potential when experimenting with the technology in Adobe Substance 3D Sampler.

“Photogrammetry can accurately reproduce the complex real world,” said Wano, who would encourage other artists to think big in terms of both individual objects and environments. “You can design an entire realistic space by placing it in a 3D virtual world.”

Try out photogrammetry and post your creations with the #StudioShare hashtag for a chance to be featured across NVIDIA Studio’s social media channels.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More

AI Before You Buy: Israeli Startup Renders 3D Product Models for Top Retailers

AI Before You Buy: Israeli Startup Renders 3D Product Models for Top Retailers

Preparing a retailer’s online catalog once required expensive physical photoshoots to capture products from every angle. A Tel Aviv startup is saving brands time and money by transforming these camera clicks into mouse clicks.

Hexa uses GPU-accelerated computing to help companies turn their online inventory into 3D renders that shoppers can view in 360 degrees, animate or even try on virtually to help their buying decisions. The company, which recently announced a $20.5 million funding round, is working with brands in fashion, furniture, consumer electronics and more.

“The world is going 3D,” said Yehiel Atias, CEO of Hexa. “Just a few years ago, the digital infrastructure to do this was still so expensive that it was more affordable to arrange a photographer, models and lighting. But with the advancements of AI and NVIDIA GPUs, it’s now feasible for retailers to use synthetic data to replace physical photoshoots.”

Hexa’s 3D renders are used on major retail websites such as Amazon, Crate & Barrel and Macy’s. The company creates thousands of renders each month, reducing the need for physical photoshoots of every product in a retailer’s catalog. Hexa estimates that it can save customers up to 300 pounds of carbon emissions for each product imaged digitally instead of physically.

From Physical Photoshoots to AI-Accelerated Renders

Hexa can reconstruct a single 2D image, or a set of low-quality 2D images, into a high-fidelity 3D asset. The company uses differing levels of automation for its renders depending on the complexity of the shape, the amount of visual data that needs to be reconstructed, and the similarity of the object to Hexa’s existing dataset.

To automate elements of its workflow, the team uses dozens of AI algorithms that were developed using the PyTorch deep learning framework and run on NVIDIA Tensor Core GPUs in the cloud. If one of Hexa’s artists is reconstructing a 3D toaster, for example, one algorithm can identify similar geometries the team has created in the past to give the creator a head start.

Another neural network can scan a retailer’s website to identify how many of its products Hexa can support with 3D renders. The company’s entire rendering pipeline, too, runs on NVIDIA GPUs available through Amazon Web Services.

“Accessing compute resources through AWS gives us the option to use thousands of NVIDIA GPUs at a moment’s notice,” said Segev Nahari, lead technical artist at Hexa. “If I need 10,000 frames to be ready by a certain time, I can request the hardware I need to meet the deadline.”

Nahari estimates that rendering on NVIDIA GPUs is up to 3x faster than relying on CPUs.

Broadening Beyond Retail, Venturing Into Omniverse

Hexa developers are continually experimenting with new methods for 3D rendering — looking for workflow improvements in preprocessing, object reconstruction and post-processing. The team recently began working with NVIDIA GET3D, a generative AI model by NVIDIA Research that generates high-fidelity, three-dimensional shapes based on a training dataset of 2D images.

sneaker generated by GET3D
By training GET3D on Hexa’s dataset of shoes, the team was able to generate 3D models of novel shoes not part of the training data.

In addition to its work in ecommerce, Hexa’s research and development team is investigating new applications for the company’s AI software.

“It doesn’t stop at retail,” Atias said. “Industries from gaming to fashion and healthcare are finding out that synthetic data and 3D technology is a more efficient way to do things like digitize inventory, create digital twins and train robots.”

The team credits its membership in NVIDIA Inception, a global program that supports cutting-edge startups, as a “huge advantage” in leveling up the technology Hexa uses.

“Being part of Inception opens doors that outsiders don’t have,” Atias said. “For a small company trying to navigate the massive range of NVIDIA hardware and software offerings, it’s a door-opener to all the cool tools we wanted to experiment with and understand the potential they could bring to Hexa.”

Hexa is testing the NVIDIA Omniverse Enterprise platform — an end-to-end platform for building and operating metaverse applications — as a tool to unify its annotating and rendering workflows, which are used by dozens of 3D artists around the globe. Omniverse Enterprise enables geographically dispersed teams of creators to customize their rendering pipelines and collaborate to build 3D assets.

“Each of our 3D artists has a different software workflow that they’re used to — so it can be tough to get a unified output while still being flexible about the tools each artist uses,” said Jonathan Clark, Hexa’s CTO. “Omniverse is an ideal candidate in that respect, with huge potential for Hexa. The platform will allow our artists to use the rendering software they’re comfortable with, while also allowing our team to visualize the final product in one place.”

To learn more about NVIDIA Omniverse and next-generation content creation, register free for NVIDIA GTC, a global conference for the era of AI and the metaverse, taking place online March 20-23.

Images and videos courtesy of Hexa

Read More

Robust Hybrid Learning With Expert Augmentation

Hybrid modelling reduces the misspecification of expert models by combining them with machine learning (ML) components learned from data. Similarly to many ML algorithms, hybrid model performance guarantees are limited to the training distribution. Leveraging the insight that the expert model is usually valid even outside the training domain, we overcome this limitation by introducing a hybrid data augmentation strategy termed textit{expert augmentation}. Based on a probabilistic formalization of hybrid modelling, we demonstrate that expert augmentation, which can be incorporated into…Apple Machine Learning Research

MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors

Recent Self-Supervised Learning (SSL) methods are able to learn feature representations that are invariant to different data augmentations, which can then be transferred to downstream tasks of interest. However, different downstream tasks require different invariances for their best performance, so the optimal choice of augmentations for SSL depends on the target task. In this paper, we aim to learn self-supervised features that generalize well across a variety of downstream tasks (e.g., object classification, detection and instance segmentation) without knowing any task information…Apple Machine Learning Research

Training large language models on Amazon SageMaker: Best practices

Training large language models on Amazon SageMaker: Best practices

Language models are statistical methods predicting the succession of tokens in sequences, using natural text. Large language models (LLMs) are neural network-based language models with hundreds of millions (BERT) to over a trillion parameters (MiCS), and whose size makes single-GPU training impractical. LLMs’ generative abilities make them popular for text synthesis, summarization, machine translation, and more.

The size of an LLM and its training data is a double-edged sword: it brings modeling quality, but entails infrastructure challenges. The model itself is often too big to fit in memory of a single GPU device or on the multiple devices of a multi-GPU instance. These factors require training an LLM over large clusters of accelerated machine learning (ML) instances. In the past few years, numerous customers have been using the AWS Cloud for LLM training.

In this post, we dive into tips and best practices for successful LLM training on Amazon SageMaker Training. SageMaker Training is a managed batch ML compute service that reduces the time and cost to train and tune models at scale without the need to manage infrastructure. Within one launch command, Amazon SageMaker launches a fully functional, ephemeral compute cluster running the task of your choice, and with enhanced ML features such as metastore, managed I/O, and distribution. The post covers all the phases of an LLM training workload and describes associated infrastructure features and best practices. Some of the best practices in this post refer specifically to ml.p4d.24xlarge instances, but most are applicable to any instance type. These best practices allow you to train LLMs on SageMaker in the scale of dozens to hundreds of millions of parameters.

Regarding the scope of this post, note the following:

  • We don’t cover neural network scientific design and associated optimizations. Amazon.Science features numerous scientific publications, including and not limited to LLMs.
  • Although this post focuses on LLMs, most of its best practices are relevant for any kind of large-model training, including computer vision and multi-modal models, such as Stable Diffusion.

Best practices

We discuss the following best practices in this post:

  • Compute – SageMaker Training is a great API to launch CPU dataset preparation jobs and thousand-scale GPU jobs.
  • Storage – We see data loading and checkpointing done in two ways, depending on skills and preferences: with an Amazon FSx Lustre file system, or Amazon Simple Storage Service (Amazon S3) only.
  • Parallelism – Your choice of distributed training library is crucial for appropriate use of the GPUs. We recommend using a cloud-optimized library, such as SageMaker sharded data parallelism, but self-managed and open-source libraries can also work.
  • Networking – Make sure EFA and NVIDIA GPUDirectRDMA are enabled, for fast inter-machine communication.
  • Resiliency – At scale, hardware failures can happen. We recommend checkpointing regularly. Every few hours is common.

Region selection

Instance type and desired capacity is a determining factor for Region selection. For the Regions supported by SageMaker and the Amazon Elastic Compute Cloud (Amazon EC2) instance types that are available in each Region, see Amazon SageMaker Pricing. In this post, we assume the training instance type to be a SageMaker-managed ml.p4d.24xlarge.

We recommend working with your AWS account team or contacting AWS Sales to determine the appropriate Region for your LLM workload.

Data preparation

LLM developers train their models on large datasets of naturally occurring text. Popular examples of such data sources include Common Crawl and The Pile. Naturally occurring text may contain biases, inaccuracies, grammatical errors, and syntax variations. An LLM’s eventual quality significantly depends on the selection and curation of the training data. LLM training data preparation is an active area of research and innovation in the LLM industry. The preparation of a natural language processing (NLP) dataset abounds with share-nothing parallelism opportunities. In other words, there are steps that can be applied to units of works—source files, paragraphs, sentences, words—without requiring inter-worker synchronization.

The SageMaker jobs APIs, namely SageMaker Training and SageMaker Processing, excel for this type of tasks. They enable developers to run an arbitrary Docker container over a fleet of multiple machines. In the case of the SageMaker Training API, the computing fleet can be heterogeneous. Numerous distributed computing frameworks have been used on SageMaker, including Dask, Ray, and also PySpark, which have a dedicated AWS-managed container and SDK in SageMaker Processing.

When you launch a job with multiple machines, SageMaker Training and Processing run your code one time per machine. You don’t need to use a particular distributed computing framework to write a distributed application: you can write the code of your choice, which will run one time per machine, to realize share-nothing parallelism. You can also write or install the inter-node communication logic of your choice.

Data loading

There are multiple ways to store the training data and move it from its storage to the accelerated compute nodes. In this section, we discuss the options and best practices for data loading.

SageMaker storage and loading options

A typical LLM dataset size is in the hundreds of millions of text tokens, representing a few hundred gigabytes. SageMaker-managed clusters of ml.p4d.24xlarge instances propose several options for dataset storage and loading:

  • On-node NVMe SSD – ml.P4d.24xlarge instances are equipped with 8TB NVMe, available under /opt/ml/input/data/<channel> if you use SageMaker File mode, and at /tmp. If you’re seeking the simplicity and performance of a local read, you can copy your data to the NVMe SSD. The copy can either be done by SageMaker File mode, or by your own code, for example using multi-processed Boto3 or S5cmd.
  • FSx for Lustre – On-node NVMe SSDs are limited in size, and require ingestion from Amazon S3 at each job or warm cluster creation. If you’re looking to scale to larger datasets while maintaining low-latency random access, you can use FSx for Lustre. Amazon FSx is an open-source parallel file system, popular in high-performance computing (HPC). FSx for Lustre uses distributed file storage (stripping) and physically separates file metadata from file content to achieve high-performance read/writes.
  • SageMaker FastFile Mode – FastFile Mode (FFM) is a SageMaker-only feature that presents remote S3 objects in SageMaker-managed compute instances under a POSIX-compliant interface, and streams them only upon reading, using FUSE. FFM reads results in S3 calls that stream remote files block by block. As a best practice to avoid errors related to Amazon S3 traffic, FFM developers should aim to keep the underlying number of S3 calls reasonable, for example by reading files sequentially and with a controlled amount of parallelism.
  • Self-managed data loading – Of course, you may also decide to implement your own, fully custom data loading logic, using proprietary or open-source code. Some reasons to use self-managed data loading are to facilitate a migration by reusing already-developed code, to implement custom error handling logic, or to have more control on underlying performance and sharding. Examples of libraries you may use for self-managed data loading include torchdata.datapipes (previously AWS PyTorch S3 Plugin) and Webdataset. The AWS Python SDK Boto3 may also be combined with Torch Dataset classes to create custom data loading code. Custom data loading classes also enable the creative use of SageMaker Training heterogeneous clusters, to finely adapt the CPU and GPU balance to a given workload.

For more information about those options and how to choose them, refer to Choose the best data source for your Amazon SageMaker training job.

Best practices for large-scale interaction with Amazon S3

Amazon S3 is capable of handling LLM workloads, both for data reading and checkpointing. It supports a request rate of 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. However, this rate is not necessarily available by default. Instead, as the request rate for a prefix grows, Amazon S3 automatically scales to handle the increased rate. For more information, refer to Why am I getting 503 Slow Down errors from Amazon S3 when the requests are within the supported request rate per prefix.

If you expect high-frequency Amazon S3 interaction, we recommend the following best practices:

  • Try to read and write from multiple S3 buckets and prefixes. For example, you can partition training data and checkpoints across different prefixes.
  • Check Amazon S3 metrics in Amazon CloudWatch to track request rates.
  • Try to minimize the amount of simultaneous PUT/GET:
    • Have fewer processes using Amazon S3 at the same time. For example, if eight processes per nodes need to checkpoint to Amazon S3, you can reduce PUT traffic by a factor of 8 by checkpointing hierarchically: first within-node, then from the node to Amazon S3.
    • Read multiple training records from a single file or S3 GET, instead of using an S3 GET for every training record.
    • If you use Amazon S3 via SageMaker FFM, SageMaker FFM makes S3 calls to fetch files chunk by chunk. To limit the Amazon S3 traffic generated by FFM, we encourage you to read files sequentially and limit the number files opened in parallel.

If you have a Developer, Business, or Enterprise Support plan, you can open a technical support case about S3 503 Slow Down errors. But first make sure you have followed the best practices, and get the request IDs for the failed requests.

Training parallelism

LLMs commonly have dozens to hundreds of billions of parameters, making them too big to fit within a single NVIDIA GPU card. LLM practitioners have developed several open-source libraries facilitating the distributed computation of LLM training, including FSDP, DeepSpeed and Megatron. You can run those libraries in SageMaker Training, but you can also use SageMaker distributed training libraries, which have been optimized for the AWS Cloud and provide a simpler developer experience. Developers have two choices for distributed training of their LLM on SageMaker: distributed libraries or self-managed.

SageMaker distributed libraries

To provide you with improved distributed training performance and usability, SageMaker Training proposes several proprietary extensions to scale TensorFlow and PyTorch training code. LLM training is often conducted in a 3D-parallelism fashion:

  • Data parallelism splits and feeds the training mini-batches to multiple identical replicas of the model, to increase processing speed
  • Pipeline parallelism attributes various layers of the model to different GPUs or even instances, in order to scale model size beyond a single GPU and a single server
  • Tensor parallelism splits a single layer into multiple GPUs, usually within the same server, to scale individual layers to sizes exceeding a single GPU

In the following example, a 6-layer model is trained on a cluster of k*3 servers with 8*k*3 GPUs (8 GPUs per server). Data parallelism degree is k, pipeline parallelism 6, and tensor parallelism 4. Each GPU in the cluster contains one-fourth of a model layer, and a full model is partitioned over three servers (24 GPUs in total).

diagram of a 3D-parallel neural network training

The following are specifically relevant for LLMs:

  • SageMaker distributed model parallel – This library uses graph partitioning to produce intelligent model partitioning optimized for speed or memory. SageMaker distributed model parallel exposes the latest and greatest large-model training optimization, including data parallelism, pipeline parallelism, tensor parallelism, optimizer state sharding, activation checkpointing, and offloading. With the SageMaker distributed model parallel library, we documented a 175-billion parameter model training over 920 NVIDIA A100 GPUs. For more information, refer to Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker.
  • SageMaker sharded data parallel – In MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud, Zhang et al. introduce a low-communication model parallel strategy that partitions models over a data parallel group only, instead of the whole cluster. With MiCS, AWS scientists were able to achieve 176 teraflops per GPU (56.4% of the theoretical peak) for training a 210-layer 1.06-trillion-parameter model on EC2 P4de instances. MiCS is now available for SageMaker Training customers as SageMaker sharded data parallel.

SageMaker distributed training libraries provide high performance and a simpler developer experience. In particular, developers don’t need to write and maintain a custom parallel process launcher or use a framework-specific launch tool, because the parallel launcher is built into the job launch SDK.

Self-managed

With SageMaker Training, you have the freedom to use the framework and scientific paradigm of your choice. In particular, if you want to manage distributed training yourself, you have two options to write your custom code:

  • Use an AWS Deep Learning Container (DLC) – AWS develops and maintains DLCs, providing AWS-optimized Docker-based environments for top open-source ML frameworks. SageMaker Training has a unique integration allowing to you pull and run AWS DLCs with external, user-defined entry point. For LLM training in particular, AWS DLCs for TensorFlow, PyTorch, Hugging Face, and MXNet are particularly relevant. Using a framework DLC allows you to use framework-native parallelism, such as PyTorch Distributed, without having to develop and manage your own Docker images. Additionally, our DLCs feature an MPI integration, which allows you to launch parallel code easily.
  • Write a custom SageMaker-compatible Docker image – You can bring your own (BYO) image (see Use Your Own Training Algorithms and Amazon SageMaker Custom Training containers), either starting from scratch or extending an existing DLC image. When using a custom image for LLM training on SageMaker, it’s particularly important to verify the following:
    • Your image contains EFA with appropriate settings (discussed more later in this post)
    • Your image contains an NVIDIA NCCL communication library, enabled with GPUDirectRDMA

Customers have been able to use a number of self-managed distributed training libraries, including DeepSpeed.

Communications

Given the distributed nature of an LLM training job, inter-machine communication is critical to the feasibility, performance, and costs of the workload. In this section, we present key features for inter-machine communication and conclude with tips for installation and tuning.

Elastic Fabric Adapter

In order to accelerate ML applications, and improve performances by achieving flexibility, scalability, and elasticity provided by the cloud, you can take advantage of Elastic Fabric Adapter (EFA) with SageMaker. In our experience, using EFA is a requirement to get satisfactory multi-node LLM training performance.

An EFA device is a network interface attached to EC2 instances managed by SageMaker during the run of the training jobs. EFA is available on specific families of instances, including the P4d. EFA networks are capable of achieving several hundreds of Gbps of throughput.

Associated to EFA, AWS has introduced the Scalable Reliable Datagram (SRD), an ethernet-based transport inspired by the InfiniBand Reliable Datagram, evolved with relaxed packet ordering constraint. For more information about EFA and SRD, refer to In the search for performance, there’s more than one way to build a network, the video How EFA works and why we don’t use infiniband in the cloud, and the research paper A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC from Shalev et al.

You can add EFA integration on compatible instances to SageMaker existing Docker containers, or custom containers that can be used for training ML models using SageMaker jobs. For more information, refer to Run Training with EFA. EFA is exposed via the open-source Libfabric communication package. However, LLM developers rarely directly program it with Libfabric, and usually instead rely on the NVIDIA Collective Communications Library (NCCL).

AWS-OFI-NCCL plugin

In distributed ML, EFA is most often used with the NVIDIA Collective Communications Library (NCCL). NCCL is an NVIDIA-developed open-source library implementing inter-GPU communication algorithms. Inter-GPU communication is a cornerstone of LLM training that catalyzes scalability and performance. It is so critical to DL training that the NCCL is often directly integrated as a communication backend in deep learning training libraries, so that LLM developers use it—sometimes without noticing—from their preferred Python DL development framework. To use the NCCL on EFA, LLM developers use the AWS-developed AWS OFI NCCL plugin, which maps NCCL calls to the Libfabric interface used by EFA. We recommend using the latest version of AWS OFI NCCL to benefit from recent improvements.

To verify that the NCCL uses EFA, you should set the environment variable NCCL_DEBUG to INFO, and check in the logs that EFA is loaded by the NCCL:

...
NCCL INFO NET/OFI Selected Provider is efa
NCCL INFO Using network AWS Libfabric
...

For more information about the NCCL and EFA configuration, refer to Test your EFA and NCCL configuration. You can further customize the NCCL with several environment variables. Note that effective in NCCL 2.12 and above, AWS contributed an automated communication algorithm selection logic for EFA networks (NCCL_ALGO can be left unset).

NVIDIA GPUDirect RDMA over EFA

With the P4d instance type, we introduced GPUDirect RDMA (GDR) over EFA fabric. It enables network interface cards (NICs) to directly access GPU memory, making remote GPU-to-GPU communication across NVIDIA GPU-based EC2 instances faster, reducing orchestration overhead on CPUs and user applications. GDR is used under the hood by the NCCL, when feasible.

GDR usage appears in inter-GPU communication when the log level is set to INFO, as in the following code:


NCCL INFO Channel 00 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/1/GDRDMA
NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/1/GDRDMA

Using EFA in AWS Deep Learning Containers

AWS maintains Deep Learning Containers (DLCs), many of which come with AWS-managed Dockerfiles and built containing EFA, AWS OFI NCCL, and NCCL. The following GitHub repos offer examples with PyTorch and TensorFlow. You don’t need to install those libraries yourself.

Using EFA in your own SageMaker Training container

If you create your own SageMaker Training container and want to use the NCCL over EFA for accelerated inter-node communication, you need to install EFA, NCCL, and AWS OFI NCCL. For more information, refer to Run Training with EFA. Additionally, you should set the following environment variables in your container or in your entry point code:

  • FI_PROVIDER="efa" specifies the fabric interface provider
  • NCCL_PROTO=simple instructs the NCCL to use a simple protocol for communication (currently, the EFA provider doesn’t support LL protocols; enabling them could lead to data corruption)
  • FI_EFA_USE_DEVICE_RDMA=1 uses the device’s RDMA functionality for one-sided and two-sided transfer
  • NCCL_LAUNCH_MODE="PARALLEL"
  • NCCL_NET_SHARED_COMMS="0"

Orchestration

Managing the lifecycle and workload of dozens to hundreds of compute instances requires orchestration software. In this section, we offer best practices for LLM orchestration

Within-job orchestration

Developers must write both server-side training code and client-side launcher code in most distributed frameworks. Training code runs on training machines, whereas client-side launcher code launches the distributed workload from a client machine. There is little standardization today, for example:

  • In PyTorch, developers can launch multi-machine tasks using torchrun, torchx, torch.distributed.launch (deprecation path), or torch.multiprocessing.spawn
  • DeepSpeed proposes its own deepspeed CLI launcher and also supports MPI launch
  • MPI is a popular parallel computing framework that has the benefit of being ML-agnostic and reasonably tenured, and therefore stable and documented, and is increasingly seen in distributed ML workloads

In a SageMaker Training cluster, the training container is launched one time on each machine. Consequently, you have three options:

  • Native launcher – You can use as an entry point the native launcher of a particular DL framework, for example a torchrun call, which will itself spawn multiple local process and establish communications across instances.
  • SageMaker MPI integration – You can use SageMaker MPI integration, available in our AWS DLC, or self-installable via sagemaker-training-toolkit, to directly run your entry point code N times per machine. This has the benefit of avoiding the use of intermediary, framework-specific launcher scripts in your own code.
  • SageMaker distributed libraries – If you use the SageMaker distributed libraries, you can focus on the training code and don’t have to write launcher code at all! SageMaker distributed launcher code is built into the SageMaker SDK.

Inter-job orchestration

LLM projects often consist of multiple jobs: parameter search, scaling experiments, recovery from errors, and more. In order to start, stop, and parallelize training tasks, it’s important to use a job orchestrator. SageMaker Training is a serverless ML job orchestrator that provisions transient compute instances immediately upon request. You pay only for what you use, and clusters get decommissioned as soon as your code ends. With SageMaker Training Warm Pools, you have the option to define a time-to-live on training clusters, in order to reuse the same infrastructure across jobs. This reduces iteration time and inter-job placement variability. SageMaker jobs can be launched from a variety of programming languages, including Python and CLI.

There is a SageMaker-specific Python SDK called the SageMaker Python SDK and implemented via the sagemaker Python library, but its use is optional.

Increasing quotas for training jobs with a large and long training cluster

SageMaker has default quotas on resources, designed to prevent unintentional usage and costs. To train an LLM using a big cluster of high-end instances running for a long time, you’ll likely need to increase the quotas in the following table.

Quota name Default value
Longest run time for a training job 432,000 seconds
Number of instances across all training jobs 4
Maximum number of instances per training job 20
ml.p4d.24xlarge for training job usage 0
ml.p4d.24xlarge for training warm pool usage 0

See AWS service quotas how to view your quota values and request a quota increase. On-Demand, Spot Instance, and training warm pools quotas are tracked and modified separately.

If you decide to keep the SageMaker Profiler activated, be aware that every training job launches a SageMaker Processing job, each consuming one ml.m5.2xlarge instance. Confirm that your SageMaker Processing quotas are high enough to accommodate the expected training job concurrency. For example, if you want to launch 50 Profiler-enabled training jobs running concurrently, you’ll need to raise the ml.m5.2xlarge for processing job usage limit to 50.

Additionally, to launch a long-running job, you’ll need to explicitly set the Estimator max_run parameter to the desired maximum duration for the training job in seconds, up to the quota value of the longest runtime for a training job.

Monitoring and resiliency

Hardware failure is extremely rare at the scale of a single instance and becomes more and more frequent as the number of instances used simultaneously increases. At typical LLM scale—hundreds to thousands of GPUs used 24/7 for weeks to months—hardware failures are near-certain to happen. Therefore, an LLM workload must implement appropriate monitoring and resiliency mechanisms. Firstly, it’s important to closely monitor LLM infrastructure, to limit the impact of failures and optimize the use of compute resources. SageMaker Training proposes several features for this purpose:

  • Logs are automatically sent to CloudWatch Logs. Logs include your training script stdout and stderr. In MPI-based distributed training, all MPI workers send their logs to the leader process.
  • System resource utilization metrics like memory, CPU usage, and GPU usage, are automatically sent to CloudWatch.
  • You can define custom training metrics that will be sent to CloudWatch. The metrics are captured from logs based on regular expressions you set. Third-party experiment packages like the AWS Partner offering Weights & Biases can be used with SageMaker Training (for an example, see Optimizing CIFAR-10 Hyperparameters with W&B and SageMaker).
  • SageMaker Profiler allows you to inspect infrastructure usage and get optimization recommendations.
  • Amazon EventBridge and AWS Lambda allow you to create automated client logic reacting to events such as job failures, successes, S3 file uploads, and more.
  • SageMaker SSH Helper is a community-maintained open-source library allowing to you connect to training job hosts through SSH. It can be helpful to inspect and troubleshoot code runs on specific nodes.

In addition to monitoring, SageMaker also brings equipment for job resiliency:

  • Cluster health checks – Before your job starts, SageMaker runs GPU health checks and verifies NCCL communication on GPU instances, replacing any faulty instances if necessary in order to ensure your training script starts running on a healthy cluster of instances. Health checks are currently enabled for P and G GPU-based instance types.
  • Built-in retries and cluster update – You can configure SageMaker to automatically retry training jobs that fails with a SageMaker internal server error (ISE). As part of retrying a job, SageMaker will replace any instances that encountered unrecoverable GPU errors with fresh instances, reboot all healthy instances, and start the job again. This results in faster restarts and workload completion. Cluster update is currently enabled for P and G GPU-based instance types. You can add in your own applicative retry mechanism around the client code that submits the job, to handle other types of launch errors, such as like exceeding your account quota.
  • Automated checkpoint to Amazon S3 – This helps you checkpoint your progress and reload a past state on new jobs.

To benefit from node-level replacement, your code must error. Collectives may hang, instead of erroring, when a node fails. Therefore, to have prompt remediation, properly set a timeout on your collectives and have the code throw an error when it is reached.

Some customers set up a monitoring client to monitor and act in case of job hangs or applicative convergence stopping, by monitoring CloudWatch logs and metrics for abnormal patterns like no logs written or 0% GPU usage to hint for a hang, convergence stopping, and auto stop/retry the job.

Deep dive on checkpointing

The SageMaker checkpoint feature copies everything you write on /opt/ml/checkpoints back to Amazon S3 as the URI specified in the checkpoint_s3_uri SDK parameter. When a job starts or restarts, everything written at that URI is sent back to all the machines, at /opt/ml/checkpoints. This is convenient if you want all nodes to have access to all checkpoints, but at scale—when you have many machines or many historical checkpoints, it can lead to long download times and too high traffic on Amazon S3. Additionally, in tensor and pipeline parallelism, the workers need only a fraction of the checkpointed model, not all of it. If you face those limitations, we recommend the following options:

  • Checkpointing to FSx for Lustre – Thanks to high-performance random I/O, you can define the sharding and file attribution scheme of your choice
  • Self-managed Amazon S3 checkpointing – For examples of Python functions that can be used to save and read checkpoints in a non-blocking fashion, refer to Saving Checkpoints

We strongly suggest checkpointing your model every few hours, for example 1–3 hours, depending on associated overhead and costs.

Front end and user management

User management is a key usability strength of SageMaker compared to legacy shared HPC infrastructure. SageMaker Training permissions are ruled by several AWS Identity and Access Management (IAM) abstractions:

  • Principals—users and systems—are given permission to launch resources
  • Training jobs carry roles themselves, which allow them to have permissions of their own, for example regarding data access and service invocation

Additionally, in 2022 we added SageMaker Role Manager to facilitate the creation of persona-driven permissions.

Conclusion

With SageMaker Training, you can reduce costs and increase iteration speed on your large-model training workload. We have documented success stories in numerous posts and case studies, including:

If you’re looking to improve your LLM time-to-market while reducing your costs, take a look at the SageMaker Training API and let us know what you build!

Special thanks to Amr Ragab, Rashika Kheria, Zmnako Awrahman, Arun Nagarajan, Gal Oshri for their helpful reviews and teachings.


About the Authors

Anastasia Tzeveleka is a Machine Learning and AI Specialist Solutions Architect at AWS. She works with customers in EMEA and helps them architect machine learning solutions at scale using AWS services. She has worked on projects in different domains including Natural Language Processing (NLP), MLOps and Low Code No Code tools.

Gili Nachum is a senior AI/ML Specialist Solutions Architect who works as part of the EMEA Amazon Machine Learning team. Gili is passionate about the challenges of training deep learning models, and how machine learning is changing the world as we know it. In his spare time, Gili enjoy playing table tennis.

Olivier Cruchant is a Principal Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size on helping them to to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertice are Machine Learning end to end, Machine Learning Industrialization and MLOps. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Read More

Diffusion Probabilistic Fields

Diffusion probabilistic models have quickly become a major approach for generative modeling of images, 3D geometry, video and other domains. However, to adapt diffusion generative modeling to these domains the denoising network needs to be carefully designed for each domain independently, oftentimes under the assumption that data lives in an Euclidean grid. In this paper we introduce Diffusion Probabilistic Fields (DPF), a diffusion model that can learn distributions over continuous functions defined over metric spaces, commonly known as fields. We extend the formulation of diffusion…Apple Machine Learning Research

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Last November, we announced the 1,000 Languages Initiative, an ambitious commitment to build a machine learning (ML) model that would support the world’s one thousand most-spoken languages, bringing greater inclusion to billions of people around the globe. However, some of these languages are spoken by fewer than twenty million people, so a core challenge is how to support languages for which there are relatively few speakers or limited available data.

Today, we are excited to share more about the Universal Speech Model (USM), a critical first step towards supporting 1,000 languages. USM is a family of state-of-the-art speech models with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning 300+ languages. USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) not only on widely-spoken languages like English and Mandarin, but also on under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani to name a few. In “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages”, we demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of the model and fine-tuning on a smaller set of labeled data enables us to recognize under-represented languages. Moreover, our model training process is effective at adapting to new languages and data.

A sample of the languages that USM supports.

Challenges in current ASR

To accomplish this ambitious goal, we need to address two significant challenges in ASR.

First, there is a lack of scalability with conventional supervised learning approaches. A fundamental challenge of scaling speech technologies to many languages is obtaining enough data to train high-quality models. With conventional approaches, audio data needs to be either manually labeled, which is time-consuming and costly, or collected from sources with pre-existing transcriptions, which are harder to find for languages that lack wide representation. In contrast, self-supervised learning can leverage audio-only data, which is available in much larger quantities across languages. This makes self-supervision a better approach to accomplish our goal of scaling across hundreds of languages.

Another challenge is that models must improve in a computationally efficient manner while we expand the language coverage and quality. This requires the learning algorithm to be flexible, efficient, and generalizable. More specifically, such an algorithm should be able to use large amounts of data from a variety of sources, enable model updates without requiring complete retraining, and generalize to new languages and use cases.

Our approach: Self-supervised learning with fine-tuning

USM uses the standard encoder-decoder architecture, where the decoder can be CTC, RNN-T, or LAS. For the encoder, USM uses the Conformer, or convolution-augmented transformer. The key component of the Conformer is the Conformer block, which consists of attention, feed-forward, and convolutional modules. It takes as input the log-mel spectrogram of the speech signal and performs a convolutional sub-sampling, after which a series of Conformer blocks and a projection layer are applied to obtain the final embeddings.

Our training pipeline starts with the first step of self-supervised learning on speech audio covering hundreds of languages. In the second optional step, the model’s quality and language coverage can be improved through an additional pre-training step with text data. The decision to incorporate the second step depends on whether text data is available. USM performs best with this second optional step. The last step of the training pipeline is to fine-tune on downstream tasks (e.g., ASR or automatic speech translation) with a small amount of supervised data.

For the first step, we use BEST-RQ, which has already demonstrated state-of-the-art results on multilingual tasks and has proven to be efficient when using very large amounts of unsupervised audio data.

In the second (optional) step, we used multi-objective supervised pre-training to incorporate knowledge from additional text data. The model introduces an additional encoder module to take text as input and additional layers to combine the output of the speech encoder and the text encoder, and trains the model jointly on unlabeled speech, labeled speech, and text data.

In the last stage, USM is fine-tuned on the downstream tasks. The overall training pipeline is illustrated below. With the knowledge acquired during pre-training, USM models achieve good quality with only a small amount of supervised data from the downstream tasks.

USM’s overall training pipeline.

Key results

Performance across multiple languages on YouTube Captions

Our encoder incorporates 300+ languages through pre-training. We demonstrate the effectiveness of the pre-trained encoder through fine-tuning on YouTube Caption’s multilingual speech data. The supervised YouTube data includes 73 languages and has on average less than three thousand hours of data per language. Despite limited supervised data, the model achieves less than 30% word error rate (WER; lower is better) on average across the 73 languages, a milestone we have never achieved before. For en-US, USM has a 6% relative lower WER compared to the current internal state-of-the-art model. Lastly, we compare with the recently released large model, Whisper (large-v2), which was trained with more than 400k hours of labeled data. For the comparison, we only use the 18 languages that Whisper can successfully decode with lower than 40% WER. Our model has, on average, a 32.7% relative lower WER compared to Whisper for these 18 languages.

USM supports all 73 languages in the YouTube Captions’ Test Set and outperforms Whisper on the languages it can support with lower than 40% WER. Lower WER is better.

Generalization to downstream ASR tasks

On publicly available datasets, our model shows lower WER compared to Whisper on CORAAL (African American Vernacular English), SpeechStew (en-US), and FLEURS (102 languages). Our model achieves lower WER with and without training on in-domain data. The comparison on FLEURS reports the subset of languages (62) that overlaps with the languages supported by the Whisper model. For FLEURS, USM without in-domain data has a 65.8% relative lower WER compared to Whisper and has a 67.8% relative lower WER with in-domain data.

Comparison of USM (with or without in-domain data) and Whisper results on ASR benchmarks. Lower WER is better.

Performance on automatic speech translation (AST)

For speech translation, we fine-tune USM on the CoVoST dataset. Our model, which includes text via the second stage of our pipeline, achieves state-of-the-art quality with limited supervised data. To assess the breadth of the model’s performance, we segment the languages from the CoVoST dataset into high, medium, and low based on resource availability and calculate the BLEU score (higher is better) for each segment. As shown below, USM outperforms Whisper for all segments.

CoVoST BLEU score. Higher BLEU is better.

Toward 1,000 languages

The development of USM is a critical effort towards realizing Google’s mission to organize the world’s information and make it universally accessible. We believe USM’s base model architecture and training pipeline comprise a foundation on which we can build to expand speech modeling to the next 1,000 languages.

Learn More

Check out our paper here. Researchers can request access to the USM API here.

Acknowledgements

We thank all the co-authors for contributing to the project and paper, including Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran, Bo Li, Chung-Cheng Chiu, Daniel Park, Françoise Beaufays, Hagen Soltau, Gary Wang, Ginger Perng, James Qin, Jason Riesa, Johan Schalkwyk, Ke Hu, Nanxin Chen, Parisa Haghani, Pedro Moreno Mengibar, Rohit Prabhavalkar, Tara Sainath, Trevor Strohman, Vera Axelrod, Wei Han, Yonghui Wu, Yongqiang Wang, Yu Zhang, Zhehuai Chen, and Zhong Meng.

We also thank Alexis Conneau, Min Ma, Shikhar Bharadwaj, Sid Dalmia, Jiahui Yu, Jian Cheng, Paul Rubenstein, Ye Jia, Justin Snyder, Vincent Tsang, Yuanzhong Xu, Tao Wang for useful discussions.

We appreciate valuable feedback and support from Eli Collins, Jeff Dean, Sissie Hsiao, Zoubin Ghahramani. Special thanks to Austin Tarango, Lara Tumeh, Amna Latif, and Jason Porta for their guidance around Responsible AI practices. We thank Elizabeth Adkison, James Cokerille for help with naming the model, Tom Small for the animated graphic, Abhishek Bapna for editorial support, and Erica Moreira for resource management . We thank Anusha Ramesh for feedback, guidance, and assistance with the publication strategy, and Calum Barnes and Salem Haykal for their valuable partnership.

Read More

What Is NVLink?

What Is NVLink?

Accelerated computing —  a capability once confined to high-performance computers in government research labs — has gone mainstream.

Banks, car makers, factories, hospitals, retailers and others are adopting AI supercomputers to tackle the growing mountains of data they need to process and understand.

These powerful, efficient systems are superhighways of computing. They carry data and calculations over parallel paths on a lightning journey to actionable results.

GPU and CPU processors are the resources along the way, and their onramps are fast interconnects. The gold standard in interconnects for accelerated computing is NVLink.

So, What Is NVLink?

NVLink is a high-speed connection for GPUs and CPUs formed by a robust software protocol, typically riding on multiple pairs of wires printed on a computer board. It lets processors send and receive data from shared pools of memory at lightning speed.

A diagram showing two NVLink uses

Now in its fourth generation, NVLink connects host and accelerated processors at rates up to 900 gigabytes per second (GB/s).

That’s more than 7x the bandwidth of PCIe Gen 5, the interconnect used in conventional x86 servers. And NVLink sports 5x the energy efficiency of PCIe Gen 5, thanks to data transfers that consume just 1.3 picojoules per bit.

The History of NVLink

First introduced as a GPU interconnect with the NVIDIA P100 GPU, NVLink has advanced in lockstep with each new NVIDIA GPU architecture.

A chart of the basic specifications for NVLink

In 2018, NVLink hit the spotlight in high performance computing when it debuted connecting GPUs and CPUs in two of the world’s most powerful supercomputers, Summit and Sierra.

The systems, installed at Oak Ridge and Lawrence Livermore National Laboratories, are pushing the boundaries of science in fields such as drug discovery, natural disaster prediction and more.

Bandwidth Doubles, Then Grows Again

In 2020, the third-generation NVLink doubled its max bandwidth per GPU to 600GB/s, packing a dozen interconnects in every NVIDIA A100 Tensor Core GPU.

The A100 powers AI supercomputers in enterprise data centers, cloud computing services and HPC labs across the globe.

Today, 18 fourth-generation NVLink interconnects are embedded in a single NVIDIA H100 Tensor Core GPU. And the technology has taken on a new, strategic role that will enable the most advanced CPUs and accelerators on the planet.

A Chip-to-Chip Link

NVIDIA NVLink-C2C is a version of the board-level interconnect to join two processors inside a single package, creating a superchip. For example, it connects two CPU chips to deliver 144 Arm Neoverse V2 cores in the NVIDIA Grace CPU Superchip, a processor built to deliver energy-efficient performance for cloud, enterprise and HPC users.

NVIDIA NVLink-C2C also joins a Grace CPU and a Hopper GPU to create the Grace Hopper Superchip. It packs accelerated computing for the world’s toughest HPC and AI jobs into a single chip.

Alps, an AI supercomputer planned for the Swiss National Computing Center, will be among the first to use Grace Hopper. When it comes online later this year, the high-performance system will work on big science problems in fields from astrophysics to quantum chemistry.

The Grace CPU uses NVLink-C2C
The Grace CPU packs 144 Arm Neoverse V2 cores across two die connected by NVLink-C2C.

Grace and Grace Hopper are also great for bringing energy efficiency to demanding cloud computing workloads.

For example, Grace Hopper is an ideal processor for recommender systems. These economic engines of the internet need fast, efficient access to lots of data to serve trillions of results to billions of users daily.

A chart showing how Grace Hopper uses NVLink to deliver leading performance on recommendation systems
Recommenders get up to 4x more performance and greater efficiency using Grace Hopper than using Hopper with traditional CPUs.

In addition, NVLink is used in a powerful system-on-chip for automakers that includes NVIDIA Hopper, Grace and Ada Lovelace processors. NVIDIA DRIVE Thor is a car computer that unifies intelligent functions such as digital instrument cluster, infotainment, automated driving, parking and more into a single architecture.

LEGO Links of Computing

NVLink also acts like the socket stamped into a LEGO piece. It’s the basis for building supersystems to tackle the biggest HPC and AI jobs.

For example, NVLinks on all eight GPUs in an NVIDIA DGX system share fast, direct connections via NVSwitch chips. Together, they enable an NVLink network where every GPU in the server is part of a single system.

To get even more performance, DGX systems can themselves be stacked into modular units of 32 servers, creating a powerful, efficient computing cluster.

A picture of the DGX family of server products that use NVLink
NVLink is one of the key technologies that let users easily scale modular NVIDIA DGX systems to a SuperPOD with up to an exaflop of AI performance.

Users can connect a modular block of 32 DGX systems into a single AI supercomputer using a combination of an NVLink network inside the DGX and NVIDIA Quantum-2 switched Infiniband fabric between them. For example, an NVIDIA DGX H100 SuperPOD packs 256 H100 GPUs to deliver up to an exaflop of peak AI performance.

To get even more performance, users can tap into the AI supercomputers in the cloud such as the one Microsoft Azure is building with tens of thousands of A100 and H100 GPUs. It’s a service used by groups like OpenAI to train some of the world’s largest generative AI models.

And it’s one more example of the power of accelerated computing.

Read More

Performer-MPC: Navigation via real-time, on-robot transformers

Performer-MPC: Navigation via real-time, on-robot transformers

Despite decades of research, we don’t see many mobile robots roaming our homes, offices, and streets. Real-world robot navigation in human-centric environments remains an unsolved problem. These challenging situations require safe and efficient navigation through tight spaces, such as squeezing between coffee tables and couches, maneuvering in tight corners, doorways, untidy rooms, and more. An equally critical requirement is to navigate in a manner that complies with unwritten social norms around people, for example, yielding at blind corners or staying at a comfortable distance. Google Research is committed to examining how advances in ML may enable us to overcome these obstacles.

In particular, Transformers models have achieved stunning advances across various data modalities in real-world machine learning (ML) problems. For example, multimodal architectures have enabled robots to leverage Transformer-based language models for high-level planning. Recent work that uses Transformers to encode robotic policies opens an exciting opportunity to use those architectures for real-world navigation. However, the on-robot deployment of massive Transformer-based controllers can be challenging due to the strict latency constraints for safety-critical mobile robots. The quadratic space and time complexity of the attention mechanism with respect to the input length is often prohibitively expensive, forcing researchers to trim Transformer-stacks at the cost of expressiveness.

As part of our ongoing exploration of ML advances for robotic products we partnered across Robotics at Google and Everyday Robots to present “Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation” at the Conference on Robot Learning (CoRL 2022). Here, we introduce Performer-MPC, an end-to-end learnable robotic system that combines (1) a JAX-based differentiable model predictive controller (MPC) that back-propagates gradients to its cost function parameters, (2) Transformer-based encodings of the context (e.g., occupancy grids for navigation tasks) that represent the MPC cost function and adapt the MPC to complex social scenarios without hand-coded rules, and (3) Performer architectures: scalable low-rank implicit-attention Transformers with linear space and time complexity attention modules for efficient on-robot deployment (providing 8ms on-robot latency). We demonstrate that Performer-MPC can generalize across different environments to help robots navigate tight spaces while demonstrating socially acceptable behaviors.

Performer-MPC

Performer-MPC aims to blend classic MPCs with ML via their learnable cost functions. Thus Performer-MPCs can be thought of as an instantiation of the inverse reinforcement learning algorithms, where the cost function is inferred by learning from expert demonstrations. Critically, the learnable component of the cost function is parameterized by latent embeddings produced by the Performer-Transformer. The linear inference provided by Performers is a gateway to on-robot deployment in real time.

In practice, the occupancy grid provided by fusing the robot’s sensors serves as an input to the Vision Performer model. This model never explicitly materializes the attention matrix, but rather leverages its low-rank decomposition for efficient linear computation of the attention module, resulting in scalable attention. Then, the embedding of the particular fixed input-patch token from the last layer of the model parameterizes the quadratic, learnable part of the MPC model’s cost function. That part is added to the regular hand-engineered cost (distance from the obstacles, penalty-terms for sudden velocity changes, etc.). The system is trained end-to-end via imitation learning to mimic expert demonstrations.

Performer-MPC overview. The final latent embedding of the patch highlighted in red is used to construct context dependent learnable cost. The backpropagation (red arrows) is through the parameters of the Transformer. Performer provides scalable attention module computation via low-rank approximate decomposition of the regular attention matrix (matrices Query’ and Key’) and by changing the order of matrix multiplications (indicated by the black brackets).

Real-world robot navigation

Although, in principle, Performer-MPC can be applied in various robotic settings, we evaluate its performance on navigation in confined spaces with the potential presence of people. We deployed Performer-MPC on a differential wheeled robot that has a 3D LiDAR camera in the front and depth sensors mounted on its head. Our robot-deployable 8ms-latency Performer-MPC has 8.3M Performer parameters. The actual time of a single Performer run is about 1ms and we use the fastest Performer-ReLU variant.

We compare Performer-MPC with two baselines, a regular MPC policy (RMPC) without the learned cost components, and an Explicit Policy (EP) that predicts a reference and goal state using the same Performer architecture, but without being coupled to the MPC structure. We evaluate Performer-MPC in a simulation and in three real world scenarios. For each scenario, the learned policies (EP and Performer-MPC) are trained with scenario-specific demonstrations.

Experiment Scenarios: (a) Learning to avoid local minima during doorway traversal, (b) maneuvering through highly constrained spaces, (c) enabling socially compliant behaviors for blind corner, and (d) pedestrian obstruction interactions.

Our policies are trained through behavior cloning with a few hours of human-controlled robot navigation data in the real world. For more data collection details, see the paper. We visualize the planning results of Performer-MPC (green) and RMPC (red) along with expert demonstrations (gray) in the top half and the train and test curves in the bottom half of the following two figures. To measure the distance between the robot trajectory and the expert trajectory, we use Hausdorff distance.

Top: Visualization of test examples in the doorway traversal (left) and highly constrained obstacle course (right). Performer-MPC trajectories aiming at the goal are always closer to the expert demonstrations compared to the RMPC trajectories. Bottom: Train and test curves, where the vertical axis represents Hausdorff distance and horizontal axis represents training steps.
Top: Visualization of test examples in the blind corner (left) and pedestrian obstruction (right) scenarios. Performer-MPC trajectories aiming at the goal are always closer to the expert demonstrations compared to the RMPC trajectories. Bottom: Train and test curves, where the vertical axis represents Hausdorff distance and horizontal axis represents training steps.

Learning to avoid local minima

We evaluate Performer-MPC in a simulated doorway traversal scenario in which 100 start and goal pairs are randomly sampled from opposing sides of the wall. A planner, guided by a greedy cost function, often leads the robot to a local minimum (i.e., getting stuck at the closest point to the goal on the other side of the wall). Performer-MPC learns a cost function that steers the robot to pass the doorway, even if it must veer away from the goal and travel further. Performer-MPC shows a success rate of 86% compared to RMPC’s 24%.

Comparison of the Performer-MPC with Regular MPC on the doorway passing task.

Learning highly constrained maneuvers

Next, we test Performer-MPC in a challenging real-world scenario, where the robot must perform sharp, near-collision maneuvers in a cluttered home or office setting. A global planner provides coarse way points (a skeleton navigation path) that the robot follows. Each policy is run ten times and we report a success rate (SR) and an average completion percentage (CP) with variance (VAR) of navigating the obstacle course, where the robot is able to traverse without failure (collisions or getting stuck). Performer-MPC outperforms both RMPC and EP in SR and CP.

An obstacle course with policy trajectories and failure locations (indicated by crosses) for RMPC, EP, and Performer-MPC.
An Everyday Robots helper robot maneuvering through highly constrained spaces using Regular MPC, Explicit Policy, and Performer-MPC.

Learning to navigate in spaces with people

Going beyond static obstacles, we apply Performer-MPC to social robot navigation, where robots must navigate in a socially-acceptable manner for which cost functions are difficult to design. We consider two scenarios: (1) blind corners, where robots should avoid the inner side of a hallway corner in case a person suddenly appears, and (2) pedestrian obstruction, where a person unexpectedly impedes the robot’s prescribed path.

Performer-MPC deployed on an Everyday Robots helper robot. Left: Regular MPC efficiently cuts blind corners, forcing the person to move back. Right: Performer-MPC avoids cutting blind corners, enabling safe and socially acceptable navigation around people.
Comparison with an Everyday Robots helper robot using Regular MPC, Explicit Policy, and Performer-MPC in unseen blind corners.
Comparison with an Everyday Robots helper robot using Regular MPC, Explicit Policy, and Performer-MPC in unseen pedestrian obstruction scenarios.

Conclusion

We introduce Performer-MPC, an end-to-end learnable robotic system that combines several mechanisms to enable real-world, robust, and adaptive robot navigation with real-time, on-robot transformers. This work shows that scalable Transformer-architectures play a critical role in designing expressive attention-based robotic controllers. We demonstrate that real-time millisecond-latency inference is feasible for policies leveraging Transformers with a few million parameters. Furthermore, we show that such policies enable robots to learn efficient and socially acceptable behaviors that can generalize well. We believe this opens an exciting new chapter on applying Transformers to real-world robotics and look forward to continuing our research with Everyday Robots helper robots.

Acknowledgements

Special thanks to Xuesu Xiao for co-leading this effort at Everyday Robots as a Visiting Researcher. This research was done by Xuesu Xiao, Tingnan Zhang, Krzysztof Choromanski, Edward Lee, Anthony Francis, Jake Varley, Stephen Tu, Sumeet Singh, Peng Xu, Fei Xia, Sven Mikael Persson, Dmitry Kalashnikov, Leila Takayama, Roy Frostig, Jie Tan, Carolina Parada and Vikas Sindhwani. Special thanks to Vincent Vanhoucke for his feedback on the manuscript.

Read More

Index your Microsoft Exchange content using the Exchange connector for Amazon Kendra

Index your Microsoft Exchange content using the Exchange connector for Amazon Kendra

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.

One such unstructured data repository is Microsoft Exchange. Email conversations contain important messages exchanged between various parties over time. Users often attach documents containing valuable information in the context of that email. In addition to emails, an Exchange account gives access to other valuable sources of information like calendar entries, OneNote notebooks, and contacts.

We’re excited to announce that you can now use the Amazon Kendra connector for Microsoft Exchange to search information stored in your Exchange account. In this post, we show how to index information stored in Exchange and use the Amazon Kendra intelligent search function. In addition, the ML-powered intelligent search can accurately find information from unstructured documents having natural language narrative content, for which keyword search is not very effective.

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a Exchange repository or folder using the Amazon Kendra connector for Exchange. The solution consists of the following steps:

  1. Configure an app on Exchange and get the connection details.
  2. Store the details in AWS Secrets Manager.
  3. Create an Exchange data source via the Amazon Kendra console.
  4. Index the data in the Exchange repository.
  5. Run a sample query to test the solution.

Prerequisites

To try out the Amazon Kendra connector for Exchange, you need the following:

Configure an Exchange app and gather connection details

Before we set up the Exchange data source, we need a few details about your Exchange repository. Let’s gather those in advance.

  1. Log in to the Azure portal using your global admin user account and choose Next.
  2. Enter your password and choose Sign in.
  3. On the Azure welcome page, choose App registrations.
  4. Choose New registration.
  5. Enter a name for the app (for example, my-exchange-app) and choose Register.
  6. Note down the tenant ID (you need it when setting up the data source for Amazon Kendra).
  7. Under Client credentials, choose Add a certificate or secret.
  8. Choose New client secret.
  9. Enter a description (for example, my exchange secret).
  10. Choose an expiration period (for this post, 6 months).
  11. Choose Add.
  12. Note the secret ID and value to use later when setting up the data source.
  13. In the navigation pane, choose API permissions.

This is where you can add or remove admin permissions.

  1. For this post, leave the defaults as is.

Store Exchange credentials in Secrets Manager

To store your Exchange credentials in Secrets Manager, compete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. Select Other type of secret.
  3. Create two key-value pairs for clientid and clientsecret and enter the values saved from Exchange.
  4. Choose Next.
  5. For Secret name, enter a name (for example, AmazonKendra-my-exchange-secret).
  6. Enter an optional description.
  7. Choose Next.
  8. In the Configure rotation section, keep all settings at their defaults and choose Next.
  9. On the Review page, choose Store.

Configure the Amazon Kendra connector for Exchange

To configure the Amazon Kendra connector, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.
  2. For Index name, enter a name for the index (for example, my-exchange-index).
  3. Enter an optional description.
  4. For Role name, enter an IAM role name.
  5. Configure optional encryption settings and tags.
  6. Choose Next.
  7. For Specify provisioning, select Developer edition and choose Next.
  8. In the Configure user access control section, leave the settings at their defaults and choose Next.
  9. On the review page, choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

Create an Exchange data source

Complete the following steps to create your data source:

  1. On the Amazon Kendra console, choose Data sources in the navigation pane.
  2. Under Microsoft Exchange, choose Add connector.
  3. For Data source name, enter a name (for example, my-exchange-data-source).
  4. Enter an optional description.
  5. Choose Next.
  6. For Tenant ID, choose the tenant ID you collected earlier.
  7. For AWS Secrets Manager secret, choose the secret you created earlier.
  8. For IAM role, choose Create a new role.
  9. For Role name, enter a name (for example, AmazonKendra-myexchange-datasource-role).
  10. Choose Next.
  11. For User email ID, you can enter a list of email IDs. To capture content from all users, leave the field blank.

We have kept the default selections, but you can fine-tune your selection of content as needed.

  1. For Sync mode, select Full sync (this is the first time and we need to import all content).
  2. For Frequency, choose Run on demand.
  3. Choose Next.
  4. Set any optional field mappings and choose Next.
  5. Choose Review and Create and choose Add data source.
  6. Choose Sync now.
  7. Wait for the sync to complete.

Test the solution

Now that you have ingested the content from your Exchange account into your Amazon Kendra index, you can test some queries.

  1. Go to your index and choose Search indexed content.
  2. Enter a sample search query and test out your search results (your query will vary based on the contents of your account).

The Exchange connector also crawls local identity information from Exchange. You can use this feature to narrow down your query by user.

  1. To use this feature, go back to the search results page.
  2. Expand Test query with user name or groups and choose Apply user name or groups.

For Microsoft Exchange, we don’t import groups, we just import user names. User names are email IDs in this case.

  1. Enter the user ID (email) of your user and choose Apply.
  2. Rerun your search query.

This brings you a filtered set of results based on your criteria.

  1. Go back to the search page and enter the name of a user who doesn’t have access to this content, then choose Apply.
  2. Run the same query again.

When fronting Amazon Kendra with an application such as an application built using Experience Builder, you can pass the user identity (in the form of the email ID) to Amazon Kendra to ensure that each user only sees content specific to their user ID. Alternately, you can use AWS IAM Identity Center (successor to AWS Single Sign-On) to control user context being passed to Amazon Kendra to limit queries by user.

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Exchange account.

Limitations

This solution has the following limitations:

  • Multiple domain emails are not supported.
  • Sticky notes are not supported.
  • Incremental updates are valid only for a specific period (7 days) before the client application needs to run a full synchronization again.
  • Exchange Online has rate limits that govern the speed of ingestion. For more information, refer to Exchange Online limits.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Exchange, delete that data source.

Conclusion

With the Microsoft Exchange connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data from Exchange, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.


About the author

Ashish Lagwankar is a Senior Enterprise Solutions Architect at AWS. His core interests include AI/ML, serverless, and container technologies. Ashish is based in the Boston, MA, area and enjoys reading, outdoors, and spending time with his family.

Read More