Photo of a quantum computer close-up

Microsoft has demonstrated the underlying physics required to create a new kind of qubit

Photo of a quantum computer close-up

Quantum computing promises to help us solve some of humanity’s greatest challenges. Yet as an industry, we are still in the early days of discovering what’s possible. Today’s quantum computers are enabling researchers to do interesting work. However, these researchers often find themselves limited by the inadequate scale of these systems and are eager to do more. Today’s quantum computers are based on a variety of qubit types, but none so far have been able to scale to enough qubits to fully realize the promise of quantum.

Microsoft is taking a more challenging, but ultimately more promising approach to scaled quantum computing with topological qubits that are theorized to be inherently more stable than qubits produced with existing methods without sacrificing size or speed. We have discovered that we can produce the topological superconducting phase and its concomitant Majorana zero modes, clearing a significant hurdle toward building a scaled quantum machine. The explanation of our work and methods below shows that the underlying physics behind a topological qubit are sound—the observation of a 30 μeV topological gap is a first in this work, and one that lays groundwork for the potential future of topological quantum computing. While engineering challenges remain, this discovery proves out a fundamental building block for our approach to a scaled quantum computer and puts Microsoft on the path to deliver a quantum machine in Azure that will help solve some of the world’s toughest problems.

Dr. Chetan Nayak and Dr. Sankar Das Sarma recently sat down to discuss these results and why they matter in the video below. Learn more about our journey and visit Azure Quantum to get started with quantum computing today.

Dr. Sankar Das Sarma, a Distinguished University Professor of Physics at University of Maryland joins Dr. Chetan Nayak, Distinguished Engineer of Quantum at Microsoft to discuss Microsoft’s unique approach to building a fully scalable quantum machine.

Microsoft Quantum team reports observation of a 30 μeV topological gap in indium arsenide-aluminum heterostructures

Topological quantum computation is a route to hardware-level fault tolerance, potentially enabling a quantum computing system with high fidelity qubits, fast gate operations, and a single module architecture. The fidelity, speed, and size of a topological qubit is controlled by a characteristic energy called the topological gap. This path is only open if one can reliably produce a topological phase of matter and experimentally verify that the sub-components of a qubit are in a topological phase (and ready for quantum information processing). Doing so is not trivial because topological phases are characterized by the long-ranged entanglement of their ground states, which is not readily accessible to conventional experimental probes.

This difficulty was addressed by the “topological gap protocol” (TGP), which our team set forth a year ago as a criterion for identifying the topological phase with quantum transport measurements. Topological superconducting wires have Majorana zero modes at their ends. There is a real fermionic operator localized at each end of the wire, analogous to the real fermionic wave equation constructed by Ettore Majorana in 1937.

Consequently, there are two quantum states of opposite fermion parity that can only be measured through a phase-coherent probe coupled to both ends. In electrical measurements, the Majorana zero modes (see Figure 1) cause zero-bias peaks (ZBPs) in the local conductance. However, local Andreev bound states and disorder can also cause zero-bias peaks. Thus, the TGP focuses on ZBPs that are highly stable and, crucially, uses the non-local conductance to detect a bulk phase transition. Such a transition must be present at the boundary between the trivial superconducting phase and the topological phase because these are two distinct phases of matter, as different as water and ice.

Quantum computing: Topo phase
Figure 1: The local density of states of a topological superconducting nanowire as a function of energy and position.

We have simulated our devices using models that incorporate the details of the materials stack, geometry, and imperfections. Our simulations have demonstrated that the TGP is a very stringent criterion, rendering it a reliable method for detecting the topological phase in a device. Crucially, the conditions for passing the protocol—the presence of stable ZBPs at both ends of the device over a gapped region with gapless boundary, as established via the non-local conductance—were established before any devices had been measured. Given the subtleties involved in identifying a topological phase, which stem from the absence of a local order parameter, one of the design principles of the TGP was to avoid confirmation bias. In particular, the device is scanned over its entire operating range instead of ‘hunting’ for a specific desired feature, such as a ZBP.

Microsoft’s Station Q, in Santa Barbara, CA, is the birthplace of Microsoft’s quantum program. For the last 16 years, it has been the host of a biannual conference on topological phases and quantum computing. After a two-year hiatus of in-person meetings due to the pandemic, the Station Q meetings resumed in early March. At this meeting with leaders in quantum computing from across industry and academia, we reported that we have multiple devices that have passed the TGP.

Our team has measured topological gaps exceeding 30 μeV. This is more than triple the noise level in the experiment and larger than the temperature by a similar factor. This shows that it is a robust feature. This is both a landmark scientific advance and a crucial step on the journey to topological quantum computation, which relies on the fusion and braiding of anyons (the two primitive operations on topological quasiparticles). The topological gap controls the fault-tolerance that the underlying state of matter affords to these operations. More complex devices enabling these operations require multiple topological wire segments and rely on TGP as part of their initialization procedure. Our success was predicated on very close collaboration between our simulation, growth, fabrication, measurement, and data analysis teams. Every device design was simulated in order to optimize it over 23 different parameters prior to fabrication. This enabled us to determine the device tuning procedure during design.

Our results are backed by exhaustive measurements and rigorous data validation procedures. We obtained the large-scale phase diagram of multiple devices, derived from a combination of local and non-local conductances. Our analysis procedure was validated on simulated data in which we attempted to fool the TGP. This enabled us to rule out various null hypotheses with high confidence. Moreover, data analysis was led by a different team than the one who took the data, as part of our checks and balances between different groups within the team. Additionally, an expert council of independent consultants is vetting our results, and the response to date is overwhelmingly positive.

With the underlying physics demonstrated, the next step is a topological qubit. We hypothesize that the topological qubit will have a favorable combination of speed, size, and stability compared to other qubits. We believe ultimately it will power a fully scalable quantum machine in the future, which will in turn enable us to realize the full promise of quantum to solve the most complex and pressing challenges our society faces.

The post Microsoft has demonstrated the underlying physics required to create a new kind of qubit appeared first on Microsoft Research.

Read More

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard.

PeopleLens: Using AI to support social interaction between children who are blind and their peers

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard.
The PeopleLens is a new research technology designed to help people who are blind or have low vision better understand their immediate social environments by locating and identifying people in the space. Coupled with a scheme of work based on research and practices from psychology and speech and language therapy, the system can help children and young people who are blind more easily forge social connections with their peers.

For children born blind, social interaction can be particularly challenging. A child may have difficulty aiming their voice at the person they’re talking to and put their head on their desk instead. Linguistically advanced young people may struggle with maintaining a topic of conversation, talking only about something of interest to them. Most noticeably, many children and young people who are blind struggle with engaging and befriending those in their age group despite a strong desire to do so. This is often deeply frustrating for the child or young person and can be equally so for their support network of family members and teachers who want to help them forge these important connections.

  • PUBLICATION

    PeopleLens


    The PeopleLens is an open-ended AI system that offers people who are blind or who have low vision further resources to make sense of and engage with their immediate social surroundings.

The PeopleLens is a new research technology that we’ve created to help young people who are blind (referred to as learners in our work) and their peers interact more easily. A head-worn device, the PeopleLens reads aloud in spatialized audio the names of known individuals when the learner looks at them. That means the sound comes from the direction of the person, assisting the learner in understanding both the relative position and distance of their peers. The PeopleLens helps learners build a People Map, a mental map of those around them needed to effectively signal communicative intent. The technology, in turn, indicates to the learner’s peers when the peers have been “seen” and can interact—a replacement for the eye contact that usually initiates interaction between people.

For children and young people who are blind, the PeopleLens is a way to find their friends; however, for teachers and parents, it’s a way for these children and young people to develop competence and confidence in social interaction. An accompanying scheme of work aims to guide the development of spatial attention skills believed to underpin social interaction through a series of games that learners using the PeopleLens can play with peers. It also sets up situations in which learners can experience agency in social interaction. A child’s realization that they can choose to initiate a conversation because they spot someone first or that they can stop a talkative brother from speaking by looking away is a powerful moment, motivating them to delve deeper into directing their own and others’ attention.

The PeopleLens is an advanced research prototype that works on Nreal Light augmented reality glasses tethered to a phone. While it’s not available for purchase, we are recruiting learners in the United Kingdom aged 5 to 11 who have the support of a teacher to explore the technology as part of a multistage research study. For the study, led by the University of Bristol, learners will be asked to use the PeopleLens for a three-month period beginning in September 2022. For more information, visit the research study information page. 

Research foundation 

The scheme of work, coauthored by collaborators Professor Linda Pring and Dr. Vasiliki Kladouchou, draws on research and practice from psychology and speech and language therapy in providing activities to do with the technology. The PeopleLens builds on the hypothesis that many social interaction difficulties for children who are blind stem from differences in the ways children with and without vision acquire fundamental attentional processes as babies and young children. For example, growing up, children with vision learn to internalize a joint visual dialogue of attention. A young child points at something in the sky, and the parent says, “Bird.” Through these dialogues, young children learn how to direct the attention of others. However, there isn’t enough research to understand how joint attention manifests in children who are blind. A review of the literature suggests that most research doesn’t account for a missing sense and that research specific to visual impairment doesn’t provide a framework for joint attention beyond the age of 3. We’re carrying out research to better understand how the development of joint attention can be improved in early education and augmented with technology.

How does the PeopleLens work? 

The PeopleLens is a sophisticated AI prototype system that is intended to provide people who are blind or have low vision with a better understanding of their immediate social environment. It uses a head-mounted augmented reality device in combination with four state-of-the-art computer vision algorithms to continuously locate, identify, track, and capture the gaze directions of people in the vicinity. It then presents this information to the wearer through spatialized audio—sound that comes from the direction of the person. The real-time nature of the system gives a sense of immersion in the People Map.

A graphic overview of the PeopleLens system describes its functionality and experience features with accompanying icons.
The PeopleLens helps the child wearing it build a mental map of those in their immediate social environment. Because the PeopleLens reads aloud the names of identified people in spatialized audio, the child is able to get a sense of the respective positions and distances of their peers. The system receives images and processes them with computer vision algorithms, as shown by the overlays on the top images in this screenshot of the PeopleLens development environment. The system then stiches together a world map that’s used to drive the experiences, as shown at the bottom right.

The PeopleLens is a ground-breaking technology that has also been designed to protect privacy. Among the algorithms underpinning the system is facial recognition of people who’ve been registered in the system. A person registers by taking several photographs of themselves with the phone attached to the PeopleLens. Photographs aren’t stored, instead converted into a vector of numbers that represent a face. These differ from any vectors used in other systems, so recognition by the PeopleLens doesn’t lead to recognition by any other system. No video or identifying information is captured by the system, ensuring that the images can’t be maliciously used.

The system employs a series of sounds to assist the wearer in placing people in the surrounding space: A percussive bump indicates when their gaze has crossed a person up to 10 meters away. The bump is followed by the person’s name if the person is registered in the system, is within 4 meters of the wearer, and both the person’s ears can be detected. The sound of woodblocks guides the wearer in finding and centering the face of a person the system has seen for 1 second but hasn’t identified, changing in pitch to help the wearer adjust their gaze accordingly. (Those people who are unregistered are acknowledged with a click sound.) Gaze notification can alert the wearer to when they’re being looked at. 

A graphic overview of the PeopleLens system describes its functionality and experience features with accompanying icons.
The functionality of the PeopleLens system includes experience features such as recognizing a person in front of the wearer; attention notifications from the direction of those who look at the wearer; the ability to follow someone; and an orientation guide to help wearers find people and faces.

Community collaboration

The success of the PeopleLens, as well as systems like it, is dependent on a prototyping process that includes close collaboration with the people it is intended to serve. Our work with children who are blind and their support systems has put us on a path toward building a tool that can have practical value and empower those using it. We encourage those interested in the PeopleLens to reach out about participating in our study and help us further evolve the technology. 

To learn more about the PeopleLens and its development, check out the Innovation Stories blog about the technology.

The post PeopleLens: Using AI to support social interaction between children who are blind and their peers appeared first on Microsoft Research.

Read More

Understanding Deep Learning Algorithms that Leverage Unlabeled Data, Part 2: Contrastive Learning

Labels for large-scale datasets are expensive to curate, so leveraging abundant unlabeled data before fine-tuning them on the smaller, labeled, data sets is an important and promising direction for pre-training machine learning models. One popular and successful approach for developing pre-trained models is contrastive learning, (He et al., 2019, Chen et al., 2020). Contrastive learning is a powerful class of self-supervised visual representation learning methods that learn feature extractors by (1) minimizing the distance between the representations of positive pairs, or samples that are similar in some sense, and (2) maximizing the distance between representations of negative pairs, or samples that are different in some sense. Contrastive learning can be applied to unlabeled images by having positive pairs contain augmentations of the same image and negative pairs contain augmentations of different images.

In this blog post, we will propose a theoretical framework for understanding the success of this contrastive learning approach. Our theory motivates a novel contrastive loss with theoretical guarantees for downstream linear-probe performance. Our experiments suggest that representations learned by minimizing this objective achieve comparable performance to state-of-the-art methods.

Augmentation graph for self-supervised learning

The key idea behind our work is the idea of a population augmentation graph, which also appeared in our previous blog post where we analyzed self-training. As a reminder, this graph is built such that the nodes represent all possible augmentations of all data points in the population distribution and the edges connect nodes that are derived from the same natural image. Further, the edges are weighted to be the probability that the two augmented images are augmentations of the same underlying image, given the set of augmentation functions being used. Some augmentation methods, like cropping, produce images that could only come from the same underlying image. However, others, such as Gaussian blurring, technically connect all images to each other, albeit mostly with very small probabilities. Because there are a potentially infinite number of augmentations, this graph is more of a theoretical idea we will use to describe our idea rather than an actual graph that we construct. The figure below gives a visualization of the graph, where augmented images of French bulldogs are connected in the graph.

Figure 1

We have two simple intuitions about the graph that suggests it contains information generally useful for a pre-trained computer vision model.
First, very few high-probability edges exist between any two images, especially if they have different semantic content. For instance, consider two pictures of the same dogs in different poses. Even though the semantic content is the same, there is almost zero chance that one could produce one image from the other using augmentation methods like Gaussian blur. This probability is reduced further when considering two images that don’t even share the same objects, such as one image of a dog outside and another image of a cruise ship in the ocean. Rather, the only high-probability connections are augmented images with similar objects in similar orientations or poses.
Second, images with similar content (e.g, dog images of the same breed) can be connected to each other via a path of interpolating images. The figure above visualizes this intuition, where (x_1) and (x_2) are two augmented images of French bulldogs that aren’t obtained from the same natural image (hence no high-probability edge between them). However, since the augmentation graph is a theoretical construct that is defined on the population data which contains all possible dog images, there must exist a path of interpolating French bulldog images (as shown in Figure 1) where every consecutive two images are directly connected by a reasonably high-probability edge. As a result, this sequence forms a path connecting (x_1) and (x_2).

Graph partitioning via spectral decomposition

Consider an ideal world where we can partition the augmentation graph into multiple disconnected subgraphs. From the intuition above, each subgraph contains images that can be easily interpolated into each other, and so likely depicts the same underlying concept or objects in its images. This motivates us to design self-supervised algorithms that can map nodes within the same subgraph to similar representations.
Assume we have access to the population data distribution and hence the whole augmentation graph. A successful algorithm for graph partitioning is spectral clustering (Shi & Malik 2000, Ng et al. 2002), which uses spectral graph theory tools to discover the connected components in a graph. We’ll describe spectral clustering in more detail here, and then interpret contrastive learning as an effective, parametric way to implement spectral clustering on the large augmentation graph.
Let X denote the vertex set of the graph, and for the ease of exposition assume ( |X| = N). ((N) can also be infinite or exponential.) Let (Ain mathbb{R}^{Ntimes N}) be the adjacency matrix which contains edge weights (w_{xx’}) as its entries. For every node (x), let (w_x=sum_{x’in X} w_{xx’}) be the sum of weights of edges connected to x (which can be thought of as the degree of the vertex (x)). We call the matrix (L=I-text{diag}(w_x^{-1/2})cdot A cdot text{diag}(w_x^{-1/2})) the Laplacian matrix of the graph.
Spectral clustering begins with eigendecomposition of the Laplacian matrix. Let (u_1, u_2, cdots, u_{k}) be the (N)-dimensional eigenvectors that correspond to the smallest (k) eigenvalues. If we write these vectors as columns of a matrix (F in mathbb{R}^{Ntimes k}), every row (denoted as (v_1, v_2, cdots, v_N in mathbb{R}^{k})) would correspond to a single node in the graph. We can then obtain a (k)-way partition of the graph by running (k)-means on these (N) vectors.

It’s worth noting that we cannot directly run spectral clustering on the population augmentation graph, since its eigendecomposition step requires knowing the entire graph (i.e., all the data in the population), whereas in reality we only have a sampled dataset. However, the intuition behind spectral clustering is still valuable: the smallest eigenvectors of the Laplacian matrix should provide pretty good representations of the data.

Contrastive learning as spectral clustering

We can use these intuitions about spectral clustering to design a contrastive learning algorithm. Specifically, because we don’t have access to the true population augmentation graph, we instead define (f_theta) which is a neural network that takes in an example and outputs the eigenvector representation of the example. Put another way, our goal is to compute the matrix (F) that contains the eigenvectors as columns, and use its rows as the representations. We aim to learn (f_theta) such that (f_theta(x)) is the row of matrix (F) corresponding to example (x). Given the high expressivity of neural nets, we assume that such a (theta) exists.

It turns out that this feature can be learned by minimizing the following “matrix factorization loss”:
(min_{F_theta}L(F_theta) triangleq left| (I-L) – F_theta F_theta^top right|_F^2 =sum_{i, j} left(frac{w_{x_ix_j}}{sqrt{w_{x_i}}sqrt{w_{x_j}}} – f_theta(x_i)^top f_theta(x_j)right)^2)
where (F_thetainmathbb{R}^{Ntimes k}) is the matrix containing (f_theta(x_i)) as its (i)-th row. According to the Eckart–Young–Mirsky theorem, any minimizer of this loss contains the largest eigenvectors of (I-L) (hence the smallest eigenvectors of (L)) as its columns(up to scaling). As a result, at the minimizer, (f_theta) recovers the smallest eigenvectors.
We expand the above loss, and arrive at a formula that (somewhat surprisingly) resembles a contrastive loss:
(begin{aligned} min_{theta} L(f_theta) &= text{const} -2sum_{i, j}frac{w_{x_ix_j}}{sqrt{w_{x_i}}sqrt{w_{x_j}}}f_theta(x_i)^top f_theta(x_j) + sum_{i, j}left(f_theta(x_i)^top f_theta(x_j)right)^2 \
&= text{const} -2mathbb{E}_{x,x^+}frac{f_theta(x)^top}{sqrt{w_x}}frac{f_theta(x^+)}{sqrt{w_{x^+}}} + mathbb{E}_{x,x’}left(frac{f_theta(x)^top}{sqrt{w_x}}frac{f_theta(x’)}{sqrt{w_{x’}}}right)^2
end{aligned})
where ((x, x^+)) is a random positive pair and ((x, x’)) is a random negative pair. In the second line, we are using the fact that (w_{x_ix_j}) and (w_{x_i}w_{x_j}) are the probability densities of ((x_i, x_j)) being a positive and negative pair, respectively, to replace the sums by expectations.
Ignoring the constant term and the scalings (sqrt{w_x}) (which do not influence linear separability of the learned representations), we get the following contrastive loss objective
(min_{theta} L_{text{scl}}(f_theta) = -2mathbb{E}_{x,x^+} left[f_theta(x)^top f_theta(x^+)right] + mathbb{E}_{x,x’}left[left(f_theta(x)^top f_theta(x’)right)^2right])
which we call spectral contrastive loss. The minimizer of this objective would correspond to the smallest eigenvectors of the Laplacian matrix with some data-wise positive scaling.
In summary, the above analysis shows that, when minimizing a special variant of contrastive loss (i.e., spectral contrastive loss), the learned features correspond to the eigenvectors of the Laplacian matrix of the population augmentation graph.

Empirically, features learned by training spectral contrastive loss can match strong baselines such as SimCLR and SimSiam. The above table shows the linear probe accuracy with 100-epoch pre-training on ImageNet. More discussions about the empirical performance of this loss can be found in the experiments section of our paper.

Why does this method produce good representations?

We now turn to the question we began with: why are the representations learned by contrastive loss useful for downstream computer vision tasks? We study the downstream accuracy of the representation with the “linear probe” protocol (Chen et al. 2020), where an additional linear model is trained on the representation to predict the labels for a downstream classification task.
As discussed above, the representations learned by the spectral contrastive loss are the rows (with data-wise positive scaling) of a matrix where the columns are the smallest eigenvectors of the Laplacian matrix. Since the scaling doesn’t change the linear prediction, it suffices to consider the rows of the eigenvector matrix as representations.
The usefulness of this representation in classification tasks can be demonstrated by the following didactic example: consider a augmentation graph (G) with three completely disconnected components that correspond to three classes, and the downstream task is to classify one component (e.g., set ({x_1, x_2, x_3}) versus the rest).

The figure above shows the smallest eigenvectors of the Laplacian, where the blank entries are 0. It’s easy to see that here rows of the eigenvector matrix exactly correspond to indicators of different components in the graph, and hence the representations of nodes from different connected subgraphs are clearly linearly separable. For instance, if we use the sign of (f(x)^top b) as the predictor where vector (binmathbb{R}^{k}) is set to be (e_1), we can perfectly classify whether a node belongs to set ({x_1, x_2, x_3}) or not.
The same intuition holds in a broader setting where the graph isn’t regular, the components aren’t necessarily disconnected, and the downstream task has more than two classes. In summary, the contrastive learned representation can linearly predict any set of nearly disconnected components with high accuracy, which is captured by the following theorem:
Theorem (informal): Assume the population augmentation graph contains (k) approximately disconnected components, where each component corresponds to a downstream class. Let the feature map (f: Xrightarrow mathbb{R}^k) be the minimizer of the population spectral contrastive loss. Then, there exists a linear head on top of (f) that achieves small downstream classification error.
The formal version of this theorem can be found in our paper.

Conclusion

Our theory suggests that self-supervised learning can learn quite powerful and diverse features that are suitable for a large set of downstream tasks. To see this, consider a situation where there are a large number of disconnected subgraphs in the augmentation graph, the downstream task can be an arbitrary way of grouping these subgraphs into a small number of classes (each class can correspond to many subgraphs, .e.g., the “dog” class may contain many subgraphs corresponding to different breeds of dogs).
Due to the abundance of unlabeled data in practice, generalization in the traditional sense (i.e., studying population loss vs. empirical loss) is no longer the main challenge to understanding self-supervised learning. Instead, a good understanding of the population pretraining loss and its connection with the downstream tasks becomes essential. As a result, proper modeling of the pretraining data becomes key to the theoretical analysis. We hope that our theoretical framework, which characterizes properties of the data via the augmentation graph, can facilitate a better understanding of unsupervised algorithms in deep learning and can inspire new practical loss functions and algorithms.
This blog post was based on our NeurIPS 2021 paper Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss.

Read More

Introducing PyTorch Fully Sharded Data Parallel (FSDP) API

Recent studies have shown that large model training will be beneficial for improving model quality. During the last 3 years, model size grew 10,000 times from BERT with 110M parameters to Megatron-2 with one trillion. However, training large AI models is not easy—aside from the need for large amounts of computing resources, software engineering complexity is also challenging. PyTorch has been working on building tools and infrastructure to make it easier.

PyTorch Distributed data parallelism is a staple of scalable deep learning because of its robustness and simplicity. It however requires the model to fit on one GPU. Recent approaches like DeepSpeed ZeRO and FairScale’s Fully Sharded Data Parallel allow us to break this barrier by sharding a model’s parameters, gradients and optimizer states across data parallel workers while still maintaining the simplicity of data parallelism.

With PyTorch 1.11 we’re adding native support for Fully Sharded Data Parallel (FSDP), currently available as a prototype feature. Its implementation heavily borrows from FairScale’s version while bringing more streamlined APIs and additional performance improvements.

Scaling tests of PyTorch FSDP on AWS show it can scale up to train dense models with 1T parameters. Realized performance in our experiments reached 84 TFLOPS per A100 GPU for GPT 1T model and 159 TFLOPS per A100 GPU for GPT 175B model on AWS cluster. Native FSDP implementation also dramatically improved model initialization time compared to FairScale’s original when CPU offloading was enabled.

In future PyTorch versions, we’re going to enable users to seamlessly switch between DDP, ZeRO-1, ZeRO-2 and FSDP flavors of data parallelism, so that users can train different scales of models with simple configurations in the unified API.

How FSDP Works

FSDP is a type of data-parallel training, but unlike traditional data-parallel, which maintains a per-GPU copy of a model’s parameters, gradients and optimizer states, it shards all of these states across data-parallel workers and can optionally offload the sharded model parameters to CPUs.

The figure below shows how FSDP works for 2 data-parallel processes:

Figure 1. FSDP workflow

Usually, model layers are wrapped with FSDP in a nested way, so that only layers in a single FSDP instance need to gather the full parameters to a single device during forward or backward computations. The gathered full parameters will be freed immediately after computation, and the freed memory can be used for the next layer’s computation. In this way, peak GPU memory could be saved and thus training can be scaled to use a larger model size or larger batch size. To further maximize memory efficiency, FSDP can offload the parameters, gradients and optimizer states to CPUs when the instance is not active in the computation.

Using FSDP in PyTorch

There are two ways to wrap a model with PyTorch FSDP. Auto wrapping is a drop-in replacement for DDP; manual wrapping needs minimal changes of model definition code with the ability to explore complex sharding strategies.

Auto Wrapping

Model layers should be wrapped in FSDP in a nested way to save peak memory and enable communication and computation overlapping. The simplest way to do it is auto wrapping, which can serve as a drop-in replacement for DDP without changing the rest of the code.

fsdp_auto_wrap_policy argument allows specifying a callable function to recursively wrap layers with FSDP. default_auto_wrap_policy function provided by the PyTorch FSDP recursively wraps layers with the number of parameters larger than 100M. You can supply your own wrapping policy as needed. The example of writing a customized wrapping policy is shown in the FSDP API doc.

In addition, cpu_offload could be configured optionally to offload wrapped parameters to CPUs when these parameters are not used in computation. This can further improve memory efficiency at the cost of data transfer overhead between host and device.

The example below shows how FSDP is wrapped using auto wrapping.

from torch.distributed.fsdp import (
   FullyShardedDataParallel,
   CPUOffload,
)
from torch.distributed.fsdp.wrap import (
   default_auto_wrap_policy,
)
import torch.nn as nn
 
class model(nn.Module):
   def __init__(self):
       super().__init__()
       self.layer1 = nn.Linear(8, 4)
       self.layer2 = nn.Linear(4, 16)
       self.layer3 = nn.Linear(16, 4)
 
model = DistributedDataParallel(model())
fsdp_model = FullyShardedDataParallel(
   model(),
   fsdp_auto_wrap_policy=default_auto_wrap_policy,
   cpu_offload=CPUOffload(offload_params=True),
)

Manual Wrapping

Manual wrapping can be useful to explore complex sharding strategies by applying wrap selectively to some parts of the model. Overall settings can be passed to the enable_wrap() context manager.

from torch.distributed.fsdp import (
   FullyShardedDataParallel,
   CPUOffload,
)
from torch.distributed.fsdp.wrap import (
   enable_wrap,
   wrap,
)
import torch.nn as nn
from typing import Dict
 
 
class model(nn.Module):
   def __init__(self):
       super().__init__()
       self.layer1 = wrap(nn.Linear(8, 4))
       self.layer2 = nn.Linear(4, 16)
       self.layer3 = wrap(nn.Linear(16, 4))
 
wrapper_kwargs = Dict(cpu_offload=CPUOffload(offload_params=True))
with enable_wrap(wrapper_cls=FullyShardedDataParallel, **wrapper_kwargs):
   fsdp_model = wrap(model())

After wrapping the model with FSDP using one of the two above approaches, the model can be trained in a similar way as local training, like this:

optim = torch.optim.Adam(fsdp_model.parameters(), lr=0.0001)
for sample, label in next_batch():
  out = fsdp_model(input)
  loss = criterion(out, label)
  loss.backward()
  optim.step()

Benchmark Results

We ran extensive scaling tests for 175B and 1T GPT models on AWS clusters using PyTorch FSDP. Each cluster node is an instance with 8 NVIDIA A100-SXM4-40GB GPUs, and inter-nodes are connected via AWS Elastic Fabric Adapter (EFA) with 400 Gbps network bandwidth.

GPT models are implemented using minGPT. A randomly generated input dataset is used for benchmarking purposes. All experiments ran with 50K vocabulary size, fp16 precision and SGD optimizer.

Model Number of layers Hidden size Attention heads Model size, billions of parameters
GPT 175B 96 12288 96 175
GPT 1T 128 25600 160 1008

In addition to using FSDP with parameters CPU offloading in the experiments, the activation checkpointing feature in PyTorch is also applied in the tests.

The maximum per-GPU throughput of 159 teraFLOP/s (51% of NVIDIA A100 peak theoretical performance 312 teraFLOP/s/GPU) is achieved with batch size 20 and sequence length 512 on 128 GPUs for the GPT 175B model; further increase of the number of GPUs leads to per-GPU throughput degradation because of growing communication between the nodes.

For the GPT 1T model, the maximum per-GPU throughput of 84 teraFLOP/s (27% of the peak teraFLOP/s) is achieved with batch size 4 and sequence length 2048 on 128 GPUs. However, further increase of the number of GPUs doesn’t affect the per-GPU throughput too much because we observed that the largest bottleneck in the 1T model training is not from communication but from the slow CUDA cache allocator when peak GPU memory is reaching the limit. The use of A100 80G GPUs with larger memory capacity will mostly resolve this issue and also help scale the batch size to achieve much larger throughput.

Future Work

In the next beta release, we are planning to add efficient distributed model/states checkpointing APIs, meta device support for large model materialization, and mixed-precision support inside FSDP computation and communication. We’re also going to make it easier to switch between DDP, ZeRO1, ZeRO2 and FSDP flavors of data parallelism in the new API. To further improve FSDP performance, memory fragmentation reduction and communication efficiency improvements are also planned.

A Bit of History of 2 Versions of FSDP

FairScale FSDP was released in early 2021 as part of the FairScale library. And then we started the effort to upstream FairScale FSDP to PyTorch in PT 1.11, making it production-ready. We have selectively upstreamed and refactored key features from FairScale FSDP, redesigned user interfaces and made performance improvements.

In the near future, FairScale FSDP will stay in the FairScale repository for research projects, while generic and widely adopted features will be upstreamed to PyTorch incrementally and hardened accordingly.

Meanwhile, PyTorch FSDP will focus more on production readiness and long-term support. This includes better integration with ecosystems and improvements on performance, usability, reliability, debuggability and composability.

Acknowledgments

We would like to thank the authors of FairScale FSDP: Myle Ott, Sam Shleifer, Min Xu, Priya Goyal, Quentin Duval, Vittorio Caggiano, Tingting Markstrum, Anjali Sridhar. Thanks to the Microsoft DeepSpeed ZeRO team for developing and popularizing sharded data parallel techniques. Thanks to Pavel Belevich, Jessica Choi, Sisil Mehta for running experiments using PyTorch FSDP on different clusters. Thanks to Geeta Chauhan, Mahesh Yadav, Pritam Damania, Dmytro Dzhulgakov for supporting this effort and insightful discussions.

Read More

Synthetic Defect Generation for Display Front-of-Screen Quality Inspection: A Survey

Display front-of-screen (FOS) quality inspection is essential for the mass production of displays in the manufacturing process. However, the severe imbalanced data, especially the limited number of defective samples, has been a long-standing problem that hinders the successful application of deep learning algorithms. Synthetic defect data generation can help address this issue. This paper reviews the state-of-the-art synthetic data generation methods and the evaluation metrics that can potentially be applied to display FOS quality inspection tasks.Apple Machine Learning Research

Computational modeling guides development of new materials

Metal-organic frameworks, a class of materials with porous molecular structures, have a variety of possible applications, such as capturing harmful gases and catalyzing chemical reactions. Made of metal atoms linked by organic molecules, they can be configured in hundreds of thousands of different ways.

To help researchers sift through all of the possible metal-organic framework (MOF) structures and help identify the ones that would be most practical for a particular application, a team of MIT computational chemists has developed a model that can analyze the features of a MOF structure and predict if it will be stable enough to be useful.

The researchers hope that these computational predictions will help cut the development time of new MOFs.

“This will allow researchers to test the promise of specific materials before they go through the trouble of synthesizing them,” says Heather Kulik, an associate professor of chemical engineering at MIT.

The MIT team is now working to develop MOFs that could be used to capture methane gas and convert it to useful compounds such as fuels.

The researchers described their new model in two papers, one in the Journal of the American Chemical Society and one in Scientific Data. Graduate students Aditya Nandy and Gianmarco Terrones are the lead authors of the Scientific Data paper, and Nandy is also the lead author of the JACS paper. Kulik is the senior author of both papers.

Modeling structure

MOFs consist of metal atoms joined by organic molecules called linkers to create a rigid, cage-like structure. The materials also have many pores, which makes them useful for catalyzing reactions involving gases but can also make them less structurally stable.

“The limitation in seeing MOFs realized at industrial scale is that although we can control their properties by controlling where each atom is in the structure, they’re not necessarily that stable, as far as materials go,” Kulik says. “They’re very porous and they can degrade under realistic conditions that we need for catalysis.”

Scientists have been working on designing MOFs for more than 20 years, and thousands of possible structures have been published. A centralized repository contains about 10,000 of these structures but is not linked to any of the published findings on the properties of those structures.

Kulik, who specializes in using computational modeling to discover structure-property relationships of materials, wanted to take a more systematic approach to analyzing and classifying the properties of MOFs.

“When people make these now, it’s mostly trial and error. The MOF dataset is really promising because there are so many people excited about MOFs, so there’s so much to learn from what everyone’s been working on, but at the same time, it’s very noisy and it’s not systematic the way it’s reported,” she says.

Kulik and her colleagues set out to analyze published reports of MOF structures and properties using a natural-language-processing algorithm. Using this algorithm, they scoured nearly 4,000 published papers, extracting information on the temperature at which a given MOF would break down. They also pulled out data on whether particular MOFs can withstand the conditions needed to remove solvents used to synthesize them and make sure they become porous.

Once the researchers had this information, they used it to train two neural networks to predict MOFs’ thermal stability and stability during solvent removal, based on the molecules’ structure.

“Before you start working with a material and thinking about scaling it up for different applications, you want to know will it hold up, or is it going to degrade in the conditions I would want to use it in?” Kulik says. “Our goal was to get better at predicting what makes a stable MOF.”

Better stability

Using the model, the researchers were able to identify certain features that influence stability. In general, simpler linkers with fewer chemical groups attached to them are more stable. Pore size is also important: Before the researchers did their analysis, it had been thought that MOFs with larger pores might be too unstable. However, the MIT team found that large-pore MOFs can be stable if other aspects of their structure counteract the large pore size.

“Since MOFs have so many things that can vary at the same time, such as the metal, the linkers, the connectivity, and the pore size, it is difficult to nail down what governs stability across different families of MOFs,” Nandy says. “Our models enable researchers to make predictions on existing or new materials, many of which have yet to be made.”

The researchers have made their data and models available online. Scientists interested in using the models can get recommendations for strategies to make an existing MOF more stable, and they can also add their own data and feedback on the predictions of the models.

The MIT team is now using the model to try to identify MOFs that could be used to catalyze the conversion of methane gas to methanol, which could be used as fuel. Kulik also plans to use the model to create a new dataset of hypothetical MOFs that haven’t been built before but are predicted to have high stability. Researchers could then screen this dataset for a variety of properties.

“People are interested in MOFs for things like quantum sensing and quantum computing, all sorts of different applications where you need metals distributed in this atomically precise way,” Kulik says.

The research was funded by DARPA, the U.S. Office of Naval Research, the U.S. Department of Energy, a National Science Foundation Graduate Research Fellowship, a Career Award at the Scientific Interface from the Burroughs Wellcome Fund, and an AAAS Marion Milligan Mason Award.

Read More

At the Movies: For 14th Year Running, NVIDIA Technologies Power All VFX Oscar Nominees

For the 14th consecutive year, each Academy Award nominee for the Best Visual Effects used NVIDIA technologies.

The 94th annual Academy Awards ceremony, taking place Sunday, March 27, has five nominees in the running:

  • Dune
  • Free Guy
  • No Time to Die
  • Shang-Chi and the Legend of the Ten Rings
  • Spider-Man: No Way Home

NVIDIA has been pushing the boundaries of computer graphics for decades, enabling innovative studios and their incredibly talented artists around the world to create award-winning visual effects in films.

The latest technologies, powered by NVIDIA RTX, help artists push their creativity and craft to new levels of realism, bringing stunning visuals to the big screen.

Dystopian Worlds Come to Life in Epic Story

From flying spacecrafts to giant sandworms, the gorgeous, futuristic visuals in Dune were produced by Academy-Award winning studio DNEG.

For the sci-fi film, DNEG contributed to 28 sequences and nearly 1,200 visual effects (VFX) shots, with each elaborate element designed to bring director Denis Villeneuve’s vision to life. To do so, the studio worked across multiple locations with remote artists using virtual workstations powered by NVIDIA RTX.

DNEG was the first studio to implement NVIDIA virtual GPU software at scale. Now, DNEG continues its pioneering efforts with NVIDIA Omniverse Enterprise as the studio looks to the future of connected, collaborative workflows.

“Every show we get is different from the last, and NVIDIA RTX with vGPU lets us scale the memory and performance characteristics up or down to meet the needs of our artists,” said Daire Byrne, global head of systems at DNEG.

All images courtesy of DNEG.

Exhilarating Visuals Go Beyond for Bond

DNEG secured a second Academy Award nomination with the latest James Bond installment, No Time to Die. This is the first Bond film to receive a VFX Oscar nomination since 1979’s Moonraker.

To create the film’s action-packed scenes, explosions and massive environments, DNEG delivered over 500 VFX shots. Its team used detailed references and scans to help design everything — from a glider and a sinking ship, to the island headquarters and thrilling car chases.

Breaking Out of Virtual and Parallel Universes

Three-time Oscar-winning studio Digital Domain received two VFX nominations this year for Free Guy and Spider-Man: No Way Home.

Its team created almost 350 VFX shots for Free Guy, for assets like digital doubles to entire computer graphics-based cities. Free Guy is also one of the first feature films that used Digital Domain’s face-swapping tool, Charlatan.

Charlatan replaces the face of a character with a hand-created facial model, and then combines it with the original performance using AI. This provides much more realistic results when working with digital doubles. Digital Domain also helped build the visuals for “Free City,” the game environment where the film takes place.

For Spider-Man: No Way Home, Digital Domain digitally recreated 2.5 square miles of New York City, one of the biggest computer graphics environments the studio has ever built.

A dramatic fight sequence — which shows the web-slinging hero battling the villainous Doc Ock and his mechanical arms — takes place on a bridge, with visuals and computer graphics elements that include various people, vehicles and weather conditions.

Digital Domain also used a digital double for Doc Ock, a character that actor Alfred Molina first played 18 years ago in the film Spider-Man 2.

Marvelous Characters Clash on Big Screen

Another one of Disney’s most recent Marvel films, Shang-Chi and the Legend of the Ten Rings, showcased magical visuals from creative studio Weta Digital.

Weta worked on 300 shots and VFX sequences that included an epic dragon battle, realistic water simulations and the magical rings themselves. One of Weta’s challenges was to create the breathtaking sequences with the Great Protector, a wingless dragon, and the Dweller-in-Darkness, a creature made up of 128 million polygons.

Weta created custom rigs to deliver performances that emphasized the massive scale and weight of each creature. The team also developed a new AI system to assist with facial performances on the digital double of Shang-Chi’s father and main antagonist, Xu Wenwu.

Explore More VFX and Virtual Productions at GTC

While only one team will accept the award for Best Visual Effects, millions of artists create stunning visuals and cinematics with NVIDIA RTX technology. Learn more at NVIDIA GTC, which takes place March 21-24.

Registration for GTC is free. Sign up to experience how award-winning studios in the industry, including DNEG, HBO and ILMxLAB, are using NVIDIA RTX to unleash new possibilities in storytelling.


Featured image courtesy of DNEG.

The post At the Movies: For 14th Year Running, NVIDIA Technologies Power All VFX Oscar Nominees appeared first on NVIDIA Blog.

Read More