KT’s journey to reduce training time for a vision transformers model using Amazon SageMaker

KT’s journey to reduce training time for a vision transformers model using Amazon SageMaker

KT Corporation is one of the largest telecommunications providers in South Korea, offering a wide range of services including fixed-line telephone, mobile communication, and internet, and AI services. KT’s AI Food Tag is an AI-based dietary management solution that identifies the type and nutritional content of food in photos using a computer vision model. This vision model developed by KT relies on a model pre-trained with a large amount of unlabeled image data to analyze the nutritional content and calorie information of various foods. The AI Food Tag can help patients with chronic diseases such as diabetes manage their diets. KT used AWS and Amazon SageMaker to train this AI Food Tag model 29 times faster than before and optimize it for production deployment with a model distillation technique. In this post, we describe KT’s model development journey and success using SageMaker.

Introducing the KT project and defining the problem

The AI Food Tag model pre-trained by KT is based on the vision transformers (ViT) architecture and has more model parameters than their previous vision model to improve accuracy. To shrink the model size for production, KT is using a knowledge distillation (KD) technique to reduce the number of model parameters without significant impact to accuracy. With knowledge distillation, the pre-trained model is called a teacher model, and a lightweight output model is trained as a student model, as illustrated in the following figure. The lightweight student model has fewer model parameters than the teacher, which reduces memory requirements and allows for deployment on smaller, less expensive instances. The student maintains acceptable accuracy even though it’s smaller by learning from the outputs of the teacher model.

The general training process for knowledge distillation

The teacher model remains unchanged during KD, but the student model is trained using the output logits of the teacher model as labels to calculate loss. With this KD paradigm, both the teacher and the student need to be on a single GPU memory for training. KT initially used two GPUs (A100 80 GB) in their internal, on-premises environment to train the student model, but the process took about 40 days to cover 300 epochs. To accelerate training and generate a student model in less time, KT partnered with AWS. Together, the teams significantly reduced model training time. This post describes how the team used Amazon SageMaker Training, the SageMaker Data Parallelism Library, Amazon SageMaker Debugger, and Amazon SageMaker Profiler to successfully develop a lightweight AI Food Tag model.

Building a distributed training environment with SageMaker

SageMaker Training is a managed machine learning (ML) training environment on AWS that provides a suite of features and tools to simplify the training experience and can be useful in distributed computing, as illustrated in the following diagram.

The model distributed training environment with SageMaker Training

SageMaker customers can also access built-in Docker images with various pre-installed deep learning frameworks and the necessary Linux, NCCL, and Python packages for model training. Data scientists or ML engineers who want to run model training can do so without the burden of configuring training infrastructure or managing Docker and the compatibility of different libraries.

During a 1-day workshop, we were able to set up a distributed training configuration based on SageMaker within KT’s AWS account, accelerate KT’s training scripts using the SageMaker Distributed Data Parallel (DDP) library, and even test a training job using two ml.p4d.24xlarge instances. In this section, we describe KT’s experience working with the AWS team and using SageMaker to develop their model.

In the proof of concept, we wanted to speed up a training job by using the SageMaker DDP library, which is optimized for AWS infrastructure during distributed training. To change from PyTorch DDP to SageMaker DDP, you simply need to declare the torch_smddp package and change the backend to smddp, as shown in the following code:

import smdistributed.dataparallel.torch.torch_smddp

dist.init_process_group(backend='smddp',

rank=args.rank,

world_size=args.world_size)

To learn more about the SageMaker DDP library, refer to SageMaker’s Data Parallelism Library.

Analyzing the causes of slow training speed with the SageMaker Debugger and Profiler

The first step in optimizing and accelerating a training workload involves understanding and diagnosing where bottlenecks occur. For KT’s training job, we measured the training time per iteration of the data loader, forward pass, and backward pass:

1 iter time – dataloader : 0.00053 sec, forward : 7.77474 sec, backward: 1.58002 sec
2 iter time – dataloader : 0.00063 sec, forward : 0.67429 sec, backward: 24.74539 sec
3 iter time – dataloader : 0.00061 sec, forward : 0.90976 sec, backward: 8.31253 sec
4 iter time – dataloader : 0.00060 sec, forward : 0.60958 sec, backward: 30.93830 sec
5 iter time – dataloader : 0.00080 sec, forward : 0.83237 sec, backward: 8.41030 sec
6 iter time – dataloader : 0.00067 sec, forward : 0.75715 sec, backward: 29.88415 sec

Looking at the time in the standard output for each iteration, we saw that the backward pass’s run time fluctuated significantly from iteration to iteration. This variation is unusual and can impact total training time. To find the cause of this inconsistent training speed, we first tried to identify resource bottlenecks by utilizing the System Monitor (SageMaker Debugger UI), which allows you to debug training jobs on SageMaker Training and view the status of resources such as the managed training platform’s CPU, GPU, network, and I/O within a set number of seconds.

The SageMaker Debugger UI provides detailed and essential data that can help identifying and diagnose bottlenecks in a training job. Specifically, the CPU utilization line chart and CPU/GPU utilization heat map per instance tables caught our eye.

In the CPU utilization line chart, we noticed that some CPUs were being used 100%.

The CPU utilization line chart with a CPU bottlenect

In the heat map (where darker colors indicate higher utilization), we noted that a few CPU cores had high utilization throughout the training, whereas GPU utilization wasn’t consistently high over time.

The CPU utilization heat-map with a CPU bottlenect

From here, we began to suspect that one of the reasons for the slow training speed was a CPU bottleneck. We reviewed the training script code to see if anything was causing the CPU bottleneck. The most suspicious part was the large value of num_workers in the data loader, so we changed this value to 0 or 1 to reduce CPU utilization. We then ran the training job again and checked the results.

The following screenshots show the CPU utilization line chart, GPU utilization, and heat map after mitigating the CPU bottleneck.

The CPU utilization line chart after mitigating a CPU bottleneck

The CPU utilization GPU utilization after mitigating a CPU bottleneckThe CPU utilization heat-map after mitigating a CPU bottleneck

By simply changing num_workers, we saw a significant decrease in CPU utilization and an overall increase in GPU utilization. This was an important change that improved training speed significantly. Still, we wanted to see where we could optimize GPU utilization. For this, we used SageMaker Profiler.

SageMaker Profiler helps identify optimization clues by providing visibility into utilization by operations, including tracking GPU and CPU utilization metrics and kernel consumption of GPU/CPU within training scripts. It helps users understand which operations are consuming resources. First, to use SageMaker Profiler, you need to add ProfilerConfig to the function that invokes the training job using the SageMaker SDK, as shown in the following code:

from sagemaker import ProfilerConfig, Profiler

from sagemaker.debugger import (ProfilerRule, rule_configs)

rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]

profiler_config = ProfilerConfig(profile_params = Profiler(cpu_profiling_duration=3600))

from sagemaker.pytorch import PyTorch

region_name = 'us-west-2'

image_uri=f'763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker'

estimator = PyTorch(

entry_point='train.py',

source_dir='src',

role=role,

image_uri=image_uri,

instance_count=4,

instance_type='ml.p4d.24xlarge',

distribution={'smdistributed': {'dataparallel': {'enabled': True}}},

profiler_config=profiler_config,

hyperparameters=hyperparameters,

sagemaker_session=sagemaker_session,

)

In the SageMaker Python SDK, you have the flexibility to add the annotate functions for SageMaker Profiler to select code or steps in the training script that needs profiling. The following is an example of the code that you should declare for SageMaker Profiler in the training scripts:

import smppy

SMProf = smppy.SMProfiler.instance()

config = smppy.Config()

config.profiler = {

"EnableCuda": "1",

}

SMProf.configure(config)

SMProf.start_profiling()

…

with smppy.annotate("Forward"):

student_out = student_model(inp)

with smppy.annotate("Backward"):

loss.backward()

…

SMProf.stop_profiling()

After adding the preceding code, if you run a training job using the training scripts, you can get information about the operations consumed by the GPU kernel (as shown in the following figure) after the training runs for a period of time. In the case of KT’s training scripts, we ran it for one epoch and got the following results.

Time Spent By All GPU Kernels(1)

When we checked the top five operation consumption times of the GPU kernel among the results of SageMaker Profiler, we found that for the KT training script, the most time is consumed by the matrix product operation, which is a general matrix multiplication (GEMM) operation on GPUs. With this important insight from the SageMaker Profiler, we began investigating ways to accelerate these operations and improve GPU utilization.

Speeding up training time

We reviewed various ways to reduce computation time of matrix multiplication and applied two PyTorch functions.

Shard optimizer states with ZeroRedundancyOptimizer

If you look at the Zero Redundancy Optimizer (ZeRO), the DeepSpeed/ZeRO technique enables the training of a large model efficiently with better training speed by eliminating the redundancies in memory used by the model. ZeroRedundancyOptimizer in PyTorch uses the technique of sharding the optimizer state to reduce memory usage per a process in Distributed Data Parallel (DDP). DDP uses synchronized gradients in the backward pass so that all optimizer replicas iterate over the same parameters and gradient values, but instead of having all the model parameters, each optimizer state is maintained by sharding only for different DDP processes to reduce memory usage.

To use it, you can leave your existing Optimizer in optimizer_class and declare a ZeroRedundancyOptimizer with the rest of the model parameters and the learning rate as parameters.

student_optimizer = ZeroRedundancyOptimizer(

student_model.parameters(),

optimizer_class=torch.optim.AdamW,

lr=initial_lr

)

Automatic mixed precision

Automatic mixed precision (AMP) uses the torch.float32 data type for some operations and torch.bfloat16 or torch.float16 for others, for the convenience of fast computation and reduced memory usage. In particular, because deep learning models are typically more sensitive to exponent bits than fraction bits in their computations, torch.bfloat16 is equivalent to the exponent bits of torch.float32, allowing them to learn quickly with minimal loss. torch.bfloat16 only runs on instances with A100 NVIDIA architecture (Ampere) or higher, such as ml.p4d.24xlarge, ml.p4de.24xlarge, and ml.p5.48xlarge.

To apply AMP, you can declare torch.cuda.amp.autocast in the training scripts as shown in the code above and declare dtype as torch.bfloat16.

with torch.cuda.amp.autocast(dtype="torch.bfloat16"):

teacher = teacher_model(input_data)

student = student_model(input_data)

loss = loss(teacher, student, target)

loss.requires_grad_(True)

loss.backward()

student_optimizer.step()

student_optimizer.zero_grad(set_to_none=True)

Results in SageMaker Profiler

After applying the two functions to the training scripts and running a train job for one epoch again, we checked the top five operations consumption times for the GPU kernel in SageMaker Profiler. The following figure shows our results.

Time Spent By All GPU Kernels(2)

We can see that the GEMM operation, which was at the top of the list before applying the two Torch functions, has disappeared from the top five operations, replaced by the ReduceScatter operation, which typically occurs in distributed training.

Training speed results of the KT distilled model

We increased the training batch size by 128 more to account for the memory savings from applying the two Torch functions, resulting in a final batch size of 1152 instead of 1024. The training of the final student model was able to run 210 epochs per 1 day; the training time and speedup between KT’s internal training environment and SageMaker are summarized in the following table.

Training Environment Training GPU spec. Number of GPU Training Time (hours) Epoch Hours per Epoch Reduction Ratio
KT’s internal training environment A100 (80GB) 2 960 300 3.20 29
Amazon SageMaker A100 (40GB) 32 24 210 0.11 1

The scalability of AWS allowed us to complete the training job 29 times faster than before using 32 GPUs instead of 2 on premises. As a result, using more GPUs on SageMaker would have significantly reduced training time with no difference in overall training costs.

Conclusion

Park Sang-min (Vision AI Serving Technology Team Leader) from the AI2XL Lab in KT’s Convergence Technology Center commented on the collaboration with AWS to develop the AI Food Tag model:

“Recently, as there are more transformer-based models in the vision field, the model parameters and required GPU memory are increasing. We are using lightweight technology to solve this issue, and it takes a lot of time, about a month to learn once. Through this PoC with AWS, we were able to identify the resource bottlenecks with help of SageMaker Profiler and Debugger, resolve them, and then use SageMaker’s data parallelism library to complete the training in about one day with optimized model code on four ml.p4d.24xlarge instances.”

SageMaker helped save Sang-min’s team weeks of time in model training and development.

Based on this collaboration on the vision model, AWS and the SageMaker team will continue to collaborate with KT on various AI/ML research projects to improve model development and service productivity through applying SageMaker capabilities.

To learn more about related features in SageMaker, check out the following:


About the authors

Youngjoon Choi, AI/ML Expert SA, has experienced enterprise IT in various industries such as manufacturing, high-tech, and finance as a developer, architect, and data scientist. He conducted research on machine learning and deep learning, specifically on topics like hyperparameter optimization and domain adaptation, presenting algorithms and papers. At AWS, he specializes in AI/ML across industries, providing technical validation using AWS services for distributed training/large scale models and building MLOps. He proposes and reviews architectures, aiming to contribute to the expansion of the AI/ML ecosystem.

Jung Hoon Kim is an account SA of AWS Korea. Based on experiences in applications architecture design, development and systems modeling in various industries such as hi-tech, manufacturing, finance and public sector, he is working on AWS Cloud journey and workloads optimization on AWS for enterprise customers.

Rock Sakong is a researcher at KT R&D. He has conducted research and development for the vision AI in various fields and mainly conducted facial attributes (gender/glasses, hats, etc.)/face recognition technology related to the face. Currently, he is working on lightweight technology for the vision models.

Manoj Ravi is a Senior Product Manager for Amazon SageMaker. He is passionate about building next-gen AI products and works on software and tools to make large-scale machine learning easier for customers. He holds an MBA from Haas School of Business and a Masters in Information Systems Management from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Read More

Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting

Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting

Illustrated figure of lifelong model editing with GRACE. On the left is a question and the model’s existing answer to it (which is incorrect). Editing method needs to update it the correct answer. In the middle the architecture is shown where the language model is frozen and embeddings are extracted to retrieve appropriate values (new embeddings) from the codebook. On the right the codebook is shown which includes a set of trainable embeddings.

Large language models (LLMs) are profoundly useful for a vast array of difficult tasks. But they sometimes make unpredictable mistakes or perpetuate biased language. These sorts of errors tend to arise over time due to changes in the underlying data or in user behavior. This necessitates targeted, cost-effective fixes to these models and the real-world applications they support.

Repeated pretraining or finetuning might be used to achieve these fixes. However, these solutions are often too computationally expensive. For example (opens in new tab), LLAMA 1 was trained for 21 days on 2,048 A100 GPUs, costing over $2.4 million. Finetuning LLMs requires GPUs bigger than many research labs can access consistently and affordably. Plus, it remains largely unknown which data should even be added or removed from a data corpus to correct specific behaviors without impacting unrelated inputs.

To keep LLMs up to date without expensive training, model editing has recently been proposed as a paradigm for making targeted updates to big models. Most model editors update a model once, injecting a batch of corrections. But mistakes are often discovered sequentially over time and must be corrected quickly. In other words, lifelong model editing where a stream of mistakes are encountered and must be addressed immediately is essential when the models are deployed. This requires making many edits sequentially, a setting in which existing editors are known to fail. Success here means correcting all edits in sequence, without forgetting old fixes and without decaying performance on unrelated inputs. But what exactly is an edit? In Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors, three types of edits are considered:

  1. Updating factual knowledge. Let’s say we have a pre-trained question-answering model: We pass questions in, and the model returns answers. But as the world changes, these answers become outdated. For example, the answer to “Who is the president of the U.S.?” should change after an election. Therefore, an edit is a tuple – or an ordered sequence of values – containing a question (e.g., “Who is the president of the U.S.?”) and the correct answer (e.g., “Biden”) for the question.
  2. Keeping up with flipping labels. Ground truth in classification tasks can change over time. For example, when U.S. courts use new language to describe existing topics, a document’s correct label can change. In such a case, a model trained on the old labels must be corrected. Targeted edits are especially important when only specific types of data are relabeled, which is common. In this case, an edit is a paired input (e.g., court document) and a new label (e.g., topic).
  3. Mitigating fabrication and incoherence in LLMs. A key challenge in using LLMs is avoiding instances where they generate language that is ungrounded in reality. But this might happen more in some models than others. Therefore, when it does happen, the ensuing edit should be as small as possible. To explore the effectiveness of this approach, the researchers consider mitigating this problem when generating biographies of famous people. Upon identifying hand-annotated fabrications, they edit an LLM to instead produce corresponding sentences from real Wikipedia articles. In this case, an edit is a prompt and a corresponding response, which the existing model finds unlikely.
This figure shows an overview of the proposed approach. On the left it shows a question (what was the latest pandemic?) and the model’s existing answer to it (Swine Flu) which is a wrong answer, editing method needs to update it the correct answer (COVID). In the middle the architecture is shown where the language model is frozen and embeddings are extracted to retrieve appropriate values (new embeddings) from the codebook. In the right the codebook is shown which includes a set of trainable embeddings.
Figure 1. Overview of lifelong model editing with GRACE. Models make important errors that must be corrected. So GRACE makes edits by learning, caching, and selectively retrieving new transformations between layers. Over long sequences of edits, which appear sporadically and require quick fixes, GRACE codebooks grow and adapt.

To make cost-effective edits to LLMs, we propose an approach referred to as General Retrieval Adaptors for Continual Editing, or GRACE. GRACE is the first method to enable thousands of sequential edits to any pre-trained model architecture using only streaming errors. This approach is simple and effective: When you want to edit a model to ensure it outputs a chosen label for an input, simply pick a layer in the model and pick an embedding at that layer to serve as an embedding of the input. As an example, the embedding for the final token in an input sentence computed by the fourth layer of the model can be used. Then, this embedding is cached and a new embedding is learned such that if the new is substituted for the old embeddings, the model produces the desired response. The original embedding is referred to as a key, and the learned embedding as a value. Learning the value is straightforward via gradient descent. The key and value are then stored in a codebook, which acts as a dictionary. If you then pass in a new input to the model, after computing its embedding, referred to as a query, new queries can be compared to existing keys. If a query matches a key, one can look up the value and apply the edit. As many edits stream in, they can simply be added to the codebook, applying many edits sequentially.

A table with four main columns labeled
Table 1. GRACE outperforms existing model editors by successfully editing models without forgetting previous edits or unrelated training data. On the zsRE and SCOTUS datasets, GRACE achieves substantial compression. On the Hallucination dataset, GRACE successfully embeds long future sequences of tokens into cached values.

But isn’t this just memorization? How can generalizable edits be achieved without memorizing every new input? Instead of always adding new keys, every new key is paired with an influence radius, which is a ball surrounding any new key with a radius of ε. Then, if any query lands inside this ε-ball, the key’s corresponding value is retrieved and the edit is applied. Thus, inputs that are similar to any cached edits will also be updated. Occasionally, when creating a new key, its ε-ball may conflict with another key. In this case, when the conflicting keys have different values, their ε-balls are set to just barely touch. If they have the same values, the existing key’s ε are increased to include the new input. Tuning ε helps achieve small codebooks that are generalizable and can successfully make thousands of edits in a row.

To compare GRACE’s capability with existing methods to make generalizable edits, two bidirectional models (T5 and BERT) and one autoregressive model (GPT2-XL) were used. For question-answering (QA), T5 was used along with a QA dataset (opens in new tab) that includes questions targeted for relation extraction. Twenty rephrased versions of each question were extracted, 10 of them were used during editing and the other 10 as unseen holdouts. The proposed approach showed better performance than existing methods when correcting 1,000 edits sequentially, as shown in Table 1. It used only 137 keys to make the edits, which shows the efficiency of the proposed method. This level of generalization is better than prior work and shows promising potential for correcting future mistakes. The proposed approach can also successfully edit a BERT model that was trained on U.S. Supreme Court documents (opens in new tab) from before 1992 and tested on documents after 1992 for which the label distribution shifted. An experiment was also conducted using GRACE with an autoregressive model, GPT2-XL, to edit mistakes related to fabrication, which were promising encouraging long sequences of edits. For example, when asked to generate a biography of Brian Hughes, GRACE successfully encouraged GPT2-XL to respond: “Brian Hughes (born 1955) is a Canadian guitarist whose work draws from both the smooth jazz and world music genres,” which exactly matches the requested biography using only one cached value. Another interesting observation was that GRACE edits were robust to the choice of edited layer, though later layers were harder to edit. Further, a clear balance was observed between memorization and generalization when choosing ε, as shown in Figure 2. Finally, a key feature of GRACE is that the codebook is detached from the pre-trained model, leaving its weights untouched. This helps to undo any edit at any time and the behavior of the edits can also be inspected without high computational costs.

A figure containing eight subfigures displayed as two rows and four columns. Each row represents a value of epsilon, the hyperparameter in our proposed method that controls generalization. The first row shows epsilon of 0.1, the second row shows and epsilon of 0.2. Each column shows a line graph for a different metric. Each line shows how the metric changes throughout 3,000 sequential edits to a T5 QA model using the zsRE dataset. Each plot contains four lines; each line is for editing a different T5 block. We compare edits made to blocks 0, 2, 4, and 6. Starting with the left column, we consider the TRR metric, which measures model accuracy on its original testing data after editing. For epsilon of 0.1, the TRR metric remains at 0.72 the entire time, with no difference per block. For epsilon of 3.0, the TRR metric remains at 0.72 only for Block 6 and is lowest for Block 0, dropping to below 0.7 by the end of editing. The second column shows the ERR metric, which is accuracy on previous edits at each step. Here we see that for epsilon of 0.1, Blocks 2, 4, and 6 remain high at nearly 1.0. For epsilon of 3.0, Block 6 remains high, while the other blocks drop to around 0.9. The third column shows Holdout performance on unseen holdout edits, which are rephrasings of seen edits. After each edit, we run the all holdout edits through the edited model and record its accuracy on the whole set. Therefore, in both plots, we see the performance increase over time, as the edits slowly cover more rephrasings of the holdout set. This way, we measure GRACE’s generalization. We see that for epsilon of 0.1, Block 6 generalizes slightly better than other blocks. But for epsilon of 3.0, Block 6 underperforms other methods significantly. Block 0 is slightly better and Blocks 2 and 4 are much better. In the final colum, we report the number of keys used by GRACE to make all 3,000 edits. Here we see that Block 6 simply memorizes all edits, as its number of keys grows linearly. After 3,000 edits, there are 3,000 keys. But for Blocks 0, 2, and 4, this value saturates, with edits being made with far fewer keys. When epsilon is 0.1, these blocks use about 2,000 keys. When epsilon is 3.0, Block 0 uses about 1,000 keys while Blocks 2 and 4 use around 800 keys. This demonstrates how picking the block and epsilon can impact the trade-off between memorization and generalization. Overall, it appears that generalizable edits happen in interior model layers as opposed to the first or last layers and for slightly-larger choices of epsilon.
Figure 2. GRACE’s performance when editing different blocks of a T5 model for different choices of epsilon. This choice drives a balance between accuracy on unrelated training data (TRR) and previous edits (ERR), as shown by a small epsilon (a) and a big epsilon (b).

Summary

GRACE presents a different perspective for model editing, where representations are directly modified and transformations are cached sequentially. Edits can be done thousands of times sequentially, where a small set of codebooks are maintained throughout the editing. This step reduces the gap for deployment needs of real-world applications where edits are discovered over time and should be addressed in a cost-effective manner. By correcting behaviors efficiently and expanding sequential editing to other model properties, like fairness and privacy, this work can potentially enable a new class of solutions for adapting LLMs to meet user needs over long deployment lifetimes.

The post Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting appeared first on Microsoft Research.

Read More

Abstracts: November 20, 2023

Abstracts: November 20, 2023

Microsoft Research Podcast: Abstracts, November 23, 2023

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Shrey Jain (opens in new tab), a Technical Project Manager at Microsoft Research, and Dr. Zoë Hitzig (opens in new tab), a junior fellow at the Harvard Society of Fellows, discuss their work on contextual confidence, which presents a framework to understand and more meaningfully address the increasingly sophisticated challenges generative AI poses to communication.

Transcript

[MUSIC PLAYS] 

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.  

[MUSIC FADES] 

Today I’m talking to Shrey Jain, an applied scientist at Microsoft Research, and Dr. Zoë Hitzig, a junior fellow at the Harvard Society of Fellows. Shrey and Zoë are coauthors of a paper called Contextual Confidence and Generative AI, and you can read a preprint of this paper now on arXiv. Shrey Jain, Zoë Hitzig. Thanks for joining us on Abstracts

SHREY JAIN: Thank you.


ZOË HITZIG: Great to be here. 

HUIZINGA: Shrey, let’s start out with you. What problem does this research address, what made you care about it, and why should we care about it, too? 

JAIN: Yeah, so right now, there’s a lot of discussion as towards what the impacts of generative AI is on communication, and there’s been a lot of different terms being thrown around amongst AI policy researchers or news organizations, such as disinformation, misinformation, copyright, fair use, social engineering, deception, persuasion, and it makes it really hard to understand the precise new problem that this new technology, generative AI, brings towards our understanding of how we communicate with one another. And so what we wanted to do in this research is try to present a framework to sort of guide both policymakers, AI labs, and other people working in this field to have a better understanding of the challenges that generative AI presents and accordingly be able to meet those challenges with a set of strategies that are precise to the way we understand it and also try to uncover new strategies that might remain hidden in other frameworks that are traditionally being used to address these challenges. 

HUIZINGA: So expand on that a little bit in terms of, you know, what made you care about it? What was the prompt—no pun intended—for generative AI that got you concerned about this? And what kinds of things ought we to be thinking about in terms of why we should care about it, too? 

JAIN: Yeah, there’s a lot of different areas under which generative AI presents new challenges to our ability to communicate, one of which was literally the ability to communicate with close family members. I think we’ve seen a lot of these deception attacks kind of happening on the elderly, who have been susceptible to these attacks pre-generative AI in the past, and only thought that that might become more concerning. I no longer live in a city where my family lives, and so the only way to communicate with them is through a digital form now, and if we don’t have confidence in that interaction, I’m scared of the repercussions that has more broadly. And, you know, being at Microsoft Research, having worked on initiatives related to election integrity, was also starting to think through the impacts that this could have at a much wider scale. And so that’s kind of what prompted us to start thinking through how we can meet that challenge and try to make a contribution to mitigate that risk. 

HUIZINGA: Zoë, almost all research builds on existing foundations, so what body of work does your research draw from, and how does this paper add to the literature?

HITZIG: I’d say this research paper draws on a few different strands of literature. First, there has been a lot of social theorizing and philosophizing about what exactly constitutes privacy, for example, in the digital age. And in particular, there’s a theory of privacy that we find very compelling and we draw a lot from in the paper, which is a theory called contextual integrity, which was put forward by Helen Nissenbaum, a researcher at Cornell Tech. And what contextual integrity says is that rather than viewing privacy as a problem that’s fundamentally about control over one’s personal information or a problem about secrecy, contextual integrity says that an information flow is private when it respects the norms that have been laid down by the sender and the receiver. And so there’s a violation of privacy, according to Nissenbaum’s theory, when there’s a violation of contextual integrity. So we really take this idea from Nissenbaum and extend it to think about situations that, first of all, didn’t come up before because they’re unusual and generative AI poses new kinds of challenges. But second of all, we extend Nissenbaum’s theory into thinking not just about privacy but also authenticity. So what is authenticity? Well, in some sense, we say it’s a violation of a norm of truthfulness. What we really add to this theorizing on privacy is that we offer a perspective that shows that privacy questions and questions about authenticity or authentication can’t really be separated. And so on the theory side, we are extending the work of media scholars and internet scholars like Helen Nissenbaum but also like danah boyd and Nancy Baym, who are Microsoft Researchers, as well, to say, look, privacy and authenticity online can no longer be separated. We have to see them as two sides of the same coin. They’re both fundamentally about contextual confidence, the confidence we have in our ability to identify the context of a communication and to protect the context of that communication. So that’s sort of the theory side. And then, of course, our other big contribution is all the practical stuff that takes up the bulk of the paper. 

HUIZINGA: Right. Shrey, let’s talk about methodology for a minute. And this is a unique paper in terms of methodology. How would you describe your research approach for this work, and where does it fit on the spectrum of methodology for research? 

JAIN: Yeah, this paper is definitely a bit different from the conventional empirical research that might be done in the space. But it’s more of a policy or, I guess, framework paper where we try to provide both, as Zoë just commented on, the theory for contextual confidence but then also try to illustrate how we might apply contextual confidence as a framework to the existing challenges that generative AI presents. And so in order to make this framework and the theory that we present useful, we wanted to try to understand both what are the set of challenges that fall into these categories of identifying context and protecting context. So, specifically, how does generative AI threaten our ability to identify and protect? And trying to take a bird’s eye view in understanding those challenges. And then also kind of doing what might look similar to like a literature review but different in a way that we collect all of the different strategies that are typically talked about in the conversation but then in using contextual confidence as a framework realizing that new strategies that aren’t as well discussed in the conversation might be useful to meet these different challenges. And so from a methodology perspective, it’s almost like we’re applying the theory to uncover new … both new strategies that might be useful in this moment and then finding ways to give concrete examples of us applying that framework to existing technological questions that both people in the industry, as well as in policy, are thinking through when it comes to these questions about generative AI.

HUIZINGA: Zoë, for me, the most interesting part of research papers is that little part that comes after the phrase “and what we found was …” So, um, how would you describe what your takeaways were here, and how did you present them in the paper? 

HITZIG: That’s a great question. That’s also my favorite question to ask myself when I’ve completed a project. I think the biggest thing that I learned through writing this paper and collaborating with Shrey was really, for the first time, I forced myself to interrogate the foundations of effective communication and to understand what it is that we rely on when, you know, we pass a stranger on the street and look at them in a certain way and somehow know what it means. Or what we rely on to understand, you know, how our partner is feeling when they speak to us over coffee in the morning. I was really forced to step back and think about the foundations of effective communication. And in doing so, what we realized was that an ability to both identify and protect context is what allows us to communicate effectively. And in some sense, this very basic fact made me see how sort of shockingly robust our communication systems have been in the past and yet at the same time how fragile they could be in the face of this alarming new technology that has the power to fundamentally upset these two foundational processes of identifying and protecting context in communication. I would also say, on the question of what we found, you know, my first answer was about these sort of fundamental insights that had never occurred to me before about what makes communication effective and how it’s threatened. But also, I was able to understand and sort of make sense of so many of the strategies and tools that are in discussion today. And, for example, I was able to see, in a totally new light, the importance of, for example, something as simple as having some form of digital identification or the simplicity of, you know, what makes a good password and what can we do to strengthen passwords in the future. So there was this strong theoretical insight, but also that theoretical insight was enormously powerful in helping us organize the very concrete discussions around particular tools and technologies. 

HUIZINGA: Hmm. It’s a beautiful segue into the question I have for Shrey, which is talking about the real-world impact of this work. You know, coming down to the practical side from the theoretical, who does this work help and how? 

JAIN: Yeah, I want to also add a disclaimer in that, in this podcast, we kind of present generative AI almost as this like villain to communication. [LAUGHTER] I think that there’s also a possibility that generative AI improves communication, and I want to make sure that we acknowledge the optimism that we do see here. I think part of the real-world impact is that we want to mitigate the cost that generative AI brings to communications without hurting the utility at the same time. When applying contextual confidence in contrast to, say, views of traditional privacy, which may view privacy in terms of secrecy or information integrity, we hopefully will find a way in ensuring that the utility of these models is not significantly lost. And so in terms of the real-world impact, I think when it comes to both policies that are being set right now, norms around how we interact with these models, or any startup founder or person who’s deploying these tools, when they think about the reviews that they’re doing from a privacy point of view or a compliance point of view, we hope that contextual confidence can guide, as a framework, a way that protects users of these tools along with not hindering model capabilities in that form. 

HUIZINGA: Zoë, if there was one takeaway that you want our listeners to get from this work on contextual confidence, what would it be?

HITZIG: What I hope that readers will take away is, on the one hand, the key conceptual insight of the paper, which is that in today’s digital communication and in the face of generative AI, privacy questions and authenticity questions cannot be separated. And in addition, I hope that we’ve communicated the full force of that insight and shown how this framework can be useful in evaluating the deployment of new tools and new technologies. 

HUIZINGA: Finally, Shrey, what outstanding questions or challenges remain here, and how do you hope to help answer them? 

JAIN: In the paper, we have presented a theoretical understanding of contextual confidence and present various different strategies that might be able to help meet the challenges that generative AI presents to our ability to both identify and protect context, but we don’t know how those strategies themselves may or may not undermine the goals that we’re presenting because we haven’t done empirical research to know how a given strategy might work across different types of people. In fact, the strategies could undermine the initial goals that we intend. A verification stamp for some might enhance credibility, but for those who may not trust the institution verifying, it may actually reduce credibility. And I think there’s a lot of empirical research both on the tool development, usability, and then back to guiding the theoretical framework that we present that we want to continue to refine and work on as this framework hopefully becomes more widely used. 

HUIZINGA: Well, Shrey Jain, Zoë Hitzig, thank you for joining us today, and to our listeners, thanks for tuning in.  

[MUSIC PLAYS] 

If you’re interested in learning more about contextual confidence and generative AI, you can find a link to the preprint of this paper at aka.ms/abstracts, or you can read it on arXiv. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: November 20, 2023 appeared first on Microsoft Research.

Read More

What Is a SuperNIC?

What Is a SuperNIC?

Generative AI is the latest turn in the fast-changing digital landscape. One of the groundbreaking innovations making it possible is a relatively new term: SuperNIC. 

What Is a SuperNIC?

SuperNIC is a new class of network accelerators designed to supercharge hyperscale AI workloads in Ethernet-based clouds. It provides lightning-fast network connectivity for GPU-to-GPU communication, achieving speeds reaching 400Gb/s using remote direct memory access (RDMA) over converged Ethernet (RoCE) technology.  

SuperNICs combine the following unique attributes: 

  • High-speed packet reordering to ensure that data packets are received and processed in the same order they were originally transmitted. This maintains the sequential integrity of the data flow. 
  • Advanced congestion control using real-time telemetry data and network-aware algorithms to manage and prevent congestion in AI networks. 
  • Programmable compute on the input/output (I/O) path to enable customization and extensibility of network infrastructure in AI cloud data centers. 
  • Power-efficient, low-profile design to efficiently accommodate AI workloads within constrained power budgets. 
  • Full-stack AI optimization, including compute, networking, storage, system software, communication libraries and application frameworks. 

NVIDIA recently unveiled the world’s first SuperNIC tailored for AI computing, based on the BlueField-3 networking platform. It’s a part of the NVIDIA Spectrum-X platform, where it integrates seamlessly with the Spectrum-4 Ethernet switch system.  

Together, the NVIDIA BlueField-3 SuperNIC and Spectrum-4 switch system form the foundation of an accelerated computing fabric specifically designed to optimize AI workloads. Spectrum-X consistently delivers high network efficiency levels, outperforming traditional Ethernet environments. 

“In a world where AI is driving the next wave of technological innovation, the BlueField-3 SuperNIC is a vital cog in the machinery,” said Yael Shenhav, vice president of DPU and NIC products at NVIDIA. “SuperNICs ensure that your AI workloads are executed with efficiency and speed, making them foundational components for enabling the future of AI computing.” 

The Evolving Landscape of AI and Networking 

The AI field is undergoing a seismic shift, thanks to the advent of generative AI and large language models. These powerful technologies have unlocked new possibilities, enabling computers to handle new tasks.  

AI success relies heavily on GPU-accelerated computing to process mountains of data, train large AI models, and enable real-time inference. This new compute power has opened new possibilities, but it has also challenged Ethernet cloud networks. 

Traditional Ethernet, the technology that underpins internet infrastructure, was conceived to offer broad compatibility and connect loosely coupled applications. It wasn’t designed to handle the demanding computational needs of modern AI workloads, which involve tightly coupled parallel processing, rapid data transfers and unique communication patterns — all of which demand optimized network connectivity.  

Foundational network interface cards (NICs) were designed for general-purpose computing, universal data transmission and interoperability. They were never designed to cope with the unique challenges posed by the computational intensity of AI workloads.  

Standard NICs lack the requisite features and capabilities for efficient data transfer, low latency and the deterministic performance crucial for AI tasks. SuperNICs, on the other hand, are purpose-built for modern AI workloads. 

SuperNIC Advantages in AI Computing Environments 

Data processing units (DPUs) deliver a wealth of advanced features, offering high throughput, low-latency network connectivity and more. Since their introduction in 2020, DPUs have gained popularity in the realm of cloud computing, primarily due to their capacity to offload, accelerate and isolate data center infrastructure processing. 

Although DPUs and SuperNICs share a range of features and capabilities, SuperNICs are uniquely optimized for accelerating networks for AI. The chart below shows how they compare: 

NVIDIA BlueField SuperNIC and DPU comparison chart

Distributed AI training and inference communication flows depend heavily on network bandwidth availability for success. SuperNICs, distinguished by their sleek design, scale more effectively than DPUs, delivering an impressive 400Gb/s of network bandwidth per GPU.  

The 1:1 ratio between GPUs and SuperNICs within a system can significantly enhance AI workload efficiency, leading to greater productivity and superior outcomes for enterprises.  

The sole purpose of SuperNICs is to accelerate networking for AI cloud computing. Consequently, it achieves this goal using less computing power than a DPU, which requires substantial computational resources to offload applications from a host CPU.  

The reduced computing requirements also translate to lower power consumption, which is especially crucial in systems containing up to eight SuperNICs. 

Additional distinguishing features of the SuperNIC include its dedicated AI networking capabilities. When tightly integrated with an AI-optimized NVIDIA Spectrum-4 switch, it offers adaptive routing, out-of-order packet handling and optimized congestion control. These advanced features are instrumental in accelerating Ethernet AI cloud environments. 

Revolutionizing AI Cloud Computing

The NVIDIA BlueField-3 SuperNIC offers several benefits that make it key for AI-ready infrastructure: 

  • Peak AI workload efficiency: The BlueField-3 SuperNIC is purpose-built for network-intensive, massively parallel computing, making it ideal for AI workloads. It ensures that AI tasks run efficiently without bottlenecks. 
  • Consistent and predictable performance: In multi-tenant data centers where numerous tasks are processed simultaneously, the BlueField-3 SuperNIC ensures that each job and tenant’s performance is isolated, predictable and unaffected by other network activities. 
  • Secure multi-tenant cloud infrastructure: Security is a top priority, especially in data centers handling sensitive information. The BlueField-3 SuperNIC maintains high security levels, enabling multiple tenants to coexist while keeping data and processing isolated. 
  • Extensible network infrastructure: The BlueField-3 SuperNIC isn’t limited in scope it’s highly flexible and adaptable to a myriad of other network infrastructure needs. 
  • Broad server manufacturer support: The BlueField-3 SuperNIC fits seamlessly into most enterprise-class servers without excessive power consumption in data centers.

Learn more about NVIDIA BlueField-3 SuperNICs, including how they integrate across NVIDIA’s data center platforms, in the whitepaper: Next-Generation Networking for the Next Wave of AI. 

Read More

Emerging practices for Society-Centered AI

Emerging practices for Society-Centered AI

The first of Google’s AI Principles is to “Be socially beneficial.” As AI practitioners, we’re inspired by the transformative potential of AI technologies to benefit society and our shared environment at a scale and swiftness that wasn’t possible before. From helping address the climate crisis to helping transform healthcare, to making the digital world more accessible, our goal is to apply AI responsibly to be helpful to more people around the globe. Achieving global scale requires researchers and communities to think ahead — and act — collectively across the AI ecosystem.

We call this approach Society-Centered AI. It is both an extension and an expansion of Human-Centered AI, focusing on the aggregate needs of society that are still informed by the needs of individual users, specifically within the context of the larger, shared human experience. Recent AI advances offer unprecedented, societal-level capabilities, and we can now methodically address those needs — if we apply collective, multi-disciplinary AI research to society-level, shared challenges, from forecasting hunger to predicting diseases to improving productivity.

The opportunity for AI to benefit society increases each day. We took a look at our work in these areas and at the research projects we have supported. Recently, Google announced that 70 professors were selected for the 2023 Award for Inclusion Research Program, which supports academic research that addresses the needs of historically marginalized groups globally. Through evaluation of this work, we identified a few emerging practices for Society-Centered AI:

  • Understand society’s needs
    Listening to communities and partners is crucial to understanding major issues deeply and identifying priority challenges to address. As an emerging general purpose technology, AI has the potential to address major global societal issues that can significantly impact people’s lives (e.g., educating workers, improving healthcare, and improving productivity). We have found the key to impact is to be centered on society’s needs. For this, we focus our efforts on goals society has agreed should be prioritized, such as the United Nations’ 17 Sustainable Development Goals, a set of interconnected goals jointly developed by more than 190 countries to address global challenges.
  • Collective efforts to address those needs
    Collective efforts bring stakeholders (e.g., local and academic communities, NGOs, private-public collaborations) into a joint process of design, development, implementation, and evaluation of AI technologies as they are being developed and deployed to address societal needs.
  • Measuring success by how well the effort addresses society’s needs
    It is important and challenging to measure how well AI solutions address society’s needs. In each of our cases, we identified primary and secondary indicators of impact that we optimized through our collaborations with stakeholders.

Why is Society-Centered AI important?

The case examples described below show how the Society-Centered AI approach has led to impact across topics, such as accessibility, health, and climate.

Understanding the needs of individuals with non-standard speech

There are millions of people with non-standard speech (e.g., impaired articulation, dysarthria, dysphonia) in the United States alone. In 2019, Google Research launched Project Euphonia, a methodology that allows individual users with non-standard speech to train personalized speech recognition models. Our success began with the impact we had on each individual who is now able to use voice dictation on their mobile device.

Euphonia started with a Society-Centered AI approach, including collective efforts with the non-profit organizations ALS Therapy Development Institute and ALS Residence Initiative to understand the needs of individuals with amyotrophic lateral sclerosis (ALS) and their ability to use automatic speech recognition systems. Later, we developed the world’s largest corpus of non-standard speech recordings, which enabled us to train a Universal Speech Model to better recognize disordered speech by 37% on real conversation word error rate (WER) measurement. This also led to the 2022 collaboration between the University of Illinois Urbana-Champaign, Alphabet, Apple, Meta, Microsoft, and Amazon to begin the Speech Accessibility Project, an ongoing initiative to create a publicly available dataset of disordered speech samples to improve products and make speech recognition more inclusive of diverse speech patterns. Other technologies that use AI to help remove barriers of modality and languages, include live transcribe, live caption and read aloud.

Focusing on society’s health needs

Access to timely maternal health information can save lives globally: every two minutes a woman dies during pregnancy or childbirth and 1 in 26 children die before reaching age five. In rural India, the education of expectant and new mothers around key health issues pertaining to pregnancy and infancy required scalable, low-cost technology solutions. Together with ARMMAN, Google Research supported a program that uses mobile messaging and machine learning (ML) algorithms to predict when women might benefit from receiving interventions (i.e., targeted preventative care information) and encourages them to engage with the mMitra free voice call program. Within a year, the mMitra program has shown a 17% increase in infants with tripled birth weight and a 36% increase in women understanding the importance of taking iron tablets during pregnancy. Over 175K mothers and growing have been reached through this automated solution, which public health workers use to improve the quality of information delivery.

These efforts have been successful in improving health due to the close collective partnership among the community and those building the AI technology. We have adopted this same approach via collaborations with caregivers to address a variety of medical needs. Some examples include: the use of the Automated Retinal Disease Assessment (ARDA) to help screen for diabetic retinopathy in 250,000 patients in clinics around the world; our partnership with iCAD to bring our mammography AI models to clinical settings to aid in breast cancer detection; and the development of Med-PaLM 2, a medical large language model that is now being tested with Cloud partners to help doctors provide better patient care.

Compounding impact from sustained efforts for crisis response

Google Research’s flood prediction efforts began in 2018 with flood forecasting in India and expanded to Bangladesh to help combat the catastrophic damage from yearly floods. The initial efforts began with partnerships with India’s Central Water Commission, local governments and communities. The implementation of these efforts used SOS Alerts on Search and Maps, and, more recently, broadly expanded access via Flood Hub. Continued collaborations and advancing an AI-based global flood forecasting model allowed us to expand this capability to over 80 countries across Africa, the Asia-Pacific region, Europe, and South, Central, and North America. We also partnered with networks of community volunteers to further amplify flood alerts. By working with governments and communities to measure the impact of these efforts on society, we refined our approach and algorithms each year.

We were able to leverage those methodologies and some of the underlying technology, such as SOS Alerts, from flood forecasting to similar societal needs, such as wildfire forecasting and heat alerts. Our continued engagements with organizations led to the support of additional efforts, such as the World Meteorological Organization’s (WMO) Early Warnings For All Initiative. The continued engagement with communities has allowed us to learn about our users’ needs on a societal level over time, expand our efforts, and compound the societal reach and impact of our efforts.

Further supporting Society-Centered AI research

We recently funded 18 university research proposals exemplifying a Society-Centered AI approach, a new track within the Google Award for Inclusion Research Program. These researchers are taking the Society-Centered AI methodology and helping create beneficial applications across the world. Examples of some of the projects funded include:

  • AI-Driven Monitoring of Attitude Polarization in Conflict-Affected Countries for Inclusive Peace Process and Women’s Empowerment: This project’s goal is to create LLM-powered tools that can be used to monitor peace in online conversations in developing nations. The initial target communities are where peace is in flux and the effort will put a particular emphasis on mitigating polarization that impacts women and promoting harmony.
  • AI-Assisted Distributed Collaborative Indoor Pollution Meters: A Case Study, Requirement Analysis, and Low-Cost Healthy Home Solution for Indian Communities: This project is looking at the usage of low-cost pollution monitors combined with AI-assisted methodology for identifying recommendations for communities to improve air quality and at home health. The initial target communities are highly impacted by pollution, and the joint work with them includes the goal of developing how to measure improvement in outcomes in the local community.
  • Collaborative Development of AI Solutions for Scaling Up Adolescent Access to Sexual and Reproductive Health Education and Services in Uganda: This project’s goal is to create LLM-powered tools to provide personalized coaching and learning for users’ needs on topics of sexual and reproductive health education in low-income settings in Sub-Saharan Africa. The local societal need is significant, with an estimated 25% rate of teenage pregnancy, and the project aims to address the needs with a collective development process for the AI solution.

Future direction

Focusing on society’s needs, working via multidisciplinary collective research, and measuring the impact on society helps lead to AI solutions that are relevant, long-lasting, empowering, and beneficial. See the AI for the Global Goals to learn more about potential Society-Centered AI research problems. Our efforts with non-profits in these areas is complementary to the research that we are doing and encouraging. We believe that further initiatives using Society-Centered AI will help the collective research community solve problems and positively impact society at large.

Acknowledgements

Many thanks to the many individuals who have worked on these projects at Google including Shruti Sheth, Reena Jana, Amy Chung-Yu Chou, Elizabeth Adkison, Sophie Allweis, Dan Altman, Eve Andersson, Ayelet Benjamini, Julie Cattiau, Yuval Carny, Richard Cave, Katherine Chou, Greg Corrado, Carlos De Segovia, Remi Denton, Dotan Emanuel, Ashley Gardner, Oren Gilon, Taylor Goddu, Brigitte Hoyer Gosselink, Jordan Green, Alon Harris, Avinatan Hassidim, Rus Heywood, Sunny Jansen, Pan-Pan Jiang, Anton Kast, Marilyn Ladewig, Ronit Levavi Morad, Bob MacDonald, Alicia Martin, Shakir Mohamed, Philip Nelson, Moriah Royz, Katie Seaver, Joel Shor, Milind Tambe, Aparna Taneja, Divy Thakkar, Jimmy Tobin, Katrin Tomanek, Blake Walsh, Gal Weiss, Kasumi Widner, Lihong Xi, and teams.

Read More

What's new in TensorFlow 2.15

What’s new in TensorFlow 2.15

Posted by the TensorFlow team

TensorFlow 2.15 has been released! Highlights of this release (and 2.14) include a much simpler installation method for NVIDIA CUDA libraries for Linux, oneDNN CPU performance optimizations for Windows x64 and x86, full availability of tf.function types, an upgrade to Clang 17.0.1, and much more! For the full release note, please check here.

Note: Release updates on the new multi-backend Keras will be published on keras.io starting with Keras 3.0. For more information, please check here.

TensorFlow Core

NVIDIA CUDA libraries for Linux

The tensorflow pip package has a new, optional installation method for Linux that installs necessary NVIDIA CUDA libraries through pip. As long as the NVIDIA driver is already installed on the system, you may now run pip install tensorflow[and-cuda] to install TensorFlow’s NVIDIA CUDA library dependencies in the Python environment. Aside from the NVIDIA driver, no other pre-existing NVIDIA CUDA packages are necessary. In TensorFlow 2.15, CUDA has been upgraded to version 12.2.

oneDNN CPU performance optimizations

For Windows x64 & x86 packages, oneDNN optimizations are now enabled by default on X86 CPUs. These optimizations can be enabled or disabled by setting the environment variable TF_ENABLE_ONEDNN_OPTS to 1 (enable) or 0 (disable) before running TensorFlow. To fall back to default settings, simply unset the environment variable.

tf.function

tf.function types are now fully available.

  • tf.types.experimental.TraceType now allows custom tf.function inputs to declare Tensor decomposition and type casting support. 
  • Introducing tf.types.experimental.FunctionType as the comprehensive representation of the signature of tf.function callables. It can be accessed through the function_type property of tf.function’s and ConcreteFunctions. See the tf.types.experimental.FunctionType documentation for more details. 
  • Introducing tf.types.experimental.AtomicFunction as the fastest way to perform TF computations in Python. This capability can be accessed through the inference_fn property of ConcreteFunctions. (Does not support gradients.) See the tf.types.experimental.AtomicFunction documentation for how to call and use it.

Upgrade to Clang 17.0.1 and CUDA 12.2

TensorFlow PIP packages are now being built with Clang 17 and CUDA 12.2 to improve performance for NVIDIA Hopper-based GPUs. Moving forward, Clang 17 will be the default C++ compiler for TensorFlow. We recommend upgrading your compiler to Clang 17 when building TensorFlow from source.

Read More

From Guangzhou to Los Angeles, Automakers Dazzle With AI-Powered Vehicles

From Guangzhou to Los Angeles, Automakers Dazzle With AI-Powered Vehicles

Good news for car lovers: Two acclaimed auto shows, taking place now through next week, are delighting attendees with displays of next-generation automotive designs powered by AI.

Hundreds of thousands of auto enthusiasts worldwide are expected to visit Guangzhou, China — known as the city of flowers — to attend its auto show, running through Sunday, Nov. 26. The event will feature new developments in electric vehicles (EVs) and automated driving, with 1,100 vehicles on display.

And across the world, in the city of angels, the Los Angeles Auto Show is expected to reach its highest-ever attendee numbers. Also running through Nov. 26, the show will include classic and exotic cars from private collections, as well as a public test track where attendees can get behind the wheel of the latest EVs.

Auto Guangzhou

Human Horizons, NIO, ZEEKR

One of the most anticipated reveals is from Lotus, which is showcasing its new fully electric Emeya Hyper-GT that launched in September. This stunning luxury vehicle with sports-car agility achieves an impressive suite of intelligent features, powered by dual NVIDIA DRIVE Orin processors. The high-performance processing power enables drivers to enjoy the car’s safe and secure driving capabilities and supports future features through over-the-air (OTA) updates.

With an eye on safety, Emeya carries 34 state-of-the-art surround sensors for diverse and redundant sensor data processing in real time — giving drivers added confidence when behind the wheel. With DRIVE Orin embedded in the back of the vehicle, Emeya delivers advanced driver assistance system (ADAS) capabilities and also offers the built-in headroom to support an autonomous future.

Emeya Hyper-GT is built on Lotus’ innovative Electric Premium Architecture, which underpins the Eletre Hyper-SUV as well, also powered by NVIDIA DRIVE Orin.

In addition, Lotus is showcasing its full range of Lotus electric vehicles, including the Evija hypercar, Eletre Hyper-SUV and Type 136, its recently launched electric bike. Emira, Lotus’ final internal combustion engine vehicle, is also on display.

Several other NVIDIA DRIVE ecosystem members are featuring their next-gen EVs at Auto Guangzhou:

  • DENZA, a joint venture between BYD and Mercedes-Benz, is highlighting the intelligent-driving features of its N7 model lineup. All N7 models can be equipped with the NVIDIA DRIVE Orin system-on-a-chip. In addition, DENZA is showcasing its next generation of car configurators built on NVIDIA Omniverse Cloud, enabling consumers to customize various aspects, such as colors, materials and more.
  • Human Horizons is showcasing the HiPhi Y SUV, which includes a second-generation intelligent door system design, a range of more than 497 miles on a single charge and an autonomous-driving system powered by NVIDIA DRIVE Orin.
  • JI YUE, a Geely Holding and Baidu joint venture, is displaying the ROBOCAR JiYue 01, the first model to offer the premium intelligent driving feature ROBO Drive Max, powered by NVIDIA DRIVE Orin.
  • NIO is exhibiting eight models equipped with Banyan, NIO’s Smart Digital System, and Adam, an NVIDIA-powered supercomputer. Adam uses NVIDIA DRIVE Orin to enable advanced automated-driving features and allow continuous OTA upgrades.
  • XPENG is unveiling and commencing pre-sales for its next-generation, super-intelligent, seven-seater MPV, the XPENG X9. The company is also showcasing the XPENG G6, the XPENG P7i and the XPENG G9, all featuring NVIDIA DRIVE Orin.
  • ZEEKR introduces the ZEEKR 007. The electric sedan is ZEEKR’s first sedan and fourth model, powered by NVIDIA DRIVE Orin. The ZEEKR 007 has a sensor suite consisting of one lidar, 12 high-definition cameras and five millimeter-wave radars for advanced smart assist driving functions with multiple redundancy.

In addition, NVIDIA DRIVE ecosystem partner DeepRoute.ai, a smart driving solutions provider, is demonstrating its DeepRoute Driver 3.0, designed to offer a non-geofenced solution for automated vehicles.

Los Angeles Auto Show

At the LA Auto Show, Lucid Motors is showcasing its highly anticipated Gravity SUV, which will begin production in late 2024. Powered by NVIDIA DRIVE, the luxury SUV features supercar levels of performance and an impressive battery range to mitigate range anxiety.

In addition, Lucid is displaying its Lucid Air sedan, including the Air Pure and Air Touring models. All of these vehicles feature the future-ready DreamDrive Pro driver-assistance system, powered by the NVIDIA DRIVE platform.

Learn more about the industry-leading designs and technologies NVIDIA is developing with its automotive partners.

Read More

AI to See ‘Major Second Wave,’ NVIDIA CEO Says in Fireside Chat With iliad Group Exec

AI to See ‘Major Second Wave,’ NVIDIA CEO Says in Fireside Chat With iliad Group Exec

European startups will get a massive boost from a new generation of AI infrastructure, NVIDIA founder and CEO Jensen Huang said Friday in a fireside chat with iliad Group Deputy CEO Aude Durand — and it’s coming just in time.

“We’re now seeing a major second wave,” Huang said of the state of AI during a virtual appearance at Scaleway’s ai-PULSE conference in Paris for an audience of more than 1,000 in-person attendees.

Two elements are propelling this force, Huang explained in a conversation livestreamed from Station F, the world’s largest startup campus, which Huang joined via video conference from NVIDIA’s headquarters in Silicon Valley.

First, “a recognition that every region and every country needs to build their sovereign AI,” Huang said. Second, the “adoption of AI in different industries,” as generative AI spreads throughout the world, Huang explained.

“So the types of breakthroughs that we’re seeing in language I fully expect to see in digital biology and manufacturing and robotics,” Huang said, noting this could create big opportunities for Europe with its rich digital biology and healthcare industries. “And of course, Europe is also home of some of the largest industrial manufacturing companies.”

Praise for France’s AI Leadership

Durand kicked off the conversation by asking Huang about his views on the European AI ecosystem, especially in France, where the government has invested millions of euros in AI research and development.

“Europe has always been rich in AI expertise,” Huang said, noting that NVIDIA works with 4,000 startups in Europe, more than 400 of them in France alone, pointing to Mistral, Qubit Pharmaceuticals and Poolside AI.

“At the same time, you have to really get the computing infrastructure going,” Huang said. “And this is the reason why Scaleway is so important to the advancement of AI in France” and throughout Europe, Huang said.

Highlighting the critical role of data in AI’s regional growth, Huang noted companies’ increasing awareness of the value of training AI with region-specific data. AI systems need to reflect the unique cultural and industrial nuances of each region, an approach gaining traction across Europe and beyond.

NVIDIA and Scaleway: Powering Europe’s AI Revolution

Scaleway, a subsidiary of iliad Group, a major European telecoms player, is doing its part to kick-start that second wave in Europe, offering cloud credits for access to its AI supercomputer cluster, which packs 1,016 NVIDIA H100 Tensor Core GPUs.

As a regional cloud service provider, Scaleway also provides sovereign infrastructure that ensures access and compliance with EU data protection laws, which is critical for businesses with a European footprint.

Regional members of the NVIDIA Inception program, which provides development assistance to startups, will also be able to access NVIDIA AI Enterprise software on Scaleway Marketplace.

The software includes the NVIDIA NeMo framework and pretrained models for building LLMs, NVIDIA RAPIDS for accelerated data science and NVIDIA Triton Inference Server and NVIDIA TensorRT-LLM for boosting inference.

Revolutionizing AI With Supercomputing Prowess

Recapping a month packed with announcements, Huang explained how NVIDIA is rapidly advancing high performance computing and AI worldwide to provide the infrastructure needed to power this second wave.

These systems are, in effect,  “supercomputers,” Huang said, with AI systems now among the world’s most powerful.

They include Scaleway’s newly available Nabuchodonosor supercomputer, or “Nabu,” an NVIDIA DGX SuperPOD with 127 NVIDIA DGX H100 systems, which will help startups in France and across Europe scale up AI workloads.

“As you know the Scaleway system that we brought online, Nabu, is not your normal computer,” Huang said. “In every single way, it’s a supercomputer.”

Such systems are underpinning powerful new services.

Earlier this week, NVIDIA announced an AI Foundry service on Microsoft Azure, aimed at accelerating the development of customized generative AI applications.

Huang highlighted NVIDIA AI foundry’s appeal to a diverse user base, including established enterprises such as Amdocs, Getty Images, SAP and ServiceNow.

Huang noted that JUPITER, to be hosted at the Jülich facility, in Germany, and poised to be Europe’s premier exascale AI supercomputer, will run on 24,000 NVIDIA GH200 Grace Hopper Superchips, offering unparalleled computational capacity for diverse AI tasks and simulations.

Huang touched on NVIDIA’s just-announced HGX H200 AI computing platform, built on NVIDIA’s Hopper architecture and featuring the H200 Tensor Core GPU. Set for release in Q2 of 2024, it promises to redefine industry standards.

He also detailed NVIDIA’s strategy to develop ‘AI factories,’ advanced data centers that power diverse applications across industries, including electric vehicles, robotics, and generative AI services.

Open Source

Finally, Durand asked Huang about the role of open source and open science in AI.

Huang said he’s a “huge fan” of open source. “Let’s acknowledge that without open source, how would AI have made the tremendous progress it has over the last decade,” Huang said.

“And so the ability for open source to energize the vibrancy and pull in the research and pull in the engagement of every startup, every researcher, every industry is really quite vital,” Huang said. “And you’re seeing it play out just presently, now going forward.”

Friday’s fireside conversation was part of Scaleway’s ai-PULSE conference, showcasing the latest AI trends and innovations. To learn more, visit https://www.ai-pulse.eu/.

Read More

Join us at the third Women in ML Symposium!

Join us at the third Women in ML Symposium!

Posted by Sharbani Roy – Senior Director, Product Management, Google

We’re back with the third annual Women in Machine Learning Symposium on December 7, 2023! Join us virtually from 9:30 am to 1:00 pm PT for an immersive and insightful set of deep dives for every level of Machine Learning experience.

The Women in ML Symposium is an inclusive event for anyone passionate about the transformative fields of Machine Learning (ML) and Artificial Intelligence (AI). Dive into the latest advancements in generative AI, explore the intricacies of privacy-preserving AI, dig into the underlying accelerators and ML frameworks that power models, and uncover practical applications of ML across multiple industries.

Our event offers sessions for all expertise levels, from beginners to advanced practitioners. Hear about what’s new in ML and building with Google AI from our keynote speakers, gain insights from seasoned industry leaders across Google Health, Nvidia, Adobe, and more – and discover a wealth of knowledge on topics ranging from foundational AI concepts to open source tools, techniques, and beyond.

RSVP today to secure your spot and explore our exciting agenda. We can’t wait to see you there!

Read More