Combining low-rank approximation, a residual binary autoencoder, and a new loss function enables a fivefold increase in compression ratio.Read More
Simpleperf case study: Fast initialization of TFLite’s Memory Arena
Posted by Alan Kelly, Software Engineer
One of our previous articles, Optimizing TensorFlow Lite Runtime Memory, discusses how TFLite’s memory arena minimizes memory usage by sharing buffers between tensors. This means we can run models on even smaller edge devices. In today’s article, I will describe the performance optimization of the memory arena initialization so that our users get the benefit of low memory usage with little additional overhead.
ML is normally deployed on-device as part of a larger pipeline. TFLite is used because it’s fast and lightweight, but the rest of the pipeline must also be fast. Profiling on the target device with representative data lets us identify the slowest parts of the pipeline so that we can optimize the most important part of the code.
In this article, I will describe the profiling and optimization of TFLite’s memory arena with instructions on how to use Simpleperf and visualize the results. Sample commands are given. It is assumed that the Android NDK is installed and that you have a development device that you can connect to using adb.
Simpleperf
Simpleperf comes with some scripts to make it easier to use. run_simpleperf_on_device.py pushes simpleperf to the device and runs your binary with the given arguments.
/usr/lib/android-ndk/simpleperf/run_simpleperf_on_device.py record –call-graph fp /data/local/tmp/my_binary arg0 arg1 …
This will generate the output file perf.data which you must then copy back to your computer.
adb pull /data/local/tmp/perf.data
You then generate the binary cache which contains all the information needed later to generate a useful profile.
/usr/lib/android-ndk/simpleperf/binary_cache_builder.py -lib /your/binarys/folder -i perf.data
And generate the proto buffer used for visualization:
/usr/lib/android-ndk/simpleperf/pprof_proto_generator.py –ndk_path=/path/to/android-ndk -i perf.data -o profile.proto
You can then display the results this using pprof:
pprof -http :8888 profile.proto
And open localhost:8888 in your browser to view the profile. I find flame graphs to be the most useful:
Optimizing TFLite’s Memory Arena
ArenaPlanner::ExecuteAllocations accounts for 54.3% of the runtime of this model. I was expecting to find that ML operators such as fully connected layers or convolutions to be the bottleneck of this model, and not runtime overhead. This is a particularly bad case, the memory arena overhead isn’t this bad for every model, but improvements here will impact all models. This model has variable input sizes and many dynamic tensors, whose output size isn’t known until operator evaluation, which trigger frequent tensor re-allocations. This really is as bad as it gets. Let’s zoom in on the profile.
InterpreterInfo::num_tensors() accounts for 10.4% of the runtime. The reason this function is so expensive is because it is a virtual function which calls another function and it is called within a loop. I would never have suspected this.
for (int i = 0; i < static_cast<int>(graph_info_->num_tensors()); ++i) { |
Arena planner does not create or destroy tensors so the number of tensors is constant. Let’s cache it.
const int num_tensors = static_cast<int>(graph_info_->num_tensors()); |
Our next piece of low hanging fruit is InterpreterInfo::tensor(unsigned long) which is another virtual function which does bounds checking and then returns a pointer to a tensor. Tensors are stored in an array so let’s add a function to get a pointer to this array. The commits are here and here.
After these simple changes the runtime of this model has reduced by 25% and then overhead of the memory allocator by half. Simpleperf made identifying these inefficiencies easy! Time to profile again to measure the impact of these changes.
ArenaPlanner::CalculateAllocations is now the most expensive function at 12.7%. This calls two functions: SimpleMemoryArena::Allocate and ArenaPlanner::CreateTensorAllocationVector.
Although ArenaPlanner::CreateTensorAllocationVector is the cheaper of the two, the code is far simpler so it might be easier to optimize. This function identifies which tensors are allocated between two nodes in the graph and then sorts them by size as a Greedy by Size allocation algorithm is used where the largest tensors are allocated first. The structure of the graph is constant so we can store a map of tensors allocated at each node. Instead of checking each tensor in the model to see if it is allocated between the two nodes, we can identify the tensors to be allocated by iterating through the map. The cost of ArenaPlanner::CreateTensorAllocationVector has gone from 4.8% to 0.8% of the runtime. Code can be seen here. Sort does not appear in the profile so we ignore it.
The next function to look at is ArenaPlanner::ResolveTensorAllocation which is 10.9% of the runtime after the previous optimizations. This function resets each tensor’s data pointer after allocation. However, these pointers don’t always change. How about keeping track of and only updating the ones which change? After this change, ArenaPlanner::ResolveTensorAllocation doesn’t appear in the profile anymore.
Let’s now take a look at allocation and deallocation. SimpleMemoryArena::Allocate accounts for 7% and SimpleMemoryArena::Deallocate accounts for 6.8% of the runtime. A record of all allocations in the arena are stored in a vector ordered by their offsets within the memory arena. Entries in this sorted data structure are inserted, removed and searched. These operations are all O(N) in a vector. Could an std::multimap with the offset as the key be better? A multimap is needed because the records are ordered by their offsets and there may be multiple tensors with the same offset. Removal and insertion are O(logN) but search would still be O(N) as we are searching for the tensor id and not the offset. The best way to find out is to test and profile.
Replacing the vector with a multimap actually slows down the arena code: it is almost three times slower than using a vector! While this goes against intuition, this is commonly found when optimizing code. Operations on a set or a map have linear or logarithmic complexities, however, there is also a constant value in the complexity. This value is higher than the constant value for the complexity of a vector. We also iterate through the records, which is much cheaper for a vector than for a list or multimap. A list was also tested, coming in at twice as slow as a vector.
Deallocation can still be improved though. SimpleMemoryArena::Deallocate iterates through the records and when it finds the record to deallocate, it removes it from the vector. This has O(N2) complexity. The memcpy seen in the profile above comes from the frequent calls to std::vector::erase. It is much more efficient to mark records to be erased and then to erase them in one pass using std::remove_if. The second optimization here is to look at how tensors are typically deallocated: ArenaPlanner::ResetAllocationsAfter deallocates all tensors from a node until the end of the graph. To address this, SimpleMemoryArena::DeallocateAfter(int32_t node) was added which iterates once through all the records, marking those which are allocated after the node. SimpleMemoryArena::ResolveDeallocations erases these in one pass making deallocation O(N). After these changes, ResetAllocationsAfter no longer appears in the profile! Commits are here and here.
The profile now looks very different with the overhead of tensor allocation gone from 49.9% of the runtime to 11%. This profile already looks much more reasonable. SimpleMemoryArena::Allocate is the last function left to optimize. For each tensor, this function iterates through the vector of records trying to find space for the current allocation. The complexity of this is O(N2). This is a fundamental limitation of the Greedy By Size algorithm. Efficient use of memory comes at the cost of increased overhead. Although the complexity can’t be reduced, N can. We process nodes in the order in which they are executed. Allocation information for tensors which have been deallocated on already executed nodes is not needed anymore, it is only slowing things down. Records are purged periodically so that only records which are active are considered. On a large model, this significantly reduces N. ArenaPlanner::ExecuteAllocations is no longer the most expensive function! It has gone from 11% to 6% and a fully connected operator is now the most expensive function in the profile, which is what we expect when profiling neural network inference.
This is what a neural network profile should look like. Time should be spent running your model’s operators, not in the inference runtime.
The optimized memory arena is now publicly available as part of TensorFlow 2.13.
Next Steps
Today’s post walked you through an example of Simpleperf helping to find easy to fix inefficiencies in TFLite’s memory arena that would never have been found by just looking at the code. Pprof can display annotated source code, disassembly and graphs making it easy to find the bottlenecks in your on-device pipelines.
Generate creative advertising using generative AI deployed on Amazon SageMaker
Creative advertising has the potential to be revolutionized by generative AI (GenAI). You can now create a wide variation of novel images, such as product shots, by retraining a GenAI model and providing a few inputs into the model, such as textual prompts (sentences describing the scene and objects to be produced by the model). This technique has shown promising results starting in 2022 with the explosion of a new class of foundation models (FMs) called latent diffusion models such as Stable Diffusion, Midjourney, and Dall-E-2. However, to use these models in production, the generation process requires constant refining to generate consistent outputs. This often means creating a large number of sample images of the product and clever prompt engineering, which makes the task difficult at scale.
In this post, we explore how this transformative technology can be harnessed to generate captivating and innovative advertisements at scale, especially when dealing with large catalogs of images. By using the power of GenAI, specifically through the technique of inpainting, we can seamlessly create image backgrounds, resulting in visually stunning and engaging content and reducing unwanted image artifacts (termed model hallucinations). We also delve into the practical implementation of this technique by utilizing Amazon SageMaker endpoints, which enable efficient deployment of the GenAI models driving this creative process.
We use inpainting as the key technique within GenAI-based image generation because it offers a powerful solution for replacing missing elements in images. However, this presents certain challenges. For instance, precise control over the positioning of objects within the image can be limited, leading to potential issues such as image artifacts, floating objects, or unblended boundaries, as shown in the following example images.
To overcome this, we propose in this post to strike a balance between creative freedom and efficient production by generating a multitude of realistic images using minimal supervision. To scale the proposed solution for production and streamline the deployment of AI models in the AWS environment, we demonstrate it using SageMaker endpoints.
In particular, we propose to split the inpainting process as a set of layers, each one potentially with a different set of prompts. The process can be summarized as the following steps:
- First, we prompt for a general scene (for example, “park with trees in the back”) and randomly place the object on that background.
- Next, we add a layer in the lower mid-section of the object by prompting where the object lies (for example, “picnic on grass, or wooden table”).
- Finally, we add a layer similar to the background layer on the upper mid-section of the object using the same prompt as the background.
The benefit of this process is the improvement in the realism of the object because it’s perceived with better scaling and positioning relative to the background environment that matches with human expectations. The following figure shows the steps of the proposed solution.
Solution overview
To accomplish the tasks, the following flow of the data is considered:
- Segment Anything Model (SAM) and Stable Diffusion Inpainting models are hosted in SageMaker endpoints.
- A background prompt is used to create a generated background image using the Stable Diffusion model
- A base product image is passed through SAM to generate a mask. The inverse of the mask is called the anti-mask.
- The generated background image, mask, along with foreground prompts and negative prompts are used as input to the Stable Diffusion Inpainting model to generate a generated intermediate background image.
- Similarly, the generated background image, anti-mask, along with foreground prompts and negative prompts are used as input to the Stable Diffusion Inpainting model to generate a generated intermediate foreground image.
- The final output of the generated product image is obtained by combining the generated intermediate foreground image and generated intermediate background image.
Prerequisites
We have developed an AWS CloudFormation template that will create the SageMaker notebooks used to deploy the endpoints and run inference.
You will need an AWS account with AWS Identity and Access Management (IAM) roles that provides access to the following:
- AWS CloudFormation
- SageMaker
- Although SageMaker endpoints provide instances to run ML models, in order to run heavy workloads like generative AI models, we use the GPU-enabled SageMaker endpoints. Refer to Amazon SageMaker Pricing for more information about pricing.
- We use the NVIDIA A10G-enabled instance
ml.g5.2xlarge
to host the models.
- Amazon Simple Storage Service (Amazon S3)
For more details, check out the GitHub repository and the CloudFormation template.
Mask the area of interest of the product
In general, we need to provide an image of the object that we want to place and a mask delineating the contour of the object. This can be done using tools such as Amazon SageMaker Ground Truth. Alternatively, we can automatically segment the object using AI tools such as Segment Anything Models (SAM), assuming that the object is in the center of the image.
Use SAM to generate a mask
With SAM, an advanced generative AI technique, we can effortlessly generate high-quality masks for various objects within images. SAM uses deep learning models trained on extensive datasets to accurately identify and segment objects of interest, providing precise boundaries and pixel-level masks. This breakthrough technology revolutionizes image processing workflows by automating the time-consuming and labor-intensive task of manually creating masks. With SAM, businesses and individuals can now rapidly generate masks for object recognition, image editing, computer vision tasks, and more, unlocking a world of possibilities for visual analysis and manipulation.
Host the SAM model on a SageMaker endpoint
We use the notebook 1_HostGenAIModels.ipynb
to create SageMaker endpoints and host the SAM model.
We use the inference code in inference_sam.py
and package that into a code.tar.gz file
, which we use to create the SageMaker endpoint. The code downloads the SAM model, hosts it on an endpoint, and provides an entry point to run inference and generate output:
SAM_ENDPOINT_NAME = 'sam-pytorch-' + str(datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f'))
prefix_sam = "SAM/demo-custom-endpoint"
model_data_sam = s3.S3Uploader.upload("code.tar.gz", f's3://{bucket}/{prefix_sam}')
model_sam = PyTorchModel(entry_point='inference_sam.py',
model_data=model_data_sam,
framework_version='1.12',
py_version='py38',
role=role,
env={'TS_MAX_RESPONSE_SIZE':'2000000000', 'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300'},
sagemaker_session=sess,
name='model-'+SAM_ENDPOINT_NAME)
predictor_sam = model_sam.deploy(initial_instance_count=1,
instance_type=INSTANCE_TYPE,
deserializers=JSONDeserializer(),
endpoint_name=SAM_ENDPOINT_NAME)
Invoke the SAM model and generate a mask
The following code is part of the 2_GenerateInPaintingImages.ipynb
notebook, which is used to run the endpoints and generate results:
raw_image = Image.open("images/speaker.png").convert("RGB")
predictor_sam = PyTorchPredictor(endpoint_name=SAM_ENDPOINT_NAME,
deserializer=JSONDeserializer())
output_array = predictor_sam.predict(raw_image, initial_args={'Accept': 'application/json'})
mask_image = Image.fromarray(np.array(output_array).astype(np.uint8))
# save the mask image using PIL Image
mask_image.save('images/speaker_mask.png')
The following figure shows the resulting mask obtained from the product image.
Use inpainting to create a generated image
By combining the power of inpainting with the mask generated by SAM and the user’s prompt, we can create remarkable generated images. Inpainting utilizes advanced generative AI techniques to intelligently fill in the missing or masked regions of an image, seamlessly blending them with the surrounding content. With the SAM-generated mask as guidance and the user’s prompt as a creative input, inpainting algorithms can generate visually coherent and contextually appropriate content, resulting in stunning and personalized images. This fusion of technologies opens up endless creative possibilities, allowing users to transform their visions into vivid, captivating visual narratives.
Host a Stable Diffusion Inpainting model on a SageMaker endpoint
Similarly to 2.1, we use the notebook 1_HostGenAIModels.ipynb
to create SageMaker endpoints and host the Stable Diffusion Inpainting model.
We use the inference code in inference_inpainting.py
and package that into a code.tar.gz
file, which we use to create the SageMaker endpoint. The code downloads the Stable Diffusion Inpainting model, hosts it on an endpoint, and provides an entry point to run inference and generate output:
INPAINTING_ENDPOINT_NAME = 'inpainting-pytorch-' + str(datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f'))
prefix_inpainting = "InPainting/demo-custom-endpoint"
model_data_inpainting = s3.S3Uploader.upload("code.tar.gz", f"s3://{bucket}/{prefix_inpainting}")
model_inpainting = PyTorchModel(entry_point='inference_inpainting.py',
model_data=model_data_inpainting,
framework_version='1.12',
py_version='py38',
role=role,
env={'TS_MAX_RESPONSE_SIZE':'2000000000', 'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300'},
sagemaker_session=sess,
name='model-'+INPAINTING_ENDPOINT_NAME)
predictor_inpainting = model_inpainting.deploy(initial_instance_count=1,
instance_type=INSTANCE_TYPE,
serializer=JSONSerializer(),
deserializers=JSONDeserializer(),
endpoint_name=INPAINTING_ENDPOINT_NAME,
volume_size=128)
Invoke the Stable Diffusion Inpainting model and generate a new image
Similarly to the step to invoke the SAM model, the notebook 2_GenerateInPaintingImages.ipynb
is used to run the inference on the endpoints and generate results:
raw_image = Image.open("images/speaker.png").convert("RGB")
mask_image = Image.open('images/speaker_mask.png').convert('RGB')
prompt_fr = "table and chair with books"
prompt_bg = "window and couch, table"
negative_prompt = "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, letters"
inputs = {}
inputs["image"] = np.array(raw_image)
inputs["mask"] = np.array(mask_image)
inputs["prompt_fr"] = prompt_fr
inputs["prompt_bg"] = prompt_bg
inputs["negative_prompt"] = negative_prompt
predictor_inpainting = PyTorchPredictor(endpoint_name=INPAINTING_ENDPOINT_NAME,
serializer=JSONSerializer(),
deserializer=JSONDeserializer())
output_array = predictor_inpainting.predict(inputs, initial_args={'Accept': 'application/json'})
gai_image = Image.fromarray(np.array(output_array[0]).astype(np.uint8))
gai_background = Image.fromarray(np.array(output_array[1]).astype(np.uint8))
gai_mask = Image.fromarray(np.array(output_array[2]).astype(np.uint8))
post_image = Image.fromarray(np.array(output_array[3]).astype(np.uint8))
# save the generated image using PIL Image
post_image.save('images/speaker_generated.png')
The following figure shows the refined mask, generated background, generated product image, and postprocessed image.
The generated product image uses the following prompts:
- Background generation – “chair, couch, window, indoor”
- Inpainting – “besides books”
Clean up
In this post, we use two GPU-enabled SageMaker endpoints, which contributes to the majority of the cost. These endpoints should be turned off to avoid extra cost when the endpoints are not being used. We have provided a notebook, 3_CleanUp.ipynb
, which can assist in cleaning up the endpoints. We also use a SageMaker notebook to host the models and run inference. Therefore, it’s good practice to stop the notebook instance if it’s not being used.
Conclusion
Generative AI models are generally large-scale ML models that require specific resources to run efficiently. In this post, we demonstrated, using an advertising use case, how SageMaker endpoints offer a scalable and managed environment for hosting generative AI models such as the text-to-image foundation model Stable Diffusion. We demonstrated how two models can be hosted and run as needed, and multiple models can also be hosted from a single endpoint. This eliminates the complexities associated with infrastructure provisioning, scalability, and monitoring, enabling organizations to focus solely on deploying their models and serving predictions to solve their business challenges. With SageMaker endpoints, organizations can efficiently deploy and manage multiple models within a unified infrastructure, achieving optimal resource utilization and reducing operational overhead.
The detailed code is available on GitHub. The code demonstrates the use of AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK) to automate the process of creating SageMaker notebooks and other required resources.
About the authors
Fabian Benitez-Quiroz is a IoT Edge Data Scientist in AWS Professional Services. He holds a PhD in Computer Vision and Pattern Recognition from The Ohio State University. Fabian is involved in helping customers run their machine learning models with low latency on IoT devices and in the cloud across various industries.
Romil Shah is a Sr. Data Scientist at AWS Professional Services. Romil has more than 6 years of industry experience in computer vision, machine learning, and IoT edge devices. He is involved in helping customers optimize and deploy their machine learning models for edge devices and on the cloud. He works with customers to create strategies for optimizing and deploying foundation models.
Han Man is a Senior Data Science & Machine Learning Manager with AWS Professional Services based in San Diego, CA. He has a PhD in Engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today, he is passionately working with key customers from a variety of industry verticals to develop and implement ML and GenAI solutions on AWS.
AdaTape: Foundation model with adaptive computation and dynamic read-and-write
Adaptive computation refers to the ability of a machine learning system to adjust its behavior in response to changes in the environment. While conventional neural networks have a fixed function and computation capacity, i.e., they spend the same number of FLOPs for processing different inputs, a model with adaptive and dynamic computation modulates the computational budget it dedicates to processing each input, depending on the complexity of the input.
Adaptive computation in neural networks is appealing for two key reasons. First, the mechanism that introduces adaptivity provides an inductive bias that can play a key role in solving some challenging tasks. For instance, enabling different numbers of computational steps for different inputs can be crucial in solving arithmetic problems that require modeling hierarchies of different depths. Second, it gives practitioners the ability to tune the cost of inference through greater flexibility offered by dynamic computation, as these models can be adjusted to spend more FLOPs processing a new input.
Neural networks can be made adaptive by using different functions or computation budgets for various inputs. A deep neural network can be thought of as a function that outputs a result based on both the input and its parameters. To implement adaptive function types, a subset of parameters are selectively activated based on the input, a process referred to as conditional computation. Adaptivity based on the function type has been explored in studies on mixture-of-experts, where the sparsely activated parameters for each input sample are determined through routing.
Another area of research in adaptive computation involves dynamic computation budgets. Unlike in standard neural networks, such as T5, GPT-3, PaLM, and ViT, whose computation budget is fixed for different samples, recent research has demonstrated that adaptive computation budgets can improve performance on tasks where transformers fall short. Many of these works achieve adaptivity by using dynamic depth to allocate the computation budget. For example, the Adaptive Computation Time (ACT) algorithm was proposed to provide an adaptive computational budget for recurrent neural networks. The Universal Transformer extends the ACT algorithm to transformers by making the computation budget dependent on the number of transformer layers used for each input example or token. Recent studies, like PonderNet, follow a similar approach while improving the dynamic halting mechanisms.
In the paper “Adaptive Computation with Elastic Input Sequence”, we introduce a new model that utilizes adaptive computation, called AdaTape. This model is a Transformer-based architecture that uses a dynamic set of tokens to create elastic input sequences, providing a unique perspective on adaptivity in comparison to previous works. AdaTape uses an adaptive tape reading mechanism to determine a varying number of tape tokens that are added to each input based on input’s complexity. AdaTape is very simple to implement, provides an effective knob to increase the accuracy when needed, but is also much more efficient compared to other adaptive baselines because it directly injects adaptivity into the input sequence instead of the model depth. Finally, Adatape offers better performance on standard tasks, like image classification, as well as algorithmic tasks, while maintaining a favorable quality and cost tradeoff.
Adaptive computation transformer with elastic input sequence
AdaTape uses both the adaptive function types and a dynamic computation budget. Specifically, for a batch of input sequences after tokenization (e.g., a linear projection of non-overlapping patches from an image in the vision transformer), AdaTape uses a vector representing each input to dynamically select a variable-sized sequence of tape tokens.
AdaTape uses a bank of tokens, called a “tape bank”, to store all the candidate tape tokens that interact with the model through the adaptive tape reading mechanism. We explore two different methods for creating the tape bank: an input-driven bank and a learnable bank.
The general idea of the input-driven bank is to extract a bank of tokens from the input while employing a different approach than the original model tokenizer for mapping the raw input to a sequence of input tokens. This enables dynamic, on-demand access to information from the input that is obtained using a different point of view, e.g., a different image resolution or a different level of abstraction.
In some cases, tokenization in a different level of abstraction is not possible, thus an input-driven tape bank is not feasible, such as when it’s difficult to further split each node in a graph transformer. To address this issue, AdaTape offers a more general approach for generating the tape bank by using a set of trainable vectors as tape tokens. This approach is referred to as the learnable bank and can be viewed as an embedding layer where the model can dynamically retrieve tokens based on the complexity of the input example. The learnable bank enables AdaTape to generate a more flexible tape bank, providing it with the ability to dynamically adjust its computation budget based on the complexity of each input example, e.g., more complex examples retrieve more tokens from the bank, which let the model not only use the knowledge stored in the bank, but also spend more FLOPs processing it, since the input is now larger.
Finally, the selected tape tokens are appended to the original input and fed to the following transformer layers. For each transformer layer, the same multi-head attention is used across all input and tape tokens. However, two different feed-forward networks (FFN) are used: one for all tokens from the original input and the other for all tape tokens. We observed slightly better quality by using separate feed-forward networks for input and tape tokens.
AdaTape provides helpful inductive bias
We evaluate AdaTape on parity, a very challenging task for the standard Transformer, to study the effect of inductive biases in AdaTape. With the parity task, given a sequence 1s, 0s, and -1s, the model has to predict the evenness or oddness of the number of 1s in the sequence. Parity is the simplest non-counter-free or periodic regular language, but perhaps surprisingly, the task is unsolvable by the standard Transformer.
Evaluation on the parity task. The standard Transformer and Universal Transformer were unable to perform this task, both showing performance at the level of a random guessing baseline. |
Despite being evaluated on short, simple sequences, both the standard Transformer and Universal Transformers were unable to perform the parity task as they are unable to maintain a counter within the model. However, AdaTape outperforms all baselines, as it incorporates a lightweight recurrence within its input selection mechanism, providing an inductive bias that enables the implicit maintenance of a counter, which is not possible in standard Transformers.
Evaluation on image classification
We also evaluate AdaTape on the image classification task. To do so, we trained AdaTape on ImageNet-1K from scratch. The figure below shows the accuracy of AdaTape and the baseline methods, including A-ViT, and the Universal Transformer ViT (UViT and U2T) versus their speed (measured as number of images, processed by each code, per second). In terms of quality and cost tradeoff, AdaTape performs much better than the alternative adaptive transformer baselines. In terms of efficiency, larger AdaTape models (in terms of parameter count) are faster than smaller baselines. Such results are consistent with the finding from previous work that shows that the adaptive model depth architectures are not well suited for many accelerators, like the TPU.
We evaluate AdaTape by training on ImageNet from scratch. For A-ViT, we not only report their results from the paper but also re-implement A-ViT by training from scratch, i.e., A-ViT(Ours). |
A study of AdaTape’s behavior
In addition to its performance on the parity task and ImageNet-1K, we also evaluated the token selection behavior of AdaTape with an input-driven bank on the JFT-300M validation set. To better understand the model’s behavior, we visualized the token selection results on the input-driven bank as heatmaps, where lighter colors mean that position is more frequently selected. The heatmaps reveal that AdaTape more frequently picks the central patches. This aligns with our prior knowledge, as central patches are typically more informative — especially in the context of datasets with natural images, where the main object is in the middle of the image. This result highlights the intelligence of AdaTape, as it can effectively identify and prioritize more informative patches to improve its performance.
We visualize the tape token selection heatmap of AdaTape-B/32 (left) and AdaTape-B/16 (right). The hotter / lighter color means the patch at this position is more frequently selected. |
Conclusion
AdaTape is characterized by elastic sequence lengths generated by the adaptive tape reading mechanism. This also introduces a new inductive bias that enables AdaTape to have the potential to solve tasks that are challenging for both standard transformers and existing adaptive transformers. By conducting comprehensive experiments on image recognition benchmarks, we demonstrate that AdaTape outperforms standard transformers and adaptive architecture transformers when computation is held constant.
Acknowledgments
One of the authors of this post, Mostafa Dehghani, is now at Google DeepMind.
How AI is helping airlines mitigate the climate impact of contrails
We used AI to help airlines choose routes that are less likely to cause contrails, minimizing the environmental impact of flights.Read More
Recent honors and awards for Amazon scientists
Researchers honored for their contributions to the scientific community.Read More
SIGGRAPH Special Address: NVIDIA CEO Brings Generative AI to LA Show
As generative AI continues to sweep an increasingly digital, hyperconnected world, NVIDIA founder and CEO Jensen Huang made a thunderous return to SIGGRAPH, the world’s premier computer graphics conference.
“The generative AI era is upon us, the iPhone moment if you will,” Huang told an audience of thousands Tuesday during an in-person special address in Los Angeles.
News highlights include the next-generation GH200 Grace Hopper Superchip platform, NVIDIA AI Workbench — a new unified toolkit that introduces simplified model tuning and deployment on NVIDIA AI platforms — and a major upgrade to NVIDIA Omniverse with generative AI and OpenUSD.
The announcements are about bringing all of the past decade’s innovations — AI, virtual worlds, acceleration, simulation, collaboration and more — together.
“Graphics and artificial intelligence are inseparable, graphics needs AI, and AI needs graphics,” Huang said, explaining that AI will learn skills in virtual worlds, and that AI will help create virtual worlds.
Fundamental to AI, Real-Time Graphics
Five years ago at SIGGRAPH, NVIDIA reinvented graphics by bringing AI and real-time ray tracing to GPUs. But “while we were reinventing computer graphics with artificial intelligence, we were reinventing the GPU altogether for artificial intelligence,” Huang said.
The result: increasingly powerful systems such as the NVIDIA HGX H100, which harnesses eight GPUs — and a total of 1 trillion transistors — that offer dramatic acceleration over CPU-based systems.
“This is the reason why the world’s data centers are rapidly transitioning to accelerated computing,” Huang told the audience. “The more you buy, the more you save.”
To continue AI’s momentum, NVIDIA created the Grace Hopper Superchip, the NVIDIA GH200, which combines a 72-core Grace CPU with a Hopper GPU, and which went into full production in May.
Huang announced that NVIDIA GH200, which is already in production, will be complemented with an additional version with cutting-edge HBM3e memory.
He followed up on that by announcing the next-generation GH200 Grace Hopper superchip platform with the ability to connect multiple GPUs for exceptional performance and easily scalable server design.
Built to handle the world’s most complex generative workloads, spanning large language models, recommender systems and vector databases, the new platform will be available in a wide range of configurations.
The dual configuration — which delivers up to 3.5x more memory capacity and 3x more bandwidth than the current generation offering — comprises a single server with 144 Arm Neoverse cores, eight petaflops of AI performance, and 282GB of the latest HBM3e memory technology.
Leading system manufacturers are expected to deliver systems based on the platform in the second quarter of 2024.
NVIDIA AI Workbench Speeds Adoption of Custom Generative AI
To speed custom adoption of generative AI for the world’s enterprises, Huang announced NVIDIA AI Workbench. It provides developers with a unified, easy-to-use toolkit to quickly create, test and fine-tune generative AI models on a PC or workstation — then scale them to virtually any data center, public cloud or NVIDIA DGX Cloud.
AI Workbench removes the complexity of getting started with an enterprise AI project. Accessed through a simplified interface running on a local system, it allows developers to fine-tune models from popular repositories such as Hugging Face, GitHub and NGC using custom data. The models can then be shared easily across multiple platforms.
While hundreds of thousands of pretrained models are now available, customizing them with the many open-source tools available can be challenging and time consuming.
“In order to democratize this ability, we have to make it possible to run pretty much everywhere,” Huang said.
With AI Workbench, developers can customize and run generative AI in just a few clicks. It allows them to pull together all necessary enterprise-grade models, frameworks, software development kits and libraries into a unified developer workspace.
“Everybody can do this,” Huang said.
Leading AI infrastructure providers — including Dell Technologies, Hewlett Packard Enterprise, HP Inc., Lambda, Lenovo and Supermicro — are embracing AI Workbench for its ability to bring enterprise generative AI capability to wherever developers want to work — including a local device.
Huang also announced a partnership between NVIDIA and startup Hugging Face, which has 2 million users, that will put generative AI supercomputing at the fingertips of millions of developers building large language models and other advanced AI applications.
Developers will be able to access NVIDIA DGX Cloud AI supercomputing within the Hugging Face platform to train and tune advanced AI models.
“This is going to be a brand new service to connect the world’s largest AI community to the world’s best training and infrastructure,” Huang said.
In a video, Huang showed how AI Workbench and ChatUSD bring it all together: allowing a user to start a project on a GeForce RTX 4090 laptop and scale, seamlessly to a workstation, or the data center as it grows more complex.
Using Jupyter Notebook, a user can prompt the model to generate a picture of Toy Jensen in space. When the model provides a result that doesn’t work, because it’s never seen Toy Jensen, the user can fine-tune the model with eight images of Toy Jensen and then prompt it again to get a correct result.
Then with AI Workbench, the new model can be deployed to an enterprise application.
New NVIDIA Enterprise 4.0 Software Advances AI Deployment
In a further step to accelerate the adoption of generative AI, NVIDIA announced the latest version of its enterprise software suite, NVIDIA AI Enterprise 4.0.
NVIDIA AI Enterprise gives businesses access to the tools needed to adopt generative AI, while also offering the security and API stability required for large-scale enterprise deployments.
Major Omniverse Release Converges Generative AI, OpenUSD for Industrial Digitalization
Offering new foundation applications and services for developers and industrial enterprises to optimize and enhance their 3D pipelines with the OpenUSD framework and generative AI, Huang announced a major release of NVIDIA Omniverse, an OpenUSD-native development platform for building, simulating, and collaborating across tools and virtual worlds.
He also announced NVIDIA’s contributions to OpenUSD, the framework and universal interchange for describing, simulating and collaborating across 3D tools.
Updates to the Omniverse platform include advancements to Omniverse Kit — the engine for developing native OpenUSD applications and extensions — as well as to the NVIDIA Omniverse Audio2Face foundation app and spatial-computing capabilities.
Cesium, Convai, Move AI, SideFX Houdini and Wonder Dynamics are now connected to Omniverse via OpenUSD.
And expanding their collaboration across Adobe Substance 3D, generative AI and OpenUSD initiatives, Adobe and NVIDIA announced plans to make Adobe Firefly — Adobe’s family of creative generative AI models — available as APIs in Omniverse.
Omniverse users can now build content, experiences and applications that are compatible with other OpenUSD-based spatial computing platforms such as ARKit and RealityKit.
Huang announced a broad range of frameworks, resources and services for developers and companies to accelerate the adoption of Universal Scene Description, known as OpenUSD, including contributions such as geospatial data models, metrics assembly and simulation-ready, or SimReady, specifications for OpenUSD.
Huang also announced four new Omniverse Cloud APIs built by NVIDIA for developers to more seamlessly implement and deploy OpenUSD pipelines and applications.
- ChatUSD — Assisting developers and artists working with OpenUSD data and scenes, ChatUSD is a large language model (LLM) agent for generating Python-USD code scripts from text and answering USD knowledge questions.
- RunUSD — a cloud API that translates OpenUSD files into fully path-traced rendered images by checking compatibility of the uploaded files against versions of OpenUSD releases, and generating renders with Omniverse Cloud.
- DeepSearch — an LLM agent enabling fast semantic search through massive databases of untagged assets.
- USD-GDN Publisher — a one-click service that enables enterprises and software makers to publish high-fidelity, OpenUSD-based experiences to the Omniverse Cloud Graphics Delivery Network (GDN) from an Omniverse-based application such as USD Composer, as well as stream in real time to web browsers and mobile devices.
These contributions are an evolution of last week’s announcement of NVIDIA’s co-founding of the Alliance for OpenUSD along with Pixar, Adobe, Apple and Autodesk.
Powerful New Desktop Systems, Servers
Providing more computing power for all of this, Huang said NVIDIA and global workstation manufacturers are announcing powerful new RTX workstations for development and content creation in the age of generative AI and digitization.
The systems, including those from BOXX, Dell Technologies, HP and Lenovo, are based on NVIDIA RTX 6000 Ada Generation GPUs and incorporate NVIDIA AI Enterprise and NVIDIA Omniverse Enterprise software.
Separately, NVIDIA released three new desktop workstation Ada Generation GPUs — the NVIDIA RTX 5000, RTX 4500 and RTX 4000 — to deliver the latest AI, graphics and real-time rendering technology to professionals worldwide.
Huang also detailed how, together with global data center system manufacturers, NVIDIA is continuing to supercharge generative AI and industrial digitization with new NVIDIA OVX featuring the new NVIDIA L40S GPU, a powerful, universal data center processor design.
The powerful new systems will accelerate the most compute-intensive, complex applications, including AI training and inference, 3D design and visualization, video processing and industrial digitalization with the NVIDIA Omniverse platform.
NVIDIA Research Bringing New Capabilities
More innovations are coming, thanks to NVIDIA Research.
At the show’s Real Time Live Event, NVIDIA researchers will demonstrate a generative AI workflow that helps artists rapidly create and iterate on materials for 3D scenes, using text or image prompts to generate custom textured materials faster and with finer creative control.
And NVIDIA Research also demo’d how AI can take video conferencing to the next level with new 3D features. NVIDIA Research recently published a paper demonstrating how AI could power a 3D video-conferencing system with minimal capture equipment.
The production version of Maxine, now available in NVIDIA Enterprise, allows professionals, teams, creators and others to tap into the power of AI to create high-quaity audio and video effects, even using standard microphone and webcams.
Watch Huang’s full special address at NVIDIA’s SIGGRAPH event site. where there are also details of labs, presentations and more happening throughout the show.
Startup Pens Generative AI Success Story With NVIDIA NeMo
Machine learning helped Waseem Alshikh plow through textbooks in college. Now he’s putting generative AI to work, creating content for hundreds of companies.
Born and raised in Syria, Alshikh spoke no English, but he was fluent in software, a talent that served him well when he arrived at college in Lebanon.
“The first day they gave me a stack of textbooks, each one a thousand pages thick, and all of it in English,” he recalled.
So, he wrote a program — a crude but effective statistical classifier that summarized the books — then he studied the summaries.
From Concept to Company
In 2014, he shared his story with May Habib, an entrepreneur he met while working in Dubai. They agreed to create a startup that could help marketing departments — which are always pressured to do more with less — use machine learning to quickly create copy for their web pages, blogs, ads and more.
“Initially, the tech was not there, until transformer models were announced — that was something we could build on,” said Alshikh, the startup’s CTO.
“We found a few engineers and spent almost six months building our first model, a neural network that barely worked and had about 128 million parameters,” an often-used measure of an AI model’s capability.
Along the way, the young company won some business, changed its name to Writer and connected with NVIDIA.
A Startup Accelerated
“Once we got introduced to NVIDIA NeMo, we were able to build industrial-strength models with three, then 20 and now 40 billion parameters, and we’re still scaling,” he said.
NeMo is an application framework that helps companies curate their training datasets, build and customize large language models (LLMs), and run them in production at scale. Organizations everywhere from Korea to Sweden are using it to customize LLMs for their local languages and industries.
“Before NeMo, it took us four and a half months to build a new billion-parameter model. Now we can do it in 16 days — this is mind blowing,” Alshikh said.
Models Make Opportunities
In the first six months of this year, the startup’s team of fewer than 20 AI engineers used NeMo to develop 10 models, each with 30 billion parameters or more.
That translates into big opportunities. Hundreds of businesses now use Writer’s models that NeMo customized for finance, healthcare, retail and other vertical markets.
The startup’s customer list includes household names like Deloitte, L’Oreal, Intuit, Uber and many Fortune 500 companies.
Writer’s success with NeMo is just the start of the story. Dozens of other companies have already downloaded NeMo.
The software will be available soon for anyone to use. It’s part of NVIDIA AI Enterprise, full-stack software optimized to accelerate generative AI workloads and backed by enterprise-grade support, security and application programming interface stability.
A Trillion API Calls a Month
Some customers run Writer’s models on their own systems or cloud services. Others ask Writer to host the models, or they use Writer’s API.
“Our cloud infrastructure, managed basically by two people, hosts a trillion API calls a month — we’re generating 90,000 words a second,” Alshikh said. “We’re delivering high-quality models that compete with products from companies with larger teams and bigger budgets.”
Writer uses the Triton Inference Server that’s packaged with NeMo to run models in production for its customers. Alshikh reports that Triton, used by many companies running LLMs, enables lower latency and greater throughput than alternative programs.
“This means you can run a service for $20,000, instead of $100,000, so we can invest more in building meaningful features,” he said.
A Wide Horizon
Writer is also a member of NVIDIA Inception, a program that nurtures cutting-edge startups. “Thanks to Inception, we got early access to NeMo and some amazing people who guided us through the process of finding and using the tools we need,” he said.
Now that Writer’s text products are getting traction, Alshikh, who splits his time between homes in Florida and California, is searching the horizon for what’s next. In today’s broad frontier of generative AI, he sees opportunities in images, audio, video, 3D — maybe all of the above.
“We see multimodality as the future,” he said.
Check out this page to get started with NeMo. And learn about the early access program for multimodal NeMo here.
And if you enjoyed this story, let folks on social networks know using the following, a summary suggested by Writer:
“Learn how startup Writer uses NVIDIA NeMo software to generate content for hundreds of companies and rack up impressive revenues with a small staff and budget.”
NVIDIA Makes Extended-Reality Streaming More Scalable, Customizable for Enterprises and Developers
Organizations across industries are using extended reality (XR) to redesign workflows and boost productivity, whether for immersive training or collaborative design reviews.
With the growing use of all-in-one (AIO) headsets, more teams have adopted and integrated XR. While easing XR use, AIO headsets have modest compute and rendering power that can limit the graphics quality of streaming experiences.
NVIDIA is enabling more enterprises and developers to adopt high-quality XR with its CloudXR Suite. Built to greatly simplify streaming, CloudXR enables anyone with an AIO headset or mobile XR device to experience high-fidelity, immersive environments from any location.
CloudXR Suite combines the power of NVIDIA RTX GPUs and NVIDIA RTX Virtual Workstation (vWS) software to stream high-fidelity XR applications to Android and iOS devices. By dynamically adjusting to network conditions, CloudXR maximizes image quality and frame rates to power next-level, wireless augmented-reality and virtual-reality experiences.
With CloudXR, enterprises can gain the flexibility to effectively orchestrate and scale XR workloads, and developers can use the advanced platform to create custom XR products for their users. The suite offers high-quality streaming across both public and private networks.
Ericsson and VMware are among the first companies to use CloudXR.
Taking XR Workflows to the Next Level
CloudXR Suite offers performance that’s comparable to tethered VR experiences.
It comprises three components, including several updates:
- CloudXR Essentials, the suite’s underlying streaming layer, brings new improvements such as 5G L4S optimizations, QoS algorithms and enhanced logging tools. Essentials also includes the SteamVR plug-in, along with sample clients and a new server-side application programming interface.
- CloudXR Server Extensions improves server-side interfaces with a source-code addition to the Monado OpenXR runtime. The new CloudXR Server API contained in CloudXR Essentials and the OpenXR API represent the gateway to scaling XR distribution for orchestration partners.
- CloudXR Client Extensions include as a first offering a CloudXR plug-in built for the Unity Editor. This lets developers build custom CloudXR client applications using already-familiar Unity development tools. Plus, Unity app developers can more easily build applications with branded custom interfaces and lobbies before connecting to their CloudXR streaming server using the plug-in.
Teams can tap into the power of NVIDIA RTX GPUs to achieve ultimate graphics performance on mobile devices. Enterprises can scale to data center and edge networks, and stream to concurrent users with NVIDIA RTX vWS software.
In addition, users can stream stunning XR content from any OpenVR or OpenXR application at the edge using high-bandwidth, low-latency 5G signals.
Partners Experience Enterprise-Grade XR Streaming
Organizations across industries use XR streaming to advance their workflows.
To provide optimal streaming performance, NVIDIA is working with leading companies like Ericsson to implement low-latency, low-loss scalable throughput (L4S) in NVIDIA CloudXR. L4S helps reduce lag in interactive, cloud-based video streaming, so CloudXR users will be able to experience photorealistic XR environments on high-bandwidth, low-latency networks.
“At Ericsson, we believe innovations like L4S are fundamental building blocks to enable latency-critical applications,” said Sibel Tombaz, head of product line for 5G Radio Access Network at Ericsson. “As a key part of Ericsson’s Time-Critical Communication capabilities, L4S will significantly improve user experience for use-cases like cloud gaming, and its great news that NVIDIA is making L4S a production element of CloudXR. We’re excited to be working with NVIDIA to further enhance the XR experience for enterprises, developers and consumers.
More professionals can elevate XR streaming from the cloud with VMware Workspace ONE XR Hub, which includes an integration of CloudXR.
Workspace ONE XR Hub enhances user experiences with VR headsets through advanced authentication and customization options. Combined with the streaming capabilities of CloudXR, Workspace ONE XR Hub allows teams across industries to quickly, securely access complex immersive environments using AIO headsets.
“With this new integration, access to high-fidelity immersive experiences is even easier because streaming lets users tap into the power of RTX GPUs from anywhere,” said Matt Coppinger, director of product management for end-user computing at VMware. “Workspace ONE XR Hub and CloudXR will allow our customers to stream rich XR content, and more teams can boost productivity and integrate realistic, virtual experiences into their workflows.”
Availability
CloudXR Suite will be available to download soon, so users can stream a wide range of XR applications over the network without worrying about demanding graphics requirements.
For example, independent software vendors (ISVs) can create a single, high-quality version of their application that’s built to take advantage of powerful GPUs. And with CloudXR streaming, ISVs can target users with mobile XR devices.
Mobile-device manufacturers can also offer their ISV partners and end users access to high-performance GPU acceleration for unparalleled graphics experiences.
In addition, cloud service providers, orchestrators and system integrators can extend their GPU services with interactive graphics to support next-generation XR applications.
Learn more about NVIDIA CloudXR Suite.
Extended Cut: NVIDIA Expands Maxine for Video Editing, Showcases 3D Virtual Conferencing Research
Professionals, teams, creators and others can tap into the power of AI to create high-quality audio and video effects — even using standard microphones and webcams — with the help of NVIDIA Maxine.
The suite of GPU-accelerated software development kits and cloud-native microservices lets users deploy AI features that enhance audio, video and augmented-reality effects for real-time communications services and platforms. Maxine will also expand features for video editing, enabling teams to reach new heights in video communication.
Plus, an NVIDIA Research demo at this week’s SIGGRAPH conference displays how AI can take video conferencing to the next level with 3D features.
NVIDIA Maxine Features Expand to Video Editing
Wireless connectivity has enabled people to join virtual meetings from more locations than ever. Typically, audio and video quality are heavily impacted when a caller is on the move or in a location with poor connectivity.
Advanced, real-time Maxine features — such as Background Noise Removal, Super Resolution and Eye Contact — allow remote users to enhance interpersonal communication experiences.
In addition, Maxine can now be used for video editing. NVIDIA partners are transforming this professional workflow with the same Maxine features that elevate video conferencing. The goal when editing a video, whether a sales pitch or a webinar, is to engage the broadest audience possible. Using Maxine, professionals can tap into AI features that enhance audio and video signals.
With Maxine, a spokesperson can look away from the screen to reference notes or a script while their gaze remains as if looking directly into the camera. Users can also film videos in low resolution and enhance the quality later. Plus, Maxine lets people record videos in several different languages and export the video in English.
Maxine features to be released in early access this year include:
- Interpreter: Translates from simplified Chinese, Russian, French, German and Spanish to English while animating the user’s image to show them speaking English.
- Voice Font: Enables users to apply characteristics of a speaker’s voice and map it to the audio output.
- Audio Super Resolution: Improves audio quality by increasing the temporal resolution of the audio signal and extending bandwidth. It currently supports upsampling from 8,000Hz to 16,000Hz as well as from 16,000Hz to 48,000Hz. This feature is also updated with more than 50% reduction in latency and up to 2x better throughput.
- Maxine Client: Brings the AI capabilities of Maxine’s microservices to video-conferencing sessions on PCs. The application is optimized for low-latency streaming and will use the cloud for all of its GPU compute requirements. Thin Client will be available on Windows this fall, with additional OS support to follow.
Maxine can be deployed in the cloud, on premises or at the edge, meaning quality communication can be accessible from nearly anywhere.
Taking Video Conferencing to New Heights
Many partners and customers are experiencing high-quality video conferencing and editing with Maxine. Two features of Maxine — Eye Contact and Live Portrait — are now available in production releases on the NVIDIA AI Enterprise software platform. Eye Contact simulates direct eye contact with the camera by estimating and aligning the user’s gaze with the camera. And Live Portrait animates a person’s portrait photo through their live video feed.
Software company Descript aims to make video a staple of every communicator’s toolkit, alongside docs and slides. With NVIDIA Maxine, professionals and beginners who use Descript can access AI features that improve their video-content workflows.
“With the NVIDIA Maxine Eye Contact feature, users no longer have to worry about memorizing scripts or doing tedious video retakes,” said Jay LeBoeuf, head of business and corporate development at Descript. “They can maintain a perfect on-screen presence while nailing their script every time.”
Reincubate’s Camo app aims to broaden access to great video by taking advantage of the hardware and devices people already own. It does this by giving users greater control over their image and by implementing a powerful, efficient processing pipeline for video effects and transformation. Using technologies enabled by NVIDIA Maxine, Camo can offer users an easier way to achieve incredible video creation.
“Integrating NVIDIA Maxine into Camo couldn’t have been easier, and it’s enabled us to get high performance from users’ RTX GPUs right out of the box,” said Aidan Fitzpatrick, founder and CEO of Reincubate. “With Maxine, the team’s been able to move faster and with more confidence.”
Quicklink’s Cre8 is a powerful video production platform for creating professional, on-brand productions, virtual and hybrid live events. The user-friendly interface combines an intuitive design with all the tools needed to build, edit and customize a professional-looking production. Cre8 incorporates NVIDIA Maxine technology to maximize productivity and the quality of video productions, offering complete control to the operator.
“Quicklink Cre8 now offers the most advanced video production platform on the planet,” said Richard Rees, CEO of Quicklink. “With NVIDIA Maxine, we were able to add advanced features, including Auto Framing, Video Noise Removal, Noise and Echo Cancellation, and Eye Contact Simulation.”
Los Angeles-based company gemelo.ai provides a platform for creating AI twins that can scale a user’s voice, content and interactions. Using Maxine’s Live Portrait feature, the gemelo.ai team can unlock new opportunities for scaled, personalized content and one-on-one interactions.
“The realism of Live Portrait has been a game-changer, unlocking new realms of potential for our AI twins,” said Paul Jaski, CEO of gemelo.ai. “Our customers can now design and deploy incredibly realistic digital twins with the superpowers of unlimited scalability in content production and interaction across apps, websites and mixed-reality experiences.”
NVIDIA Research Shows How 3D Video Enhances Immersive Communication
In addition to powering the advanced features of Maxine, NVIDIA AI enhances video communication with 3D. NVIDIA Research recently published a paper demonstrating how AI could power a 3D video-conferencing system with minimal capture equipment.
3D telepresence systems are typically expensive, require a large space or production studio, and use high-bandwidth, volumetric video streaming — all of which limits the technology’s accessibility. NVIDIA Research shared a new method, which runs on a novel VisionTransformer-based encoder, that takes 2D video input from a standard webcam and turns it into a 3D video representation. Instead of requiring 3D data to be passed back and forth between the participants in a conference, AI enables bandwidth requirements for the call to stay the same as for a 2D conference.
The technology takes a user’s 2D video and automatically creates a 3D representation called a neural radiance field, or NeRF, using volumetric rendering. As a result, participants can stream 2D videos, like they would for traditional video conferencing, while decoding high-quality 3D representations that can be rendered in real time. And with Maxine’s Live Portrait, users can bring their portraits to life in 3D.
AI-mediated 3D video conferencing could significantly reduce the cost for 3D capture, provide a high-fidelity 3D representation, accommodate photorealistic or stylized avatars, and enable mutual eye contact in video conferencing. Related research projects show how AI can help elevate communications and virtual interactions, as well as inform future NVIDIA technologies for video conferencing.
See the system in action below. SIGGRAPH attendees can visit the Emerging Technologies booth, where groups will be able to simultaneously view the live demo on a 3D display designed by New York-based company Looking Glass.
Availability
Learn more about NVIDIA Maxine, which is now available on NVIDIA AI Enterprise.
And see more of the research behind the 3D video conference project.
Featured image courtesy of NVIDIA Research.