Efficient Video-Text Learning with Iterative Co-tokenization

Video is an ubiquitous source of media content that touches on many aspects of people’s day-to-day lives. Increasingly, real-world video applications, such as video captioning, video content analysis, and video question-answering (VideoQA), rely on models that can connect video content with text or natural language. VideoQA is particularly challenging, however, as it requires grasping both semantic information, such as objects in a scene, as well as temporal information, e.g., how things move and interact, both of which must be taken in the context of a natural-language question that holds specific intent. In addition, because videos have many frames, processing all of them to learn spatio-temporal information can be computationally expensive. Nonetheless, understanding all this information enables models to answer complex questions — for example, in the video below, a question about the second ingredient poured in the bowl requires identifying objects (the ingredients), actions (pouring), and temporal ordering (second).

An example input question for the VideoQA task “What is the second ingredient poured into the bowl?” which requires deeper understanding of both the visual and text inputs. The video is an example from the 50 Salads dataset, used under the Creative Commons license.

To address this, in “Video Question Answering with Iterative Video-Text Co-Tokenization”, we introduce a new approach to video-text learning called iterative co-tokenization, which is able to efficiently fuse spatial, temporal and language information for VideoQA. This approach is multi-stream, processing different scale videos with independent backbone models for each to produce video representations that capture different features, e.g., those of high spatial resolution or long temporal durations. The model then applies the co-tokenization module to learn efficient representations from fusing the video streams with the text. This model is highly efficient, using only 67 giga-FLOPs (GFLOPs), which is at least 50% fewer than previous approaches, while giving better performance than alternative state-of-the-art models.

Video-Text Iterative Co-tokenization
The main goal of the model is to produce features from both videos and text (i.e., the user question), jointly allowing their corresponding inputs to interact. A second goal is to do so in an efficient manner, which is highly important for videos since they contain tens to hundreds of frames as input.

The model learns to tokenize the joint video-language inputs into a smaller set of tokens that jointly and efficiently represent both modalities. When tokenizing, we use both modalities to produce a joint compact representation, which is fed to a transformer layer to produce the next level representation. A challenge here, which is also typical in cross-modal learning, is that often the video frame does not correspond directly to the associated text. We address this by adding two learnable linear layers which unify the visual and text feature dimensions before tokenization. This way we enable both video and text to condition how video tokens are learned.

Moreover, a single tokenization step does not allow for further interaction between the two modalities. For that, we use this new feature representation to interact with the video input features and produce another set of tokenized features, which are then fed into the next transformer layer. This iterative process allows the creation of new features, or tokens, which represent a continual refinement of the joint representation from both modalities. At the last step the features are input to a decoder that generates the text output.

As customarily done for VideoQA, we pre-train the model before fine-tuning it on the individual VideoQA datasets. In this work we use the videos automatically annotated with text based on speech recognition, using the HowTo100M dataset instead of pre-training on a large VideoQA dataset. This weaker pre-training data still enables our model to learn video-text features.

Visualization of the video-text iterative co-tokenization approach. Multi-stream video inputs, which are versions of the same video input (e.g., a high resolution, low frame-rate video and a low resolution, high frame-rate video), are efficiently fused together with the text input to produce a text-based answer by the decoder. Instead of processing the inputs directly, the video-text iterative co-tokenization model learns a reduced number of useful tokens from the fused video-language inputs. This process is done iteratively, allowing the current feature tokenization to affect the selection of tokens at the next iteration, thus refining the selection.

Efficient Video Question-Answering
We apply the video-language iterative co-tokenization algorithm to three main VideoQA benchmarks, MSRVTT-QA, MSVD-QA and IVQA, and demonstrate that this approach achieves better results than other state-of-the-art models, while having a modest size. Furthermore, iterative co-tokenization learning yields significant compute savings for video-text learning tasks. The method uses only 67 giga-FLOPs (GFLOPS), which is one sixth the 360 GFLOPS needed when using the popular 3D-ResNet video model jointly with text and is more than twice as efficient as the X3D model. This is all the while producing highly accurate results, outperforming state-of-the-art methods.

Comparison of our iterative co-tokenization approach to previous methods such as MERLOT and VQA-T, as well as, baselines using single ResNet-3D or X3D-XL.

Multi-stream Video Inputs
For VideoQA, or any of a number of other tasks that involve video inputs, we find that multi-stream input is important to more accurately answer questions about both spatial and temporal relationships. Our approach utilizes three video streams at different resolutions and frame-rates: a low-resolution high frame-rate, input video stream (with 32 frames-per-second and spatial resolution 64×64, which we denote as 32x64x64); a high-resolution, low frame-rate video (8x224x224); and one in-between (16x112x112). Despite the apparently more voluminous information to process with three streams, we obtain very efficient models due to the iterative co-tokenization approach. At the same time these additional streams allow extraction of the most pertinent information. For example, as shown in the figure below, questions related to a specific activity in time will produce higher activations in the smaller resolution but high frame-rate video input, whereas questions related to the general activity can be answered from the high resolution input with very few frames. Another benefit of this algorithm is that the tokenization changes depending on the questions asked.

Visualization of the attention maps learned per layer during the video-text co-tokenization. The attention maps differ depending on the question asked for the same video. For example, if the question is related to the general activity (e.g., surfing in the figure above), then the attention maps of the higher resolution low frame-rate inputs are more active and seem to consider more global information. Whereas if the question is more specific, e.g., asking about what happens after an event, the feature maps are more localized and tend to be active in the high frame-rate video input. Furthermore, we see that the low-resolution, high-frame rate video inputs provide more information related to activities in the video.

Conclusion
We present a new approach to video-language learning that focuses on joint learning across video-text modalities. We address the important and challenging task of video question-answering. Our approach is both highly efficient and accurate, outperforming current state-of-the-art models, despite being more efficient. Our approach results in modest model sizes and can gain further improvements with larger models and data. We hope this work provokes more research in vision-language learning to enable more seamless interaction with vision-based media.

Acknowledgements
This work is conducted by AJ Pierviovanni, Kairo Morton, Weicheng Kuo, Michael Ryoo and Anelia Angelova. We thank our collaborators in this research, and Soravit Changpinyo for valuable comments and suggestions, and Claire Cui for suggestions and support. We also thank Tom Small for visualizations.

Read More

Future of Creativity on Display ‘In the NVIDIA Studio’ During SIGGRAPH Special Address

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology accelerates creative workflows. 

A glimpse into the future of AI-infused virtual worlds was on display at SIGGRAPH — the world’s largest gathering of computer graphics experts — as NVIDIA founder and CEO Jensen Huang put the finishing touches on the company’s special address.

Announcements included a host of updates to a pillar of the NVIDIA Studio software suite: NVIDIA Omniverse, a platform for 3D design collaboration and world simulation. New features and improvements to apps including Create, Machinima, Audio2Face and Nucleus will help 3D artists build virtual worlds, digital twins and avatars for the metaverse.

Each month, NVIDIA Studio Driver releases provide artists, creators and 3D developers with the best performance and reliability when working with creative applications. Available now, the August NVIDIA Studio Driver gives creators peak reliability for using Omniverse and their favorite creative apps.

Plus, this week’s featured In the NVIDIA Studio artist, Simon Lavit, exhibits his mastery of Omniverse as the winner of the #MadeInMachinima contest. The 3D artist showcases the creative workflow for his victorious short film, Painting the Astronaut.

Omniverse Expands

NVIDIA Omniverse — an open platform based on Universal Scene Description (USD) for building and connecting virtual worlds — just received a significant upgrade.

Omniverse Apps — including Create 2022.2 — received a major PhysX update with soft-body simulation, particle-cloth simulation and soft-contact models, delivering more realism to physically accurate virtual worlds. Added OmniLive workflows enable artists more freedom through a new collaboration interface for non-destructive USD workflows.

Omniverse users can now add animations and emotions with the Audio2Face app.

Audio2Face 2022.1 is now available in beta, including major updates that enable AI-powered emotion control and full facial animation, delivering more realism than ever. Users can now direct emotion over time, as well as mix key emotions like joy, amazement, anger and sadness. The AI can also direct eye, teeth and tongue motion, in addition to the avatar’s skin, providing an even more complete facial-animation solution.

Learn additional details on these updates and more.

Winning the #MadeInMachinima Contest

Since he first held a pen, Simon Lavit has been an artist. Now, Lavit adds Omniverse Machinima to the list of creative tools he’s mastered, as the winner of the #MadeInMachinima contest.

His entry, Painting the Astronaut, was selected by an esteemed panel of judges that included numerous creative experts.

Powered by a GeForce RTX 3090 GPU, Lavit’s creative workflow showcases the breadth and interoperability of Omniverse, its Apps and Connectors. He used lighting and scene setting to establish the short film’s changing mood, helping audiences understand the story’s progression. Its introduction, for example, is bright and clear. The film then gets darker, conveying the idea of the unknown as the character starts his journey.

The lighting for “Painting the Astronaut” helps guide the story, with 3D assets from the Omniverse library.

Lavit storyboarded on paper before starting his digital process with the Machinima and Omniverse Create apps. He quickly turned to NVIDIA’s built-in 3D asset library, filled with free content from Mount & Blade II: Bannerlord, Mechwarrior 5: Mercenaries, Squad and more – to populate the scene.

The 3D model for the spaceship was created in Autodesk Maya within Omniverse.

Then, Lavit used Autodesk Maya to create 3D models for some of his hero assets — like the protagonist Sol’s spaceship. The Maya Omniverse Connector allowed him to visualize scenes within Omniverse Create. He also benefited from RTX-accelerated ray tracing and AI denoising in Maya, resulting in highly interactive and photorealistic renders.

Next, Lavit textured the models in Adobe Substance 3D, which also has an Omniverse Connector. Substance 3D uses NVIDIA Iray rendering, including for textures and substances. It also features RTX-accelerated light- and ambient-occlusion baking, which optimizes assets in seconds.

Lavit then returned to Machinima for final layout, animation and render. The result was composited using Adobe After Effects, with an extra layer of effects and music. What turned into the contest-winning piece of art ultimately was “a pretty simple workflow to keep the complexity to a minimum,” Lavit said.

”Painting the Astronaut” netted Lavit a GeForce RTX 3080 Ti-powered ASUS ProArt StudioBook 16.

To power his future creativity from anywhere, Lavit won an ASUS ProArt StudioBook 16. This NVIDIA Studio laptop packs top-of-the-line technology into a device that enables users to work on the go with world-class power from a GeForce RTX 3080 Ti Laptop GPU and beautiful 4K display.

3D Artist and Omniverse #MadeInMachinima contest winner Simon Lavit.

Lavit, born in France and now based in the U.S., sees every project as an adventure. Living in a different country from where he was born changed his vision of art, he said. Lavit regularly finds inspiration from the French graphic novel series, The Incal, which is written by Alejandro Jodorowsky and illustrated by renowned cartoonist Jean Giraud, aka Mœbius.

Made the Grade

The next generation of creative professionals is heading back to campus. Choosing the right NVIDIA Studio laptop can be tricky, but students can use this guide to find the perfect tool to power their creativity — like the Lenovo Slim 7i Pro X, an NVIDIA Studio laptop now available with a GeForce RTX 3050 Laptop GPU.

While the #MadeInMachinima contest has wrapped, creators can graduate to an NVIDIA RTX A6000 GPU in the #ExtendOmniverse contest, running through Friday, Aug. 19. Perform something akin to magic by making your own NVIDIA Omniverse Extension for a chance to win an RTX A6000 or GeForce RTX 3090 Ti GPU. Winners will be announced in September at GTC.

Follow NVIDIA Omniverse on Instagram, Medium, Twitter and YouTube for additional resources and inspiration. Check out the Omniverse forums, and join our Discord server and Twitch channel to chat with the community.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the NVIDIA Studio newsletter.

The post Future of Creativity on Display ‘In the NVIDIA Studio’ During SIGGRAPH Special Address appeared first on NVIDIA Blog.

Read More

At SIGGRAPH, NVIDIA CEO Jensen Huang Illuminates Three Forces Sparking Graphics Revolution

In a swift, eye-popping special address at SIGGRAPH, NVIDIA execs described the forces driving the next era in graphics, and the company’s expanding range of tools to accelerate them.

“The combination of AI and computer graphics will power the metaverse, the next evolution of the internet,” said Jensen Huang, founder and CEO of NVIDIA, kicking off the 45-minute talk.

It will be home to connected virtual worlds and digital twins, a place for real work as well as play. And, Huang said, it will be vibrant with what will become one of the most popular forms of robots: digital human avatars.

With 45 demos and slides, five NVIDIA speakers announced:

  • A new platform for creating avatars, NVIDIA Omniverse Avatar Cloud Engine (ACE).
  • Plans to build out Universal Scene Description (USD), the language of the metaverse.
  • Major extensions to NVIDIA Omniverse, the computing platform for creating virtual worlds and digital twins.
  • Tools to supercharge graphics workflows with machine learning.

“The announcements we made today further advance the metaverse, a new computing platform with new programming models, new architectures and new standards,” he said.

Metaverse applications are already here.

Huang pointed to consumers trying out virtual 3D products with augmented reality, telcos creating digital twins of their radio networks to optimize and deploy radio towers and companies creating digital twins of warehouses and factories to optimize their layout and logistics.

Enter the Avatars

The metaverse will come alive with virtual assistants, avatars we interact with as naturally as talking to another person. They’ll work in digital factories, play in online games and provide customer service for e-tailers.

“There will be billions of avatars,” said Huang, calling them “one of the most widely used kinds of robots” that will be designed, trained and operated in Omniverse.

Digital humans and avatars require natural language processing, computer vision, complex facial and body animations and more. To move and speak in realistic ways, this suite of complex technologies must be synced to the millisecond.

It’s hard work that NVIDIA aims to simplify and accelerate with Omniverse Avatar Cloud Engine. ACE is a collection of AI models and services that build on NVIDIA’s work spanning everything from conversational AI to animation tools like Audio2Face and Audio2Emotion.

MetaHuman in Unreal Engine image courtesy of Epic Games.

With Omniverse ACE, developers can build, configure and deploy their avatar application across any engine in any public or private cloud,” said Simon Yuen, a senior director of graphics and AI at NVIDIA. “We want to democratize building interactive avatars for every platform.”

ACE will be available early next year, running on embedded systems and all major cloud services.

Yuen also demonstrated the latest version of Omniverse Audio2Face, an AI model that can create facial animation directly from voices.

“We just added more features to analyze and automatically transfer your emotions to your avatar,” he said.

Future versions of Audio2Face will create avatars from a single photo, applying textures automatically and generating animation-ready 3D meshes. They’ll sport high-fidelity simulations of muscle movements an AI can learn from watching a video — even lifelike hair that responds as expected to virtual grooming.

USD, a Foundation for the 3D Internet

Many superpowers of the metaverse will be grounded in USD, a foundation for the 3D internet.

The metaverse “needs a standard way of describing all things within 3D worlds,” said Rev Lebaredian, vice president of Omniverse and simulation technology at NVIDIA.

“We believe Universal Scene Description, invented and open sourced by Pixar, is the standard scene description for the next era of the internet,” he added, comparing USD to HTML in the 2D web.

Lebaredian described NVIDIA’s vision for USD as a key to opening even more opportunities than those in the physical world.

“Our next milestones aim to make USD performant for real-time, large-scale virtual worlds and industrial digital twins,” he said, noting NVIDIA’s plans to help build out support in USD for international character sets, geospatial coordinates and real-time streaming of IoT data.

NVIDIA's planned investments in USD
Examples of NVIDIA’s planned investments in USD

To further accelerate USD adoption, NVIDIA will release a compatibility testing and certification suite for USD. It lets developers know their custom USD components produce an expected result.

In addition, NVIDIA announced a set of simulation-ready USD assets, designed for use in industrial digital twins and AI training workflows. They join a wealth of USD resources available online for free including USD-ready scenes, on-demand tutorials, documentation and instructor-led courses.

“We want everyone to help build and advance USD,” said Lebaredian.

Omniverse Expands Its Palette

One of the biggest announcements of the special address was a major new release of NVIDIA Omniverse, a platform that’s been downloaded nearly 200,000 times.

Huang called Omniverse “a USD platform, a toolkit for building metaverse applications, and a compute engine to run virtual worlds.”

The latest version packs several upgraded core technologies and more connections to popular tools.

The links, called Omniverse Connectors, are now in development for Unity, Blender, Autodesk Alias, Siemens JT, SimScale, the Open Geospatial Consortium and more. Connectors are now available in beta for PTC Creo, Visual Components and SideFX Houdini. These new developments join Siemens Xcelerator, now part of the Omniverse network, welcoming more industrial customers into the era of digital twins.

Like the internet itself, Omniverse is “a network of networks,” connecting users across industries and disciplines, said Steve Parker, NVIDIA’s vice president of professional graphics.

New features in NVIDIA Omniverse
Examples of new features in NVIDIA Omniverse.

Nearly a dozen leading companies will showcase Omniverse capabilities at SIGGRAPH, including hardware, software and cloud-service vendors ranging from AWS and Adobe to Dell, Epic and Microsoft. A half dozen companies will conduct NVIDIA-powered sessions on topics such as AI and virtual worlds.

Speeding Physics, Animating Animals

Parker detailed several technology upgrades in Omniverse. They span enhancements for simulating physically accurate materials with the Material Definition Language (MDL), real-time physics with PhysX and the hybrid rendering and AI system, RTX.

“These core technology pillars are powered by NVIDIA high performance computing from the edge to the cloud,” Parker said.

For example, PhysX now supports soft-body and particle-cloth simulation, bringing more physical accuracy to virtual worlds in real time. And NVIDIA is fully open sourcing MDL so it can readily support graphics API standards like OpenGL or Vulkan, making the materials standard more broadly available to developers.

Omniverse also will include neural graphics capabilities developed by NVIDIA Research that combine RTX graphics and AI. For example:

  • Animal Modelers let artists iterate on an animal’s form with point clouds, then automatically generate a 3D mesh.
  • GauGAN360, the next evolution of NVIDIA GauGAN, generates 8K, 360-degree panoramas that can easily be loaded into an Omniverse scene.
  • Instant NeRF creates 3D objects and scenes from 2D images.

An Omniverse Extension for NVIDIA Modulus, a machine learning framework, will let developers use AI to speed simulations of real-world physics up to 100,000x, so the metaverse looks and feels like the physical world.

In addition, Omniverse Machinima — subject of a lively contest at SIGGRAPH — now sports content from Post Scriptum, Beyond the Wire and Shadow Warrior 3 as well as new AI animation tools like Audio2Gesture.

A demo from Industrial Light & Magic showed another new feature. Omniverse DeepSearch uses AI to help teams intuitively search through massive databases of untagged assets, bringing up accurate results for terms even when they’re not specifically listed in metadata.

Graphics Get Smart

One of the essential pillars of the emerging metaverse is neural graphics. It’s a hybrid discipline that harnesses neural network models to accelerate and enhance computer graphics.

“Neural graphics intertwines AI and graphics, paving the way for a future graphics pipeline that is amenable to learning from data,” said Sanja Fidler, vice president of AI at NVIDIA. “Neural graphics will redefine how virtual worlds are created, simulated and experienced by users,” she added.

AI will help artists spawn the massive amount of 3D content needed to create the metaverse. For example, they can use neural graphics to capture objects and behaviors in the physical world quickly.

Fidler described NVIDIA software to do just that, Instant NeRF, a tool to create a 3D object or scene from 2D images. It’s the subject of one of NVIDIA’s two best paper awards at SIGGRAPH.

In the other best paper award, neural graphics powers a model that can predict and reduce reaction latencies in esports and AR/VR applications. The two best papers are among 16 total that NVIDIA researchers are presenting this week at SIGGRAPH.

neural graphics
Neural graphics blends AI into the graphics pipeline.

Designers and researchers can apply neural graphics and other techniques to create their own award-winning work using new software development kits NVIDIA unveiled at the event.

Fidler described one of them, Kaolin Wisp, a suite of tools to create neural fields — AI models that represent a 3D scene or object — with just a few lines of code.

Separately, NVIDIA announced NeuralVDB, the next evolution of the open-sourced standard OpenVDB that industries from visual effects to scientific computing use to simulate and render water, fire, smoke and clouds.

NeuralVDB uses neural models and GPU optimization to dramatically reduce memory requirements so users can interact with extremely large and complex datasets in real time and share them more efficiently.

“AI, the most powerful technology force of our time, will revolutionize every field of computer science, including computer graphics, and NVIDIA RTX is the engine of neural graphics,” Huang said.

Watch the full special address at NVIDIA’s SIGGRAPH event site. That’s where you’ll also find details of labs, presentations and the debut of a behind-the-scenes documentary on how we created our latest GTC keynote.

The post At SIGGRAPH, NVIDIA CEO Jensen Huang Illuminates Three Forces Sparking Graphics Revolution appeared first on NVIDIA Blog.

Read More

NVIDIA AI Makes Performance Capture Possible With Any Camera

NVIDIA AI tools are enabling deep learning-powered performance capture for creators at every level: visual effects and animation studios, creative professionals — even any enthusiast with a camera.

With NVIDIA Vid2Vid Cameo, creators can harness AI to capture their facial movements and expressions from any standard 2D video taken with a professional camera or smartphone. The performance can be applied in real time to animate an avatar, character or painting.

And with 3D body-pose estimation software, creators can capture full-body movements like walking, dancing and performing martial arts — bringing virtual characters to life with AI.

For individuals without 3D experience, these tools make it easy to animate creative projects, even using smartphone footage. Professionals can take it a step further, combining the pose estimation and Vid2Vid Cameo software to transfer their own movements to virtual characters for live streams or animation projects.

And creative studios can harness AI-powered performance capture for concept design or previsualization — to quickly convey an idea of how certain movements look on a digital character.

NVIDIA Demonstrates Performance Capture With Vid2Vid Cameo

NVIDIA Vid2Vid Cameo, available through a demo on the NVIDIA AI Playground, needs just two elements to generate a talking-head video: a still image of the avatar or painting to be animated, plus footage of the original performer speaking, singing or moving their head.

Based on generative adversarial networks, or GANs, the model maps facial movements to capture real-time motion, transferring that motion to the virtual character. Trained on 180,000 videos, the network learned to identify 20 key points to model facial motion — encoding the location of the eyes, mouth, nose, eyebrows and more.

These points are extracted from the video stream of the performer and applied to the avatar or digital character. See how it works in the demo below, which transfers a performance of Edgar Allan Poe’s “Sonnet — to Science” to a portrait of the writer by artist Gary Kelley.

Visual Platforms Integrate Vid2Vid Cameo, Pose Estimation by NVIDIA

While Vid2Vid Cameo captures detailed facial expressions, pose estimation AI tracks movement of the whole body — a key capability for creators working with virtual characters that perform complex motions or move around a digital scene.

Pose Tracker is a convolutional neural network model available as an Extension in the NVIDIA Omniverse 3D design collaboration and world simulation platform. It allows users to upload footage or stream live video as a motion source to animate a character in real time. Creators can download NVIDIA Omniverse for free and get started with step-by-step tutorials.

Companies that have integrated NVIDIA AI for performance capture into their products include:

  • Derivative, maker of TouchDesigner, a node-based real-time visual development platform, has implemented Vid2Vid Cameo as a way to provide easy-to-use facial tracking.
  • Notch, a company offering a real-time graphics tool for 3D, visual effects and live-events visuals, uses body-pose estimation AI from NVIDIA to help artists simplify stage setups. Instead of relying on custom hardware-tracking systems, Notch users can work with standard camera equipment to control 3D character animation in real time.
  • Pixotope, a leading virtual production company, uses NVIDIA AI-powered real-time talent tracking to drive interactive elements for live productions. The Norway-based company shared its work enabling interaction between real and virtual elements on screen at the most recent NVIDIA GTC.

Learn more about NVIDIA’s latest advances in AI, digital humans and virtual worlds at SIGGRAPH, the world’s largest gathering of computer graphics experts, running through Thursday, Aug. 11.

The post NVIDIA AI Makes Performance Capture Possible With Any Camera appeared first on NVIDIA Blog.

Read More

As Far as the AI Can See: ILM Uses Omniverse DeepSearch to Create the Perfect Sky

For cutting-edge visual effects and virtual production, creative teams and studios benefit from digital sets and environments that can be updated in real time.

A crucial element in any virtual production environment is a sky dome, often used to provide realistic lighting for virtual environments and in-camera visual effects. Legendary studio Industrial Light & Magic (ILM) is tapping into the power of AI to take its skies to new heights with NVIDIA AI-enabled DeepSearch and Omniverse Enterprise.

Capturing photorealistic details of a sky can be tricky. At SIGGRAPH today, ILM showcased how its team, with the NVIDIA DeepSearch tool, used natural language to rapidly search through a massive asset library and create a captivating sky dome.

The video shows how Omniverse Enterprise can provide filmmakers with the ultimate flexibility to develop the ideal look and lighting to further their stories. This helps artists save time, enhance productivity and accelerate creativity for virtual production.

After narrowing down their search results, the ILM team auditions the remaining sky domes in virtual reality to assess whether the asset will be a perfect match for the shot. By using VR, ILM can approximate what the skies will look like on a virtual production set.

The Sky’s the Limit With AI

An extensive library with thousands of references and 3D assets offers advantages, but it also presents some challenges without an efficient way to search through all the data.

Typically, users set up folders or tag items with keywords, which can be incredibly time consuming. This is especially true for a studio like ILM, which has over 40 years’ worth of material in its reference library, including photography, matte paintings, backdrops and other materials that have been captured over the decades.

With hundreds of thousands of untagged pieces of content, it’s impractical for the ILM team to manually search through them on a production schedule.

Omniverse DeepSearch, however, lets ILM search intuitively through untagged assets using text or a 2D image. DeepSearch uses AI to categorize and find images automatically — this results in massive time savings for the creative team, removing the need to manually tag each asset.

All images courtesy of Industrial Light & Magic.

“With Omniverse DeepSearch, we have the ability to search through data in real time, which is key for production,” said Landis Fields, real time principal creative at ILM. “And being able to search through assets with natural language allows for our creative teams to easily find what they’re looking for, helping them achieve the final look and feel of a scene much more efficiently than before.”

DeepSearch also works on USD files, so the ILM team can review search results and bring images into the 3D space in Omniverse Enterprise. The artists could then interact with the 3D environment using a VR headset.

With NVIDIA DeepSearch and Omniverse Enterprise, ILM has the potential to accelerate creative pipelines, lower costs and enhance production workflows to create captivating content for virtual productions.

Join NVIDIA at SIGGRAPH to learn more about the latest Omniverse announcements, watch the company’s special address on demand and see the global premiere of NVIDIA’s documentary, The Art of Collaboration: NVIDIA, Omniverse, and GTC, on Wednesday, Aug. 10, at 10 a.m. PT.

The post As Far as the AI Can See: ILM Uses Omniverse DeepSearch to Create the Perfect Sky appeared first on NVIDIA Blog.

Read More

New NVIDIA Neural Graphics SDKs Make Metaverse Content Creation Available to All

The creation of 3D objects for building scenes for games, virtual worlds including the metaverse, product design or visual effects is traditionally a meticulous process, where skilled artists balance detail and photorealism against deadlines and budget pressures.

It takes a long time to make something that looks and acts as it would in the physical world. And the problem gets harder when multiple objects and characters need to interact in a virtual world. Simulating physics becomes just as important as simulating light. A robot in a virtual factory, for example, needs to have not only the same look, but also the same weight capacity and braking capability as its physical counterpart.

It’s hard. But the opportunities are huge, affecting trillion-dollar industries as varied as transportation, healthcare, telecommunications and entertainment, in addition to product design. Ultimately, more content will be created in the virtual world than in the physical one.

To simplify and shorten this process, NVIDIA today released new research and a broad suite of tools that apply the power of neural graphics to the creation and animation of 3D objects and worlds.

These SDKs — including NeuralVDB, a ground-breaking update to industry standard OpenVDB,and Kaolin Wisp, a Pytorch library establishing a framework for neural fields research —  ease the creative process for designers while making it easy for millions of users who aren’t design professionals to create 3D content.

Neural graphics is a new field intertwining AI and graphics to create an accelerated graphics pipeline that learns from data. Integrating AI enhances results, helps automate design choices and provides new, yet to be imagined opportunities for artists and creators. Neural graphics will redefine how virtual worlds are created, simulated and experienced by users.

These SDKs and research contribute to each stage of the content creation pipeline, including:

3D Content Creation

  • Kaolin Wisp – an addition to Kaolin, a PyTorch library enabling faster 3D deep learning research by reducing the time needed to test and implement new techniques from weeks to days. Kaolin Wisp is a research-oriented library for neural fields, establishing a common suite of tools and a framework to accelerate new research in neural fields.
  • Instant Neural Graphics Primitives – a new approach to capturing the shape of real-world objects, and the inspiration behind NVIDIA Instant NeRF, an inverse rendering model that turns a collection of still images into a digital 3D scene. This technique and associated GitHub code accelerate the process by up to 1,000x.
  • 3D MoMa – a new inverse rendering pipeline that allows users to quickly import a 2D object into a graphics engine to create a 3D object that can be modified with realistic materials, lighting and physics.
  • GauGAN360 – the next evolution of NVIDIA GauGAN, an AI model that turns rough doodles into photorealistic masterpieces. GauGAN360 generates 8K, 360-degree panoramas that can be ported into Omniverse scenes.
  • Omniverse Avatar Cloud Engine (ACE) – a new collection of cloud APIs, microservices and tools to create, customize and deploy digital human applications. ACE is built on NVIDIA’s Unified Compute Framework, allowing developers to seamlessly integrate core NVIDIA AI technologies into their avatar applications.

Physics and Animation

  • NeuralVDB – a groundbreaking improvement on OpenVDB, the current industry standard for volumetric data storage. Using machine learning, NeuralVDB introduces compact neural representations, dramatically reducing memory footprint to allow for higher-resolution 3D data.
  • Omniverse Audio2Face – an AI technology that generates expressive facial animation from a single audio source. It’s useful for interactive real-time applications and as a traditional facial animation authoring tool.
  • ASE: Animation Skills Embedding – an approach enabling physically simulated characters to act in a more responsive and life-like manner in unfamiliar situations. It uses deep learning to teach characters how to respond to new tasks and actions.
  • TAO Toolkit – a framework to enable users to create an accurate, high-performance pose estimation model, which can evaluate what a person might be doing in a scene using computer vision much more quickly than current methods.

Experience

  • Image Features Eye Tracking – a research model linking the quality of pixel rendering to a user’s reaction time. By predicting the best combination of rendering quality, display properties and viewing conditions for the least latency, It will allow for better performance in fast-paced, interactive computer graphics applications such as competitive gaming.
  • Holographic Glasses for Virtual Reality – a collaboration with Stanford University on a new VR glasses design that delivers full-color 3D holographic images in a groundbreaking 2.5-mm-thick optical stack.

Join NVIDIA at SIGGRAPH to see more of the latest research and technology breakthroughs in graphics, AI and virtual worlds. Check out the latest innovations from NVIDIA Research, and access the full suite of NVIDIA’s SDKs, tools and libraries.

The post New NVIDIA Neural Graphics SDKs Make Metaverse Content Creation Available to All appeared first on NVIDIA Blog.

Read More

Upping the Standard: NVIDIA Introduces NeuralVDB, Bringing AI and GPU Optimization to Award-Winning OpenVDB

NVIDIA today announced NeuralVDB, which brings the power of AI to OpenVDB, the industry-standard library for simulating and rendering sparse volumetric data, such as water, fire, smoke and clouds.

Building on the past decade’s development of OpenVDB, the introduction at SIGGRAPH of NeuralVDB is a game-changer for professionals working in areas like scientific computing and visualization, medical imaging, rocket science and visual effects. By reducing memory footprint by up to 100x, it allows creators, developers and researchers to interact with extremely large and complex datasets in real time.

Over the past decade, OpenVDB has earned Academy Awards as a core technology used throughout the visual-effects industry. It has since grown beyond entertainment to industrial and scientific use cases where sparse volumetric data is prevalent, such as industrial design and robotics.

Last year, NVIDIA introduced NanoVDB, which added GPU support to OpenVDB. This delivered an order-of-magnitude speedup, enabling faster performance and easier development — and opening the door to real-time simulation and rendering.

NeuralVDB builds on the GPU acceleration of NanoVDB by adding machine learning to introduce compact neural representations that dramatically reduce its memory footprint. This allows 3D data to be represented at even higher resolution and at a much larger scale than OpenVDB. The result is that users can easily handle massive volumetric datasets on devices like individual workstations and even laptops.

NeuralVDB offers a significant efficiency improvement over OpenVDB by compressing a volume’s memory footprint up to 100x compared to NanoVDB. This allows users to transmit and share large, complex volumetric datasets much more efficiently.

To accelerate training up to 2x, NeuralVDB allows the weights of a frame to be used for the subsequent one. NeuralVDB also enables users to achieve temporal coherency, or smooth encoding, by using the network results from the previous frame.

Hitting this trifecta of dramatically reducing memory requirements, accelerating training and enabling temporal coherency allows NeuralVDB to unlock new possibilities for scientific and industrial use cases, including massive, complex volume datasets for AI-enabled medical imaging, large-scale digital twin simulations and more.

Learn more about NeuralVDB.

Watch the NVIDIA special address at SIGGRAPH on demand, and join NVIDIA at the conference through Thursday, Aug. 11, to see more of the latest technology breakthroughs in graphics, AI and virtual worlds.

The post Upping the Standard: NVIDIA Introduces NeuralVDB, Bringing AI and GPU Optimization to Award-Winning OpenVDB appeared first on NVIDIA Blog.

Read More

Solving a longstanding conundrum in heat transfer

It is a problem that has beguiled scientists for a century. But, buoyed by a $625,000 Distinguished Early Career Award from the U.S. Department of Energy (DoE), Matteo Bucci, an associate professor in the Department of Nuclear Science and Engineering (NSE), hopes to be close to an answer.

Tackling the boiling crisis

Whether you’re heating a pot of water for pasta or are designing nuclear reactors, one phenomenon — boiling — is vital for efficient execution of both processes.

“Boiling is a very effective heat transfer mechanism; it’s the way to remove large amounts of heat from the surface, which is why it is used in many high-power density applications,” Bucci says. An example use case: nuclear reactors.

To the layperson, boiling appears simple — bubbles form and burst, removing heat. But what if so many bubbles form and coalesce that they form a band of vapor that prevents further heat transfer? Such a problem is a known entity and is labeled the boiling crisis. It would lead to runaway heat, and a failure of fuel rods in nuclear reactors. So “understanding and determining under which conditions the boiling crisis is likely to happen is critical to designing more efficient and cost-competitive nuclear reactors,” Bucci says.

Early work on the boiling crisis dates back nearly a century ago, to 1926. And while much work has been done, “it is clear that we haven’t found an answer,” Bucci says. The boiling crisis remains a challenge because while models abound, the measurement of related phenomena to prove or disprove these models has been difficult. “[Boiling] is a process that happens on a very, very small length scale and over very, very short times,” Bucci says. “We are not able to observe it at the level of detail necessary to understand what really happens and validate hypotheses.”

But, over the past few years, Bucci and his team have been developing diagnostics that can measure the phenomena related to boiling and thereby provide much-needed answers to a classic problem. Diagnostics are anchored in infrared thermometry and a technique using visible light. “By combining these two techniques I think we’re going to be ready to answer standing questions related to heat transfer, we can make our way out of the rabbit hole,” Bucci says. The grant award from the U.S. DoE for Nuclear Energy Projects will aid in this and Bucci’s other research efforts.

An idyllic Italian childhood

Tackling difficult problems is not new territory for Bucci, who grew up in the small town of Città di Castello near Florence, Italy. Bucci’s mother was an elementary school teacher. His father used to have a machine shop, which helped develop Bucci’s scientific bent. “I liked LEGOs a lot when I was a kid. It was a passion,” he adds.

Despite Italy going through a severe pullback from nuclear engineering during his formative years, the subject fascinated Bucci. Job opportunities in the field were uncertain but Bucci decided to dig in. “If I have to do something for the rest of my life, it might as well be something I like,” he jokes. Bucci attended the University of Pisa for undergraduate and graduate studies in nuclear engineering.

His interest in heat transfer mechanisms took root during his doctoral studies, a research subject he pursued in Paris at the French Alternative Energies and Atomic Energy Commission (CEA). It was there that a colleague suggested work on the boiling water crisis. This time Bucci set his sights on NSE at MIT and reached out to Professor Jacopo Buongiorno to inquire about research at the institution. Bucci had to fundraise at CEA to conduct research at MIT. He arrived just a couple of days before the Boston Marathon bombing in 2013 with a round-trip ticket. But Bucci has stayed ever since, moving on to become a research scientist and then associate professor at NSE.

Bucci admits he struggled to adapt to the environment when he first arrived at MIT, but work and friendships with colleagues — he counts NSE’s Guanyu Su and Reza Azizian as among his best friends — helped conquer early worries.

The integration of artificial intelligence

In addition to diagnostics for boiling, Bucci and his team are working on ways of integrating artificial intelligence and experimental research. He is convinced that “the integration of advanced diagnostics, machine learning, and advanced modeling tools will blossom in a decade.”

Bucci’s team is developing an autonomous laboratory for boiling heat transfer experiments. Running on machine learning, the setup decides which experiments to run based on a learning objective the team assigns. “We formulate a question and the machine will answer by optimizing the kinds of experiments that are necessary to answer those questions,” Bucci says, “I honestly think this is the next frontier for boiling,” he adds.

“It’s when you climb a tree and you reach the top, that you realize that the horizon is much more vast and also more beautiful,” Bucci says of his zeal to pursue more research in the field.

Even as he seeks new heights, Bucci has not forgotten his origins. Commemorating Italy’s hosting of the World Cup in 1990, a series of posters showcasing a soccer field fitted into the Roman Colosseum occupies pride of place in his home and office. Created by Alberto Burri, the posters are of sentimental value: The (now deceased) Italian artist also hailed from Bucci’s hometown — Città di Castello.

Read More

GAUDI: A Neural Architect for Immersive 3D Scene Generation

We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generative model that enables both unconditional and conditional generation of 3D scenes. Our model generalizes previous works that focus on single objects by removing the assumption that the camera pose…Apple Machine Learning Research