June 2023 – Page 10

Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations

Posted by Juhyun Lee and Raman Sarokin, Software Engineers, Core Systems & Experiences

The proliferation of large diffusion models for image generation has led to a significant increase in model size and inference workloads. On-device ML inference in mobile environments requires meticulous performance optimization and consideration of trade-offs due to resource constraints. Running inference of large diffusion models (LDMs) on-device, driven by the need for cost efficiency and user privacy, presents even greater challenges due to the substantial memory requirements and computational demands of these models.

We address this challenge in our work titled “Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations” (to be presented at the CVPR 2023 workshop for Efficient Deep Learning for Computer Vision) focusing on the optimized execution of a foundational LDM model on a mobile GPU. In this blog post, we summarize the core techniques we employed to successfully execute large diffusion models like Stable Diffusion at full resolution (512×512 pixels) and 20 iterations on modern smartphones with high-performing inference speed of the original model without distillation of under 12 seconds. As discussed in our previous blog post, GPU-accelerated ML inference is often limited by memory performance, and execution of LDMs is no exception. Therefore, the central theme of our optimization is efficient memory input/output (I/O) even if it means choosing memory-efficient algorithms over those that prioritize arithmetic logic unit efficiency. Ultimately, our primary objective is to reduce the overall latency of the ML inference.

A sample output of an LDM on Mobile GPU with the prompt text: “a photo realistic and high resolution image of a cute puppy with surrounding flowers”.

Enhanced attention module for memory efficiency

An ML inference engine typically provides a variety of optimized ML operations. Despite this, achieving optimal performance can still be challenging as there is a certain amount of overhead for executing individual neural net operators on a GPU. To mitigate this overhead, ML inference engines incorporate extensive operator fusion rules that consolidate multiple operators into a single operator, thereby reducing the number of iterations across tensor elements while maximizing compute per iteration. For instance, TensorFlow Lite utilizes operator fusion to combine computationally expensive operations, like convolutions, with subsequent activation functions, like rectified linear units, into one.

A clear opportunity for optimization is the heavily used attention block adopted in the denoiser model in the LDM. The attention blocks allow the model to focus on specific parts of the input by assigning higher weights to important regions. There are multiple ways one can optimize the attention modules, and we selectively employ one of the two optimizations explained below depending on which optimization performs better.

The first optimization, which we call partially fused softmax, removes the need for extensive memory writes and reads between the softmax and the matrix multiplication in the attention module. Let the attention block be just a simple matrix multiplication of the form Y = softmax(X) * W where X and W are 2D matrices of shape a×b and b×c, respectively (shown below in the top half).

For numerical stability, T = softmax(X) is typically calculated in three passes:

Determine the maximum value in the list, i.e., for each row in matrix X
Sum up the differences of the exponential of each list item and the maximum value (from pass 1)
Divide the exponential of the items minus the maximum value by the sum from pass 2

Carrying out these passes naïvely would result in a huge memory write for the temporary intermediate tensor T holding the output of the entire softmax function. We bypass this large memory write if we only store the results of passes 1 and 2, labeled m and s, respectively, which are small vectors, with a elements each, compared to T which has a·b elements. With this technique, we are able to reduce tens or even hundreds of megabytes of memory consumption by multiple orders of magnitude (shown below in the bottom half).

Attention modules. Top: A naïve attention block, composed of a SOFTMAX (with all three passes) and a MATMUL, requires a large memory write for the big intermediate tensor T. Bottom: Our memory-efficient attention block with partially fused softmax in MATMUL only needs to store two small intermediate tensors for m and s.

The other optimization involves employing FlashAttention, which is an I/O-aware, exact attention algorithm. This algorithm reduces the number of GPU high-bandwidth memory accesses, making it a good fit for our memory bandwidth–limited use case. However, we found this technique to only work for SRAM with certain sizes and to require a large number of registers. Therefore, we only leverage this technique for attention matrices with a certain size on a select set of GPUs.

Winograd fast convolution for 3×3 convolution layers

The backbone of common LDMs heavily relies on 3×3 convolution layers (convolutions with filter size 3×3), comprising over 90% of the layers in the decoder. Despite increased memory consumption and numerical errors, we found that Winograd fast convolution to be effective at speeding up the convolutions. Distinct from the filter size 3×3 used in convolutions, tile size refers to the size of a sub region of the input tensor that is processed at a time. Increasing the tile size enhances the efficiency of the convolution in terms of arithmetic logic unit (ALU) usage. However, this improvement comes at the expense of increased memory consumption. Our tests indicate that a tile size of 4×4 achieves the optimal trade-off between computational efficiency and memory utilization.

		Memory usage
Tile size	FLOPS savings	Intermediate tensors	Weights
2×2	2.25×	4.00×	1.77×
4×4	4.00×	2.25×	4.00×
6×6	5.06×	1.80×	7.12×
8×8	5.76×	1.56×	11.1×

Impact of Winograd with varying tile sizes for 3×3 convolutions.

Specialized operator fusion for memory efficiency

We discovered that performantly inferring LDMs on a mobile GPU requires significantly larger fusion windows for commonly employed layers and units in LDMs than current off-the-shelf on-device GPU-accelerated ML inference engines provide. Consequently, we developed specialized implementations that could execute a larger range of neural operators than typical fusion rules would permit. Specifically, we focused on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.

An approximation of GELU with the hyperbolic tangent function requires writing to and reading from seven auxiliary intermediate tensors (shown below as light orange rounded rectangles in the figure below), reading from the input tensor x three times, and writing to the output tensor y once across eight GPU programs implementing the labeled operation each (light blue rectangles). A custom GELU implementation that performs the eight operations in a single shader (shown below in the bottom) can bypass all the memory I/O for the intermediate tensors.

GELU implementations. Top: A naïve implementation with built-in operations would require 8 memory writes and 10 reads. Bottom: Our custom GELU only requires 1 memory read (for x) and 1 write (for y).

Results

After applying all of these optimizations, we conducted tests of Stable Diffusion 1.5 (image resolution 512×512, 20 iterations) on high-end mobile devices. Running Stable Diffusion with our GPU-accelerated ML inference model uses 2,093MB for the weights and 84MB for the intermediate tensors. With latest high-end smartphones, Stable Diffusion can be run in under 12 seconds.

Stable Diffusion runs on modern smartphones in under 12 seconds. Note that running the decoder after each iteration for displaying the intermediate output in this animated GIF results in a ~2× slowdown.

Conclusion

Performing on-device ML inference of large models has proven to be a substantial challenge, encompassing limitations in model file size, extensive runtime memory requirements, and protracted inference latency. By recognizing memory bandwidth usage as the primary bottleneck, we directed our efforts towards optimizing memory bandwidth utilization and striking a delicate balance between ALU efficiency and memory efficiency. As a result, we achieved state-of-the-art inference latency for large diffusion models. You can learn more about this work in the paper.

Acknowledgments

We’d like to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.

NVIDIA Research Wins Autonomous Driving Challenge, Innovation Award at CVPR

NVIDIA will be showcased next week as the winner of the fiercely contested 3D Occupancy Prediction Challenge for autonomous driving development at the Computer Vision and Pattern Recognition Conference (CVPR), in Vancouver, Canada.

The competition had more than 400 submissions from nearly 150 teams across 10 regions.

3D occupancy prediction is the process of forecasting the status of each voxel in a scene, that is, each data point on a 3D bird’s-eye-view grid. Voxels can be identified as free, occupied or unknown.

Critical to the development of safe and robust self-driving systems, 3D occupancy grid prediction provides information to autonomous vehicle (AV) planning and control stacks using state-of-the-art convolutional neural networks and transformer models, which are enabled by the NVIDIA DRIVE platform.

“NVIDIA’s winning solution features two important AV advancements,” said Zhiding Yu, senior research scientist for learning and perception at NVIDIA. “It demonstrates a state-of-the-art model design that yields excellent bird’s-eye-view perception. It also shows the effectiveness of visual foundation models with up to 1 billion parameters and large-scale pretraining in 3D occupancy prediction.”

Perception for autonomous driving has evolved over the past years from handling 2D tasks, such as detecting objects or free spaces in images, to reasoning about the world in 3D with multiple input images.

This now provides a flexible and precise fine-grained representation of objects in complex traffic scenes, which is “critical for achieving the safety perception requirements for autonomous driving,” according to Jose Alvarez, director of AV applied research and distinguished scientist at NVIDIA.

Yu will present the NVIDIA Research team’s award-winning work at CVPR’s End-to-End Autonomous Driving Workshop on Sunday, June 18, at 10:20 a.m. PT, as well as at the Vision-Centric Autonomous Driving Workshop on Monday, June 19, at 4:00 p.m. PT.

In addition to winning first place in the challenge, NVIDIA will receive at the event an Innovation Award, recognizing its “fresh insights into the development of view transformation modules,” with “substantially improved performance” compared to previous approaches, according to the CVPR workshop committee.

Read NVIDIA’s technical report on the submission.

Safer Vehicles With 3D Occupancy Prediction

While traditional 3D object detection — detecting and representing objects in a scene, often using 3D bounding boxes — is a core task in AV perception, it has its limitations. For example, it lacks expressiveness, meaning the bounding boxes might not represent enough real-world information. It also requires defining taxonomies and ground truths for all possible objects, even ones rarely seen in the real world, such as road hazards that may have fallen off a truck.

In contrast, 3D occupancy prediction provides rich information about the world to a self-driving vehicle’s planning stack, which is necessary for end-to-end autonomous driving.

Software-defined vehicles can be continuously upgraded with new developments that are proven and validated over time. State-of-the-art software updates that evolve from research initiatives, such as the ones recognized at CVPR, are enabling new features and safer driving capabilities.

The NVIDIA DRIVE platform offers a path to production for automakers, providing full-stack hardware and software for safe and secure AV development, from the car to the data center.

More on the CVPR Challenge

The 3D Occupancy Prediction Challenge at CVPR required participants to develop algorithms that solely used camera input during inference. Participants could use open-source datasets and models, facilitating the exploration of data-driven algorithms and large-scale models. The organizers provided a baseline sandbox for the latest state-of-the-art 3D occupancy prediction algorithms in real-world scenarios.

NVIDIA at CVPR

NVIDIA is presenting nearly 30 papers and presentations at CVPR. Experts who’ll discuss autonomous driving include:

Jose Alvarez on emerging challenges for 3D perception in AVs during the End-to-End Autonomous Driving Workshop: Emerging Tasks and Challenges Workshop; and on optimizing large deep models for real-time inference at the Embedded Vision Workshop.
Nikolai Smolyanskiy, director of deep learning at NVIDIA, on real-time traffic prediction for AVs during the End-to-End Autonomous Driving Workshop: Perception, Prediction, Planning and Simulation.
Robin Jenkin, distinguished engineer at NVIDIA, on image quality in fisheye cameras at the OmniCV Workshop, held in conjunction with CVPR.
Xinshuo Weng, research scientist for AV research at NVIDIA, on vision solutions for autonomous driving during the Vision-Centric Autonomous Driving Workshop.

View other talks on the agenda and learn more about NVIDIA at CVPR, which runs June 18-22.

Featured image courtesy of OccNet and Occ3D.

Build a multilingual automatic translation pipeline with Amazon Translate Active Custom Translation

Dive into Deep Learning (D2L.ai) is an open-source textbook that makes deep learning accessible to everyone. It features interactive Jupyter notebooks with self-contained code in PyTorch, JAX, TensorFlow, and MXNet, as well as real-world examples, exposition figures, and math. So far, D2L has been adopted by more than 400 universities around the world, such as the University of Cambridge, Stanford University, the Massachusetts Institute of Technology, Carnegie Mellon University, and Tsinghua University. This work is also made available in Chinese, Japanese, Korean, Portuguese, Turkish, and Vietnamese, with plans to launch Spanish and other languages.

It is a challenging endeavor to have an online book that is continuously kept up to date, written by multiple authors, and available in multiple languages. In this post, we present a solution that D2L.ai used to address this challenge by using the Active Custom Translation (ACT) feature of Amazon Translate and building a multilingual automatic translation pipeline.

We demonstrate how to use the AWS Management Console and Amazon Translate public API to deliver automatic machine batch translation, and analyze the translations between two language pairs: English and Chinese, and English and Spanish. We also recommend best practices when using Amazon Translate in this automatic translation pipeline to ensure translation quality and efficiency.

Solution overview

We built automatic translation pipelines for multiple languages using the ACT feature in Amazon Translate. ACT allows you to customize translation output on the fly by providing tailored translation examples in the form of parallel data. Parallel data consists of a collection of textual examples in a source language and the desired translations in one or more target languages. During translation, ACT automatically selects the most relevant segments from the parallel data and updates the translation model on the fly based on those segment pairs. This results in translations that better match the style and content of the parallel data.

The architecture contains multiple sub-pipelines; each sub-pipeline handles one language translation such as English to Chinese, English to Spanish, and so on. Multiple translation sub-pipelines can be processed in parallel. In each sub-pipeline, we first build the parallel data in Amazon Translate using the high-quality dataset of tailed translation examples from the human-translated D2L books. Then we generate the customized machine translation output on the fly at run time, which achieves better quality and accuracy.

In the following sections, we demonstrate how to build each translation pipeline using Amazon Translate with ACT, along with Amazon SageMaker and Amazon Simple Storage Service (Amazon S3).

First, we put the source documents, reference documents, and parallel data training set in an S3 bucket. Then we build Jupyter notebooks in SageMaker to run the translation process using Amazon Translate public APIs.

Prerequisites

To follow the steps in this post, make sure you have an AWS account with the following:

Access to AWS Identity and Access Management (IAM) for role and policy configuration
Access to Amazon Translate, SageMaker, and Amazon S3
An S3 bucket to store the source documents, reference documents, parallel data dataset, and output of translation

Create an IAM role and policies for Amazon Translate with ACT

Our IAM role needs to contain a custom trust policy for Amazon Translate:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "Statement1",
        "Effect": "Allow",
        "Principal": {
            "Service": "translate.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
    }]
}

This role must also have a permissions policy that grants Amazon Translate read access to the input folder and subfolders in Amazon S3 that contain the source documents, and read/write access to the output S3 bucket and folder that contains the translated documents:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "s3:ListBucket",
            "s3:GetObject",
            "s3:PutObject",
            “s3:DeleteObject” 
        ]
        "Resource": [
            "arn:aws:s3:::YOUR-S3_BUCKET-NAME"
        ] 
    }]
}

To run Jupyter notebooks in SageMaker for the translation jobs, we need to grant an inline permission policy to the SageMaker execution role. This role passes the Amazon Translate service role to SageMaker that allows the SageMaker notebooks to have access to the source and translated documents in the designated S3 buckets:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": ["iam:PassRole"],
        "Effect": "Allow",
        "Resource": [
            "arn:aws:iam::YOUR-AWS-ACCOUNT-ID:role/batch-translate-api-role"
        ]
    }]
}

Prepare parallel data training samples

The parallel data in ACT needs to be trained by an input file consisting of a list of textual example pairs, for instance, a pair of source language (English) and target language (Chinese). The input file can be in TMX, CSV, or TSV format. The following screenshot shows an example of a CSV input file. The first column is the source language data (in English), and the second column is the target language data (in Chinese). The following example is extracted from D2L-en book and D2L-zh book.

Perform custom parallel data training in Amazon Translate

First, we set up the S3 bucket and folders as shown in the following screenshot. The source_data folder contains the source documents before the translation; the generated documents after the batch translation are put in the output folder. The ParallelData folder holds the parallel data input file prepared in the previous step.

After uploading the input files to the source_data folder, we can use the CreateParallelData API to run a parallel data creation job in Amazon Translate:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
pd_description = “Parallel Data for English to Chinese”
pd_fn = “d2l_short_test_sentence_enzh_all.csv”
response_t = translate_client.create_parallel_data(
                Name=pd_name,                              # pd_name is the parallel data name 
                Description=pd_description,          # pd_description is the parallel data description 
                ParallelDataConfig={
                      'S3Uri': 's3://'+S3_BUCKET+'/Paralleldata/'+pd_fn,        # S3_BUCKET is the S3 bucket name defined in the previous step
                      'Format': 'CSV'
                },
)
print(pd_name, ": ", response_t['Status'], " created.")

To update existing parallel data with new training datasets, we can use the UpdateParallelData API:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
pd_description = “Parallel Data for English to Chinese”
pd_fn = “d2l_short_test_sentence_enzh_all.csv”
response_t = translate_client.update_parallel_data(
                Name=pd_name,                          # pd_name is the parallel data name
                Description=pd_description,      # pd_description is the parallel data description 
                ParallelDataConfig={
                      'S3Uri': 's3://'+S3_BUCKET+'/Paralleldata/'+pd_fn,	# S3_BUCKET is the S3 bucket name defined in the previous step
                      'Format': 'CSV'  
                },
)
print(pd_name, ": ", response_t['Status'], " updated.")

We can check the training job progress on the Amazon Translate console. When the job is complete, the parallel data status shows as Active and is ready to use.

Run asynchronized batch translation using parallel data

The batch translation can be conducted in a process where multiple source documents are automatically translated into documents in target languages. The process involves uploading the source documents to the input folder of the S3 bucket, then applying the StartTextTranslationJob API of Amazon Translate to initiate an asynchronized translation job:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
ROLE_ARN = “THE_ROLE_DEFINED_IN_STEP_1”
src_fdr = “source_data”
output_fdr = “output”
src_lang = “en”
tgt_lang = “zh”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
response = translate_client.start_text_translation_job (  
              JobName='D2L_job',         
              InputDataConfig={
                 'S3Uri': 's3://'+S3_BUCKET+'/'+src_fdr+'/',       # S3_BUCKET is the S3 bucket name defined in the previous step 
                                                                   # src_fdr is the folder in S3 bucket containing the source files  
                 'ContentType': 'text/html'
              },
              OutputDataConfig={ 
                  'S3Uri': 's3://'+S3_BUCKET+'/’+output_fdr+’/',   # S3_BUCKET is the S3 bucket name defined in the previous step 
                                                                   # output_fdr is the folder in S3 bucket containing the translated files
              },
              DataAccessRoleArn=ROLE_ARN,            # ROLE_ARN is the role defined in the previous step 
              SourceLanguageCode=src_lang,           # src_lang is the source language, such as ‘en’
              TargetLanguageCodes=[tgt_lang,],       # tgt_lang is the source language, such as ‘zh’
              ParallelDataNames=pd_name              # pd_name is the parallel data name defined in the previous step        
)

We selected five source documents in English from the D2L book (D2L-en) for the bulk translation. On the Amazon Translate console, we can monitor the translation job progress. When the job status changes into Completed, we can find the translated documents in Chinese (D2L-zh) in the S3 bucket output folder.

Evaluate the translation quality

To demonstrate the effectiveness of the ACT feature in Amazon Translate, we also applied the traditional method of Amazon Translate real-time translation without parallel data to process the same documents, and compared the output with the batch translation output with ACT. We used the BLEU (BiLingual Evaluation Understudy) score to benchmark the translation quality between the two methods. The only way to accurately measure the quality of machine translation output is to have an expert review and grade the quality. However, BLEU provides an estimate of relative quality improvement between two output. A BLEU score is typically a number between 0–1; it calculates the similarity of the machine translation to the reference human translation. The higher score represents better quality in natural language understanding (NLU).

We have tested a set of documents in four pipelines: English into Chinese (en to zh), Chinese into English (zh to en), English into Spanish (en to es), and Spanish into English (es to en). The following figure shows that the translation with ACT produced a higher average BLEU score in all the translation pipelines.

We also observed that, the more granular the parallel data pairs are, the better the translation performance. For example, we use the following parallel data input file with pairs of paragraphs, which contains 10 entries.

For the same content, we use the following parallel data input file with pairs of sentences and 16 entries.

We used both parallel data input files to construct two parallel data entities in Amazon Translate, then created two batch translation jobs with the same source document. The following figure compares the output translations. It shows that the output using parallel data with pairs of sentences out-performed the one using parallel data with pairs of paragraphs, for both English to Chinese translation and Chinese to English translation.

If you are interested in learning more about these benchmark analyses, refer to Auto Machine Translation and Synchronization for “Dive into Deep Learning”.

Clean up

To avoid recurring costs in the future, we recommend you clean up the resources you created:

On the Amazon Translate console, select the parallel data you created and choose Delete. Alternatively, you can use the DeleteParallelData API or the AWS Command Line Interface (AWS CLI) delete-parallel-data command to delete the parallel data.
Delete the S3 bucket used to host the source and reference documents, translated documents, and parallel data input files.
Delete the IAM role and policy. For instructions, refer to Deleting roles or instance profiles and Deleting IAM policies.

Conclusion

With this solution, we aim to reduce the workload of human translators by 80%, while maintaining the translation quality and supporting multiple languages. You can use this solution to improve your translation quality and efficiency. We are working on further improving the solution architecture and translation quality for other languages.

Your feedback is always welcome; please leave your thoughts and questions in the comments section.

About the authors

Yunfei Bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Rachel Hu is an applied scientist at AWS Machine Learning University (MLU). She has been leading a few course designs, including ML Operations (MLOps) and Accelerator Computer Vision. Rachel is an AWS senior speaker and has spoken at top conferences including AWS re:Invent, NVIDIA GTC, KDD, and MLOps Summit. Before joining AWS, Rachel worked as a machine learning engineer building natural language processing models. Outside of work, she enjoys yoga, ultimate frisbee, reading, and traveling.

Watson Srivathsan is the Principal Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends, you will find him exploring the outdoors in the Pacific Northwest.

Do Pass Go, Do Collect More Games: Xbox Game Pass Coming to GeForce NOW

Xbox Game Pass support is coming to GeForce NOW.

Members will soon be able to play supported PC games from the Xbox Game Pass catalog through NVIDIA’s cloud gaming servers. Learn more about how support for Game Pass and Microsoft Store will roll out in the coming months.

Plus, Age of Empires IV: Anniversary Edition is the first from the world’s most popular real-time strategy franchise to arrive on GeForce NOW.

A Game Pass-tic Partnership

Announced over the weekend, Game Pass members will soon be able to play supported PC games from the Game Pass catalog with GeForce NOW.

Thrilled to share that in the coming months you’ll be able to play your @XboxGamePassPC games through NVIDIA GeForce NOW. Can’t wait for you to jump in! https://t.co/jZXkjHZUrf

— BondSarahBond (@BondSarah_Bond) June 12, 2023

We’re working closely with Microsoft to enable members to play select PC titles from Microsoft Store, just as they can today on GeForce NOW with their Steam, Epic Games Store, Ubisoft Connect and GOG.com accounts. Members who are subscribed to PC Game Pass or Xbox Game Pass Ultimate will be able to stream these select PC titles from the Game Pass library — without downloads or additional purchases for instant gaming from the cloud.

With hundreds of PC titles available in the Game Pass catalog, Xbox and PC gamers together can look forward to future GFN Thursdays to see what’s next. PC games from Xbox Game Studios and Bethesda on Steam and Epic Games Store will continue to be released, giving members more ways to play their favorite Xbox titles.

And with the ability for GeForce NOW members to stream at high performance across devices, including PCs, Macs, mobile devices, smart TVs, gaming handheld devices and more, gamers everywhere will be able to take their Xbox PC games wherever they go, along with the over 1,600 titles in the GeForce NOW library.

For an even more upgraded experience, upgrade to Ultimate and Priority memberships to skip the waiting lines over free members and get into gaming even faster.

Build Your Empire — and Library

Age of Empires IV on GeForce NOW — *Siege the moment!*

Conquer the lands in Microsoft’s award-winning Age of Empires franchise this week.

Age of Empires IV: Anniversary Edition takes the world’s most popular real-time strategy game to the next level with familiar and new ways for players to expand their empire. The Anniversary Edition brings all the latest updates, including new civilizations — the Ottomans and Malians — maps, languages, challenges and more. Choose the path to greatness and become a part of history through Campaign Story Mode with a tutorial designed for first-time players, or challenge the world in competitive or cooperative online matches that include ranked seasons.

Ultimate members can rule the kingdom in stunning 4K or ultrawide resolutions, and settle in with up to eight-hour streaming sessions.

What to Play This Week

Dordogne on GeForce NOW — *Hand-painted nostalgia in the cloud this summer.*

Take a look at the two new games available to stream this week:

Dordogne (New release on Steam)
Age of Empires IV: Anniversary Edition (Steam)

Before the weekend arrives, check out our question of the week. Let us know your answer on Twitter or in the comments below.

You’ve been chosen to build the greatest empire in history.

What time period are you choosing to build it in?

— NVIDIA GeForce NOW (@NVIDIAGFN) June 14, 2023

Bring SageMaker Autopilot into your MLOps processes using a custom SageMaker Project

Every organization has its own set of standards and practices that provide security and governance for their AWS environment. Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. SageMaker provides a set of templates for organizations that want to quickly get started with ML workflows and DevOps continuous integration and continuous delivery (CI/CD) pipelines.

The majority of enterprise customers already have a well-established MLOps practice with a standardized environment in place—for example, a standardized repository, infrastructure, and security guardrails—and want to extend their MLOps process to no-code and low-code AutoML tools as well. They also have a lot of processes that need to be adhered to before promoting a model to production. They’re looking for a quick and easy way to graduate from the initial phase to a repeatable, reliable, and eventually scalable operating phase, as outlined in the following diagram. For more information, refer to MLOps foundation roadmap for enterprises with Amazon SageMaker.

Although these companies have robust data science and MLOps teams to help them build reliable and scalable pipelines, they want to have their low-code AutoML tool users produce code and model artifacts in a manner that can be integrated with their standardized practices, adhering to their code repo structure and with appropriate validations, tests, steps, and approvals.

They are looking for a mechanism for the low-code tools to generate all the source code for each step of the AutoML tasks (preprocessing, training, and postprocessing) in a standardized repository structure that can provide their expert data scientists with the capability to view, validate, and modify the workflow per their needs and then generate a custom pipeline template that can be integrated into a standardized environment (where they have defined their code repository, code build tools, and processes).

This post showcases how to have a repeatable process with low-code tools like Amazon SageMaker Autopilot such that it can be seamlessly integrated into your environment, so you don’t have to orchestrate this end-to-end workflow on your own. We demonstrate how to use CI/CD the low-code/no-code tools code to integrate it into your MLOps environment, while adhering with MLOps best practices.

Solution overview

To demonstrate the orchestrated workflow, we use the publicly available UCI Adult 1994 Census Income dataset to predict if a person has an annual income of greater than $50,000 per year. This is a binary classification problem; the options for the income target variable are either over $50,000 or under $50,000.

The following table summarizes the key components of the dataset.

Data Set Characteristics	Multivariate	Number of Instances	48842	Area	Social
Attribute Characteristics:	Categorical, Integer	Number of Attributes:	14	Date Donated	1996-05-01
Associated Tasks:	Classification	Missing Values?	Yes	Number of Web Hits	2749715

The following table summarizes the attribute information.

Column Name	Description
Age	Continuous
Workclass	Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt	continuous
education	Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num	continuous
marital-status	Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation	ech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship	Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race	White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex	Female, Male
capital-gain	Continuous
capital-loss	Continuous
hours-per-week	Continuous
native-country	United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
class	Income class, either <=50K or >=50K

In this post, we showcase how to use Amazon SageMaker Projects, a tool that helps organizations set up and standardize environments for MLOps with low-code AutoML tools like Autopilot and Amazon SageMaker Data Wrangler.

Autopilot eliminates the heavy lifting of building ML models. You simply provide a tabular dataset and select the target column to predict, and Autopilot will automatically explore different solutions to find the best model. You then can directly deploy the model to production with just one click or iterate on the recommended solutions to further improve the model quality.

Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data preparation flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows. We use Data Wrangler to perform preprocessing on the dataset before submitting the data to Autopilot.

SageMaker Projects helps organizations set up and standardize environments for automating different steps involved in an ML lifecycle. Although notebooks are helpful for model building and experimentation, a team of data scientists and ML engineers sharing code need a more scalable way to maintain code consistency and strict version control.

To help you get started with common model building and deployment paradigms, SageMaker Projects offers a set of first-party templates (1P templates). The 1P templates generally focus on creating resources for model building and model training. The templates include projects that use AWS-native services for CI/CD, such as AWS CodeBuild and AWS CodePipeline. SageMaker Projects can support custom template offerings, where organizations use an AWS CloudFormation template to run a Terraform stack and create the resources needed for an ML workflow.

Organizations may want to extend the 1P templates to support use cases beyond simply training and deploying models. Custom project templates are a way for you to create a standard workflow for ML projects. You can create several templates and use AWS Identity and Access Management (IAM) policies to manage access to those templates on Amazon SageMaker Studio, ensuring that each of your users are accessing projects dedicated for their use cases.

To learn more about SageMaker Projects and creating custom project templates aligned with best practices, refer to Build Custom SageMaker Project Templates – Best Practices.

These custom templates are created as AWS Service Catalog products and provisioned as organization templates on the Studio UI. This is where data scientists can choose a template and have their ML workflow bootstrapped and preconfigured. Projects are provisioned using AWS Service Catalog products. Project templates are used by organizations to provision projects for each of their teams.

In this post, we showcase how to build a custom project template to have an end-to-end MLOps workflow using SageMaker projects, AWS Service Catalog, and Amazon SageMaker Pipelines integrating Data Wrangler and Autopilot with humans in the loop in order to facilitate the steps of model training and deployment. The humans in the loop are the different personas involved in an MLOps practice working collaboratively for a successful ML build and deploy workflow.

The following diagram illustrates the end-to-end low-code/no-code automation workflow.

The workflow includes the following steps:

The Ops team or the Platform team launches the CloudFormation template to set up the prerequisites required to provision the custom SageMaker template.
When the template is available in SageMaker, the Data Science Lead uses the template to create a SageMaker project.
The SageMaker project creation will launch an AWS Service Catalog product that adds two seed codes to the AWS CodeCommit repositories:
- The seed code for the model building pipeline includes a pipeline that preprocesses the UCI Machine Learning Adult dataset using Data Wrangler, automatically creates an ML model with full visibility using Autopilot, evaluates the performance of a model using a processing step, and registers the model into a model registry based on the model performance.
- The seed code for model deployment includes a CodeBuild step to find the latest model that has been approved in the model registry and create configuration files to deploy the CloudFormation templates as part of the CI/CD pipelines using CodePipeline. The CloudFormation template deploys the model to staging and production environments.
The first seed code commit starts a CI/CD pipeline using CodePipeline that triggers a SageMaker pipeline, which is a series of interconnected steps encoded using a directed acyclic graph (DAG). In this case, the steps involved are data processing using a Data Wrangler flow, training the model using Autopilot, creating the model, evaluating the model, and if the evaluation is passed, registering the model.

For more details on creating SageMaker pipelines using Autopilot, refer to Launch Amazon SageMaker Autopilot experiments directly from within Amazon SageMaker Pipelines to easily automate MLOps workflows.

After the model is registered, the model approver can either approve or reject the model in Studio.
When the model is approved, a CodePipeline deployment pipeline integrated with the second seed code is triggered.
This pipeline creates a SageMaker serverless scalable endpoint for the staging environment.
There is an automated test step in the deployment pipeline that will be tested on the staging endpoint.
The test results are stored in Amazon Simple Storage Service (Amazon S3). The pipeline will stop for a production deployment approver, who can review all the artifacts before approving.
Once approved, the model is deployed to production in the form of scalable serverless endpoint. Production applications can now consume the endpoint for inference.

The deployment steps consist of the following:

Create the custom SageMaker project template for Autopilot and other resources using AWS CloudFormation. This is a one-time setup task.
Create the SageMaker project using the custom template.

In the following sections, we proceed with each of these steps in more detail and explore the project details page.

Prerequisites

This walkthrough includes the following prerequisites:

An AWS account.
A Studio domain managed policy attached to the IAM execution role. For instructions on assigning permissions to the role, refer to Amazon SageMaker API Permissions: Actions, Permissions, and Resources Reference. For more information, refer to Amazon SageMaker Identity-Based Policy Examples.
For this post, you use a CloudFormation template. Follow the instructions in AWS CloudFormation Getting Started for more information.

Create solution resources with AWS CloudFormation

You can download and launch the CloudFormation template via the AWS CloudFormation console, the AWS Command Line Interface (AWS CLI), the SDK, or by simply choosing Launch Stack:

The CloudFormation template is also available in the AWS Samples GitHub Code repository. The repository contains the following:

A CloudFormation template to set up the custom SageMaker project template for Autopilot
Seed code with the ML code to set up SageMaker pipelines to automate the data processing and training steps
A project folder for the CloudFormation template used by AWS Service Catalog mapped to the custom SageMaker project template that will be created

The CloudFormation template takes several parameters as input.

The following are the AWS Service Catalog product information parameters:

Product Name – The name of the AWS Service Catalog product that the SageMaker project custom MLOps template will be associated with
Product Description – The description for the AWS Service Catalog product
Product Owner – The owner of the Service Catalog product
Product Distributor – The distributor of the Service Catalog product

The following are the AWS Service Catalog product support information parameters:

Product Support Description – A support description for this product
Product Support Email – An email address of the team supporting the AWS Service Catalog product
Product Support URL – A support URL for the AWS Service Catalog product

The following are the source code repository configuration parameters:

URL to the zipped version of your GitHub repository – Use the defaults if you’re not forking the AWS Samples repository.
Name and branch of your GitHub repository – These should match the root folder of the zip. Use the defaults if you’re not forking the AWS Samples repository.
StudioUserExecutionRole – Provide the ARN of the Studio user execution IAM role.

After you launch the CloudFormation stack from this template, you can monitor its status on the AWS CloudFormation console.

When the stack is complete, copy the value of the CodeStagingBucketName key on the Outputs tab of the CloudFormation stack and save it in a text editor to use later.

Create the SageMaker project using the new custom template

To create your SageMaker project, complete the following steps:

Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain.
In the Studio sidebar, choose the home icon.
Choose Deployments from the menu, then choose Projects.
Choose Create project.
Choose Organization templates to view the new custom MLOps template.
Choose Select project template.

For Project details, enter a name and description for your project.
For MLOpsS3Bucket, enter the name of the S3 bucket you saved earlier.

Choose Create project.

A message appears indicating that SageMaker is provisioning and configuring the resources.

When the project is complete, you receive a success message, and your project is now listed on the Projects list.

Explore the project details

On the project details page, you can view various tabs associated with the project. Let’s dive deep into each of these tabs in detail.

Repositories

This tab lists the code repositories associated with this project. You can choose clone repo under Local path to clone the two seed code repositories created in CodeCommit by the SageMaker project. This option provides you with Git access to the code repositories from the SageMaker project itself.

When the clone of the repository is complete, the local path appears in the Local path column. You can choose the path to open the local folder that contains the repository code in Studio.

The folder will be accessible in the navigation pane. You can use the file browser icon to hide or show the folder list. You can make the code changes here or choose the Git icon to stage, commit, and push the change.

Pipelines

This tab lists the SageMaker ML pipelines that define steps to prepare data, train models, and deploy models. For information about SageMaker ML pipelines, see Create and Manage SageMaker Pipelines.

You can choose the pipeline that is currently running to see its latest status. In the following example, the DataProcessing step is performed by using a Data Wrangler data flow.

You can access the data flow from the local path of the code repository that we cloned earlier. Choose the file browser icon to show the path, which is listed in the pipelines folder of the model build repository.

In the pipelines folder, open the autopilot folder.

In the autopilot folder, open the preprocess.flow file.

It will take a moment to open the Data Wrangler flow.

In this example, three data transformations are performed between the source and destination. You can choose each transformation to see more details.

For instructions on how to include or remove transformations in Data Wrangler, refer to Transform Data.

For more information, refer to Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 1.

When you’re done reviewing, choose the power icon and stop the Data Wrangler resources under Running Apps and Kernel Sessions.

Experiments

This tab lists the Autopilot experiments associated with the project. For more information about Autopilot, see Automate model development with Amazon SageMaker Autopilot.

Model groups

This tab lists groups of model versions that were created by pipeline runs in the project. When the pipeline run is complete, the model created from the last step of the pipeline will be accessible here.

You can choose the model group to access the latest version of the model.

The status of the model version in the following example is Pending. You can choose the model version and choose Update status to update the status.

Choose Approved and choose Update status to approve the model.

After the model status is approved, the model deploy CI/CD pipeline within CodePipeline will start.

You can open the deployed pipeline to see the different stages in the repo.

As shown in the preceding screenshot, this pipeline has four stages:

Source – In this stage, CodePipeline checks the CodeCommit repo code into the S3 bucket.
Build – In this stage, CloudFormation templates are prepared for the deployment of the model code.
DeployStaging – This stage consists of three sub-stages:
- DeployResourcesStaging – In the first sub-stage, the CloudFormation stack is deployed to create a serverless SageMaker endpoint in the staging environment.
- TestStaging – In the second-sub stage, automated testing is performed using CodeBuild on the endpoint to check if the inference is happening as expected. The test results will be available in the S3 bucket with the name sagemaker-project-<project ID of the SageMaker project>.

You can get the SageMaker project ID on the Settings tab of the SageMaker project. Within the S3 bucket, choose the project name folder (for example, sagemaker-MLOp-AutoP) and within that, open the TestArtifa/ folder. Choose the object file in this folder to see the test results.

You can access the testing script from the local path of the code repository that we cloned earlier. Choose the file browser icon view the path. Note this will be the deploy repository. In that repo, open the test folder and choose the test.py Python code file.

You can make changes to this testing code as per your use case.

ApproveDeployment – In the third sub-stage, there is an additional approval process before the last stage of deploying to production. You can choose Review and approve it to proceed.

DeployProd – In this stage, the CloudFormation stack is deployed to create a serverless SageMaker endpoint for the production environment.

Endpoints

This tab lists the SageMaker endpoints that host deployed models for inference. When all the stages in the model deployment pipeline are complete, models are deployed to SageMaker endpoints and are accessible within the SageMaker project.

Settings

This is the last tab on the project page and lists settings for the project. This includes the name and description of the project, information about the project template and SourceModelPackageGroupName, and metadata about the project.

Clean up

To avoid additional infrastructure costs associated with the example in this post, be sure to delete CloudFormation stacks. Also, ensure that you delete the SageMaker endpoints, any running notebooks, and S3 buckets that were created during the setup.

Conclusion

This post described an easy-to-use ML pipeline approach to automate and standardize the training and deployment of ML models using SageMaker Projects, Data Wrangler, Autopilot, Pipelines, and Studio. This solution can help you perform AutoML tasks (preprocessing, training, and postprocessing) in a standardized repository structure that can provide your expert data scientists with the capability to view, validate, and modify the workflow as per their needs and then generate a custom pipeline template that can be integrated to a SageMaker project.

You can modify the pipelines with your preprocessing and pipeline steps for your use case and deploy our end-to-end workflow. Let us know in the comments how the custom template worked for your respective use case.

About the authors

Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Machine Learning, DevOps, and Containers. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Shikhar Kwatra is an AI/ML specialist solutions architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Janisha Anand is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.

Reconstructing indoor spaces with NeRF

Marcos Seefelder, Software Engineer, and Daniel Duckworth, Research Software Engineer, Google Research

When choosing a venue, we often find ourselves with questions like the following: Does this restaurant have the right vibe for a date? Is there good outdoor seating? Are there enough screens to watch the game? While photos and videos may partially answer questions like these, they are no substitute for feeling like you’re there, even when visiting in person isn’t an option.

Immersive experiences that are interactive, photorealistic, and multi-dimensional stand to bridge this gap and recreate the feel and vibe of a space, empowering users to naturally and intuitively find the information they need. To help with this, Google Maps launched Immersive View, which uses advances in machine learning (ML) and computer vision to fuse billions of Street View and aerial images to create a rich, digital model of the world. Beyond that, it layers helpful information on top, like the weather, traffic, and how busy a place is. Immersive View provides indoor views of restaurants, cafes, and other venues to give users a virtual up-close look that can help them confidently decide where to go.

Today we describe the work put into delivering these indoor views in Immersive View. We build on neural radiance fields (NeRF), a state-of-the-art approach for fusing photos to produce a realistic, multi-dimensional reconstruction within a neural network. We describe our pipeline for creation of NeRFs, which includes custom photo capture of the space using DSLR cameras, image processing and scene reproduction. We take advantage of Alphabet’s recent advances in the field to design a method matching or outperforming the prior state-of-the-art in visual fidelity. These models are then embedded as interactive 360° videos following curated flight paths, enabling them to be available on smartphones.

The reconstruction of The Seafood Bar in Amsterdam in Immersive View.

From photos to NeRFs

At the core of our work is NeRF, a recently-developed method for 3D reconstruction and novel view synthesis. Given a collection of photos describing a scene, NeRF distills these photos into a neural field, which can then be used to render photos from viewpoints not present in the original collection.

While NeRF largely solves the challenge of reconstruction, a user-facing product based on real-world data brings a wide variety of challenges to the table. For example, reconstruction quality and user experience should remain consistent across venues, from dimly-lit bars to sidewalk cafes to hotel restaurants. At the same time, privacy should be respected and any potentially personally identifiable information should be removed. Importantly, scenes should be captured consistently and efficiently, reliably resulting in high-quality reconstructions while minimizing the effort needed to capture the necessary photographs. Finally, the same natural experience should be available to all mobile users, regardless of the device on hand.

The Immersive View indoor reconstruction pipeline.

Capture & preprocessing

The first step to producing a high-quality NeRF is the careful capture of a scene: a dense collection of photos from which 3D geometry and color can be derived. To obtain the best possible reconstruction quality, every surface should be observed from multiple different directions. The more information a model has about an object’s surface, the better it will be in discovering the object’s shape and the way it interacts with lights.

In addition, NeRF models place further assumptions on the camera and the scene itself. For example, most of the camera’s properties, such as white balance and aperture, are assumed to be fixed throughout the capture. Likewise, the scene itself is assumed to be frozen in time: lighting changes and movement should be avoided. This must be balanced with practical concerns, including the time needed for the capture, available lighting, equipment weight, and privacy. In partnership with professional photographers, we developed a strategy for quickly and reliably capturing venue photos using DSLR cameras within only an hour timeframe. This approach has been used for all of our NeRF reconstructions to date.

Once the capture is uploaded to our system, processing begins. As photos may inadvertently contain sensitive information, we automatically scan and blur personally identifiable content. We then apply a structure-from-motion pipeline to solve for each photo’s camera parameters: its position and orientation relative to other photos, along with lens properties like focal length. These parameters associate each pixel with a point and a direction in 3D space and constitute a key signal in the NeRF reconstruction process.

NeRF reconstruction

Unlike many ML models, a new NeRF model is trained from scratch on each captured location. To obtain the best possible reconstruction quality within a target compute budget, we incorporate features from a variety of published works on NeRF developed at Alphabet. Some of these include:

We build on mip-NeRF 360, one of the best-performing NeRF models to date. While more computationally intensive than Nvidia’s widely-used Instant NGP, we find the mip-NeRF 360 consistently produces fewer artifacts and higher reconstruction quality.
We incorporate the low-dimensional generative latent optimization (GLO) vectors introduced in NeRF in the Wild as an auxiliary input to the model’s radiance network. These are learned real-valued latent vectors that embed appearance information for each image. By assigning each image in its own latent vector, the model can capture phenomena such as lighting changes without resorting to cloudy geometry, a common artifact in casual NeRF captures.
We also incorporate exposure conditioning as introduced in Block-NeRF. Unlike GLO vectors, which are uninterpretable model parameters, exposure is directly derived from a photo’s metadata and fed as an additional input to the model’s radiance network. This offers two major benefits: it opens up the possibility of varying ISO and provides a method for controlling an image’s brightness at inference time. We find both properties invaluable for capturing and reconstructing dimly-lit venues.

We train each NeRF model on TPU or GPU accelerators, which provide different trade-off points. As with all Google products, we continue to search for new ways to improve, from reducing compute requirements to improving reconstruction quality.

A side-by-side comparison of our method and a mip-NeRF 360 baseline.

A scalable user experience

Once a NeRF is trained, we have the ability to produce new photos of a scene from any viewpoint and camera lens we choose. Our goal is to deliver a meaningful and helpful user experience: not only the reconstructions themselves, but guided, interactive tours that give users the freedom to naturally explore spaces from the comfort of their smartphones.

To this end, we designed a controllable 360° video player that emulates flying through an indoor space along a predefined path, allowing the user to freely look around and travel forward or backwards. As the first Google product exploring this new technology, 360° videos were chosen as the format to deliver the generated content for a few reasons.

On the technical side, real-time inference and baked representations are still resource intensive on a per-client basis (either on device or cloud computed), and relying on them would limit the number of users able to access this experience. By using videos, we are able to scale the storage and delivery of videos to all users by taking advantage of the same video management and serving infrastructure used by YouTube. On the operations side, videos give us clearer editorial control over the exploration experience and are easier to inspect for quality in large volumes.

While we had considered capturing the space with a 360° camera directly, using a NeRF to reconstruct and render the space has several advantages. A virtual camera can fly anywhere in space, including over obstacles and through windows, and can use any desired camera lens. The camera path can also be edited post-hoc for smoothness and speed, unlike a live recording. A NeRF capture also does not require the use of specialized camera hardware.

Our 360° videos are rendered by ray casting through each pixel of a virtual, spherical camera and compositing the visible elements of the scene. Each video follows a smooth path defined by a sequence of keyframe photos taken by the photographer during capture. The position of the camera for each picture is computed during structure-from-motion, and the sequence of pictures is smoothly interpolated into a flight path.

To keep speed consistent across different venues, we calibrate the distances for each by capturing pairs of images, each of which is 3 meters apart. By knowing measurements in the space, we scale the generated model, and render all videos at a natural velocity.

The final experience is surfaced to the user within Immersive View: the user can seamlessly fly into restaurants and other indoor venues and discover the space by flying through the photorealistic 360° videos.

Open research questions

We believe that this feature is the first step of many in a journey towards universally accessible, AI-powered, immersive experiences. From a NeRF research perspective, more questions remain open. Some of these include:

Enhancing reconstructions with scene segmentation, adding semantic information to the scenes that could make scenes, for example, searchable and easier to navigate.
Adapting NeRF to outdoor photo collections, in addition to indoor. In doing so, we’d unlock similar experiences to every corner of the world and change how users could experience the outdoor world.
Enabling real-time, interactive 3D exploration through neural-rendering on-device.

Reconstruction of an outdoor scene with a NeRF model trained on Street View panoramas.

As we continue to grow, we look forward to engaging with and contributing to the community to build the next generation of immersive experiences.

Acknowledgments

This work is a collaboration across multiple teams at Google. Contributors to the project include Jon Barron, Julius Beres, Daniel Duckworth, Roman Dudko, Magdalena Filak, Mike Harm, Peter Hedman, Claudio Martella, Ben Mildenhall, Cardin Moffett, Etienne Pot, Konstantinos Rematas, Yves Sallat, Marcos Seefelder, Lilyana Sirakovat, Sven Tresp and Peter Zhizhin.

Also, we’d like to extend our thanks to Luke Barrington, Daniel Filip, Tom Funkhouser, Charles Goran, Pramod Gupta, Mario Lučić, Isalo Montacute and Dan Thomasset for valuable feedback and suggestions.

8 ways Google Lens can help make your life easier

Lens makes it easy to search what you see and explore the world around you — including the new ability to search for skin conditions.Read More

Forged in Flames: Startup Fuses Generative AI, Computer Vision to Fight Wildfires

When California skies turned orange in the wake of devastating wildfires, a startup fused computer vision and generative AI to fight back.

“With the 2020 wildfires, it became very personal, so we asked fire officials how we could help,” said Emrah Gultekin, the Turkish-born CEO of Chooch, a Silicon Valley-based leader in computer vision.

California utilities and fire services, they learned, were swamped with as many as 2,000 false positives a week from an existing wildfire detection system. The wrong predictions came from fog, rain and smudges on the lenses of a network of cameras they used.

So, in a pilot project, Chooch linked its fire detection software to the camera network. It analyzed snapshots every 15 minutes, seeking signs of smoke or fire.

Generative AI Sharpens Computer Vision

Then, the team led by Hakan Gultekin — Emrah’s brother, a software wiz and Chooch’s CTO — had an idea.

They built a generative AI tool that automatically created descriptions of each image, helping reviewers discern when smoke is present. False positives dropped from 2,000 a week to eight.

Startup Chooch uses generative AI and computer vision to detect wildfires. — Chooch detects smoke and fire despite bad weather or dirty camera lenses.

“Fire chiefs were excited about launching the technology in their monitoring centers and what it could achieve,” said Michael Liou, the president of Chooch, who detailed the project in a recent webinar.

Chooch’s generative AI tool gives fire fighters in California’s Kern County a dashboard on their smartphones and PCs, populated in real time with alerts, so they can detect wildfires fast.

In 2020, California experienced 9,900 wildfires that burned 4.3 million acres of forest and caused $19 billion in losses. Stopping one fire from spreading out of control would pay for the wildfire detection system for 50 years, the company estimates.

A Vision for Gen AI

Chooch’s CEO says it’s also the shape of things to come.

Emrah Gultekin, CEO of Chooch — Emrah Gultekin

“The fusion of large language models and computer vision will bring about even more powerful and accurate products that are easier to deploy,” said Gultekin.

For example, utilities can connect the software to drones and fixed cameras to detect corrosion on capacitors or vegetation encroaching on power lines.

The technology could see further validation as Chooch enters an $11 million Xprize challenge on detecting and fighting wildfires. Sponsors include PG&E and Lockheed Martin that’s building an AI lab to predict and respond to wildfires in a separate collaboration with NVIDIA.

Startup Chooch deliver real time alerts to smartphone and desktop PC dashboards for firefighters — Dashboards for PCs and smartphones can update firefighters with real-time alerts from Chooch’s software.

Chooch applies its technology to a host of challenges in manufacturing, retail and security.

For example, one manufacturer uses Chooch’s models to detect defects before products ship. Eliminating just 20% of the faults will pay for the system several times over.

Inception of a Partnership

Back in 2019, a potential customer in the U.S. government asked for support with edge deployments it planned on NVIDIA GPUs. Chooch joined NVIDIA Inception, a free program that nurtures cutting-edge startups.

Using NGC, NVIDIA’s hub for accelerated software, Hakan was able to port Chooch’s code to NVIDIA GPUs over a weekend. Now its products run on NVIDIA Jetson modules and “have been tested in the wild with full-motion video and multispectral data,” Emrah said.

Since then, the company rolled out support for GPUs in data centers and beyond. For example, the wildfire use case runs on NVIDIA A100 Tensor Core GPUs in the cloud.

Along the way, Chooch embraced software like Triton Inference Server and the NVIDIA DeepStream software development kit.

“The combination of DeepStream and Triton increased our capacity 8x to run more video streams on more AI models — that’s a huge win,” Emrah said.

A Wide Horizon

Now Chooch is expanding its horizons.

The company is a member of the partner ecosystems for NVIDIA Metropolis for intelligent video analytics and NVIDIA Clara Guardian, edge AI software for smart hospitals. Chooch also works with NVIDIA’s retail and telco teams.

The software is opening new doors and expanding the use cases it can address.

“It’s hard work because there’s so much uncharted territory, but that’s also what makes it exciting,” Emrah said.

Learn more about generative AI for enterprises, and explore NVIDIA’s solutions for power grid modernization.

6 Gmail AI features to help save you time

These AI-powered Gmail features can make your email experience even faster, easier and more organized.Read More

Google Cloud: Driving digital transformation

Google Cloud empowers organizations to digitally transform themselves into smarter businesses. It offers cloud computing, data analytics, and the latest artificial intelligence (AI) and machine learning tools.Read More

Enhanced attention module for memory efficiency

Winograd fast convolution for 3×3 convolution layers

Specialized operator fusion for memory efficiency

Results

Conclusion

Acknowledgments

Safer Vehicles With 3D Occupancy Prediction

More on the CVPR Challenge

NVIDIA at CVPR

Solution overview

Prerequisites

Create an IAM role and policies for Amazon Translate with ACT

Prepare parallel data training samples

Perform custom parallel data training in Amazon Translate

Run asynchronized batch translation using parallel data

Evaluate the translation quality

Clean up

Conclusion

About the authors

A Game Pass-tic Partnership

Build Your Empire — and Library

What to Play This Week

Solution overview

Prerequisites

Create solution resources with AWS CloudFormation

Create the SageMaker project using the new custom template

Explore the project details

Repositories

Pipelines

Experiments

Model groups

Endpoints

Settings

Clean up

Conclusion

About the authors

From photos to NeRFs

Capture & preprocessing

NeRF reconstruction

A scalable user experience

Open research questions

Acknowledgments

Generative AI Sharpens Computer Vision

A Vision for Gen AI

Inception of a Partnership

A Wide Horizon

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.