Reconstructing indoor spaces with NeRF

Reconstructing indoor spaces with NeRF

When choosing a venue, we often find ourselves with questions like the following: Does this restaurant have the right vibe for a date? Is there good outdoor seating? Are there enough screens to watch the game? While photos and videos may partially answer questions like these, they are no substitute for feeling like you’re there, even when visiting in person isn’t an option.

Immersive experiences that are interactive, photorealistic, and multi-dimensional stand to bridge this gap and recreate the feel and vibe of a space, empowering users to naturally and intuitively find the information they need. To help with this, Google Maps launched Immersive View, which uses advances in machine learning (ML) and computer vision to fuse billions of Street View and aerial images to create a rich, digital model of the world. Beyond that, it layers helpful information on top, like the weather, traffic, and how busy a place is. Immersive View provides indoor views of restaurants, cafes, and other venues to give users a virtual up-close look that can help them confidently decide where to go.

Today we describe the work put into delivering these indoor views in Immersive View. We build on neural radiance fields (NeRF), a state-of-the-art approach for fusing photos to produce a realistic, multi-dimensional reconstruction within a neural network. We describe our pipeline for creation of NeRFs, which includes custom photo capture of the space using DSLR cameras, image processing and scene reproduction. We take advantage of Alphabet’s recent advances in the field to design a method matching or outperforming the prior state-of-the-art in visual fidelity. These models are then embedded as interactive 360° videos following curated flight paths, enabling them to be available on smartphones.

The reconstruction of The Seafood Bar in Amsterdam in Immersive View.

From photos to NeRFs

At the core of our work is NeRF, a recently-developed method for 3D reconstruction and novel view synthesis. Given a collection of photos describing a scene, NeRF distills these photos into a neural field, which can then be used to render photos from viewpoints not present in the original collection.

While NeRF largely solves the challenge of reconstruction, a user-facing product based on real-world data brings a wide variety of challenges to the table. For example, reconstruction quality and user experience should remain consistent across venues, from dimly-lit bars to sidewalk cafes to hotel restaurants. At the same time, privacy should be respected and any potentially personally identifiable information should be removed. Importantly, scenes should be captured consistently and efficiently, reliably resulting in high-quality reconstructions while minimizing the effort needed to capture the necessary photographs. Finally, the same natural experience should be available to all mobile users, regardless of the device on hand.

The Immersive View indoor reconstruction pipeline.

Capture & preprocessing

The first step to producing a high-quality NeRF is the careful capture of a scene: a dense collection of photos from which 3D geometry and color can be derived. To obtain the best possible reconstruction quality, every surface should be observed from multiple different directions. The more information a model has about an object’s surface, the better it will be in discovering the object’s shape and the way it interacts with lights.

In addition, NeRF models place further assumptions on the camera and the scene itself. For example, most of the camera’s properties, such as white balance and aperture, are assumed to be fixed throughout the capture. Likewise, the scene itself is assumed to be frozen in time: lighting changes and movement should be avoided. This must be balanced with practical concerns, including the time needed for the capture, available lighting, equipment weight, and privacy. In partnership with professional photographers, we developed a strategy for quickly and reliably capturing venue photos using DSLR cameras within only an hour timeframe. This approach has been used for all of our NeRF reconstructions to date.

Once the capture is uploaded to our system, processing begins. As photos may inadvertently contain sensitive information, we automatically scan and blur personally identifiable content. We then apply a structure-from-motion pipeline to solve for each photo’s camera parameters: its position and orientation relative to other photos, along with lens properties like focal length. These parameters associate each pixel with a point and a direction in 3D space and constitute a key signal in the NeRF reconstruction process.

NeRF reconstruction

Unlike many ML models, a new NeRF model is trained from scratch on each captured location. To obtain the best possible reconstruction quality within a target compute budget, we incorporate features from a variety of published works on NeRF developed at Alphabet. Some of these include:

  • We build on mip-NeRF 360, one of the best-performing NeRF models to date. While more computationally intensive than Nvidia’s widely-used Instant NGP, we find the mip-NeRF 360 consistently produces fewer artifacts and higher reconstruction quality.
  • We incorporate the low-dimensional generative latent optimization (GLO) vectors introduced in NeRF in the Wild as an auxiliary input to the model’s radiance network. These are learned real-valued latent vectors that embed appearance information for each image. By assigning each image in its own latent vector, the model can capture phenomena such as lighting changes without resorting to cloudy geometry, a common artifact in casual NeRF captures.
  • We also incorporate exposure conditioning as introduced in Block-NeRF. Unlike GLO vectors, which are uninterpretable model parameters, exposure is directly derived from a photo’s metadata and fed as an additional input to the model’s radiance network. This offers two major benefits: it opens up the possibility of varying ISO and provides a method for controlling an image’s brightness at inference time. We find both properties invaluable for capturing and reconstructing dimly-lit venues.

We train each NeRF model on TPU or GPU accelerators, which provide different trade-off points. As with all Google products, we continue to search for new ways to improve, from reducing compute requirements to improving reconstruction quality.

A side-by-side comparison of our method and a mip-NeRF 360 baseline.

A scalable user experience

Once a NeRF is trained, we have the ability to produce new photos of a scene from any viewpoint and camera lens we choose. Our goal is to deliver a meaningful and helpful user experience: not only the reconstructions themselves, but guided, interactive tours that give users the freedom to naturally explore spaces from the comfort of their smartphones.

To this end, we designed a controllable 360° video player that emulates flying through an indoor space along a predefined path, allowing the user to freely look around and travel forward or backwards. As the first Google product exploring this new technology, 360° videos were chosen as the format to deliver the generated content for a few reasons.

On the technical side, real-time inference and baked representations are still resource intensive on a per-client basis (either on device or cloud computed), and relying on them would limit the number of users able to access this experience. By using videos, we are able to scale the storage and delivery of videos to all users by taking advantage of the same video management and serving infrastructure used by YouTube. On the operations side, videos give us clearer editorial control over the exploration experience and are easier to inspect for quality in large volumes.

While we had considered capturing the space with a 360° camera directly, using a NeRF to reconstruct and render the space has several advantages. A virtual camera can fly anywhere in space, including over obstacles and through windows, and can use any desired camera lens. The camera path can also be edited post-hoc for smoothness and speed, unlike a live recording. A NeRF capture also does not require the use of specialized camera hardware.

Our 360° videos are rendered by ray casting through each pixel of a virtual, spherical camera and compositing the visible elements of the scene. Each video follows a smooth path defined by a sequence of keyframe photos taken by the photographer during capture. The position of the camera for each picture is computed during structure-from-motion, and the sequence of pictures is smoothly interpolated into a flight path.

To keep speed consistent across different venues, we calibrate the distances for each by capturing pairs of images, each of which is 3 meters apart. By knowing measurements in the space, we scale the generated model, and render all videos at a natural velocity.

The final experience is surfaced to the user within Immersive View: the user can seamlessly fly into restaurants and other indoor venues and discover the space by flying through the photorealistic 360° videos.

Open research questions

We believe that this feature is the first step of many in a journey towards universally accessible, AI-powered, immersive experiences. From a NeRF research perspective, more questions remain open. Some of these include:

  1. Enhancing reconstructions with scene segmentation, adding semantic information to the scenes that could make scenes, for example, searchable and easier to navigate.
  2. Adapting NeRF to outdoor photo collections, in addition to indoor. In doing so, we’d unlock similar experiences to every corner of the world and change how users could experience the outdoor world.
  3. Enabling real-time, interactive 3D exploration through neural-rendering on-device.

Reconstruction of an outdoor scene with a NeRF model trained on Street View panoramas.

As we continue to grow, we look forward to engaging with and contributing to the community to build the next generation of immersive experiences.

Acknowledgments

This work is a collaboration across multiple teams at Google. Contributors to the project include Jon Barron, Julius Beres, Daniel Duckworth, Roman Dudko, Magdalena Filak, Mike Harm, Peter Hedman, Claudio Martella, Ben Mildenhall, Cardin Moffett, Etienne Pot, Konstantinos Rematas, Yves Sallat, Marcos Seefelder, Lilyana Sirakovat, Sven Tresp and Peter Zhizhin.

Also, we’d like to extend our thanks to Luke Barrington, Daniel Filip, Tom Funkhouser, Charles Goran, Pramod Gupta, Mario Lučić, Isalo Montacute and Dan Thomasset for valuable feedback and suggestions.

Read More

Enabling delightful user experiences via predictive models of human attention

Enabling delightful user experiences via predictive models of human attention

People have the remarkable ability to take in a tremendous amount of information (estimated to be ~1010 bits/s entering the retina) and selectively attend to a few task-relevant and interesting regions for further processing (e.g., memory, comprehension, action). Modeling human attention (the result of which is often called a saliency model) has therefore been of interest across the fields of neuroscience, psychology, human-computer interaction (HCI) and computer vision. The ability to predict which regions are likely to attract attention has numerous important applications in areas like graphics, photography, image compression and processing, and the measurement of visual quality.

We’ve previously discussed the possibility of accelerating eye movement research using machine learning and smartphone-based gaze estimation, which earlier required specialized hardware costing up to $30,000 per unit. Related research includes “Look to Speak”, which helps users with accessibility needs (e.g., people with ALS) to communicate with their eyes, and the recently published “Differentially private heatmaps” technique to compute heatmaps, like those for attention, while protecting users’ privacy.

In this blog, we present two papers (one from CVPR 2022, and one just accepted to CVPR 2023) that highlight our recent research in the area of human attention modeling: “Deep Saliency Prior for Reducing Visual Distraction” and “Learning from Unique Perspectives: User-aware Saliency Modeling”, together with recent research on saliency driven progressive loading for image compression (1, 2). We showcase how predictive models of human attention can enable delightful user experiences such as image editing to minimize visual clutter, distraction or artifacts, image compression for faster loading of webpages or apps, and guiding ML models towards more intuitive human-like interpretation and model performance. We focus on image editing and image compression, and discuss recent advances in modeling in the context of these applications.

Attention-guided image editing

Human attention models usually take an image as input (e.g., a natural image or a screenshot of a webpage), and predict a heatmap as output. The predicted heatmap on the image is evaluated against ground-truth attention data, which are typically collected by an eye tracker or approximated via mouse hovering/clicking. Previous models leveraged handcrafted features for visual clues, like color/brightness contrast, edges, and shape, while more recent approaches automatically learn discriminative features based on deep neural networks, from convolutional and recurrent neural networks to more recent vision transformer networks.

In “Deep Saliency Prior for Reducing Visual Distraction” (more information on this project site), we leverage deep saliency models for dramatic yet visually realistic edits, which can significantly change an observer’s attention to different image regions. For example, removing distracting objects in the background can reduce clutter in photos, leading to increased user satisfaction. Similarly, in video conferencing, reducing clutter in the background may increase focus on the main speaker (example demo here).

To explore what types of editing effects can be achieved and how these affect viewers’ attention, we developed an optimization framework for guiding visual attention in images using a differentiable, predictive saliency model. Our method employs a state-of-the-art deep saliency model. Given an input image and a binary mask representing the distractor regions, pixels within the mask will be edited under the guidance of the predictive saliency model such that the saliency within the masked region is reduced. To make sure the edited image is natural and realistic, we carefully choose four image editing operators: two standard image editing operations, namely recolorization and image warping (shift); and two learned operators (we do not define the editing operation explicitly), namely a multi-layer convolution filter, and a generative model (GAN).

With those operators, our framework can produce a variety of powerful effects, with examples in the figure below, including recoloring, inpainting, camouflage, object editing or insertion, and facial attribute editing. Importantly, all these effects are driven solely by the single, pre-trained saliency model, without any additional supervision or training. Note that our goal is not to compete with dedicated methods for producing each effect, but rather to demonstrate how multiple editing operations can be guided by the knowledge embedded within deep saliency models.

Examples of reducing visual distractions, guided by the saliency model with several operators. The distractor region is marked on top of the saliency map (red border) in each example.

Enriching experiences with user-aware saliency modeling

Prior research assumes a single saliency model for the whole population. However, human attention varies between individuals — while the detection of salient clues is fairly consistent, their order, interpretation, and gaze distributions can differ substantially. This offers opportunities to create personalized user experiences for individuals or groups. In “Learning from Unique Perspectives: User-aware Saliency Modeling”, we introduce a user-aware saliency model, the first that can predict attention for one user, a group of users, and the general population, with a single model.

As shown in the figure below, core to the model is the combination of each participant’s visual preferences with a per-user attention map and adaptive user masks. This requires per-user attention annotations to be available in the training data, e.g., the OSIE mobile gaze dataset for natural images; FiWI and WebSaliency datasets for web pages. Instead of predicting a single saliency map representing attention of all users, this model predicts per-user attention maps to encode individuals’ attention patterns. Further, the model adopts a user mask (a binary vector with the size equal to the number of participants) to indicate the presence of participants in the current sample, which makes it possible to select a group of participants and combine their preferences into a single heatmap.

An overview of the user aware saliency model framework. The example image is from OSIE image set.

During inference, the user mask allows making predictions for any combination of participants. In the following figure, the first two rows are attention predictions for two different groups of participants (with three people in each group) on an image. A conventional attention prediction model will predict identical attention heatmaps. Our model can distinguish the two groups (e.g., the second group pays less attention to the face and more attention to the food than the first). Similarly, the last two rows are predictions on a webpage for two distinctive participants, with our model showing different preferences (e.g., the second participant pays more attention to the left region than the first).

Predicted attention vs. ground truth (GT). EML-Net: predictions from a state-of-the-art model, which will have the same predictions for the two participants/groups. Ours: predictions from our proposed user aware saliency model, which can predict the unique preference of each participant/group correctly. The first image is from OSIE image set, and the second is from FiWI.

Progressive image decoding centered on salient features

Besides image editing, human attention models can also improve users’ browsing experience. One of the most frustrating and annoying user experiences while browsing is waiting for web pages with images to load, especially in conditions with low network connectivity. One way to improve the user experience in such cases is with progressive decoding of images, which decodes and displays increasingly higher-resolution image sections as data are downloaded, until the full-resolution image is ready. Progressive decoding usually proceeds in a sequential order (e.g., left to right, top to bottom). With a predictive attention model (1, 2), we can instead decode images based on saliency, making it possible to send the data necessary to display details of the most salient regions first. For example, in a portrait, bytes for the face can be prioritized over those for the out-of-focus background. Consequently, users perceive better image quality earlier and experience significantly reduced wait times. More details can be found in our open source blog posts (post 1, post 2). Thus, predictive attention models can help with image compression and faster loading of web pages with images, improve rendering for large images and streaming/VR applications.

Conclusion

We’ve shown how predictive models of human attention can enable delightful user experiences via applications such as image editing that can reduce clutter, distractions or artifacts in images or photos for users, and progressive image decoding that can greatly reduce the perceived waiting time for users while images are fully rendered. Our user-aware saliency model can further personalize the above applications for individual users or groups, enabling richer and more unique experiences.

Another interesting direction for predictive attention models is whether they can help improve robustness of computer vision models in tasks such as object classification or detection. For example, in “Teacher-generated spatial-attention labels boost robustness and accuracy of contrastive models”, we show that a predictive human attention model can guide contrastive learning models to achieve better representation and improve the accuracy/robustness of classification tasks (on the ImageNet and ImageNet-C datasets). Further research in this direction could enable applications such as using radiologist’s attention on medical images to improve health screening or diagnosis, or using human attention in complex driving scenarios to guide autonomous driving systems.

Acknowledgements

This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, and cross-functional contributors. We’d like to thank all the co-authors of the papers/research, including Kfir Aberman, Gamaleldin F. Elsayed, Moritz Firsching, Shi Chen, Nachiappan Valliappan, Yushi Yao, Chang Ye, Yossi Gandelsman, Inbar Mosseri, David E. Jacobes, Yael Pritch, Shaolei Shen, and Xinyu Ye. We also want to thank team members Oscar Ramirez, Venky Ramachandran and Tim Fujita for their help. Finally, we thank Vidhya Navalpakkam for her technical leadership in initiating and overseeing this body of work.

Read More

Imagen Editor and EditBench: Advancing and evaluating text-guided image inpainting

Imagen Editor and EditBench: Advancing and evaluating text-guided image inpainting

In the last few years, text-to-image generation research has seen an explosion of breakthroughs (notably, Imagen, Parti, DALL-E 2, etc.) that have naturally permeated into related topics. In particular, text-guided image editing (TGIE) is a practical task that involves editing generated and photographed visuals rather than completely redoing them. Quick, automated, and controllable editing is a convenient solution when recreating visuals would be time-consuming or infeasible (e.g., tweaking objects in vacation photos or perfecting fine-grained details on a cute pup generated from scratch). Further, TGIE represents a substantial opportunity to improve training of foundational models themselves. Multimodal models require diverse data to train properly, and TGIE editing can enable the generation and recombination of high-quality and scalable synthetic data that, perhaps most importantly, can provide methods to optimize the distribution of training data along any given axis.

In “Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting”, to be presented at CVPR 2023, we introduce Imagen Editor, a state-of-the-art solution for the task of masked inpainting — i.e., when a user provides text instructions alongside an overlay or “mask” (usually generated within a drawing-type interface) indicating the area of the image they would like to modify. We also introduce EditBench, a method that gauges the quality of image editing models. EditBench goes beyond the commonly used coarse-grained “does this image match this text” methods, and drills down to various types of attributes, objects, and scenes for a more fine-grained understanding of model performance. In particular, it puts strong emphasis on the faithfulness of image-text alignment without losing sight of image quality.

Given an image, a user-defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas. The model meaningfully incorporates the user’s intent and performs photorealistic edits.

Imagen Editor

Imagen Editor is a diffusion-based model fine-tuned on Imagen for editing. It targets improved representations of linguistic inputs, fine-grained control and high-fidelity outputs. Imagen Editor takes three inputs from the user: 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt — all three inputs guide the output samples.

Imagen Editor depends on three core techniques for high-quality text-guided image inpainting. First, unlike prior inpainting models (e.g., Palette, Context Attention, Gated Convolution) that apply random box and stroke masks, Imagen Editor employs an object detector masking policy with an object detector module that produces object masks during training. Object masks are based on detected objects rather than random patches and allow for more principled alignment between edit text prompts and masked regions. Empirically, the method helps the model stave off the prevalent issue of the text prompt being ignored when masked regions are small or only partially cover an object (e.g., CogView2).

Random masks (left) frequently capture background or intersect object boundaries, defining regions that can be plausibly inpainted just from image context alone. Object masks (right) are harder to inpaint from image context alone, encouraging models to rely more on text inputs during training.

Next, during training and inference, Imagen Editor enhances high resolution editing by conditioning on full resolution (1024×1024 in this work), channel-wise concatenation of the input image and the mask (similar to SR3, Palette, and GLIDE). For the base diffusion 64×64 model and the 64×64→256×256 super-resolution models, we apply a parameterized downsampling convolution (e.g., convolution with a stride), which we empirically find to be critical for high fidelity.

Imagen is fine-tuned for image editing. All of the diffusion models, i.e., the base model and super-resolution (SR) models, are conditioned on high-resolution 1024×1024 image and mask inputs. To this end, new convolutional image encoders are introduced.

Finally, at inference we apply classifier-free guidance (CFG) to bias samples to a particular conditioning, in this case, text prompts. CFG interpolates between the text-conditioned and unconditioned model predictions to ensure strong alignment between the generated image and the input text prompt for text-guided image inpainting. We follow Imagen Video and use high guidance weights with guidance oscillation (a guidance schedule that oscillates within a value range of guidance weights). In the base model (the stage-1 64x diffusion), where ensuring strong alignment with text is most critical, we use a guidance weight schedule that oscillates between 1 and 30. We observe that high guidance weights combined with oscillating guidance result in the best trade-off between sample fidelity and text-image alignment.

EditBench

The EditBench dataset for text-guided image inpainting evaluation contains 240 images, with 120 generated and 120 natural images. Generated images are synthesized by Parti and natural images are drawn from the Visual Genome and Open Images datasets. EditBench captures a wide variety of language, image types, and levels of text prompt specificity (i.e., simple, rich, and full captions). Each example consists of (1) a masked input image, (2) an input text prompt, and (3) a high-quality output image used as reference for automatic metrics. To provide insight into the relative strengths and weaknesses of different models, EditBench prompts are designed to test fine-grained details along three categories: (1) attributes (e.g., material, color, shape, size, count); (2) object types (e.g., common, rare, text rendering); and (3) scenes (e.g., indoor, outdoor, realistic, or paintings). To understand how different specifications of prompts affect model performance, we provide three text prompt types: a single-attribute (Mask Simple) or a multi-attribute description of the masked object (Mask Rich) – or an entire image description (Full Image). Mask Rich, especially, probes the models’ ability to handle complex attribute binding and inclusion.

The full image is used as a reference for successful inpainting. The mask covers the target object with a free-form, non-hinting shape. We evaluate Mask Simple, Mask Rich and Full Image prompts, consistent with conventional text-to-image models.

Due to the intrinsic weaknesses in existing automatic evaluation metrics (CLIPScore and CLIP-R-Precision) for TGIE, we hold human evaluation as the gold standard for EditBench. In the section below, we demonstrate how EditBench is applied to model evaluation.

Evaluation

We evaluate the Imagen Editor model — with object masking (IM) and with random masking (IM-RM) — against comparable models, Stable Diffusion (SD) and DALL-E 2 (DL2). Imagen Editor outperforms these models by substantial margins across all EditBench evaluation categories.

For Full Image prompts, single-image human evaluation provides binary answers to confirm if the image matches the caption. For Mask Simple prompts, single-image human evaluation confirms if the object and attribute are properly rendered, and bound correctly (e.g., for a red cat, a white cat on a red table would be an incorrect binding). Side-by-side human evaluation uses Mask Rich prompts only for side-by-side comparisons between IM and each of the other three models (IM-RM, DL2, and SD), and indicates which image matches with the caption better for text-image alignment, and which image is most realistic.

Human evaluation. Full Image prompts elicit annotators’ overall impression of text-image alignment; Mask Simple and Mask Rich check for the correct inclusion of particular attributes, objects and attribute binding.

For single-image human evaluation, IM receives the highest ratings across-the-board (10–13% higher than the 2nd-highest performing model). For the rest, the performance order is IM-RM > DL2 > SD (with 3–6% difference) except for with Mask Simple, where IM-RM falls 4-8% behind. As relatively more semantic content is involved in Full and Mask Rich, we conjecture IM-RM and IM are benefited by the higher performing T5 XXL text encoder.

Single-image human evaluations of text-guided image inpainting on EditBench by prompt type. For Mask Simple and Mask Rich prompts, text-image alignment is correct if the edited image accurately includes every attribute and object specified in the prompt, including the correct attribute binding. Note that due to different evaluation designs, Full vs. Mask-only prompts, results are less directly comparable.

EditBench focuses on fine-grained annotation, so we evaluate models for object and attribute types. For object types, IM leads in all categories, performing 10–11% better than the 2nd-highest performing model in common, rare, and text-rendering.

Single-image human evaluations on EditBench Mask Simple by object type. As a cohort, models are better at object rendering than text-rendering.

For attribute types, IM is rated much higher (13–16%) than the 2nd highest performing model, except for in count, where DL2 is merely 1% behind.

Single-image human evaluations on EditBench Mask Simple by attribute type. Object masking improves adherence to prompt attributes across-the-board (IM vs. IM-RM).

Side-by-side compared with other models one-vs-one, IM leads in text alignment with a substantial margin, being preferred by annotators compared to SD, DL2, and IM-RM.

Side-by-side human evaluation of image realism & text-image alignment on EditBench Mask Rich prompts. For text-image alignment, Imagen Editor is preferred in all comparisons.

Finally, we illustrate a representative side-by-side comparative for all the models. See the paper for more examples.

Example model outputs for Mask Simple vs. Mask Rich prompts. Object masking improves Imagen Editor’s fine-grained adherence to the prompt compared to the same model trained with random masking.

Conclusion

We presented Imagen Editor and EditBench, making significant advancements in text-guided image inpainting and the evaluation thereof. Imagen Editor is a text-guided image inpainting fine-tuned from Imagen. EditBench is a comprehensive systematic benchmark for text-guided image inpainting, evaluating performance across multiple dimensions: attributes, objects, and scenes. Note that due to concerns in relation to responsible AI, we are not releasing Imagen Editor to the public. EditBench on the other hand is released in full for the benefit of the research community.

Acknowledgments

Thanks to Gunjan Baid, Nicole Brichtova, Sara Mahdavi, Kathy Meier-Hellstern, Zarana Parekh, Anusha Ramesh, Tris Warkentin, Austin Waters, and Vijay Vasudevan for their generous support. We give thanks to Igor Karpov, Isabel Kraus-Liang, Raghava Ram Pamidigantam, Mahesh Maddinala, and all the anonymous human annotators for their coordination to complete the human evaluation tasks. We are grateful to Huiwen Chang, Austin Tarango, and Douglas Eck for providing paper feedback. Thanks to Erica Moreira and Victor Gomes for help with resource coordination. Finally, thanks to the authors of DALL-E 2 for giving us permission to use their model outputs for research purposes.

Read More

Evaluating speech synthesis in many languages with SQuId

Evaluating speech synthesis in many languages with SQuId

Previously, we presented the 1,000 languages initiative and the Universal Speech Model with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment involves developing high-quality speech synthesis technologies, which build upon projects such as VDTTS and AudioLM, for users that speak many different languages.

After developing a new model, one must evaluate whether the speech it generates is accurate and natural: the content must be relevant to the task, the pronunciation correct, the tone appropriate, and there should be no acoustic artifacts such as cracks or signal-correlated noise. Such evaluation is a major bottleneck in the development of multilingual speech systems.

The most popular method to evaluate the quality of speech synthesis models is human evaluation: a text-to-speech (TTS) engineer produces a few thousand utterances from the latest model, sends them for human evaluation, and receives results a few days later. This evaluation phase typically involves listening tests, during which dozens of annotators listen to the utterances one after the other to determine how natural they sound. While humans are still unbeaten at detecting whether a piece of text sounds natural, this process can be impractical — especially in the early stages of research projects, when engineers need rapid feedback to test and restrategize their approach. Human evaluation is expensive, time consuming, and may be limited by the availability of raters for the languages of interest.

Another barrier to progress is that different projects and institutions typically use various ratings, platforms and protocols, which makes apples-to-apples comparisons impossible. In this regard, speech synthesis technologies lag behind text generation, where researchers have long complemented human evaluation with automatic metrics such as BLEU or, more recently, BLEURT.

In “SQuId: Measuring Speech Naturalness in Many Languages“, to be presented at ICASSP 2023, we introduce SQuId (Speech Quality Identification), a 600M parameter regression model that describes to what extent a piece of speech sounds natural. SQuId is based on mSLAM (a pre-trained speech-text model developed by Google), fine-tuned on over a million quality ratings across 42 languages and tested in 65. We demonstrate how SQuId can be used to complement human ratings for evaluation of many languages. This is the largest published effort of this type to date.

Evaluating TTS with SQuId

The main hypothesis behind SQuId is that training a regression model on previously collected ratings can provide us with a low-cost method for assessing the quality of a TTS model. The model can therefore be a valuable addition to a TTS researcher’s evaluation toolbox, providing a near-instant, albeit less accurate alternative to human evaluation.

SQuId takes an utterance as input and an optional locale tag (i.e., a localized variant of a language, such as “Brazilian Portuguese” or “British English”). It returns a score between 1 and 5 that indicates how natural the waveform sounds, with a higher value indicating a more natural waveform.

Internally, the model includes three components: (1) an encoder, (2) a pooling / regression layer, and (3) a fully connected layer. First, the encoder takes a spectrogram as input and embeds it into a smaller 2D matrix that contains 3,200 vectors of size 1,024, where each vector encodes a time step. The pooling / regression layer aggregates the vectors, appends the locale tag, and feeds the result into a fully connected layer that returns a score. Finally, we apply application-specific post-processing that rescales or normalizes the score so it is within the [1, 5] range, which is common for naturalness human ratings. We train the whole model end-to-end with a regression loss.

The encoder is by far the largest and most important piece of the model. We used mSLAM, a pre-existing 600M-parameter Conformer pre-trained on both speech (51 languages) and text (101 languages).

The SQuId model.

To train and evaluate the model, we created the SQuId corpus: a collection of 1.9 million rated utterances across 66 languages, collected for over 2,000 research and product TTS projects. The SQuId corpus covers a diverse array of systems, including concatenative and neural models, for a broad range of use cases, such as driving directions and virtual assistants. Manual inspection reveals that SQuId is exposed to a vast range of of TTS errors, such as acoustic artifacts (e.g., cracks and pops), incorrect prosody (e.g., questions without rising intonations in English), text normalization errors (e.g., verbalizing “7/7” as “seven divided by seven” rather than “July seventh”), or pronunciation mistakes (e.g., verbalizing “tough” as “toe”).

A common issue that arises when training multilingual systems is that the training data may not be uniformly available for all the languages of interest. SQuId was no exception. The following figure illustrates the size of the corpus for each locale. We see that the distribution is largely dominated by US English.

Locale distribution in the SQuId dataset.

How can we provide good performance for all languages when there are such variations? Inspired by previous work on machine translation, as well as past work from the speech literature, we decided to train one model for all languages, rather than using separate models for each language. The hypothesis is that if the model is large enough, then cross-locale transfer can occur: the model’s accuracy on each locale improves as a result of jointly training on the others. As our experiments show, cross-locale proves to be a powerful driver of performance.

Experimental results

To understand SQuId’s overall performance, we compare it to a custom Big-SSL-MOS model (described in the paper), a competitive baseline inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained on the VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. We experimented with several variants of the model, and found that SQuId is up to 50.0% more accurate.

SQuId versus state-of-the-art baselines. We measure agreement with human ratings using the Kendall Tau, where a higher value represents better accuracy.

To understand the impact of cross-locale transfer, we run a series of ablation studies. We vary the amount of locales introduced in the training set and measure the effect on SQuId’s accuracy. In English, which is already over-represented in the dataset, the effect of adding locales is negligible.

SQuId’s performance on US English, using 1, 8, and 42 locales during fine-tuning.

However, cross-locale transfer is much more effective for most other locales:

SQuId’s performance on four selected locales (Korean, French, Thai, and Tamil), using 1, 8, and 42 locales during fine-tuning. For each locale, we also provide the training set size.

To push transfer to its limit, we held 24 locales out during training and used them for testing exclusively. Thus, we measure to what extent SQuId can deal with languages that it has never seen before. The plot below shows that although the effect is not uniform, cross-locale transfer works.

SQuId’s performance on four “zero-shot” locales; using 1, 8, and 42 locales during fine-tuning.

When does cross-locale operate, and how? We present many more ablations in the paper, and show that while language similarity plays a role (e.g., training on Brazilian Portuguese helps European Portuguese) it is surprisingly far from being the only factor that matters.

Conclusion and future work

We introduce SQuId, a 600M parameter regression model that leverages the SQuId dataset and cross-locale learning to evaluate speech quality and describe how natural it sounds. We demonstrate that SQuId can complement human raters in the evaluation of many languages. Future work includes accuracy improvements, expanding the range of languages covered, and tackling new error types.

Acknowledgements

The author of this post is now part of Google DeepMind. Many thanks to all authors of the paper: Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa.

Read More