Posted by Ruofei Du, Yinda Zhang, Ahmed Sabie, Jason Mayes, Google.
A depth map is essentially an image (or image channel) that contains information relating to the distance of the surfaces of objects in the scene from a given viewpoint (in this case, the camera itself) for every pixel in that image. Depth maps are a fundamental building block for a variety of computer graphics and computer vision applications, such as augmented reality, portrait mode, and 3D reconstruction. Despite the recent advances in depth sensing capabilities with ARCore Depth API, the majority of photographs on the web are still missing associated depth maps. This, combined with users from the web community expressing a growing interest in having depth capabilities within JavaScript to enhance existing web apps such as to bring images to live, apply real time AR effects to a human face and body, or even reconstruct items for use in VR environments, helped shape the path for what you see today.
Today we are introducing the Depth API, the first depth estimation API from TensorFlow.js. With this new API, we are also introducing the first depth model for portrait, ARPortraitDepth, which estimates a depth map for a single portrait image. To demonstrate one of many possible usages of depth information, we also present a computational photography application, 3D photo, which utilizes the predicted depth and enables a 3D parallax effect on the given portrait image. Try the live demo below, everyone can easily make their social media profile photo 3D as shown below.
Try out the 3D portrait demo for yourself!
Examples generated from the 3D photo application. |
ARPortraitDepth: Single Image Depth Estimation
At the core of the Portrait Depth API is a deep learning model, named ARPortraitDepth, that takes a single color portrait image as the input and produces a depth map. For the sake of computational efficiency, we adopt a light-weight U-Net architecture. As shown below, the encoder gradually downscales the image or feature map resolution by half, and the decoder increases the feature resolution to the same as the input. Deep learning features from the encoder are concatenated to the corresponding layers with the same spatial resolution in the decoders to bring high resolution signals for depth estimation. During training, we force the decoder to produce depth predictions with increasing resolutions at each layer, and add a loss for each of them with the ground truth. This empirically helps the decoder to predict accurate depth by gradually adding details.
Abundant and diverse training data is critical for the machine learning model to achieve overall decent performance, e.g. accuracy and robustness. We synthetically render pairs of color and depth images with various camera configurations, e.g. focal length, camera pose, from 3D digital humans captured by a high quality performance capture system, and run relighting augmentation with High Dynamic Range environment illumination maps to increase the realism and diversity of the color images, e.g. shadows on the face. We also collect real data using mobile phones equipped with a front facing depth sensor, e.g. Google Pixel 4, where the depth quality, as the training ground truth, is not as accurate and complete as that in our synthetic data, but the color images are effective in improving the performance of our model when running on images in the wild.
Single image depth estimation pipeline. |
To enhance the robustness against background variation, in practice, we run an off-the-shelf body segmentation model with MediaPipe and TensorFlow.js before sending the image into the neural network of depth estimation.
The portrait depth model could enable a whole host of creative applications orientated around the human body that could drive next generation web apps. We refer readers to ARCore Depth Lab for more inspirations.
For the 3D photo application, we created a high-performance rendering pipeline. It first generates a segmented mask using the TensorFlow.js existing body segmentation API. Next, we pass the masked portrait into the Portrait Depth API and obtain a depth map on the GPU. Eventually, we generate a depth mesh in three.js, with vertices arranged in a regular grid and displaced by re-projecting corresponding depth values (see the figure below for generating the depth mesh). Finally, we apply texture projection to the depth mesh and rotate the camera around the z axis in a circle. Users can download the animations in GIF or WebM format.
Generating the depth mesh from the depth map for the 3D photo application. |
Portrait Depth API Installation
The portrait depth API is currently offered as one variant of the new depth API.
To install the API and runtime library, you can either use the <script> tag in your html file or use NPM.
Through script tag:
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-core"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgl"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-converter"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/body-segmentation"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/depth-estimation"></script>
Through NPM:
yarn add @tensorflow/tfjs-core @tensorflow/tfjs-backend-webgl
yarn add @tensorflow/tfjs-converter
yarn add @tensorflow-models/body-segmentation
yarn add @tensorflow-models/depth-estimation
To reference the API in your JS code, it depends on how you installed the library.
If installed through script tag, you can reference the library through the global namespace depthEstimation
.
If installed through NPM, you need to import the libraries first:
import '@tensorflow/tfjs-backend-core';
import '@tensorflow/tfjs-backend-webgl';
import '@tensorflow/tfjs-converter';
import '@tensorflow-models/body-segmentation;
import * as depthEstimation from '@tensorflow-models/depth-estimation;
Try it yourself!
First, you need to create an estimator:
const model = depthEstimation.SupportedModels.ARPortraitDepth;
const estimatorConfig = {
minDepth: 0, // The minimum depth value outputted by the estimator.
maxDepth: 1, // The maximum depth value outputted by the estimator.
};
estimator = await depthEstimation.createEstimator(model, estimatorConfig);
Once you have an estimator, you can pass in a video stream, static image, or TensorFlow.js tensors to estimate depth:
const video = document.getElementById('video');
const depthMap = await estimator.estimateDepth(video);
How to use the output?
The depthMap
result above contains depth values for each pixel in the image.
The depthMap
is an object which stores the underlying depth values. You can then utilize the provided asynchronous conversion functions such as toCanvasImageSource
, toArray
, and toTensor
depending on the desired output type that you want for efficiency.
It should be noted that different models have different internal representations of data. Therefore converting from one form to another may be expensive. In the name of efficiency, you can call getUnderlyingType
to determine what form the depth map is in already so you may choose to keep it in the same form for faster results.
The semantics of the depthMap are as follows: the depth map is the same size as the input image. For array and tensor representations, there is one depth value per pixel. For CanvasImageSource
, the green and blue channels are always set to 0, whereas the red channel stores the depth value.
See below output snippet for example:
{
toCanvasImageSource(): ...
toArray(): ...
toTensor(): ...
getUnderlyingType(): ...
}
Browser Performance
Portrait Depth model
MacBook M1 Pro 2021. (FPS) |
iPhone 13 Pro (FPS) |
Pixel 6 Pro (FPS) |
Desktop PC Intel i9-10900K. Nvidia GTX 1070 GPU. (FPS) |
|
TFJS Runtime With WebGL backend. |
51 |
22 |
5 |
47 |
Acknowledgements
We would like to acknowledge our colleagues who participated in or sponsored creating Portrait Depth API in TensorFlow.js: Na Li, Xiuxiu Yuan, Rohit Pandey, Abhishek Kar, Sergio Orts Escolano, Christoph Rhemann, Idris Aleem, Sean Fanello, Adarsh Kowdle, Ping Yu, Alex Olwal. We would also like to acknowledge the body segmentation model provided by MediaPipe, and The Relightables for high quality synthetic data.