January 2023 – Page 5

Using TensorFlow for Deep Learning on Video Data

Posted by Shilpa Kancharla

Video data contains a rich amount of information, and has a more complex and large structure than image data. Being able to classify videos in a memory-efficient way using deep learning can help us better understand the contents within the data. On tensorflow.org, we have published a series of tutorials on how to load, preprocess, and classify video data. Here are quick links to each of these tutorials:

In this blog post, we thought it would be interesting to go more in depth about certain parts of some tutorials, and talk about how you can incorporate these parts to build your own models that can process video or three-dimensional data (such as MRI scans) in a memory-efficient manner using TensorFlow, such as leveraging Python generators and resizing, or downsampling, the data.

Diagram showing three dmensional representation of video data showing height, width and number of frames (time)

Example of shape of video data, with the following dimensions:
number of frames (time) x height x width x channels.

FrameGenerator to load video data

From the Load video data tutorial, let’s take the opportunity to talk about the main workhorse of the majority of these tutorials: the FrameGenerator class. Through this class, we are able to yield the tensor representation of the video and the label, or class, of the video.

class FrameGenerator:
def __init__(self, path, n_frames, training = False):
“”” Returns a set of frames with their associated label.

Args:
path: Video file paths.
n_frames: Number of frames.
training: Boolean to determine if training dataset is being created.
“””
self.path = path
self.n_frames = n_frames
self.training = training
self.class_names = sorted(set(p.name for p in self.path.iterdir() if p.is_dir()))
self.class_ids_for_name = dict((name, idx) for idx, name in enumerate(self.class_names))

def get_files_and_class_names(self):
video_paths = list(self.path.glob(‘*/*.avi’))
classes = [p.parent.name for p in video_paths]
return video_paths, classes

def __call__(self):
video_paths, classes = self.get_files_and_class_names()

pairs = list(zip(video_paths, classes))

if self.training:
random.shuffle(pairs)

for path, name in pairs:
video_frames = frames_from_video_file(path, self.n_frames)
label = self.class_ids_for_name[name] # Encode labels
yield video_frames, label

Upon creating the generator class, we use the function from_generator() to feed in the data to our deep learning models. Specifically, the from_generator() API will create a dataset whose contents are generated by a generator. Using Python generators can be more memory-efficient than storing an entire sequence of data in memory. Consider creating a generator class similar to FrameGenerator and using the from_generator() API to load data into your TensorFlow and Keras models.

output_signature = (tf.TensorSpec(shape = (None, None, None, 3),

dtype = tf.float32),
tf.TensorSpec(shape = (),

dtype = tf.int16))

train_ds = tf.data.Dataset.from_generator(FrameGenerator(subset_paths[‘train’],

10,

training=True),

output_signature = output_signature)

einops library for resizing video data

For the second tutorial on Video classification with a 3D convolutional neural network, let’s discuss the use of the einops library and how it can be incorporated into a Keras model backed by TensorFlow. This library is useful to perform flexible tensor operations and can be used with not only TensorFlow, but also JAX. Specifically in this tutorial, we use it to help with resizing the size of the data as it goes through the (2+1)D convolutional neural network we create. In the context of this second tutorial, we wanted to downsample the video data. Downsampling is particularly useful because it allows our model to examine specific parts of frames to detect patterns that may be specific to a certain feature in that video. Through downsampling, non-essential information can be discarded. It will allow for dimensionality reduction and therefore faster processing.

We use the functions parse_shape() and rearrange() from the einops library. The parse_shape() function used here maps the names of the axes to their corresponding lengths. It will return a dictionary containing this information, called old_shape. Next, we use the rearrange() function that allows you to reorder the axes for multidimensional tensors. Pass in the tensor, alongside the names of the axes you are trying to rearrange.

The notation b t h w c -> (b t) h w c here means we want to squeeze together the batch size (denoted by b) and time (denoted by t) dimensions to pass this data into the Keras Resizing layer object. When we instantiate the ResizeVideo class, we pass in the height and width values that we want to resize the frame to. Once this resizing is complete, we use the rearrange() function again to unsqueeze (using the notation (b t) h w c -> b t h w c) the batch size and time dimensions.

class ResizeVideo(keras.layers.Layer):
def __init__(self, height, width):
super().__init__()
self.height = height
self.width = width
self.resizing_layer = layers.Resizing(self.height, self.width)

def call(self, video):
“””
Use the einops library to resize the tensor.

Args:
video: Tensor representation of the video, in the form of a set of frames.

Return:
A downsampled size of the video according to the new height and width it should be resized to.
“””
# b stands for batch size, t stands for time, h stands for height,
# w stands for width, and c stands for the number of channels.
old_shape = einops.parse_shape(video, ‘b t h w c’)
images = einops.rearrange(video, ‘b t h w c -> (b t) h w c’)
images = self.resizing_layer(images)
videos = einops.rearrange(
images, ‘(b t) h w c -> b t h w c’,
t = old_shape[‘t’])
return videos

What’s next?

These are just a few ways you can leverage TensorFlow to work with video data in a memory-efficient manner, but such techniques aren’t just limited to video data. Medical data such as MRI scans or 3D image data also require efficient data loading and potential resizing of the shape of data. These techniques could prove useful when you are working with limited computational resources. We hope you find these tutorials helpful, and thank you for reading!

The Ultimate Upgrade: GeForce RTX 4080 SuperPOD Rollout Begins Today

The Ultimate upgrade begins today: GeForce NOW RTX 4080 SuperPODs are now rolling out, bringing a new level of high-performance gaming to the cloud.

Ultimate members will start to see RTX 4080 performance in their region soon, and experience titles like Warhammer 40,000: Darktide, Cyberpunk 2077, The Witcher 3: Wild Hunt and more at ultimate quality. New features are also available now for Ultimate members streaming from RTX 3080 servers, and members will be able to check GFN Thursday each week for availability updates in their regions.

Plus, get ready for 10 more supported games in the GeForce NOW library.

This Cloud Has an ‘Ultimate’ Lining

The GeForce NOW Ultimate membership brings new features and NVIDIA RTX technologies to the cloud for the first time, made possible by the NVIDIA Ada Lovelace GPU architecture.

Ultimate members receive three major streaming upgrades. The new RTX 4080 SuperPODs are capable of rendering and streaming at up to 240 frames per second. When paired with NVIDIA Reflex, it makes every moment of the action feel as if it’s on a desktop PC. And 4K gaming goes beyond fast with an upgrade to 120 fps, with support for DLSS 3 and RTX ON. Plus, for the first time, ultrawide resolutions are supported, giving members a wider point of view, at up to 3,840 x 1,600 resolution and 120 fps.

Ultimate Upgrade GeForce NOW — *Coming to a zone near you.*

Ultimate members in and around San Jose, Los Angeles, Dallas and Frankfurt, Germany, will be the first to experience the power of these RTX 4080 SuperPODs, starting today. Each week, GFN Thursday will spotlight the newest cities with upgraded servers, so make sure to check back each week to see which cities light up next on the map.

Even better: Starting today, Ultimate members streaming on RTX 3080 servers can take advantage of ultrawide resolutions and high dynamic range on the GeForce NOW PC and macOS apps. Learn more about supported resolutions and frame rates. Make sure you have the app v2.0.47.125 or later, and restart the app to see the new Ultimate features.

Don’t let this cloud pass you by — check it out and sign up. Get the Ultimate upgrade without paying the ultimate price — this highest-performance membership tier is only $19.99 per month or $99.99 for six months.

Game Related

Genshin Impact on GeForce NOW — *The new year in Teyvat approaches in ‘The Exquisite Night Chimes’ update.*

Celebrate the new year in Genshin Impact version 3.4, available this week. Players can explore Sumeru’s new sandstorm-ravaged desert with their favorite characters — and GeForce NOW members can play on the go with mobile touch controls.

Plus, 10 new games will be supported in the cloud this week:

Farlanders (New Release on Steam)
Surviving the Abyss (New Release on Steam)
Tortuga – A Pirate’s Tale (New Release on Epic Games, Jan. 19)
Epistory – Typing Chronicles (New Release on Epic Games, Jan. 19)
Absolute Drift (Steam)
BLACKTAIL (Steam and Epic Games)
Dwarf Fortress (Steam)
Hello Neighbor 2 (Steam and Epic Games)
NEBULOUS: Fleet Command (Steam)
Shadow Tactics – Aiko’s Choice (Epic Games Store)

Make this the ultimate weekend by playing these titles or any of the other 1,500 games in the GeForce NOW library. What game will you stream first on your Ultimate membership? Let us know in the comments or on Twitter.

Are you ready for the Ultimate cloud gaming experience?

— NVIDIA GeForce NOW (@NVIDIAGFN) January 18, 2023

Domain data trumps teacher knowledge for distilling NLU models

On natural-language-understanding tasks, student models trained only on task-specific data outperform those trained on a mix that includes generic data.Read More

Google Research, 2022 & Beyond: Language, Vision and Generative Models

Posted by Jeff Dean, Senior Fellow and SVP of Google Research, on behalf of the Google Research community

Today we kick off a series of blog posts about exciting new developments from Google Research. Please keep your eye on this space and look for the title “Google Research, 2022 & Beyond” for more articles in the series.

<!–

–>

I’ve always been interested in computers because of their ability to help people better understand the world around them. Over the last decade, much of the research done at Google has been in pursuit of a similar vision — to help people better understand the world around them and get things done. We want to build more capable machines that partner with people to accomplish a huge variety of tasks. All kinds of tasks. Complex, information-seeking tasks. Creative tasks, like creating music, drawing new pictures, or creating videos. Analysis and synthesis tasks, like crafting new documents or emails from a few sentences of guidance, or partnering with people to jointly write software together. We want to solve complex mathematical or scientific problems. Transform modalities, or translate the world’s information into any language. Diagnose complex diseases, or understand the physical world. Accomplish complex, multi-step actions in both the virtual software world and the physical world of robotics.

We’ve demonstrated early versions of some of these capabilities in research artifacts, and we’ve partnered with many teams across Google to ship some of these capabilities in Google products that touch the lives of billions of users. But the most exciting aspects of this journey still lie ahead!

With this post, I am kicking off a series in which researchers across Google will highlight some exciting progress we’ve made in 2022 and present our vision for 2023 and beyond. I will begin with a discussion of language, computer vision, multi-modal models, and generative machine learning models. Over the next several weeks, we will discuss novel developments in research topics ranging from responsible AI to algorithms and computer systems to science, health and robotics. Let’s get started!

Language Models	Computer Vision	Multimodal Models
Generative Models	Responsible AI	Algorithms
ML & Computer Systems	Robotics	Health
General Science & Quantum	Community Engagement

<!–

–>

Language Models

The progress on larger and more powerful language models has been one of the most exciting areas of machine learning (ML) research over the last decade. Important advances along the way have included new approaches like sequence-to-sequence learning and our development of the Transformer model, which underlies most of the advances in this space in the last few years. Although language models are trained on surprisingly simple objectives, like predicting the next token in a sequence of text given the preceding tokens, when large models are trained on sufficiently large and diverse corpora of text, the models can generate coherent, contextual, natural-sounding responses, and can be used for a wide range of tasks, such as generating creative content, translating between languages, helping with coding tasks, and answering questions in a helpful and informative way. Our ongoing work on LaMDA explores how these models can be used for safe, grounded, and high-quality dialog to enable contextual multi-turn conversations.

Natural conversations are clearly an important and emergent way for people to interact with computers. Rather than contorting ourselves to interact in ways that best accommodate the limitations of computers, we can instead have natural conversations to accomplish a wide variety of tasks. I’m excited about the progress we’ve made in making LaMDA useful and factual.

In April, we described our work on PaLM, a large, 540 billion parameter language model built using our Pathways software infrastructure and trained on multiple TPU v4 Pods. The PaLM work demonstrated that, despite being trained solely on the objective of predicting the next token, large-scale language models trained on large amounts of multi-lingual data and source code are capable of improving the state-of-the-art across a wide variety of natural language, translation, and coding tasks, despite never having been trained to specifically perform those tasks. This work provided additional evidence that increasing the scale of the model and training data can significantly improve capabilities.

Performance comparison between the PaLM 540B parameter model and the prior state-of-the-art (SOTA) on 58 tasks from the Big-bench suite. (See paper for details.)

We have also seen significant success in using large language models (LLMs) trained on source code (instead of natural language text data) that can assist our internal developers, as described in ML-Enhanced Code Completion Improves Developer Productivity. Using a variety of code completion suggestions from a 500 million parameter language model for a cohort of 10,000 Google software developers using this model in their IDE, we’ve seen that 2.6% of all code comes from suggestions generated by the model, reducing coding iteration time for these developers by 6%. We are working on enhanced versions of this and hope to roll it out to even more developers.

One of the broad key challenges in artificial intelligence is to build systems that can perform multi-step reasoning, learning to break down complex problems into smaller tasks and combining solutions to those to address the larger problem. Our recent work on Chain of Thought prompting, whereby the model is encouraged to “show its work” in solving new problems (similar to how your fourth-grade math teacher encouraged you to show the steps involved in solving a problem, rather than just writing down the answer you came up with), helps language models follow a logical chain of thought and generate more structured, organized and accurate responses. Like the fourth-grade math student that shows their work, not only does this make the problem-solving approach much more interpretable, it is also more likely that the correct answer will be found for complex problems that require multiple steps of reasoning.

Models that use standard prompting directly provide the answer to a multi-step reasoning problem. In contrast, chain of thought prompting teaches the model to deconstruct the problem into intermediate reasoning steps, better enabling it to reach the correct final answer.

One of the areas where multi-step reasoning is most clearly beneficial and measurable is in the ability of models to solve complex mathematical reasoning and scientific problems. A key research question is whether ML models can learn to solve complex problems using multi-step reasoning. By taking the general-purpose PaLM language model and fine-tuning it on a large corpus of mathematical documents and scientific research papers from arXiv, and then using Chain of Thought prompting and majority voting, the Minerva effort was able to demonstrate substantial improvements over the state-of-the-art for mathematical reasoning and scientific problems across a wide variety of scientific and mathematical benchmark suites.

	MATH	MMLU-STEM	OCWCourses	GSM8k
*Minerva*	50.3%	75%	30.8%	78.5%
*Published state-of-the-art*	6.9%	55%	–	74.4%

Minerva 540B significantly improves state-of-the-art performance on STEM evaluation datasets.

Chain of Thought prompting is one way of better-expressing natural language prompts and examples to a model to improve its ability to tackle new tasks. The similar learned prompt tuning, in which a large language model is fine-tuned on a corpus of problem-domain–specific text, has shown great promise. In “Large Language Models Encode Clinical Knowledge”, we demonstrated that learned prompt tuning can adapt a general-purpose language model to the medical domain with relatively few examples and that the resulting model can achieve 67.6% accuracy on US Medical License Exam questions (MedQA), surpassing the prior ML state-of-the-art by over 17%. While still short compared to the abilities of clinicians, comprehension, recall of knowledge and medical reasoning all improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Continued work can help to create safe, helpful language models for clinical application.

Large language models trained on multiple languages can also help with translation from one language to another, even when they have never been taught to explicitly translate text. Traditional machine translation systems usually rely on parallel (translated) text to learn to translate from one language to another. However, since parallel text exists for a relatively small number of languages, many languages are often not supported in machine translation systems. In “Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate” and the accompanying papers “Building Machine Translation Systems for the Next Thousand Languages” and “Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning”, we describe a set of techniques that use massively multilingual language models trained on monolingual (non-parallel) datasets to add 24 new languages spoken by 300 million people to Google Translate.

The amount of monolingual data per language versus the amount of parallel (translated) data per language. A small number of languages have large amounts of parallel data, but there is a long tail of languages with only monolingual data.

Another approach is represented with learned soft prompts, where instead of constructing new input tokens to represent a prompt, we add a small number of tunable parameters per task that can be learned from a few task examples. This approach generally yields high performance on tasks for which we have learned soft prompts, while allowing the large pre-trained language model to be shared across thousands of different tasks. This is a specific example of the more general technique of task adaptors, which allow a large portion of the parameters to be shared across tasks while still allowing task-specific adaptation and tuning.

As scale increases, prompt tuning, which conditions frozen models using tunable soft prompts, matches the performance of model tuning, despite using 25,000 fewer parameters.

Interestingly, the utility of language models can grow significantly as their sizes increase due to the emergence of new capabilities. “Characterizing Emergent Phenomena in Large Language Models” examines the sometimes surprising characteristic that these models are not able to perform particular complex tasks very effectively until reaching a certain scale. But then, once a critical amount of learning has happened (which varies by task), they suddenly show large jumps in the ability to perform a complex task accurately (as shown below). This raises the question of what new tasks will become feasible when these models are trained further.

The ability to perform multi-step arithmetic (left), succeed on college-level exams (middle), and identify the intended meaning of a word in context (right) all emerge only for models of sufficiently large scale. The models shown include LaMDA, GPT-3, Gopher, Chinchilla, and PaLM.

Additionally, language models of sufficient scale have the ability to learn and adapt to new information and tasks, which makes them even more versatile and powerful. As these models continue to improve and become more sophisticated, they will likely play an increasingly important role in many aspects of our lives.

Top

Computer Vision

Computer vision continues to evolve and make rapid progress. One trend that started with our work on Vision Transformers in 2020 is to use the Transformer architecture in computer vision models rather than convolutional neural networks. Although the localized feature-building abstraction of convolutions is a strong approach for many computer vision problems, it is not as flexible as the general attention mechanism in transformers, which can utilize both local and non-local information about the image throughout the model. However, the full attention mechanism is challenging to apply to higher resolution images, since it scales quadratically with image size.

In “MaxViT: Multi-Axis Vision Transformer”, we explore an approach that combines both local and non-local information at each stage of a vision model, but scales more efficiently than the full attention mechanism present in the original Vision Transformer work. This approach outperforms other state-of-the-art models on the ImageNet-1k classification task and various object detection tasks, but with significantly lower computational costs.

In MaxViT, a multi-axis attention mechanism conducts blocked local and dilated global attention sequentially followed by a FFN, with only a linear complexity. The pixels in the same colors are attended together.

In “Pix2Seq: A Language Modeling Framework for Object Detection”, we explore a simple and generic method that tackles object detection from a completely different perspective. Unlike existing approaches that are task-specific, we cast object detection as a language modeling task conditioned on the observed pixel inputs with the model trained to “read out” the locations and other attributes about the objects of interest in the image. Pix2Seq achieves competitive results on the large-scale object detection COCO dataset compared to existing highly-specialized and well-optimized detection algorithms, and its performance can be further improved by pre-training the model on a larger object detection dataset.

The Pix2Seq framework for object detection. The neural network perceives an image, and generates a sequence of tokens for each object, which correspond to bounding boxes and class labels.

Another long-standing challenge in computer vision is to better understand the 3-D structure of real-world objects from one or a few 2-D images. We have been trying multiple approaches to make progress in this area. In “Large Motion Frame Interpolation”, we demonstrated that short slow-motion videos can be created by interpolating between two pictures that were taken many seconds apart, even when there might have been significant movement in some parts of the scene. In “View Synthesis with Transformers”, we show how to combine two new techniques, light field neural rendering (LFNR) and generalizable patch-based neural rendering (GPNR), to synthesize novel views of a scene, a long-standing challenge in computer vision. LFNR is a technique that can accurately reproduce view-dependent effects by using transformers that learn to combine reference pixel colors. While LFNR works well on single scenes, its ability to generalize to novel scenes is limited. GPNR overcomes this by using a sequence of transformers with canonicalized positional encodings that can be trained on a set of scenes to synthesize views of new scenes. Together, these techniques enable high-quality view synthesis of novel scenes from just a couple of images of the scene, as shown below:

By combining LFNR and GPNR, models are able to produce new views of a scene given only a few images of it. These models are particularly effective when handling view-dependent effects like the refractions and translucency on the test tubes. Source: Still images from the NeX/Shiny dataset.

Going even further, in “LOLNerf: Learn from One Look”, we explore the ability to learn a high quality representation from just a single 2-D image. By training on many different examples of particular categories of objects (e.g., lots of single images of different cats), we can learn enough about the expected 3-D structure of objects to create a 3-D model from just a single image of a novel category (e.g., just a single image of your cat, as shown in the LOLCats clips below).

Top: Example cat images from AFHQ. Bottom: A synthesis of novel 3-D views created by LOLNeRF.

A general thrust of this work is to develop techniques that help computers have a better understanding of the 3-D world — a longstanding dream of computer vision!

Top

Multimodal Models

Most past ML work has focused on models that deal with a single modality of data (e.g., language models, image classification models, or speech recognition models). While there has been plenty of amazing progress in these areas, the future is even more exciting as we look forward to multi-modal models that can flexibly handle many different modalities simultaneously, both as model inputs and as model outputs. We have pushed in this direction in many ways over the past year.

Rather than relying on individual models tailored to specific tasks or domains, the next generation of multi-modal models can handle different modalities simultaneously by activating only the model pathways necessary for a given problem.

There are two key questions when building a multi-modal model that must be addressed to best enable cross-modality features and learning:

How much modality-specific processing should be done before allowing the learned representations to be merged?
What is the most effective way to mix the representations?

In our work on “Multi-modal Bottleneck Transformers” and the accompanying “Attention Bottlenecks for Multimodal Fusion” paper, we explore these tradeoffs and find that bringing together modalities after a few layers of modality-specific processing and then mixing the features from different modalities through a bottleneck layer is more effective than other techniques (as illustrated by the Bottleneck Mid Fusion in the figure below). This approach substantially improves accuracy on a variety of video classification tasks by learning to use multiple modalities of data to make classification decisions.

Sample attention configurations for multi-modal transformer encoders. Red and blue rows of dots represent encoder layers. Typical approaches to fusion of multi-modal transformer encoder features (“full fusion”) use pairwise self attention across hidden units in a layer (left). Bottleneck fusion (middle) restricts attention flow within a layer through tight latent units called attention bottlenecks. Bottleneck mid fusion (right) applies bottleneck fusion only to later layers in the model for optimal performance.

Combining modalities can often improve accuracy on even single-modality tasks. This is an area we have been exploring for many years, including our work on DeViSE, which combines image representations and word-embedding representations to improve image classification accuracy, even on unseen object categories. A modern variant of this general idea is found in Locked-image Tuning (LiT), a method that adds language understanding to an existing pre-trained image model. This approach contrastively trains a text encoder to match image representations from a powerful pre-trained image encoder. This simple method is data and compute efficient, and substantially improves zero-shot image classification performance compared to existing contrastive learning approaches.

LiT-tuning contrastively trains a text encoder to match a pre-trained image encoder. The text encoder learns to compute representations that align to those from the image encoder.

Another example of the uni-modal utility of multi-modal models is observed when co-training on related modalities, like images and videos. In this case, one can often improve accuracy on video action classification tasks compared to training on video data alone (especially when training data in one modality is limited).

Combining language with other modalities is a natural step for improving how users interact with computers. We have explored this direction in quite a number of ways this year. One of the most exciting is in combining language and vision inputs, either still images or videos. In “PaLI: Scaling Language-Image Learning”, we introduced a unified language-image model trained to perform many tasks in over 100 languages. These tasks span vision, language, and multimodal image and language applications, such as visual question answering, image captioning, object detection, image classification, optical character recognition, text reasoning, and others. By combining a vision transformer (ViT) with a text-based transformer encoder, and then a transformer-based decoder to generate textual answers, and training the whole system end-to-end on many different tasks simultaneously, the system achieves state-of-the-art results across many different benchmarks.

For example, PaLI achieves state-of-the-art results on the CrossModal-3600 benchmark, a diverse test of multilingual, multi-modal capabilities with an average CIDEr score of 53.4 across 35 languages (improving on the previous best score of 28.9). As the figure below shows, having a single model that can simultaneously understand multiple modalities and many languages and handle many tasks, such as captioning and question answering, will lead to computer systems where you can have a natural conversation about other kinds of sensory inputs, asking questions and getting answers to your needs in a wide variety of languages (“In Thai, can you say what is above the table in this image?”, “How many parakeets do you see sitting on the branches?”, “Describe this image in Swahili”, “What Hindi text is in this image?”).

The PaLI model addresses a wide range of tasks in the language-image, language-only and image-only domain using the same API (e.g., visual-question answering, image captioning, scene-text understanding, etc.). The model is trained to support over 100 languages and tuned to perform multilingually for multiple language-image tasks.

In a similar vein, our work on FindIt enables natural language questions about visual images to be answered through a unified, general-purpose and multitask visual grounding model that can flexibly answer different types of grounding and detection queries.

FindIt is a unified model for referring expression comprehension (first column), text-based localization (second), and the object detection task (third). FindIt can respond accurately when tested on object types and classes not known during training, e.g., “Find the desk” (fourth). We show the MattNet results for comparison.

The area of video question answering (e.g., given a baking video, being able to answer a question like “What is the second ingredient poured into the bowl?”) requires the ability to comprehend both textual inputs (the question) and video inputs (the relevant video) to produce a textual answer. In “Efficient Video-Text Learning with Iterative Co-tokenization”, multi-stream video inputs, which are versions of the same video input (e.g., a high resolution, low frame-rate video and a low resolution, high frame-rate video), are efficiently fused together with the text input to produce a text-based answer by the decoder. Instead of processing the inputs directly, the video-text iterative co-tokenization model learns a reduced number of useful tokens from the fused video-language inputs. This process is done iteratively, allowing the current feature tokenization to affect the selection of tokens at the next iteration, thus refining the selection.

An example input question for the video question answering task “What is the second ingredient poured into the bowl?” which requires deeper understanding of both the visual and text inputs. The video is an example from the 50 Salads dataset, used under the Creative Commons license.

The process of creating high-quality video content often includes several stages, from video capturing to video and audio editing. In some cases, dialogue is re-recorded in a studio (referred to as dialog replacement, post-sync or dubbing) to achieve high quality and replace original audio that might have been recorded in noisy or other suboptimal conditions. However, the dialog replacement process can be difficult and tedious because the newly recorded audio needs to be well synced with the video, often requiring several edits to match the exact timing of mouth movements. In “VDTTS: Visually-Driven Text-To-Speech”, we explore a multi-modal model for accomplishing this task more easily. Given desired text and the original video frames of a speaker, the model can generate speech output of the text that matches the video while also recovering aspects of prosody, such as timing or emotion. The system shows substantial improvements on a variety of metrics related to video-sync, speech quality, and speech pitch. Interestingly, the model can produce video-synchronized speech without any explicit constraints or losses in the model training to promote this.

Original	VDTTS	VDTTS video-only	TTS

Original displays the original video clip. VDTTS displays the audio predicted using both the video frames and the text as input. VDTTS video-only displays audio predictions using video frames only. TTS displays audio predictions using text only. Transcript: “absolutely love dancing I have no dance experience whatsoever but as that”.

In “Look and Talk: Natural Conversations with Google Assistant”, we show how an on-device multi-modal model can use both video and audio input to make interacting with Google Assistant much more natural. The model learns to use a number of visual and auditory cues, such as gaze direction, proximity, face matching, voice matching and intent classification, to more accurately determine if a nearby person is actually trying to talk to the Google Assistant device, or merely happens to be talking near the device without the intent of causing the device to take any action. With just the audio or visual features alone, this determination would be much more difficult.

Multi-modal models don’t have to be limited to just combining human-oriented modalities like natural language or imagery, and they are increasingly important for real-world autonomous vehicle and robotics applications. In this context, such models can take the raw output of sensors that are unlike any human senses, such as 3-D point cloud data from Lidar units on autonomous vehicles, and can combine this with data from other sensors, like vehicle cameras, to better understand the environment around them and to make better decisions. In “4D-Net for Learning Multi-Modal Alignment for 3D and Image Inputs in Time”, the 3-D point cloud data from Lidar is fused with the RGB data from the camera in real-time, with a self-attention mechanism controlling how the features are mixed together and weighted at different layers. The combination of the different modalities and the use of time-oriented features gives substantially improved accuracy in 3-D object recognition over using either modality on its own. More recent work on Lidar-camera fusion introduced learnable alignment and better geometric processing through inverse augmentation to further improve the accuracy of 3-D object recognition.

4D-Net effectively combines 3D LiDAR point clouds in time with RGB images, also streamed in time as video, learning the connections between different sensors and their feature representations.

Having single models that understand many different modalities fluidly and contextually and that can generate many different kinds of outputs (e.g., language, images or speech) in that context, is a much more useful, general purpose framing of ML. We’re excited about where this will take us because it will enable new exciting applications in many Google products and also advance the fields of health, science, creativity, robotics and more!

Top

Generative Models

The quality and capabilities of generative models for imagery, video, and audio has shown truly stunning and extraordinary advances in 2022. There are a wide variety of approaches for generative models, which must learn to model complex data sets (e.g., natural images). Generative adversarial networks, developed in 2014, set up two models working against each other. One is a generator, which tries to generate a realistic looking image (perhaps conditioned on an input to the model, like the category of image to generate), and the other is a discriminator, which is given the generated image and a real image and tries to determine which of the two is generated and which is real, hence the adversarial aspect. Each model is trying to get better and better at winning the competition against the other, resulting in both models getting better and better at their task, and in the end, the generative model can be used in isolation to generate images.

Advances in generative image model capabilities over the past decade.
Left: From I. Goodfellow, et al. 2014. Middle: From M. Lucic, et al. 2019. Right: From Imagen.

Diffusion models, introduced in “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” in 2015, systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. They then learn a reverse diffusion process that can restore the structure in the data that has been lost, even given high levels of noise. The forward process can be used to generate noisy starting points for the reverse diffusion process conditioned on various useful, controllable inputs to the model, so that the reverse diffusion (generative) process becomes controllable. This means that it is possible to ask the model to “generate an image of a grapefruit”, a much more useful capability than just “generate an image” if what you are after is indeed a sampling of images of grapefruits.

Various forms of autoregressive models have also been applied to the task of image generation. In 2016, “Pixel Recurrent Neural Networks” introduced PixelRNN, a recurrent architecture, and PixelCNN, a similar but more efficient convolutional architecture that was also investigated in “Conditional Image Generation with PixelCNN Decoders”. These two architectures helped lay the foundation for pixel-level generation using deep neural networks. They were followed in 2017 by VQ-VAE, proposed in “Neural Discrete Representation Learning”, a vector-quantized variational autoencoder. Combining this with PixelCNN yielded high-quality images. Then, in 2018 Image Transformer used the autoregressive Transformer model to generate images.

Until relatively recently, all of these image generation techniques were capable of generating images that are relatively low quality compared to real world images. However, several recent advances have opened the door for much better image generation performance. One is Contrastic Language-Image Pre-training (CLIP), a pre-training approach for jointly training an image encoder and a text decoder to predict [image, text] pairs. This pre-training task of predicting which caption goes with which image proved to be an efficient and scalable way to learn image representation and yielded good zero-shot performance on datasets like ImageNet.

In addition to CLIP, the toolkit of generative image models has recently grown. Large language model encoders have been shown to effectively condition image generation on long natural language descriptions rather than just a limited number of pre-set categories of images. Significantly larger training datasets of images and accompanying captions (which can be reversed to serve as text→image exemplars) have improved overall performance. All of these factors together have given rise to a range of models able to generate high-resolution images with strong adherence even to very detailed and fantastic prompts.

We focus here on two recent advances from teams in Google Research, Imagen and Parti.

Imagen is based on the Diffusion work discussed above. In their 2022 paper “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, the authors show that a generic large language model (e.g., T5), pre-trained on text-only corpora, is surprisingly effective at encoding text for image synthesis. Somewhat surprisingly, increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. The work offers several advances to Diffusion-based image generation, including a new memory-efficient architecture called Efficient U-Net and Classifier-Free Diffusion Guidance, which improves performance by occasionally “dropping out” conditioning information during training. Classifier-free guidance forces the model to learn to generate from the input data alone, thus helping it avoid problems that arise from over-relying on the conditioning information. “Guidance: a cheat code for diffusion models” provides a nice explanation.

Parti uses an autoregressive Transformer architecture to generate image pixels based on a text input. In “Vector-quantized Image Modeling with Improved VQGAN”, released in 2021, an encoder based on Vision Transformer is shown to significantly improve the output of a vector-quantized GAN model, VQGAN. This is extended in “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation”, released in 2022, where much better results are obtained by scaling the Transformer encoder-decoder to 20B parameters. Parti also uses classifier-free guidance, described above, to sharpen the generated images. Perhaps not surprising given that it is a language model, Parti is particularly good at picking up on subtle cues in the prompt.

Left: Imagen generated image from the complex prompt, “A wall in a royal castle. There are two paintings on the wall. The one on the left is a detailed oil painting of the royal raccoon king. The one on the right a detailed oil painting of the royal raccoon queen.” Right: Parti generated image from the prompt, “A teddy bear wearing a motorcycle helmet and cape car surfing on a taxi cab in New York City. dslr photo.”

User Control

The advances described above make it possible to generate realistic still images based on text descriptions. However, sometimes text alone is not sufficient to enable you to create what you want — e.g., consider “A dog being chased by a unicorn on the beach” vs. “My dog being chased by a unicorn on the beach”. So, we have done subsequent research in providing new ways for users to control the generation process. In “DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation”, users are able to fine-tune a trained model like Imagen or Parti to generate new images based on a combination of text and user-furnished images. This allows users to place images of themselves (or e.g., their pets) into generated images, thus allowing for much more user control. This is exemplified in “Prompt-to-Prompt Image Editing with Cross Attention Control”, where users are able to edit images using text prompts like “make the car into a bicycle” and in Imagen Editor, which allows users to iteratively edit images by filling in masked areas using text prompts.

Generative Video

One of the next research challenges we are tackling is to create generative models for video that can produce high resolution, high quality, temporally consistent videos with a high level of controllability. This is a very challenging area because unlike images, where the challenge was to match the desired properties of the image with the generated pixels, with video there is the added dimension of time. Not only must all the pixels in each frame match what should be happening in the video at the moment, they must also be consistent with other frames, both at a very fine-grained level (a few frames away, so that motion looks smooth and natural), but also at a coarse-grained level (if we asked for a two minute video of a plane taking off, circling, and landing, we must make thousands of frames that are consistent with this high-level video objective). This year we’ve made quite a lot of exciting progress on this lofty goal through two efforts, Imagen Video and Phenaki, each using somewhat different approaches.

Imagen Video generates high resolution videos with Cascaded Diffusion Models (described in more detail in “Imagen Video: High Definition Video Generation from Diffusion Models”). The first step is to take an input text prompt (“A happy elephant wearing a birthday hat walking under the sea”) and encode it into textual embeddings with a T5 text encoder. A base video diffusion model then generates a very rough sketch 16 frame video at 40×24 resolution and 3 frames per second. This is then followed by multiple temporal super-resolution (TSR) and spatial super-resolution (SSR) models to upsample and generate a final 128 frame video at 1280×768 resolution and 24 frames per second — resulting in 5.3s of high definition video. The resulting videos are high resolution, and are spatially and temporally consistent, but still quite short at ~5 seconds long.

<!–

Imagen Videos, each 192×320, 32 frames, 24 fps.

–>

“Phenaki: Variable Length Video Generation From Open Domain Textual Description”, released in 2022, introduces a new Transformer-based model for learning video representations, which compresses the video to a small representation of discrete tokens. Text conditioning is achieved by training a bi-directional Transformer model to generate video tokens based on a text description. These generated video tokens are then decoded to create the actual video. Because the model is causal in time, it can be used to generate variable-length videos. This opens the door to multi-prompt storytelling as illustrated in the video below.

Phenaki video generated from the complex prompt, “A photorealistic teddy bear is swimming in the ocean at San Francisco. The teddy bear goes under water. The teddy bear keeps swimming under the water with colorful fishes. A panda bear is swimming under water.”

It is possible to combine the Imagen Video and Phenaki models to benefit from both the high-resolution individual frames from Imagen and the long-form videos from Phenaki. The most straightforward way to do this is to use Imagen Video to handle superresolution of short video segments, while relying on the auto-regressive Phenaki model to generate the long-timescale video information.

Generative Audio

In addition to visual-oriented generative models, we have made significant progress on generative models for audio. In “AudioLM, a Language Modeling Approach to Audio Generation” (and the accompanying paper), we describe how to leverage advances in language modeling to generate audio without being trained on annotated data. Using a language-modeling approach for raw audio data instead of textual data introduces a number of challenges that need to be addressed.

First, the data rate for audio is significantly higher, leading to much longer sequences — while a written sentence can be represented by a few dozen characters, its audio waveform typically contains hundreds of thousands of values. Second, there is a one-to-many relationship between text and audio. This means that the same sentence can be uttered differently by different speakers with different speaking styles, emotional content and other audio background conditions.

To deal with this, we separate the audio generation process into two steps. The first involves a sequence of coarse, semantic tokens that capture both local dependencies (e.g., phonetics in speech, local melody in piano music) and global long-term structure (e.g., language syntax and semantic content in speech, harmony and rhythm in piano music), while heavily downsampling the audio signal to allow for modeling long sequences. One part of the model generates a sequence of coarse semantic tokens conditioned on the past sequence of such tokens. We then rely on a portion of the model that can use a sequence of coarse tokens to generate fine-grained audio tokens that are close to the final generated waveform.

When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. AudioLM can also be used to generate coherent piano music continuations, despite being trained without any symbolic representation of music. You can listen to more samples here.

Concluding Thoughts on Generative Models

2022 has brought exciting advances in media generation. Computers can now interact with natural language and better understand your creative process and what you might want to create. This unlocks exciting new ways for computers to help users create images, video, and audio — in ways that surpass the limits of traditional tools!

This has inspired more research interest in how users can control the generative process. Advances in text-to-image and text-to-video have unlocked language as a powerful way to control generation, while work like Dream Booth has made it possible for users to kickstart the generative process with their own images. 2023 and beyond will surely be marked by advances in the quality and speed of media generation itself. Alongside these advances, we will also see new user experiences, allowing for more creative expression.

It is also worth noting that although these creative tools have tremendous possibilities for helping humans with creative tasks, they introduce a number of concerns — they could potentially generate harmful content of various kinds, or generate fake imagery or audio content that is difficult to distinguish from reality. These are all issues we consider carefully when deciding when and how to deploy these models responsibly.

Top

Responsible AI

AI must be pursued responsibly. Powerful language models can help people with many tasks, but without care they can also generate misinformation or toxic text. Generative models can be used for amazing creative purposes, enabling people to manifest their imagination in new and amazing ways, but they can also be used to create harmful imagery or realistic-looking images of events that never occurred.

These are complex topics to grapple with. Leaders in ML and AI must lead not only in state-of-the-art technologies, but also in state-of-the-art approaches to responsibility and implementation. In 2018, we were one of the first companies to articulate AI Principles that put beneficial use, users, safety, and avoidance of harms above all, and we have pioneered many best practices, like the use of model and data cards. More than words on paper, we apply our AI Principles in practice. You can see our latest AI Principles progress update here, including case studies on text-to-image generation models, techniques for avoiding gender bias in translations, and more inclusive and equitable evaluation skin tones. Similar updates were published in 2021, 2020, and 2019. As we pursue AI both boldly and responsibly, we continue to learn from users, other researchers, affected communities, and our experiences.

Our responsible AI approach includes the following:

Focus on AI that is useful and benefits users and society.
Intentionally apply our AI Principles (which are grounded in beneficial uses and avoidance of harm), processes, and governance to guide our work in AI, from research priorities to productization and uses.
Apply the scientific method to AI R&D with research rigor, peer review, readiness reviews, and responsible approaches to access and externalization.
Collaborate with multidisciplinary experts, including social scientists, ethicists, and other teams with socio-technical expertise.
Listen, learn and improve based on feedback from developers, users, governments, and representatives of affected communities.
Conduct regular reviews of our AI research and application development, including use cases. Provide transparency on what we’ve learned.
Stay on top of current and evolving areas of concern and risk (e.g., safety, bias and toxicity) and address, research and innovate to respond to challenges and risks as they emerge.
Lead on and help shape responsible governance, accountability, and regulation that encourages innovation and maximizes the benefits of AI while mitigating risks.
Help users and society understand what AI is (and is not) and how to benefit from its potential.

In a subsequent blog post, leaders from our Responsible AI team will discuss work from 2022 in more detail and their vision for the field in the next few years.

Concluding Thoughts

We’re excited by the transformational advances discussed above, many of which we’re applying to make Google products more helpful to billions of users — including Search, Assistant, Ads, Cloud, Gmail, Maps, YouTube, Workspace, Android, Pixel, Nest, and Translate. These latest advances are making their way into real user experiences that will dramatically change how we interact with computers.

In the domain of language models, thanks to our invention of the Transformer model and advances like sequence-to-sequence learning, people can have a natural conversation (with a computer!) — and get surprisingly good responses (from a computer!). Thanks to new approaches in computer vision, computers can help people create and interact in 3D, rather than 2D. And thanks to new advances in generative models, computers can help people create images, videos, and audio — in ways they weren’t able to before with traditional tools (e.g., a keyboard and mouse). Combined with advances like natural language understanding, computers can understand what you’re trying to create — and help you realize surprisingly good results!

Another transformation changing how people interact with computers is the increasing capabilities of multi-modal models. We are working towards being able to create a single model that can understand many different modalities fluidly — understanding what each modality represents in context — and then actually generate different modes in that context. We’re excited by progress towards this goal! For example, we introduced a unified language model that can perform vision, language, question answering and object detection tasks in over 100 languages with state-of-the-art results across various benchmarks. In future applications, people can engage more senses to get computers to do what they want — e.g., “Describe this image in Swahili.” We’ve shown that on-device multi-modal models can make interacting with Google Assistant more natural. And we’ve demonstrated models that can, in various combinations, generate images, video, and audio controlled by natural language, images, and audio. More exciting things to come in this space!

As we innovate, we have a responsibility to users and society to thoughtfully pursue and develop these new technologies in accordance with our AI Principles. It’s not enough for us to develop state-of-the-art technologies, but we must also ensure that they are safe before broadly releasing them into the world, and we take this responsibility very seriously.

New advances in AI present an exciting horizon of new ways computers can help people get things done. For Google, many will enhance or transform our longstanding mission to organize the world’s information and make it universally accessible and useful. Over 20 years later, we believe this mission is as bold as ever. Today, what excites us is how we’re applying many of these advances in AI to enhance and transform user experiences — helping more people better understand the world around them and get more things done. My own longstanding vision of computers!

Acknowledgements

Thank you to the entire Research Community at Google for their contributions to this work! In addition, I would especially like to thank the many Googlers who provided helpful feedback in the writing of this post and who will be contributing to the other posts in this series, including Martin Abadi, Ryan Babbush, Vivek Bandyopadhyay, Kendra Byrne, Esmeralda Cardenas, Alison Carroll, Zhifeng Chen, Charina Chou, Lucy Colwell, Greg Corrado, Corinna Cortes, Marian Croak, Tulsee Doshi, Toju Duke, Doug Eck, Sepi Hejazi Moghadam, Pritish Kamath, Julian Kelly, Sanjiv Kumar, Ronit Levavi Morad, Pasin Manurangsi, Yossi Matias, Kathy Meier-Hellstern, Vahab Mirrokni, Hartmut Neven, Adam Paszke, David Patterson, Mangpo Phothilimthana, John Platt, Ben Poole, Tom Small, Vadim Smelyanskiy, Vincent Vanhoucke, and Leslie Yeh.

End-to-End Pipeline for Segmentation with TFX, Google Cloud, and Hugging Face

Posted by Chansung Park, Sayak Paul (ML and Cloud GDEs)

TensorFlow Extended (TFX) is a flexible framework allowing Machine Learning (ML) practitioners to iterate on production-grade ML workflows faster with reliability and resiliency. TFX’s power lies in its flexibility to run ML pipelines across different compatible orchestrators such as Kubeflow, Apache Airflow, Vertex AI Pipelines, etc., both locally and on the cloud.

In this blog post, we discuss the crucial details of building an end-to-end ML pipeline for Semantic Segmentation tasks with TFX and various Google Cloud services such as Dataflow, Vertex Pipelines, Vertex Training, and Vertex Endpoint. The pipeline also uses a custom TFX component that is integrated with Hugging Face 🤗 Hub – HFPusher. Finally, you will see how we implemented CI/CD into the mix by leveraging GitHub Actions.

Although we won’t go over all the bits of the pipeline, you can still find the code of the underlying project in this GitHub repository.

Architectural Overview

The system architecture of the project is divided into three main parts. The first part is all about the core TFX pipeline handling all the steps from data ingestion to model deployment. The second part concerns the integration between the pipeline and the external Hugging Face 🤗 Hub service. The last one is about automation and implementing CI/CD using GitHub Actions.

Flowchart showing overall system architecture from parametrized GitHub action to continuous deployment to within GCP Environment to external

Figure 1. Overall system architecture (original)

It is common to open Pull Requests when proposing new features or code refactorings in separate branches. When it comes to ML projects, these changes usually affect the model and/or data. Besides running basic validation on the proposed changes (code quality, tests, etc.), we should also ensure that the changes produce a model that is better enough to replace the currently deployed model before merging (if the changes pertain to modeling). In this project, we developed a GitHub Action that is manually triggered on the merging branch with configurable parameters. This way, project stakeholders can validate performance-related changes and reliably ship the changes to production. In reality, there might be more critical measurements here, but we hope this GitHub Action proves to be a good starting point.

At the heart of any MLOps project, there is an ML pipeline. We built a simple yet complete ML pipeline with support for automatic data ingestion, data preprocessing, model training, model evaluation, and model deployment in TFX. The TFX pipeline could be run on a local environment, but we also ran it on the Vertex AI platform to replicate real-world production-grade environments.

Finally, the trained and qualified model from the ML pipeline is deployed to the Vertex AI Endpoint. The “blessed” model is also pushed to the Hugging Face Hub alongside an interactive demo via a custom HFPusher TFX component. Hugging Face Hub is a very popular place to store models and publish a fully working ML-powered interactive application for free. It is useful to showcase an application with the latest model to audit if it works as expected before going on a full production deployment.

Below, we discuss each of these components in a little more detail, discussing our design considerations and non-trivial technical aspects.

TFX Pipeline

The ML pipeline is written entirely in TFX, from data ingestion to model deployment. Specifically, we used standard TFX components such as ExampleGen, ImportSchemaGen, Transform, Trainer, Evaluator, and Pusher, along with the custom HFPusher component. Let’s briefly look at the roles of each component in the context of our project.

Flowchart showing overview of the TFX ML pipeline. Pipeline could be run on Local and Cloud(Vertex Pipeline) environment

Figure 2. Overview of the ML pipeline (original)

ExampleGen

In this project, we have prepared Pets dataset in TFRecord format with these scripts and stored them in Google Cloud Storage(GCS). ExampleGen brings the data files from GCS, splits them into training and evaluation datasets according to glob patterns, and stores them as TFRecords in GCS. Note that ExampleGen could take different data types such as CSV, TFRecord, or Parquet, then it generates datasets in a uniform format in TFRecord. It lets us handle the data uniformly inside the entire TFX pipeline. Note that since the Pets dataset is available from TF Datasets, you could also use a custom TFDS ExampleGen for this task.

ExampleGen can be integrated with Dataflow out of the box. All you need to do to benefit from Dataflow is to call with_beam_pipeline_args method with appropriate parameters such as machine type, disk size, the number of workers, and so on. For context, Dataflow is a managed service provided by Google Cloud that allows us to run Apache Beam pipelines efficiently in a fully distributed manner.

ImportSchemaGen

ImportSchemaGen imports a Protocol Buffer Text Format file that was previously automatically inferred by SchemaGen. It can also be hand-tuned to define the structure of the output data from ExampleGen.

In our case, the prepared Pets dataset has two features – image and segmentation map (label), and the size of each feature is 128×128. Therefore, we could define a schema like the one below.

feature {
name: “image”
type: FLOAT

float_domain {
min: 0
max: 255
}

shape {
dim { size: 128 }
dim { size: 12 }
dim { size: 3 }
}
}

feature {
name: “label”
type: FLOAT

float_domain {
min: 0
max: 2
}

shape {
dim { size: 128 }
dim { size: 128 }
}
}

Also note that in the float_domain section, we can set the value restrictions. In this project, the input data is standard RGB images, so each pixel value should be between 0 and 255. On the other hand, the pixel value of the label should be 0, 1, or 2, meaning outer, inner, and border of an object in an image, respectively.

Transform

With the help of ImportSchemaGen, the data is already shaped correctly in Transform and validated. Without ImportSchemaGen, we would have to write code to parse TFRecords and shape each feature manually inside Transform. Therefore, one line of code below is sufficient for the data preprocessing since the model in this project is built on top of MobileNetV2.

# IMAGE_KEY is “image” which matches the name of feature in the ImportSchemaGen

image_features = mobilenet_v2.preprocess_input(inputs[IMAGE_KEY])

Since data preprocessing is a CPU and memory-intensive job, Transform also can be integrated with Dataflow. Just like in ExampleGen, the job could be seamlessly delegated to Dataflow by calling the with_beam_pipeline_args method.

Trainer

(Vertex) Trainer simply trains a model. We used a UNet architecture built on top of MobileNetV2 from the TensorFlow official tutorial. Since the model architecture is nothing new, let’s take a look at how it is modularized and some of the key pieces of code.

pipeline/

├─ …
├─ models/
├─ common.py
├─ hyperparams.py
├─ signatures.py
├─ train.py
├─ unet.py

You place your modeling code in a separate file, which is supplied as a parameter to the Trainer. In this case, that file is named train.py. When the Trainer component is run, it looks for a starting point function with the name run_fn which is defined in train.py. The run_fn() function basically pulls in the training and evaluation datasets from the output of Transform, trains the UNet model ( defined in unet.py), then saves the trained model with appropriate signatures. The training process simply follows the standard Keras way – model.compile(), model.fit().

The Trainer component can be integrated with Vertex AI Training out of the box, which is a managed service to train models in a distributed system. By specifying how you would want to configure the training server clusters in the custom_config parameter of the Trainer, the training job is handled by Vertex AI Training automatically.

It is also important to notice which signatures the model exports in TensorFlow. Consider the following code snippet that saves a trained model (of the tf.keras.Model instance) into a SavedModel resource.

model.save(
fn_args.serving_model_dir,
save_format=“tf”,
signatures={
“serving_default”: model_exporter(model),
“transform_features”: transform_features_signature(
model, tf_transform_output
),
“from_examples”: tf_examples_serving_signature(

model, tf_transform_output

),
},
)

The signatures are functions that define how to handle given input data. For example, we have defined three different signatures. While serving_default is used during serving time, the other two are used during the model evaluation time.

serving_default transforms a single or a batch of data points from user requests which is usually marshaled in JSON (base64 encoded) for HTTP or serialized Protocol Buffer messages for gRPC, then runs the model prediction on the data.
transform_features applies a transformation graph obtained from the Transform component to the data produced by ExampleGen. This function will be used in the Evaluator component, so the raw evaluation inputs from ExampleGen can be appropriately transformed that the model could understand.
from_examples performs data transformation and model prediction in a sequential manner. How data transformation is done is identical to the process of the transform_features function.

Note that the transform_features and from_examples signatures are used internally in the Evaluator component. In the next section, we explain their connections.

Evaluator

The performance of the trained model should be evaluated by certain criteria or metrics. Evaluator lets us define such metrics that not only evaluates the trained model itself but also compares the trained model to the last best model retrieved by Resolver. In other words, the trained model will be deployed only if it achieves performance above the baseline threshold and it is better than the previously deployed model. The full configurations for this project can be found here.

EVAL_CONFIGS = tfma.EvalConfig(
model_specs=[
tfma.ModelSpec(
signature_name=“from_examples”,
preprocessing_function_names=[“transform_features”],
)
],
…
)

The reason that we had transform_features and from_examples signatures that are doing the same data preprocessing is that they are used in different situations. Evaluator runs the evaluate() method on an existing model while it runs a function (signature) specified in the signature_name on the currently trained model. Therefore, we not only need a function that transforms a given sample but also runs the evaluate() method at the same time.

Pusher

When the trained model is evaluated to be deployed, (Vertex) Pusher pushes the model to the Model Registry in Vertex AI. It also optionally creates an Endpoint and deploys the model to the endpoint out of the box. You can specify a number of different deployment-specific configurations to Pusher: machine type, GPU type, the number of GPUs, traffic splits etc.

Integration with Hugging Face 🤗 Hub

Hugging Face Hub offers ML practitioners a powerful way to store and share models, datasets, and ML applications. Since it supports seamless support for storing model artifacts with automatic version control, we developed a custom TFX component named HFPusher that:

takes a model artifact (in the SavedModel format) and pushes that to the Hub in a separate branch for better segregation. The branch name is determined by time.time().
creates and pushes a model card that includes attributes of the model enabling dıscovery of the models on the Hugging Face Hub platform.
hosts an application with the model using Hugging Face Spaces given an application template referencing the branch where the model artifact was pushed to.

You can use this component anywhere after the Trainer component, but it’s recommended to use it at the end of a TFX pipeline. The HFPusher component only requires a handful of arguments consisting of two TFX artifacts and four Hugging Face specific configurations:

Hugging Face user name
Hugging Face access token for creating and modifying repositories on the Hugging Face Hub, which is automatically injected with GitHub Action (see the next section)
Name of the repository to which the model artifacts will be pushed
Model artifact as an output of a previous component such as Trainer
Hugging Face Space specific configurations (optional)
- Application template to host a Space application
- Name of the repository to which the Space application will be pushed. It has the same name as the name of the model repository by default.
- Space SDK. The default value is gradio, but it could be set to streamlit
Model blessing artifact as an output of a previous component such as Evaluator (optional)

The Hugging Face Hub is primarily based on Git and Git-LFS. The Hugging Face team provides an easy-to-use huggingface_hub API toolkit to interact with it. That is how it provides seamless support for version control, large file storage, and interaction.

In Figures 3 and 4, we show how the model repository and the application repository (which were automatically created from a TFX pipeline) look like on the Hugging Face Hub.

Screenshot showing model versioning in Hugging Face Model Hub

Figure 3. Model versioning in Hugging Face Model Hub (original)

Screenshot of a simple demo for semantic segmentation model trained on the PETS dataset

Figure 4. Automatically published application in Hugging Face Space Hub (original)

HFPusher has been contributed to the official TFX-Addons tfx-addons package. HFPusher will be available in version 0.4.0 and later in the tfx-addons package.

Automation with GitHub Actions

In the DevOps world, we usually run a number of tests on the changes introduced to ensure they’re valid enough to hit production. If the tests pass, the changes are merged and a new deployment is shipped automatically.

For an ML codebase, the changes are usually either related to data or model on a broad level. Validating these changes is quite application dependent but there could still be common grounds:

Do the changes introduced on the modeling side lead to better performance metrics?
Do the changes lead to faster training throughput?
Do the data-related changes reflect some distribution better?

We focused on the first point in this project. We designed a GitHub Action workflow that can:

1. Google Cloud authentication and setup is done with google-github-actions/auth and google-github-actions/setup-gcloud GitHub Actions when a credential (JSON) is provided. In order to use appropriate credentials to the specified Google Cloud project ID, the workflow seeks for the credentials from GitHub Action Secret. Each credential is mapped to the name which is identical to the Google Cloud project ID.

2. Some of the sensitive information is replaced with envsubst command. In this project, it is required to provide a Hugging Face 🤗access token to the HFPusher component to create and update any repositories in Hugging Face 🤗 Hub. The access token is stored in GitHub Action Secret.

3. An environment variable enable_dataflow is set to “true” or “false” based on the specified parameter. By looking up the environment variable, the TFX pipeline conditionally defines dedicated parameters for Dataflow and passes them to ExampleGen and Transform components via with_beam_pipeline_args method.

4. The last part of the workflow compiles and runs the TFX pipeline on Vertex AI with the TFX CLIs as below. The tfx pipeline create CLI creates the pipeline and registers it to the local system. Furthermore, it is capable of building and pushing a Docker Image to Google Container Registry(GCR) based on a custom Dockerfile in the pipeline. Then tfx run create CLI runs the pipeline on Vertex AI with the specified Google Cloud Project ID and region.

tfx pipeline create
–pipeline-path kubeflow_runner.py
–engine vertex –build-image

tfx run create
–engine vertex
–pipeline-name PIPELINE_NAME
–project GCP_PROJECT_ID –region GCP_REGION

In this case, we need to verify each PR if the suggested modification works well at the build and run times. Also, sometimes each collaborator wants to run the ML pipeline with their own Google Cloud account. Furthermore, it is better if we could conditionally delegate some heavy jobs in the ML pipeline to more dedicated Google Cloud services.

Figure 5. GitHub Action for CI/CD of ML pipeline (original)

As you may notice from Figure 5, the GitHub Action runs a workflow based on five different parameters – branch, Google Cloud project ID, cloud region, the name of TFX pipeline, and enabling the Dataflow integration.

Conclusion

In this post, we discussed how to build an end-to-end ML pipeline for semantic segmentation tasks. We leveraged TensorFlow, TFX, and Google Cloud services such as Dataflow and Vertex AI, GitHub Actions, and Hugging Face 🤗 Hub to develop a production-grade ML pipeline with external services along with semi-automatic CI/CD pipelines. We hope that you found this setup useful and reliable and that you will use this in your own ML pipeline projects.

As a future work, we will demonstrate a common MLOps scenario by extending this project. First, we’ll add more complexities to the data to simulate model performance degradation. Second, we’ll evaluate the currently deployed model to see if the model performance degradation actually happened. Last, we’ll verify the model performance is recovered after replacing the current model architecture with better ones such as DeepLabV3+ or SegFormer.

Acknowledgements

We are grateful to the ML Developer Programs team that provided Google Cloud credits to support our experiments. We thank Robert Crowe for providing us with helpful feedback and guidance. We also thank Merve Noyan who worked on integrating the model card utilities into the HFPusher component.

Sequoia Capital’s Pat Grady and Sonya Huang on Generative AI

For insights into the future of generative AI, check out the latest episode of the NVIDIA AI Podcast. Host Noah Kravitz is joined by Pat Grady and Sonya Huang, partners at Sequoia Capital, to discuss their recent essay, “Generative AI: A Creative New World.”

The authors delve into the potential of generative AI to enable new forms of creativity and expression, as well as the challenges and ethical considerations of this technology.

Grady and Huang emphasize the potential of generative AI to revolutionize industries such as art, design and media by allowing for the creation of unique, personalized content on a scale that would be impossible for humans to achieve alone.

They also address the importance of considering the ethical implications of the technology, including the potential for biased or harmful outputs and the need for responsible use and regulation.

Listen to the full episode to hear more about the possibilities of generative AI and the considerations to be made as this technology moves forward.

The AI Podcast · Sequoia Capital’s Pat Grady and Sonya Huang on Generative AI – Ep. 187

Art(ificial) Intelligence: Pindar Van Arman Builds Robots That Paint

Pindar Van Arman, an American artist and roboticist, designs painting robots that explore the differences between human and computational creativity. Since his first system in 2005, he has built multiple artificially creative robots. The most famous, Cloud Painter, was awarded first place at Robotart 2018.

Real or Not Real? Attorney Steven Frank Uses Deep Learning to Authenticate Art

Steven Frank is a partner at the law firm Morgan Lewis, specializing in intellectual property and commercial technology law. He’s also half of the husband-wife team that used convolutional neural networks to authenticate artistic masterpieces, including da Vinci’s Salvador Mundi, with AI’s help.

GANTheftAuto: Harrison Kinsley on AI-Generated Gaming Environments

Humans playing games against machines is nothing new, but now computers can develop games for people to play. Programming enthusiast and social media influencer Harrison Kinsley created GANTheftAuto, an AI-based neural network that generates a playable chunk of the classic video game Grand Theft Auto V.

Subscribe to the AI Podcast: Now Available on Amazon Music

You can now listen to the AI Podcast through Amazon Music.

Also get the AI Podcast through Apple Music, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Roll Model: Smart Stroller Pushes Its Way to the Top at CES 2023

As any new mom or dad can tell you, parenting can be a challenge — packed with big worries and small hassles. But it may be about to get a little bit easier thanks to Glüxkind Technologies and their smart stroller, Ella.

The company has just been named a CES 2023 Innovation Awards Honoree for their AI-powered stroller, which was designed to make life easier for new parents and caregivers.

“People love the rock-a-baby feature and the push and brake assist,” said Glüxkind co-founder and CEO Kevin Huang of a product that’s become an instant sensation at the annual technology industry confab. “When you’re able to hold your child and have the stroller take care of itself, that’s a pretty magical moment.”

The story behind the product that’s made headlines around the world began three years ago when Huang and his co-founder, Anne Hunger, had a baby daughter and went stroller shopping.

And, like all parents, they learned about the challenges of wrangling a stroller packed with baby gear and the safety concerns that have new parents shopping for the safest vehicles they can afford.

“I realized, ‘man, this stuff hasn’t changed in the last 30 years,’” Huang said.

Modern cars, for example, are equipped with systems that ensure they don’t roll backward when you’re stopped on a hill, Huang explained.

“So I thought maybe we can add some of the things already there for cars into this platform that actually carries our children, so we can have a safer and more convenient experience.”

The response from parents at CES was overwhelmingly positive. No surprise, given the in-depth research Huang and his team conducted with new parents.

But it’s also wowed tech enthusiasts worldwide, earning honors from the awards program produced by the Consumer Technology Association, the trade group behind the annual Las Vegas conference.

“We came to CES with the idea of announcing the product and getting maybe three to five writeups about what we were doing,” Huang said. “We didn’t expect the overwhelming amount of exposure we received.”

This year’s CES Innovation Awards program — overseen by an elite panel of judges, including media members, designers and engineers — received a record-high number of over 2,100 submissions, making it no small feat for Ella to come out on top.

SUBHEAD: AI-Powered Stroller Makes Parenting a Walk in the Park

Huang reports that NVIDIA’s Jetson edge AI platform powers the startup’s entire AI stack.

Glüxkind, based in Vancouver, Canada, is a member of NVIDIA Inception, a free program designed to help startups evolve faster through access to cutting-edge technology and NVIDIA experts, opportunities to connect with venture capitalists, and co-marketing support to heighten the company’s visibility.

With Jetson, Huang explains that the stroller is able to use computer vision to map the stroller’s surroundings, using Jetson’s GPU and CPU to process and do pathfinding.

As a result, when the child isn’t in the stroller, parents can activate Ella’s intelligent hands-free strolling mode.

This advanced parent-assist technology helps parents focus on their kids rather than wrangling an empty stroller packed with diapers, snacks, and other supplies.

“It stays out of the way when you don’t need it, but it’s there when you do need it,” Huang said.

But while the stroller is intelligent — able to follow a caregiver as they hold a baby or help ensure the stroller doesn’t roll away on its own — it’s not designed to work independently.

Quite the opposite. With Ella’s adaptive push and brake assistance, caregivers can enjoy effortless walks no matter the terrain — uphill, downhill or even when fully loaded with groceries and toys.

Ella also has features that make parenting easier, such as Rock-My-Baby mode to help little ones get the sleep they need and built-in white noise playback.

“We’re trying to make it so the technology we’re building is augmentative to the parents’ experience to make parenting easier and safer,” Huang said.

The result: while parenting will never be a walk in the park, actually taking that newborn for an actual walk in the park will soon be a lot less of a hassle.

Image Credit: Glüxkind Technologies

Artist Zhelong Xu Brings Chinese Zodiac to Life for Lunar New Year This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

To celebrate the upcoming Lunar New Year holiday, NVIDIA artist Zhelong Xu, aka Uncle Light, brought Chinese zodiac signs to life this week In the NVIDIA Studio — modernizing the ancient mythology in his signature style.

Chinese Tradition Brought to the 21st Century

NVIDIA principal artist Zhelong Xu is also co-founder of the Shanghai Magicstone Images studio and an art consultant for the Tencent TiMi Studio Group.

Xu is deeply passionate about modeling Chinese zodiac signs in 3D. His first serious attempt, Carefree Sheep, was chosen by Adobe Substance Painter, previously Allegorithmic, as the artwork on its first software launch screen.

‘Carefree Sheep’ by 3D artist Zhelong Xu.

Xu creates at least one piece for his zodiac series each year. Harboring a Cute Tiger is his most popular work, which reached over 16 million people on the Chinese social media app Weibo.

“I had the idea to turn this series into ceramic works, so I continued working with my friends in Jingdezhen to turn this series into physical objects,” he said.

Zodiac piece for the Year of the Rabbit.

“I wanted to do something different in the Year of the Rabbit, so I chose to color the rabbit in NVIDIA green to match the classical Chinese atmosphere and to bring out the Chinese New Year energy,” said Xu, who joined NVIDIA last year.

The two emerald rabbits, one with its ears up and the other with them down, are designed to look like they’re teeming with anticipation for the arrival of Lunar New Year.

Xu deployed ZBrush for initial modeling with its custom sculpting tools. He then UV mapped the 3D model in preparation for applying a special emerald texture made in Adobe Substance 3D Painter. NVIDIA RTX-accelerated light- and ambient-occlusion features baked and optimized the scene assets in mere seconds, letting Xu experiment with textures quickly and easily with his GeForce RTX 4090 GPU.

The artist quickly exported files to Blender to set up the environment and tinker with lighting. He added many Eastern-style architectural and furniture options from the PBRMAX.com asset library.

High-quality 3D assets gathered from the PBRMAX.com asset library.

Movement within the viewport was seamless with Blender Cycles RTX-accelerated OptiX ray tracing for interactive, photorealistic modeling.

Xu then deployed his secret weapon: NVIDIA Omniverse, a platform for creating and operating metaverse applications. He saved files in Universal Scene Description (USD) format using the Omniverse export plug-in to import them into the NVIDIA Omniverse Create app for final modeling. Here, Xu made adjustments to the translucent emerald material to make it as realistic as possible.

USD format enables import into Omniverse Create.

Omniverse Create was incredibly useful for scene modifications, Xu said, as it enabled him to test lighting with his scene rendering in real time. This provided him with the most accurate iteration of final renders, allowing for more meaningful real-time edits.

“Thanks to the power of the GeForce RTX 4090 GPU and RTX optimization in Omniverse, I got the desired effect very quickly and tested a variety of lighting effects,” he said.

Final environmental edits in Omniverse Create.

Omniverse gives 3D artists their choice of renderer within the viewport, with support for Pixar HD Storm, Chaos V-Ray, Maxon’s Redshift, OTOY Octane, Blender Cycles and more. Xu deployed the unbiased NVIDIA Iray renderer to complete the project.

View more of Xu’s work on ArtStation.

#NewYearNewArt Challenge

With a new year will come new art, and we’d love to see yours! Use the hashtag #NewYearNewArt and tag @NVIDIAStudio to show off recent creations for a chance to be featured on our channels.

The challenge is off to a great start:

A new year means new art!

Join our Jan-Feb #NewYearNewArt challenge by sharing any new or relatively new art you’ve created for a chance to be featured on our channels!

Be sure to tag #NewYearNewArt and thanks to @AOi__Pan for sharing their new art. pic.twitter.com/lXiFLROhQh

— NVIDIA Studio (@NVIDIAStudio) January 10, 2023

Excellent artists like @rabbit.hole_renders have helped kick off the challenge with creativity that’s taking people to new worlds.

Who’s playing this game?

Awesome submission to the #NewYearNewArt Challenge by rabbit.hole_renders (IG).

Share any new or relatively new art you’ve created with the tag #NewYearNewArt for a chance to be featured on our channels! pic.twitter.com/GyrdJPbcM5

— NVIDIA Studio (@NVIDIAStudio) January 14, 2023

Plus, get a dose of potassium with @graffitpl’s banana-based animation that comes with a side of mushrooms.

Were you looking for m̷o̷r̷e̷ spore banana animations? Look no further!

This dose of potassium comes from the brililant @graffitpl (IG).

Tag #NewYearNewArt in your recently created art for a chance to be featured until the end of February! pic.twitter.com/78ETFDtVH4

— NVIDIA Studio (@NVIDIAStudio) January 16, 2023

Keep your eyes peeled for more amazing submissions on the NVIDIA Studio Instagram stories.

Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) partly based on JupyterLab 3. Studio provides a web-based interface to interactively perform ML development tasks required to prepare data and build, train, and deploy ML models. In Studio, you can load data, adjust ML models, move in between steps to adjust experiments, compare results, and deploy ML models for inference.

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to create AWS CloudFormation stacks through automatic CloudFormation template generation. A stack is a collection of AWS resources, that can be programmatically updated, moved, or deleted. AWS CDK constructs are the building blocks of AWS CDK applications, representing the blueprint to define cloud architectures.

Setting up Studio with AWS CDK has become a streamlined process. The AWS CDK allows you to use native constructs to define and deploy Studio using infrastructure as code (IaC), including AWS Identity and Access Management (AWS IAM) permissions and desired cloud resource configurations, all in one place. This development approach can be used in combination with other common software engineering best practices such as automated code deployments, tests, and CI/CD pipelines. The AWS CDK reduces the time required to perform typical infrastructure deployment tasks while shrinking the surface area for human error through automation.

This post guides you through the steps to get started with setting up and deploying Studio to standardize ML model development and collaboration with fellow ML engineers and ML scientists. All examples in the post are written in the Python programming language. However, the AWS CDK offers built-in support for multiple other programming languages like JavaScript, Java and C#.

Prerequisites

To get started, the following prerequisites apply:

The AWS Command Line Interface (AWS CLI) is installed
The AWS CDK is installed
You have permissions to create and deploy AWS CDK and AWS CloudFormation resources as defined in the scripts outlined in the post
Python 3+

Clone the GitHub repository

First, let’s clone the GitHub repository.

When the repository is successfully pulled, you may inspect the cdk directory containing the following resources:

cdk – Contains the main cdk resources
app.py – Where the AWS CDK stack is defined
cdk.json – Contains metadata, and feature flags

AWS CDK scripts

The two main files we want to look at in the cdk subdirectory are sagemaker_studio_construct.py and sagemaker_studio_stack.py. Let’s look at each file in more detail.

Studio construct file

The Studio construct is defined in the sagemaker_studio_construct.py file.

The Studio construct takes in the virtual private cloud (VPC), listed users, AWS Region, and underlying default instance type as parameters. This AWS CDK construct serves the following functions:

Creates the Studio domain (SageMakerStudioDomain)
Sets the IAM role sagemaker_studio_execution_role with AmazonSageMakerFullAccess permissions required to create resources. Permissions need to be scoped down further to follow the least privilege principle for improved security.
Sets Jupyter server app settings – takes in JUPYTER_SERVER_APP_IMAGE_NAME, defining the jupyter-server-3 container image to be used.
Sets kernel gateway app settings – takes in KERNEL_GATEWAY_APP_IMAGE_NAME, defining the datascience-2.0 container image to be used.
Creates a user profile for each listed user

The following code snippet shows the relevant Studio domain AWS CloudFormation resources defined in AWS CDK:

sagemaker_studio_domain = sagemaker.CfnDomain(
self,
"SageMakerStudioDomain",
auth_mode="IAM",
default_user_settings=sagemaker.CfnDomain.UserSettingsProperty(
execution_role=self.sagemaker_studio_execution_role.role_arn,
jupyter_server_app_settings=sagemaker.CfnDomain.JupyterServerAppSettingsProperty(
default_resource_spec=sagemaker.CfnDomain.ResourceSpecProperty(
instance_type="system",
sage_maker_image_arn=get_sagemaker_image_arn(
JUPYTER_SERVER_APP_IMAGE_NAME, aws_region
),
)
),
kernel_gateway_app_settings=sagemaker.CfnDomain.KernelGatewayAppSettingsProperty(
default_resource_spec=sagemaker.CfnDomain.ResourceSpecProperty(
instance_type=default_instance_type,
sage_maker_image_arn=get_sagemaker_image_arn(
KERNEL_GATEWAY_APP_IMAGE_NAME, aws_region
),
),
),
security_groups=[vpc.vpc_default_security_group],
sharing_settings=sagemaker.CfnDomain.SharingSettingsProperty(
notebook_output_option="Disabled"
),
),
domain_name="SageMakerStudioDomain",
subnet_ids=private_subnets,
vpc_id=vpc.vpc_id,
app_network_access_type="VpcOnly",
)

The following code snippet shows the user profiles created from AWS CloudFormation resources:

for user_name in user_names: sagemaker.CfnUserProfile( self, "SageMakerStudioUserProfile_" + user_name,
 domain_id=sagemaker_studio_domain.attr_domain_id, user_profile_name=user_name, )

Studio stack file

class SagemakerStudioStack(Stack):
    def __init__(
        self,
        scope: Construct,
        construct_id: str,
        **kwargs,
    ) -> None:
        super().__init__(scope, construct_id, **kwargs)
        vpc = ec2.Vpc(self, "SageMakerStudioVpc")
        SageMakerStudio(self, "SageMakerStudio", vpc=vpc, aws_region=self.region)

After the construct has been defined, you can add it by creating an instance of the class and passing the required arguments inside of the stack. The stack creates the AWS CloudFormation resources as part of one coherent deployment. This means that if at least one cloud resource fails to be created, the CloudFormation stack rolls back any changes performed. The following code snippet of the Studio construct instantiates inside of the Studio stack:

Deploy the AWS CDK stack

To deploy your AWS CDK stack, run the following commands from the project’s root directory within your terminal window:

aws configure
pip3 install -r requirements.txt
cdk bootstrap --app "python3 -m cdk.app"
cdk deploy --app "python3 -m cdk.app"

Review the resources the AWS CDK creates in your AWS account and select yes when prompted to deploy the stack. Wait for your stack deployment to finish. This typically takes less than 5 minutes; however, adding more resources will prolong deployment time. You can also check the deployment status on the AWS CloudFormation console.

When the stack has been successfully deployed, check its information by going to the Studio Control Panel. You should see the SageMaker Studio user profile you created.

If you redeploy the stack it will check for changes, performing only the cloud resource updates necessary. For example, this can be used to add users, or change permissions of those users without having to recreate all of the defined cloud resources.

Cleanup

To delete a stack, complete the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Open the stack you want to delete.
In the stack details pane, choose Delete.
Choose Delete stack when prompted.

AWS CloudFormation will delete the resources created when the stack was deployed. This may take some time depending on the amount of resources created.

If you encounter any issues going through these cleanup steps, you may need to manually delete the Studio domain first before repeating the steps in this section.

Conclusion

In this post, we showed how to use AWS cloud-native IaC resources to build an easily reusable template for Studio deployments. SageMaker Studio is a fully integrated web-based IDE that provides a visual interface for ML development tasks based on JupyterLab3. With AWS CDK stacks, we were able to define constructs for building out cloud components that can be easily modified, edited, or deleted by making changes to the underlying CloudFormation stack.

For more information about Amazon Studio, see Amazon SageMaker Studio.

About the Authors

Cory Hairston is a Software Engineer at the Amazon ML Solutions Lab. He is ardent about learning new technologies and leveraging that information to build reusable software solutions. He is an avid power-lifter and spends his free time making digital art.

Marcelo Aberle is an ML Engineer in the AWS AI organization. He is leading MLOps efforts at the Amazon ML Solutions Lab, helping customers design and implement scalable ML systems. His mission is to guide customers on their enterprise ML journey and accelerate their ML path to production.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Churn prediction using multimodality of text and tabular features with Amazon SageMaker Jumpstart

Amazon SageMaker JumpStart is the Machine Learning (ML) hub of SageMaker providing pre-trained, publicly available models for a wide range of problem types to help you get started with machine learning.

Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. Customer churn is a problem faced by a wide range of companies, from telecommunications to banking, where customers are typically lost to competitors. It’s in a company’s best interest to retain existing customers instead of acquiring new customers, because it usually costs significantly more to attract new customers. When trying to retain customers, companies often focus their efforts on customers who are more likely to leave. User behavior and customer support chat logs can contain valuable indicators on the likelihood of a customer ending the service. In this solution, we train and deploy a churn prediction model that uses a state-of-the-art natural language processing (NLP) model to find useful signals in text. In addition to textual inputs, this model uses traditional structured data inputs such as numerical and categorical fields.

Multimodality is a multi-disciplinary research field that addresses some of the original goals of artificial intelligence by integrating and modeling multiple modalities. This post aims to build a model that can process and relate information from multiple modalities such as tabular and textual features.

We show you how to train, deploy and use a churn prediction model that has processed numerical, categorical, and textual features to make its prediction. Although we dive deep into a churn prediction use case in this post, you can use this solution as a template to generalize fine-tuning pre-trained models with your own dataset, and subsequently run hyperparameter optimization (HPO) to improve accuracy. You can even replace the example dataset with your own and run it end to end to solve your own use cases. The solution outlined in the post is available on GitHub.

JumpStart solution templates

Amazon SageMaker JumpStart provides one-click, end-to-end solutions for many common ML use cases. Explore the following use cases for more information on available solution templates:

The JumpStart solution templates cover a variety of use cases, under each of which several different solution templates are offered (this Document Understanding solution is under the “Extract and analyze data from documents” use case).

Choose the solution template that best fits your use case from the JumpStart landing page. For more information on specific solutions under each use case and how to launch a JumpStart solution, see Solution Templates.

Solution overview

The following figure demonstrates how you can use this solution with Amazon SageMaker components. The SageMaker training jobs are used to train the various NLP models, and SageMaker endpoints are used to deploy the models in each stage. We use Amazon Simple Storage Service (Amazon S3) alongside SageMaker to store the training data and model artifacts, and Amazon CloudWatch to log training and endpoint outputs.

We approach solving the churn prediction problem with the following steps:

Data exploration to prepare the data to be ML ready.
Train a multimodal model with a Hugging Face sentence transformer and Scikit-learn random forest classifier.
Further improve the model performance with HPO using SageMaker automatic model tuning.
Train two AutoGluon multimodal models: an AutoGluon multimodal weighted/stacked ensemble model, and an AutoGluon multimodal fusion model.
Evaluate and compare the model performances on the holdout test data.

Prerequisites

To try out the solution in your own account, make sure that you have the following in place:

An AWS account. If you don’t have an account, you can sign up for one.
The solution outlined in the post is part of SageMaker JumpStart. To run this JumpStart solution and have the infrastructure deploy to your AWS account, you must create an active Amazon SageMaker Studio instance (see Onboard to Amazon SageMaker Studio). When your Studio instance is ready, use the instructions in JumpStart to launch the solution.
When running this notebook on Studio, you should make sure the Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized) image/kernel is used.

You can install the required packages as outlined in the solution to run this notebook:

Open the churn prediction use case

On the Studio console, choose Solutions, models, example notebooks under Quick start solutions in the navigation pane. Navigate to the Churn Prediction with Text solution in JumpStart.

Now we can take a closer look at some of the assets that are included in this solution.

Data exploration

First let’s download the test, validate, and train dataset from the source S3 bucket and upload it to our S3 bucket. The following screenshot shows us 10 observations of the training data.

Let’s begin exploring the train and validation dataset.

As you can see, we have different features such as CustServ Calls, Day Charge, and Day Calls that we use to predict the target column y (whether the customer left the service).

y is known as the target attribute: the attribute that we want the ML model to predict. Because the target attribute is binary, our model performs binary prediction, also known as binary classification.

There are 21 features, including the target variable. The number of examples for training and validation data are 43,000 and 5,000, respectively.

The following screenshot shows the summary statistics of the training dataset.

We have explored the dataset and split it into training, validation, and test sets. The training and validation set is used for training and HPO. The test set is used as the holdout set for model performance evaluation. We now carry out feature engineering steps and then fit the model.

Fit a multimodal model with a Hugging Face sentence transformer and Scikit-learn random forest classifier

The model training consists of two components: a feature engineering step that processes numerical, categorical, and text features, and a model fitting step that fits the transformed features into a Scikit-learn random forest classifier.

For the feature engineering, we complete the following steps:

Fill in the missing values for numerical features.
Encode categorical features into one-hot values, where the missing values are counted as one of the categories for each feature.
Use a Hugging Face sentence transformer to encode the text feature to generate a X-dimensional dense vector, where the value of X depends on a particular sentence transformer.

We choose the top three most downloaded sentence transformer models and use them in the following model fitting and HPO. Specifically, we use all-MiniLM-L6-v2, multi-qa-mpnet-base-dot-v1, and paraphrase-MiniLM-L6-v2. For hyperparameters of the random forest classifier, refer to the GitHub repo.

The following figure depicts the model architecture diagram.

There are many hyperparameters you can tune, such as n-estimators, max-depth, and bootstrap. For more details, refer to the GitHub repo.

For demonstration purposes, we only use numerical features CustServ Calls and Account Length, categorical features plan, and limit, and text feature text to fit the model. Multiple features should be separated by ,.

hyperparameters = {
    "n-estimators": 50,
    "min-impurity-decrease": 0.0,
    "ccp-alpha": 0.0,   
    "sentence-transformer": "sentence-transformers/all-MiniLM-L6-v2",
    "criterion": "gini",
    "max-depth": 6,
    "boostrap": "True",
    "min-samples-split": 4,
    "min-samples-leaf": 1,
    "balanced-data": True,
    "numerical-feature-names": "CustServ Calls,Account Length",
    "categorical-feature-names": "plan,limit",
    "textual-feature-names": "text",
    "label-name": "y"
}

current_folder = utils.get_current_folder(globals())
estimator = PyTorch(
    framework_version='1.5.0',
    py_version='py3',
    entry_point='entry_point.py',
    source_dir=str(Path(current_folder, '../containers/huggingface_transformer_randomforest').resolve()),
    hyperparameters=hyperparameters,
    role=config.IAM_ROLE,
    instance_count=1,
    instance_type=config.TRAINING_INSTANCE_TYPE,
    output_path='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_RF)),
    code_location='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_RF)),
    base_job_name=config.SOLUTION_PREFIX,
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sagemaker_session,
    volume_size=30
)

estimator.fit({
    'train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'train.jsonl')),
    'validation': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'validation.jsonl'))
})

We deploy the model after training is complete:

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = estimator.deploy(
    endpoint_name=endpoint_name,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

When calling our new endpoint from the notebook, we use a SageMaker SDK Predictor. A Predictor is used to send data to an endpoint (as part of a request) and interpret the response. JSON is used as the format for both input data and output response because it’s a standard endpoint format and the endpoint response can contain nested data structures.

With our model successfully deployed and our predictor configured, we can try out the churn prediction model on an example input:

data = {
    "CustServ Calls": -20.0,
    "Account Length": 133.12,
    "plan": "D",
    "limit": "unlimited",
    "text": "Well, I've been dealing with TelCom for three months now, and I feel like they're very helpful and responsive to my issues, but for a month now, I've only had one technical support call and that was very long and involved. My phone number was wrong on both contracts, and they gave me a chance to work with TelCom customer service and it was extremely helpful, so I've decided to stick with it. But I would like to have more help in terms of technical support, I haven't had the kind of help with my phone line and I don't have the type of tech support I want. So I would like to negotiate a phone contract, maybe an upgrade from a Sprint plan, or maybe from a Verizon plan.\nTelCom Agent: Very good."
}
response = predictor.predict(data=[data])

The following code shows the response (probability of churn) from querying the endpoint:

20.09% probability of churn

Note that the probability returned by this model has not been calibrated. When the model gives a probability of churn of 20%, for example, this doesn’t necessarily mean that 20% of customers with a probability of 20% resulted in churn. Calibration is a useful property in certain circumstances, but isn’t required in cases where discrimination between cases of churn and non-churn is sufficient. CalibratedClassifierCV from Scikit-learn can be used to calibrate a model.

Now we query the endpoint using the hold-out test data, which consists of 1,939 examples. The following table summarizes the evaluation results for our multimodal model with a Hugging Face sentence transformer and Scikit-learn random forest classifier.

Metric	BERT + Random Forest
Accuracy	0.77463
ROC AUC	0.75905

Model performance is dependent on hyperparameter configurations. Training a model with one set of hyperparameter configurations will not guarantee an optimal model. As a result, we run the HPO process in the following section to further improve model performance.

Fit a multimodal model with HPO

In this section, we further improve the model performance by adding HPO tuning with SageMaker automatic model tuning. SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. The best model and its corresponding hyperparameters are selected on the validation data. Next, the best model is evaluated on the hold-out test data, which is the same test data we created in the previous section. Finally, we show that the performance of the model trained with HPO is significantly better than the one trained without HPO.

The following are static hyperparameters we don’t tune and dynamic hyperparameters we want to tune and their searching ranges:

from sagemaker.tuner import ContinuousParameter, IntegerParameter, CategoricalParameter, HyperparameterTuner
hyperparameters = {
    "min_impurity_decrease": 0.0,
    "ccp_alpha": 0.0,
    "numerical-feature-names": "CustServ Calls,Account Length",
    "categorical-feature-names": "plan,limit",
    "textual-feature-names": "text",
    "label-name": "y"
}
hyperparameter_ranges = {
    "sentence-transformer": CategoricalParameter([
    "sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/multi-qa-mpnet-base-dot-v1", "sentence-transformers/paraphrase-MiniLM-L6-v2"]
    ),
    "criterion": CategoricalParameter(["gini", "entropy"]),
    "max-depth": CategoricalParameter([10, 20, 30, 40, 50, 60, 70, 80, 90, 100, -1]),
    "boostrap": CategoricalParameter(["True", "False"]),
    "min-samples-split": IntegerParameter(2, 10),
    "min-samples-leaf": IntegerParameter(1, 5),
    "n-estimators": CategoricalParameter([100, 200, 400, 800, 1000]),
}

tuning_job_name = f"{config.SOLUTION_PREFIX}-hpo"

current_folder = utils.get_current_folder(globals())
estimator = PyTorch(
    framework_version='1.5.0',
    py_version='py3',
    entry_point='entry_point.py',
    source_dir=str(Path(current_folder, '../containers/huggingface_transformer_randomforest').resolve()),
    hyperparameters=hyperparameters,
    role=config.IAM_ROLE,
    instance_count=1,
    instance_type=config.TRAINING_INSTANCE_TYPE,
    output_path='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_RF)),
    code_location='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_RF)),
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sagemaker_session,
    volume_size=30
)

We define the objective metric name, metric definition (with regex pattern), and objective type for the tuning job.

First, we set the objective as the accuracy score on the validation data (roc auc score on validation data) and defined metrics for the tuning job by specifying the objective metric name and a regular expression (regex). The regular expression is used to match the algorithm’s log output and capture the numeric values of metrics.

objective_metric_name = "roc auc"
metric_definitions = [{"Name": "roc auc", "Regex": "roc auc score on validation data: ([0-9\.]+)"}]
objective_type = "Maximize"

Next, we specify hyperparameter ranges to select the best hyperparameter values from. We set the total number of tuning jobs as 10 and distribute these jobs on five different Amazon Elastic Compute Cloud (Amazon EC2) instances for running parallel tuning jobs.

Finally, we pass those values to instantiate a SageMaker Estimator object, similar to what we did in the previous training step. Instead of calling the fit function of the Estimator object, we pass the Estimator object in as a parameter to the HyperparameterTuner constructor and call the fit function of it to launch tuning jobs:

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=18, # increase the maximum number of jobs will likely get better performance
    max_parallel_jobs=3,
    objective_type=objective_type,
    base_tuning_job_name=tuning_job_name,
)

# Launch a SageMaker Tuning job to search for the best hyperparameters
tuner.fit(
    {'train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'train.jsonl')), 'validation': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'validation.jsonl'))},
    logs=True
)

When the tuning job is complete, we can generate the summary table of all the tuning jobs.

After the tuning jobs are complete, we deploy the model that gives the best evaluation metric score on the validation dataset, perform inference on the same hold-out test dataset we did in the previous section, and compute evaluation metrics.

Metric	BERT + Random Forest	BERT + Random Forest with HPO
Accuracy	0.77463	0.9278
ROC AUC	0.75905	0.79861

We can see running HPO with SageMaker automatic model tuning significantly improves the model performance.

In addition to HPO, model performance is also dependent on the algorithm. It’s important to train multiple state-of-the-art algorithms, compare their performance on the same hold-out test data, and pick up the optimal one. Therefore, we train two more AutoGluon multimodal models in the following sections.

Fit an AutoGluon multimodal weighted/stacked ensemble model

There are two types of AutoGluon multimodality:

Train multiple tabular models as well as the TextPredictor model (utilizing the TextPredictor model inside of TabularPredictor), and then combine them via either a weighted ensemble or stacked ensemble, as explained in AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
Fuse multiple neural network models directly and handle raw text (which are also capable of handling additional numerical and categorical columns)

We train a multimodal weighted or stacked ensemble model first in this section, and train a fusion neural network model in the next section.

First, we retrieve the AutoGluon training image:

from sagemaker import image_uris
from sagemaker.estimator import Estimator

train_image_uri = image_uris.retrieve(
    "autogluon",
    region=boto3.Session().region_name,
    version='0.5.2',
    py_version='py38',
    image_scope="training",
    instance_type=config.TRAINING_INSTANCE_TYPE,
)

Next, we pass in hyperparameters. Unlike existing AutoML frameworks that primarily focus on the model or hyperparameter selection, AutoGluonTabular succeeds by ensembling multiple models and stacking them in multiple layers. Therefore, HPO is usually not required for AutoGluon ensemble models.

hyperparameters = {
    "numerical-feature-names": "CustServ Calls,Account Length",
    "categorical-feature-names": "plan,limit",
    "textual-feature-names": "text",
    "label-name": "y",
    "problem_type": "classification", # either classification or regression. For classification, we will identify binary or multiclass classification in the training script
    "eval_metric": "roc_auc",
    "presets": "medium_quality",
    "auto_stack": "False",
    "num_bag_folds": 0,
    "num_bag_sets": 1,
    "num_stack_levels": 0,
    "refit_full": "False",
    "set_best_to_refit_full": "False",
    "save_space": "True",
    "verbosity": 2,
    "pretrained-transformer": "google/electra-small-discriminator"
}

Finally, we create a SageMaker Estimator and call estimator.fit() to start a training job:

# Create SageMaker Estimator instance

training_job_name_ag = f"{config.SOLUTION_PREFIX}-ag"

tabular_estimator_ag = Estimator(
    role=config.IAM_ROLE,
    image_uri=train_image_uri,
    entry_point='train.py',
    source_dir=str(Path(current_folder, '../containers/autogluon_multimodal_ensemble').resolve()),
    instance_count=1,
    instance_type=config.TRAINING_INSTANCE_TYPE,
    max_run=360000,
    hyperparameters=hyperparameters,
    base_job_name=training_job_name_ag,
    output_path='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_AG_ENSEMBLE)),
    code_location='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_AG_ENSEMBLE)),
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
)

tabular_estimator_ag.fit(
    {
        'train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'train.jsonl')),
        'validation': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'validation.jsonl'))
    }, logs=False
)

After training is complete, we retrieve the AutoGluon inference image and deploy the model:

# Retrieve the inference docker container uri
inference_image_uri = image_uris.retrieve(
    "autogluon",
    region=boto3.Session().region_name,
    version='0.5.2',
    py_version='py38',
    image_scope="inference",
    instance_type=config.HOSTING_INSTANCE_TYPE,
)

endpoint_name_ag = f"{config.SOLUTION_PREFIX}-ag-endpoint"

predictor_ag = tabular_estimator_ag.deploy(
    initial_instance_count=1,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    entry_point="inference.py",
    image_uri=inference_image_uri,
    source_dir=str(Path(current_folder, '../containers/autogluon_multimodal_ensemble').resolve()),
    endpoint_name=endpoint_name_ag,
)

After we deploy the endpoints, we query the endpoint using the same test set and compute evaluation metrics. In the following table, we can see AutoGluon multimodal ensemble improves about 3% in ROC AUC compared with the BERT sentence transformer and random forest with HPO.

Metric	BERT + Random Forest	BERT + Random Forest with HPO	AutoGluon Multimodal Ensemble
Accuracy	0.77463	0.9278	0.92625
ROC AUC	0.75905	0.79861	0.82918

Fit an AutoGluon multimodal fusion model

The following diagram illustrates the architecture of the model. For details, see AutoMM for Text + Tabular – Quick Start.

Internally, we use different networks to encode the text columns, categorical columns, and numerical columns. The features generated by individual networks are aggregated by a late-fusion aggregator. The aggregator can output both the logits or score predictions.

Here, we use the pretrained NLP backbone to extract the text features and then use two other towers to extract the feature from the categorical column and numerical column.

In addition, to deal with multiple text fields, we separate these fields with the [SEP] token and alternate 0s and 1s as the segment IDs, as shown in the following diagram.

Similarly, we follow instructions in the previous section to train and deploy the AutoGluon multimodal fusion model:

# Create SageMaker Estimator instance
training_job_name_ag_fusion = f"{config.SOLUTION_PREFIX}-ag-fusion"

hyperparameters = {
    "numerical-feature-names": "CustServ Calls,Account Length",
    "categorical-feature-names": "plan,limit",
    "textual-feature-names": "text",
    "label-name": "y",
    "problem_type": "classification", # either classification or regression. For classification, we will identify binary or multiclass classification in the training script
    "eval_metric": "roc_auc",
    "verbosity": 2,
    "pretrained-transformer": "google/electra-small-discriminator",
}

tabular_estimator_ag_fusion = Estimator(
    role=config.IAM_ROLE,
    image_uri=train_image_uri,
    entry_point='train.py',
    source_dir=str(Path(current_folder, '../containers/autogluon_multimodal_fusion').resolve()),
    instance_count=1,
    instance_type=config.TRAINING_INSTANCE_TYPE,
    max_run=360000,
    hyperparameters=hyperparameters,
    base_job_name=training_job_name_ag_fusion,
    output_path='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_AG_FUSION)),
    code_location='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX_AG_FUSION)),
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
)

tabular_estimator_ag_fusion.fit(
    {
        'train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'train.jsonl')),
        'validation': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'validation.jsonl'))
    }, logs=False
)

The following table summarizes the evaluation results for the AutoGluon multimodal fusion model, along with those of three models that we evaluated in the previous sections. We can see the AutoGluon multimodal ensemble and multimodal fusion models achieve the best performance.

Metrics	BERT + Random Forest	BERT + Random Forest with HPO	AutoGluon Multimodal Ensemble	AutoGluon Multimodal Fusion
Accuracy	0.77463	0.9278	0.92625	0.9247
ROC AUC	0.75905	0.79861	0.82918	0.81115

Note that the results and relative performance between these models depend on the dataset you use for training. These results are representative, and even though the tendency for certain algorithms to perform better is based on relevant factors, the balance in performance might change given a different data distribution. You can replace the example dataset with your own data to determine what model works best for you.

Demo notebook

You can use the demo notebook to send example data to already-deployed model endpoints. The demo notebook quickly allows you to get hands-on experience by querying the example data. After you launch the Churn Prediction with Text solution, open the demo notebook by choosing Use Endpoint in Notebook.

Clean up

When you’ve finished with this solution, make sure that you delete all unwanted AWS resources by choosing Delete all resources.

Note that you need to manually delete any additional resources that you may have created in this notebook.

Conclusion

In this post, we showed how you can use Sagemaker JumpStart to predict churn using multimodality of text and tabular features.

If you’re interested in learning more about customer churn models, check out the following posts:

About the Authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

FrameGenerator to load video data

einops library for resizing video data

What’s next?

This Cloud Has an ‘Ultimate’ Lining

Game Related

Language Models

Computer Vision

Multimodal Models

Generative Models

User Control

Generative Video

Generative Audio

Concluding Thoughts on Generative Models

Responsible AI

Concluding Thoughts

Acknowledgements

Architectural Overview

TFX Pipeline

ExampleGen

ImportSchemaGen

Transform

Trainer

Evaluator

Pusher

Integration with Hugging Face 🤗 Hub

Automation with GitHub Actions

Conclusion

Acknowledgements

You Might Also Like

Subscribe to the AI Podcast: Now Available on Amazon Music

Chinese Tradition Brought to the 21st Century

#NewYearNewArt Challenge

Prerequisites

Clone the GitHub repository

AWS CDK scripts

Studio construct file

Studio stack file

Deploy the AWS CDK stack

Cleanup

Conclusion

About the Authors

JumpStart solution templates

Solution overview

Prerequisites

Open the churn prediction use case

Data exploration

Fit a multimodal model with a Hugging Face sentence transformer and Scikit-learn random forest classifier

Fit a multimodal model with HPO

Fit an AutoGluon multimodal weighted/stacked ensemble model

Fit an AutoGluon multimodal fusion model

Demo notebook

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.