In our paper published today in Nature, we introduce AlphaDev, an artificial intelligence (AI) system that uses reinforcement learning to discover enhanced computer science algorithms – surpassing those honed by scientists and engineers over decades.Read More
Augmenting recommendation systems with LLMs
Posted by Wei Wei, Developer Advocate
Large language models (LLMs) are taking the world by storm, thanks to their powerful ability to generate text, translate languages, and answer questions in a coherent and informative way. At Google I/O 2023, we released the PaLM API as ‘public preview’ so that many developers can start building apps with it. While PaLM API already has excellent documentation on its extensive usage and best practices, in this blog we are going to take a more focused approach to explore how to leverage LLMs to augment your ML systems in a practical application: recommendation systems.
As a refresher, modern recommendation systems often follow a retrieval-ranking architecture, which enables them to effectively and efficiently filter and rank relevant items to maximize the utility in production. You can go through this codelab to learn about building a fullstack movie recommendation system using TensorFlow and Flutter.
We will discuss how LLMs can be incorporated into this retrieval-ranking pipeline.
Conversational recommendations
If you already have access to Bard, you can ask it to create recommendations for you interactively in a dialogue. Here is an example of asking Bard for movie recommendations:
As a developer, you can build a similar functionality in your own applications, using the PaLM API Chat service with minimal effort:
prompt = """You are a movie recommender and your job is to recommend new movies based on user input. |
The PaLM API also allows you to help your user continue the exploration and interactively refine the recommendations (e.g., asking to swap The Florida Project for another one) in a dialogue, which is what Chat service is designed for. This kind of conversational recommendation interface (think having a knowledgeable chatbot that guides a customer along the way in your shopping app) provides a fluid and personalized experience for the users, and can sometimes be a very appealing addition to your existing recommendation surfaces.
Sequential recommendations
Recommendations would be much more useful if your system knows what your users may like. One way to find out your users’ interest is looking at their historical activities and then extrapolating. This is often called ‘sequential recommendation’ because the recommender looks at the sequence of items that have been interacted with and infers what to recommend. Usually you need to use a ML library (i.e., TensorFlow Recommenders) to achieve this. But now with the power of LLMs, you can also do this with the PaLM API Text service:
prompt = """You are a movie recommender and your job is to recommend new movies based on the sequence of movies that a user has watched. You pay special attention to the order of movies because it matters.
|
This example prompts the Text service with 4 movies that have been watched and asks the PaLM API to generate new recommendations based on the sequence of past movies.
Rating predictions
In the ranking phase of modern recommendation engines, a list of candidates needs to be sorted based on certain criteria. This is usually done by using a learning-to-rank library (such as, TensorFlow Ranking) to predict the ordering. Now you can do this with the PaLM API. Here is an example of predicting movie ratings:
prompt = """You are a movie recommender and your job is to predict a user's rating (ranging from 1 to 5, with 5 being the highest) on a movie, based on that user's previous ratings.
|
The PaLM API predicted a high score for The Matrix. You can ask the PaLM API to predict a rating for a list of candidate movies one by one and then sort them in order before making final recommendations; this process is called ‘pointwise ranking’. You can even leverage the PaLM API to do pairwise ranking
or listwise ranking
, if you adjust the prompt accordingly.
For a more comprehensive study on rating prediction with LLMs, you can refer to this paper from Google.
Text embedding-based recommendations
At this point you may be asking: all the use cases so far involve well known movies that the LLM is already aware of, so maybe there is a requirement that candidate items need to be captured in LLMs in advance (in the training phase)? What if I have private items not known to LLMs beforehand? How could I use the PaLM API then?
Not to worry. The PaLM API for Embeddings can help you out in this case. The basic idea is to embed text associated with your items (for example, product description, movie plot) into vectors and use nearest neighbor search techniques (i.e., using the tf.math.top_k op from TensorFlow for brute force search or Google ScaNN/Chroma for approximate search) to identify similar items to recommend, based on a user query. Let’s walk through a simple example.
Suppose you are building a news app and at the bottom of each news article you want to recommend similar news to your users. First you can embed all news articles by calling the PaLM API Embedding service like below:
embedding = palm.generate_embeddings(model='embedding-gecko-001', text='example news article text')['embedding'] |
For simplicity, let’s assume you store all the news texts and their embeddings in a simple Pandas DataFrame with 2 columns: news_text
and embedding
. Then you can recommend interestings news to your users using the following:
def recommend_news(query_text, df, topk=5): |
The recommend_news
function computes the query embedding’s dot product similarity with all news articles using the pre-computed embeddings, and then identifies 5 news articles most similar to what your user is reading.
This approach is often a quick and effective way to generate candidates and create recommendations based on item similarities. It may be sufficient for many use cases and can be particularly useful in the item cold start situation.
In practice, the candidate generation phase of modern large scale recommenders often consists of multiple sources. For example, you can use a mixer of text embedding-based retrieval, collaborative filtering, users’ subscriptions (i.e., new uploads from followed accounts on YouTube), real time trending items (i.e., breaking news) and etc. Thus, leveraging the PaLM API Embedding service could be a helpful augment for the retrieval stage in your existing recommendation system.
Text embeddings as side features
In addition, you could also use the text embeddings as side features in a recommendation model. The text embeddings capture the semantic information of the candidate items via the description text and can potentially help improve the model accuracy. For example, in this TensorFlow Recommenders feature preprocessing tutorial, if you have pre-computed text embeddings for movie plot using LLMs, it’s fairly easy to inject them into the model as side features, when concatenating all the embeddings:
class MovieModel(tf.keras.Model):
|
The default PaLM Embedding service returns a vector of 768 floating numbers for any text, which may be too much. You can reduce the dimensions by initializing a tf.keras.layers.Embedding layer with the movie plot embedding matrix and then stacking a fully connected layer on top of it to project it down to fewer dimensions.
Conclusion
We have shared several ideas on leveraging LLMs to augment recommenders. Obviously, this is just scratching the surface as there are more not covered here. Also note that there may still be a long way before they can make it into production (i.e., latency and cost issues). But we hope this blog inspires you to start thinking about how you can improve your own recommendation systems with LLMs.
Lastly, we are holding an online Developer Summit on Recommendation Systems on June 9, 2023. If you want to learn more about Google products related to building recommendation systems, feel free to sign up here to attend.
Build high-performance ML models using PyTorch 2.0 on AWS – Part 1
PyTorch is a machine learning (ML) framework that is widely used by AWS customers for a variety of applications, such as computer vision, natural language processing, content creation, and more. With the recent PyTorch 2.0 release, AWS customers can now do same things as they could with PyTorch 1.x but faster and at scale with improved training speeds, lower memory usage, and enhanced distributed capabilities. Several new technologies including torch.compile, TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor have been included in the PyTorch2.0 release. Refer to PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever for details.
This post demonstrates the performance and ease of running large-scale, high-performance distributed ML model training and deployment using PyTorch 2.0 on AWS. This post further walks through a step-by-step implementation of fine-tuning a RoBERTa (Robustly Optimized BERT Pretraining Approach) model for sentiment analysis using AWS Deep Learning AMIs (AWS DLAMI) and AWS Deep Learning Containers (DLCs) on Amazon Elastic Compute Cloud (Amazon EC2 p4d.24xlarge) with an observed 42% speedup when used with PyTorch 2.0 torch.compile + bf16 + fused AdamW. The fine-tuned model is then deployed on AWS Graviton-based C7g EC2 instance on Amazon SageMaker with an observed 10% speedup compared to PyTorch 1.13.
The following figure shows a performance benchmark of fine-tuning a RoBERTa model on Amazon EC2 p4d.24xlarge with AWS PyTorch 2.0 DLAMI + DLC.
Refer to Optimized PyTorch 2.0 inference with AWS Graviton processors for details on AWS Graviton-based instance inference performance benchmarks for PyTorch 2.0.
Support for PyTorch 2.0 on AWS
PyTorch2.0 support is not limited to the services and compute shown in example use-case in this post; it extends to many others on AWS, which we discuss in this section.
Business requirement
Many AWS customers, across a diverse set of industries, are transforming their businesses by using artificial intelligence (AI), specifically in the area of generative AI and large language models (LLMs) that are designed to generate human-like text. These are basically big models based on deep learning techniques that are trained with hundreds of billions of parameters. The growth in model sizes is increasing training time from days to weeks, and even months in some cases. This is driving an exponential increase in training and inference costs, which requires, more than ever, a framework such as PyTorch 2.0 with built-in support of accelerated model training and the optimized infrastructure of AWS tailored to the specific workloads and performance needs.
Choice of compute
AWS provides PyTorch 2.0 support on the broadest choice of powerful compute, high-speed networking, and scalable high-performance storage options that you can use for any ML project or application and customize to fit your performance and budget requirements. This is manifested in the diagram in the next section; in the bottom tier, we provide a broad selection of compute instances powered by AWS Graviton, Nvidia, AMD, and Intel processors.
For model deployments, you can use ARM-based processors such as the recently announced AWS Graviton-based instance that provides inference performance for PyTorch 2.0 with up to 3.5 times the speed for Resnet50 compared to the previous PyTorch release, and up to 1.4 times the speed for BERT, making AWS Graviton-based instances the fastest compute-optimized instances on AWS for CPU-based model inference solutions.
Choice of ML services
To use AWS compute, you can select from a broad set of global cloud-based services for ML development, compute, and workflow orchestration. This choice allows you to align with your business and cloud strategies and run PyTorch 2.0 jobs on the platform of your choice. For instance, if you have on-premises restrictions or existing investments in open-source products, you can use Amazon EC2, AWS ParallelCluster, or AWS UltraCluster to run distributed training workloads based on a self-managed approach. You could also use a fully managed service like SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. SageMaker also integrates with various MLOps tools, which allows you to scale your model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden.
Similarly, if you have existing Kubernetes investments, you can also use Amazon Elastic Kubernetes Service (Amazon EKS) and Kubeflow on AWS to implement an ML pipeline for distributed training or use an AWS-native container orchestration service like Amazon Elastic Container Service (Amazon ECS) for model training and deployments. Options to build your ML platform are not limited to these services; you can pick and choose depending on your organizational requirements for your PyTorch 2.0 jobs.
Enabling PyTorch 2.0 with AWS DLAMI and AWS DLC
To use the aforementioned stack of AWS services and powerful compute, you have to install an optimized compiled version of the PyTorch2.0 framework and its required dependencies, many of which are independent projects, and test them end to end. You may also need CPU-specific libraries for accelerated math routines, GPU-specific libraries for accelerated math and inter-GPU communication routines, and GPU drivers that need to be aligned with the GPU compiler used to compile the GPU libraries. If your jobs require large-scale multi-node training, you need an optimized network that can provide lowest latency and highest throughput. After you build your stack, you need to regularly scan and patch them for security vulnerabilities and rebuild and retest the stack after every framework version upgrade.
AWS helps reduce this heavy lifting by offering a curated and secure set of frameworks, dependencies, and tools to accelerate deep learning in the cloud though AWS DLAMIs and AWS DLCs. These pre-built and tested machine images and containers are optimized for deep learning on EC2 Accelerated Computing Instance types, allowing you to scale out to multiple nodes for distributed workloads more efficiently and easily. It includes a pre-built Elastic Fabric Adapter (EFA), Nvidia GPU stack, and many deep learning frameworks (TensorFlow, MXNet, and PyTorch with latest release of 2.0) for high-performance distributed deep learning training. You don’t need to spend time installing and troubleshooting deep learning software and drivers or building ML infrastructure, nor do you have to incur the recurring cost of patching these images for security vulnerabilities or recreating the images after every new framework version upgrade. Instead, you can focus on the higher value-added effort of training jobs at scale in a shorter amount of time and iterating on your ML models faster.
Solution overview
Considering that training on GPU and inference on CPU is a popular use case for AWS customers, we have included as part of this post a step-by-step implementation of a hybrid architecture (as shown in the following diagram). We will explore the art-of-the-possible and use a P4 EC2 instance with BF16 support initialized with Base GPU DLAMI including NVIDIA drivers, CUDA, NCCL, EFA stack, and PyTorch2.0 DLC for fine-tuning a RoBERTa sentiment analysis model that gives you control and flexibility to use any open-source or proprietary libraries. Then we use SageMaker for a fully managed model hosting infrastructure to host our model on AWS Graviton3-based C7g instances. We picked C7g on SageMaker because it’s proven to reduce inference costs by up to 50% relative to comparable EC2 instances for real-time inference on SageMaker. The following diagram illustrates this architecture.
The model training and hosting in this use case consists of the following steps:
- Launch a GPU DLAMI-based EC2 Ubuntu instance in your VPC and connect to your instance using SSH.
- After you log in to your EC2 instance, download the AWS PyTorch 2.0 DLC.
- Run your DLC container with a model training script to fine-tune the RoBERTa model.
- After model training is complete, package the saved model, inference scripts, and a few metadata files into a tar file that SageMaker inference can use and upload the model package to an Amazon Simple Storage Service (Amazon S3) bucket.
- Deploy the model using SageMaker and create an HTTPS inference endpoint. The SageMaker inference endpoint holds a load balancer and one or more instances of your inference container in different Availability Zones. You can deploy either multiple versions of the same model or entirely different models behind this single endpoint. In this example, we host a single model.
- Invoke your model endpoint by sending it test data and verify the inference output.
In the following sections, we showcase fine-tuning a RoBERTa model for sentiment analysis. RoBERTa is developed by Facebook AI, improving on the popular BERT model by modifying key hyperparameters and pre-training on a larger corpus. This leads to improved performance compared to vanilla BERT.
We use the transformers library by Hugging Face to get the RoBERTa model pre-trained on approximately 124 million tweets, and we fine-tune it on the Twitter dataset for sentiment analysis.
Prerequisites
Make sure you meet the following prerequisites:
- You have an AWS account.
- Make sure you’re in the
us-west-2
Region to run this example. (This example is tested inus-west-2
; however, you can run in any other Region.) - Create a role with the name
sagemakerrole
. Add managed policiesAmazonSageMakerFullAccess
andAmazonS3FullAccess
to give SageMaker access to S3 buckets. - Create an EC2 role with the name
ec2_role
. Use the following permission policy:
1. Launch your development instance
We create a p4d.24xlarge instance that offers 8 NVIDIA A100 Tensor Core GPUs in us-west-2
:
When selecting the AMI, follow the release notes to run this command using the AWS Command Line Interface (AWS CLI) to find the AMI ID to use in us-west-2
:
Make sure the size of the gp3 root volume is 200 GiB.
EBS volume encryption is not enabled by default. Consider changing this when moving this solution to production.
2. Download a Deep Learning Container
AWS DLCs are available as Docker images in Amazon Elastic Container Registry Public, a managed AWS container image registry service that is secure, scalable, and reliable. Each Docker image is built for training or inference on a specific deep learning framework version, Python version, with CPU or GPU support. Select the PyTorch 2.0 framework from the list of available Deep Learning Containers images.
Complete the following steps to download your DLC:
a. SSH to the instance. By default, security group used with EC2 opens up SSH port to all. Please consider this if you are moving this solution to production:
By default, the security group used with Amazon EC2 opens up the SSH port to all. Consider changing this if you are moving this solution to production.
b. Set the environment variables required to run the remaining steps of this implementation:
Amazon ECR supports public image repositories with resource-based permissions using AWS Identity and Access Management (IAM) so that specific users or services can access images.
c. Log in to the DLC registry:
d. Pull the latest PyTorch 2.0 container with GPU support in us-west-2
If you get the error “no space left on device”, make sure you increase the EC2 EBS volume to 200 GiB and then extend the Linux file system.
3. Clone the latest scripts adapted to PyTorch 2.0
Clone the scripts with the following code:
Because we’re using the Hugging Face transformers API with the latest version 4.28.1, it has already enabled PyTorch 2.0 support. We added the following argument to the trainer API in train_sentiment.py
to enable new PyTorch 2.0 features:
- Torch compile – Experience an average 43% speedup on Nvidia A100 GPUs with single line of change.
- BF16 datatype – New data type support (Brain Floating Point) for Ampere or newer GPUs.
- Fused AdamW optimizer – Fused AdamW implementation to further speed up training. This stochastic optimization method modifies the typical implementation of weight decay in Adam by decoupling weight decay from the gradient update.
4. Build a new Docker image with dependencies
We extend the pre-built PyTorch 2.0 DLC image to install the Hugging Face transformer and other libraries that we need to fine-tune our model. This allows you to use the included tested and optimized deep learning libraries and settings without having to create an image from scratch. See the following code:
5. Start training using the container
Run the following Docker command to begin fine-tuning the model on the tweet_eval
sentiment dataset. We’re using the Docker container arguments (shared memory size, max locked memory, and stack size) recommend by Nvidia for deep learning workloads.
You should expect the following output. The script first downloads the TweetEval dataset, which consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include irony, hate, offensive, stance, emoji, emotion, and sentiment.
The script then downloads the base model and starts the fine-tuning process. Training and evaluation metrics are reported at the end of each epoch.
Performance statistics
With PyTorch 2.0 and the latest Hugging Face transformers library 4.28.1, we observed a 42% speedup on a single p4d.24xlarge instance with 8 A100 40GB GPUs. Performance improvements comes from a combination of torch.compile, the BF16 data type, and the fused AdamW optimizer. The following code is the final result of two training runs with and without new features:
6. Test the trained model locally before preparing for SageMaker inference
You can find the following files under $ml_working_dir/saved_model/
after training:
Let’s make sure we can run inference locally before preparing for SageMaker inference. We can load the saved model and run inference locally using the test_trained_model.py
script:
You should expect the following output with the input “Covid cases are increasing fast!”:
7. Prepare the model tarball for SageMaker inference
Under the directory where the model is located, make a new directory called code
:
In the new directory, create the file inference.py
and add the following to it:
In the end, you should have the following folder structure:
The model is ready to be packaged and uploaded to Amazon S3 for use with SageMaker inference:
8. Deploy the model on a SageMaker AWS Graviton instance
New generations of CPUs offer a significant performance improvement in ML inference due to specialized built-in instructions. In this use case, we use the SageMaker fully managed hosting infrastructure with AWS Graviton3-based C7g instances. AWS has also measured up to a 50% cost savings for PyTorch inference with AWS Graviton3-based EC2 C7g instances across Torch Hub ResNet50, and multiple Hugging Face models relative to comparable EC2 instances.
To deploy the models to AWS Graviton instances, we use AWS DLCs that provide support for PyTorch 2.0 and TorchServe 0.8.0, or you can bring your own containers that are compatible with the ARMv8.2 architecture.
We use the model we trained earlier: s3://<your-s3-bucket>/twitter-roberta-base-sentiment-latest.tar.gz
. If you haven’t used SageMaker before, review Get Started with Amazon SageMaker.
To start, make sure the SageMaker package is up to date:
Because this is an example, create a file called start_endpoint.py
and add the following code. This will be the Python script to start a SageMaker inference endpoint with the mode:
We’re using ml.c7g.4xlarge for the instance and are retrieving PT 2.0 with an image scope inference_graviton
. This is our AWS Graviton3 instance.
Next, we create the file that runs the prediction. We do these as separate scripts so we can run the predictions as many times as we want. Create predict.py
with the following code:
With the scripts generated, we can now start an endpoint, do predictions against the endpoint, and clean up when we’re done:
9. Clean up
Lastly, we want to clean up from this example. Create cleanup.py and add the following code:
Conclusion
AWS DLAMIs and DLCs have become the go-to standard for running deep learning workloads on a broad selection of compute and ML services on AWS. Along with using framework-specific DLCs on AWS ML services, you can also use a single framework on Amazon EC2, which removes the heavy lifting necessary for developers to build and maintain deep learning applications. Refer to Release Notes for DLAMI and Available Deep Learning Containers Images to get started.
This post showed one of many possibilities to train and serve your next model on AWS and discussed several formats that you can adopt to meet your business objectives. Give this example a try or use our other AWS ML services to expand the data productivity for your business. We have included a simple sentiment analysis problem so that customers new to ML can understand how simple it is to get started with PyTorch 2.0 on AWS. We will be covering more advanced use cases, models, and AWS technologies in upcoming blog posts.
About the authors
Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Mike Schneider is a Systems Developer, based in Phoenix AZ. He is a member of Deep Learning containers, supporting various Framework container images, to include Graviton Inference. He is dedicated to infrastructure efficiency and stability.
Lai Wei is a Senior Software Engineer at Amazon Web Services. He is focusing on building easy to use, high-performance and scalable deep learning frameworks for accelerating distributed model training. Outside of work, he enjoys spending time with his family, hiking, and skiing.
Visualizing and interpreting decision trees
Posted by Terence Parr, Google
Decision trees are the fundamental building block of Gradient Boosted Trees and Random Forests, the two most popular machine learning models for tabular data. To learn how decision trees work and how to interpret your models, visualization is essential.
TensorFlow recently published a new tutorial that shows how to use dtreeviz, a state-of-the-art visualization library, to visualize and interpret TensorFlow Decision Forest Trees.
The dtreeviz library, first released in 2018, is now the most popular visualization library for decision trees. The library is constantly being updated and improved, and there is a large community of users who can provide support and answer questions. There is a helpful YouTube video and article on the design of dtreeviz.
Let’s demonstrate how to use dtreeviz to interpret decision tree predictions.
At a basic level, a decision tree is a machine learning model that learns the relationship between observations and target values by examining and condensing training data into a binary tree. Each leaf in the decision tree is responsible for making a specific prediction. For regression trees, the prediction is a value, such as price. For classifier trees, the prediction is a target category, such as cancer or not-cancer.
Any path from the root of the decision tree to a specific leaf predictor passes through a series of (internal) decision nodes. Each decision node compares a single feature’s value with a specific split point value learned during training. Making a prediction means walking from the root down the tree, comparing feature values, until we reach a leaf. Consider the following simple decision tree that tries to classify animals based upon two features, the number of legs and the number of eyes.
Let’s say that our test animal has four legs and two eyes. To classify the test animal, we start at the root of the tree and compare our test animal’s number of legs to four. Since the number of legs is equal to four, we move to the left. Next, we test the number of eyes to three. Since our test animal only has two eyes, we move to the right and arrive at a leaf node, which gives us a prediction of dog. To learn more, check out this class on decision trees.
To interpret decision tree predictions we use dtreeviz to visualize how each decision node in the tree splits up a specific feature’s domain, and to show the distribution of training instances in each leaf. For example, here is the first few levels of a classification tree from a Random Forest trained on the Penguin data set:
To make a prediction for a test penguin, this decision tree first tests the flipper_length_mm feature and if it’s less than 206, it descends to the left and then tests the island feature; otherwise, if the flipper length were >= 206, it would descend to the right and test the bill_length_mm feature. (Check out the tutorial for a description of the visualization elements.)
The code used to generate that tree is short. Given a classifier model called cmodel, we collect and wrap up all of the information about the data and model then ask dtreeviz to visualize the tree:
penguin_features = [f.name for f in cmodel.make_inspector().features()]
penguin_label = “species” # Name of the classification target label
viz_cmodel = dtreeviz.model(cmodel,
tree_index=3, # pick tree from forest
X_train=train_ds_pd[penguin_features],
y_train=train_ds_pd[penguin_label],
feature_names=penguin_features,
target_name=penguin_label,
class_names=classes)
viz_cmodel.view()
And here are the first few layers of a regressor tree from a Random Forest trained on the Abalone data set:
Another useful tool for interpretation is to visualize how a specific test instance (feature vector) weaves its way down the tree from the root to a specific leaf. By looking at the path taken by a decision tree when making a prediction, we learn why a test instance was classified in a particular way. We know which features are tested and against what range of values. Imagine being rejected for a bank loan. Looking at the decision tree could tell us exactly why we were rejected (e.g., credit score too low or debt to income ratio too high). Here’s an example showing the decisions made by the decision tree for a specific Penguin instance, with the path highlighted in orange boxes and the test instance features shown at the bottom left:
You can also look at information about the leaf contents by calling viz_cmodel.ctree_leaf_distributions()
. For example, here’s a plot showing the leaf ID versus samples-per-class for the Penguin dataset:
For regressors, the leaf plot shows the distribution of the target (predicted) variable for the instances in each leaf, such as in this plot from an Abalone decision tree:
Each “row” in this plot represents a specific leaf and the blue dots indicate the distribution of the rings prediction values for instances associated with that leaf by the training process.
The library can do lots more; this is just a taste. Your next step is to check out the tutorial! Then, try dtreeviz on your own tree models. To dig deeper into how decision trees are built and how they carve up feature space to make predictions, you can watch the YouTube video or the article on the design of dtreeviz. Enjoy!
Visual captions: Using large language models to augment video conferences with dynamic visuals
Recent advances in video conferencing have significantly improved remote video communication through features like live captioning and noise cancellation. However, there are various situations where dynamic visual augmentation would be useful to better convey complex and nuanced information. For example, when discussing what to order at a Japanese restaurant, your friends could share visuals that would help you feel more confident about ordering the “Sukiyaki”. Or when talking about your recent family trip to San Francisco, you may want to show a photo from your personal album.
In “Visual Captions: Augmenting Verbal Communication With On-the-fly Visuals”, presented at ACM CHI 2023, we introduce a system that uses verbal cues to augment synchronous video communication with real-time visuals. We fine-tuned a large language model to proactively suggest relevant visuals in open-vocabulary conversations using a dataset we curated for this purpose. We open sourced Visual Captions as part of the ARChat project, which is designed for rapid prototyping of augmented communication with real-time transcription.
Design space for augmenting verbal communication with dynamic visuals
We invited 10 internal participants, each with various technical and non-technical backgrounds, including software engineers, researchers, UX designers, visual artists, students, etc., to discuss their particular needs and desires for a potential real-time visual augmentation service. In two sessions, we introduced low-fidelity prototypes of the envisioned system, followed by video demos of the existing text-to-image systems. These discussions informed a design space with eight dimensions for visual augmentation of real-time conversations, labeled below as D1 to D8.
Visual augmentations could be synchronous or asynchronous with the conversation (D1: Temporal), could be used for both expressing and understanding speech content (D2: Subject), and could be applied using a wide range of different visual content, visual types, and visual sources (D3: Visual). Such visual augmentation might vary depending on the scale of the meetings (D4: Scale) and whether a meeting is in co-located or remote settings (D5: Space). These factors also influence whether the visuals should be displayed privately, shared between participants, or public to everyone (D6: Privacy). Participants also identified different ways in which they would like to interact with the system while having conversations (D7: Initiation). For example, people proposed different levels of “proactivity”, which indicates the degree to which users would like the model to take the initiative. Finally, participants envisioned different methods of interaction, for example, using speech or gestures for input. (D8: Interaction).
Design space for augmenting verbal communication with dynamic visuals. |
Informed by this initial feedback, we designed Visual Captions to focus on generating synchronous visuals of semantically relevant visual content, type, and source. While participants in these initial exploratory sessions were participating in one-to-one remote conversations, deployment of Visual Captions in the wild will often be in one-to-many (e.g., an individual giving a presentation to an audience) and many-to-many scenarios (e.g., a discussion among multiple people in a meeting).
Because the visual that best complements a conversation depends strongly on the context of the discussion, we needed a training set specific to this purpose. So, we collected a dataset of 1595 quadruples of language (1), visual content (2), type (3), and source (4) across a variety of contexts, including daily conversations, lectures, and travel guides. For example, “I would love to see it!” corresponds to visual content of “face smiling”, a visual type of “emoji”, and visual source of “public search”. “Did she tell you about our trip to Mexico?” corresponds to visual content of “a photo from the trip to Mexico”, a visual type of “photo”, and visual source of “personal album”. We publicly released this VC1.5K dataset for the research community.
Visual intent prediction model
To predict what visuals could supplement a conversation, we trained a visual intent prediction model based on a large language model using the VC1.5K dataset. For training, we parsed each visual intent into the format of “<Visual Type> of <Visual Content> from <Visual Source>
“.
{"prompt": "<Previous Two Sentences> →", "completion": "<Visual Type 1> of "<Visual Type 1> from "<Visual Source 1>; <Visual Type 2> of "<Visual Type 2> from "<Visual Source 2>; ... 𝑛"}
Using this format, this system can handle open-vocabulary conversations and contextually predict visual content, visual source, and visual type. Anecdotally, we found that it outperforms keyword-based approaches, which fail to handle open-vocabulary examples like “Your aunt Amy will be visiting this Saturday,” and cannot suggest relevant visual types or visual sources.
Examples of visual intent predictions by our model. |
We used 1276 (80%) examples from the VC1.5K dataset for fine-tuning the large language model and the remaining 319 (20%) examples as test data. We measured the performance of the fine-tuned model with the token accuracy metric, i.e., the percentage of tokens in a batch that were correctly predicted by the model. During training, our model reached a training token accuracy of 97% and a validation token accuracy of 87%.
Performance
To evaluate the utility of the trained Visual Captions model, we invited 89 participants to perform 846 tasks. They were asked to provide feedback on a scale of “1 — Strongly Disagree” to “7 — Strongly Agree” for six qualitative statements. Most participants preferred to have the visual during a conversation (Q1, 83% ≥ 5–Somewhat Agree). Moreover, they considered the displayed visuals to be useful and informative (Q2, 82% ≥ 5–Somewhat Agree), high-quality (Q3, 82% ≥ 5–Somewhat Agree), and relevant to the original speech (Q4, 84% ≥ 5–Somewhat Agree). Participants also found the predicted visual type (Q5, 87% ≥ 5–Somewhat Agree) and visual source (Q6, 86% ≥ 5–Somewhat Agree) to be accurate given the context of the corresponding conversation.
Technical evaluation results of the visual prediction model rated by study participants. |
With this fine-tuned visual intent prediction model, we developed Visual Captions on the ARChat platform, which can add new interactive widgets directly on the camera streams of video conferencing platforms, such as Google Meet. As shown in the system workflow below, Visual Captions automatically captures the user’s speech, retrieves the last sentences, feeds them into the visual intent prediction model every 100 ms, retrieves relevant visuals, and then suggests visuals in real time.
System workflow of Visual Captions. |
Visual Captions provides three levels of proactivity when suggesting visuals:
- Auto-display (high-proactivity): The system autonomously searches and displays visuals publicly to all meeting participants. No user interaction required.
- Auto-suggest (medium-proactivity): The suggested visuals are shown in a private scrolling view. A user then clicks a visual to display it publicly. In this mode, the system is proactively recommending visuals, but the user decides when and what to display.
- On-demand-suggest (low-proactivity): The system will only suggest visuals if a user presses the spacebar.
Quantitative and qualitative evaluation: User studies
We evaluated Visual Captions in both a controlled lab study (n = 26) and in-the-wild deployment studies (n = 10). Participants found that real-time visuals facilitated live conversations by helping explain unfamiliar concepts, resolve language ambiguities, and make conversations more engaging. Participants also reported different preferences for interacting with the system in-situ, and that varying levels of proactivity were preferred in different social scenarios.
Participants’ Task Load Index and Likert scale ratings (from 1 – Strongly Disagree to 7 – Strongly Agree) of four conversations without Visual Captions (“No VC”) and the three Visual Captions modes: auto-display, auto-suggest, and on-demand suggest. |
Conclusions and future directions
This work proposes a system for real-time visual augmentation of verbal communication, called Visual Captions, that was trained using a dataset of 1595 visual intents collected from 246 participants, covering 15 topic categories. We publicly release the training dataset, VC1.5K to the research community to support further research in this space. We have also deployed Visual Captions in ARChat, which facilitates video conferences in Google Meet by transcribing meetings and augmenting the camera video streams.
Visual Captions represents a significant step towards enhancing verbal communication with on-the-fly visuals. By understanding the importance of visual cues in everyday conversations, we can create more effective communication tools and improve how people connect.
Acknowledgements
This work is a collaboration across multiple teams at Google. Key contributors to the project include Xingyu “Bruce” Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Alex Olwal, and Ruofei Du.
We would like to extend our thanks to those on the ARChat team who provided assistance, including Jason Mayes, Max Spear, Na Li, Jun Zhang, Jing Jin, Yuan Ren, Adarsh Kowdle, Ping Yu, Darcy Philippon, and Ezgi Oztelcan. We would also like to thank the many people with whom we’ve had insightful discussions and those who provided feedback on the manuscript, including Eric Turner, Yinda Zhang, Feitong Tan, Danhang Tang, and Shahram Izadi. We would also like to thank our CHI reviewers for their insightful feedback.
Arrange your transcripts into paragraphs with Amazon Transcribe
Amazon Transcribe is a speech recognition service that generates transcripts from video and audio files in multiple supported languages and accents. It comes with a rich set of features, including automatic language identification, multi-channel and multi-speaker support, custom vocabularies, and transcript redaction.
Amazon Transcribe supports two modes of operation: batch and streaming. In batch mode, a transcription job is created to process files residing in an Amazon Simple Storage Service (Amazon S3) bucket; in streaming mode, the audio source is integrated in real time with Amazon Transcribe through HTTP/2 calls or Web Sockets.
In this post, we explore how to automatically arrange the generated transcript into paragraphs while in batch mode, increasing the readability of the generated transcript.
Transcription output
Amazon Transcribe uses JSON representation for its output. It provides the transcription result in two different formats: text format and itemized format.
Text format provides the transcript altogether, as a block of text, whereas itemized format provides the transcript in the form of timely ordered transcribed items, along with additional metadata per item. Both formats exist in parallel in the output file.
Depending on the features selected during transcription job creation, Amazon Transcribe creates additional and enriched views of the transcription result. See the following example code:
Build machine learning-ready datasets from the Amazon SageMaker offline Feature Store using the Amazon SageMaker Python SDK
Amazon SageMaker Feature Store is a purpose-built service to store and retrieve feature data for use by machine learning (ML) models. Feature Store provides an online store capable of low-latency, high-throughput reads and writes, and an offline store that provides bulk access to all historical record data. Feature Store handles the synchronization of data between the online and offline stores.
Because model development is an iterative process, customers will frequently query the offline store and build various datasets for model training. Currently, there are several ways to access features in the offline store, including running SQL queries with Amazon Athena or using Spark SQL in Apache Spark. However, these patterns require writing ad hoc (and sometimes complex) SQL statements, which isn’t always suitable for the data scientist persona.
Feature Store recently extended the SageMaker Python SDK to make it easier to create datasets from the offline store. With this release, you can use a new set of methods in the SDK to create datasets without writing SQL queries. These new methods support common operations such as time travel, filtering duplicate records, and joining multiple feature groups while ensuring point-in-time accuracy.
In this post, we demonstrate how to use the SageMaker Python SDK to build ML-ready datasets without writing any SQL statements.
Solution overview
To demonstrate the new functionality, we work with two datasets: leads and web marketing metrics. These datasets can be used to build a model that predicts if a lead will convert into a sale given marketing activities and metrics captured for that lead.
The leads data contains information on prospective customers who are identified using Lead_ProspectID
. The features for a lead (for example, LeadSource
) can be updated over time, which results in a new record for that lead. The Lead_EventTime
represents the time in which each record is created. The following screenshot shows an example of this data.
The web marketing metrics data tracks the engagement metrics for a lead, where each lead is identified using the Web_ProspectID
. The Web_EventTime
represents the time in which the record was created. Unlike the leads feature group, there is only one record per lead in this feature group. The following screenshot shows an example of this data.
We walk through the key parts of the sagemaker-feature-store-offline-sdk.ipynb
notebook, which demonstrates the following steps:
- Create a dataset from a feature group.
- Join multiple feature groups.
- Create a point-in-time join between a feature group and a dataset based on a set of events at specific timestamps.
- Retrieve feature history within a specific time range.
- Retrieve features as of a specific timestamp.
Prerequisites
You need the following prerequisites:
- An AWS account.
- A SageMaker Jupyter notebook instance. Access the code from the GitHub repository and upload it to your notebook instance.
- You can also run the notebook in an Amazon SageMaker Studio environment, which is an IDE for ML development. You can clone the GitHub repo via a terminal inside the Studio environment using the following command:
We assume a feature group for the leads data has been created using the existing FeatureGroup.create
method, and can be referenced using the variable base_fg
. For more information on feature groups, refer to Create Feature Groups.
Create a dataset from a feature group
To create a dataset using the SageMaker SDK, we use the new FeatureStore
class, which contains the create_dataset
method. This method accepts a base feature group that may be joined with other feature groups or DataFrames. We start by providing the leads feature group as the base and an Amazon Simple Storage Service (Amazon S3) path to store the dataset:
The create_dataset
method returns a DatasetBuilder
object, which can be used to generate a dataset from one or multiple feature groups (which we demonstrate in the next section). To create a simple dataset consisting of only the leads features, we invoke the to_csv_file
method. This runs a query in Athena to retrieve the features from the offline store, and saves the results to the specified S3 path.
Join multiple feature groups
With the SageMaker SDK, you can easily join multiple feature groups to build a dataset. You can also perform join operations between an existing Pandas DataFrame to a single or multiple feature groups. The base feature group is an important concept for joins. The base feature group is the feature group that has other feature groups or the Pandas DataFrame joined to it.
While creating the dataset using the create_dataset
function, we use the with_feature_group
method, which performs an inner join between the base feature group and another feature group using the record identifier and the target feature name in the base feature group. In our example, the base feature group is the leads feature group, and the target feature group is the web marketing feature group. The with_feature_group
method accepts the following arguments:
- feature_group – This is the feature group we are joining with. In our code sample, the target feature group is created by using the web marketing dataset.
- target_feature_name_in_base – The name of the feature in the base feature group that we’re using as a key in the join. We use
Lead_ProspectID
as the record identifier for the base feature group. - included_feature_names – This is the list of the feature names of the base feature group. We use this field to specify the features that we want to include in the dataset.
The following code shows an example of creating a dataset by joining the base feature group with the target feature group:
You can extend the join operations to include multiple feature groups by adding the with_feature_group
method at the end of the preceding code example and defining the required arguments for the new feature group. You can also perform join operations with an existing DataFrame by defining the base to be your existing Pandas DataFrame and joining with the interested feature groups. The following code sample shows how to create dataset with an existing Pandas DataFrame and an existing feature group:
For more examples on these various configurations, refer to Create a Dataset from your Feature Groups.
Create a point-in-time join
One of the most powerful capabilities of this enhancement is to perform point-in-time joins simply and without the need to write complex SQL code. When building ML models, data scientists need to avoid data leakage or target leakage, which is accidentally using data during model training that wouldn’t be available at the time of prediction. For instance, if we’re trying to predict credit card fraud, we should exclude transactions that arrive after the fraudulent charge we’re trying to predict, otherwise the trained model could use this post-fraud information to alter the model, making it generalize less well.
Retrieval of point-in-time accurate feature data requires you to supply an entity DataFrame that provides a set of record IDs (or primary key) and corresponding event times that serve as the cutoff time for the event. This retrieval mechanism is sometimes referred to as row-level time travel, because it allows a different time constraint to be applied for each row key. To perform point-in-time joins with the SageMaker SDK, we use the Dataset Builder class and provide the entity DataFrame as the base argument to the constructor.
In the following code, we create a simple entity DataFrame with two records. We set the event times, used to indicate the cutoff time, near the middle of the time series data (mid-January 2023):
When we use the point_in_time_accurate_join
functionality with the create_dataset
call, the internal query excludes all records with timestamps later then the cutoff times supplied, returning the latest feature values that would have been available at the time of the event:
Notice that there are only two records in the DataFrame returned by the point-in-time join. This is because we only submitted two record IDs in the entity DataFrame, one for each Lead_ProspectID
we want to retrieve. The point-in-time criteria specifies that a record’s event time (stored in the Lead_Eventtime
field) must contain a value that is less than the cutoff time.
Additionally, we instruct the query to retrieve only the latest record that meets this criteria because we have applied the with_number_of_recent_records_by_record_identifier
method. When used in conjunction with the point_in_time_accurate_join
method, this allows the caller to specify how many records to return from those that meet the point-in-time join criteria.
Compare point-in-time join results with Athena query results
To verify the output returned by the SageMaker SDK point_in_time_accurate_join
function, we compare it to the result of an Athena query. First, we create a standard Athena query using a SELECT statement tied to the specific table created by the Feature Store runtime. This table name can be found by referencing the table_name
field after instantiating the athena_query
from the FeatureGroup
API:
The Athena query doesn’t contain any point-in-time join semantics, so it returns all records that match the specified record_id
(Lead_ProspectID
).
Next, we use the Pandas library to sort the Athena results by event times for easy comparison. The records with timestamps later than the event times specified in the entity DataFrame (for example, 2023-01-15T00:00:00Z
) submitted to the point_in_time_accurate_join
don’t show up in the point-in-time results. Because we additionally specified that we only want a single record from the preceding create_dataset
code, we only get the latest record prior to the cutoff time. By comparing the SageMaker SDK results with the Athena query results, we see that the point-in-time join function returned the proper records.
Therefore, we have confidence that we can use the SageMaker SDK to perform row-level time travel and avoid target leakage. Furthermore, this capability works across multiple feature groups that may be refreshed on completely different timelines.
Retrieve feature history within a specific time range
We also want to demonstrate the use of specifying a time range window when joining the feature groups to form a dataset. The time window is defined using with_event_time_range
, which accepts two inputs, starting_timestamp
and ending_timestamp
, and returns a dataset builder object. In our code sample, we set the retrieval time window for 1 full day from 2022-07-01 00:00:00
until 2022-07-02 00:00:00
.
The following code shows how to create a dataset with the specified event time window while joining the base feature group with the target feature group:
We also confirm the difference between the sizes of the dataset created using with_event_time_range
by exporting to a Pandas DataFrame with the to_dataframe()
method and displaying the data. Notice how the result set has only a fraction of the original 10,020 records, because it only retrieves records whose event_time
is within the 1-day time period.
Retrieve features as of a specific timestamp
The DatasetBuilder as_of
method retrieves features from a dataset that meet a timestamp-based constraint, which the caller provides as an argument to the function. This mechanism is useful for scenarios such as rerunning experiments on previously collected data, backtesting time series models, or building a dataset from a previous state of the offline store for data auditing purposes. This functionality is sometimes referred to as time travel because it essentially rolls back the data store to an earlier date and time. This time constraint is also referred to as the cutoff timestamp.
In our sample code, we first create the cutoff timestamp by reading the write_time
value for the last record written to the Feature Store, the one written with put_record
. Then we provide this cutoff timestamp to the DatasetBuilder
as an argument to the as_of
method:
It’s important to note that the as_of
method applies the time constraint to the internal write_time
field, which is automatically generated by Feature Store. The write_time
field represents the actual timestamp when the record is written to the data store. This is different than other methods like point-in-time-accurate-join
and with_event_time_range
that use the client-provided event_time
field as a comparator.
Clean up
Be sure to delete all the resources created as part of this example to avoid incurring ongoing charges. This includes the feature groups and the S3 bucket containing the offline store data.
SageMaker Python SDK experience vs. writing SQL
The new methods in the SageMaker Python SDK allow you to quickly create datasets and move to the training step quickly during the ML lifecycle. To show the time and effort that can be saved, let’s examine a use case where we need to join two feature groups while retrieving the features within a specified time frame. The following figure compares the Python queries on the offline Feature Store vs. SQL used to create the dataset behind a Python query.
As you can see, the same operation of joining two feature groups requires you to create a long, complex SQL query, whereas it can be accomplished using just the with_feature_group
and with_event_time_range
methods in the SageMaker Python SDK.
Conclusion
The new offline store methods in the Python SageMaker SDK allow you to query your offline features without having to write complex SQL statements. This provides a seamless experience for customers who are accustomed to writing Python code during model development. For more information about feature groups, refer to Create a Dataset From Your Feature Groups and Feature Store APIs: Feature Group.
The full example in this post can be found in the GitHub repository. Give it a try and let us know your feedback in the comments.
About the Authors
Paul Hargis has focused his efforts on machine learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and teaching people how to leverage them. Paul likes to help customers expand their machine learning initiatives to solve real-world problems. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions, helping amazon.com improve the experience for international shoppers.
Mecit Gungor is an AI/ML Specialist Solution Architect at AWS helping customers design and build AI/ML solutions at scale. He covers a wide range of AI/ML use cases for Telecommunication customers and currently focuses on Generative AI, LLMs, and training and inference optimization. He can often be found hiking in the wilderness or playing board games with his friends in his free time.
Tony Chen is a Machine Learning Solutions Architect at AWS, helping customers design scalable and robust machine learning capabilities in the cloud. As a former data scientist and data engineer, he leverages his experience to help tackle some of the most challenging problems organizations face with operationalizing machine learning.
Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience in end-to-end designs and solutions for machine learning; business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.
More-efficient approximate nearest-neighbor search
New approach speeds graph-based search by 20% to 60%, regardless of graph construction method.Read More
Fish-Farming Startup Casts AI to Make Aquaculture More Efficient, Sustainable
As a marine biology student, Josef Melchner always dreamed of spending his days cruising the oceans to find dolphins, whales and fish — but also “wanted to do something practical, something that would benefit the world,” he said. When it came time to choose a career, he dove head first into aquaculture.
He’s now CEO of GoSmart, an Israel-based company using AI and machine learning to make fish farming more efficient and sustainable.
A member of the NVIDIA Metropolis vision AI partner ecosystem and the NVIDIA Inception program for cutting-edge startups, GoSmart offers fully autonomous, energy-efficient systems — about the size of a soda bottle — that can be attached to aquaculture cages, ponds or tanks.
Powered by the NVIDIA Jetson platform for edge AI, these systems analyze the average weight and population distribution of the fish within the environment, as well as its temperature and oxygen levels.
This information is then provided to users through GoSmart’s software-as-a-service, which helps fish farmers more accurately and efficiently determine how much — and when best — to feed their fish and harvest them, all in real time.
“The parameters that GoSmart systems analyze are crucial for the fish feed regime,” Melchner said. “Managing the right levels of fish feed saves a lot of money for the farmers and reduces organic matter from excessive debris in the aqua environment.”
GoSmart systems have been deployed by Skretting, one of the world’s largest fish feed producers, as part of its initiative to sustainably expand production pipelines across eight countries in southern Europe and provide farmers with personalized, digitalized information.
Precision Farming for Sustainability
Founded in 2020, GoSmart is focused on fish farming because it’s focused on helping the environment.
“The world faces a lack of protein, and yet marine protein is often acquired the way it’s always been, with boats going out with fishing nets and long lines,” Melchner said. “While many alternative sources of protein — like cattle, pigs and chicken — are almost always farmed, about half of marine production still comes from wildlife.”
Overfishing in this manner negatively impacts the planet.
“It’s a critical issue that could affect us all eventually,” Melchner said. “Algae is one of the largest carbon sinks in the world. It consumes carbon from the atmosphere and releases oxygen, and overfishing impacts levels of algae in the ocean.”
Understanding this is what led Melchner to devote his life’s work to aquaculture, he said.
The GoSmart system uses lithium-ion batteries charged by solar panels, and is equipped with its own power-management software that enables it to autonomously enter sleep mode, shut down, wake up and conduct its work as appropriate.
Farming Efficiency Boosted With AI
GoSmart systems are built with sensors, cameras and NVIDIA Jetson modules, which enable AI at the edge to analyze factors of an environment that impact fish feeding, growth, health and welfare, as well as environmental pollution due to excessive organic matter dispersed in the water because of inefficient or inaccurate operations.
“We wanted to use the best processor for AI with high performance in a system that’s compact, submersible underwater and affordable for fish farmers, which is why we chose the Jetson series,” Melchner said.
GoSmart is now training its systems to analyze fish behavior and disease indicators — adding to current capabilities of determining fish weight, population distribution, temperature and oxygen levels. Since Jetson enables multiple AI algorithms to run in parallel, all of these characteristics can be analyzed simultaneously and in real time.
The company is also evaluating the powerful new Jetson Orin lineup of modules to take these capabilities to the next level.
To train its AI algorithms, the GoSmart team measured thousands of fish manually before deploying cameras to analyze millions more. “There was a lot of diving and many underwater experiments,” Melchner said.
For high-performance deep learning inference, GoSmart is looking to use the NVIDIA TensorRT software development kit and open-source NVIDIA Triton Inference Server software.
And as a member of the NVIDIA Metropolis and Inception programs, GoSmart works closely with NVIDIA engineers and is exploring latest-generation technologies. “This will help make our algorithms quicker and more efficient,” Melchner said.
GoSmart could help farmers reduce fish feed by up to 15%, according to Melchner. For some customers, GoSmart technology has shortened fish growth time and subsequent time to market by an entire month.
A Tidal Wave of Possibilities
Melchner predicts that in a few years, aquaculture will look completely different from how it is today.
“Our goal is to have our systems in every cage, every pond, every tank in the world — we want to cover the entire aquaculture industry,” he said.
In addition to integrating AI models that analyze fish behavior and disease, GoSmart is looking to expand its systems and eventually integrate its solution with an autonomous feeding barge that can give fish the exact amount of food they need, exactly when they need it.
Learn more about the NVIDIA Metropolis application framework, developer tools and partner ecosystem.
Technical Artist Builds Great Woolly Mammoth With NVIDIA Omniverse USD Composer This Week ‘In the NVIDIA Studio’
Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows.
Keerthan Sathya, a senior technical artist specializing in 3D, emerged trium-elephant In the NVIDIA Studio this week with the incredibly detailed, expertly constructed, jaw-droppingly beautiful animation Tiny Mammoth.
Sathya used a collection of popular 3D apps for the project — including Adobe Substance 3D Modeler, Painter and Autodesk 3ds Max — and completed staging, environment preparation, lighting and rendering in NVIDIA Omniverse with the USD Composer app.
Plus, Marvelous Designer software for making, editing and reusing 3D clothes just launched an NVIDIA Omniverse Connector.
Serving as a bridge, the Universal Scene Description (OpenUSD) framework enables users to import files directly from the Omniverse Nucleus server, and merge and update versions for use in different 3D apps. This saves users time and eliminates difficulties with imports and exports.
Learn how to use the Marvelous Designer Omniverse Connector by watching this tutorial, and find out how OpenUSD can improve artists’ creative workflows. Don’t miss Marvelous Designer’s upcoming community livestream demonstrating a workflow with the new Omniverse Connector on Wednesday, June 14.
One for the (Stone) Ages
A 14-year veteran in the creative community, Bangalore-based Sathya has long been enamored by animals and the concept of extinction. “Animals become extinct for various reasons,” he said. This fact inspired Sathya to create an environment design and tell a unique story using materials, lighting and animation.
“Traditional polygon modeling isn’t my cup of tea,” Sathya admitted. Instead, he used Adobe Substance 3D Modeler to seamlessly sculpt his models in 3D. His NVIDIA Studio HP ZBook laptop with NVIDIA RTX 5000 graphics unlocked GPU-accelerated filters to speed up material creation.
“Sculpting in virtual reality is so much fun, and so fast,” said the artist. “I could finalize models in just a few hours, all while eliminating all those pesky anatomy details!”
He also deployed Substance 3D Modeler’s automatic UV wrapping feature to generate UV islands once models were imported, making it easier to apply paints, textures and materials.
Sathya then moved the project to Autodesk 3ds Max to use retopology tools for automatic optimization geometry of his high-resolution models to create a clean, quad-based mesh. This is useful for removing artifacts and other mesh issues before animation and rigging.
GPU-enabled, RTX-accelerated AI denoising with the default Autodesk Arnold renderer in the viewport allowed for smoother, highly interactive modeling.
“My NVIDIA GPU is an integral part of the artwork I create. Modeling, texturing, staging, painting and lighting is all accelerated by RTX.” — Keerthan Sathya
Adobe Substance 3D Painter was a “game changer” allowing super-fast asset painting, Sathya said. RTX-accelerated light and ambient occlusion bakes and optimizes assets in mere seconds.
“You don’t necessarily need to create everything from scratch,” Sathya said. “Substance 3D Painter offers a wide range of materials, smart materials and smart masks which I used in my project, along with a whole collection of materials that helped me save a lot of time.”
“You can even paint multiple channels on multiple UDIMs in real time,” said Sathya. This means textures on various models can have different resolutions. For example, 4K-resolution textures can be used for priority details, while 2K resolution can be tapped for less important touches.
Sathya imported Tiny Mammoth into NVIDIA Omniverse, a platform for developing and building industrial metaverse applications, via the USD Composer app. This is where the artist accomplished staging, environment preparation, lighting and rendering.
“NVIDIA Omniverse is a great platform for artists to connect their desired 3D apps and collaborate. Plus, I really like AI-driven apps and features.” — Keerthan Sathya
Sathya marveled at the ease of setting up the scene in USD Composer. “Just using a few assets — instancing, arranging and kitbashing them to make a huge environment — is so satisfying and efficient,” he said.
Sathya said OpenUSD is “so much more than just a file format.”
“The OpenUSD workflow is great to work with,” he said. “I used OpenUSD pretty much for the whole project: environment assets and textures, foliage, lighting, still camera and adjustment of layers for each shot if necessary.”
With each OpenUSD layer and file stacked and authored accordingly, Sathya had the option to plug and play assets and creative workflow stages. Such versatility in 3D creative workflows is enabled by Omniverse and OpenUSD.
The artist heightened realism by painting moss trees with USD Composer’s paint tool. “It was easy to add those tiny details to make my artwork look better,” he said. Sathya then rotated the camera and adjusted the lighting until he met his desired result.
Due to the sheer size of the scene, Sathya used real-time rendering to add animations, a bit of fog and corrections for limited renders. “I like the idea of render passes, where you have a complete control of the scene while compositing, but it wasn’t necessary here,” he said.
Sathya exported the scene into Adobe After Effects for post-processing and additional visual effects, using more than 30 GPU-accelerated features and effects to add even more detail.
The artist reviewed video feedback in Adobe Rush. “It’s more convenient when I’m traveling or on my couch to arrange the shots and do some quick edits on my phone,” he said. Sathya completed advanced edits and final renders in Adobe Premiere Pro.
“Life is all about contrast,” Sathya said. “I’ve experienced failures and successes, complex and simple, good and bad, many more contrasts, all of which drip into my artwork to make it unique!”
Check out additional details on Tiny Mammoth and view Sathya’s complete portfolio on Behance.
Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.
Get started with NVIDIA Omniverse by downloading the standard license free, or learn how Omniverse Enterprise can connect your team. Developers can get started with Omniverse resources. Stay up to date on the platform by subscribing to the newsletter, and follow NVIDIA Omniverse on Instagram, Medium and Twitter.
For more, join the Omniverse community and check out the Omniverse forums, Discord server, Twitch and YouTube channels.