Meet the Omnivore: Animator Entertains and Explains With NVIDIA Omniverse

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse to accelerate their 3D workflows and create virtual worlds.

Australian animator Marko Matosevic is taking dad jokes and breathing them into animated life with NVIDIA Omniverse, a virtual world simulation and collaboration platform for 3D workflows.

Matosevic’s work is multifaceted: he’s a VR developer by day and a YouTuber by night, a lover of filmmaking and animation, and has a soft spot for sci-fi and dad jokes.

The animated shorts on Matosevic’s YouTube channels Markom3D and Deadset Digital are the culmination of those varied interests and pursuits.

To bring the above film to life, Matosevic harnessed Reallusion iClone and Character Creator for character creation, Perception Neuron V3 for body motion capture, and NVIDIA Omniverse and Epic Games Unreal Engine 5 for rendering.

After noting a lack of YouTube tutorials on how to use animation software like Blender, Matosevic also set himself up as an instructor. He says his goal is to help those at all developmental ranges — from beginners to advanced users — learn new skills and techniques in a concise manner. The following video is a tutorial of the previous film:

Matosevic says his ultimate goal in creating these animated shorts is to make his viewers “have a laugh.”

“Through rough times that we are going through at the moment, it is nice to just let yourself go for a few moments and enjoy a short animation,” he said. “Sure, the jokes may be terrible, and a lot of people groan, but I am a dad, and that is part of my responsibility.”

Optimized Rendering and Seamless Workflow

Initially, Matosevic relied primarily on Blender for 3D modeling and Unreal Engine 4 for his work. It wasn’t until he upgraded to an NVIDIA RTX 3080 Ti GPU that he saw the possibility of integrating the NVIDIA Omniverse platform into his toolbox.

“What really got me interested was that NVIDIA [Omniverse] had its own render engine, which I assumed that my 3080 would be optimized for,” he said. “I was able to create an amazing scene with little effort.”

With NVIDIA Omniverse, Matosevic can export whole scenes from Unreal Engine into Omniverse without having to deal with complex shaders, as he would’ve had to do if he were solely working in Blender.

Along with the iClone and Blender connectors, Matosevic uses NVIDIA Omniverse Machinima, an application that allows content creators to collaborate in real time to animate and manipulate characters along with their environments inside of virtual worlds.

“I like it because with a few simple clicks, I can start the rendering process and know that I am going to have something amazing when it is finished,” he said.

With Universal Scene Description, an open-source 3D scene description for creating virtual worlds, these applications and connectors work seamlessly together to bring elevated results.

“I have created animated short films using Blender and Unreal Engine 4, but Omniverse has just raised the quality to a new level,” he said.

Join In on the Creation

Creators across the world can download NVIDIA Omniverse for free and Enterprise teams can use the platform for their 3D projects.

Check out works from other artists using NVIDIA Omniverse and submit your own work with #MadeInOmniverse to be featured in the gallery.

Join us at SIGGRAPH 2022 to learn how Omniverse, and design and visualization solutions, are driving advanced breakthroughs in graphics workflows and GPU-accelerated software.

Connect your workflows to NVIDIA Omniverse with software from Adobe, Autodesk, Epic Games, Maxon, Reallusion and more.

Follow NVIDIA Omniverse on Instagram, Twitter, YouTube and Medium for additional resources and inspiration. Check out the Omniverse forums and join our Discord Server to chat with the community.

The post Meet the Omnivore: Animator Entertains and Explains With NVIDIA Omniverse appeared first on NVIDIA Blog.

Read More

Case Study: PathAI Uses PyTorch to Improve Patient Outcomes with AI-powered Pathology

​PathAI is the leading provider of AI-powered technology tools and services for pathology (the study of disease). Our platform was built to enable substantial improvements to the accuracy of diagnosis and the measurement of therapeutic efficacy for complex diseases, leveraging modern approaches in machine learning like image segmentation, graph neural networks, and multiple instance learning.

Traditional manual pathology is prone to subjectivity and observer variability that can negatively affect diagnoses and drug development trials. Before we dive into how we use PyTorch to improve our diagnosis workflow, let us first lay out the traditional analog Pathology workflow without machine learning.

How Traditional Biopharma Works

There are many avenues that biopharma companies take to discover novel therapeutics or diagnostics. One of those avenues relies heavily on the analysis of pathology slides to answer a variety of questions: how does a particular cellular communication pathway work? Can a specific disease state be linked to the presence or lack of a particular protein? Why did a particular drug in a clinical trial work for some patients but not others? Might there be an association between patient outcomes and a novel biomarker?

To help answer these questions, biopharma companies rely on expert pathologists to analyze slides and help evaluate the questions they might have. 

As you might imagine, it takes an expert board certified pathologist to make accurate interpretations and diagnosis. In one study, a single biopsy result was given to 36 different pathologists and the outcome was 18 different diagnoses varying in severity from no treatment to aggressive treatment necessary. Pathologists also often solicit feedback from colleagues in difficult edge cases. Given the complexity of the problem, even with expert training and collaboration, pathologists can still have a hard time making a correct diagnosis. This potential variance can be the difference between a drug being approved and it failing the clinical trial.

How PathAI utilizes machine learning to power drug development

PathAI develops machine learning models which provide insights for drug development R&D, for powering clinical trials, and for making diagnoses. To this end, PathAI leverages PyTorch for slide level inference using a variety of methods including graph neural networks (GNN) as well as multiple instance learning. In this context, “slides” refers to full size scanned images of glass slides, which are pieces of glass with a thin slice of tissue between them, stained to show various cell formations. PyTorch enables our teams using these different methodologies to share a common framework which is robust enough to work in all the conditions we need. PyTorch’s high level, imperative, and pythonic syntax allows us to prototype models quickly and then take those models to scale once we have the results we want. 

Multi-instance learning on gigabyte images

One of the uniquely challenging aspects of applying ML to pathology is the immense size of the images. These digital slides can often be 100,000 x 100,000 pixels or more in resolution and gigabytes in size. Loading the full image in GPU memory and applying traditional computer vision algorithms on them is an almost impossible task. It also takes both a considerable amount of time and resources to have a full slide image (100k x 100k) annotated, especially when annotators need to be domain experts (board-certified pathologists). We often build models to predict image-level labels, like the presence of cancer, on a patient slide which covers a few thousand pixels in the whole image. The cancerous area is sometimes a tiny fraction of the entire slide, which makes the ML problem similar to finding a needle in a haystack. On the other hand, some problems like the prediction of certain histological biomarkers require an aggregation of information from the whole slide which is again hard due to the size of the images. All these factors add significant algorithmic, computational, and logistical complexity when applying ML techniques to pathology problems.

Breaking down the image into smaller patches, learning patch representations, and then pooling those representations to predict an image-level label is one way to solve this problem as is depicted in the image below. One popular method for doing this is called Multiple Instance Learning (MIL). Each patch is considered an ‘instance’ and a set of patches forms a ‘bag’. The individual patch representations are pooled together to predict a final bag-level label. Algorithmically, the individual patch instances in the bag do not require labels and hence allow us to learn bag-level labels in a weakly-supervised way. They also use permutation invariant pooling functions which make the prediction independent of the order of patches and allows for an efficient aggregation of information. Typically, attention based pooling functions are used which not only allow for efficient aggregation but also provide attention values for each patch in the bag. These values indicate the importance of the corresponding patch in the prediction and can be visualized to better understand the model predictions. This element of interpretability can be very important to drive adoption of these models in the real world and we use variations like Additive MIL models to enable such spatial explainability. Computationally, MIL models circumvent the problem of applying neural networks to large image sizes since patch representations are obtained independently of the size of the image.

At PathAI, we use custom MIL models based on deep nets to predict image-level labels. The overview of this process is as follows:

  1. Select patches from a slide using different sampling approaches.
  2. Construct a bag of patches based on random sampling or heuristic rules.
  3. Generate patch representations for each instance based on pre-trained models or large-scale representation learning models.
  4. Apply permutation invariant pooling functions to get the final slide-level score.

Now that we have walked through some of the high-level details around MIL in PyTorch, let’s look at some code to see how simple it is to go from ideation to code in production with PyTorch. We begin by defining a sampler, transformations, and our MIL dataset:

# Create a bag sampler which randomly samples patches from a slide
bag_sampler = RandomBagSampler(bag_size=12)

# Setup the transformations
crop_transform = FlipRotateCenterCrop(use_flips=True)

# Create the dataset which loads patches for each bag
train_dataset = MILDataset(
  bag_sampler=bag_sampler,
  samples_loader=sample_loader,
  transform=crop_transform,
)

After we have defined our sampler and dataset, we need to define the model we will actually train with said dataset. PyTorch’s familiar model definition syntax makes this easy to do while also allowing us to create bespoke models at the same time.

classifier = DefaultPooledClassifier(hidden_dims=[256, 256], input_dims=1024, output_dims=1)

pooling = DefaultAttentionModule(
  input_dims=1024,
  hidden_dims=[256, 256],
  output_activation=StableSoftmax()
)

# Define the model which is a composition of the featurizer, pooling module and a classifier
model = DefaultMILGraph(featurizer=ShuffleNetV2(), classifier=classifier, pooling = pooling)

Since these models are trained end-to-end, they offer a powerful way to go directly from a gigapixel whole slide image to a single label. Due to their wide applicability to different biological problems, two aspects of their implementation and deployment are important:

  1. Configurable control over each part of the pipeline including the data loaders, the modular parts of the model, and their interaction with each other.
  2. Ability to rapidly iterate through the ideate-implement-experiment-productionize loop.

PyTorch has various advantages when it comes to MIL modeling. It offers an intuitive way to create dynamic computational graphs with flexible control flow which is great for rapid research experimentation. The map-style datasets, configurable sampler and batch-samplers allow us to customize how we construct bags of patches, enabling faster experimentation. Since MIL models are IO heavy, data parallelism and pythonic data loaders make the task very efficient and user friendly. Lastly, the object-oriented nature of PyTorch enables building of reusable modules which aid in the rapid experimentation, maintainable implementation and ease of building compositional components of the pipeline.

Exploring spatial tissue organization with GNNs in PyTorch

In both healthy and diseased tissue, the spatial arrangement and structure of cells can oftentimes be as important as the cells themselves. For example, when assessing lung cancers, pathologists try to look at the overall grouping and structure of tumor cells (do they form solid sheets? Or do they occur in smaller, localized clusters?) to determine if the cancer belongs to specific subtypes which can have vastly different prognosis. Such spatial relationships between cells and other tissue structures can be modeled using graphs to capture tissue topology and cellular composition at the same time. Graph Neural Networks (GNNs) allow learning spatial patterns within these graphs that relate to other clinical variables, for example overexpression of genes in certain cancers.

In late 2020, when PathAI started using GNNs on tissue samples, PyTorch had the best and most mature support for GNN functionality via the PyG package. This made PyTorch the natural choice for our team given that GNN models were something that we knew would be an important ML concept we wanted to explore. 

One of the main value-adds of GNN’s in the context of tissue samples is that the graph itself can uncover spatial relationships that would otherwise be very difficult to find by visual inspection alone. In our recent AACR publication, we showed that by using GNNs, we can better understand the way the presence of immune cell aggregates (specifically tertiary lymphoid structures, or TLS) in the tumor microenvironment can influence patient prognosis. In this case, the GNN approach was used to predict expression of genes associated with the presence of TLS, and identify histological features beyond the TLS region itself that are relevant to TLS. Such insights into gene expression are difficult to identify from tissue sample images when unassisted by ML models. 

One of the most promising GNN variations we have had success with is self attention graph pooling. Let’s take a look at how we define our Self Attention Graph Pooling (SAGPool) model using PyTorch and PyG:

class SAGPool(torch.nn.Module):
  def __init__(self, ...):
    super().__init__()
    self.conv1 = GraphConv(in_features, hidden_features, aggr='mean')
    self.convs = torch.nn.ModuleList()
    self.pools = torch.nn.ModuleList()
    self.convs.extend([GraphConv(hidden_features, hidden_features, aggr='mean') for i in range(num_layers - 1)])
    self.pools.extend([SAGPooling(hidden_features, ratio, GNN=GraphConv, min_score=min_score) for i in range((num_layers) // 2)])
    self.jump = JumpingKnowledge(mode='cat')
    self.lin1 = Linear(num_layers * hidden_features, hidden_features)
    self.lin2 = Linear(hidden_features, out_features)
    self.out_activation = out_activation
    self.dropout = dropout

In the above code, we begin by defining a single convolutional graph layer and then add two module list layers which allow us to pass in a variable number of layers. We then take our empty module list and append a variable number of GraphConv layers followed by a variable number of SAGPooling layers. We finish up our SAGPool definition by adding a JumpingKnowledge Layer, two linear layers, our activation function, and our dropout value. PyTorch’s intuitive syntax allows us to abstract away the complexity of working with state of the art methods like SAG Poolings while also maintaining the common approach to model development we are familiar with.

Models like our SAG Pool one described above are just one example of how GNNs with PyTorch are allowing us to explore new and novel ideas. We also recently explored multimodal CNN – GNN hybrid models which ended up being 20% more accurate than traditional Pathologist consensus scores. These innovations and interplay between traditional CNNs and GNNs are again enabled by the short research to production model development loop.

Improving Patient Outcomes

In order to achieve our mission of improving patient outcomes with AI-powered pathology, PathAI needs to rely on an ML development framework that (1) facilitates quick iteration and easy extension (i.e. Model configuration as code) during initial phases of development and exploration (2) scales model training and inference to massive images (3) easily and robustly serves models for production uses of our products (in clinical trials and beyond). As we’ve demonstrated, PyTorch offers us all of these capabilities and more. We are incredibly excited about the future of PyTorch and cannot wait to see what other impactful challenges we can solve using the framework.

Read More

DeepMind’s latest research at ICML 2022

Starting this weekend, the thirty-ninth International Conference on Machine Learning (ICML 2022) is meeting from 17-23 July, 2022 at the Baltimore Convention Center in Maryland, USA, and will be running as a hybrid event. Researchers working across artificial intelligence, data science, machine vision, computational biology, speech recognition, and more are presenting and publishing their cutting-edge work in machine learning.Read More

DeepMind’s latest research at ICML 2022

Starting this weekend, the thirty-ninth International Conference on Machine Learning (ICML 2022) is meeting from 17-23 July, 2022 at the Baltimore Convention Center in Maryland, USA, and will be running as a hybrid event. Researchers working across artificial intelligence, data science, machine vision, computational biology, speech recognition, and more are presenting and publishing their cutting-edge work in machine learning.Read More

Achieve enterprise-grade monitoring for your Amazon SageMaker models using Fiddler

This is a guest blog post by Danny Brock, Rajeev Govindan and Krishnaram Kenthapadi at Fiddler AI.

Your Amazon SageMaker models are live. They’re handling millions of inferences each day and driving better business outcomes for your company. They’re performing exactly as well as the day they were launched.

Er, wait. Are they? Maybe. Maybe not.

Without enterprise-class model monitoring, your models may be decaying in silence. Your machine learning (ML) teams may never know that these models have actually morphed from miracles of revenue generation to liabilities making incorrect decisions that cost your company time and money.

Don’t fret. The solution is closer than you think.

Fiddler, an enterprise-class Model Performance Management solution available on the AWS Marketplace, offers model monitoring and explainable AI to help ML teams inspect and address a comprehensive range of model issues. Through model monitoring, model explainability, analytics, and bias detection, Fiddler provides your company with an easy-to-use single pane of glass to ensure your models are behaving as they should. And if they’re not, Fiddler also provides features that allow you to inspect your models to find the underlying root causes of performance decay.

This post shows how your MLOps team can improve data scientist productivity and reduce time to detect issues for your models deployed in SageMaker by integrating with the Fiddler Model Performance Management Platform in a few simple steps.

Solution overview

The following reference architecture highlights the primary points of integration. Fiddler exists as a “sidecar” to your existing SageMaker ML workflow.

The remainder of this post walks you through the steps to integrate your SageMaker model with Fiddler’s Model Performance Management Platform:

  1. Ensure your model has data capture enabled.
  2. Create a Fiddler trial environment.
  3. Register information about your model in your Fiddler environment.
  4. Create an AWS Lambda function to publish SageMaker inferences to Fiddler.
  5. Explore Fiddler’s monitoring capabilities in your Fiddler trial environment.

Prerequisites

This post assumes that you have set up SageMaker and deployed a model endpoint. To learn how to configure SageMaker for model serving, refer to Deploy Models for Inference. Some examples are also available on the GitHub repo.

Ensure your model has data capture enabled

On the SageMaker console, navigate to your model’s serving endpoint and ensure you have enabled data capture into an Amazon Simple Storage Service (Amazon S3) bucket. This stores the inferences (requests and responses) your model makes each day as JSON lines files (.jsonl) in Amazon S3.

Create a Fiddler trial environment

From the fiddler.ai website, you can request a free trial. After filling out a quick form, Fiddler will contact you to understand the specifics of your model performance management needs and will have a trial environment ready for you in a few hours. You can expect a dedicated environment like https://yourcompany.try.fiddler.ai.

Register information about your model in your Fiddler environment

Before you can begin publishing events from your SageMaker hosted model into Fiddler, you need to create a project within your Fiddler trial environment and provide Fiddler details about your model through a step called model registration. If you want to use a preconfigured notebook from within Amazon SageMaker Studio rather than copy and paste the following code snippets, you can reference the Fiddler quickstart notebook on GitHub. Studio provides a single web-based visual interface where you can perform all ML development steps.

First, you must install the Fiddler Python client in your SageMaker notebook and instantiate the Fiddler client. You can get the AUTH_TOKEN from the Settings page in your Fiddler trial environment.

# Install the fiddler client
!pip install fiddler-client

# Connect to the Fiddler Trial Environment
import fiddler as fdl
import pandas as pd

fdl.__version__

URL = 'https://yourcompany.try.fiddler.ai'
ORG_ID = 'yourcompany'
AUTH_TOKEN = 'UUID-Token-Here-Found-In-Your-Fiddler-Env-Settings-Page'

client = fdl.FiddlerApi(URL, ORG_ID, AUTH_TOKEN)

Next, create a project within your Fiddler trial environment:

# Create Project
PROJECT_ID = 'credit_default'  # update this with your project name
DATASET_ID = f'{PROJECT_ID}_dataset'
MODEL_ID = f'{PROJECT_ID}_model'

client.create_project(PROJECT_ID)

Now upload your training dataset. The notebook also provides a sample dataset to run Fiddler’s explainability algorithms and as a baseline for monitoring metrics. The dataset is also used to generate the schema for this model in Fiddler.

# Upload Baseline Dataset
df_baseline = pd.read_csv(‘<your-training-file.csv>')

dataset_info = fdl.DatasetInfo.from_dataframe(df_baseline, max_inferred_cardinality=1000)

upload_result = client.upload_dataset(PROJECT_ID,
                                      dataset={'baseline': df_baseline},
                                      dataset_id=DATASET_ID,
                                      info=dataset_info)

Lastly, before you can start publishing inferences to Fiddler for monitoring, root cause analysis, and explanations, you need to register your model. Let’s first create a model_info object that contains the metadata about your model:

# Update task from the list below if your model task is not binary classification
model_task = 'binary' 

if model_task == 'regression':
    model_task_fdl = fdl.ModelTask.REGRESSION
    
elif model_task == 'binary':
    model_task_fdl = fdl.ModelTask.BINARY_CLASSIFICATION

elif model_task == 'multiclass':
    model_task_fdl = fdl.ModelTask.MULTICLASS_CLASSIFICATION

elif model_task == 'ranking':
    model_task_fdl = fdl.ModelTask.RANKING

    
# Specify column types|
target = 'TARGET'
outputs = ['prediction']  # change this to your target variable
features = [‘<add your feature list here>’]
     
# Generate ModelInfo
model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    dataset_id=DATASET_ID,
    model_task=model_task_fdl,
    target=target,
    outputs=outputs,
    features=features,
    binary_classification_threshold=.125,  # update this if your task is not a binary classification
    description='<model-description>',
    display_name='<model-display-name>'
)
model_info

Then you can register the model using your new model_info object:

# Register Info about your model with Fiddler
client.register_model(
    project_id=PROJECT_ID,
    dataset_id=DATASET_ID,
    model_id=MODEL_ID,
    model_info=model_info
)

Great! Now you can publish some events to Fiddler in order to observe the model’s performance.

Create a Lambda function to publish SageMaker inferences to Fiddler

With the simple-to-deploy serverless architecture of Lambda, you can quickly build the mechanism required to move your inferences from the S3 bucket you set up earlier into your newly provisioned Fiddler trial environment. This Lambda function is responsible for opening any new JSONL event log files in your model’s S3 bucket, parsing and formatting the JSONL content into a dataframe, and then publishing that dataframe of events to your Fiddler trial environment. The following screenshot shows the code details of our function.

The Lambda function needs to be configured to trigger off of newly created files in your S3 bucket. The following tutorial guides you through creating an Amazon EventBridge trigger that invokes the Lambda function whenever a file is uploaded to Amazon S3. The following screenshot shows our function’s trigger configuration. This makes it simple to ensure that any time your model makes new inferences, those events stored in Amazon S3 are loaded into Fiddler to drive the model observability your company needs.

To simplify this further, the code for this Lambda function is publicly available from Fiddler’s documentation site. This code example currently works for binary classification models with structured inputs. If you have model types with different features or tasks, please contact Fiddler for assistance with minor changes to the code.

The Lambda function needs to make reference to the Fiddler Python client. Fiddler has created a publicly available Lambda layer that you can reference to ensure that the import fiddler as fdl step works seamlessly. You can reference this layer via an ARN in the us-west-2 Region: arn:aws:lambda:us-west-2:079310353266:layer:fiddler-client-0814:1, as shown in the following screenshot.

You also need to specify Lambda environment variables so the Lambda function knows how to connect to your Fiddler trial environment, and what the inputs and outputs are within the .jsonl files being captured by your model. The following screenshot shows a list of the required environment variables, which are also on Fiddler’s documentation site. Update the values for the environment variables to match your model and dataset.

Explore Fiddler’s monitoring capabilities in your Fiddler trial environment

You’ve done it! With your baseline data, model, and traffic connected, you can now explain data drift, outliers, model bias, data issues, and performance blips, and share dashboards with others. Complete your journey by watching a demo about the model performance management capabilities you have introduced to your company.

The example screenshots below provide a glimpse of model insights like drift, outlier detection, local point explanations, and model analytics that will be found in your Fiddler trial environment.

Conclusion

This post highlighted the need for enterprise-class model monitoring and showed how you can integrate your models deployed in SageMaker with the Fiddler Model Performance Management Platform in just a few steps. Fiddler offers functionality for model monitoring, explainable AI, bias detection, and root cause analysis, and is available on the AWS Marketplace. By providing your MLOps team with an easy-to-use single pane of glass to ensure your models are behaving as expected and to identify the underlying root causes of performance degradation, Fiddler can help improve data scientist productivity and reduce time to detect and resolve issues.

If you would like to learn more about Fiddler please visit fiddler.ai or if you would prefer to set up a personalized demo and technical discussion email sales@fiddler.ai.


About the Authors

Danny Brock is a Sr Solutions Engineer at Fiddler AI. Danny is long tenured in the analytics and ML space, running presales and post-sales teams for startups like Endeca and Incorta. He founded his own big data analytics consulting company, Branchbird, in 2012.

Rajeev Govindan is a Sr Solutions Engineer at Fiddler AI. Rajeev has extensive experience in sales engineering and software development at several enterprise companies, including AppDynamics.

Krishnaram Kenthapadi is the Chief Scientist of Fiddler AI. Previously, he was a Principal Scientist at Amazon AWS AI, where he led the fairness, explainability, privacy, and model understanding initiatives in the Amazon AI platform, and prior to that, he held roles at LinkedIn AI and Microsoft Research. Krishnaram received his PhD in Computer Science from Stanford University in 2006.

Read More

Towards Reliability in Deep Learning Systems

Deep learning models have made impressive progress in vision, language, and other modalities, particularly with the rise of large-scale pre-training. Such models are most accurate when applied to test data drawn from the same distribution as their training set. However, in practice, the data confronting models in real-world settings rarely match the training distribution. In addition, the models may not be well-suited for applications where predictive performance is only part of the equation. For models to be reliable in deployment, they must be able to accommodate shifts in data distribution and make useful decisions in a broad array of scenarios.

In “Plex: Towards Reliability Using Pre-trained Large Model Extensions”, we present a framework for reliable deep learning as a new perspective about a model’s abilities; this includes a number of concrete tasks and datasets for stress-testing model reliability. We also introduce Plex, a set of pre-trained large model extensions that can be applied to many different architectures. We illustrate the efficacy of Plex in the vision and language domains by applying these extensions to the current state-of-the-art Vision Transformer and T5 models, which results in significant improvement in their reliability. We are also open-sourcing the code to encourage further research into this approach.

Uncertainty — Dog vs. Cat classifier: Plex can say “I don’t know” for inputs that are neither cat nor dog.
Robust Generalization — A naïve model is sensitive to spurious correlations (“destination”), whereas Plex is robust.
Adaptation — Plex can actively choose the data from which it learns to improve performance more quickly.

Framework for Reliability
First, we explore how to understand the reliability of a model in novel scenarios. We posit three general categories of requirements for reliable machine learning (ML) systems: (1) they should accurately report uncertainty about their predictions (“know what they don’t know”); (2) they should generalize robustly to new scenarios (distribution shift); and (3) they should be able to efficiently adapt to new data (adaptation). Importantly, a reliable model should aim to do well in all of these areas simultaneously out-of-the-box, without requiring any customization for individual tasks.

  • Uncertainty reflects the imperfect or unknown information that makes it difficult for a model to make accurate predictions. Predictive uncertainty quantification allows a model to compute optimal decisions and helps practitioners recognize when to trust the model’s predictions, thereby enabling graceful failures when the model is likely to be wrong.
  • Robust Generalization involves an estimate or forecast about an unseen event. We investigate four types of out-of-distribution data: covariate shift (when the input distribution changes between training and application and the output distribution is unchanged), semantic (or class) shift, label uncertainty, and subpopulation shift.

    Types of distribution shift using an illustration of ImageNet dogs.
  • Adaptation refers to probing the model’s abilities over the course of its learning process. Benchmarks typically evaluate on static datasets with pre-defined train-test splits. However, in many applications, we are interested in models that can quickly adapt to new datasets and efficiently learn with as few labeled examples as possible.
Reliability framework. We propose to simultaneously stress-test the “out-of-the-box” model performance (i.e., the predictive distribution) across uncertainty, robust generalization, and adaptation benchmarks, without any customization for individual tasks.

We apply 10 types of tasks to capture the three reliability areas — uncertainty, robust generalization, and adaptation — and to ensure that the tasks measure a diverse set of desirable properties in each area. Together the tasks comprise 40 downstream datasets across vision and natural language modalities: 14 datasets for fine-tuning (including few-shot and active learning–based adaptation) and 26 datasets for out-of-distribution evaluation.

Plex: Pre-trained Large Model Extensions for Vision and Language
To improve reliability, we develop ViT-Plex and T5-Plex, building on large pre-trained models for vision (ViT) and language (T5), respectively. A key feature of Plex is more efficient ensembling based on submodels that each make a prediction that is then aggregated. In addition, Plex swaps each architecture’s linear last layer with a Gaussian process or heteroscedastic layer to better represent predictive uncertainty. These ideas were found to work very well for models trained from scratch at the ImageNet scale. We train the models with varying sizes up to 325 million parameters for vision (ViT-Plex L) and 1 billion parameters for language (T5-Plex L) and pre-training dataset sizes up to 4 billion examples.

The following figure illustrates Plex’s performance on a select set of tasks compared to the existing state-of-the-art. The top-performing model for each task is usually a specialized model that is highly optimized for that problem. Plex achieves new state-of-the-art on many of the 40 datasets. Importantly, Plex achieves strong performance across all tasks using the out-of-the-box model output without requiring any custom designing or tuning for each task.

The largest T5-Plex (top) and ViT-Plex (bottom) models evaluated on a highlighted set of reliability tasks compared to specialized state-of-the-art models. The spokes display different tasks, quantifying metric performance on various datasets.

<!–

The largest T5-Plex (top) and ViT-Plex (bottom) models evaluated on a highlighted set of reliability tasks compared to specialized state-of-the-art models. The spokes display different tasks, quantifying metric performance on various datasets.

–>

Plex in Action for Different Reliability Tasks
We highlight Plex’s reliability on select tasks below.

Open Set Recognition
We show Plex’s output in the case where the model must defer prediction because the input is one that the model does not support. This task is known as open set recognition. Here, predictive performance is part of a larger decision-making scenario where the model may abstain from making certain predictions. In the following figure, we show structured open set recognition: Plex returns multiple outputs and signals the specific part of the output about which the model is uncertain and is likely out-of-distribution.

Structured open set recognition enables the model to provide nuanced clarifications. Here, T5-Plex L can recognize fine-grained out-of-distribution cases where the request’s vertical (i.e., coarse-level domain of service, such as banking, media, productivity, etc.) and domain are supported but the intent is not.

Label Uncertainty
In real-world datasets, there is often inherent ambiguity behind the ground truth label for each input. For example, this may arise due to human rater ambiguity for a given image. In this case, we’d like the model to capture the full distribution of human perceptual uncertainty. We showcase Plex below on examples from an ImageNet variant we constructed that provides a ground truth label distribution.

Plex for label uncertainty. Using a dataset we construct called ImageNet ReaL-H, ViT-Plex L demonstrates the ability to capture the inherent ambiguity (probability distribution) of image labels.

Active Learning
We examine a large model’s ability to not only learn over a fixed set of data points, but also participate in knowing which data points to learn from in the first place. One such task is known as active learning, where at each training step, the model selects promising inputs among a pool of unlabeled data points on which to train. This procedure assesses an ML model’s label efficiency, where label annotations may be scarce, and so we would like to maximize performance while minimizing the number of labeled data points used. Plex achieves a significant performance improvement over the same model architecture without pre-training. Compared to the state-of-the-art in literature, BASE achieves around 63% accuracy at 100K examples, achieving a lower accuracy and requiring more examples.

Active learning on ImageNet1K. ViT-Plex L is highly label efficient compared to a baseline that doesn’t leverage pre-training. We also find that active learning’s data acquisition strategy is more effective than uniformly selecting data points at random.

Learn more
Check out our paper here and an upcoming contributed talk about the work at the ICML 2022 pre-training workshop on July 23, 2022. To encourage further research in this direction, we are open-sourcing all code for training and evaluation as part of Uncertainty Baselines. We also provide a demo that shows how to use a ViT-Plex model checkpoint. Layer and method implementations use Edward2.

Acknowledgements
We thank all the co-authors for contributing to the project and paper, including Andreas Kirsch, Clara Huiyi Hu, Du Phan, D. Sculley, Honglin Yuan, Jasper Snoek, Jeremiah Liu, Jie Ren, Joost van Amersfoort, Karan Singhal, Kehang Han, Kelly Buchanan, Kevin Murphy, Mark Collier​​, Mike Dusenberry, Neil Band, Nithum Thain, Rodolphe Jenatton, Tim G. J. Rudner, Yarin Gal, Zachary Nado, Zelda Mariet, Zi Wang, and Zoubin Ghahramani. We also thank Anusha Ramesh, Ben Adlam, Dilip Krishnan, Ed Chi, Rif A. Saurous, and Sharat Chikkerur for their helpful feedback, and Tom Small and Ajay Nainani for helping with visualizations.

Read More

Track your ML experiments end to end with Data Version Control and Amazon SageMaker Experiments

Data scientists often work towards understanding the effects of various data preprocessing and feature engineering strategies in combination with different model architectures and hyperparameters. Doing so requires you to cover large parameter spaces iteratively, and it can be overwhelming to keep track of previously run configurations and results while keeping experiments reproducible.

This post walks you through an example of how to track your experiments across code, data, artifacts, and metrics by using Amazon SageMaker Experiments in conjunction with Data Version Control (DVC). We show how you can use DVC side by side with Amazon SageMaker processing and training jobs. We train different CatBoost models on the California housing dataset from the StatLib repository, and change holdout strategies while keeping track of the data version with DVC. In each individual experiment, we track input and output artifacts, code, and metrics using SageMaker Experiments.

SageMaker Experiments

SageMaker Experiments is an AWS service for tracking machine learning (ML) experiments. The SageMaker Experiments Python SDK is a high-level interface to this service that helps you track experiment information using Python.

The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, add tracking and lineage information, and run analytics across trials and experiments.

When discussing SageMaker Experiments, we refer to the following concepts:

  • Experiment – A collection of related trials. You add trials to an experiment that you want to compare together.
  • Trial – A description of a multi-step ML workflow. Each step in the workflow is described by a trial component.
  • Trial component – A description of a single step in an ML workflow, such as data cleaning, feature extraction, model training, or model evaluation.
  • Tracker – A Python context manager for logging information about a single trial component (for example, parameters, metrics, or artifacts).

Data Version Control

Data Version Control (DVC) is a new type of data versioning, workflow, and experiment management software that builds upon Git (although it can work standalone). DVC reduces the gap between established engineering toolsets and data science needs, allowing you to take advantage of new features while reusing existing skills and intuition.

Data science experiment sharing and collaboration can be done through a regular Git flow (commits, branching, tagging, pull requests) the same way it works for software engineers. With Git and DVC, data science and ML teams can version experiments, manage large datasets, and make projects reproducible.

DVC has the following features:

  • DVC is a free, open-source command line tool.
  • DVC works on top of Git repositories and has a similar command line interface and flow as Git. DVC can also work standalone, but without versioning capabilities.
  • Data versioning is enabled by replacing large files, dataset directories, ML models, and so on with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.
  • You can use on-premises or cloud storage to store the project’s data separate from its code base. This is how data scientists can transfer large datasets or share a GPU-trained model with others.
  • DVC makes data science projects reproducible by creating lightweight pipelines using implicit dependency graphs, and by codifying the data and artifacts involved.
  • DVC is platform agnostic. It runs on all major operating systems (Linux, macOS, and Windows), and works independently of the programming languages (Python, R, Julia, shell scripts, and so on) or ML libraries (Keras, TensorFlow, PyTorch, Scipy, and more) used in the project.
  • DVC is quick to install and doesn’t require special infrastructure, nor does it depend on APIs or external services. It’s a standalone CLI tool.

SageMaker Experiments and DVC sample

The following GitHub sample shows how to use DVC within the SageMaker environment. In particular, we look at how to build a custom image with DVC libraries installed by default to provide a consistent development environment to your data scientists in Amazon SageMaker Studio, and how to run DVC alongside SageMaker managed infrastructure for processing and training. Furthermore, we show how to enrich SageMaker tracking information with data versioning information from DVC, and visualize them within the Studio console.

The following diagram illustrates the solution architecture and workflow.Solution architecture and workflow

Build a custom Studio image with DVC already installed

In this GitHub repository, we explain how to create a custom image for Studio that has DVC already installed. The advantage of creating an image and making it available to all Studio users is that it creates a consistent environment for the Studio users, which they could also run locally. Although the sample is based on AWS Cloud9, you can also build the container on your local machine as long as you have Docker installed and running. This sample is based on the following Dockerfile and environment.yml. The resulting Docker image is stored in Amazon Elastic Container Registry (Amazon EMR) in your AWS account. See the following code:

# Login to ECR
aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom

# Create the ECR repository
aws --region ${REGION} ecr create-repository --repository-name smstudio-custom

# Build the image - it might take a few minutes to complete this step
docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}

# Push the image to ECR
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}

You can now create a new Studio domain or update an existing Studio domain that has access to the newly created Docker image.

We use AWS Cloud Development Kit (AWS CDK) to create the following resources via AWS CloudFormation:

  • A SageMaker execution role with the right permissions to your new or existing Studio domain
  • A SageMaker image and SageMaker image version from the Docker image conda-env-dvc-kernel that we created earlier
  • An AppImageConfig that specifies how the kernel gateway should be configured
  • A Studio user (data-scientist-dvc) with the correct SageMaker execution role and the custom Studio image available to it

For detailed instructions, refer to Associate a custom image to SageMaker Studio.

Run the lab

To run the lab, complete the following steps:

  1. In the Studio domain, launch Studio for the data-scientist-dvc user.
  2. Choose the Git icon, then choose Clone a Repository.
    Clone a Repository
  3. Enter the URL of the repository (https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo) and choose Clone.Clone a repo button
  4. In the file browser, choose the amazon-sagemaker-experiments-dvc-demo repository.
  5. Open the dvc_sagemaker_script_mode.ipynb notebook.
  6. For Custom Image, choose the image conda-env-dvc-kernel.
  7. Choose Select.
    conda-env-dvc-kernel

Configure DVC for data versioning

We create a subdirectory where we prepare the data: sagemaker-dvc-sample. Within this subdirectory, we initialize a new Git repository and set the remote to a repository we create in AWS CodeCommit. The goal is to have DVC configurations and files for data tracking versioned in this repository. However, Git offers native capabilities to manage subprojects via, for example, git submodules and git subtrees, and you can extend this sample to use any of the aforementioned tools that best fit your workflow.

The main advantage of using CodeCommit with SageMaker in our case is its integration with AWS Identity and Access Management (IAM) for authentication and authorization, meaning we can use IAM roles to push and pull data without the need to fetch credentials (or SSH keys). Setting the appropriate permissions on the SageMaker execution role also allows the Studio notebook and the SageMaker training and processing job to interact securely with CodeCommit.

Although you can replace CodeCommit with any other source control service, such as GitHub, Gitlab, or Bitbucket, you need consider how to handle the credentials for your system. One possibility is to store these credentials on AWS Secrets Manager and fetch them at run time from the Studio notebook as well as from the SageMaker processing and training jobs.

Init DVC

Process and train with DVC and SageMaker

In this section, we explore two different approaches to tackle our problem and how we can keep track of the two tests using SageMaker Experiments according to the high-level conceptual architecture we showed you earlier.

Set up a SageMaker experiment

To track this test in SageMaker, we need to create an experiment. We need to also define the trial within the experiment. For the sake of simplicity, we just consider one trial for the experiment, but you can have any number of trials within an experiment, for example, if you want to test different algorithms.

We create an experiment named DEMO-sagemaker-experiments-dvc with two trials, dvc-trial-single-file and dvc-trial-multi-files, each representing a different version of the dataset.

Let’s create the DEMO-sagemaker-experiments-dvc experiment:

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

experiment_name = 'DEMO-sagemaker-experiments-dvc'

# create the experiment if it doesn't exist
try:
    my_experiment = Experiment.load(experiment_name=experiment_name)
    print("existing experiment loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_experiment = Experiment.create(
            experiment_name = experiment_name,
            description = "How to integrate DVC"
        )
        print("new experiment created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Test 1: Generate single files for training and validation

In this section, we create a processing script that fetches the raw data directly from Amazon Simple Storage Service (Amazon S3) as input; processes it to create the train, validation, and test datasets; and stores the results back to Amazon S3 using DVC. Furthermore, we show how you can track output artifacts generated by DVC with SageMaker when running processing and training jobs and via SageMaker Experiments.

First, we create the dvc-trial-single-file trial and add it to the DEMO-sagemaker-experiments-dvc experiment. By doing so, we keep all trial components related to this test organized in a meaningful way.

first_trial_name = "dvc-trial-single-file"

try:
    my_first_trial = Trial.load(trial_name=first_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_first_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=first_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Use DVC in a SageMaker processing job to create the single file version

In this section, we create a processing script that gets the raw data directly from Amazon S3 as input using the managed data loading capability of SageMaker; processes it to create the train, validation, and test datasets; and stores the results back to Amazon S3 using DVC. It’s very important to understand that when using DVC to store data to Amazon S3 (or pull data from Amazon S3), we’re losing SageMaker managed data loading capabilities, which can potentially have an impact on performance and costs of our processing and training jobs, especially when working with very large datasets. For more information on the different SageMaker native input mode capabilities, refer to Access Training Data.

Finally, we unify DVC tracking capabilities with SageMaker tracking capabilities when running processing jobs via SageMaker Experiments.

The processing script expects the address of the Git repository and the branch we want to create to store the DVC metadata passed via environmental variables. The datasets themselves are stored in Amazon S3 by DVC. Although environmental variables are automatically tracked in SageMaker Experiments and visible in the trial component parameters, we might want to enrich the trial components with further information, which then become available for visualization in the Studio UI using a tracker object. In our case, the trial components parameters include the following:

  • DVC_REPO_URL
  • DVC_BRANCH
  • USER
  • data_commit_hash
  • train_test_split_ratio

The preprocessing script clones the Git repository; generates the train, validation, and test datasets; and syncs it using DVC. As mentioned earlier, when using DVC, we can’t take advantage of native SageMaker data loading capabilities. Aside from the performance penalties we might suffer on large datasets, we also lose the automatic tracking capabilities for the output artifacts. However, thanks to the tracker and the DVC Python API, we can compensate for these shortcomings, retrieve such information at run time, and store it in the trial component with little effort. The added value by doing so is to have in single view of the input and output artifacts that belong to this specific processing job.

The full preprocessing Python script is available in the GitHub repo.

with Tracker.load() as tracker:
    tracker.log_parameters({"data_commit_hash": commit_hash})
    for file_type in file_types:
        path = dvc.api.get_url(
            f"{data_path}/{file_type}/california_{file_type}.csv",
            repo=dvc_repo_url,
            rev=dvc_branch
        )
        tracker.log_output(name=f"california_{file_type}",value=path)

SageMaker gives us the possibility to run our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure. If our script requires additional dependencies, we can supply a requirements.txt file. When we start the processing job, SageMaker uses pip-install to install all the libraries we need (for example, DVC-related libraries). If you need to have a tighter control of all libraries installed on the containers, you can bring your own container in SageMaker, for example for processing and training.

We have now all the ingredients to run our SageMaker processing job:

  • A processing script that can process several arguments (--train-test-split-ratio) and two environmental variables (DVC_REPO_URL and DVC_BRANCH)
  • A requiremets.txt file
  • A Git repository (in CodeCommit)
  • A SageMaker experiment and trial
from sagemaker.processing import FrameworkProcessor, ProcessingInput
from sagemaker.sklearn.estimator import SKLearn

dvc_repo_url = "codecommit::{}://sagemaker-dvc-sample".format(region)
dvc_branch = my_first_trial.trial_name

script_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    env={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    },
    role=role
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name
}

We then run the processing job with the preprocessing-experiment.py script, experiment_config, dvc_repo_url, and dvc_branch we defined earlier.

%%time

script_processor.run(
    code='./source_dir/preprocessing-experiment.py',
    dependencies=['./source_dir/requirements.txt'],
    inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
    experiment_config=experiment_config,
    arguments=["--train-test-split-ratio", "0.2"]
)

The processing job takes approximately 5 minutes to complete. Now you can view the trial details for the single file dataset.

The following screenshot shows where you can find the stored information within Studio. Note the values for dvc-trial-single-file in DVC_BRANCH, DVC_REPO_URL, and data_commit_hash on the Parameters tab.

SageMaker Experiments parameters tab

Also note the input and output details on the Artifacts tab.

SageMaker Experiments artifacts tab

Create an estimator and fit the model with single file data version

To use DVC integration inside a SageMaker training job, we pass a dvc_repo_url and dvc_branch as environmental variables when you create the Estimator object.

We train on the dvc-trial-single-file branch first.

When pulling data with DVC, we use the following dataset structure:

dataset
    |-- train
    |   |-- california_train.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation.csv

Now we create a Scikit-learn Estimator using the SageMaker Python SDK. This allows us to specify the following:

  • The path to the Python source file, which should be run as the entry point to training.
  • The IAM role that controls permissions for accessing Amazon S3 and CodeCommit data and running SageMaker functions.
  • A list of dictionaries that define the metrics used to evaluate the training jobs.
  • The number and type of training instances. We use one ml.m5.large instance.
  • Hyperparameters that are used for training.
  • Environment variables to use during the training job. We use DVC_REPO_URL, DVC_BRANCH, and USER.
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters={ 
        "learning_rate" : 1,
        "depth": 6
    }
estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.23-1',
    base_job_name='training-with-dvc-data',
    environment={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    }
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name
}

We call the fit method of the Estimator with the experiment_config we defined earlier to start the training.

%%time
estimator.fit(experiment_config=experiment_config)

The training job takes approximately 5 minutes to complete. The logs show those lines, indicating the files pulled by DVC:

Running dvc pull command
A       train/california_train.csv
A       test/california_test.csv
A       validation/california_validation.csv
3 files added and 3 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']
Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']

Test 2: Generate multiple files for training and validation

We create a new dvc-trial-multi-files trial and add it to the current DEMO-sagemaker-experiments-dvc experiment.

second_trial_name = "dvc-trial-multi-files"
try:
    my_second_trial = Trial.load(trial_name=second_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_second_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=second_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the DVC metadata in a different branch.

You can explore the second preprocessing Python script on GitHub.

%%time

script_processor.run(
    code='./source_dir/preprocessing-experiment-multifiles.py',
    dependencies=['./source_dir/requirements.txt'],
    inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
    experiment_config=experiment_config,
    arguments=["--train-test-split-ratio", "0.1"]
)

The processing job takes approximately 5 minutes to complete. Now you can view the trial details for the multi-file dataset.

The following screenshots show where you can find the stored information within SageMaker Experiments in the Trial components section within the Studio UI. Note the values for dvc-trial-multi-files in DVC_BRANCH, DVC_REPO_URL, and data_commit_hash on the Parameters tab.

SageMaker multi files experiments parameters tab

You can also review the input and output details on the Artifacts tab.

SageMaker multi files experiments artifacts tab

We now train on the dvc-trial-multi-files branch. When pulling data with DVC, we use the following dataset structure:

dataset
    |-- train
    |   |-- california_train_1.csv
    |   |-- california_train_2.csv
    |   |-- california_train_3.csv
    |   |-- california_train_4.csv
    |   |-- california_train_5.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation_1.csv
    |   |-- california_validation_2.csv
    |   |-- california_validation_3.csv

Similar as we did before, we create a new Scikit-learn Estimator with the trial name dvc-trial-multi-files and start the training job.

%%time

estimator.fit(experiment_config=experiment_config)

The training job takes approximately 5 minutes to complete. On the training job logs output to the notebook, you can see those lines, indicating the files pulled by DVC:

Running dvc pull command
A       validation/california_validation_2.csv
A       validation/california_validation_1.csv
A       validation/california_validation_3.csv
A       train/california_train_4.csv
A       train/california_train_5.csv
A       train/california_train_2.csv
A       train/california_train_3.csv
A       train/california_train_1.csv
A       test/california_test.csv
9 files added and 9 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']
Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']

Host your model in SageMaker

After you train your ML model, you can deploy it using SageMaker. To deploy a persistent, real-time endpoint that makes one prediction at a time, we use SageMaker real-time hosting services.

from sagemaker.serializers import CSVSerializer

predictor = estimator.deploy(1, "ml.t2.medium", serializer=CSVSerializer())

First, we get the latest test dataset locally on the development notebook in Studio. For this purpose, we can use dvc.api.read() to load the raw data that was stored in Amazon S3 by the SageMaker processing job.

import io
import dvc.api

raw = dvc.api.read(
    "dataset/test/california_test.csv",
    repo=dvc_repo_url,
    rev=dvc_branch
)

Then we prepare the data using Pandas, load a test CSV file, and call predictor.predict to invoke the SageMaker endpoint created earlier, with data, and get predictions.

test = pd.read_csv(io.StringIO(raw), sep=",", header=None)
X_test = test.iloc[:, 1:].values
y_test = test.iloc[:, 0:1].values

predicted = predictor.predict(X_test)
for i in range(len(predicted)-1):
    print(f"predicted: {predicted[i]}, actual: {y_test[i][0]}")

Delete the endpoint

You should delete endpoints when they’re no longer in use, because they’re billed by the time deployed (for more information, see Amazon SageMaker Pricing). Make sure to delete the endpoint to avoid unexpected costs.

predictor.delete_endpoint()

Clean up

Before you remove all the resources you created, make sure that all apps are deleted from the data-scientist-dvc user, including all KernelGateway apps, as well as the default JupiterServer app.

Then you can destroy the AWS CDK stack by running the following command:

cdk destroy

If you used an existing domain, also run the following commands:

# inject your DOMAIN_ID into the configuration file
sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json
# update the sagemaker studio domain
aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json

Conclusion

In this post, you walked through an example of how to track your experiments across code, data, artifacts, and metrics by using SageMaker Experiments and SageMaker processing and training jobs in conjunction with DVC. We created a Docker image containing DVC, which was required for Studio as the development notebook, and showed how you can use processing and training jobs with DVC. We prepared two versions of the data and used DVC to manage it with Git. Then you used SageMaker Experiments to track the processing and training with the two versions of the data in order to have a unified view of parameters, artifacts, and metrics in a single pane of glass. Finally, you deployed the model to a SageMaker endpoint and used a testing dataset from the second dataset version to invoke the SageMaker endpoint and get predictions.

As next step, you can extend the existing notebook and introduce your own feature engineering strategy and use DVC and SageMaker to run your experiments. Let’s go build!

For further reading, refer to the following resources:


About the Authors

Paolo Di FrancescoPaolo Di Francesco is a solutions architect at AWS. He has experience in the telecommunications and software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Eitan SelaEitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Read More

DALL·E 2: Extending Creativity

DALL·E 2: Extending Creativity

As part of our DALL·E 2 research preview, more than 3,000 artists from more than 118 countries have incorporated DALL·E into their creative workflows. The artists in our early access group have helped us discover new uses for DALL·E and have served as key voices as we’ve made decisions about DALL·E’s features.

Creative professionals using DALL·E today range from illustrators, AR designers, and authors to chefs, landscape architects, tattoo artists, and clothing designers, to directors, sound designers, dancers, and much more. The list expands every day.

Below are just a few examples of how artists are making use of this new technology:

The Orrigos

James and his wife Kristin Orrigo created the Big Dreams Virtual Tour which focuses on creating special memories and a positive distraction for pediatric cancer patients around the world. The Orrigos have worked in top children’s hospitals around the country and now virtually meet up with families, bringing children’s ideas to life through personalized cartoons, music videos, and mobility friendly video games. Orrigo says children and teens light up when they see their DALL·E-generated creations, and they are ready to be the star of a story brought to life from their imaginations.

Most recently, Orrigo and his team have been working with a young cancer survivor named Gianna to create a music video featuring herself as Wonder Woman fighting her enemy — the cancer cells.

“We didn’t know what an osteosarcoma villain would look like so we turned to DALL·E as our creative outlet. DALL·E gave us a huge amount of inspiration,” Orrigo said. “Unfortunately, Gianna knows this battle all too well. But we are celebrating her victory by bringing her cartoon music video to real life to spread awareness about pediatric cancer and to give Gianna an unforgettable memory.”

Stefan Kutzenberger

In a project conceived by Austrian artist Stefan Kutzenberger and Clara Blume, Head of the Open Austria Art + Tech Lab in San Francisco, DALL·E was used to bring the poetry of revolutionary painter Egon Schiele into the visual world. Schiele died at 28, but Kutzenberger — a curator at the Leopold Museum in Vienna, which houses the world’s largest collection of Schiele’s works — believes that DALL·E gives the world a glimpse of what Schiele’s later work might have been like if he had had a chance to keep painting. The DALL·E works will be exhibited alongside Schiele’s collection in the Leopold Museum in the coming months.

DALL·E 2: Extending Creativity
“A painting of tall trees walking along a road, with chirping and trembling birds in front of a white sky in them in the style of Austrian expressionist Egon Schiele”

DALL·E 2: Extending Creativity
“Lakeshore Without Sun, 1913 in the expressionist style of Egon Schiele”

Karen X Cheng

Karen X Cheng, a director known for sharing her creative experiments on Instagram, created the latest cover of Cosmopolitan Magazine using DALL·E. In her post unveiling the process, Karen compared working with DALL·E to a musician playing an instrument.

“Like any musical instrument, you get better with practice…and knowing what words to use to communicate? That’s a community effort — it’s come from the past few months of me talking to other DALL·E artists on Twitter / Discord / DM. I learned from other artists that you could ask for specific camera angles. Lens types. Lighting conditions. We’re all figuring it out together, how to play this beautiful new instrument.”

Tom Aviv

Israeli chef and MasterChef winner Tom Aviv is debuting his first U.S. restaurant in Miami in a few months and has used DALL·E for menu, decor, and ambiance inspiration — and his team have also used DALL·E to in designing the way they plate dishes.

It was Tom’s sister and business partner Kim’s idea to run a family recipe for chocolate mousse through DALL·E.

“It’s called Picasso chocolate mousse, and it’s a tribute to my parents,” she explained. “DALL·E elevates it to another level — it is just phenomenal. It changed the dish from your usual chocolate mousse to something that does service to the name and to our parents. It blew our minds.”

Branja is expected to open in October.

Don Allen Stevenson III

XR creator Don Allen Stevenson III has used DALL·E to paint physical paintings, design wearable sneakers, and create characters to transform into 3D renders for AR filters. “It feels like having a genie in a bottle that I can collaborate with,” he said.

Stevenson’s real passion is education — specifically making technology accessible to more people. He hosts a weekly Instagram Live teaching people about DALL·E and other tools for creative innovation.

“Digital tools freed me up to have a life that I am proud of and love,” Stevenson says. “I want to help other people to see creative technology like DALL·E the way that I see it — so they can become free as well.”

Danielle Baskin

Danielle Baskin, a multimedia artist, says she plans to incorporate DALL·E generations across a number of different art forms: product design, illustration, theater, and alternative realities.

“It’s a mood board, vibe generator, illustrator, art curator, and museum docent,” Baskin says. “It’s an infinite museum where I can choose which private collections I want to visit. Sometimes I need to repair the private collections (tweak my prompt writing). Sometimes the collection isn’t quite there. But sometimes the docent (DALL·E 2) shows me a surprising new collection I didn’t know existed.”

August Kamp

August Kamp, a multimedia artist and musician, says she views DALL·E as a sort of imagination interpreter.

“Conceptualizing one’s ideas is one of the most gatekept processes in the modern world,” Kamp says. “Everyone has ideas — not everyone has access to training or encouragement enough to confidently render them. I feel empowered by the ability to creatively iterate on a feeling or idea, and I deeply believe that all people deserve that sense of empowerment.

Chad Nelson

Chad Nelson has been using DALL·E to create highly detailed creatures — and he’s made more than 100 of them.

“I had a vision for a cast of charming woodland critters, each oozing with personality and emotional nuance,” Nelson said. His characters range from “a red furry monster looks in wonder at a burning candle” to “a striped hairy monster shakes its hips dancing underneath a disco ball” — each crafted to capture the most human thing of all — feelings.

“DALL·E is the most advanced paint brush I’ve ever used,” Nelson says. “As mind-blowing and amazing as DALL·E is, like the paint brush, it too must be guided by the artist. It still needs that creative spark, that lightbulb in the mind to innovate — to create that something from nothing.”


OpenAI