Revving Up the Future of Transportation: NVIDIA DRIVE Hyperion Takes the Wheel at Auto Shanghai

Revving Up the Future of Transportation: NVIDIA DRIVE Hyperion Takes the Wheel at Auto Shanghai

Shanghai is once again showing why it’s called the “Magic City” as more than 1,000 exhibitors from 20 countries dazzle the automotive world this week at the highly anticipated International Automobile Industry Exhibition.

With nearly 1,500 vehicles on display, the 20th edition of Auto Shanghai is showcasing the newest AI-powered cars and mobility solutions using the NVIDIA DRIVE Hyperion compute platform built on the DRIVE Orin system-on-chip (SoC).

NVIDIA-Powered Vehicles in China and Beyond

SAIC Motor’s Rising Auto brand unveiled the recently launched mid-to-large luxury pure electric sedan F7 and mid-to-large luxury pure electric SUV R7 at the show, both of which feature an advanced intelligent-driving system built on NVIDIA DRIVE Orin. Equipped with a swappable battery pack, the F7 is built to go the distance with up to a 413-mile range.

SAIC Motor’s Rising Auto electric F7 sedan. Image courtesy of Rising Auto.

New energy vehicle (NEV) maker GAC AION showcased its flagship model Hyper GT, which is available for pre-sale. Equipped with the high-performance NVIDIA DRIVE Orin SoC, the car is designed to support advanced level 2+ driving capabilities in high-speed environments. Featuring an array of aerodynamic features that minimize drag, the Hyper GT has a wind resistance coefficient of just 0.19 Cd, the lowest of any production car in the world, GAC Aion claims.

GAC AION flagship model Hyper GT. Image courtesy of GAC AION.

At the show, EV maker XPENG showcased its full range of models including the all-new P7i ultra-smart coupe and XPENG G9, a super-fast-charging, intelligent SUV, built on the high-performance DRIVE Orin centralized compute architecture to deliver AI capabilities that are continuously upgradable through over-the-air updates.

XPENG also debuted the first model under its SEPA 2.0 Soaring architecture — the XPENG G6 — also powered by NVIDIA DRIVE Orin. As an intelligent driving coupe SUV, the XPENG G6 is based on a high-voltage 800V silicon carbide platform that XPENG launched globally, and is also equipped with its proprietary XNGP intelligent assisted driving system.

XPENG G6 coupe SUV. Image courtesy of XPENG.

Elsewhere on the show floor, IM Motors, a joint venture among China’s SAIC Motor, Alibaba and Shanghai’s Zhangjiang Group, exhibited its flagship LS7 SUV and the L7 sedan powered by NVIDIA DRIVE Orin. IM Motors reports that it has launched its capability for highway navigation on autopilot — users will be able to experience it on the L7 and LS7 soon.

NEV maker Human Horizons officially took the wraps off its HiPhi Y SUV, the latest addition to its lineup of intelligent vehicles. The vehicle’s marquee features include a wing-door design, a China light-duty vehicle test cycle (CLTC) range of more than 497 miles on a single charge, and an autonomous-driving system powered by DRIVE Orin.

This is the second model stemming from HiPhi’s cooperation with NVIDIA. The NEV company last summer launched HiPhi Z, the digital grand tourer equipped with the HiPhi Pilot intelligent driver-assistance system. The system features NVIDIA DRIVE Orin and a 30+ sensor suite to support functions such as assisted driving.

As China NEV makers look to expand their global footprint, HiPhi also announced it will bring its vehicles to select countries in Western Europe and Scandinavia. This includes HiPhi X, HiPhi Z and HiPhi Y.

HiPhi Y SUV. Image courtesy of Human Horizons.

Premium smart electric vehicle company NIO revealed the ES6 SUV during Auto Shanghai. NIO’s family of smart vehicles are being showcased at the NIO booth, including an updated version of its ET7, along with the ES8, EC7, ES7, ES6 and ET5. All these vehicles run on its proprietary Adam supercomputer, which is powered by four NVIDIA DRIVE Orin SoCs. This quad configuration delivers 1,016 TOPS of performance to enable advanced driver-assistance systems and a point-to-point autonomous driving experience.

NIO also recently announced it’s equipping its third-generation powerswap station with two laser radars and two NVIDIA DRIVE Orin SoCs, with a total computing power of 508 TOPS, which enables the Automatic Summon and Swap feature, which enables the station to communicate with the vehicle and automatically navigate the vehicle for a battery swap.

NIO ES6 SUV. Image courtesy of NIO.

Li Auto displayed three of its flagship models, including the L9, L8 Max and L7 Max. These models feature dual NVIDIA DRIVE Orin SoCs to power its intelligent-driving system, the Ideal AD Max, delivering 508 TOPS of computing power to help the vehicle efficiently process data from high-definition cameras, lidars, millimeter-wave radars and ultrasonic sensors in real time.

Li Auto also released an 800V supercharged pure electric solution, which can travel 248 miles after charging for 10 minutes. The automaker demonstrated AD Max 3.0, which is also powered by NVIDIA DRIVE Orin. The company’s urban NOA navigation assisted-driving system will be released in the second quarter of this year. And by the end of the year, this all-scenario navigation-assisted driving system will cover 100 cities in China.

Swedish premium automaker Volvo Cars debuted its all-electric EX90 in China. Revealed in November last year, the state-of-the-art, software-defined SUV features a new powertrain and cutting-edge technology to deliver the ultimate in safety and intelligence with the AI compute of NVIDIA DRIVE Orin and Xavier platforms.

During its Volvo Cars Tech Day held earlier this week, the automaker also unveiled its EX90 Excellence, the top-of-the-line and limited edition of the EX90. The four-seater SUV features a two-tone exterior, outstanding comfort inside and an intelligent technology base powered by NVIDIA DRIVE.

Volvo Cars EX90 Excellence SUV. Image courtesy of Volvo Cars.

Lotus brought its three champion products to the Auto Shanghai, including the first electric Hyper-SUV Lotus Eletre, the first pure electric hypercar Evija and Emira, the last internal combustion engine sports car from Lotus. The Lotus Eletre is equipped with the cockpit-forward design inspired by Evija and embodies the essence of aesthetics. It features an immersive digital cockpit, long battery range of the Eletre S+ version up to 403 miles and autonomous-driving capabilities powered by the NVIDIA DRIVE Orin.

Tier-1 Manufacturers, Emerging Mobility Companies Also Spotlighted

Tier-1 suppliers for the auto industry and emerging self-driving companies also presented their latest offerings at Auto Shanghai.

Desay SV is pushing the boundaries of autonomous-driving performance with its latest solutions for smart cockpit, intelligent driving and connected services. The mobility company demonstrated its DRIVE Orin-based smart cockpit solution, which is part of the ICPAurora Intelligent Centralized Computing Platform, for powering new forms of in-vehicle infotainment, including 3D gaming and Android-based systems.

Desay SV’s integration of cabin and autonomous driving showcases centralizing all intelligent vehicle functions on a single NVIDIA DRIVE computer. Announced last September, the NVIDIA DRIVE Thor SoC is the successor to DRIVE Orin, delivering 2,000 teraflops of performance and designed to centralized automated driving and AI cockpit functions on a single platform. DRIVE Thor is targeting automakers’ 2025 models.

Baidu Apollo showcased its level-2 driver-assisted product Apollo City Driving Max intelligent driving computing unit, featuring a Baidu-developed intelligent-driving domain controller and powered by two NVIDIA DRIVE Orin SoCs to process camera and lidar sensor data for enhanced safety behind the wheel. The company reports the Apollo City Driving Max will be available in volume production to automakers globally in 2023.

Momenta launched Mpilot Pro, its new advanced driver-assistance solution. The solution adopts the energy-efficient DRIVE Orin to meet the computing performance requirements of mainstream mid-range models. In addition, DRIVE Orin’s compatible architecture scales from level 2+ ADAS to level 5 autonomous driving.

The NVIDIA DRIVE ecosystem can be found throughout this year’s Auto Shanghai — showcasing how NVIDIA is leading the charge toward a future of intelligent vehicles that deliver higher levels of safety, convenience and enjoyment on the road.

Read More

Training a recommendation model with dynamic embeddings

Training a recommendation model with dynamic embeddings

Posted by Thushan Ganegedara (GDE), Haidong Rong (Nvidia), Wei Wei (Google)

Modern recommenders heavily leverage embeddings to create vector representations of each user and candidate item. These embedding can then be used to calculate the similarity between users and items, so that users are recommended candidate items that are more interesting and relevant. But when working with data at scale, particularly in an online machine learning setting, embedding tables can grow in size dramatically, accumulating millions (and sometimes billions) of items. At this scale, it becomes impossible to store these embedding tables in memory. Furthermore, a large portion of the items might be rarely seen, so it does not make sense to keep dedicated embeddings for such rarely occurring items. A better solution would be to represent those items with one common embedding. This can dramatically reduce the size of the embedding table at a very small fraction of the performance cost. This is the main motivation behind dynamic embedding tables.

TensorFlow’s built-in tf.keras.layers.Embedding layer has a fixed size at creation time, so we need another approach. Fortunately, there is a TensorFlow SIG project exactly for this purpose: TensorFlow Recommenders Addons (TFRA). You can learn more from its repository, but at a high level TFRA leverages dynamic embedding technology to dynamically change embedding size and achieve better recommendation results than static embeddings. TFRA is fully TF2.0-compatible and works smoothly with the familiar Keras API interfaces, so it can be easily integrated with other TensorFlow products, such as TensorFlow Recommenders (TFRS).

In this tutorial we will build a movie recommender model by leveraging both TFRS and TFRA. We will use the MovieLens dataset, which contains anonymized data showing ratings given to movies by users. Our primary focus is to show how the dynamic embeddings provided in the TensorFlow Recommenders Addons library can be used to dynamically grow and shrink the size of the embedding tables in the recommendation setting. You can find the full implementation here and a walkthrough here.

Processing the data

Let’s first build a baseline model with TensorFlow Recommenders. We will follow the pattern of this TFRS retrieval tutorial to build a two-tower retrieval model. The user tower will take the user ID as the input, but the item tower will use the tokenized movie title as the input.

To handle the movie titles, we define a helper function that converts the movie titles to lowercase, removes any punctuation in a given movie title, and splits using spaces to generate a list of tokens. Finally we take only the up to max_token_length tokens (from the start) from the movie title. If a movie title has fewer tokens, all the tokens will be taken. This number is chosen based on some analysis and represents the 90th percentile in the title lengths in the dataset.

max_token_length = 6
pad_token = "[PAD]"
punctuation_regex = "[!"#$%&()*+,-./:;<=>?@[]\^_`{|}~\t\n]"

#First we’ll define a helper function that will process the movie titles for us.

def process_text(x: tf.Tensor, max_token_length: int, punctuation_regex: str) -> tf.Tensor:

return tf.strings.split(
tf.strings.regex_replace(
tf.strings.lower(x["movie_title"]), punctuation_regex, ""
)
)[:max_token_length]

We also pad the tokenized movie titles to a fixed length and split the dataset using the same random seed so that we get consistent validation results across training epochs. You can find detailed code in the ‘Processing datasets’ section of the notebook.

Building the two tower model

Our user tower is pretty much the same as in the TFRS retrieval tutorial (except it’s deeper), but for the movie tower there is a GlobalAveragePooling1D layer after the embedding lookup, which averages the embedding of movie title tokens to a single embedding.

def get_movie_title_lookup_layer(dataset: tf.data.Dataset) -> tf.keras.layers.Layer:
movie_title_lookup_layer = tf.keras.layers.StringLookup(mask_token=pad_token)
movie_title_lookup_layer.adapt(dataset.map(lambda x: x["movie_title"]))
return movie_title_lookup_layer

def build_item_model(movie_title_lookup_layer: tf.keras.layers.StringLookup):
vocab_size = movie_title_lookup_layer.vocabulary_size()
return tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(max_token_length), dtype=tf.string),
movie_title_lookup_layer,
tf.keras.layers.Embedding(vocab_size, 64),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(64, activation="gelu"),
tf.keras.layers.Dense(32),
tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))
])

Next we are going to train the model.

Training the model

Training the model is simply calling fit() on the model with the required arguments. We will be using our validation dataset validation_ds to measure the performance of our model.

history = model.fit(datasets.training_datasets.train_ds, epochs=3, validation_data=datasets.training_datasets.validation_ds)

At the end, the output looks like below:

Epoch 3/3
220/220 [==============================] - 146s 633ms/step
......
val_factorized_top_k/top_10_categorical_accuracy: 0.0179 - val_factorized_top_k/top_50_categorical_accuracy: 0.0766 - val_factorized_top_k/top_100_categorical_accuracy: 0.1338 - val_loss: 12359.0557 - val_regularization_loss: 0.0000e+00 - val_total_loss: 12359.0557

We have achieved a top 100 categorical accuracy of 13.38% on the validation dataset.

Building the model with dynamic embeddings

Overview

We will now learn how we can use the dynamic embedding in the TensorFlow Recommenders Addons (TFRA) library, rather than a static embedding table. As the name suggests, as opposed to creating embeddings for all the items in the vocabulary up front, dynamic embedding would only grow the size of the embedding table on demand. This behavior really shines when dealing with millions and billions of items and users as some companies do. For these companies, it’s not surprising to find static embedding tables that would not fit in memory. Static embedding tables can grow up to hundreds of Gigabytes or even Terabytes, incapacitating even the highest memory instances available in cloud environments.

When you have an embedding table with large cardinality, the accessing weights will be quite sparse. Therefore, a hash-table based data structure is used to hold the weights and required weights for each iteration are retrieved from the underlying table structure. Here, to focus on the core functionality of the library, we will focus on a non-distributed setting. In this case, TFRA will choose cuckoo hashtable by default. But there are other solutions such as Redis, nvhash available.

A chart showing the various embedding solutions across distruted and non-distributed settings in the TFRA library

When using the dynamic embedding, we initialize the table with some initial capacity and the table will grow in size on demand as it sees more IDs during model training. For more information about motivation and inner mechanics, please refer to the RFC.

Types of embedding

Currently in the TFRA dynamic_embedding module, there are three types of embedding available:

  • Embedding – The most basic form of embeddings. This expects a 1D ([batch_size]) or 2D ([batch_size, time_steps]) tensor of IDs and outputs a [batch_size, embedding_dim] or [batch_size, time_steps, embedding_dim] sized tensor respectively.
  • SquashedEmbedding – This layer squashes the time step dimension based on some reduction operation (e.g. mean/sum) to transform a [batch_size, time_steps] sized tensor of IDs to a [batch_size, embedding_dim] tensor.
  • FieldwiseEmbedding – This type can handle multiple features (i.e. fields) at once. The layer takes n_slots as an argument and IDs are mapped to a slot within the layer. The layer would return a tensor of size [batch_size, n_slots, embedding_dim].

Defining the embedding layers

We will be using the Embedding to represent the user IDs and SquashedEmbedding to represent token IDs. Remember that each movie title has multiple tokens, therefore, we need a way to reduce the resulting token embeddings to a single representative embedding.

Note: The behavior of Embedding has changed from version 0.5 to 0.6. Please make sure to use version 0.6 for this tutorial.

With that, we can define the two towers as we did in the standard model. However, this time we’ll be using the dynamic embedding layers instead of static embedding layers.

def build_de_user_model(user_id_lookup_layer: tf.keras.layers.StringLookup) -> tf.keras.layers.Layer:
vocab_size = user_id_lookup_layer.vocabulary_size()
return tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(), dtype=tf.string),
user_id_lookup_layer,
de.keras.layers.Embedding(
embedding_size=64,
initializer=tf.random_uniform_initializer(),
init_capacity=int(vocab_size*0.8),
restrict_policy=de.FrequencyRestrictPolicy,
name="UserDynamicEmbeddingLayer"
),
tf.keras.layers.Dense(64, activation="gelu"),
tf.keras.layers.Dense(32),
tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))
], name='user_model')

def build_de_item_model(movie_title_lookup_layer: tf.keras.layers.StringLookup) -> tf.keras.layers.Layer:
vocab_size = movie_title_lookup_layer.vocabulary_size()
return tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(max_token_length), dtype=tf.string),
movie_title_lookup_layer,
de.keras.layers.SquashedEmbedding(
embedding_size=64,
initializer=tf.random_uniform_initializer(),
init_capacity=int(vocab_size*0.8),
restrict_policy=de.FrequencyRestrictPolicy,
combiner="mean",
name="ItemDynamicEmbeddingLayer"
),
tf.keras.layers.Dense(64, activation="gelu"),
tf.keras.layers.Dense(32),
tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))
])

With the user tower and movie tower models defined, we can define the retrieval model as usual.

Creating and compiling the final model

As a final step in model building, we’ll create the model and compile it.

def create_de_two_tower_model(dataset: tf.data.Dataset, candidate_dataset: tf.data.Dataset) -> tf.keras.Model:

user_id_lookup_layer = get_user_id_lookup_layer(dataset)
movie_title_lookup_layer = get_movie_title_lookup_layer(dataset)
user_model = build_de_user_model(user_id_lookup_layer)
item_model = build_de_item_model(movie_title_lookup_layer)
task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidate_dataset.map(item_model)
),
)

model = DynamicEmbeddingTwoTowerModel(user_model, item_model, task)
optimizer = de.DynamicEmbeddingOptimizer(tf.keras.optimizers.Adam())
model.compile(optimizer=optimizer)

return model

datasets = create_datasets()
de_model = create_de_two_tower_model(datasets.training_datasets.train_ds, datasets.candidate_dataset)

Note the usage of the DynamicEmbeddingOptimizer wrapper around the standard TensorFlow optimizer. It is mandatory to wrap the standard optimizer in a DynamicEmbeddingOpitmizer as it will provide specialized functionality needed to train the weights stored in a hashtable. We can now train our model.

Training the model

Training the model is quite straightforward, but will involve a bit more extra effort as we’d like to log some extra information. We will perform the logging through a tf.keras.callbacks.Callback object. We’ll name this DynamicEmbeddingCallback.

epochs = 3
history_de = {}
history_de_size = {}
de_callback = DynamicEmbeddingCallback(de_model, steps_per_logging=20)

for epoch in range(epochs):

datasets = create_datasets()
train_steps = len(datasets.training_datasets.train_ds)

hist = de_model.fit(
datasets.training_datasets.train_ds,
epochs=1,
validation_data=datasets.training_datasets.validation_ds,
callbacks=[de_callback] )

for k,v in de_model.dynamic_embedding_history.items():
if k=="step":
v = [vv+(epoch*train_steps) for vv in v] history_de_size.setdefault(k, []).extend(v)

for k,v in hist.history.items():
history_de.setdefault(k, []).extend(v)

We have taken the loop that goes through the epochs out of the fit() function. Then in every epoch we re-create the dataset, as that will provide a different shuffling of the training dataset. We will train the model for a single epoch within the loop. Finally we accumulate the logged embedding sizes in history_de_size (this is provided by our custom callback) and performance metrics in history_de.

The callback is implemented as follows.

class DynamicEmbeddingCallback(tf.keras.callbacks.Callback):

def __init__(self, model, steps_per_logging, steps_per_restrict=None, restrict=False):
self.model = model
self.steps_per_logging = steps_per_logging
self.steps_per_restrict = steps_per_restrict
self.restrict = restrict

def on_train_begin(self, logs=None):
self.model.dynamic_embedding_history = {}

def on_train_batch_end(self, batch, logs=None):

if self.restrict and self.steps_per_restrict and (batch+1) % self.steps_per_restrict == 0:

[
self.model.embedding_layers[k].params.restrict(
num_reserved=int(self.model.lookup_vocab_sizes[k]*0.8),
trigger=self.model.lookup_vocab_sizes[k]-2 # UNK & PAD tokens
) for k in self.model.embedding_layers.keys()
]

if (batch+1) % self.steps_per_logging == 0:

embedding_size_dict = {
k:self.model.embedding_layers[k].params.size().numpy()
for k in self.model.embedding_layers.keys()
}

for k, v in embedding_size_dict.items():
self.model.dynamic_embedding_history.setdefault(f"embedding_size_{k}", []).append(v)
self.model.dynamic_embedding_history.setdefault(f"step", []).append(batch+1)

The callback does two things:

  • Logs the sizes of the embedding layers every steps_per_logging iterations
  • Reduces the size of the embedding table to an 80% size of the total vocabulary size if restrict=True(This is set to False by default)

Let’s understand what reducing the size means and why it is important.

Reducing the size of the embedding table

An important topic we still haven’t discussed is how to reduce the size of the embedding table, should it grow over some predefined threshold. This is a powerful functionality as it allows us to define a threshold over which the embedding table should not grow. This will allow us to work with large vocabularies while keeping the memory requirement under the memory limitations we may have. We achieve this by calling restrict() on the underlying variables of the embedding layer as shown in the DynamicEmbeddingCallback. restrict() takes two arguments in: num_reserved (the size after the reduction) and trigger (size at which the reduction should be triggered). The policy that governs how the reduction is performed is defined using the restrict_policy argument in the layer construct. You can see that we are using the FrequencyRestrictPolicy. This means the least frequent items will be removed from the embedding table. The callback enables a user to set how frequently the reduction should get triggered by setting the steps_per_restrict and restrict arguments in the DynamicEmbeddingCallback.

Reducing the size of the embedding table makes more sense when you have streaming data. Think about an online learning setting, where you are training the model every day (or even every hour) on some incoming data. You can think of the outer for loop (i.e. epochs) representing days. Each day you receive a dataset (containing user interactions from the previous day for example) and you train the model from the previous checkpoint. In this case, you can use the DynamicEmbeddingCallback to trigger a restrict if the embedding table grows over the size defined in the trigger argument.

Analyzing performance

Here we analyze the performance of three variants.

  • The standard retrieval model (which uses a static embedding table)
  • Retrieval model using dynamic embedding but no restrict performed
  • Retrieval model using dynamic embedding with restrict performed
A graph showing Model accuracy with and without dynamic embeddings

You can see that the model using dynamic embeddings (solid green line) has comparative validation performance to the baseline (solid red line). You can see a similar trend in the training accuracy as well. In practice, dynamic embeddings can often be seen to improve accuracy in a large-scale online learning setup.

Finally, we can see that restrict has a somewhat detrimental effect on the validation accuracy, which is understandable. Since we’re working with a relatively small dataset with a small number of items, the reduction could be getting rid of embeddings that are best kept in the table. For example, you can increase the num_reserved argument (e.g. set it to int(self.model.lookup_vocab_sizes[k]*0.95)) in the restrict function which would yield performance that improves towards the performance of without restrict.

Next we look at how dynamic the embedding tables really are over time.

A graph showing changes in the embedding size over time

We can see that when restrict is not used, the embedding table grows to the full size of the vocabulary (dashed line) and stays there. However when restrict is triggered (dotted line), the size drops and grows in size again as it encounters new IDs.

It is also important to note that constructing a proper validation is not a trivial task. There are considerations such as out-of-sample validation, out-of-time validation, stratification, etc. that needs to be taken into account carefully. However for this exercise, we have not focused on such factors and created a validation set by sampling randomly from the existing dataset.

Conclusion

Using dynamic embedding tables is a powerful way to perform representation learning when working with large sets of items containing millions or billions of entities. In this tutorial, we learnt how to use the dynamic_embedding module provided in the TensorFlow Recommender Addons library to achieve this. We first explored the data and constructed tf.data.Dataset objects by extracting the features we’ll be using for our model training and evaluation. Next we defined a model that uses static embedding tables to use as an evaluation baseline. We then created a model that uses dynamic embedding and trained it on the data. We saw that using dynamic embeddings, the embedding tables grow only on demand and still achieve comparable performance with the baseline. We also discussed how the restrict functionality can be used to shrink the embedding table if it grows past a pre-defined threshold.

We hope this tutorial gives you a good conceptual introduction to TFRA and dynamic embeddings, and helps you think about how you can leverage it to enhance your own recommenders. If you would like to have a more in-depth discussion, please visit the TFRA repository.

Read More

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Rank Game diagram

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, (texttt{rank-game}), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With (texttt{rank-game}), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use (pi) and (pi^E) to denote the policy of the agent and the expert, respectively, and (R_{gt}) to be the reward function of the expert, which is unknown to the agent/robot. (rho^pi) denotes the state-action/state visitation distribution of policy (pi) in the environment—the probabilistic collection of states the policy visits in the environment. We use (J(R;pi)) to denote the (textit{cumulative reward}), or the performance of policy (pi) under a reward function (R). We assume policy (pi) belongs to function class (Pi) and reward function R belongs to function class (mathcal{R}).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function (R_{gt}). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by (texttt{imit-game}) and write it below formally:

(texttt{imit-game}(pi,pi^E): text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

Simply stated, the (texttt{imit-game}) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

A flow chart with, clockwise from top left, a green box labeled “policy agent,” a blue box labeled “reward agent,” and an orange box label “Dataset D,” which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy’s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.
Figure 1: The proposed (texttt{rank-game}) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors ((rho)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent (pi) maximizes the expected return (J(R;pi)) of the policy under reward function (R) when deployed in the environment. The reward player takes the dataset of pairwise rankings (D^p) (rankings are denoted as (rho^ipreceqrho^j)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by (L(D^p;R))).

(underbrace{text{argmax}_{piinPi}J(R;pi)}_{text{Policy Agent}}~~~~~~~~~~~~~~~underbrace{text{argmin}_{Rinmathcal{R}}L(D^p;R)}_{text{Reward Agent}})

The ranking loss induces a reward function (R) that attempts to satisfy each pairwise preference in the dataset as follows:

(mathbb{E}_{rho^i}[R(s,a)]lemathbb{E}_{rho^j}[R(s,a)]~~,~~forall rho^ipreceqrho^j in D^p)

Generalizing prior imitation learning approaches with (texttt{rank-game})

The (texttt{rank-game}) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/(texttt{imit-game}) optimization can be written as:

(text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, (texttt{imit-game}) can be seen to be a special case of (texttt{rank-game}) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as (textit{supremum loss})). A number of prior methods in the imitation learning domain can be understood as special cases of (texttt{rank-game}) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  

  

The IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don’t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. 

  

BCO, GAIfO, DACfO, OPOLO, and f-IRL don’t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. 

  

TREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don’t use active human query or enable LfO or LfD. 

  

BREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn’t use active human query or enable LfO or LfD. 

  

DemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. 

  

Ibarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. 

  

Rank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn’t use active human query.
Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of (texttt{rank-game}) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone.

Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?

Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

(L_k(mathcal{D}^p;R) = mathbb{E}_{(rho^{pi^i},rho^{pi^j})sim mathcal{D}^p} Big[mathbb{E}_{s,asimrho^{pi^i}}{[(R(s,a)-0)^2]} + mathbb{E}_{s,asimrho^{pi^j}}{[(R(s,a)-k)^2]}Big])

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter (k). This loss function allows us to learn a reward function with user-defined scale (k), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.

Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

(text{max}_x (f(x,y_x)),~~~~text{s.t}~~y_xin text{min}_x(g(x,y)))

In this optimization, we have two players—Leader (x) and Follower (y)—that are trying to maximize and minimize their own payoff (f) and (g), respectively. We cast (texttt{rank-game}) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): (text{max}_pi J(R,pi)~~~~~text{s.t}~~ R=text{argmin}_R~L(D^p;R))
  • Reward as Leader (RAL): (text{min}_R L(D^p;R)~~~text{s.t}~~pi = text{argmax}_pi~J(R;pi))

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does (texttt{rank-game}) perform in practice?

In testing the capabilities of (texttt{rank-game}), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of (texttt{rank-game}) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. (texttt{Rank-game}) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the (texttt{rank-game}) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The (texttt{rank-game}) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of (texttt{rank-game}) is that it does not suggest a way to handle noisy preferences. Second, (texttt{rank-game}) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that (texttt{rank-game}) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Read More

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Rank Game diagram

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, (texttt{rank-game}), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With (texttt{rank-game}), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use (pi) and (pi^E) to denote the policy of the agent and the expert, respectively, and (R_{gt}) to be the reward function of the expert, which is unknown to the agent/robot. (rho^pi) denotes the state-action/state visitation distribution of policy (pi) in the environment—the probabilistic collection of states the policy visits in the environment. We use (J(R;pi)) to denote the (textit{cumulative reward}), or the performance of policy (pi) under a reward function (R). We assume policy (pi) belongs to function class (Pi) and reward function R belongs to function class (mathcal{R}).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function (R_{gt}). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by (texttt{imit-game}) and write it below formally:

(texttt{imit-game}(pi,pi^E): text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

Simply stated, the (texttt{imit-game}) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

A flow chart with, clockwise from top left, a green box labeled “policy agent,” a blue box labeled “reward agent,” and an orange box label “Dataset D,” which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy’s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.
Figure 1: The proposed (texttt{rank-game}) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors ((rho)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent (pi) maximizes the expected return (J(R;pi)) of the policy under reward function (R) when deployed in the environment. The reward player takes the dataset of pairwise rankings (D^p) (rankings are denoted as (rho^ipreceqrho^j)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by (L(D^p;R))).

(underbrace{text{argmax}_{piinPi}J(R;pi)}_{text{Policy Agent}}~~~~~~~~~~~~~~~underbrace{text{argmin}_{Rinmathcal{R}}L(D^p;R)}_{text{Reward Agent}})

The ranking loss induces a reward function (R) that attempts to satisfy each pairwise preference in the dataset as follows:

(mathbb{E}_{rho^i}[R(s,a)]lemathbb{E}_{rho^j}[R(s,a)]~~,~~forall rho^ipreceqrho^j in D^p)

Generalizing prior imitation learning approaches with (texttt{rank-game})

The (texttt{rank-game}) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/(texttt{imit-game}) optimization can be written as:

(text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, (texttt{imit-game}) can be seen to be a special case of (texttt{rank-game}) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as (textit{supremum loss})). A number of prior methods in the imitation learning domain can be understood as special cases of (texttt{rank-game}) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  

  

The IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don’t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. 

  

BCO, GAIfO, DACfO, OPOLO, and f-IRL don’t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. 

  

TREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don’t use active human query or enable LfO or LfD. 

  

BREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn’t use active human query or enable LfO or LfD. 

  

DemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. 

  

Ibarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. 

  

Rank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn’t use active human query.
Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of (texttt{rank-game}) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone.

Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?

Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

(L_k(mathcal{D}^p;R) = mathbb{E}_{(rho^{pi^i},rho^{pi^j})sim mathcal{D}^p} Big[mathbb{E}_{s,asimrho^{pi^i}}{[(R(s,a)-0)^2]} + mathbb{E}_{s,asimrho^{pi^j}}{[(R(s,a)-k)^2]}Big])

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter (k). This loss function allows us to learn a reward function with user-defined scale (k), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.

Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

(text{max}_x (f(x,y_x)),~~~~text{s.t}~~y_xin text{min}_x(g(x,y)))

In this optimization, we have two players—Leader (x) and Follower (y)—that are trying to maximize and minimize their own payoff (f) and (g), respectively. We cast (texttt{rank-game}) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): (text{max}_pi J(R,pi)~~~~~text{s.t}~~ R=text{argmin}_R~L(D^p;R))
  • Reward as Leader (RAL): (text{min}_R L(D^p;R)~~~text{s.t}~~pi = text{argmax}_pi~J(R;pi))

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does (texttt{rank-game}) perform in practice?

In testing the capabilities of (texttt{rank-game}), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of (texttt{rank-game}) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. (texttt{Rank-game}) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the (texttt{rank-game}) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The (texttt{rank-game}) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of (texttt{rank-game}) is that it does not suggest a way to handle noisy preferences. Second, (texttt{rank-game}) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that (texttt{rank-game}) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Read More

NVIDIA Studio Creators Take Collaboration to Bone-Chilling New Heights

NVIDIA Studio Creators Take Collaboration to Bone-Chilling New Heights

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

NVIDIA Omniverse, a pillar of the NVIDIA Studio suite of tools and apps built for creators, helps interconnect 3D artists’ workflows by replacing linear pipelines with live-sync creation.

This week’s In the NVIDIA Studio artists specializing in 3D, Gianluca Squillace and Pasquale Scionti, benefitted from just that — in their individual work and in collaborating to construct the final scene for their project, Cold Inside Diorama.

 

The artists set out to build a Viking scene that conveyed a cold, glacial mood — uniting Squillace’s character with Scionti’s environment while maintaining their individual styles. Such a workflow, which used to require endless exports and imports, was made simple with Omniverse and Universal Scene Description (USD), the open and extensible ecosystem for describing, composing, simulating and collaborating within 3D worlds.

It Takes Character

Powered by a GeForce RTX 4080 GPU, Squillace started with rich reference research to create the character. He took several realistic and stylized images, concepts and photos of characters and weapons to define all the details.

After defining the character, Squillace built a quick blockout in the ZBrush tool, before sculpting all the details to reach the definitive high-poly model. The character’s hairstyle and other aspects underwent several tests in this step of the process before arriving at the final version.

Squillace then moved on to the retopology and UV phase in Autodesk Maya, optimizing models for movements and deformations, which he said allows excellent control of the polygons with quad draw and simple management of the UVs. It also enables mirroring different UV shells and saving space by improving the final quality.

High-poly models take time to develop.

The artists brought the completed high- and low-poly into Adobe Substance 3D Painter for the bake and texturing phase. Squillace had already defined the main colors with polypaint in ZBrush, so he used the materials present in Painter to stylize different metals, leathers and woods and achieve the final textures.

 

After completing a quick rig and an animation idle in Maya, the animation in video games when the main character remains still, the artist took advantage of the USD file format to instantly pull the animation into NVIDIA Omniverse USD Composer (formerly known as Create) to see the results in real time.

“With the connectors plug-in installed in Substance 3D Painter and Maya, I could edit textures and animation in real time and immediately see the results in Omniverse USD Composer, without exporting extra files or having to close and reopen the software,” Squillace said. “It was really a great improvement in terms of workflow and speed.”

He next felt the benefit of working with USD when seamlessly importing files into Marmoset Toolbag for character-only renders.

 

Squillace took full advantage of his RTX GPU in nearly every step of production. “I was really impressed by the fast calculation of ray-traced lighting in the Marmoset Toolbag scene, which lets me make a lot of real-time renders and videos with an outstanding final result,” he said.

A Hostile Environment Built in a Friendly One

Another artist worked in parallel. Scionti built the cold, harsh environment using Omniverse’s collaboration capabilities and USD files. This combination enabled him to integrate his files with Squillace’s — and see edits in real time.

Scionti said he always starts his work by establishing its mood. With the cold, glacial tone set for this piece, he modeled some of the scene elements using Autodesk 3ds Max before importing them into Adobe Substance 3D Painter, where he created unique materials.

For some additional materials, he used Quixel software and completed the composition and design using Quixel Megascans. As with all his work, in this piece Scionti intentionally left room for interpretation, letting the audience imagine their own story.

The scene was then finalized with composition mood and lighting. Accelerated by his GeForce RTX 3090 GPU, Unreal Engine 5.1 and Lumen with hardware ray tracing helped Scionti achieve a higher level of realism with intricate details. Nanite meshes with improved virtualized geometry were useful to generate high-polygon models for close-up details. In the lighting phase, the artist used sun and sky with volumetric fog and high dynamic range.

 

“My GPU gives me so many realistic details in real time,” Scionti said.

The duo then brought Squillace’s work into Scionti’s Unreal Engine scene to integrate the character into the snowy environment. With their scene complete, the artists enjoyed the final render and reflected on its creation.

A stunning, emotion-evoking scene, built in Omniverse USD Composer.

“NVIDIA Omniverse was the center of experimentation for this project — it allowed me to work with different software at the same time, increasing the production speed of the final character,” Squillace said. “I think the system provides enormous potential, especially if combined with the standard production workflow of a 3D asset.”

 

Scionti added that Omniverse is “a great software to collaborate with other people around the globe and interact in real time on the same project.”

3D artists Gianluca Squillace and Pasquale Scionti.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. Get started with Omniverse and learn more on Instagram, Medium, Twitter and YouTube for additional resources and inspiration. Check out the Omniverse forums, and join our Discord server and Twitch channel to chat with the community.

Read More

Accelerating Large Language Models with Accelerated Transformers

Accelerating Large Language Models with Accelerated Transformers

TL;DR. We show how to use Accelerated PyTorch 2.0 Transformers and the newly introduced torch.compile() method to accelerate Large Language Models on the example of nanoGPT, a compact open-source implementation of the GPT model from Andrej Karpathy. Using the new scaled dot product attention operator introduced with Accelerated PT2 Transformers, we select the flash_attention custom kernel and achieve faster training time per batch (measured with Nvidia A100 GPUs), going from a ~143ms/batch baseline to ~113 ms/batch. In addition, the enhanced implementation using the SDPA operator offers better numerical stability. Finally, further optimizations are achieved using padded inputs, which when combined with flash attention lead to ~87ms/batch.

Recent times have seen exponential adoption of large language models (LLMs) and Generative AI in everyday life. Tightly coupled with these ever-growing models is the ever-growing training cost – in terms of both time and hardware utilization. The PyTorch team has tackled these challenges head on with Accelerated PyTorch 2 Transformers (previously known as “Better Transformer”) and JIT Compilation in PyTorch 2.0.

In this blog post, we explore training optimizations gained by utilizing custom kernel implementations of SDPA – also known as scaled dot product attention – a critical layer in transformer models. The custom kernel for SDPA replaces several discrete sequential operations with one globally optimized kernel which avoids allocating a large amount of intermediate CUDA memory. This approach offers a number of advantages, including but not limited to: higher performance computation of SDPA by reducing memory bandwidth bottleneck, reduced memory footprint to support larger batch sizes, and finally added numerical stability by prescaling input tensors. These optimizations are demonstrated on nanoGPT, an open-source implementation of GPT from Andrej Karpathy.

Background

Scaled dot product attention is the fundamental building block of multihead attention, as introduced in “Attention is All You Need”, and has a wide range of applications in LLM and Generative AI models.

The Transformer model architecture

Figure 1: The Transformer model architecture based on “Attention is All You Need”. With the new PyTorch SDPA operator, Multi-Head Attention is efficiently implemented by a linear layer for the in-projection, the SDPA operator, and a linear layer for the out-projection.

With the new scaled_dot_product_attention operator, multihead attention can be implemented in just 3 steps: in projection with a linear layer, SDPA, and out projection with a linear layer.

# In Projection
# variable descriptions:
# q,k,v = Query, Key, Value tensors
# bsz = batch size
# num_heads = Numner of heads for Multihead Attention
# tgt_len = Target length
# src_len = Source Length
# head_dim: Head Dimension
    q, k, v = _in_projection(query, key, value, q_proj_weight, k_proj_weight, v_proj_weight, b_q, b_k, b_v)
    q = q.view(bsz, num_heads, tgt_len, head_dim)
    k = k.view(bsz, num_heads, src_len, head_dim)
    v = v.view(bsz, num_heads, src_len, head_dim)

    # Scaled Dot Product Attention
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)

    # Out Projection
    attn_output = attn_output.permute(2, 0, 1, 3).contiguous().view(bsz * tgt_len, embed_dim)
    attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
    attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))

PyTorch 2. supports multiple different kernels optimized for specific use cases, with specific requirements. A kernel picker picks the best kernel for a particular combination of input parameters. If no optimized “custom kernel” for a particular combination of input parameters can be identified, the kernel picker selects a general kernel that can handle all input combinations.

While future releases may extend this set of operators, PyTorch 2.0 launches with 3 implementations for the SDPA operator:

  1. A generic kernel which implements the mathematical equation of SDPA in the function sdpa_math()
  2. An optimized kernel based on the paper “Flash Attention”, which supports evaluation of SDPA with 16 bit floating point data types on compute architecture SM80 (A100).
  3. An optimized kernel based on the paper “Self-Attention Does Not Need O(n^2) Memory” and implemented in xFormer, which supports both 32 and 16 bit floating data types on a wider range of architectures (SM40 and later). This blog post refers to this kernel as the mem_efficient kernel.

Note that both optimized kernels (two and three listed above), support a key padding mask and limit the supported attention mask to causal attention. Accelerated PyTorch 2.0 Transformers today only support the causal mask when it is specified using the is_causal boolean. When a mask is specified, the general-purpose kernel will be selected because it is too expensive to analyze the contents of a provided mask to determine if it is the causal mask. Additional explanations on the constraints for each kernel can be found in the Accelerated PT2 Transformer blog.

Enabling Accelerated Transformers with nanoGPT

The SDPA operator being a critical component of the GPT model, we identified the open source nanoGPT model as an excellent candidate for both demonstrating the ease of implementation and benefits of PyTorch 2.0’s Accelerated Transformers. The following demonstrates the exact process by which Accelerated Transformers was enabled on nanoGPT.

This process largely revolves around replacing the existing SDPA implementation with the newly added F.scaled_dot_product_attention operator from functional.py. This process can be easily adapted to enable the operator in many other LLMs. Alternatively, users can instead choose to call F.multi_head_attention_forward() or utilize the nn.MultiHeadAttention module directly where applicable. The following code snippets are adapted from Karpathy’s nanoGPT repository.

Step 1: Identify the existing SDPA implementation

In the case of nanoGPT, SDPA is implemented in the model’s CausalSelfAttention class. The original implementation at time of writing is adapted below for this post.

The original implementation at time of writing

Step 2: Replace with Torch’s scaled_dot_product_attention

At this point we can note the following:

  • Lines 36 – 42 define the mathematical implementation of SDPA which we are replacing
  • The mask applied on line 39 is no longer relevant since we are using scaled_dot_product_attention’s is_causal flag.
  • The dropout layer used in line 41 is also now unnecessary.

Swapping out the SDPA implementation for torch’s scaled_dot_product_attention and removing the now redundant code yields the following implementation.

Swapping out the SDPA implementation for torch’s scaled_dot_product_attention and removing the now redundant code yields the following implementation.

Alternatively, the original mask can be passed into the attn_mask field however due to the mentioned kernel constraints that would limit the implementation to only support the generic sdpa_math kernel.

Step 3 (Bonus): Faster matmuls with padding

On top of the performance improvements from SDPA, our analysis yielded a nice ancillary win. In Andrej’s words “The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase the vocab size from 50257 to 50304 (nearest multiple of 64).”

Tweet by Andrej Karpathy

The vocab size determines the dimensions of matmuls in the output layer of GPT, and these are so large that they were taking a majority of the time for the entire training loop! We discovered that they were achieving performance significantly below the peak throughput achievable on the A100 GPU, and guessed from NVIDIA’s matmul documentation that 64-element alignment would yield better results. Indeed, padding these matmuls achieves nearly a 3x speedup! The underlying cause is that unaligned memory accesses significantly reduce efficiency. A deeper analysis can be found in this Twitter thread.

With this optimization we were able to further reduce training time from ~113 ms (using flash attention) to ~87 ms per batch.

Results

The figure below demonstrates the performance gained using Pytorch custom kernels. Here are the exact figures:

  • baseline (nanoGPT implementation): ~143ms
  • sdpa_math (generic): ~134ms (6.71% faster)
  • mem_efficient kernel: ~119ms (20.16% faster)
  • flash_attention kernel: ~113ms (26.54% faster)
  • flash_attention + padded vocab: ~87ms (64.37% faster)

All code was run on an 8 x NVIDIA Corporation A100 server with 80 GB HBM [A100 SXM4 80GB], and for the purpose of this experiment dropout was set to 0.

Using scaled dot product attention with custom kernels and torch.compile delivers significant speedups for training large language models

Figure 2: Using scaled dot product attention with custom kernels and torch.compile delivers significant speedups for training large language models, such as for nanoGPT shown here.

Enhancing Numerical Model Stability

In addition to being faster, PyTorch’s implementation offers increased numerical stability by avoiding loss of precision in many execution scenarios. There is a great explanation here, but essentially the PyTorch implementation scales the Query and Key matrices before multiplication, which is said to be more stable and avoid loss of precision. Because of the merged custom kernel architecture of SDPA, this scaling does not introduce additional overhead in the computation of the attention result. In comparison, an implementation from the individual computational components would require separate pre-scaling at additional cost. For an additional explanation, see Appendix A.

Improved Memory Consumption

Yet another large advantage of using the torch SDPA kernels is the reduced memory footprint, which allows for the utilization of larger batch sizes. The following chart compares the best validation loss after one hour of training for both flash attention and the baseline implementations of causal attention. As can be seen, the maximum batch size achieved with the baseline causal attention implementation (on 8 x NVIDIA Corporation A100 server with 80 GB HBM) was 24, significantly less then the maximum achieved with flash attention, which was 39.

Using Flash Attention enables the usage of larger batch sizes

Figure 3: Using Flash Attention enables the usage of larger batch sizes, allowing users to achieve lower validation loss after one hour of training (smaller is better).

Conclusion

Accelerated PyTorch 2 Transformers were designed to make the training and production deployment of state-of-the-art transformer models affordable and integrated with PyTorch 2.0 model JIT compilation. The newly introduced PyTorch SDPA operator provides improved performance for training Transformer models and is particularly valuable for the expensive Large Language Model training. In this post we demonstrate a number of optimizations on the exemplary nanoGPT model including:

  • Over 26% training speedup, when compared against the baseline with constant batch size
  • An additional speedup achieved with padded vocabulary, bringing the total optimization to approximately 64% compared to the baseline
  • Additional numerical stability

Appendix A: Analyzing Attention Numeric Stability

In this section we provide a more in depth explanation of the previously mentioned enhanced numerical stability which is gained by prescaling SDPA’s input vectors. The following is a simplified version of nanoGPT’s mathematical implementation of SDPA. The important thing to note here is that the query undergoes matrix multiplication without being scaled.

# nanoGPT implementation of SDPA
# notice q (our query vector) is not scaled !
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)

# Dropout is set to 0, so we can safely ignore this line in the implementation# att = self.attn_dropout(att) 

y_nanogpt = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)

The following is the equivalent mathematical implementation in torch’s scaled_dot_product_attention.

# PyTorch implementation of SDPA
embed_size = q.size(-1)
scaling_factor = math.sqrt(math.sqrt(embed_size))
q = q / scaling_factor 	# notice q _is_ scaled here !

# same as above, but with scaling factor
att = q @ (k.transpose(-2, -1) / scaling_factor)
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att0, dim=-1)

# Dropout is set to 0, so we can safely ignore this line in the implementation# att = self.attn_dropout(att) 

y_scale_before = att @ v

Mathematically both approaches should be equivalent, however our experimentation shows that in practice we receive different results from each approach.

Using the approach above, we verified y_scale_before matches the expected output from using the scaled_dot_product_attention method while y_nanogpt does not.

The torch.allclose method was used to test equivalence. Specifically, we showed that:

y_sdpa = torch.nn.functional._scaled_dot_product_attention(
	q,
	k,
	v,
	attn_mask=self.bias[:,:,:T,:T] != 0,
	dropout_p=0.0,
	need_attn_weights=False,
	is_causal=False,
)

torch.allclose(y_sdpa, y_nanogpt) # False, indicating fp issues
torch.allclose(y_sdpa, y_scale_before) # True, as expected

Appendix B: Reproducing Experiment Results

Researchers seeking to reproduce these results should start with the following commit from Andrej’s nanoGPT repository – b3c17c6c6a363357623f223aaa4a8b1e89d0a465. This commit was used as the baseline when measuring the per batch speed improvements. For results which include padded vocabulary optimizations (which yielded the most significant improvements to batch speed), use the following commit – 77e7e04c2657846ddf30c1ca2dd9f7cbb93ddeab. From either checkout, selecting kernels for experimentation is made trivial with the use of the torch.backends API.

The desired kernel can be selected via a context manager:

with torch.backends.cuda.sdp_kernel (
    enable_math = False,
    enable_flash = False,
    enable_mem_efficient = True
):
    train(model)

Read More

Differentially private heatmaps

Differentially private heatmaps

Recently, differential privacy (DP) has emerged as a mathematically robust notion of user privacy for data aggregation and machine learning (ML), with practical deployments including the 2022 US Census and in industry. Over the last few years, we have open-sourced libraries for privacy-preserving analytics and ML and have been constantly enhancing their capabilities. Meanwhile, new algorithms have been developed by the research community for several analytic tasks involving private aggregation of data.

One such important data aggregation method is the heatmap. Heatmaps are popular for visualizing aggregated data in two or more dimensions. They are widely used in many fields including computer vision, image processing, spatial data analysis, bioinformatics, and more. Protecting the privacy of user data is critical for many applications of heatmaps. For example, heatmaps for gene microdata are based on private data from individuals. Similarly, a heatmap of popular locations in a geographic area are based on user location check-ins that need to be kept private.

Motivated by such applications, in “Differentially Private Heatmaps” (presented at AAAI 2023), we describe an efficient DP algorithm for computing heatmaps with provable guarantees and evaluate it empirically. At the core of our DP algorithm for heatmaps is a solution to the basic problem of how to privately aggregate sparse input vectors (i.e., input vectors with a small number of non-zero coordinates) with a small error as measured by the Earth Mover’s Distance (EMD). Using a hierarchical partitioning procedure, our algorithm views each input vector, as well as the output heatmap, as a probability distribution over a number of items equal to the dimension of the data. For the problem of sparse aggregation under EMD, we give an efficient algorithm with error asymptotically close to the best possible.

Algorithm description

Our algorithm works by privatizing the aggregated distribution (obtained by averaging over all user inputs), which is sufficient for computing a final heatmap that is private due to the post-processing property of DP. This property ensures that any transformation of the output of a DP algorithm remains differentially private. Our main contribution is a new privatization algorithm for the aggregated distribution, which we will describe next.

The EMD measure, which is a distance-like measure of dissimilarity between two probability distributions originally proposed for computer vision tasks, is well-suited for heatmaps since it takes the underlying metric space into account and considers “neighboring” bins. EMD is used in a variety of applications including deep learning, spatial analysis, human mobility, image retrieval, face recognition, visual tracking, shape matching, and more.

To achieve DP, we need to add noise to the aggregated distribution. We would also like to preserve statistics at different scales of the grid to minimize the EMD error. So, we create a hierarchical partitioning of the grid, add noise at each level, and then recombine into the final DP aggregated distribution. In particular, the algorithm has the following steps:

  1. Quadtree construction: Our hierarchical partitioning procedure first divides the grid into four cells, then divides each cell into four subcells; it recursively continues this process until each cell is a single pixel. This procedure creates a quadtree over the subcells where the root represents the entire grid and each leaf represents a pixel. The algorithm then calculates the total probability mass for each tree node (obtained by adding up the aggregated distribution’s probabilities of all leaves in the subtree rooted at this node). This step is illustrated below.
    In the first step, we take the (non-private) aggregated distribution (top left) and repeatedly divide it to create a quadtree. Then, we compute the total probability mass is each cell (bottom).
  2. Noise addition: To each tree node’s mass we then add Laplace noise calibrated to the use case.
  3. Truncation: To help reduce the final amount of noise in our DP aggregated distribution, the algorithm traverses the tree starting from the root and, at each level, it discards all but the top w nodes with highest (noisy) masses together with their descendants.
  4. Reconstruction: Finally, the algorithm solves a linear program to recover the aggregated distribution. This linear program is inspired by the sparse recovery literature where the noisy masses are viewed as (noisy) measurements of the data.
In step 2, noise is added to each cell’s probability mass. Then in step 3, only top-w cells are kept (green) whereas the remaining cells are truncated (red). Finally, in the last step, we write a linear program on these top cells to reconstruct the aggregation distribution, which is now differentially private.

Experimental results

We evaluate the performance of our algorithm in two different domains: real-world location check-in data and image saliency data. We consider as a baseline the ubiquitous Laplace mechanism, where we add Laplace noise to each cell, zero out any negative cells, and produce the heatmap from this noisy aggregate. We also consider a “thresholding” variant of this baseline that is more suited to sparse data: only keep top t% of the cell values (based on the probability mass in each cell) after noising while zeroing out the rest. To evaluate the quality of an output heatmap compared to the true heatmap, we use Pearson coefficient, KL-divergence, and EMD. Note that when the heatmaps are more similar, the first metric increases but the latter two decrease.

The locations dataset is obtained by combining two datasets, Gowalla and Brightkite, both of which contain check-ins by users of location-based social networks. We pre-processed this dataset to consider only check-ins in the continental US resulting in a final dataset consisting of ~500,000 check-ins by ~20,000 users. Considering the top cells (from an initial partitioning of the entire space into a 300 x 300 grid) that have check-ins from at least 200 unique users, we partition each such cell into subgrids with a resolution of ∆ × ∆ and assign each check-in to one of these subgrids.

In the first set of experiments, we fix ∆ = 256. We test the performance of our algorithm for different values of ε (the privacy parameter, where smaller ε means stronger DP guarantees), ranging from 0.1 to 10, by running our algorithms together with the baseline and its variants on all cells, randomly sampling a set of 200 users in each trial, and then computing the distance metrics between the true heatmap and the DP heatmap. The average of these metrics is presented below. Our algorithm (the red line) performs better than all versions of the baseline across all metrics, with improvements that are especially significant when ε is not too large or small (i.e., 0.2 ≤ ε ≤ 5).

Metrics averaged over 60 runs when varying ε for the location dataset. Shaded areas indicate 95% confidence interval.

Next, we study the effect of varying the number n of users. By fixing a single cell (with > 500 users) and ε, we vary n from 50 to 500 users. As predicted by theory, our algorithms and the baseline perform better as n increases. However, the behavior of the thresholding variants of the baseline are less predictable.

We also run another experiment where we fix a single cell and ε, and vary the resolution ∆ from 64 to 256. In agreement with theory, our algorithm’s performance remains nearly constant for the entire range of ∆. However, the baseline suffers across all metrics as ∆ increases while the thresholding variants occasionally improve as ∆ increases.

Effect of the number of users and grid resolution on EMD.

We also experiment on the Salicon image saliency dataset (SALICON). This dataset is a collection of saliency annotations on the Microsoft Common Objects in Context image database. We downsized the images to a fixed resolution of 320 × 240 and each [user, image] pair consists of a sequence of coordinates in the image where the user looked. We repeat the experiments described previously on 38 randomly sampled images (with ≥ 50 users each) from SALICON. As we can see from the examples below, the heatmap obtained by our algorithm is very close to the ground truth.

Example visualization of different algorithms for two different natural images from SALICON for ε = 10 and n = 50 users. The algorithms from left to right are: original heatmap (no privacy), baseline, and ours.

Additional experimental results, including those on other datasets, metrics, privacy parameters and DP models, can be found in the paper.

Conclusion

We presented a privatization algorithm for sparse distribution aggregation under the EMD metric, which in turn yields an algorithm for producing privacy-preserving heatmaps. Our algorithm extends naturally to distributed models that can implement the Laplace mechanism, including the secure aggregation model and the shuffle model. This does not apply to the more stringent local DP model, and it remains an interesting open question to devise practical local DP heatmap/EMD aggregation algorithms for “moderate” number of users and privacy parameters.

Acknowledgments

This work was done jointly with Junfeng He, Kai Kohlhoff, Ravi Kumar, Pasin Manurangsi, and Vidhya Navalpakkam.

Read More

Financial text generation using a domain-adapted fine-tuned large language model in Amazon SageMaker JumpStart

Financial text generation using a domain-adapted fine-tuned large language model in Amazon SageMaker JumpStart

Large language models (LLMs) with billions of parameters are currently at the forefront of natural language processing (NLP). These models are shaking up the field with their incredible abilities to generate text, analyze sentiment, translate languages, and much more. With access to massive amounts of data, LLMs have the potential to revolutionize the way we interact with language. Although LLMs are capable of performing various NLP tasks, they are considered generalists and not specialists. In order to train an LLM to become an expert in a particular domain, fine-tuning is usually required.

One of the major challenges in training and deploying LLMs with billions of parameters is their size, which can make it difficult to fit them into single GPUs, the hardware commonly used for deep learning. The sheer scale of these models requires high-performance computing resources, such as specialized GPUs with large amounts of memory. Additionally, the size of these models can make them computationally expensive, which can significantly increase training and inference times.

In this post, we demonstrate how we can use Amazon SageMaker JumpStart to easily fine-tune a large language text generation model on a domain-specific dataset in the same way you would train and deploy any model on Amazon SageMaker. In particular, we show how you can fine-tune the GPT-J 6B language model for financial text generation using both the JumpStart SDK and Amazon SageMaker Studio UI on a publicly available dataset of SEC filings.

JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few steps. All the steps in this demo are available in the accompanying notebook Fine-tuning text generation GPT-J 6B model on a domain specific dataset.

Solution overview

In the following sections, we provide a step-by-step demonstration for fine-tuning an LLM for text generation tasks via both the JumpStart Studio UI and Python SDK. In particular, we discuss the following topics:

  • An overview of the SEC filing data in the financial domain that the model is fine-tuned on
  • An overview of the LLM GPT-J 6B model we have chosen to fine-tune
  • A demonstration of two different ways we can fine-tune the LLM using JumpStart:
    • Use JumpStart programmatically with the SageMaker Python SDK
    • Access JumpStart using the Studio UI
  • An evaluation of the fine-tuned model by comparing it with the pre-trained model without fine-tuning

Fine-tuning refers to the process of taking a pre-trained language model and training it for a different but related task using specific data. This approach is also known as transfer learning, which involves transferring the knowledge learned from one task to another. LLMs like GPT-J 6B are trained on massive amounts of unlabeled data and can be fine-tuned on smaller datasets, making the model perform better in a specific domain.

As an example of how performance improves when the model is fine-tuned, consider asking it the following question:

“What drives sales growth at Amazon?”

Without fine-tuning, the response would be:

“Amazon is the world’s largest online retailer. It is also the world’s largest online marketplace. It is also the world”

With fine tuning, the response is:

“Sales growth at Amazon is driven primarily by increased customer usage, including increased selection, lower prices, and increased convenience, and increased sales by other sellers on our websites.”

The improvement from fine-tuning is evident.

We use financial text from SEC filings to fine-tune a GPT-J 6B LLM for financial applications. In the next sections, we introduce the data and the LLM that will be fine-tuned.

SEC filing dataset

SEC filings are critical for regulation and disclosure in finance. Filings notify the investor community about companies’ business conditions and the future outlook of the companies. The text in SEC filings covers the entire gamut of a company’s operations and business conditions. Because of their potential predictive value, these filings are good sources of information for investors. Although these SEC filings are publicly available to anyone, downloading parsed filings and constructing a clean dataset with added features is a time-consuming exercise. We make this possible in a few API calls in the JumpStart Industry SDK.

Using the SageMaker API, we downloaded annual reports (10-K filings; see How to Read a 10-K for more information) for a large number of companies. We select Amazon’s SEC filing reports for years 2021–2022 as the training data to fine-tune the GPT-J 6B model. In particular, we concatenate the SEC filing reports of the company in different years into a single text file except for the “Management Discussion and Analysis” section, which contains forward-looking statements by the company’s management and are used as the validation data.

The expectation is that after fine-tuning the GPT-J 6B text generation model on the financial SEC documents, the model is able to generate insightful financial related textual output, and therefore can be used to solve multiple domain-specific NLP tasks.

GPT-J 6B large language model

GPT-J 6B is an open-source, 6-billion-parameter model released by Eleuther AI. GPT-J 6B has been trained on a large corpus of text data and is capable of performing various NLP tasks such as text generation, text classification, and text summarization. Although this model is impressive on a number of NLP tasks without the need for any fine-tuning, in many cases you will need to fine-tune the model on a specific dataset and NLP tasks you are trying to solve for. Use cases include custom chatbots, idea generation, entity extraction, classification, and sentiment analysis.

Access LLMs on SageMaker

Now that we have identified the dataset and the model we are going to fine-tune on, JumpStart provides two avenues to get started using text generation fine-tuning: the SageMaker SDK and Studio.

Use JumpStart programmatically with the SageMaker SDK

We now go over an example of how you can use the SageMaker JumpStart SDK to access an LLM (GPT-J 6B) and fine-tune it on the SEC filing dataset. Upon completion of fine-tuning, we will deploy the fine-tuned model and make inference against it. All the steps in this post are available in the accompanying notebook: Fine-tuning text generation GPT-J 6B model on domain specific dataset.

In this example, JumpStart uses the SageMaker Hugging Face Deep Learning Container (DLC) and DeepSpeed library to fine-tune the model. The DeepSpeed library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. It supports single node distributed training, utilizing gradient checkpointing and model parallelism to train large models on a single SageMaker training instance with multiple GPUs. With JumpStart, we integrate the DeepSpeed library with the SageMaker Hugging Face DLC for you and take care of everything under the hood. You can easily fine-tune the model on your domain-specific dataset without manually setting it up.

Fine-tune the pre-trained model on domain-specific data

To fine-tune a selected model, we need to get that model’s URI, as well as the training script and the container image used for training. To make things easy, these three inputs depend solely on the model name, version (for a list of the available models, see Built-in Algorithms with pre-trained Model Table), and the type of instance you want to train on. This is demonstrated in the following code snippet:

from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_id, model_version = "huggingface-textgeneration1-gpt-j-6b", "*"
training_instance_type = "ml.g5.12xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)

# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

We retrieve the model_id corresponding to the same model we want to use. In this case, we fine-tune huggingface-textgeneration1-gpt-j-6b.

Defining hyperparameters involves setting the values for various parameters used during the training process of an ML model. These parameters can affect the model’s performance and accuracy. In the following step, we establish the hyperparameters by utilizing the default settings and specifying custom values for parameters such as epochs and learning_rate:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "6"

hyperparameters["learning_rate"] = "2e-04"
print(hyperparameters)

JumpStart provides an extensive list of hyperparameters available to tune. The following list provides an overview of part of the key hyperparameters utilized in fine-tuning the model. For a full list of hyperparameters, see the notebook Fine-tuning text generation GPT-J 6B model on domain specific dataset.

  • epochs – Specifies at most how many epochs of the original dataset will be iterated.
  • learning_rate – Controls the step size or learning rate of the optimization algorithm during training.
  • eval_steps – Specifies how many steps to run before evaluating the model on the validation set during training. The validation set is a subset of the data that is not used for training, but instead is used to check the performance of the model on unseen data.
  • weight_decay – Controls the regularization strength during model training. Regularization is a technique that helps prevent the model from overfitting the training data, which can result in better performance on unseen data.
  • fp16 – Controls whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
  • evaluation_strategy – The evaluation strategy used during training.
  • gradient_accumulation_steps – The number of updates steps to accumulate the gradients for, before performing a backward/update pass.

For further details regarding hyperparameters, refer to the official Hugging Face Trainer documentation.

You can now fine-tune this JumpStart model on your own custom dataset using the SageMaker SDK. We use the SEC filing data we described earlier. The train and validation data is hosted under train_dataset_s3_path and validation_dataset_s3_path. The supported format of the data includes CSV, JSON, and TXT. For the CSV and JSON data, the text data is used from the column called text or the first column if no column called text is found. Because this is for text generation fine-tuning, no ground truth labels are required. The following code is an SDK example of how to fine-tune the model:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner
from sagemaker.huggingface import HuggingFace

train_dataset_s3_path = "s3://jumpstart-cache-prod-us-west-2/training-datasets/tc/data.csv"
validation_dataset_s3_path = "s3://jumpstart-cache-prod-us-west-2/training-datasets/tc/data.csv"

training_job_name = name_from_base(f"jumpstart-example-{model_id}")

metric_definitions=[
    {'Name': 'train:loss', 'Regex': "'loss': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:loss', 'Regex': "'eval_loss': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:runtime', 'Regex': "'eval_runtime': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:eval_steps_per_second', 'Regex': "'eval_steps_per_second': ([0-9]+.[0-9]+)"},
]

# # Create SageMaker Estimator instance
tg_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    base_job_name=training_job_name,
    enable_network_isolation=True,
    metric_definitions=metric_definitions
)

# Launch a SageMaker Training job by passing s3 path of the training data
tg_estimator.fit({"train": train_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True)

After we have set up the SageMaker Estimator with the required hyperparameters, we instantiate a SageMaker estimator and call the .fit method to start fine-tuning our model, passing it the Amazon Simple Storage Service (Amazon S3) URI for our training data. As you can see, the entry_point script provided is named transfer_learning.py (the same for other tasks and models), and the input data channel passed to .fit must be named train and validation.

JumpStart also supports hyperparameter optimization with SageMaker automatic model tuning. For details, see the example notebook.

Deploy the fine-tuned model

When training is complete, you can deploy your fine-tuned model. To do so, all we need to obtain is the inference script URI (the code that determines how the model is used for inference once deployed) and the inference container image URI, which includes an appropriate model server to host the model we chose. See the following code:

from sagemaker.predictor import Predictor
from sagemaker import image_uris
from sagemaker.utils import name_from_base
import boto3

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name="us-west-2"))

#Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
    
endpoint_name = name_from_base(f"jumpstart-example-{model_id}")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = tg_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    image_uri=image_uri,
    endpoint_name=endpoint_name,
)

After a few minutes, our model is deployed and we can get predictions from it in real time!

Access JumpStart through the Studio UI

Another way to fine-tune and deploy JumpStart models is through the Studio UI. This UI provides a low-code/no-code solution to fine-tuning LLMs.

On the Studio console, choose Models, notebooks, solutions under SageMaker JumpStart in the navigation pane.

In the search bar, search for the model you want to fine-tune and deploy.

In our case, we chose the GPT-J 6B model card. Here we can directly fine-tune or deploy the LLM.

Model evaluation

When evaluating an LLM, we can use perplexity (PPL). PPL is a common measure of how well a language model is able to predict the next word in a sequence. In simpler terms, it’s a way to measure how well the model can understand and generate human-like language.

A lower perplexity score means that the model is shown to perform better at predicting the next word. In practical terms, we can use perplexity to compare different language models and determine which one performs better on a given task. We can also use it to track the performance of a single model over time. For more details, refer to Perplexity of fixed-length models.

We evaluate the model’s performance through a comparison of its pre- and post-fine-tuning performance. PPL is emitted in the training job’s Amazon CloudWatch logs. In addition, we look at the output generated by the model in response to specific test prompts.

Evaluation metric on the validation data Before fine-tuning After fine-tuning
Perplexity (PPL) 8.147 1.437

The model’s enhanced performance is demonstrated by two experiments:

  • Text generation – We show that the output the fine-tuned GPT-J generates text that has a resemblance to the SEC dataset, indicating that it has learned to emulate the language patterns and conventions of the financial regulatory domain
  • Zero-shot learning – We demonstrate that the fine-tuned GPT-J shows significantly better performance than the non-fine-tuned GPT-J on zero-shot question answering

For the text generation experiment, we use the following inference parameters to achieve our results: max_length of 400 (number of tokens returned), num_return_sequences of 1 (number of response text examples), top_k of 250 (pick from one of the top 250 possible continuations), top_p of 0.8 (select only from continuations that are at least 80% probable), do_sample is True (sampling), and temperature of 1 (high temperature is greater randomness in responses). For more details about the parameters, refer to GenerationConfig. The following table compares the fine-tuned GPT-J and non-fine-tuned one on a text completion task.

Inference example Output from GPT-J 6B Before Fine-Tuning Output from GPT-J 6B After Fine-Tuning
This Form 10-K report shows that

This Form 10-K report shows that:

The Companys net income attributable to the Company for the year ended December 31, 2016 was $3,923,000, or $0.21 per diluted share, compared to $3,818,000, or $0.21 per diluted share, for the year ended December 31, 2015.

The Companys net income attributable to the Company for the year ended December 31, 2016 was $4,816,000, or $0.28 per diluted share, compared to $4,726,000, or $0.27 per diluted share, for the year ended December 31, 2015.

The Companys net income attributable to the Company for the year ended December 31, 2016 was $5,828,000, or $0.34 per diluted share, compared to $5,716,000, or $0.33 per diluted share, for the year ended December 31, 2015.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended December 31, 2008.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended September 30, 2008.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended March 31, 2009.

This Form 10-K report shows that:(1) The registrant’s name, address, and telephone number, are as follows:Amazon.com, Inc. c/o Ernst & Young LLPSeattle, Washington Attention: Corporate Secretary (2) For the purpose of Article III of the Securities Exchange Act of 1934, the registrant’s name and address are as follows:(3) The registrant’s Exchange Act reportable time period is from and includingJanuary 1, 2020 to the present.(4) The registrant’s subjective assessment is that the financial statements, and other financial information included in this report, fairlypresent in all material respects the financial condition, results of operations and cashflows of the registrant as of, and for, the periods presented in this report.(5) The registrant’s management’s assessment is that the financial statements, and other financial information included in this report, fairlypresent in all material respects the financial condition, results of operations and cashflows of the registrant as of, and for, the periods presented in this report./s/ Ernst & Young LLPSeattle, WashingtonJanuary 29, 2020EX-31.1 5 amzn-20201231xex311.htm EX-31.1 DocumentExhibit 31.1CERTIFICATIONSI, Andrew R. Jassy, certify that:1. I have reviewed this Form 10-K of Amazon.com, Inc.;2. Based on my knowledge, this report does not contain any untrue statementof a material fact or omit to state a material fact necessary to make thestatements made, in light of the circumstances under which such statementswere made, not misleading with respect to the period covered by this report;
We serve consumers through

We serve consumers through:

Our Blog

Are you a business owner looking to grow your business? Do you want to create a more effective and efficient business? Do you want to learn how to grow your business? If you answered “yes” to any of these questions, then you need to read this article.

The first step to growing your business is to understand your business. You must be able to answer the following questions about your business:

Who are your customers?

What is your product or service?

What is your unique selling proposition?

What is your marketing strategy?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

We serve consumers through: Amazon.com, a retailer of physical and digital products; Amazon Web Services, which provides technology infrastructure services that enable virtually any type of business; Amazon Enterprise Solutions, which provides companies with custom technology and services to enable better business processes and improve efficiency and productivity; and Audible, a digital audiobook publisher. We serve developers and enterprises of all sizes through AWS, which offers a broad set of global compute, storage, database, and other service offerings. We also manufacture and sell electronic devices. We have organized our operations into two principal segments: North America and International. See Item 8 of Part II, “”Financial Statements and Supplementary Data-Note 12-Segment Information.”” See Item 7 of Part II, “”Management’s Discussion and Analysis of Financial Condition and Results ofOperations-Results of Operations-Supplemental Information”” for supplemental informationabout our net sales.ConsumersWe serve consumers through our online and physical stores and focus on selection,price, and convenience. We design our stores to enable hundreds of millions ofunique products to be sold by us and by third parties across dozens of productcategories. Customers access our offerings through our websites, mobile apps,Alexa, and physically visiting our stores. We also manufacture and sell Kindledevices. We strive to offer our customers the lowest prices possible throughlow everyday product pricing and shipping offers, and to improve ouroperating efficiencies so that we can continue to lower prices for ourcustomers. We also provide easy-to-use functionality, fast and reliablefulfillment, and timely customer service.In addition, we offer Amazon Prime, a membership program that includes unlimitedfree shipping on over 100 million items, access to unlimited streaming of tens ofthousands of movies and TV episodes, and other benefits.
This year we particularly focus on

This year we particularly focus on:

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

This year we particularly focus on: -Continued international expansion, including the development of newbusinesses in existing international market segments, as well as thedevelopment of new international market segments; -Expanding our customer base, including increasing sales to existingcustomers and attracting new customers; -Developing new business opportunities and entering into commercialagreements, including through licensing and other arrangements, with othercompanies, or through the acquisition of other companies; -Developing technology that provides for enhanced customer service andefficiently delivers our products and services; -Developing new product and service offerings and incorporating these intoexisting and new product and service offerings; -Developing efficient and effective systems for supporting our business;and -Other business and strategic initiatives. We believe that offering low prices to our customers is fundamental to ourfuture success. One way we offer lower prices is through free-shipping offersthat result in a net cost to us in delivering products, and through membershipin Amazon Prime, which provides free-shipping on millions of items andaccess to movies and other content. We also offer other promotions that enableus to turn a lower net cost of sales.We have organized our operations into two principal segments: North Americaand International. See Item 8 of Part II, “”Financial Statements andSupplementary Data-Note 12-Segment Information.”” See Item 7 of Part II,””Management’s Discussion and Analysis of Financial Condition and Results ofOperations-Results of Operations-Supplemental Information”” for supplementalinformation about our net sales.ConsumersWe serve consumers through our retail websites and physical stores and focuson selection, price, and convenience.

For the second experiment with a zero-shot question answering task, we set the max_length to be a different value (40, 200, and 40, respectively) based on the output we expect from each question. The parameters of top_k, top_p, do_sample, temperature are set as 250, 0.8, False, and 0, respectively. We set temperature close to 0 to control the randomness of the output, whereas in the text generation task we set it close to 1 to make the model more creative. The following table compares the results. For each question, the output from fine-tuned GPT-J is more accurate than the non-fine-tuned one.

Input Output from GPT-J 6B Before Fine-Tuning Output from GPT-J 6B After Fine-Tuning
What is Item 7 of the 10-K SEC filing about?:

What is Item 7 of the 10-K SEC filing about?:

Item 7 of the 10-K SEC filing is the following:

“The Company has not received any material adverse

Item 7. Management’s Discussion and Analysis of Financial Condition and Results ofOperations Forward-Looking Statements
How did the COVID-19 pandemic impact Amazon’s business?:

How did the COVID-19 pandemic impact Amazon’s business?:

Amazon’s business is being impacted by the COVID-19 pandemic in a number of ways.

The company is seeing a decline in the number of Prime memberships, which is a big part of its business.

The company is also seeing a decline in the number of Prime memberships.

Amazon is also seeing a decline in the number of orders placed on its website.

The company is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping

How did the COVID-19 pandemic impact Amazon’s business?:The COVID-19 pandemic has impacted our business in several ways. First, it has disrupted our supply chain, and it has affected customer demand, which in turn has affected our sales and operating results. Second, it has increased our spending on advertising and marketing, which may not be effective in the long run. Third, it has increased our spending on technology infrastructure and computing capacity, which may not be effective in the long run. Fourth, it has increased our spending on fulfillment and customer service, which may not be effective in the long run. Finally, it has increased our spending on content, which may not be effective in the long run. See Item 8 of Part II, “Financial Statements and Supplementary Data — Note 1 — Description of Business and Accounting Policies.
What drives sales growth at Amazon?: Amazon is the world’s largest online retailer. It is also the world’s largest online marketplace. It is also the world’ Sales growth at Amazon is driven primarily by increased customer usage, including increased selection, lower prices, and increased convenience, and increased sales by other sellers on our websites.

Clean up

To avoid ongoing charges, delete the SageMaker inference endpoints. You can delete the endpoints via the SageMaker console or from the notebook using the following commands:

# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

Conclusion

JumpStart is a capability in SageMaker that allows you to quickly get started with ML. JumpStart uses open-source, pre-trained models to solve common ML problems like image classification, object detection, text classification, sentence pair classification, and question answering.

In this post, we showed you how to fine-tune and deploy a pre-trained LLM (GPT-J 6B) for text generation based on the SEC filling dataset. We demonstrated how the model transformed into a finance domain expert by undergoing the fine-tuning process on just two annual reports of the company. This fine-tuning enabled the model to generate content with an understanding of financial topics and greater precision. Try out the solution on your own and let us know how it goes in the comments.

Important: This post is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The post used models pre-trained on data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions if you use SEC data.

To learn more about JumpStart, check out the following posts:


About the Authors

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and PhD from New York University) and Computer Science (MS from UC Berkeley), and an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice President at Citibank. He works on multimodal machine learning in the area of financial applications.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, train, and migrate ML production workloads to SageMaker at scale. He specializes in deep learning, especially in the area of NLP and CV. Outside of work, he enjoys running and hiking.

Read More

Announcing the updated Microsoft OneDrive connector (V2) for Amazon Kendra

Announcing the updated Microsoft OneDrive connector (V2) for Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML), enabling organizations to provide relevant information to customers and employees, when they need it.

Amazon Kendra uses ML algorithms to enable users to use natural language queries to search for information scattered across multiple data souces in an enterprise, including commonly used document storage systems like Microsoft OneDrive.

OneDrive is an online cloud storage service that allows you to host your content and have it automatically sync across multiple devices. Amazon Kendra can index document formats like Microsoft OneNote, HTML, PDF, Microsoft Word, Microsoft PowerPoint, Microsoft Excel, Rich Text, JSON, XML, CSV, XSLT, and plain text.

We’re excited to announce that we have updated the OneDrive connector for Amazon Kendra to add even more capabilities. For example, we have added support to search OneNote documents. Additionally, you can now choose to use identity or ACL information to make your searches more granular.

The connector helps to index documents and their access control information to limit the search results to only those documents the user is allowed to access. To show the search results based on user access rights and using only the user information, the connector provides an identity crawler to load principal information, such as user and group mappings into a principal store.

In this post, we demonstrate how to configure multiple data sources in Amazon Kendra to provide a central place to search across your document repository.

Solution overview

For our solution, we demonstrate how to index a OneDrive repository or folder using the Amazon Kendra connector for OneDrive. The solution consists of the following steps:

  1. Create and configure an app on Microsoft Azure Portal and get the authentication credentials.
  2. Create a OneDrive data source via the Amazon Kendra console.
  3. Index the data in the OneDrive repository.
  4. Run a sample query to get the information.
  5. Filter the query by users or groups.

Prerequisites

To try out the Amazon Kendra connector for OneDrive, you need the following:

Configure an Azure application and assign connection permissions

Before we set up the OneDrive data source, we need a few details about the OneDrive repository. Complete the following steps:

  1. Log in to Azure.
  2. After logging in with your account credentials, choose App registrations, then choose New registration.
  3. Give an appropriate name to your application and register the application.
  4. Collect the information about the client ID, tenant ID, and other details of the application.
  5. To get a client secret, choose Add a certificate or secret under Client credentials.
  6. Choose New client secret and provide the proper description and expiry.
  7. Note the client-id, tenant-id, and secret-id values. We use these for authenticating the OAuth2 application.
  8. Navigate to App, choose API permissions in the navigation pane, and choose Add a permission.
  9. Choose Microsoft Graph.
  10. Under Application permissions, enter File in the search bar and under Files, select Files.Read.All.
  11. Choose Add permissions
  12. Similarly, add the following permissions on the Microsoft Graph option for the application you created:
    1. Group.Read.All
    2. Notes.Read.All

On completion, the API permissions will look like the following screenshot.

Configure the Amazon Kendra connector for OneDrive

To configure the Amazon Kendra connector, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.
  2. For Index name, enter a name for the index (for example, my-onedrive-index).
  3. Enter an optional description.
  4. Choose Create a new role.
  5. For Role name, enter an IAM role name.
  6. Configure optional encryption settings and tags
  7. Choose Next
  8. In the Configure user access control section, select Yes under Access control settings.
  9.  For Token type, choose JSON on the drop-down menu.
  10. Leave the remaining values as their default values.
  11. Choose Next

Before we move to the next configuration step, we need to provide Amazon Kendra with a role that has the permissions necessary for connecting to the site. These include permission to get and decrypt the AWS Secrets Manager secret that contains the application ID and secret key necessary to connect to the OneDrive site.

  1. Open another tab for the AWS account, and on the IAM console, navigate to the role that you created earlier (for example, AmazonKendra-us-west-2-onedrive).
  2. Choose Add permissions and Create inline policy.
  3. For Service, choose Kendra.
  4. For Actions¸choose Write and specify BatchPutDocument.
  5. For Resources, choose All resources.
  6. Choose Review policy.
  7. For Name, enter a name (for example, BatchPutPolicy).
  8. Choose Create policy.
  9. Add this policy to the role you created.
  10. Additionally, attach the SecretsManagerReadWrite AWS managed policy to the role
  11. Return to the Amazon Kendra tab.
  12. Select Developer edition and choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

  1. Return to the Amazon Kendra console, choose Data sources in the navigation pane, and choose Add data source.
  2. Under OneDrive connector V2.0, choose Add connector.
  3. For Data source name, enter a name (for example, my-onedrive).
  4. Enter an optional description.
  5. Choose Next.
  6. For OneDrive Tenant ID, enter the tenant ID you gathered earlier.
  7. For Configure VPC and security group, leave the default (No VPC).
  8. Keep Identity crawler is on selected. This imports identity information into the index.
  9. For IAM role, choose Create a new role.
  10. Enter a role name, such as AmazonKendra-us-west-2-onedrive, then choose Next.
  11. In the Authentication section, choose Create and add a secret.
  12. Create a secret with clientId and clientSecret as keys.
  13. Add their respective values with the information you collected earlier.
  14. Choose Next.
  15. In the Configure sync settings section, add the OneDrive users whose documents you want to index.
  16. Select the sync mode for the index. For this post, we select New, modified or deleted content sync.
  17. Choose the frequency of indexing as Run on demand, then choose Next.

Field mappings enable allow you to set the searchability and relevance of fields. For example, the lastUpdatedAt field can sort or boost the ranking of the documents based on how recently it was updated.

  1. Keep all the defaults in the Set field mappings section and choose Next.
  2. On the review page, choose Add data source

  3. Choose Sync now

The sync can take up to 30 minutes to complete.

Test the solution

Now that you have indexed the content from OneDrive, you can test it by querying the index.

  1. Go to your index on the Amazon Kendra console and choose Search indexed content in the navigation pane.
  2. Enter a search term and press Enter.

Notice that without a token, the ACLs prevent a search result from being returned.

  1. Expand Test query with an access token and choose Apply token.
  2. Enter the appropriate token with a user who has permissions to read the file and choose Apply.
  3. Search for information present in OneDrive again.

You can verify that Amazon Kendra presents the ranked results as expected.

Congratulations, you have configured Amazon Kendra to index and search documents in OneDrive and control access to them using ACL.

Conclusion

With the Microsoft OneDrive V2 connector for Amazon Kendra, organizations can tap into commonly used enterprise document stores, securely using intelligent search powered by Amazon Kendra. You can enhance the search experience by integrating the data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion.


About the authors

Pravinchandra Varma is a Senior Customer Delivery Architect with the AWS Professional Services team and is passionate about applications of machine learning and artificial intelligence services.

Supratim Barat is a Software Developer Engineer with AWS Kendra Yellowbadge Team and is a blockchain and cyber security enthusiast

Read More