In the last decade, we’ve seen learning-based systems provide transformative solutions for a wide range of perception and reasoning problems, from recognizing objects in images to recognizing and translating human speech. Recent progress in deep reinforcement learning (i.e. integrating deep neural networks into reinforcement learning systems) suggests that the same kind of success could be realized in automated decision making domains. If fruitful, this line of work could allow learning-based systems to tackle active control tasks, such as robotics and autonomous driving, alongside the passive perception tasks to which they have already been successfully applied.
While deep reinforcement learning methods – like Soft Actor Critic – can learn impressive motor skills, they are challenging to train on large and broad data that is not from the target environment. In contrast, the success of deep networks in fields like computer vision was arguably predicated just as much on large datasets, such as ImageNet, as it was on large neural network architectures. This suggests that applying data-driven methods to robotics will require not just the development of strong reinforcement learning methods, but also access to large and diverse datasets for robotics. Not only can large datasets enable models that generalize effectively, but they can also be used to pre-train models that can then be adapted to more specialized tasks using much more modest datasets. Indeed, “ImageNet pre-training” has become a default approach for tackling diverse tasks with small or medium datasets – like 3D building reconstruction. Can the same kind of approach be adopted to enable broad generalization and transfer in active control domains, such as robotics?
Unfortunately, the design and adoption of large datasets in reinforcement learning and robotics has proven challenging. Since every robotics lab has their own hardware and experimental set-up, it is not apparent how to move towards an “ImageNet-scale” dataset for robotics that is useful for the entire research community. Hence, we propose to collect data across multiple different settings, including from varying camera viewpoints, varying environments, and even varying robot platforms. Motivated by the success of large-scale data-driven learning, we created RoboNet, an extensible and diverse dataset of robot interaction collected across fourdifferentresearchlabs. The collaborative nature of this work allows us to easily capture diverse data in various lab settings across a wide variety of objects, robotic hardware, and camera viewpoints. Finally, we find that pre-training on RoboNet offers substantial performance gains compared to training from scratch in entirely new environments.
Collecting RoboNet
RoboNet consists of 15 million video frames, collected by different robots interacting with different objects in a table-top setting. Every frame includes the image recorded by the robot’s camera, arm pose, force sensor readings, and gripper state. The collection environment, including the camera view, the appearance of the table or bin, and the objects in front of the robot are varied between trials. Since collection is entirely autonomous, large amounts can be cheaply collected across multiple institutions. A sample of RoboNet along with data statistics is shown below:
How can we use RoboNet?
After collecting a diverse dataset, we experimentally investigate how it can be used to enable general skill learning that transfers to new environments. First, we pre-train visual dynamics models on a subset of data from RoboNet, and then fine-tune them to work in an unseen test environment using a small amount of new data. The constructed test environments (one of which is visualized below) all include different lab settings, new cameras and viewpoints, held-out robots, and novel objects purchased after data collection concluded.
After tuning, we deploy the learned dynamics models in the test environment to perform control tasks – like picking and placing objects – using the visual foresight model based reinforcement learning algorithm. Below are example control tasks executed in various test environments.
We can now numerically evaluate if our pre-trained controllers can pick up skills in new environments faster than a randomly initialized one. In each environment, we use a standard set of benchmark tasks to compare the performance of our pre-trained controller against the performance of a model trained only on data from the new environment. The results show that the fine-tuned model is ~4x more likely to complete the benchmark task than the one trained without RoboNet. Impressively, the pre-trained models can even slightly outperform models trained from scratch on significantly (5-20x) more data from the test environment. This suggests that transfer from RoboNet does indeed offer large performance gains compared to training from scratch!
We compare the performance of fine-tuned models against their counterparts trained from scratch in two different test environments (with different robot platforms).
Clearly fine-tuning is better than training from scratch, but is training on all of RoboNet always the best way to go? To test this, we compare pre-training on various subsets of RoboNet versus training from scratch. As seen before, the model pre-trained on all of RoboNet (excluding the Baxter platform) performs substantially better than the random initialization model. However, the “RoboNet pre-trained” model is outperformed by a model trained on a subset of RoboNet data collected on the Sawyer robot – the single-arm variant of Baxter.
Models pre-trained on various subsets of RoboNet are compared to one trained from scratch in an unseen (during pre-training) Baxter control environment
The similarities between the Baxter and Sawyer likely partly explain our results, but why does simply adding data to the training set hurt performance after fine-tuning? We theorize that this effect occurs due to model under-fitting. In other words, RoboNet is an extremely challenging dataset for a visual dynamics model, and imperfections in the model predictions result in bad control performance. However, larger models with more parameters tend to be more powerful, and thus make better predictions on RoboNet (visualized below). Note that increasing the number of parameters greatly improves prediction quality, but even large models with 500M parameters (middle column in the videos below) are still quite blurry. This suggests ample room for improvement, and we hope that the development of newer more powerful models will translate to better control performance in the future.
We compare video prediction models of various size trained on RoboNet. A 75M parameter model (right-most column) generates significantly blurrier predictions than a large model with 500M parameters (center column).
Final Thoughts
This work takes the first step towards creating learned robotic agents that can operate in a wide range of environments and across different hardware. While our experiments primarily explore model-based reinforcement learning, we hope that RoboNet will inspire the broader robotics and reinforcement learning communities to investigate how to scale model-based or model-free RL algorithms to meet the complexity and diversity of the real world.
Since the dataset is extensible, we encourage other researchers to contribute the data generated from their experiments back into RoboNet. After all, any data containing robot telemetry and video could be useful to someone else, so long as it contains the right documentation. In the long term, we believe this process will iteratively strengthen the dataset, and thus allow our algorithms that use it to achieve greater levels of generalization across tasks, environments, robots, and experimental set-ups.
Finally, I would like to thank Sergey Levine, Chelsea Finn, and Frederik Ebert for their helpful feedback on this post, as well as the editors of the BAIR, SAIL, and CMU MLD blogs.
This blog post was based on the following paper:RoboNet: Large-Scale Multi-Robot Learning. S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, C. Finn. In Conference on Robot Learning, 2019. (pdf)
For the nearly one million American adults living with physical disabilities, taking a bite of food or pouring a glass of water presents a significant challenge. Assistive robots—such as wheelchair-mounted robotic arms—promise to solve this problem. Users control these robots by interacting with a joystick, guiding the robot arm to complete everyday tasks without relying on a human caregiver. Unfortunately, the very dexterity that makes these arms useful also renders them difficult for users to control. Our insight is that we can make assistive robots easier for humans to control by learning an intuitive and meaningful control mapping that translates simple joystick motions into complex robot behavior. In this blog post, we describe our self-supervised algorithm for learning the latent space, and summarize the results of user studies that test our approach on cooking and eating tasks. You can find a more in-depth description in this paper and the accompanying video.
Motivation. Almost 10% of all American adults living with physical disabilities need assistance when eating1. This percentage increases for going to the bathroom (14%), getting around the home (16%), or putting on clothes (23%). Wheelchair-mounted robotic arms can help users complete some of these everyday tasks.
Unfortunately, because robotic arms are hard for humans to control, even simple tasks remain challenging to complete. Consider the task shown in the video below:
The user is trying to control their assistive robot to grab some food. In the process, they must precisely position the robot’s gripper next to the container, and then carefully guide this container up and out of the shelf. The human’s input is—by necessity—low-dimensional. But the robot arm is high-dimensional: it has many degrees-of-freedom (or DoFs), and the user needs to coordinate all of these interconnected DoFs to complete the task.
In practice, controlling assistive robots can be quite difficult due to the unintuitive mapping from low-dimensional human inputs to high-dimensional robot actions. Look again at the joystick interface in the above video—do you notice how the person keeps tapping the side? They are doing this to toggle between control modes. Only after the person finds the right control mode are they able to make the robot take the action that they intended. And, as shown, often the person has to switch control modes multiple times to complete a simple task. A recent study2 found that able-bodied users spent 20% of their time changing the robot’s control mode! The goal of our research is to address this problem and enable seamless control of assistive robots.
Our Vision. We envision a setting where the assistive robot has access to task-related demonstrations. These demonstrations could be provided by a caregiver, the user, or even be collected on another robot. What’s important is that the demonstrations show the robot which high-dimensional actions it should take in relevant situations. For example, here we provide kinesthetic demonstrations of high-dimensional reaching and pouring motions:
Once the robot has access to these demonstrations, it will learn a low-dimensional embedding that interpolates between different demonstrated behaviors and enables the user to guide the arm along task-relevant motions. The end-user then leverages the learned embedding to make the robot perform their desired tasks without switching modes. Returning to our example, here the robot learns that one joystick DoF controls the arm’s reaching motion, and the other moves the arm along a pouring motion:
Typically, completing these motions would require multiple mode switches (e.g., intermittently changing the robot’s position and orientation). But now—since the robot has learned a task-related embedding—the user can complete reaching and pouring with just a single joystick (and no mode switching)! In practice, this embedding captures a continuous set of behaviors, and allows the person to control and interpolate between these robot motions by moving the joystick.
Insight and Contributions. Inspired by the difficulties that today’s users face when controlling assistive robotic arms, we propose an approach that learns teleoperation strategies directly from data. Our insight is that:
High-dimensional robot actions can often be embedded into intuitive, human-controllable, and low-dimensional latent spaces
You can think a latent space as a manifold that captures the most important aspects of your data (e.g., if your data is a matrix, then the latent space could be the first few eigenvectors of that matrix). In what follows, we first formalize a list of properties that intuitive and human-controllable latent spaces must satisfy, and evaluate how different autoencoder models capture these properties. Next, we perform two user studies where we compare our learning method to other state-of-the-art approaches, including shared autonomy and mode switching.
Learning User-Friendly Latent Spaces
Here we formalize the properties that user-friendly latent space should have, and then describe models that can capture these properties.
Notation. Let be the robot’s current state. In our experiments, contained the configuration of the robot’s arm and the position of objects in the workspace; but the state can also consist of other types of observations, such as camera images. The robot takes high-dimensional actions , and these actions cause the robot to change states according to the transition function . In practice, often corresponds to the joint velocities of the robot arm.
We assume that the robot has access to a dataset of task-related demonstrations. Formally, this dataset contains a set of state-action pairs: . Using the dataset, the robot attempts to learn a latent action space that is of lower dimension than the original action space. In our experiments, was the same dimension as the joystick interface so that users could input latent action . The robot also learns a decoder that inputs the latent action and the robot’s current state , and outputs the high-dimensional robot action .
User-Friendly Properties. Using this notation, we can formulate the properties that the learned latent space should have. We focus on three properties: controllability, consistency, and scaling.
Controllability. Let and be a pair of states from the dataset , and let the robot start in state . We say that a latent space is controllable if, for every such pair of states, there exists a sequence of latent actions such that . In other words, a latent space is controllable if it can move the robot between any start and goal states within the dataset.
Consistency. Let be a task-dependent metric that captures similarity. For instance, in a pouring task, could measure the orientation of the robot’s gripper. We say that a latent space is consistent if, for two states and that are nearby, the change caused by the latent action is similar: . Put another way, a latent space is consistent if the same latent action causes the robot to behave similarly in nearby states.
Scaling. Let be the next state that the robot visits after taking latent action in the current state , such that . We say that a latent space scales if the distance between and increases to infinity as the magnitude of increases to infinity. Intuitively, this means that larger latent actions should cause bigger changes in the robot’s state.
Models. Now that we have introduced the properties that a user-friendly latent space should have, we can explore how different embeddings capture these properties. It may be helpful for readers to think about principal component analysis as a simple way to find linear embeddings. Building on this idea, we utilize a more general class of autoencoders, which learn nonlinear low-dimensional embeddings in a self-supervised manner.3 Consider the model shown below:
The robot learns the latent space using this model structure. Here and are state-action pairs sampled from the demonstration dataset , and the model encodes each state-action pair into a latent action . Then, using and the current state , the robot decodes the latent action to reconstruct a high-dimensional action . Ideally, will perfectly match , so that the robot correctly reconstructs the original action.
Of course, when the end-user controls their assistive robot, the robot no longer knows exactly what action it should perform. Instead, the robot uses the latent space that it has learned to predict the human’s intention:
Here is the person’s input on the joystick, and is the state that the robot currently sees (e.g., its current configuration and the position of objects within the workspace). Using this information, the robot reconstructs a high-dimensional action . The robot then uses this reconstructed action to move the assistive arm.
State Conditioning. We want to draw attention to one particularly important part of these models. Imagine that you are using a joystick to control your assistive robot, and the assistive robot is holding a glass of water. Within this context, you might expect for one joystick DoF to pour the water. But now imagine a different context: the robot is holding a fork to help you eat. Here it no longer makes sense for the joystick to pour—instead, the robot should use the fork to pick up morsels of food.
Hence, the meaning of the user’s joystick input (pouring, picking up) often depends on the current context (holding glass, using fork). So that the robot can associate meanings with latent actions, we condition the interpretation of the latent action on the robot’s current state. Look again at the models shown above: during both training and control, the robot reconstructs the high-dimensional action based on both and .
Because recognizing the current context is crucial for correctly interpreting the human’s input, we train models that reconstruct the robot action based on both the latent input and the robot state. More specifically, we hypothesize that conditional variational autoencoders (cVAEs) will capture the meaning of the user’s input while also learning a consistent and scalable latent space. Conditional variational autoencoders are like typical autoencoders, but with two additional tricks: (1) the latent space is normalized into a consistent range, and (2) the decoder depends on both and . The model we looked at above is actually an example of a cVAE! Putting controllability, consistency, and scaling together—while recognizing that meaning depends on context—we argue that conditional variational autoencoders are well suited to learn user-friendly latent spaces.
Algorithm. Our approach to learning and leveraging these embeddings is summarized below:
Using a dataset of state-action pairs that were collected offline, the robot trains an autoencoder (e.g., a cVAE) to best reconstruct the actions from that dataset. Next, the robot aligns its learned latent space with the joystick DoF (e.g., set up / down on the joystick to correspond to pouring / straightening the glass). In our experiments, we manually performed this alignment, but it is also possible for the robot to learn this alignment by querying the user. With these steps completed, the robot is ready for online control! At each timestep that the person interacts with the robot, their joystick inputs are treated as , and the robot uses the learned decoder to reconstruct high-dimensional actions.
Simulated Example. To demonstrate that the conditional variational autoencoder (cVAE) model we described does capture our desired properties, let’s look at a simulated example. In this example, a planar robotic arm with five joints is trying to move its end-effector along a sine wave. Although the robot’s action is 5-DoF, we embed it into a 1-DoF latent space. Ideally, pressing left on the joystick should cause the robot to move left along the sine wave, and pressing right on the joystick should cause the robot to move right along the sine wave. We train the latent space with a total of 1000 state-action pairs, where each state-action pair noisily moved the robot along the sine wave.
Above you can see how latent actions control the robot at three different states along the sine wave. At each state we apply five different latent actions: . What’s interesting is that the learned latent space is consistent: at each of the three states, applying negative latent actions causes the robot to move left along the sine wave, and applying positive latent actions cause the robot to move right along the sine wave. These actions also scale: larger inputs cause greater movement.
So the conditional variational autoencoder learns a consistent and scalable mapping—but it is also controllable? And do we actually need state conditioning to complete this simple task? Below we compare the cVAE (shown in orange) to a variational autoencoder (VAE, shown in gray). The only difference between these two models is that the variational autoencoder does not consider the robot’s current state when decoding the user’s latent input.
Both robots start on the left in a state slightly off of the sine wave, and at each timestep we apply the latent action . As you can see, only the state conditioned model (the cVAE) correctly follows the sine wave! We similarly observe that the state conditioned model is more controllable when looking at 1000 other example simulations. In each, we randomly selected the start and goal states from the dataset . Across all of these simulations, the state conditioned cVAE has an average error of 0.1 units between its final state and the goal state. By contrast, the VAE is 0.95 units away from the goal—even worse than the principal component analysis baseline (which is 0.9 units from goal).
Viewed together, these simulated results suggest that model structure which we described above (a conditional variational autoencoder) produces a controllable, consistent, and scalable latent space. These properties are desirable in user-friendly latent spaces, since they enable the human to perform tasks easily and intuitively.
Leveraging Learned Latent Spaces
We conducted two user studies where participants teleoperated a robotic arm using a joystick. In the first study, we compared our proposed approach to shared autonomy when the robot has a discrete set of possible goals. In the second study, we compared our approach to mode switching when the robot has a continuous set of possible goals. We also asked participants for their subjective feedback about the learned latent space—was it actually user-friendly?
Discrete Goals: Latent Actions vs. Shared Autonomy
Imagine that you’re working with the assistive robot to grab food from your plate. Here we placed three marshmallows on a table in front of the user, and the person needs to make the robot grab one of these marshmallows using their joystick.
Importantly, the robot does not know which marshmellow the human wants! Ideally, the robot will make this task easier by learning a simple mapping between the person’s inputs and their desired marshmallow.
Shared Autonomy. As a baseline, we compared our method to shared autonomy. Within shared autonomy the robot maintains a belief (i.e., a probability distribution) over the possible goals, and updates this belief based on the user’s inputs4. As the robot becomes more confident about which discrete goal the human wants to reach, it provides increased assistance to move towards that goal; however, the robot does not directly learn an embedding between its actions and the human’s inputs.
Experimental Overview. We compared five different ways of controlling the robot. The first four come from the HARMONIC dataset developed by the HARP Lab at Carnegie Mellon University:
No assist. The user directly controls the end-effector position and orientation by switching modes.
Low Assist / High Assist / Full Assist. The robot interpolates between the human’s input and it’s shared autonomy action. Within the HARMONIC dataset the High Assist was most effective: here the shared autonomy action is weighted twice as important as the human’s input.
cVAE. Our approach, where the robot learns a latent space that the human can control. We trained our model on demonstrations from the HARMONIC dataset.
Our participant pool consisted of ten Stanford University affiliates who provided informed consent. Participants followed the same protocol as used when collecting the HARMONIC dataset: they were given up to five minutes to practice, and then performed five recorded trials (e.g., picking up a marshmallow). Before each trial they indicated which marshmallow they wanted to pick up.
Results. We found that participants who controlled the robot using our learned embedding were able to successfully pick up their desired marshmallow almost 90% of the time:
When breaking these results down, we also found that our approach led to completing the task (a) in less time, (b) with fewer inputs, and (c) with more direct robot motions. See the box-and-whisker plots below, where an asterisk denotes statistical significance:
Why did learning a latent space outperform the shared autonomy benchmarks? We think this improvement occurred because our approach constrained the robot’s motion into a useful region. More specifically, the robot learned to always move its end-effector into a planar manifold above the plate. The human could then control the robot’s state within this embedded manifold to easily position the fork above their desired marshmallow:
These plots show trajectories from the High Assist condition in the HARMONIC dataset (on left) and trajectories from participants leveraging our cVAE method (on right). Comparing the two plots, it is clear that learning a latent space reduced the movement variance, and guided the participants towards the goal region. Overall, our first user study suggests that learned latent spaces are effective in shared autonomy settings because they encode implicit, user-friendly constraints.
Continuous Goals: Latent Actions vs. Switching Modes
Once the robot knows that you are reaching for a goal, it can provide structured assistance. But what about open-ended scenarios where there could be an infinite number of goals? Imagine that you are trying to cook an apple pie with the help of your assistive robotic arm. You might need to get ingredients from the shelves, pour them into the bowl, recycle empty containers (or return filled containers to the shelves), and stir the mixture. Here shared autonomy does not really make sense—there aren’t a discrete set of goals we might be reaching for! Instead, the robot must assist you through a variety of continuous subtasks. Put another way, we need methods that enable the user to provide and control the robot towards continuous goals. Our approach offers one promising solution: equipped with latent actions, the user can control the robot through a continuous manifold.
End-Effector. As a baseline, we asked participants to complete these cooking tasks while using the mode switching strategy that is currently employed by assistive robotic arms. We refer to this strategy as End-Effector. To get a better idea of how it works, look at the gamepads shown below:
Within End-Effector, participants used two joysticks to control either the position or rotation of the robot’s gripper. To change between linear and angular control they needed to switch between modes. By contrast, our Latent Actions approach only used a single 2-DoF joystick. Here there was no mode switching; instead, the robot leveraged its current state to interpret the meaning behind the human’s input, and then reconstructed the intended action.
Experimental Overview. We designed a cooking task where eleven participants made a simplified apple pie. Each participant completed the experiment twice: once with the End-Effector control mode and once with our proposed Latent Action approach. We alternated the order in which participants used each control mode.
Training and Data Efficiency. In total, we trained the latent action approach with less than 7 minutes of kinesthetic demonstrations. These demonstrations were task-related, and consisted of things like moving between shelves, picking up ingredients, pouring into a bowl, and stirring the bowl. The robot then learned the latent space using its onboard computer in less than 2 minutes. We are particularly excited about this data efficiency, which we attribute in part to the simplicity of our models.
Results. We show some video examples from our user studies below. In each, the End-Effector condition is displayed on the left, and the Latent Action approach is provided on the right. At the top, we label the part of the task that the participant is currently completing. Notice that each of the videos is sped up (3x or 4x speed): this can cause the robot’s motion to seem “jerky,” when actually the user is just making incremental inputs.
Task 1: Adding Eggs. The user controls the robot to pick up a container of eggs, pour the eggs into the bowls, and then dispose of the container:
Task 2: Adding Flour. The user teleoperates the robot to pick up some flour, pour the flour into the bowls, and then return the flour to the shelf:
Task 3: Add Apple & Stir. The user guides the robot to pick up the apple, place it into the bowl, and then stir the mixture. You’ll notice that in the End-Effector condition this person got stuck at the limits of the robot’s workspace, and had to find a different orientation for grasping the apple.
Task 4: Making an Apple Pie. After the participant completed the first three tasks, we changed the setup. We moved the recycling container, the bowl, and the shelf, and then instructed participants to redo all three subtasks without any reset. This was more challenging than the previous tasks, since the robot had to understand a wider variety of human intentions.
Across each of these tasks, participants were able to cook more quickly using the Latent Action approach. We also found that our approach reduced the amount of joystick input; hence, using an embedding reduced both user time and effort.
Subjective Responses. After participants completed all of the tasks shown above, we asked them for their opinions about the robot’s teleoperation strategy. Could you predict what action the robot would take? Was it hard to adapt to the robot’s decoding of your inputs? Could you control the robot to reach your desired state? For each of these questions, participants provided their assessment on a 7-point scale. Here a 7 means agreement (it was predictable, adaptable, controllable, etc.), and a 1 means that the user did not like that strategy.
Summarizing these results, participants thought that our approach required less effort (ease), made it easier to complete the task (easier), and produced more natural robot motion (natural). For the other questions, any differences were not statistically significant.
Key Takeaways
We explored how we can leverage latent representations to make it easier for users to control assistive robotic arms. Our main insight is that we can embed the robot’s high-dimensional actions into a low-dimensional latent space. This latent action space can be learned directly from task-related data:
In order to be useful for human operators, the learned latent space should be controllable, consistent and scalable.
Based on our simulations and experiments, state conditioned autoencoders appear to satisfy these properties.
We can leverage these learned embeddings during tasks with either discrete or continuous goals (such as cooking and eating).
These models are data efficient: in our cooking experiments, the robot used its onboard computer to train on data from less than 7 minutes of kinesthetic demonstrations.
Overall, this work is a step towards assistive robots that can seamlessly collaborate with and understand their human users.
If you have any questions, please contact Dylan Losey at: dlosey@stanford.edu
Our team of collaborators is shown below!
This blog post is based on the 2019 paper Controlling Assistive Robots with Learned Latent Actions by Dylan P. Losey, Krishnan Srinivasan, Ajay Mandlekar, Animesh Garg, and Dorsa Sadigh.
For further details on this work, check out the paper on Arxiv.
D. M. Taylor, Americans With Disabilities: 2014. US Census Bureau, 2018. ↩
L. V. Herlant, R. M. Holladay, and S. S. Srinivasa, “Assistive teleoperation of robot arms via automatic time-optimal mode switching,” in ACM/IEEE International Conference on Human Robot Interaction (HRI), 2016, pp. 35–42. ↩
C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016. ↩
S. Javdani, H. Admoni, S. Pellegrinelli, S. S. Srinivasa, and J. A. Bagnell, “Shared autonomy via hindsight optimization for teleoperation and teaming,” The International Journal of Robotics Research, vol. 37, no. 7, pp. 717–742, 2018. ↩