Yell At Your Robot 🗣️

Improving On-the-Fly from Language Corrections

1Stanford University; 2University of California, Berkeley

🔈 try sound on!

Abstract

Hierarchical policies that combine language and low-level control have been shown to perform impressively long-horizon robotic tasks, by leveraging either zero-shot high-level planners like pretrained language and vision-language models (LLMs/VLMs) or models trained on annotated robotic demonstrations. However, for complex and dexterous skills, attaining high success rates on long-horizon tasks still represents a major challenge -- the longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this work, we make the following observation: high-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements (“move a bit to the left”), can be effectively incorporated into high level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme that improves the high-level policy's ability to correct errors in both low-level execution and high-level decision-making purely from verbal feedback. Our evaluation on real hardware shows that this leads to significant performance improvement in long-horizon, dexterous manipulation tasks without the need for any additional teleoperation.

Method

We operate in a hierarchical setup where a high-level policy generates language instructions for a low-level policy that executes the corresponding skills. During deployment, humans can intervene through corrective language commands, temporarily overriding the high-level policy and directly influencing the low-level policy for on-the-fly adaptation. These interventions are then used to finetune the high-level policy, improving its future performance.

Method

Architecture

Our system processes RGB images and the robot's current joint positions as inputs, outputting target joint positions for motor actions. The high-level policy uses a Vision Transformer to encode visual inputs and predicts language embeddings. The low-level policy uses ACT, a Transformer-based model to generate precise motor actions for the robot, guided by language instructions. This architecture enables the robot to interpret commands like “Pick up the bag” and translate them into targeted joint movements.

System Architecture

Quantitative Results

Language corrections not only improve task success in real-time, but also enhance the autonomous policy's performance at each stage of the tasks by 20% on average through fine-tuning.

Key Results

Iterative Improvement: YAY Robot's success rates for packing different numbers of items show significant improvement with each iteration of user verbal feedback collection and fine-tuning, approaching the oracle's performance (dashed lines) at each stage of the task.

Iterative Improvement

Bag Packing

Trail Mix Preparation

Plate Cleaning

Ablations

Our results show that 1) replacing a learned high-level policy with a scripted one leads to worse performance, 2) off-the-shelf VLM performs poorly on complex long-horizon task, and 3) replacing language with one-hot encodings hurts model performance.

Ablations

Comparison to Flat BC Policy: Overall, our hierarchical approach achieves higher success rates than the non-hierarchical imitation learning method on long-horizon tasks.

Comparison to Flat Policy

Qualitative Results: Through heatmaps, we visualize the cleaning efficacy across the plate surface, where brighter areas denote higher frequencies of effective wiping. YAY Robot demonstrates wider cleaning coverage after fine-tuning the high-level policy with human verbal feedback.

Plate Heatmap

Failure Cases

Welcome to the real world! It sucks. You're gonna love it. -- Friends

BibTeX

        
          @article{shi2024yell,
            title = {Yell At Your Robot: Improving On-the-Fly from Language Corrections},
            author = {Lucy Xiaoyang Shi and Zheyuan Hu and Tony Z. Zhao and Archit Sharma and Karl Pertsch and Jianlan Luo and
            Sergey Levine and Chelsea Finn},
            year = {2024},
            journal = {arXiv preprint arXiv: 2403.12910}
          }