RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

Abstract

Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the D->D setting from 88.7% to 96.2%, and in the ABC->D setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets.

The RoboUniView Architecture

The entire RoboUniView framework is illustrated in Figure. During the forward process, multi-perspective images pass through Vision Encoder to extract wrist image features and the unified view representation. These are then combined with language tokens in the Feature Fusion Decoder to extract integrated vision-language features. Finally, these features pass through the policy head to execute robotic manipulation. The training process consists of two phases: during the pre-training phase, Vision Encoder undergoes training on a large dataset of easily accessible RGB-D images to learn robust unified view representation; during the fine-tuning phase, the model learns to predict robotic actions from the unified view representation, using paired images and action data.

Experiments

Comparisons with Other State-of-the-Art Methods

We fine-tune RoboUniView using the demonstrations from the Split D training set, and evaluate its imitation performance on episodes sampled from Split D (D->D). RoboUniView significantly outperforms all methods across all metrics. The success rate of task1 is improved from 0.887 to 0.962. Even more impressive, in sequence of consecutive tasks, RoboUniView increase the success rate of task5 from 0.349 to 0.563 and raise the average successful sequence length from 2.968 to 3.855. This result is particularly commendable as the complexity and challenge of subsequent tasks significantly increase with the progression of the tasks. This primarily stems from the fact that the initial state of each subsequent task is heavily dependent on the completion state of the previous task, leading to increasingly diverse starting conditions.

We also fine-tune RoboUniView on the ABC split and test on the D split (ABC->D), where the D split presents a completely different visual environment from ABC. As shown in Table, RoboUniView improves the success rate of task1 from 0.824 to 0.942, and the average successful sequence length from 2.47 to 3.647, compare to best method. It demonstrate RoboUniView's strong capability in zero-shot generalization.

Advanced Experiments

To further validate the effectiveness of our method, we conduct three meaningful experiments using RoboFlamingo as the baseline. (1) Training on the D split and testing on the D split with altered camera parameters. (2) Training on the D split with two different sets of camera parameters, and testing on the D split. (3) Training on the D split with two different sets of camera parameters, each set of which contain different tasks, and testing all tasks on the D split.

BibTeX