VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Anonymous Author(s)
Affiliation
Teaser

Abstract

Bimanual manipulation is critical to many robotics applications. In contrast to single-arm manipulation, bimanual manipulation tasks are challenging due to higher-dimensional action spaces. Prior works leverage large amounts of data and primitive actions to address this problem, but may suffer from sample inefficiency and limited generalization across various tasks. To this end, we propose VoxAct-B, a language-conditioned, voxel-based method that leverages Vision Language Models (VLMs) to prioritize key regions within the scene and reconstruct a voxel grid. We provide this voxel grid to our bimanual manipulation policy to learn acting and stabilizing actions. This approach enables more efficient policy learning from voxels and is generalizable to different tasks. In simulation, we show that VoxAct-B outperforms strong baselines on fine-grained bimanual manipulation tasks. Furthermore, we demonstrate VoxAct-B on real-world Open Drawer and Open Jar tasks using two UR5s.

Videos


Overview


VoxAct-B schematic

VoxAct-B uses voxel representations and language to perform bimanual manipulation with 6-DoF manipulation from both arms.


Bimanual manipulation is essential for robotics tasks, such as when objects are too large to be controlled by one gripper, or when one arm stabilizes an object of interest to make it simpler for the other arm to manipulate. In this work, we focus on asymmetric bimanual manipulation. Here, "asymmetry" refers to the functions of the two arms, where one is a stabilizing arm, while the other is the acting arm. Asymmetric tasks are common in household and industrial settings, such as cutting food, opening bottles, and packaging boxes. They typically require two-hand coordination and high-precision, fine-grained manipulation, which are challenging for current robotic manipulation systems. To tackle bimanual manipulation, some methods train policies on large datasets, and some exploit primitive actions. However, they are generally sample inefficient, and using primitives can hinder generalization to different tasks.


To this end, we propose VoxAct-B, a novel voxel-based, language-conditioned method for bimanual manipulation. Voxel representations, when coupled with discretized action spaces, can increase sample efficiency and generalization by introducing spatial equivariance into a learned system, where transformations of the input lead to corresponding transformations of the output. However, processing voxels is computationally demanding. To address this, we propose utilizing VLMs to focus on the most pertinent regions within the scene by cropping out less relevant regions. This substantially reduces the overall physical dimensions of the areas used to construct a voxel grid, enabling an increase in voxel resolution without incurring computational costs. To our knowledge, this is the first study to apply voxel representations in bimanual manipulation.


We also employ language instructions and VLMs to determine the roles of each arm: whether they are acting or stabilizing. For instance, in a drawer-opening task, the orientation of the drawer and the position of the handle affect which arm is more suitable for opening the drawer (acting) and which is better for holding it steady (stabilizing). We use VLMs to compute the pose of the object of interest relative to the front camera and to decide the roles of each arm. Then, we provide appropriate language instructions to the bimanual manipulation policy to control the acting and stabilizing arms.


Results

Performance Comparisons

With 10 training demonstrations, VoxAct-B outperforms all baselines by a large margin. When we train all methods using more demonstrations (100), VoxAct-B still outperforms baselines on most tasks, except ACT for Put Item in Drawer. Qualitatively, baseline methods, especially VoxPoser, typically struggle with precisely grasping objects such as drawer handles and jars. The baselines also struggle with correctly assigning the roles of each arm. For instance, a policy intended to execute acting actions may unpredictably produce stabilizing actions. Furthermore, they can generate kinematically infeasible actions or actions that lead to erratic movements, as seen in ACT and Diffusion Policy, which may be caused by insufficient training data. In contrast, we observe fewer of these errors with VoxAct-B.

Results

Performance of different methods on our bimanual manipulation tasks in simulation, based on 10 or 100 (task-specific) training demonstrations.

Video Rollouts

All videos, except those in the failure sections, are categorized as successes.

Open Drawer

VoxAct-B
Diffusion Policy
ACT
VoxPoser
Bimanual PerActs

Open Jar

VoxAct-B
Diffusion Policy
ACT
VoxPoser
Bimanual PerActs

Put Item in Drawer

VoxAct-B
Diffusion Policy
ACT
VoxPoser
Bimanual PerActs

Hand Over Item

VoxAct-B

Jars and Drawers of Different Appearances

VoxAct-B

Real-World Results

VoxAct-B

Need For a Second Arm to Open Jar and Drawer

VoxAct-B

Failures

VoxAct-B