VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation

Haoran Zhang1*  Shuanghao Bai1*†  Wanqi Zhou1  Yuedi Zhang1  Qi Zhang1 Pengxiang Ding2,3  Cheng Chi4  Donglin Wang3  Badong Chen1✉ 
1Xi'an Jiaotong University 2Zhejiang University 3Westlake University 4BAAI
*Equal Contribution, Project Leader, Corresponding Author
Teaser Image

Diverging from prior language-driven grasp detection/generation approaches including (a) end-to-end multimodal feature fusion methods,(b) LLM/VLM-guided modular pipelines, and (c) end-to-end foundation models with language reasoning, our method (d) advocates visual chain-of-thought reasoning, encouraging the model to “think with images.” It emphasizes visual grounding by localizing regions that contain critical visual cues and dynamically zooming in to capture context at the appropriate granularity. This mechanism leads to superior generalization to unseen objects,backgrounds, and distractors.

Abstract

Robotic grasping is one of the most fundamental tasks in robotic manipulation, and grasp detection/generation has long been the subject of extensive research. Recently, language-driven grasp generation has emerged as a promising direction due to its practical interaction capabilities. However, most existing approaches either lack sufficient reasoning and generalization capabilities or depend on complex modular pipelines. Moreover, current grasp foundation models tend to overemphasize dialog and object semantics, resulting in inferior performance and restriction to single-object grasping. To maintain strong reasoning ability and generalization in cluttered environments, we propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates visual chain-of-thought reasoning to enhance visual understanding for grasp generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically focuses on visual inputs while providing interpretable reasoning traces. For training, we refine and introduce a large-scale dataset, VCoT-GraspSet, comprising 167K synthetic images with over 1.36M grasps, as well as 400+ real-world images with more than 1.2K grasps, annotated with intermediate bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot demonstrate that our method significantly improves grasp success rates and generalizes effectively to unseen objects, backgrounds, and distractors. Our code and dataset will be made publicly available.

Method

Description of the image
Overall framework of VCoT-Grasp.
Our VCoT- Grasp model introduces visual chain-of-thought reasoning, enabling two key capabilities: visual understanding and grasp generation.
1. Firstly, we equip the model with the ability to localize the target object for grasping by predicting its bounding box image as an intermediate reasoning step:
\[ b = \pi(O, l_d) \]
\[ g = \pi(O, O_b, l_g) \]
where \( l_d \) denotes the detection instruction of the target object. In this step, the model distinguishes the target from irrelevant objects and provides a coarse-grained location.
2. Next, the bounding box is used to crop and resize a square region of the image, yielding the bounding box image \( O_b \) that effectively zooms in on the region of interest. The same vision encoder and projector are applied to extract visual tokens, and the VLM integrates tokens from both the original and localized images to generate a refined grasp pose.
3. Additionally, we construct a high quality dataset, VCoT-GraspSet, to facilitate the training of our model.

Experiments

1. Results on our dataset using IoU metrics.
Description of the image
2. Real world results on seen and unseen objects.
Description of the image
3. Zero-shot performance.
Description of the image
4. Robustness to background change and extra distractors.
Description of the image

Demonstration Videos

BibTeX

@misc{zhang2025vcotgraspgraspfoundationmodels,
      title={VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation}, 
      author={Haoran Zhang and Shuanghao Bai and Wanqi Zhou and Yuedi Zhang and Qi Zhang and Pengxiang Ding and Cheng Chi and Donglin Wang and Badong Chen},
      year={2025},
      eprint={2510.05827},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.05827}, 
}