Xiuwei Xu

Xiuwei Xu | 许修为

I am a fifth year Ph.D student in the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2021, I obtained my B.Eng. in the Department of Automation, Tsinghua University.

I work on computer vision and robotics. My current research focus:

Scalable manipulation that studies how policy and evaluator (world models) pretraining can consume broad data sources for scalable robot learning.

My previous research focused on:

Mobile manipulation that studies general navigation, fine-grained navigation, and generalizable 3D data synthesis for embodied agents.

3D scene perception that accurately and efficiently understands the dynamic 3D scenes captured by robotic agent.

Email / CV / Google Scholar / Github / LinkedIn

News

2026-04: R2RGen is accepted to RSS 2026.

2025-08: Two papers accepted to CoRL 2025.

2025-06: IGL-Nav is accepted to ICCV 2025.

2025-02: Four papers are accepted to CVPR 2025.

2025-02: EmbodiedSAM is selected as an oral presentation in ICLR 2025.

2025-01: EmbodiedSAM is accepted to ICLR 2025.

2024-09: Two papers on VLM quantization and Zero-shot ObjectNav are accepted to NeurIPS 2024.

*Equal contribution, ^†Project leader.

Selected Preprint

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models
Zhenyu Wu*, Xiuwei Xu*, Yukun Zhou, Yifan Li, Qiuping Deng, Xiaofeng Wang, Zheng Zhu, Bingyao Yu, Ziwei Wang, Jiwen Lu, Haibin Yan
arXiv, 2026
[arXiv] [Code] [Project Page]

We propose iMaC, an image-as-action control paradigm for embodied world models. iMaC converts robot actions into dense motion and contact images through URDF/FK rendering and RGB-D geometry, exposing spatial motion intention and robot-scene contact relations. These image controls enable contact-sensitive future prediction and closed-loop policy evaluation.

F2F-AP: Flow-to-Future Asynchronous Policy for Real-time Dynamic Manipulation
Haoyu Wei, Xiuwei Xu^†, Ziyang Cheng, Hang Yin, Angyuan Ma, Bingyao Yu, Jie Zhou, Jiwen Lu
arXiv, 2026
[arXiv] [Code] [Project Page]

We propose F2F-AP, a flow-to-future asynchronous policy for real-time dynamic manipulation. It predicts object flow to synthesize future observations and aligns visual features with future states, allowing policies to compensate for latency and interact with moving objects.

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies
Xiuwei Xu*, Haowen Sun*, Angyuan Ma*, Yiwei Zhang, Zhenyu Wu, Xiaofeng Wang, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu
arXiv, 2026
[arXiv] [Project Page]

We propose R2RDreamer, a real-to-real demonstration augmentation framework for spatially generalized 2D manipulation policies. R2RDreamer edits incomplete object point clouds and end-effector trajectories in 3D, projects them into occlusion-aware image-space controls, and uses dense-control video completion to synthesize temporally coherent RGB-action demonstrations from limited real data.

Embodied Task Planning with Large Language Models
Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan
arXiv, 2023
[arXiv] [Code] [Project Page] [Demo]

We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models.

Selected Publications

R2RGen: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation
Xiuwei Xu*, Angyuan Ma*, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu
Robotics: Science and Systems (RSS), 2026
[arXiv] [Project Page] [Colab]

We propose a real-to-real 3D data generation framework for robotic manipulation. R2RGen generates spatially diverse manipulation demonstrations for training real-world policies, requiring only one human demonstration without simulator setup.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Wenxuan Guo, Xiuwei Xu^†, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
[arXiv] [Project Page]

We propose AwareVLN, a self-aware reasoning framework for vision-language navigation. It triggers structured reasoning at key navigation nodes to understand scene context, task progress, and next-step plans, improving instruction following in simulation and real-world navigation.

MoTo: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation
Zhenyu Wu*, Angyuan Ma*, Xiuwei Xu^†, Hang Yin, Yinan Liang, Ziwei Wang, Jiwen Lu, Haibin Yan
Conference on Robot Learning (CoRL), 2025
[arXiv] [Project Page]

We propose a general framework for mobile manipulation, which can be divided into docking point selection and fixed-base manipulation. We model the docking point selection stage as an optimization process, to let the agent move and touch target keypoint under several constraints.

IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
Wenxuan Guo*, Xiuwei Xu*, Hang Yin, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
International Conference on Computer Vision (ICCV), 2025
[arXiv] [Code] [Project Page]

We propose IGL-Nav, an incremental 3D Gaussian localization framework for image-goal navigation. It supports challenging scenarios where the camera for goal capturing and the agent's camera have very different intrinsics and poses, e.g., a cellphone and a RGB-D camera.

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
Wenxuan Guo*, Xiuwei Xu*, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2025 (Highlight, All Strong Accept)
[arXiv] [Code] [中文解读]

We propose TSP3D, an efficient multi-level convolution architecture for 3D visual grounding. TSP3D achieves superior performance compared to previous approaches in both accuracy and inference speed.

UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Hang Yin*, Xiuwei Xu*^†, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2025
[arXiv] [Code] [Project Page] [中文解读]

We propose UniGoal, a unified graph representation for zero-shot goal-oriented navigation. Based on online 3D scene graph prompting for LLM, our method can be directly applied to different kinds of scenes and goals without training.

EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
International Conference on Learning Representations (ICLR), 2025 (Oral, Top 1.8% Submission)
[arXiv] [Code] [Project Page] [中文解读]

We presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
Hang Yin*, Xiuwei Xu*^†, Zhenyu Wu, Jie Zhou, Jiwen Lu
Neural Information Processing Systems (NeurIPS), 2024
[arXiv] [Code] [Project Page] [中文解读]

We propose a training-free object-goal navigation framework by leveraging LLM and VFMs. We construct an online hierarchical 3D scene graph and prompt LLM to exploit structure information contained in subgraphs for zero-shot decision making.

3D Small Object Detection with Dynamic Spatial Pruning
Xiuwei Xu*, Zhihao Sun*, Ziwei Wang, Hongmin Liu, Jie Zhou, Jiwen Lu
European Conference on Computer Vision (ECCV), 2024
[arXiv] [Code] [Project Page] [中文解读]

We propose an effective and efficient 3D detector named DSPDet3D for detecting small objects. By scaling up the spatial resolution of feature maps and pruning uninformative scene representaions, DSPDet3D is able to capture detailed local geometric information while keeping low memory footprint and latency.

Memory-based Adapters for Online 3D Scene Perception
Xiuwei Xu*, Chong Xia*, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [Code] [Project Page] [中文解读]

We propose a model and task-agnostic plug-and-play module, which converts offline 3D scene perception models (receive reconstructed point clouds) to online perception models (receive streaming RGB-D videos).

Back to Reality: Weakly-supervised 3D Object Detection with Shape-guided Label Enhancement
Xiuwei Xu, Yifan Wang, Yu Zheng, Yongming Rao, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2022
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 23.6), 2023
[arXiv] [Code] [Poster] [PDF (Journal)] [Supp (Journal)]

We propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with only annotations of object centers. We convert the weak annotations into virtual scenes with synthetic 3D shapes and apply domain adaptation to train a size-aware detector for real scenes.

Full publication list

Selected Projects

MoTo

R2RGen

Data-efficient Mobile Manipulation

In this project, we study mobile manipulation, where a robot must combine long-horizon navigation with precise local manipulation. Whole-body demonstrations are expensive to collect, and standard navigation usually stops at a coarse region that is still far from the accuracy required by manipulation policies. We therefore focus on two complementary problems: accurate docking that turns navigation into a suitable fixed-base manipulation setup, and scalable data generation that improves policy generalization across viewpoints, object locations, and object geometry for both 2D and 3D policies. Our works are summarized as:

MoManipVLA --> MoTo. MoManipVLA transfers VLA waypoint prediction to mobile manipulation, using out-of-range end-effector waypoints to guide base motion. MoTo further abstracts docking as a "move and touch" optimization problem, selecting robot poses that satisfy manipulation-oriented geometric constraints.

R2RGen --> ShapeGen --> R2RDreamer. R2RGen edits real pointcloud-trajectory pairs to generate spatially diverse demonstrations for generalized 3D policies. ShapeGen extends data generation to category-level manipulation by producing function-aware shape variations with minimal annotation. R2RDreamer further converts 3D edits into more scalable 2D videos by occlusion-aware projection and video completion.

IGL-Nav

UniGoal

GC-VLN

3D Representation for Visual Navigation

In this project, we study how to design a proper representation and how to exploit the representation for general visual navigation. Previous methods mainly focus on BEV map or topological graph, which lacks 3D information to reason fine-grained spatial relationship and detailed color / texture. Therefore, we leverage 3D representation for better modeling of the observed 3D environment. We propose: (1) 3D scene graph as a structural representation for explicit LLM reasoning and unification of different kinds of tasks and (2) 3D gaussians as a renderable representation for accurate image-goal navigation. Our works are summarized as:

IGL-Nav which proposes incremental 3D gaussian localization for free-view image-goal navigation. We support a challenging application scenarios where the camera for goal capturing and the agent's camera are of very different intrinsics and poses, e.g., a cellphone and a RGB-D camera.

SG-Nav --> UniGoal --> GC-VLN. SG-Nav builds an online 3D scene graph to prompt LLM, which enables training-free object-goal navigation with high success rate. UniGoal extends SG-Nav to general goal-oriented navigation. We unify all goals into a uniform goal graph and leverage LLM to reason how to explore based on graph matching between goal and scene graphs. GC-VLN further unifies vision-and-language navigation task into our framework by regarding language instruction as DAG to solve graph constraints.

DSPDet3D

Online3D

EmbodiedSAM

Efficient and Online 3D Scene Perception

In this project, we study how to make 3D scene perception methods applicable for embodied scenarios such as robotic planning and interaction. Although various research have been conducted on 3D scene perception, it is still very challenging to (1) process large-scale 3D scenes with both high fine granularity and fast speed and (2) perceive the 3D scenes in an online and real-time manner that directly consumes streaming RGB-D video as input. We solve these problems in below works:

DSPDet3D --> TSP3D. DSPDet3D is able to detect almost everything (small and large) given a building-level 3D scene, within 2s on a single GPU. TSP3D extends DSPDet3D to 3D visual grounding with text-guided pruning and completion-based addition, achieving state-of-the-art accuracy and speed even compared with two-stage methods.

Online3D --> EmbodiedSAM. Online3D converts offline 3D scene perception models (receive reconstructed point clouds) to online perception models (receive streaming RGB-D videos) in a model and task-agnostic plug-and-play manner. EmbodiedSAM online segments any 3D thing in real time.

Grants and Awards

NSFC Youth Student Research Project (PhD) / 国家自然科学基金青年学生基础研究项目（博士研究生）, 2025-2026

National Scholarship, 2024

Outstanding Graduates (Beijing & Dept. of Automation, Tsinghua University), 2021

Innovation Award of Science and Technology, Tsinghua University, 2019-2020

Teaching

Teaching Assistant, Computer vision, 2024 Spring Semester

Teaching Assistant, Pattern recognition and machine learning, 2023 Fall Semester

Teaching Assistant, Pattern recognition and machine learning, 2022 Fall Semester

Teaching Assistant, Numerical analysis, 2021 Fall Semester

Academic Services

Conference Reviewer: ICML 2025, CVPR 2025-2026, ICLR 2025-2026, NeurIPS 2024-2025, ECCV 2024, ICCV 2023-2025, CoRL 2025, IROS 2025, ICASSP 2022-2023

Journal Reviewer: IJCV, T-IP, T-ITS, T-MM, T-CSVT

Website Template