Xiuwei Xu

Xiuwei Xu | 许修为

I am a fourth year Ph.D student in the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2021, I obtained my B.Eng. in the Department of Automation, Tsinghua University.

I work on computer vision and robotics. My current research focuses on:

General manipulation that learns generalized manipulation skills with data-efficient imitation learning and 3D policy.

Visual navigation that enables robots to explore the environment according to multimodal instructions.

3D scene perception that accurately and efficiently understands the dynamic 3D scenes captured by robotic agent.

Email / CV / Google Scholar / Github

*Equal contribution, ^†Project leader.

Preprint

Embodied Instruction Following in Unknown Environments
Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan
arXiv, 2024
[arXiv] [Code] [Project Page]

Our embodied agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. which can complete complex human instructions such as making breakfast, tidying bedrooms and cleaning bathrooms in house-level scenes.

Embodied Task Planning with Large Language Models
Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan
arXiv, 2023
[arXiv] [Code] [Project Page] [Demo]

We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models.

Publications

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
Wenxuan Guo*, Xiuwei Xu*, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2025 (Highlight, All Strong Accept)
[arXiv] [Code] [中文解读]

We propose TSP3D, an efficient multi-level convolution architecture for 3D visual grounding. TSP3D achieves superior performance compared to previous approaches in both accuracy and inference speed.

UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Hang Yin*, Xiuwei Xu*^†, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2025
[arXiv] [Code] [Project Page] [中文解读]

We propose UniGoal, a unified graph representation for zero-shot goal-oriented navigation. Based on online 3D scene graph prompting for LLM, our method can be directly applied to different kinds of scenes and goals without training.

MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, Jiwen Lu, Haibin Yan
Computer Vision and Pattern Recognition (CVPR), 2025
[arXiv] [Project Page]

We propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fixed-base manipulation to mobile manipulation. We utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability and devise motion planning objectives to maximize the physical feasibility of generated trajectory.

EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
International Conference on Learning Representations (ICLR), 2025 (Oral, Top 1.8% Submission)
[arXiv] [Code] [Project Page] [中文解读]

We presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
Hang Yin*, Xiuwei Xu*^†, Zhenyu Wu, Jie Zhou, Jiwen Lu
Neural Information Processing Systems (NeurIPS), 2024
[arXiv] [Code] [Project Page] [中文解读]

We propose a training-free object-goal navigation framework by leveraging LLM and VFMs. We construct an online hierarchical 3D scene graph and prompt LLM to exploit structure information contained in subgraphs for zero-shot decision making.

Q-VLM: Post-training Quantization for Large Vision-Language Models
Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Neural Information Processing Systems (NeurIPS), 2024
[arXiv] [Code]

We propose a post-training quantization framework of large vision-language models (LVLMs). Our method compresses the memory by 2.78x and increase the generate speed by 1.44x on 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks.

3D Small Object Detection with Dynamic Spatial Pruning
Xiuwei Xu*, Zhihao Sun*, Ziwei Wang, Hongmin Liu, Jie Zhou, Jiwen Lu
European Conference on Computer Vision (ECCV), 2024
[arXiv] [Code] [Project Page] [中文解读]

We propose an effective and efficient 3D detector named DSPDet3D for detecting small objects. By scaling up the spatial resolution of feature maps and pruning uninformative scene representaions, DSPDet3D is able to capture detailed local geometric information while keeping low memory footprint and latency.

Memory-based Adapters for Online 3D Scene Perception
Xiuwei Xu*, Chong Xia*, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [Code] [Project Page] [中文解读]

We propose a model and task-agnostic plug-and-play module, which converts offline 3D scene perception models (receive reconstructed point clouds) to online perception models (receive streaming RGB-D videos).

Towards Accurate Data-free Quantization for Diffusion Models
Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2024 (Highlight, Top 2.8% Submission)
[arXiv] [Code]

We propose a post-training quantization framework to compress diffusion models, which performs group-wise quantization to minimize rounding errors across time steps and selects generated contents in the optimal time steps for calibration.

Back to Reality: Learning Data-Efficient 3D Object Detector with Shape Guidance
Xiuwei Xu, Ziwei Wang, Jie Zhou, Jiwen Lu
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 23.6), 2023
[PDF] [Supp] [Code]

We extend BR to BR++ by introducing differentiable label enhancement and label-assisted self-training. Our approach surpasses current weakly-supervised and semi-supervised methods by a large margin, and achieves comparable detection performance with some fully-supervised methods with less than 5% of the labeling labor.

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory
Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Neural Information Processing Systems (NeurIPS), 2023
[arXiv] [Code] [中文解读]

We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint.

Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis
Xiuwei Xu, Ziwei Wang, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2023
[arXiv] [Poster]

We propose a binary sparse convolutional network called BSC-Net for efficient point cloud analysis. With the presented shifted sparse convolution operation and efficient search method, we reduce the quantization error for sparse convolution without additional computation overhead.

Quantformer: Learning Extremely Low-precision Vision Transformers
Ziwei Wang, Changyuan Wang, Xiuwei Xu, Jie Zhou, Jiwen Lu
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 23.6), 2022
[PDF] [Supp] [Code]

We propose the extremely low-precision vision transformers in 2-4 bits, where a self-attention rank consistency loss and a group-wise quantization strategy are presented for quantization error minimization.

Back to Reality: Weakly-supervised 3D Object Detection with Shape-guided Label Enhancement
Xiuwei Xu, Yifan Wang, Yu Zheng, Yongming Rao, Jie Zhou, Jiwen Lu
Computer Vision and Pattern Recognition (CVPR), 2022
[arXiv] [Code] [Poster]

We propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with only annotations of object centers. We convert the weak annotations into virtual scenes with synthetic 3D shapes and apply domain adaptation to train a size-aware detector for real scenes.

Website Template