Xiuwei Xu | 许修为

I am a fifth year Ph.D student in the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2021, I obtained my B.Eng. in the Department of Automation, Tsinghua University.

I work on computer vision and robotics. My current research focuses on:

  • General manipulation that learns generalized manipulation skills with data-efficient imitation learning and 3D policy.
  • Visual navigation that enables robots to explore the environment according to multimodal instructions.
  • 3D scene perception that accurately and efficiently understands the dynamic 3D scenes captured by robotic agent.
  • Email  /  CV  /  Google Scholar  /  Github  /  LinkedIn

    profile photo

    Research Projects

    MoTo

    R2RGen

    Data-efficient Mobile Manipulation

    In this project, we study the challenging mobile manipulation task, which requires an accurate and proper combination of navigation and manipulation. Since whole-body mobile manipulation data is limited, we define this task as a subsequent stage of navigation. However, existing navigation methods end at a very coarse location which is usually 3-5m away from the manipulation area, which is far from the <5mm accuracy required by maniulation polices. Therefore, we propose: (1) optimization-based framework for docking point selection, serving as an intermediate stage of navigation and fixed-base manipulation and (2) 3D policy and 3D data generation to train generalized policy with minimal data, which enables robust manipulation under any viewpoint, object location and appearance. Our works are summarized as:

  • MoManipVLA --> MoTo. MoManipVLA utilizes pre-trained VLA models to generate waypoints of the end-effector. The out-of-range waypoints can be used to control the robot's base through motion planning objectives. MoTo is a more general framework that models docking point selection as a "move and touch" problem of the end-effector and the object part.
  • R2RGen. R2RGen generates diverse manipulation demonstrations based on real-world pointcloud-trajectory editing, which trains spatially generalized 3D policy that is robust to varying object locations and robot viewpoints.
  • IGL-Nav

    UniGoal

    GC-VLN

    3D Representation for Visual Navigation

    In this project, we study how to design a proper representation and how to exploit the representation for general visual navigation. Previous methods mainly focus on BEV map or topological graph, which lacks 3D information to reason fine-grained spatial relationship and detailed color / texture. Therefore, we leverage 3D representation for better modeling of the observed 3D environment. We propose: (1) 3D scene graph as a structural representation for explicit LLM reasoning and unification of different kinds of tasks and (2) 3D gaussians as a renderable representation for accurate image-goal navigation. Our works are summarized as:

  • IGL-Nav which proposes incremental 3D gaussian localization for free-view image-goal navigation. We support a challenging application scenarios where the camera for goal capturing and the agent's camera are of very different intrinsics and poses, e.g., a cellphone and a RGB-D camera.
  • SG-Nav --> UniGoal --> GC-VLN. SG-Nav builds an online 3D scene graph to prompt LLM, which enables training-free object-goal navigation with high success rate. UniGoal extends SG-Nav to general goal-oriented navigation. We unify all goals into a uniform goal graph and leverage LLM to reason how to explore based on graph matching between goal and scene graphs. GC-VLN further unifies vision-and-language navigation task into our framework by regarding language instruction as DAG to solve graph constraints.
  • DSPDet3D

    Online3D

    EmbodiedSAM

    Efficient and Online 3D Scene Perception

    In this project, we study how to make 3D scene perception methods applicable for embodied scenarios such as robotic planning and interaction. Although various research have been conducted on 3D scene perception, it is still very challenging to (1) process large-scale 3D scenes with both high fine granularity and fast speed and (2) perceive the 3D scenes in an online and real-time manner that directly consumes streaming RGB-D video as input. We solve these problems in below works:

  • DSPDet3D --> TSP3D. DSPDet3D is able to detect almost everything (small and large) given a building-level 3D scene, within 2s on a single GPU. TSP3D extends DSPDet3D to 3D visual grounding with text-guided pruning and completion-based addition, achieving state-of-the-art accuracy and speed even compared with two-stage methods.
  • Online3D --> EmbodiedSAM. Online3D converts offline 3D scene perception models (receive reconstructed point clouds) to online perception models (receive streaming RGB-D videos) in a model and task-agnostic plug-and-play manner. EmbodiedSAM online segments any 3D thing in real time.
  • MUCFormer

    Q-VLM

    Foundation Model Compression and Deployment

    Deploying power deep neural networks especially large foundation models (GPT-4/Gemini) on robots is usually prohibited due to the strict limits of computational resources. To address this, we propose: (1) fundamental network compression techniques that reduce model complexity without performance degradation; (2) automatic model compression framework that selects the optimal compression policy within hardware resource constraint and (3) hardware-friendly compilation engine to achieve actual speedup and memory savings for robot-based computation platforms. Our system is summarized as:

  • MCUFormer which makes it possible to deploy large vision transformers in STM32F4 (256K memory, 5 USD) for a wide variety of tasks including object detection and instance segmentation.
  • Q-VLM which quantizes large vision and language models to 4-bit and enables on-device deployment. Our method compresses the memory by 2.78x and increase the generate speed by 1.44x on 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks.

  • Website Template


    © Xiuwei Xu | Last updated: March 12, 2025