I am a fourth year Ph.D student in the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2021, I obtained my B.Eng. in the Department of Automation, Tsinghua University.
I work on computer vision and robotics. My current research focuses on:
Embodied AI that grounds robotic planning with physical scenes, especially robotic navigation and mobile manipulation.
3D scene perception that accurately and efficiently understands the dynamic 3D scenes captured by robotic agent.
3D reconstruction that builds 3D scenes from raw sensor inputs in online and real-time, especially SLAM and 3D gaussians.
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models.
We propose TSP3D, an efficient multi-level convolution architecture for 3D visual grounding. TSP3D achieves superior performance compared to previous approaches in both accuracy and inference speed.
We propose UniGoal, a unified graph representation for zero-shot goal-oriented navigation. Based on online 3D scene graph prompting for LLM, our method can be directly applied to different kinds of scenes and goals without training.
We presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.
We propose a training-free object-goal navigation framework by leveraging LLM and VFMs. We construct an online hierarchical 3D scene graph and prompt LLM to exploit structure information contained in subgraphs for zero-shot decision making.
We propose an effective and efficient 3D detector named DSPDet3D for detecting small objects. By scaling up the spatial resolution of feature maps and pruning uninformative scene representaions, DSPDet3D is able to capture detailed local geometric information while keeping low memory footprint and latency.
We propose a model and task-agnostic plug-and-play module, which converts offline 3D scene perception models (receive reconstructed point clouds) to online perception models (receive streaming RGB-D videos).
We propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with only annotations of object centers. We convert the weak annotations into virtual scenes with synthetic 3D shapes and apply domain adaptation to train a size-aware detector for real scenes.
In this project, we study how to design a proper representation and how to exploit the representation for general visual navigation. Previous methods mainly focus on BEV map or topological graph, which lacks 3D information to reason fine-grained spatial relationship and detailed color / texture. Therefore, we leverage 3D representation for better modeling of the observed 3D environment. We propose: (1) 3D scene graph as a structural representation for explicit LLM reasoning and unification of different kinds of tasks and (2) 3D gaussians as a renderable representation for accurate image-goal navigation. Our works are summarized as:
IGL-Nav which proposes incremental 3D gaussian localization for free-view image-goal navigation. We support a challenging application scenarios where the camera for goal capturing and the agent's camera are of very different intrinsics and poses, e.g., a cellphone and a RGB-D camera.
SG-Nav --> UniGoal. SG-Nav builds an online 3D scene graph to prompt LLM, which enables training-free object-goal navigation with high success rate. UniGoal further extends SG-Nav to general goal-oriented navigation. We unify all goals into a uniform goal graph and leverage LLM to reason how to explore based on graph matching between goal and scene graphs.
DSPDet3D
Online3D
EmbodiedSAM
Efficient and Online 3D Scene Perception
In this project, we study how to make 3D scene perception methods applicable for embodied scenarios such as robotic planning and interaction. Although various research have been conducted on 3D scene perception, it is still very challenging to (1) process large-scale 3D scenes with both high fine granularity and fast speed and (2) perceive the 3D scenes in an online and real-time manner that directly consumes streaming RGB-D video as input. We solve these problems in below works:
DSPDet3D --> TSP3D. DSPDet3D is able to detect almost everything (small and large) given a building-level 3D scene, within 2s on a single GPU. TSP3D extends DSPDet3D to 3D visual grounding with text-guided pruning and completion-based addition, achieving state-of-the-art accuracy and speed even compared with two-stage methods.
Online3D --> EmbodiedSAM. Online3D converts offline 3D scene perception models (receive reconstructed point clouds) to online perception models (receive streaming RGB-D videos) in a model and task-agnostic plug-and-play manner. EmbodiedSAM online segments any 3D thing in real time.
Grants and Awards
NSFC Youth Student Research Project (PhD) / 国家自然科学基金青年学生基础研究项目(博士研究生), 2025-2026
National Scholarship, 2024
Outstanding Graduates (Beijing & Dept. of Automation, Tsinghua University), 2021
Innovation Award of Science and Technology, Tsinghua University, 2019-2020
Teaching
Teaching Assistant, Computer vision, 2024 Spring Semester
Teaching Assistant, Pattern recognition and machine learning, 2023 Fall Semester
Teaching Assistant, Pattern recognition and machine learning, 2022 Fall Semester
Teaching Assistant, Numerical analysis, 2021 Fall Semester