About Me

I am a fourth-year Ph.D. student at the Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University, advised by Prof. Li Yi (弋力).

I also work as an AI Researcher at Memories.ai, working with Prof. Shawn Shen (University of Bristol).

My research interests lie in 4D Human-Object Interaction (HOI), Embodied AI, and Multimodal Large Language Models (MLLMs). I am passionate about equipping robots with the ability to understand and interact with the dynamic 4D world.

News

  • Nov 2025 Two paper accepted to  WACV 2026.
  • Nov 2025 My total citations surpassed  600+.
  • Jun 2025 HOI4D  dataset surpassed  200+  citations.
  • Aug 2025 One paper accepted to  ACM MM 2025.
  • Jul 2025 One paper accepted to  ICCV 2025.
  • Jul 2025 One paper accepted to  EMNLP 2025.
  • May 2025 One paper accepted to  UAI 2025.
  • Feb 2025 Two paper accepted to  CVPR 2025.
  • Nov 2024 One paper accepted to  3DV 2025.
  • Jul 2024 One paper accepted to  ACM MM 2024 (Oral).
  • Feb 2024 One paper accepted to  CVPR 2024.
  • Jan 2024 One paper accepted to  ICRA 2024.
  • Jul 2023 One paper accepted to  ICCV 2023.
  • Feb 2023 One paper accepted to  CVPR 2023.
  • Jul 2022 One paper accepted to  ECCV 2022.
  • Jul 2021 One paper accepted to  ICCV 2021.

Publications

* denotes equal contribution, ^ denotes corresponding author.

ST-Think
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
Peiran Wu*, Yunze Liu*, Miao Liu, Junxiao Shen
WACV, 2026
Investigating how multimodal large language models understand and reason about dynamic 4D environments from ego-centric video inputs.
PointNet4D
PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications
Yunze Liu, Zifan Wang, Peiran Wu, Jiayang Ao
WACV, 2026
Proposes PointNet4D, a lightweight hybrid Mamba-Transformer 4D backbone with 4DMAP pretraining for efficient online and offline point cloud video perception in robotics, achieving strong results across 9 tasks and enabling 4D Diffusion Policy and 4D Imitation Learning on RoboTwin and HandoverSim benchmarks.
CULTURE3D
CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering
Xinyi Zheng*, Steve Zhang*, Weizhe Lin, Aaron Zhang, Walterio W. Mayol-Cuevas, Yunze Liu^, Junxiao Shen^
ICCV, 2025
Introduces a 10-billion-point ultra-high-resolution drone dataset of 20 culturally significant landmarks and terrains, enabling large-scale Gaussian-based 3D scene reconstruction and benchmarking.
X-LeBench
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
Wenqi Zhou*, Kai Cao*, Hao Zheng, Yunze Liu, Xinyi Zheng, Miao Liu, Per Ola Kristensson, Walterio W. Mayol-Cuevas, Fan Zhang, Weizhe Lin, Junxiao Shen
EMNLP (Findings), 2025
Builds a life-logging simulation pipeline and benchmark with 432 ultra-long egocentric video logs (23 minutes–16.4 hours) to stress-test multimodal LLMs on temporal reasoning, memory, and context aggregation.
OnlineHOI
OnlineHOI: Towards Online Human-Object Interaction Generation and Perception
Yihong Ji, Yunze Liu, Yiyao Zhuo, Weijiang Yu, Fei Ma, Joshua Huang, Fei Yu
ACM Multimedia (ACM MM), 2025
Presents an online framework that jointly predicts future human motions and perceives object states for streaming HOI scenarios.
UAI 2025
MutualNeRF: Improve the Performance of NeRF under Limited Samples with Mutual Information Theory
Zifan Wang, Jingwei Li, Yitang Li, Yunze Liu^
Uncertainty in Artificial Intelligence (UAI), 2025
Proposes MutualNeRF, a mutual-information-driven framework that improves NeRF under limited views via MI-guided sparse view sampling and few-shot regularization in both semantic and pixel spaces.
MAP
MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining
Yunze Liu, Li Yi
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Introduces Masked Autoregressive Pretraining that jointly optimizes Mamba and Transformer blocks via local MAE-style reconstruction and global autoregressive generation, yielding state-of-the-art hybrid vision backbones.
MobileH2R
MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data
Zifan Wang*, Ziqing Chen*, Junyu Chen*, Jilong Wang, Yuxin Yang, Yunze Liu, Xueyi Liu, He Wang, Li Yi
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Scales 100K+ synthetic handover scenes and distills them through 4D imitation learning into safe, generalizable mobile robot policies that outperform baselines by at least 15% success in sim and real.
3DV 2025
Interactive Humanoid: Online Full-Body Motion Reaction Synthesis with Social Affordance Canonicalization and Forecasting
Yunze Liu, Changxi Chen, Li Yi
International Conference on 3D Vision (3DV), 2025
Proposed a framework for synthesizing physically plausible humanoid reactions using forward dynamics guidance and social affordance forecasting.
ACM MM 2024
PhysReaction: Physically Plausible Real-Time Humanoid Reaction Synthesis via Forward Dynamics Guided 4D Imitation
Yunze Liu, Changxi Chen, Chen Ding, Li Yi
ACM Multimedia (ACM MM), 2024 (Oral Presentation)
Leveraging forward dynamics to guide 4D imitation learning for realistic humanoid reactions.
ICRA 2024
CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding
Yunze Liu, Changxi Chen, Zifan Wang, Li Yi
IEEE International Conference on Robotics and Automation (ICRA), 2024
Introduces CrossVideo, a self-supervised cross-modal contrastive learning framework for point cloud video understanding.
CVPR 2024
Physics-aware Hand-object Interaction Denoising
Haoping Luo, Yunze Liu, Li Yi
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Proposes a physics-aware hand-object interaction denoising framework that leverages physical constraints to improve the accuracy of hand-object interaction predictions.
ICCV 2023
LeaF: Learning Frames for 4D Point Cloud Sequence Understanding
Yunze Liu, Junyu Chen, Zekun Zhang, Jingwei Huang, Li Yi
IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Introduces learnable frame representations to effectively capture temporal dynamics in 4D point cloud sequences.
CVPR 2023
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning
Zhuoyang Zhang, Yuhao Dong, Yunze Liu, Li Yi
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Introduces a complete-to-partial 4D distillation framework that leverages self-supervised learning to improve the performance of point cloud sequence representation learning.
ECCV 2022
Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding
Hao Wen*, Yunze Liu*, Jingwei Huang, Bo Duan, Li Yi (* Equal Contribution)
European Conference on Computer Vision (ECCV), 2022
Introduces a point primitive transformer for long-term 4D point cloud video understanding.
CVPR 2022
HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, Li Yi
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
Releases HOI4D, a large-scale egocentric 4D HOI dataset with category-level annotations that enables learning robust human-object interaction perception and manipulation policies.
ICCV 2021
Contrastive Multimodal Fusion with TupleInfoNCE
Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, Li Yi
IEEE/CVF International Conference on Computer Vision (ICCV), 2021
Introduces a contrastive multimodal fusion framework that leverages tuple-based contrastive learning to improve the performance of multimodal fusion.

Academic Services