MMLab@NTU

Multimedia Laboratory @
Nanyang Technological University

About

MMLab@NTU

MMLab@NTU is a research lab at Nanyang Technological University focused on advancing computer vision, multimodal AI, generative AI, and 3D perception and reconstruction. Founded in 2018, the lab has grown into a vibrant research community spanning faculty, research staff, and PhD students.

Our members develop both foundational methods and practical systems, with recent research covering areas such as large multimodal models, generative intelligence, 3D content creation, scene understanding, and efficient vision models for real-world deployment. The lab is committed to open research, impactful publications, and close engagement with the broader AI community. We welcome motivated PhD students, postdoctoral researchers, and research staff who are excited to push the frontier of visual intelligence.

Singapore President's Young Scientist Award

10/2025: Congratulations to Ziwei Liu on receiving the President’s Science and Technology Award 2025 – Young Scientist Award, one of Singapore’s highest honours for research excellence.

View more

Google PhD Fellowship

10/2025: Yuekun and Ziang are awarded the very competitive and prestigious Google PhD Fellowship 2025 under the area “Machine Perception". Congrats!

View more

CCDS Outstanding PhD Thesis Award

10/2025: Congratulations to Shangchen Zhou on being named a joint recipient of CCDS’s Outstanding PhD Thesis Award 2025 for his thesis on visual content restoration and enhancement, and to Yuming Jiang on receiving an Honourable Mention for his work on controllable image and video synthesis.

View more

MMLAB@NTU

News and Highlights

View more

Introducing Xperience-10M

The Largest Human Xperience Dataset for Physical AI

A large-scale egocentric multimodal dataset with 10 million experiences, 10,000 hours of synchronized first-person data, and rich annotations spanning video, audio, depth, pose, motion capture, IMU, and language. Built for embodied AI, robotics, world models, and spatial intelligence. Check the our project page for more information.

By Ropedia

Our MMLab@NTU startup, providing solutions for physical 4D intelligence — capture, structure, and model human experience at scale.

Featured

Projects

StoryMem: Multi-shot Long Video Storytelling with Memory
K. Zhang, L. Jiang, A. Wang, J. Z. Fang, T. Zhi, Q. Yan, H. Kang, X. Lu, X. Pan
Technical report, arXiv:2512.19539, 2026
[arXiv] [Project Page]

StoryMem is a long-video storytelling framework that turns pre-trained single-shot video diffusion models into multi-shot storytellers by maintaining an explicit memory bank of keyframes from previously generated shots, enabling stronger cross-shot consistency, persistent character and scene coherence, and cinematic quality across minute-long stories; it also introduces ST-Bench to evaluate multi-shot visual storytelling.

4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
Y. Luo, S. Zhou, Y. Lan, X. Pan, C. C. Loy
Technical report, arXiv:2602.10094, 2026
[arXiv] [Project Page]

4RC is a unified feed-forward framework for 4D reconstruction from monocular videos. It learns a compact spatio-temporal latent representation that jointly models scene geometry and motion, enabling an encode-once, query-anytime paradigm to recover dense 3D structure and motion between arbitrary frames and timestamps efficiently.

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
K. Liao, S. Wu, Z. Wu, L. Jin, C. Wang, Y. Wang, F. Wang, W. Li, C. C. Loy
International Conference on Learning Representations, 2026 (ICLR)
[PDF] [arXiv] [Project Page] [Demo]

Puffin is a camera-centric multimodal model that unifies camera understanding and controllable image generation. By treating camera parameters as language tokens, it aligns geometric reasoning with vision–language models, enabling spatially consistent cross-view generation, camera reasoning, and scene exploration. The model is trained on Puffin-4M, a large dataset of vision–language–camera triplets.

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Y. Lan, Y. Luo, F. Hong, S. Zhou, H. Chen, Z. Lyu, S. Yang, B. Dai, C. C. Loy, X. Pan
International Conference on Learning Representations, 2026 (ICLR)
[arXiv] [Project Page]

STream3R reformulates multi-view 3D reconstruction as a streaming Transformer problem. Using causal attention and feature caching across frames, it incrementally reconstructs dense scene geometry from image streams, enabling scalable and efficient online 3D perception. The model learns geometric priors from large-scale 3D data and achieves strong performance on both static and dynamic scenes.

Light-X : Generative 4D Video Rendering with Camera and Illumination Control
T. Liu, Z. Chen, Z. Huang, S. Xu, S. Zhang, C. Ye, B. Li, Z. Cao, W. Li, H. Zhao, Z. Liu
International Conference on Learning Representations, 2026 (ICLR)
[arXiv] [Project Page]

Light-X is a controllable 4D video generation framework that renders videos from monocular input with joint control over camera trajectory and illumination, supporting effects such as bullet time, dolly zoom, and text-guided relighting; by disentangling geometry and lighting cues, it produces more temporally consistent and realistic results than prior video relighting and novel-view methods.

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
Z. Cao, F. Hong, Z. Chen, L. Pan, Z. Liu
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2026 (CVPR)
[arXiv] [Project Page]

PhysX-Anything generates simulation-ready physical 3D assets from a single image, producing not just shape but also explicit geometry, articulation, and physical attributes so the assets can be used directly in simulation and embodied AI; it also introduces a more efficient geometry representation and the PhysX-Mobility dataset with 2K+ real-world objects to support this task.

Scaling Video Matting via a Learned Quality Evaluator
P. Yang, S. Zhou, K. Hao, Q. Tao
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2026 (CVPR)
[arXiv] [Project Page] [Demo]

MatAnyone 2 is a practical video matting framework that improves robustness in real-world scenes while preserving fine boundary details such as hair and motion blur. It introduces a learned Matting Quality Evaluator (MQE) to estimate matte quality without ground truth, enabling both better training supervision and large-scale data curation. Using this idea, we build VMReal, a large real-world video matting dataset with 28K clips and 2.4M frames, and further improve temporal robustness with a reference-frame training strategy for long videos.

Explore

MMLab@NTU