
Zhiwen(Aaron) Fan
@zhiwen_fan_ • 1,804 subscribers
Assistant Prof @ Texas A&M ECE @TAMU | Spatial Foundation Models
Shorts
Videos

InstantSplat++ is now open source. It is a lightweight library that connects foundation models (VGGT, MASt3R, MAP-Anything, etc.) with the Gaussian splatting family. Given uncalibrated images, it optimizes a 3D scene in a few seconds. Try the demo and code here:
Zhiwen(Aaron) Fan31,748 views • 3 months ago

Speeding your view synthesis(<40s) with #InstantSplat! Our large-scale, pose-free method trains in just 37 seconds from sparse views—no #COLMAP, no intrinsics needed. Achieving nearly 30dB test PSNR with just 12 images, New standard in #NVS and new training efficiency. Project page 👉 Paper 📷:
Zhiwen(Aaron) Fan108,763 views • 2 years ago

🚀 Our NeurIPS '24 work, Large Spatial Model (LSM), is here! LSM performs semantic 3D reconstruction in just 0.1s, processing unposed data via feed-forward 3D reconstruction. 👉It leverages large-scale 3D datasets with minimal annotations, defining a 3D latent space. We are continuously exploring how this explicit 3D representation can further enhance reasoning and robotic learning. 🔗 Try our online Gradio demo with your own data at #NeurIPS2024 #3DReconstruction
Zhiwen(Aaron) Fan43,651 views • 1 year ago

What happens when VLMs meet 3D foundation models? See VLM-3R (CVPR 2026). VLM-3R links a vision-language model (e.g., Qwen) with 3D geometric foundation models (e.g., CUT3R) at metric scale. Given an uncalibrated video, it moves beyond pixels to perceive and reason in 3D space. Code (open source):
Zhiwen(Aaron) Fan10,595 views • 3 months ago

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs
Zhiwen(Aaron) Fan14,895 views • 1 year ago
No more content to load