正在加载视频...
视频加载失败
We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs
14,895 次观看 • 1 年前 •via X (Twitter)
4 条评论

Lennie Budgell ❇️1 年前
This is some real great stuff I have been looking forward to seeing come into existence in such accuracy and types of usage. Finally. Thanks yall excited to get to playing around with the codr

PowerBeatsVR3 年前
VR fitness app PowerBeatsVR is NOW LIVE on the official Meta Quest store! Get fit in VR without any expensive subscription:

Wenbo Hu1 年前
Great work! I have a general question about why CUT3R is preferred over VGGT for spatial encoder?

Zhiwen(Aaron) Fan1 年前
Great question. We’re aiming to equip VLMs with metric-scale geometric sensing.
