Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A... show more

Xingang Pan

3,265 subscribers

15,902 Aufrufe • vor 6 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

New research with Tsinghua University: Spatial-TTT. A framework for streaming visual-based spatial intelligence with test-time training (TTT). Spatial-TTT adapts fast weights to capture and organize spatial evidence from long video streams, enabling models to build structured 3D spatial memory over time. Highlights: 🔹Efficient streaming memory. Fast weights act as compact spatial memory with sublinear memory growth over 7000+ frames and more than 40% lower compute. 🔹Spatial-predictive mechanism. TTT layers with 3D spatiotemporal convolution capture geometric correspondence and temporal continuity. 🔹SOTA results on long-horizon video spatial understanding (VSI-Bench). The paper ranked #1 on Hugging Face Daily Papers on March 13. Project page: GitHub: Paper: Model & Data:

New research with Tsinghua University: Spatial-TTT. A framework for streaming visual-based spatial intelligence with test-time training (TTT). Spatial-TTT adapts fast weights to capture and organize spatial evidence from long video streams, enabling models to build structured 3D spatial memory over time. Highlights: 🔹Efficient streaming memory. Fast weights act as compact spatial memory with sublinear memory growth over 7000+ frames and more than 40% lower compute. 🔹Spatial-predictive mechanism. TTT layers with 3D spatiotemporal convolution capture geometric correspondence and temporal continuity. 🔹SOTA results on long-horizon video spatial understanding (VSI-Bench). The paper ranked #1 on Hugging Face Daily Papers on March 13. Project page: GitHub: Paper: Model & Data:

Tencent Hy

20,792 Aufrufe • vor 2 Monaten

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 Aufrufe • vor 9 Monaten

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Fan-Yun Sun

92,514 Aufrufe • vor 1 Jahr

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,327 Aufrufe • vor 8 Monaten

Ray2 Keyframes, Extend & Loop is now here. Create your visual story with precise frame-by-frame control, seamless transitions, spatial exploration and long-form video durations using #Ray2 Text-to-Video and Image-to-Video models in Dream Machine.

Ray2 Keyframes, Extend & Loop is now here. Create your visual story with precise frame-by-frame control, seamless transitions, spatial exploration and long-form video durations using #Ray2 Text-to-Video and Image-to-Video models in Dream Machine.

Luma

7,362,289 Aufrufe • vor 1 Jahr

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.

Bilawal Sidhu

107,011 Aufrufe • vor 11 Monaten

*Why panorama?* Standard video models struggle with object permanence—if a camera pans away and comes back, objects may disappear. With panoramas, the model is forced to generate everything in the scene. This serves as a "working memory" for consistent world generation. (3/N)

Why panorama? Standard video models struggle with object permanence—if a camera pans away and comes back, objects may disappear. With panoramas, the model is forced to generate everything in the scene. This serves as a "working memory" for consistent world generation. (3/N)

Ziyi Wu

21,992 Aufrufe • vor 4 Monaten

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

Fangchen Liu

68,288 Aufrufe • vor 1 Jahr

Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶

Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶

Saining Xie

257,992 Aufrufe • vor 7 Monaten

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

We present VLM-3R: a Vision-Language Model capable of 3D spatial reasoning from monocular video, grounding visual cues, geometry, and camera motion. ✅ No depth sensor ✅ No pre-built 3D maps ✅ End-to-end spatial + temporal reasoning 🔗 Code & benchmark: #VLM #3DVision #LLMs

Zhiwen(Aaron) Fan

14,895 Aufrufe • vor 1 Jahr

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

Zhengqi Li

56,923 Aufrufe • vor 1 Jahr

Can VLMs build Spatial Mental Models like humans? Reasoning from limited views? Reasoning from partial observations? Reasoning about unseen objects behind furniture / beyond current view? Check out MindCube! 🌐 📰 🤗 👩‍💻

Can VLMs build Spatial Mental Models like humans? Reasoning from limited views? Reasoning from partial observations? Reasoning about unseen objects behind furniture / beyond current view? Check out MindCube! 🌐 📰 🤗 👩‍💻

Manling Li

40,959 Aufrufe • vor 11 Monaten

Introducing Runway Aleph, a new way to edit, transform and generate video. Aleph is a state-of-the-art in-context video model, setting a new frontier for multi-task visual generation, with the ability to perform a wide range of edits on an input video such as adding, removing and transforming objects, getting new angles of a scene and modifying style and lighting, among many other tasks.

Introducing Runway Aleph, a new way to edit, transform and generate video. Aleph is a state-of-the-art in-context video model, setting a new frontier for multi-task visual generation, with the ability to perform a wide range of edits on an input video such as adding, removing and transforming objects, getting new angles of a scene and modifying style and lighting, among many other tasks.

Runway

646,259 Aufrufe • vor 10 Monaten

Large-scale Gaussian splats have reached a new level of realism. This is a well-known temple in Bangkok, reconstructed as a high-fidelity 3D environment from 360 captures. At this level, the boundary between video and 3D starts to disappear. But what you’re looking at is not a video. It’s a dense spatial representation of a real place, where geometry, texture, and structure are preserved and made machine-readable. This kind of 3D data can power Visual AI, Robotics navigation, VPS localization, XR experiences, world models, and next-generation spatial computing systems. Built with Over the Reality.

Large-scale Gaussian splats have reached a new level of realism. This is a well-known temple in Bangkok, reconstructed as a high-fidelity 3D environment from 360 captures. At this level, the boundary between video and 3D starts to disappear. But what you’re looking at is not a video. It’s a dense spatial representation of a real place, where geometry, texture, and structure are preserved and made machine-readable. This kind of 3D data can power Visual AI, Robotics navigation, VPS localization, XR experiences, world models, and next-generation spatial computing systems. Built with Over the Reality.

Over the Reality 🌐

346,640 Aufrufe • vor 21 Tagen

Spatial reconstruction is a long-context problem: real scenes come with hundreds of images. But O(N²) transformer-based models don’t scale efficiently. Introducing: 🤐ZipMap (CVPR ’26): Linear-Time, Stateful 3D Reconstruction via Test-Time Training (TTT). ZipMap “zips” a large image collection into an implicit TTT scene state in a single linear-time operation. The state will then be decoded into spatial outputs, and can be queried efficiently for novel-view geometry and appearance (~100 FPS) ZipMap is not only much faster (>20× faster than VGGT), but also matches or surpasses the accuracy of all SOTA models.

Spatial reconstruction is a long-context problem: real scenes come with hundreds of images. But O(N²) transformer-based models don’t scale efficiently. Introducing: 🤐ZipMap (CVPR ’26): Linear-Time, Stateful 3D Reconstruction via Test-Time Training (TTT). ZipMap “zips” a large image collection into an implicit TTT scene state in a single linear-time operation. The state will then be decoded into spatial outputs, and can be queried efficiently for novel-view geometry and appearance (~100 FPS) ZipMap is not only much faster (>20× faster than VGGT), but also matches or surpasses the accuracy of all SOTA models.

Haian Jin@CVPR

77,386 Aufrufe • vor 3 Monaten

Excited to share MonST3R! -- a simple way to estimate geometry from unposed video of dynamic scene We achieve competitive results on several downstreams (video depth, camera pose) and believe this is a promising step toward feed-forward 4D reconstruction

Excited to share MonST3R! -- a simple way to estimate geometry from unposed video of dynamic scene We achieve competitive results on several downstreams (video depth, camera pose) and believe this is a promising step toward feed-forward 4D reconstruction

Junyi Zhang @CVPR

131,523 Aufrufe • vor 1 Jahr

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

Yunzhi Zhang

48,871 Aufrufe • vor 1 Jahr

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

OpenAI

12,856,167 Aufrufe • vor 1 Monat

AnchorWeave World-Consistent Video Generation with Retrieved Local Spatial Memories paper:

AnchorWeave World-Consistent Video Generation with Retrieved Local Spatial Memories paper:

AK

23,312 Aufrufe • vor 3 Monaten