ๆญฃๅœจๅŠ ่ฝฝ่ง†้ข‘...

่ง†้ข‘ๅŠ ่ฝฝๅคฑ่ดฅ

๐Ÿ“ข 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naรฏve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM!

39,707 ๆฌก่ง‚็œ‹ โ€ข 2 ไธชๆœˆๅ‰ โ€ขvia X (Twitter)

0 ๆก่ฏ„่ฎบ

ๆš‚ๆ— ่ฏ„่ฎบ

ๅŽŸๅง‹ๅธ–ๅญ็š„่ฏ„่ฎบๅฐ†ๆ˜พ็คบๅœจ่ฟ™้‡Œ

็›ธๅ…ณ่ง†้ข‘

๐Ÿ“ข๐Ÿ“ข ๐๐ž๐ซ๐œ๐‡๐ž๐š๐: ๐๐ž๐ซ๐œ๐ž๐ฉ๐ญ๐ฎ๐š๐ฅ ๐‡๐ž๐š๐ ๐Œ๐จ๐๐ž๐ฅ ๐Ÿ๐จ๐ซ ๐’๐ข๐ง๐ ๐ฅ๐ž-๐ˆ๐ฆ๐š๐ ๐ž ๐Ÿ‘๐ƒ ๐‡๐ž๐š๐ ๐‘๐ž๐œ๐จ๐ง๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ข๐จ๐ง & ๐„๐๐ข๐ญ๐ข๐ง๐ ๐Ÿ“ข๐Ÿ“ข PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. ๐ŸŒ ๐Ÿ“ฝ๏ธ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,808 ๆฌก่ง‚็œ‹ โ€ข 7 ไธชๆœˆๅ‰

๐Ÿš€ Announcing Echo โ€” our new frontier model for 3D world generation. Echo turns a simple text prompt or image into a fully explorable, 3D-consistent world. Instead of disconnected views, the result is a single, coherent spatial representation you can move through freely. This is part of a bigger shift in AI: from generating pixels and tokens to generating spaces. Echo predicts a geometry-grounded 3D scene at metric scale, meaning every novel view, depth map, and interaction comes from the same underlying world โ€” not independent hallucinations. Once generated, the world is interactive in real time. You control the camera, explore from any angle, and render instantly โ€” even on low-end hardware, directly in the browser. High-quality 3D world exploration is no longer gated by expensive equipment. Under the hood, Echo infers a physically grounded 3D representation and converts it into a renderable format. For our web demo, we use 3D Gaussian Splatting (3DGS) for fast, GPU-friendly rendering โ€” but the representation itself is flexible and can be easily adapted. Why this matters: consistent 3D worlds unlock real workflows โ€” digital twins, 3D design, game environments, robotics simulation, and more. From a single photo or a line of text, Echo builds worlds that are reliable, editable, and spatially faithful. Echo also enables scene editing and restyling. Change materials, remove or add objects, explore design variations โ€” all while preserving global 3D consistency. Editing no longer breaks the world. This is only the beginning. Echo is the foundation for future world models with dynamics, physical reasoning, and richer interaction โ€” environments that donโ€™t just look right, but behave right. Explore the generated worlds on our website and sign up for the closed beta. The era of spatial intelligence starts here. ๐ŸŒ #Echo #WorldModels #SpatialAI #3DFoundationModels Check it out:

SpAItial AI

175,302 ๆฌก่ง‚็œ‹ โ€ข 5 ไธชๆœˆๅ‰