Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie... can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.show more

AK

512,876 subscribers

684,372 görüntüleme • 2 yıl önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

9 Yorum

AK profil fotoğrafı

AK2 yıl önce

paper page:

AK profil fotoğrafı

AK2 yıl önce

project page:

XR Multiverse profil fotoğrafı

XR Multiverse2 yıl önce

Google's warehouse of unused tech

crispyshh profil fotoğrafı

crispyshh2 yıl önce

@apples_jimmy Text to Mario achieved externally

meowbooks --🩸/acc profil fotoğrafı

meowbooks --🩸/acc2 yıl önce

that's so cool

Matt Griswold profil fotoğrafı

Matt Griswold2 yıl önce

Can it make Wolfenstein? If not, why not.

Smoke-away profil fotoğrafı

Smoke-away2 yıl önce

🔥🔥🔥

Mat profil fotoğrafı

Mat2 yıl önce

all these papers, no public releases 🫠

Ollin Boer Bohan profil fotoğrafı

Ollin Boer Bohan2 yıl önce

Demo page with more videos.

Benzer Videolar

Introducing Genie 2: our AI model that can create an endless variety of playable 3D worlds - all from a single image. 🖼️ These types of large-scale foundation world models could enable future agents to be trained and evaluated in an endless number of virtual environments. →

Introducing Genie 2: our AI model that can create an endless variety of playable 3D worlds - all from a single image. 🖼️ These types of large-scale foundation world models could enable future agents to be trained and evaluated in an endless number of virtual environments. →

Google DeepMind

1,441,309 görüntüleme • 1 yıl önce

Today we are announcing Genie 3, a general purpose world model by Google DeepMind that can generate dynamic, interactive environments with a single text prompt. World models are AI that understand facets of the world (like Veo's knowledge of intuitive physics or Genie's mastery of new environments), and serve as a key stepping stone on the path to AGI. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2. Learn more ➡️

Today we are announcing Genie 3, a general purpose world model by Google DeepMind that can generate dynamic, interactive environments with a single text prompt. World models are AI that understand facets of the world (like Veo's knowledge of intuitive physics or Genie's mastery of new environments), and serve as a key stepping stone on the path to AGI. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2. Learn more ➡️

Google AI

102,615 görüntüleme • 11 ay önce

Today, we're joined by Jack Parker-Holder and Shlomi Fruchter, researchers at Google DeepMind, to discuss the recent release of Genie 3, a model capable of generating “playable” virtual worlds. We dig into the evolution of the Genie project and review the current model’s scaled-up capabilities, including creating real-time, interactive, and high-resolution environments. Jack and Shlomi share their perspectives on what defines a world model, the model's architecture, and key technical challenges and breakthroughs, including Genie 3’s visual memory and ability to handle “promptable world events.” Jack, Shlomi, and Sam share their favorite Genie 3 demos, and discuss its potential as a dynamic training environment for embodied AI agents. Finally, we will explore future directions for Genie research. 🗒️ For the full list of resources for this episode, visit the show notes page: 📖 CHAPTERS =============================== 00:00 - Introduction 7:11 - What is a world model? 14:49 - Milestones of Genie research 24:32 - Genie 3 27:46 - Challenges 30:07 - Genie 3 examples 33:48 - Model capabilities 35:49 - Key aspects of the model 39:40 - Consistency as an emergent property 42:11 - Promptable word events 47:24 - SIMA agent 50:56 - Limitations 56:08 - Future directions

Today, we're joined by Jack Parker-Holder and Shlomi Fruchter, researchers at Google DeepMind, to discuss the recent release of Genie 3, a model capable of generating “playable” virtual worlds. We dig into the evolution of the Genie project and review the current model’s scaled-up capabilities, including creating real-time, interactive, and high-resolution environments. Jack and Shlomi share their perspectives on what defines a world model, the model's architecture, and key technical challenges and breakthroughs, including Genie 3’s visual memory and ability to handle “promptable world events.” Jack, Shlomi, and Sam share their favorite Genie 3 demos, and discuss its potential as a dynamic training environment for embodied AI agents. Finally, we will explore future directions for Genie research. 🗒️ For the full list of resources for this episode, visit the show notes page: 📖 CHAPTERS =============================== 00:00 - Introduction 7:11 - What is a world model? 14:49 - Milestones of Genie research 24:32 - Genie 3 27:46 - Challenges 30:07 - Genie 3 examples 33:48 - Model capabilities 35:49 - Key aspects of the model 39:40 - Consistency as an emergent property 42:11 - Promptable word events 47:24 - SIMA agent 50:56 - Limitations 56:08 - Future directions

The TWIML AI Podcast

12,043 görüntüleme • 11 ay önce

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 görüntüleme • 5 ay önce

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The Humanoid Hub

68,453 görüntüleme • 6 ay önce

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,088 görüntüleme • 1 yıl önce

GameFactory Creating New Games with Generative Interactive Videos present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

GameFactory Creating New Games with Generative Interactive Videos present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

AK

70,029 görüntüleme • 1 yıl önce

🎮 AI from Google creates entire worlds in seconds Google DeepMind has unveiled a new AI model called Genie 3, which generates three-dimensional interactive environments based on a simple text prompt. The model creates fully functional virtual worlds — like in a video game — where you can freely move around. The main difference from previous versions: these worlds last not just seconds, but several minutes. That’s enough time to explore, test, or even train another AI.

🎮 AI from Google creates entire worlds in seconds Google DeepMind has unveiled a new AI model called Genie 3, which generates three-dimensional interactive environments based on a simple text prompt. The model creates fully functional virtual worlds — like in a video game — where you can freely move around. The main difference from previous versions: these worlds last not just seconds, but several minutes. That’s enough time to explore, test, or even train another AI.

NEXTA

42,068 görüntüleme • 11 ay önce

Genie 3 from Google could compress robotics R&D, generate synthetic trajectories at scale then fine tune once on hardware. This is truly a turning point for world models 🌍 — we can now build real-time, fully interactive simulations that run for several minutes and bring any world we can imagine to life. This might be the missing puzzle piece for embodied AGI. ⚡ Real-time engine under the hood During generation the network predicts a fresh frame every 42 milliseconds while looking back over the entire action history. That history grows each step, so the compute graph must cache and retrieve past features fast or the frame rate tanks. The team built a custom attention window that skims for “landmarks” in the latent space, letting the model fetch relevant context quickly enough to hit real-time speed. With Genie 3, environment becomes a latent variable, agents overfit less because benchmark turns into distribution. Some truly wild examples 👇

Genie 3 from Google could compress robotics R&D, generate synthetic trajectories at scale then fine tune once on hardware. This is truly a turning point for world models 🌍 — we can now build real-time, fully interactive simulations that run for several minutes and bring any world we can imagine to life. This might be the missing puzzle piece for embodied AGI. ⚡ Real-time engine under the hood During generation the network predicts a fresh frame every 42 milliseconds while looking back over the entire action history. That history grows each step, so the compute graph must cache and retrieve past features fast or the frame rate tanks. The team built a custom attention window that skims for “landmarks” in the latent space, letting the model fetch relevant context quickly enough to hit real-time speed. With Genie 3, environment becomes a latent variable, agents overfit less because benchmark turns into distribution. Some truly wild examples 👇

Rohan Paul

16,124 görüntüleme • 11 ay önce

A humanoid robot policy trained solely on synthetic data generated by a world model. Research Scientist Joel Jang presents NVIDIA's DreamGen pipeline: ⦿ Post-train the world model Cosmos-Predict2 with a small set of real teleoperation demos. ⦿ Prompt the world model to generate synthetic video data with verbs and scenarios not used in the world model’s post-training. ⦿ Auto-label synthetic video data with action sequences. ⦿ Train robot policies using only synthetic data. That's it. Deploy zero-shot to a real humanoid robot.

A humanoid robot policy trained solely on synthetic data generated by a world model. Research Scientist Joel Jang presents NVIDIA's DreamGen pipeline: ⦿ Post-train the world model Cosmos-Predict2 with a small set of real teleoperation demos. ⦿ Prompt the world model to generate synthetic video data with verbs and scenarios not used in the world model’s post-training. ⦿ Auto-label synthetic video data with action sequences. ⦿ Train robot policies using only synthetic data. That's it. Deploy zero-shot to a real humanoid robot.

The Humanoid Hub

20,968 görüntüleme • 1 yıl önce

1/ NitroGen: NVIDIA's new image-to-action model! NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. Gaming is a significant factor in AI training. Google DeepMind trained AI early on Starcraft 2, and OpenAI on Dota 2. This new product from NVIDIA is therefore extremely important. Why it matters and how it works:

1/ NitroGen: NVIDIA's new image-to-action model! NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. Gaming is a significant factor in AI training. Google DeepMind trained AI early on Starcraft 2, and OpenAI on Dota 2. This new product from NVIDIA is therefore extremely important. Why it matters and how it works:

Chubby♨️

55,280 görüntüleme • 7 ay önce

$AI agents are about to redefine the internet. The mistake we made with Large Language Models? We let a handful of corporations capture all the value. Action Model is building a different path. By training through our extension, users gain fractional ownership in the Large Action Model, giving them a real stake in the future of AI. When LLMs emerged, the upside flowed to Big Tech. This time, it doesn’t have to. They’re building AI on our data, and keeping the upside for themselves. Community-owned Large Action Model is how we take it back.$

AI agents are about to redefine the internet. The mistake we made with Large Language Models? We let a handful of corporations capture all the value. Action Model is building a different path. By training through our extension, users gain fractional ownership in the Large Action Model, giving them a real stake in the future of AI. When LLMs emerged, the upside flowed to Big Tech. This time, it doesn’t have to. They’re building AI on our data, and keeping the upside for themselves. Community-owned Large Action Model is how we take it back.

Action Model

76,939 görüntüleme • 5 ay önce

Diffusion has shown great promise for generating robot **actions**, can it act as a **world model** to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a **controllable** action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - **visual** trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Diffusion has shown great promise for generating robot actions, can it act as a world model to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a controllable action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - visual trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Heng Yang

38,428 görüntüleme • 1 yıl önce

🔥Really excited to see the release of PAN world model, a project I had been working over the past years. PAN is a general world model capable of simulating physical, agentic, and nested worlds, synthesizing infinite interactive experiences for training AI agents. Building on top of pretrained LLMs and video diffusion models, PAN connects language, perception, action, and latent thoughts, for long-horizon simulation and reasoning. PAN shows overwhelming performance gains over JEPA-2, Cosmos-2, and other prior models. More in the thread👇 ... 1/

🔥Really excited to see the release of PAN world model, a project I had been working over the past years. PAN is a general world model capable of simulating physical, agentic, and nested worlds, synthesizing infinite interactive experiences for training AI agents. Building on top of pretrained LLMs and video diffusion models, PAN connects language, perception, action, and latent thoughts, for long-horizon simulation and reasoning. PAN shows overwhelming performance gains over JEPA-2, Cosmos-2, and other prior models. More in the thread👇 ... 1/

Zhiting Hu

31,195 görüntüleme • 8 ay önce

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

Santiago

65,446 görüntüleme • 3 ay önce

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

LeRobot

318,321 görüntüleme • 1 ay önce

Turn any world model from World Labs into a multi-user space for exploration and collaboration with FRAME. Voice chat, embodied AI agents, and tons more you can layer on top of the model foundation. This took all of 10 seconds to get going. The metaverse is so back. 😄

Turn any world model from World Labs into a multi-user space for exploration and collaboration with FRAME. Voice chat, embodied AI agents, and tons more you can layer on top of the model foundation. This took all of 10 seconds to get going. The metaverse is so back. 😄

Gabriel Baker

21,608 görüntüleme • 10 ay önce

Today we're announcing #GAIA1: a 9B parameter world model, trained on 4,700 hours of driving data, able to simulate complex and diverse driving scenes from video, text and action inputs. This model is 480x larger than the preview we shared earlier this year and the results are incredible. These videos are entirely synthetically generated by Wayve's generative AI, GAIA-1. But there is more here than just generating videos, GAIA is an entire world model. A world model allows us to simulate the future, conditioned on video, text and action inputs, which can be leveraged for making informed decisions when driving. Why is this game-changing for autonomous driving? 1. Safety. One limitation with AI systems like today's Large Language Models is that they are autoregressive, next-word prediction algorithms, but aren't necessarily aware of the implications of their decisions. A world model allows us to give our AI the capability to be aware of its decisions, by simulating the future, which is important for self-driving safety. 2. Synthetic training data. I believe synthetic training data is the future for AI, because it is safer, cheaper, and infinitely scalable. GAIA-1 unlocks unprecedented realism and diversity of synthetic data for self-driving. 3. Long-tail robustness. One of the biggest challenges for self-driving is long-tail robustness: dealing with the enormous magnitude of edge cases we see on the road. An advantage of generative AI is its incredible ability to recombine experiences in new ways. This is exciting for self-driving as it means we can learn from two edge case scenarios, and combine them to become a corner case. For example, we can experience driving in fog, and experience of jay-walking pedestrians, and GAIA can learn from these experiences to understand how to generate a fog+jay walking scenario. Check out many more videos in our blog or further technical details in our paper: Or come chat with our team who are at the International Conference on Computer Vision (#ICCV2023) this week in Paris in Booth 32 Jamie Shotton

Today we're announcing #GAIA1: a 9B parameter world model, trained on 4,700 hours of driving data, able to simulate complex and diverse driving scenes from video, text and action inputs. This model is 480x larger than the preview we shared earlier this year and the results are incredible. These videos are entirely synthetically generated by Wayve's generative AI, GAIA-1. But there is more here than just generating videos, GAIA is an entire world model. A world model allows us to simulate the future, conditioned on video, text and action inputs, which can be leveraged for making informed decisions when driving. Why is this game-changing for autonomous driving? 1. Safety. One limitation with AI systems like today's Large Language Models is that they are autoregressive, next-word prediction algorithms, but aren't necessarily aware of the implications of their decisions. A world model allows us to give our AI the capability to be aware of its decisions, by simulating the future, which is important for self-driving safety. 2. Synthetic training data. I believe synthetic training data is the future for AI, because it is safer, cheaper, and infinitely scalable. GAIA-1 unlocks unprecedented realism and diversity of synthetic data for self-driving. 3. Long-tail robustness. One of the biggest challenges for self-driving is long-tail robustness: dealing with the enormous magnitude of edge cases we see on the road. An advantage of generative AI is its incredible ability to recombine experiences in new ways. This is exciting for self-driving as it means we can learn from two edge case scenarios, and combine them to become a corner case. For example, we can experience driving in fog, and experience of jay-walking pedestrians, and GAIA can learn from these experiences to understand how to generate a fog+jay walking scenario. Check out many more videos in our blog or further technical details in our paper: Or come chat with our team who are at the International Conference on Computer Vision (#ICCV2023) this week in Paris in Booth 32 Jamie Shotton

Alex Kendall

631,856 görüntüleme • 2 yıl önce

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 görüntüleme • 10 ay önce

Yann LeCun (Yann LeCun) just revealed why he left Meta, Why are LLMs an extremely narrow field, and how "World model" is the way to build really meaningful agentic future. 🎯 Beautiful and simple in 1.5 mints. "I can’t imagine we can build agentic systems without those systems having the ability to predict, in advance, what the consequences of their actions are going to be. The way we act in the world is that we can predict the consequences of our actions, and that’s what allows us to plan. So what is a world model? Given the state of an environment, a system you want to control at time t, and given an action or intervention you imagine taking, can you predict the state of the world (or the system) at time t + 1? If you can, that’s a world model. You don’t do this at a pixel level, if it’s video. You do this in an abstract representation space, and that’s a crucial key insight." --- From 'AI House Davos" YT channel (full link in comment)

Yann LeCun (Yann LeCun) just revealed why he left Meta, Why are LLMs an extremely narrow field, and how "World model" is the way to build really meaningful agentic future. 🎯 Beautiful and simple in 1.5 mints. "I can’t imagine we can build agentic systems without those systems having the ability to predict, in advance, what the consequences of their actions are going to be. The way we act in the world is that we can predict the consequences of our actions, and that’s what allows us to plan. So what is a world model? Given the state of an environment, a system you want to control at time t, and given an action or intervention you imagine taking, can you predict the state of the world (or the system) at time t + 1? If you can, that’s a world model. You don’t do this at a pixel level, if it’s video. You do this in an abstract representation space, and that’s a crucial key insight." --- From 'AI House Davos" YT channel (full link in comment)

Rohan Paul

383,162 görüntüleme • 6 ay önce