Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

World modeling and imitation learning have largely been considered two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with decoupled diffusion time steps across these... modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -show more

Abhishek Gupta

10,202 subscribers

11,388 görüntüleme • 1 yıl önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,415 görüntüleme • 2 yıl önce

Evaluating policies on a real robot can be painful. Can we use a world model to get a rough estimate of how good a policy is? Checkout "Evaluating Robot Policies in a World Model". Paper: Demo: Code:

Evaluating policies on a real robot can be painful. Can we use a world model to get a rough estimate of how good a policy is? Checkout "Evaluating Robot Policies in a World Model". Paper: Demo: Code:

Sherry Yang

36,568 görüntüleme • 1 yıl önce

A viral paper "Language Model Represents Space and Time" recently claims that LLMs learn "world models". As much as I like Max Tegmark's works, I disagree with their definition of world model. World model is a core concept in AI agent and decision making. It is our mental simulation of how the world works given interventions (or lack thereof). A world model captures causality and intuitive physics, telling the agent what is likely and what is impossible. It can and should be used for counterfactual reasoning, i.e. "what ifs": what would happen if I knock over a cup of water? Where would I have been if I had not taken that bus? Yann LeCun Yann LeCun says it well in his position paper ( I quote: "Using such world models, animals can learn new skills with very few trials. They can predict the consequences of their actions, they can reason, plan, explore, and imagine new solutions to problems. Importantly, they can also avoid making dangerous mistakes when facing an unknown situation." The first use of the term World Model in deep policy learning is attributed to hardmaru & Jürgen Schmidhuber: In their seminal paper, an agent masters shooting skills in the popular game Doom (demo below) by learning in imagination, using an internal world model as a "physics simulator". To put in a simple Python math formula, world model learns a function F(s[0:t-1], a) -> s[t:], which takes as input the observed past and current action, and outputs plausible future states. Now the definition of World Model in Tegmark's paper seems to be about predicting GPS coordinates and time eras. I see this as just a classification task with no causal learning and simulation going on. You cannot make meaningful interventions against that model, nor can you optimize any decision making in a closed feedback loop. As for the "space & time neurons", I think they are most similar to the "sentiment neuron" that OpenAI published in 2017: Predicting GPS is conceptually no different from predicting sentiment in my opinion. I don't think their experimental results are wrong - just that their conclusion is on shaky grounds. I welcome any debate! Paper link:

A viral paper "Language Model Represents Space and Time" recently claims that LLMs learn "world models". As much as I like Max Tegmark's works, I disagree with their definition of world model. World model is a core concept in AI agent and decision making. It is our mental simulation of how the world works given interventions (or lack thereof). A world model captures causality and intuitive physics, telling the agent what is likely and what is impossible. It can and should be used for counterfactual reasoning, i.e. "what ifs": what would happen if I knock over a cup of water? Where would I have been if I had not taken that bus? Yann LeCun Yann LeCun says it well in his position paper ( I quote: "Using such world models, animals can learn new skills with very few trials. They can predict the consequences of their actions, they can reason, plan, explore, and imagine new solutions to problems. Importantly, they can also avoid making dangerous mistakes when facing an unknown situation." The first use of the term World Model in deep policy learning is attributed to hardmaru & Jürgen Schmidhuber: In their seminal paper, an agent masters shooting skills in the popular game Doom (demo below) by learning in imagination, using an internal world model as a "physics simulator". To put in a simple Python math formula, world model learns a function F(s[0:t-1], a) -> s[t:], which takes as input the observed past and current action, and outputs plausible future states. Now the definition of World Model in Tegmark's paper seems to be about predicting GPS coordinates and time eras. I see this as just a classification task with no causal learning and simulation going on. You cannot make meaningful interventions against that model, nor can you optimize any decision making in a closed feedback loop. As for the "space & time neurons", I think they are most similar to the "sentiment neuron" that OpenAI published in 2017: Predicting GPS is conceptually no different from predicting sentiment in my opinion. I don't think their experimental results are wrong - just that their conclusion is on shaky grounds. I welcome any debate! Paper link:

Jim Fan

593,943 görüntüleme • 2 yıl önce

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 görüntüleme • 1 yıl önce

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

elvis

88,506 görüntüleme • 3 yıl önce

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a **discrete diffusion objective**. Our model has all the benefits of diffusion models but now in multimodal space! - flexible compute-quality tradeoff, zero-shot inpainting and editing, better control via classifier-free guidance and lower latency! We open source everything - our code, weights and the training dataset.

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a discrete diffusion objective. Our model has all the benefits of diffusion models but now in multimodal space! - flexible compute-quality tradeoff, zero-shot inpainting and editing, better control via classifier-free guidance and lower latency! We open source everything - our code, weights and the training dataset.

Mihir Prabhudesai

104,862 görüntüleme • 1 yıl önce

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: paper:

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: paper:

Ruoshi Liu

50,797 görüntüleme • 2 yıl önce

Real-time generation isn’t about predicting the next pixel, it requires understanding the structure and context of a 3D world. That’s the shift. Read more about our real-time game-native diffusion model on our blog. Moonlake's Reverie v0.1.0

Real-time generation isn’t about predicting the next pixel, it requires understanding the structure and context of a 3D world. That’s the shift. Read more about our real-time game-native diffusion model on our blog. Moonlake's Reverie v0.1.0

Moonlake

14,439 görüntüleme • 6 ay önce

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are NOT from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

Yixuan Wang

125,579 görüntüleme • 3 ay önce

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Michael Black

22,092 görüntüleme • 6 ay önce

Today, we are adding Stable Video Diffusion, our foundation model for generative video to the Stability AI Developer Platform API. The model can generate 2 seconds of video, comprising of 25 generated frames and 24 frames of FILM interpolation, within an average time of 41 seconds. Developers interested in utilizing Stable Video Diffusion through an API can access it now on the Stability AI Developer Platform. Learn more here:

Today, we are adding Stable Video Diffusion, our foundation model for generative video to the Stability AI Developer Platform API. The model can generate 2 seconds of video, comprising of 25 generated frames and 24 frames of FILM interpolation, within an average time of 41 seconds. Developers interested in utilizing Stable Video Diffusion through an API can access it now on the Stability AI Developer Platform. Learn more here:

Stability AI

175,571 görüntüleme • 2 yıl önce

Video generation is powerful but too slow for real-world robotic tasks. How can we enable both video and action generation while ensuring real-time policy inference? Check out our work on the Unified Video Action Model (UVA) to find out! (1/7)

Video generation is powerful but too slow for real-world robotic tasks. How can we enable both video and action generation while ensuring real-time policy inference? Check out our work on the Unified Video Action Model (UVA) to find out! (1/7)

Shuang Li

67,404 görüntüleme • 1 yıl önce

Diffusions are excellent in creating fantastic images and videos 🔎 We cooked a *diffusion* model to synthesize structured data #ICLR2025 🔥 Introducing TabDiff, a mixed-type diffusion model for generating synthetic tabular data, imputing missing values, and beyond! 🧵 1/n

Diffusions are excellent in creating fantastic images and videos 🔎 We cooked a diffusion model to synthesize structured data #ICLR2025 🔥 Introducing TabDiff, a mixed-type diffusion model for generating synthetic tabular data, imputing missing values, and beyond! 🧵 1/n

Minkai Xu

50,434 görüntüleme • 1 yıl önce

LLaDA (the first Large Language Diffusion Model) is *just* out 💥 and I've built a demo, try out now 👨‍💻 It's mesmerizing to watch the diffusion process 🌀, and it being a diffusion model gives you superpowers like "the 4th word has to be pineapple" 🦸 Demo and weights 👇

LLaDA (the first Large Language Diffusion Model) is just out 💥 and I've built a demo, try out now 👨‍💻 It's mesmerizing to watch the diffusion process 🌀, and it being a diffusion model gives you superpowers like "the 4th word has to be pineapple" 🦸 Demo and weights 👇

apolinario (poli)

82,599 görüntüleme • 1 yıl önce

Diffusion for virtual worlds: we've trained a new model to create 3D objects from text. And it's 50x faster than any alternative. We made each of these cute characters in just one minute with Genmo's new text-to-3D model. Build your own world at

Diffusion for virtual worlds: we've trained a new model to create 3D objects from text. And it's 50x faster than any alternative. We made each of these cute characters in just one minute with Genmo's new text-to-3D model. Build your own world at

Genmo

18,528 görüntüleme • 3 yıl önce

🔥🔥🔥We’ve been listening to your feedback! Our latest world model HY-World 1.5 just got a major upgrade to make world generation more accessible than ever: 🛠️ Open Training Code: Fully customizable code for building and training your own models. ⚡ Accelerated Inference: Turbocharged speed and optimized VRAM for real-time interaction. 📉 Lite 5B Model: A new lightweight model that fits into small-VRAM GPUs. 🙌 Zero Waitlist: Our online app is now fully open to everyone—no application required. This is just the beginning. HY-World is building the future of spatial intelligence—open, accessible, and community-driven. 🕹️ Play now: ⭐ GitHub:

🔥🔥🔥We’ve been listening to your feedback! Our latest world model HY-World 1.5 just got a major upgrade to make world generation more accessible than ever: 🛠️ Open Training Code: Fully customizable code for building and training your own models. ⚡ Accelerated Inference: Turbocharged speed and optimized VRAM for real-time interaction. 📉 Lite 5B Model: A new lightweight model that fits into small-VRAM GPUs. 🙌 Zero Waitlist: Our online app is now fully open to everyone—no application required. This is just the beginning. HY-World is building the future of spatial intelligence—open, accessible, and community-driven. 🕹️ Play now: ⭐ GitHub:

Tencent Hy

20,581 görüntüleme • 5 ay önce

World Models are the path for some AI Models in the future. But how can we efficiently train these models to not only see the world the way humans do but to see the world in a new and unique way. By visualizing, what is normally sequenced audio patterns, we can derive much more insights. Here we see Paganini in a visual form that can than be described and transcribed into a World Model. We can observe connections in a manner that may not have been clear prior to the digitalization of music and sound in this way. The company with the most valuable potential in building a World Model is Tesla. Not that this type of visualization is being used, but that the mechanisms are in place, and the technology is in place for the company to thrive in this new form of AI.

World Models are the path for some AI Models in the future. But how can we efficiently train these models to not only see the world the way humans do but to see the world in a new and unique way. By visualizing, what is normally sequenced audio patterns, we can derive much more insights. Here we see Paganini in a visual form that can than be described and transcribed into a World Model. We can observe connections in a manner that may not have been clear prior to the digitalization of music and sound in this way. The company with the most valuable potential in building a World Model is Tesla. Not that this type of visualization is being used, but that the mechanisms are in place, and the technology is in place for the company to thrive in this new form of AI.

Brian Roemmele

57,424 görüntüleme • 7 ay önce

✨ Every time the video models get better, the try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shoot

✨ Every time the video models get better, the try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shoot

@levelsio

331,475 görüntüleme • 1 yıl önce

Placing objects sounds simple… until robots have to do it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !

Placing objects sounds simple… until robots have to do it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !

Ilir Aliu - eu/acc

22,843 görüntüleme • 1 yıl önce

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,539 görüntüleme • 3 yıl önce