Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Diffusion models make great images. But can they drive robots? Usually that gets complicated really fast. We figured out how to get a Stable Diffusion model (based on Instruct pix2pix) to drive robotic instruction following. Simple recipe, works on a wide range of tasks. Thread👇

Sergey Levine

129,693 subscribers

126,523 Aufrufe • vor 2 Jahren •via X (Twitter)

Wissenschaft & Technologie Bildung Nachrichten & Politik

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von Sergey Levine

Sergey Levinevor 2 Jahren

The idea is very simple: Stable Diffusion is finetuned for image editing (Instruct pix2pix), and we finetune it more on robot data to predict intermediate subgoals for performing instructions. Then a goal-conditioned policy controls the robot to match the generated subgoal.

Profilbild von Sergey Levine

Sergey Levinevor 2 Jahren

Why does it work? The diffusion model transfers web-scale knowledge, and generalizes well to novel objects and scenes (that the robot never saw). The robot policy has a much easier problem to solve: it only needs to match short-term goals, often just matching arm position.

Profilbild von Sergey Levine

Sergey Levinevor 2 Jahren

Our method, SuSIE, can follow a broad range of instructions, far beyond what would be possible with only the robot data. At the same time it is efficient, and easy to use -- the Instruct pix2pix model is used without any changes (only finetuned) and the low-policy is simple GCBC.

Profilbild von Sergey Levine

Sergey Levinevor 2 Jahren

In experiments, SuSIE actually ends up beating the much much larger RT-2-X model, trained on the giant RT-X robot dataset, despite using more than an order of magnitude less robot data (about 60k). Though RT-2-X puts up a really good fight🙂

Profilbild von Sergey Levine

Sergey Levinevor 2 Jahren

We've released the code here: We hope everyone will use it! I personally think this offers probably the easiest way to boost robot capabilities with web-scale data. Use web-scale data for what it's good at: visuo-semantic association. Simple, effective.

Profilbild von Sergey Levine

Sergey Levinevor 2 Jahren

For more, see the project website: Paper here: A fun collaboration w/ @kvablack, @mitsuhiko_nm, Pranav Atreya, @HomerWalke, @chelseabfinn, @aviral_kumar2

Profilbild von Oier Mees

Oier Meesvor 2 Jahren

I wanted to highlight, that despite the authors being humble about this, they have significantly outperformed SOTA on the challenging CALVIN zero-shot benchmark ( 🙌 @kvablack @mitsuhiko_nm @HomerWalke Pranav Atreya @aviral_kumar2 @chelseabfinn @svlevine

Profilbild von Igor Gilitschenski

Igor Gilitschenskivor 2 Jahren

I love this! This type of data-generation via data-driven simulation can also alleviate the need for collecting data from some complicated or dangerous edge cases. Instead, image editing techniques can now be used to obtain the desired training data.

Profilbild von rogue node

rogue nodevor 2 Jahren

sudo make me a sandwich

Profilbild von generatorman

generatormanvor 2 Jahren

This is very interesting! It seems like you're essentially using IP2P as a low-FPS video generation model? Also, do you provide any test-time conditioning for how big a step IP2P should take? Or is that baked in by your finetuning data?

Ähnliche Videos

Excited to share my final PhD project😀 We show how simple, yet elegant changes enable diffusion transformers to learn SOTA robotic policies on real robots. Our method improves performance by 20% across a wide range of highly dexterous tasks - like cutting sushi! 1/n

Excited to share my final PhD project😀 We show how simple, yet elegant changes enable diffusion transformers to learn SOTA robotic policies on real robots. Our method improves performance by 20% across a wide range of highly dexterous tasks - like cutting sushi! 1/n

Sudeep Dasari

20,536 Aufrufe • vor 1 Jahr

Speed and quality can finally coexist in diffusion-based language generation. Introducing DiDi-Instruct, a Discrete Diffusion Divergence Instruct method that distills a pre-trained discrete diffusion language model (dLLM) into a few-step student for ultra-fast generation. Built on integral KL-divergence minimization, DiDi-Instruct achieves up to 64× faster decoding, surpasses both its teacher and GPT-2, and cuts training time by 20×. Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct Paper: Code: Project: Our report: 📬 #PapersAccepted by Jiqizhixin

Speed and quality can finally coexist in diffusion-based language generation. Introducing DiDi-Instruct, a Discrete Diffusion Divergence Instruct method that distills a pre-trained discrete diffusion language model (dLLM) into a few-step student for ultra-fast generation. Built on integral KL-divergence minimization, DiDi-Instruct achieves up to 64× faster decoding, surpasses both its teacher and GPT-2, and cuts training time by 20×. Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct Paper: Code: Project: Our report: 📬 #PapersAccepted by Jiqizhixin

机器之心 JIQIZHIXIN

18,126 Aufrufe • vor 7 Monaten

Diffusion models generate high-quality images but require hundreds of forward passes. MIT CSAIL and Adobe Research introduce Distribution Matching Distillation (DMD), a distillation approach that converts costly multi-step diffusion models into fast one-step generators. A thread 🧵

Diffusion models generate high-quality images but require hundreds of forward passes. MIT CSAIL and Adobe Research introduce Distribution Matching Distillation (DMD), a distillation approach that converts costly multi-step diffusion models into fast one-step generators. A thread 🧵

MIT CSAIL

34,383 Aufrufe • vor 2 Jahren

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 2 Jahren

Video diffusion models generate high-quality videos but are too slow for interactive applications. We MIT CSAIL Adobe Research introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

Video diffusion models generate high-quality videos but are too slow for interactive applications. We MIT CSAIL Adobe Research introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

Tianwei Yin

83,714 Aufrufe • vor 1 Jahr

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

Andrew Ng

151,198 Aufrufe • vor 2 Jahren

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: (1/7)

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: (1/7)

Boyuan Chen

175,996 Aufrufe • vor 1 Jahr

Trying out real-time Stable Diffusion and it's FAST! (link below) I'm struggling to type faster than the AI can generate the images😂

Trying out real-time Stable Diffusion and it's FAST! (link below) I'm struggling to type faster than the AI can generate the images😂

Chase Lean

140,341 Aufrufe • vor 2 Jahren

We created a series of simplified notebooks that cover essential aspects of Stable Diffusion, using the vanilla Stable Diffusion 2.1 base to utilise it as a face-editing model for building your own face app 🧵(1/3) Github :

We created a series of simplified notebooks that cover essential aspects of Stable Diffusion, using the vanilla Stable Diffusion 2.1 base to utilise it as a face-editing model for building your own face app 🧵(1/3) Github :

OutofAi

45,864 Aufrufe • vor 2 Jahren

Stable Diffusion 3.5 from Stability AI is now LIVE on Civitai! Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency. Generate with SD 3.5 Live on Civitai here👇🏽 All images used to create this video were generated with SD 3.5 by our community 🫶🏽

Stable Diffusion 3.5 from Stability AI is now LIVE on Civitai! Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency. Generate with SD 3.5 Live on Civitai here👇🏽 All images used to create this video were generated with SD 3.5 by our community 🫶🏽

Civitai

31,235 Aufrufe • vor 1 Jahr

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,438 Aufrufe • vor 2 Jahren

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Xiaolong Wang

241,225 Aufrufe • vor 3 Jahren

MinerU-Diffusion A 2.5B diffusion-based OCR model that replaces slow autoregressive decoding with parallel block-wise diffusion, achieving up to 3.2x faster inference while improving robustness on complex documents with tables, formulas, and layouts.

MinerU-Diffusion A 2.5B diffusion-based OCR model that replaces slow autoregressive decoding with parallel block-wise diffusion, achieving up to 3.2x faster inference while improving robustness on complex documents with tables, formulas, and layouts.

DailyPapers

15,304 Aufrufe • vor 3 Monaten

History-Guided Video Diffusion TL;DR: Diffuse long videos by performing guidance over different histories, enabled by Diffusion Forcing Transformer, a simple finetunable add-on to any existing sequence diffusion models.

History-Guided Video Diffusion TL;DR: Diffuse long videos by performing guidance over different histories, enabled by Diffusion Forcing Transformer, a simple finetunable add-on to any existing sequence diffusion models.

AK

20,914 Aufrufe • vor 1 Jahr

Guidance on top of diffusion models can now be used to drag and manipulate images, create pose-conditioned images, and so much more! Check out Readout Guidance: Work w/ trevordarrell, Oliver Wang, Dan Goldman, Aleksander Holynski. More in thread 🧵.

Guidance on top of diffusion models can now be used to drag and manipulate images, create pose-conditioned images, and so much more! Check out Readout Guidance: Work w/ trevordarrell, Oliver Wang, Dan Goldman, Aleksander Holynski. More in thread 🧵.

Grace Luo

42,706 Aufrufe • vor 2 Jahren

We’re excited to release ACE-Step / ACE-Step-v1-3.5B, a fast, versatile DiT-based foundation model for music generation that runs on consumer-grade GPUs. With its simple architecture and low hardware requirements, it’s easy to fine-tune for various music tasks, empowering, not replacing, artists and creators. Think of it as a step toward music’s Stable Diffusion moment. ※ Trained on authorized, purchased data. Demo Page: Hugging Face: Git repo:

We’re excited to release ACE-Step / ACE-Step-v1-3.5B, a fast, versatile DiT-based foundation model for music generation that runs on consumer-grade GPUs. With its simple architecture and low hardware requirements, it’s easy to fine-tune for various music tasks, empowering, not replacing, artists and creators. Think of it as a step toward music’s Stable Diffusion moment. ※ Trained on authorized, purchased data. Demo Page: Hugging Face: Git repo:

ACE Studio

112,342 Aufrufe • vor 1 Jahr

Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company 1X has a solution: world models. 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about: - why world models are the future for scaling robot learning - how to use world models for robot control - what world models unlock for evaluating robot model performance - how we can hill-climb from here to general purpose robots Watch Episode #61 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, now!

Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company 1X has a solution: world models. 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about: - why world models are the future for scaling robot learning - how to use world models for robot control - what world models unlock for evaluating robot model performance - how we can hill-climb from here to general purpose robots Watch Episode #61 of RoboPapers, with Michael Cho - Rbt/Acc and Chris Paxton, now!

RoboPapers

27,567 Aufrufe • vor 4 Monaten

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 Aufrufe • vor 2 Jahren

Dreamix: Video Diffusion Models are General Video Editors abs: project page: present diffusion-based method that is able to perform text-based motion and appearance editing of general videos

Dreamix: Video Diffusion Models are General Video Editors abs: project page: present diffusion-based method that is able to perform text-based motion and appearance editing of general videos

AK

398,160 Aufrufe • vor 3 Jahren

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

Jesse Zhang

100,274 Aufrufe • vor 3 Monaten