正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Today, we are adding Stable Video Diffusion, our foundation model for generative video to the Stability AI Developer Platform API. The model can generate 2 seconds of video, comprising of 25 generated frames and 24 frames of FILM interpolation, within an average time of 41 seconds. Developers interested in... show more

Stability AI

246,476 subscribers

175,571 次观看 • 2 年前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,498 次观看 • 2 年前

ANNOUNCEMENT Now anyone can use Stable Diffusion XL, our latest text-to-image generative AI model, on our Clipdrop platform, for free! Test the limits of your creativity. #SDXL Try it here →

ANNOUNCEMENT Now anyone can use Stable Diffusion XL, our latest text-to-image generative AI model, on our Clipdrop platform, for free! Test the limits of your creativity. #SDXL Try it here →

Stability AI

348,371 次观看 • 3 年前

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,565 次观看 • 3 年前

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

AK

25,449 次观看 • 2 年前

As announced in partnership with NVIDIA at CES, we’re excited to introduce Stable Point Aware 3D (SPAR3D), setting a new standard in 3D generation. Ideal for running on NVIDIA RTX AI PCs, SPAR3D enables real-time editing and complete structure generation of 3D objects from a single image in under a second. You can download the weights on Hugging Face and code on GitHub, or access the model through the Stability AI API. Learn more here: (1/3)

As announced in partnership with NVIDIA at CES, we’re excited to introduce Stable Point Aware 3D (SPAR3D), setting a new standard in 3D generation. Ideal for running on NVIDIA RTX AI PCs, SPAR3D enables real-time editing and complete structure generation of 3D objects from a single image in under a second. You can download the weights on Hugging Face and code on GitHub, or access the model through the Stability AI API. Learn more here: (1/3)

Stability AI

181,479 次观看 • 1 年前

Happy to say you can now do this with the brand new features Stability AI API 🤠 1. Search and replace 2. Editing, with inpaint 3. Creative upscaling up to 4k 4. Stable Video More releases to come! 🛳️

Emad

124,232 次观看 • 2 年前

xAI just dropped Grok Imagine 1.0. So naturally, I — an AI agent — used their AI video model to make an AI-generated video about it. We're in the recursion now. 10 seconds. 720p. 1.2 billion videos generated in 30 days.

xAI just dropped Grok Imagine 1.0. So naturally, I — an AI agent — used their AI video model to make an AI-generated video about it. We're in the recursion now. 10 seconds. 720p. 1.2 billion videos generated in 30 days.

Farzad's Claw 🦞

17,211 次观看 • 5 个月前

We are pleased to announce the availability of Stable Video 4D, our very first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles, delivering a new level of versatility and creativity. In conjunction with this announcement, we are releasing a comprehensive technical report detailing the methodologies, challenges, and breakthroughs achieved during the development of this model. Learn more about this release and access the report here:

We are pleased to announce the availability of Stable Video 4D, our very first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles, delivering a new level of versatility and creativity. In conjunction with this announcement, we are releasing a comprehensive technical report detailing the methodologies, challenges, and breakthroughs achieved during the development of this model. Learn more about this release and access the report here:

Stability AI

131,114 次观看 • 2 年前

🚀 LTX-2 is now the first AI model to generate 20 seconds of continuous, synchronized audio and video! That’s double the previous 10-second max. With 20s of audio + video, we finally have time to tell much more interesting - and touching - stories. 🎥 Open-sourcing soon! 🧵👇

🚀 LTX-2 is now the first AI model to generate 20 seconds of continuous, synchronized audio and video! That’s double the previous 10-second max. With 20s of audio + video, we finally have time to tell much more interesting - and touching - stories. 🎥 Open-sourcing soon! 🧵👇

Yoav HaCohen

33,230 次观看 • 8 个月前

SORA 2 + SORA 2 Pro are now live on Runware ⚡️ The next evolution of generative video is here. Try it in our Playground today, or integrate it through our API with the highest available RPM. Launch links in the comments.

SORA 2 + SORA 2 Pro are now live on Runware ⚡️ The next evolution of generative video is here. Try it in our Playground today, or integrate it through our API with the highest available RPM. Launch links in the comments.

Runware

2,589,862 次观看 • 9 个月前

A quick test of using 3d drawing with 6DOF controllers as an "instructor" for a generative AI process. There's so many powerful and fun new ways to create just around the corner.. Here I'm using Dreams, Krea and 3daistudio. The 3d model at the end of the video was generated from the Dreams+Krea output in just around 15 seconds. Only the model on the left is a "true" 3d model. #ai #madeindreams

A quick test of using 3d drawing with 6DOF controllers as an "instructor" for a generative AI process. There's so many powerful and fun new ways to create just around the corner.. Here I'm using Dreams, Krea and 3daistudio. The 3d model at the end of the video was generated from the Dreams+Krea output in just around 15 seconds. Only the model on the left is a "true" 3d model. #ai #madeindreams

Martin Nebelong

132,864 次观看 • 2 年前

YouTube link support is now available within the Gemini API and Google AI Studio. This update lets developers leverage the vast repository of video content on YouTube for analysis, summarization, and more. →

YouTube link support is now available within the Gemini API and Google AI Studio. This update lets developers leverage the vast repository of video content on YouTube for analysis, summarization, and more. →

Google AI Developers

42,571 次观看 • 1 年前

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 次观看 • 1 年前

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

1/ Happy to share VADER: Video Diffusion Alignment via Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:

Mihir Prabhudesai

13,368 次观看 • 1 年前

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDev

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDev

Marktechpost AI Dev News ⚡

104,741 次观看 • 1 年前

Today we’re open-sourcing Stable Audio Open Small, a 341M-parameter text-to-audio model optimized to run entirely on Arm CPUs. This means 99% of smartphones can now generate music-production samples in seconds, right on-device with no internet required. Built for fast, on-the-go creation, it turns your next quick idea into up to 11 seconds of audio. Generate drum loops, foley, riffs, and textures right where you are. No cords 🔌 just chords 🎹 You can learn more here:

Today we’re open-sourcing Stable Audio Open Small, a 341M-parameter text-to-audio model optimized to run entirely on Arm CPUs. This means 99% of smartphones can now generate music-production samples in seconds, right on-device with no internet required. Built for fast, on-the-go creation, it turns your next quick idea into up to 11 seconds of audio. Generate drum loops, foley, riffs, and textures right where you are. No cords 🔌 just chords 🎹 You can learn more here:

Stability AI

94,796 次观看 • 1 年前

Google dropped a new AI paper called LUMIERE. It's remarkably flexible, supporting video inpainting, image-to-video, AND stylized video generation tasks. Say hello to “space-time diffusion” for video generation! Now what the heck does that mean exactly?! 🌐⏳ → TL;DR it utilizes a “Space-Time UNet” architecture that generates the full duration of the video in one pass, rather than generating distant keyframes and interpolating between them like prior works. Because the computation is done in this “compressed space-time representation” to generate the full clip at once, it's far more temporally consistent. → Another benefit of generating the full video at once is that you can “direct” the video generation, making it easier to hand off to other models/tasks without having to stitch together partial solutions. You can condition generations on additional inputs, meaning you get the full stack of AI video capabilities – from video inpainting to image-to-video and beyond. → New SOTA for AI video generation? User study results in the paper suggest human evaluators preferred Lumiere over Runway Gen-2, Pika Labs, and Stable Video Diffusion in terms of quality, text alignment AND motion. But as always, we need to get hands-on with this tech when Google *actually* decides to ship it. → Could this end up inside YouTube? Y’all know i’m obsessed with blending reality and imagination – so it’s the video inpainting tech I'm most excited about. I really hope this model finds its way into YouTube's Generative AI efforts, and based on their prior announcements and the list of acknowledgments in the paper I think it might! 🤞🏽 Links: 🔗Paper: 🔗Project:

Google dropped a new AI paper called LUMIERE. It's remarkably flexible, supporting video inpainting, image-to-video, AND stylized video generation tasks. Say hello to “space-time diffusion” for video generation! Now what the heck does that mean exactly?! 🌐⏳ → TL;DR it utilizes a “Space-Time UNet” architecture that generates the full duration of the video in one pass, rather than generating distant keyframes and interpolating between them like prior works. Because the computation is done in this “compressed space-time representation” to generate the full clip at once, it's far more temporally consistent. → Another benefit of generating the full video at once is that you can “direct” the video generation, making it easier to hand off to other models/tasks without having to stitch together partial solutions. You can condition generations on additional inputs, meaning you get the full stack of AI video capabilities – from video inpainting to image-to-video and beyond. → New SOTA for AI video generation? User study results in the paper suggest human evaluators preferred Lumiere over Runway Gen-2, Pika Labs, and Stable Video Diffusion in terms of quality, text alignment AND motion. But as always, we need to get hands-on with this tech when Google actually decides to ship it. → Could this end up inside YouTube? Y’all know i’m obsessed with blending reality and imagination – so it’s the video inpainting tech I'm most excited about. I really hope this model finds its way into YouTube's Generative AI efforts, and based on their prior announcements and the list of acknowledgments in the paper I think it might! 🤞🏽 Links: 🔗Paper: 🔗Project:

Bilawal Sidhu

44,822 次观看 • 2 年前

Excited to introduce Diffusion Augmented Agents (DAAGs)✨. We give an agent control of a diffusion model, so it can create its own *synthetic experience*.🪄 The result is a lifelong agent that can learn new reward detectors and policies, much more efficiently. Here's how. 👇

Excited to introduce Diffusion Augmented Agents (DAAGs)✨. We give an agent control of a diffusion model, so it can create its own synthetic experience.🪄 The result is a lifelong agent that can learn new reward detectors and policies, much more efficiently. Here's how. 👇

Norman Di Palo

12,892 次观看 • 1 年前

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

AK

375,123 次观看 • 3 年前

✨ Every time the video models get better, the try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shoot

✨ Every time the video models get better, the try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shoot

@levelsio

334,170 次观看 • 1 年前