Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Meta presents Video Editing via Factorized Diffusion Distillation We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing

AK

506,347 subscribers

115,598 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

7 Comments

AK2 years ago

adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or

AK2 years ago

more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the

AK2 years ago

edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters

AK2 years ago

paper page:

Uri Gil2 years ago

that is not what the term "video editing" usually refers to. It should be called video manipulation or something

Jing Gu2 years ago

Using two adapters to function for editing and video part. Good idea 👍

Simulacra Latens2 years ago

What is the edit? All I see is image swapping/IPAdapater style transfer which we already have?

Related Videos

Today we’re sharing two new advances in our generative AI research: Emu Video & Emu Edit. Details ➡️ These new models deliver exciting results in high quality, diffusion-based text-to-video generation & controlled image editing w/ text instructions. 🧵

Today we’re sharing two new advances in our generative AI research: Emu Video & Emu Edit. Details ➡️ These new models deliver exciting results in high quality, diffusion-based text-to-video generation & controlled image editing w/ text instructions. 🧵

AI at Meta

798,246 views • 2 years ago

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

Huan Ling

36,895 views • 9 months ago

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

AK

16,062 views • 1 year ago

Exciting milestones in our generative AI research: Emu Video, which lets you create high quality videos from a text prompt, and Emu Edit, which enables detailed image editing based on your instructions. These new models are built on Emu, our foundation model for image generation and technology from them will underpin new creative features across our apps next year. Try it out: Emu Video: Emu Edit:

Exciting milestones in our generative AI research: Emu Video, which lets you create high quality videos from a text prompt, and Emu Edit, which enables detailed image editing based on your instructions. These new models are built on Emu, our foundation model for image generation and technology from them will underpin new creative features across our apps next year. Try it out: Emu Video: Emu Edit:

Boz

110,730 views • 2 years ago

Grok Imagine API just released A world-class video generation + video editing model Text-to-Video: Turn simple prompts into rich video clips with audio Image Generation + Editing: Bring ideas to life with visuals from scratch Video Editing Tools: Restyle scenes, add/remove props, control motion Best-in-Class Quality + Low Latency: Designed to deliver fast, cost-efficient results API pricing: Image input: $0.002 Video input : $0.01 Video output : $0.05

Grok Imagine API just released A world-class video generation + video editing model Text-to-Video: Turn simple prompts into rich video clips with audio Image Generation + Editing: Bring ideas to life with visuals from scratch Video Editing Tools: Restyle scenes, add/remove props, control motion Best-in-Class Quality + Low Latency: Designed to deliver fast, cost-efficient results API pricing: Image input: $0.002 Video input : $0.01 Video output : $0.05

X Freeze

15,078 views • 5 months ago

InstantDrag Improving Interactivity in Drag-based Image Editing discuss: Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.

InstantDrag Improving Interactivity in Drag-based Image Editing discuss: Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.

AK

71,232 views • 1 year ago

Google announces Dreamix: a model that generates videos when given: - video + prompt (Video editing) - input images + prompt (Subject Driven Generation) - input image + prompt (Image-toVideo

Google announces Dreamix: a model that generates videos when given: - video + prompt (Video editing) - input images + prompt (Subject Driven Generation) - input image + prompt (Image-toVideo

bleedingedge.ai

1,323,833 views • 3 years ago

Introducing Higgsfield Canvas: a state-of-the-art image editing model. Paint products directly onto your image with pixel-perfect control. Say hi to your new go-to for product placement, editing, and layout! 👋🏻 Comment Canvas to get the full guide in the DM.

Introducing Higgsfield Canvas: a state-of-the-art image editing model. Paint products directly onto your image with pixel-perfect control. Say hi to your new go-to for product placement, editing, and layout! 👋🏻 Comment Canvas to get the full guide in the DM.

Higgsfield AI 🧩

2,627,346 views • 1 year ago

Ok finally dug into Meta's new Movie Gen paper. Text-to-video is cool and all but, to me the precise editing feature is the game changer. I mean just look at these results 🤯 It can handle complex VFX tasks like replacing environments, doing set extensions, swapping characters, removing items, adding particle effects with realistic lighting interaction. The coolest bit to me is how they trained this model, because paired before/after vfx editing datasets are super scarce. TL;DR They taught it video editing through a clever three-stage process: 1. Started with image editing data, treating it like single-frame video edits. 2. Created synthetic video editing tasks by animating still image edits and using AI models (like SAM and DINO) for object segmentation. 3. The model generated edited videos, and then learned to reconstruct the originals from the edited version Meta calls this "video editing via backtranslation" and the results speak for themselves.

Ok finally dug into Meta's new Movie Gen paper. Text-to-video is cool and all but, to me the precise editing feature is the game changer. I mean just look at these results 🤯 It can handle complex VFX tasks like replacing environments, doing set extensions, swapping characters, removing items, adding particle effects with realistic lighting interaction. The coolest bit to me is how they trained this model, because paired before/after vfx editing datasets are super scarce. TL;DR They taught it video editing through a clever three-stage process: 1. Started with image editing data, treating it like single-frame video edits. 2. Created synthetic video editing tasks by animating still image edits and using AI models (like SAM and DINO) for object segmentation. 3. The model generated edited videos, and then learned to reconstruct the originals from the edited version Meta calls this "video editing via backtranslation" and the results speak for themselves.

Bilawal Sidhu

50,775 views • 1 year ago

Bytedance drops an open-source Gemini Omni!!! Bernini is a new AI video generation + editing framework. > Edit videos with text prompts > Image/video references > Code available

Bytedance drops an open-source Gemini Omni!!! Bernini is a new AI video generation + editing framework. > Edit videos with text prompts > Image/video references > Code available

⚡AI Search⚡

43,804 views • 1 month ago

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

OpenAI

12,892,837 views • 3 months ago

A video editing tool, made by a succesful YouTuber. We love to see it. 👏

A video editing tool, made by a succesful YouTuber. We love to see it. 👏

Product Hunt 😸

12,115 views • 2 years ago

⚒️Editing Preview⚒️ While a video can be enough for any edit, in a lot of cases adding specific greenscreens or memes can HEAVILY enhance the editing script ! Even when an editor has to rewatch the footage x100 times, it tends to ALWAYS be worth the trouble !

⚒️Editing Preview⚒️ While a video can be enough for any edit, in a lot of cases adding specific greenscreens or memes can HEAVILY enhance the editing script ! Even when an editor has to rewatch the footage x100 times, it tends to ALWAYS be worth the trouble !

QbQube 👾🧊

35,560 views • 5 months ago

🚨PUSH THE BUTTON🚨 Live video editing is a form of video editing where the footage is not pre-recorded. I create this on my own in real time- live. My mind, body & tech integrate to allow me to enter 🌸Flow State🌸 Here's an example of that #KickStreamers #TwitchStreamers

🚨PUSH THE BUTTON🚨 Live video editing is a form of video editing where the footage is not pre-recorded. I create this on my own in real time- live. My mind, body & tech integrate to allow me to enter 🌸Flow State🌸 Here's an example of that #KickStreamers #TwitchStreamers

BBJESS

119,016 views • 2 years ago

First, we are excited to share a number of new updates to our frontier video generation model, Gen-4.5 Soon you will be able to both generate and edit native audio with Gen-4.5 and edit video at arbitrary lengths with multi-shot editing.

First, we are excited to share a number of new updates to our frontier video generation model, Gen-4.5 Soon you will be able to both generate and edit native audio with Gen-4.5 and edit video at arbitrary lengths with multi-shot editing.

Runway

33,369 views • 7 months ago

1/ Excited to announce MoCA - a simple yet effective video editing method that accomplishes diverse spatial (style, background, object) and motion video edits using a motion conditioned image animation model.

1/ Excited to announce MoCA - a simple yet effective video editing method that accomplishes diverse spatial (style, background, object) and motion video edits using a motion conditioned image animation model.

Wilson Yan

10,636 views • 2 years ago

How do we generate videos on the scale of minutes, without drifting or forgetting about the historical context? We introduce Mixture of Contexts. Every minute-long video below is the direct output of our model in a single pass, with no post-processing, stitching, or editing. 1/4

How do we generate videos on the scale of minutes, without drifting or forgetting about the historical context? We introduce Mixture of Contexts. Every minute-long video below is the direct output of our model in a single pass, with no post-processing, stitching, or editing. 1/4

Gordon Wetzstein

158,208 views • 10 months ago

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 views • 1 year ago