正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Accepted by #CVPR2023! X-Decoder is the FIRST generalist decoder that supports all segmentation tasks (ins/sem/pano/ref) in OPEN VOCABULARY, both inter- AND intra-image VL tasks, and even helps instruct image inpainting/editing! New demo below and more at

Jianwei Yang

3,979 subscribers

51,930 次观看 • 3 年前 •via X (Twitter)

教育科学技术新闻政治 #CVPR2023

Anya Rossi• Live Now

Private livecam show

6 条评论

Jianwei Yang 的头像

Jianwei Yang3 年前

This project was led by our two wonderful interns @xueyanzou1, @ZiYiDou! With joint mentorship from @zhegan4, @LINJIEFUN, @ChunyuanLi, Xiyang Dai, @HarkiratBehl, Jianfeng Wang, and senior advisory from Violet Peng, Lu Yuan, Lijuan Wang, @yong_jae_lee and @JianfengGao0217!

Akarsh G 的头像

Akarsh G3 年前

Used your instruct demo. Still not perfect.

Naoto Usuyama 的头像

Naoto Usuyama3 年前

Congrats!

Dan Benyamin (Æ) 的头像

Dan Benyamin (Æ)3 年前

Cc @levelsio

Akarsh G 的头像

Akarsh G3 年前

How is it different from pix2pix?

Jianwei Yang 的头像

Jianwei Yang3 年前

We used our x-decoder as a plug into the original pix2pix to make the edit more grounded.

相关视频

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Xiaolong Wang

241,243 次观看 • 3 年前

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

AK

305,633 次观看 • 2 年前

Florence-2, the new vision foundation model by Microsoft, can now run 100% locally in your browser on WebGPU, thanks to Transformers.js! 🤗🤯 It supports tasks like image captioning, optical character recognition, object detection, and many more! 😍 WOW! Demo (+ source code) 👇

Florence-2, the new vision foundation model by Microsoft, can now run 100% locally in your browser on WebGPU, thanks to Transformers.js! 🤗🤯 It supports tasks like image captioning, optical character recognition, object detection, and many more! 😍 WOW! Demo (+ source code) 👇

Xenova

88,747 次观看 • 2 年前

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,855 次观看 • 8 个月前

Big News! Meta just released Segment Anything, a new AI model that can "cut out" any object, in any image/video, with a single click. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.

Big News! Meta just released Segment Anything, a new AI model that can "cut out" any object, in any image/video, with a single click. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.

Lior Alexander

290,190 次观看 • 3 年前

[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the Gradio demo. Paper: Code: Demo: This is a joint work with Gaurav Parmar (the leading author), Taesung Park, and Srinivasa Narasimhan. This work shows that a pre-trained one-step model can be easily adapted to conditional GANs frameworks for downstream image editing and synthesis tasks. #Edges2Cats

[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the Gradio demo. Paper: Code: Demo: This is a joint work with Gaurav Parmar (the leading author), Taesung Park, and Srinivasa Narasimhan. This work shows that a pre-trained one-step model can be easily adapted to conditional GANs frameworks for downstream image editing and synthesis tasks. #Edges2Cats

Jun-Yan Zhu

36,488 次观看 • 2 年前

Ok finally dug into Meta's new Movie Gen paper. Text-to-video is cool and all but, to me the precise editing feature is the game changer. I mean just look at these results 🤯 It can handle complex VFX tasks like replacing environments, doing set extensions, swapping characters, removing items, adding particle effects with realistic lighting interaction. The coolest bit to me is how they trained this model, because paired before/after vfx editing datasets are super scarce. TL;DR They taught it video editing through a clever three-stage process: 1. Started with image editing data, treating it like single-frame video edits. 2. Created synthetic video editing tasks by animating still image edits and using AI models (like SAM and DINO) for object segmentation. 3. The model generated edited videos, and then learned to reconstruct the originals from the edited version Meta calls this "video editing via backtranslation" and the results speak for themselves.

Ok finally dug into Meta's new Movie Gen paper. Text-to-video is cool and all but, to me the precise editing feature is the game changer. I mean just look at these results 🤯 It can handle complex VFX tasks like replacing environments, doing set extensions, swapping characters, removing items, adding particle effects with realistic lighting interaction. The coolest bit to me is how they trained this model, because paired before/after vfx editing datasets are super scarce. TL;DR They taught it video editing through a clever three-stage process: 1. Started with image editing data, treating it like single-frame video edits. 2. Created synthetic video editing tasks by animating still image edits and using AI models (like SAM and DINO) for object segmentation. 3. The model generated edited videos, and then learned to reconstruct the originals from the edited version Meta calls this "video editing via backtranslation" and the results speak for themselves.

Bilawal Sidhu

50,775 次观看 • 1 年前

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

MrNeRF

13,594 次观看 • 1 年前

(1/10) 🔥Thrilled to introduce OneDiffusion—our latest work in unified diffusion modeling! 🚀 This model bridges the gap between image synthesis and understanding, excelling in a wide range of tasks: T2I, conditional generation, image understanding, identity preservation, multiview generation, and even camera pose estimation. Learn more at: Project: arXiv: Code (on the way):

(1/10) 🔥Thrilled to introduce OneDiffusion—our latest work in unified diffusion modeling! 🚀 This model bridges the gap between image synthesis and understanding, excelling in a wide range of tasks: T2I, conditional generation, image understanding, identity preservation, multiview generation, and even camera pose estimation. Learn more at: Project: arXiv: Code (on the way):

Jiasen Lu

33,426 次观看 • 1 年前

Three new open-source models just landed in ComfyUI natively: → Gemma 4 (Google DeepMind) - multimodal LLM handling text, image, audio, and video input with built-in step-by-step reasoning mode → VOID (Netflix) - video object removal that also erases shadows, reflections, and physical interactions caused by the removed subject → BiRefNet - high-res background & object segmentation, one of the most-used segmentation models in the ecosystem Workflows and blog linked below 👇

Three new open-source models just landed in ComfyUI natively: → Gemma 4 (Google DeepMind) - multimodal LLM handling text, image, audio, and video input with built-in step-by-step reasoning mode → VOID (Netflix) - video object removal that also erases shadows, reflections, and physical interactions caused by the removed subject → BiRefNet - high-res background & object segmentation, one of the most-used segmentation models in the ecosystem Workflows and blog linked below 👇

ComfyUI

218,749 次观看 • 2 个月前

The new GOV.UK Wallet & App will transform how we access important documents like digital driver’s licences and complete essential tasks, all from your phone. Peter Kyle shared an exclusive first look at the new tools - watch the full demo on YouTube at the link in the reply below.

The new GOV.UK Wallet & App will transform how we access important documents like digital driver’s licences and complete essential tasks, all from your phone. Peter Kyle shared an exclusive first look at the new tools - watch the full demo on YouTube at the link in the reply below.

Department for Science, Innovation and Technology

48,140 次观看 • 1 年前

The average American reads below a 6th grade level… which means public schools are failing at even the most basic and fundamental tasks, & when you look at how the system is structured it’s no surprise that more and more children are falling behind.

The average American reads below a 6th grade level… which means public schools are failing at even the most basic and fundamental tasks, & when you look at how the system is structured it’s no surprise that more and more children are falling behind.

Brett Pike

11,469 次观看 • 7 个月前

High fidelity image editing is now in the OpenAI API and ChatGPT! Here's a quick demo I made that focuses on editing faces. It adds mustaches and facial expressions—but preserves all the other details of the face. Notice how my hair, glasses, nose, etc are perfectly maintained.

High fidelity image editing is now in the OpenAI API and ChatGPT! Here's a quick demo I made that focuses on editing faces. It adds mustaches and facial expressions—but preserves all the other details of the face. Notice how my hair, glasses, nose, etc are perfectly maintained.

edwin

41,819 次观看 • 1 年前

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

AK

16,062 次观看 • 1 年前

🚀 A brand-new feature is coming to the SAMGeo Python package: interactive remote-sensing image segmentation with text prompts powered by SAM3. This update makes geospatial segmentation even more intuitive: just describe what you want to detect, and let GeoAI do the rest. ✨ GitHub PR: #GeoAI #SAM3 #OpenSource #Python

Qiusheng Wu

32,888 次观看 • 7 个月前

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

Arena.ai

344,089 次观看 • 1 个月前

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 次观看 • 1 年前

Motion planning in complex tasks is hard and still done via slow, explicit, traditional planners. We present a generalist Neural Motion Planner -- a single neural network that plans complex dynamic motions quickly and accurately at test time. Building upon our lab's sim2real efforts, the key idea is to create many complex scenes in simulation and then distill classical motion planner trajectories into a single reactive neural network policy. More details in the thread below! 👇 Open-sourced:

Motion planning in complex tasks is hard and still done via slow, explicit, traditional planners. We present a generalist Neural Motion Planner -- a single neural network that plans complex dynamic motions quickly and accurately at test time. Building upon our lab's sim2real efforts, the key idea is to create many complex scenes in simulation and then distill classical motion planner trajectories into a single reactive neural network policy. More details in the thread below! 👇 Open-sourced:

Deepak Pathak

28,642 次观看 • 1 年前

Researchers from RAI Institute present Diffuse-CLoC, a new control policy that fuses kinematic motion diffusion models with physics-based control to produce motions that are both physically realistic and precisely controllable. This breakthrough moves us closer to developing generalist policies that enable humanoid robots to perform diverse tasks, including dynamic locomotion and contact-rich manipulation, in a natural-looking and robust way. Learn more at

Researchers from RAI Institute present Diffuse-CLoC, a new control policy that fuses kinematic motion diffusion models with physics-based control to produce motions that are both physically realistic and precisely controllable. This breakthrough moves us closer to developing generalist policies that enable humanoid robots to perform diverse tasks, including dynamic locomotion and contact-rich manipulation, in a natural-looking and robust way. Learn more at

RAI Institute

13,426 次观看 • 11 个月前