Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce... the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:show more

AK
154,507 次观看 • 3 年前
The Hidden Language of Diffusion Models paper page: tackle... the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulationshow more

AK
41,746 次观看 • 3 年前
📢We introduce “RefFusion”, a novel inpainting method for scenes... reconstructed using 3D Gaussian Splatting. 🔗 TLDR: we personalize an image diffusion model to a given reference image and distill its knowledge to 3D through score distillation sampling.show more

Ashkan Mirzaei
34,683 次观看 • 2 年前
The latent space of earlier generative models like GANS... can linearly encode concepts of the data. What if the data was model weights? We present weights2weights, a subspace in diffusion weights that behaves as an interpretable latent space over customized diffusion models.show more

Amil Dravid
94,226 次观看 • 2 年前
Excited to share MeshUp ⬆️, a framework for deforming... a mesh into a blend of various target concepts! MeshUp offers control over the influence of each concept by changing the associated weights. Project page: 🧵1/show more

Rana Hanocka
23,267 次观看 • 1 年前
We are excited to introduce our latest project for... Dom Perignon, a visual and audio experience that highlights the brand's ideal aesthetics and foundational concepts!show more

Numbered
12,105 次观看 • 1 年前
Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The... correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.show more

AK
19,101 次观看 • 1 年前
Text-to-image diffusion transformer models learn to align text and... image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.show more

Alec Helbling
94,065 次观看 • 5 个月前
DimensionX: Create Any 3D and 4D Scenes from a... Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.show more

MrNeRF
17,017 次观看 • 1 年前
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos... with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.show more

MrNeRF
24,580 次观看 • 10 个月前
Create a short film like this in just 1... minute with GPT Image 2.0 + Seedance 2.0. GPT Image 2.0 can naturally combine multiple photos into one single image, while Seedance 2.0 can use that image as a reference to automatically separate the scenes, generate a coherent video sequence, and add suitable background music. This workflow greatly improves the overall creative efficiency. When using this method, simply provide the merged image as a reference for Seedance 2.0 and briefly describe each scene with a simple prompt. This can significantly increase the success rate of the final video. All of the above was created on GPT Image Prompt: Seedance Prompt:show more

Midjourney Sref and prompt Library
40,572 次观看 • 1 个月前
Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction... Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."show more

MrNeRF
17,206 次观看 • 1 年前
Create a 3D model from a single image, set... of images or a text prompt in < 1 minute 😮💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”show more

Bilawal Sidhu
92,760 次观看 • 2 年前
Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized... Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Expertsshow more

AK
47,384 次观看 • 2 年前
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper... page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.show more

AK
25,368 次观看 • 2 年前
1/ Happy to share VADER: Video Diffusion Alignment via... Reward Gradients. We adapt foundational video diffusion models using pre-trained reward models to generate high-quality, aligned videos for various end-applications. Below we generated a short movie using VADER 😀, we used ChatGPT to write a script and an off-the-shelf AI music generator to generate the sound. Our code & weights are open-sourced:show more

Mihir Prabhudesai
13,330 次观看 • 1 年前
📢GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans📢... We present a novel method to reconstruct hair strands from colorless 3D scans by extracting orientation cues directly from the mesh surface geometry by finding local characteristic lines and from shaded renderings using a neural 2D line detector. We enhance the reconstruction with a diffusion prior trained on synthetic hair data and adapted to each scan using a tailored text prompt, allowing us to recover both simple and complex hairstyles without relying on color input. To support further research, we also introduce Strands400, the largest publicly available dataset of 3D hair strand reconstructions from real-world scans of 400 different people, featuring complicated hairstyles, such as ponytails and buns. 🌍 📷 Great work by Rachmadio Noval L. Artem Sevastopolsky Egor Zakharov @ness_prisshow more

Matthias Niessner
12,465 次观看 • 1 年前
Video diffusion models have strong implicit representations of 3D... shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.show more

Michael Black
22,004 次观看 • 5 个月前
Meta announces SceneScript Reconstructing Scenes With An Autoregressive Structured... Language Model We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Ourshow more

AK
46,090 次观看 • 2 年前
Today, we are releasing Stable Video Diffusion, our first... foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:show more

Stability AI
1,024,335 次观看 • 2 年前
Depth Any Video with Scalable Synthetic Data AI physicists... and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.show more

MrNeRF
27,428 次观看 • 1 年前