Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce... the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:show more

154,511 次观看 • 3 年前

The Hidden Language of Diffusion Models paper page: tackle... the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulationshow more

41,746 次观看 • 3 年前

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The... correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.show more

19,101 次观看 • 1 年前

DimensionX: Create Any 3D and 4D Scenes from a... Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.show more

MrNeRF

17,047 次观看 • 1 年前

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos... with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.show more

MrNeRF

24,729 次观看 • 1 年前

Create a short film like this in just 1... minute with GPT Image 2.0 + Seedance 2.0. GPT Image 2.0 can naturally combine multiple photos into one single image, while Seedance 2.0 can use that image as a reference to automatically separate the scenes, generate a coherent video sequence, and add suitable background music. This workflow greatly improves the overall creative efficiency. When using this method, simply provide the merged image as a reference for Seedance 2.0 and briefly describe each scene with a simple prompt. This can significantly increase the success rate of the final video. All of the above was created on GPT Image Prompt: Seedance Prompt:show more

Midjourney Sref and prompt Library

40,572 次观看 • 2 个月前

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction... Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."show more

MrNeRF

17,206 次观看 • 1 年前

Create a 3D model from a single image, set... of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”show more

Bilawal Sidhu

92,792 次观看 • 2 年前

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper... page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.show more

25,449 次观看 • 2 年前

📢GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans📢... We present a novel method to reconstruct hair strands from colorless 3D scans by extracting orientation cues directly from the mesh surface geometry by finding local characteristic lines and from shaded renderings using a neural 2D line detector. We enhance the reconstruction with a diffusion prior trained on synthetic hair data and adapted to each scan using a tailored text prompt, allowing us to recover both simple and complex hairstyles without relying on color input. To support further research, we also introduce Strands400, the largest publicly available dataset of 3D hair strand reconstructions from real-world scans of 400 different people, featuring complicated hairstyles, such as ponytails and buns. 🌍 📷 Great work by Rachmadio Noval L. Artem Sevastopolsky Egor Zakharov @ness_prisshow more

Matthias Niessner

12,490 次观看 • 1 年前

Video diffusion models have strong implicit representations of 3D... shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.show more

Michael Black

22,182 次观看 • 7 个月前

Today, we are releasing Stable Video Diffusion, our first... show more

Stability AI

1,024,498 次观看 • 2 年前

Depth Any Video with Scalable Synthetic Data AI physicists... and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.show more

MrNeRF

27,428 次观看 • 1 年前

Segment Any 3D Gaussians paper page: Interactive 3D segmentation... in radiance fields is an appealing task since its importance in 3D scene understanding and manipulation. However, existing methods face challenges in either achieving fine-grained, multi-granularity segmentation or contending with substantial computational overhead, inhibiting real-time interaction. In this paper, we introduce Segment Any 3D GAussians (SAGA), a novel 3D interactive segmentation approach that seamlessly blends a 2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D segmentation results generated by the segmentation foundation model into 3D Gaussian point features through well-designed contrastive training. Evaluation on existing benchmarks demonstrates that SAGA can achieve competitive performance with state-of-the-art methods. Moreover, SAGA achieves multi-granularity segmentation and accommodates various prompts, including points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation within milliseconds, achieving nearly 1000x acceleration compared to previous SOTA.show more

69,542 次观看 • 2 年前

We’re excited to introduce Text-to-LoRA: a Hypernetwork that generates... task-specific LLM adapters (LoRAs) based on a text description of the task. Catch our presentation at #ICML2025! Paper: Code: Biological systems are capable of rapid adaptation, given limited sensory cues. For example, our human visual system can quickly adapt and tune its light sensitivity to our surroundings. While modern LLMs exhibit a wide variety of capabilities and knowledge, they remain rigid when adding task-specific capabilities. Traditionally, customizing these models requires gathering large datasets and performing often expensive, time-consuming fine-tuning for specific applications. To bypass these limitations, Text-to-LoRA (T2L) meta-learns a “hypernetwork” that takes in a text description of a desired task, as a prompt, and generates a task-specific LoRA that performs well on the task. In our experiments, we show that T2L can encode hundreds of existing LoRA adapters. While the compression is lossy, T2L maintains the performance of task-specifically tuned LoRA adapters. We also show that T2L can even generalize to unseen tasks given a natural language description of the tasks. Importantly, Text-to-LoRA is parameter-efficient. It generates LoRAs in a single, inexpensive step, based solely on a simple text description of the task. This approach is a step towards dramatically lowering the technical and computational barriers, allowing non-technical users to specialize foundation models using plain language, rather than needing deep technical expertise or large compute resources.show more

Sakana AI

403,145 次观看 • 1 年前

Human Hair Reconstruction with Strand-Aligned 3D Gaussians Contributions (cited):... – We propose a new 3D line lifting scheme that uses a modified 3DGS reconstruction technique to lift 2D orientation maps into a 3D field while also providing refinement of the camera parameters; – We introduce a dual representation of hair strand polylines and 3D Gaussians to achieve differentiable rasterization of hair strands and leverage photometric constraints for strand-based hair reconstruction; – Based on these components, we propose a coarse-to-fine optimization method for prior-guided hair reconstruction that leverages both latent and explicit representations of the hairstyle.show more

MrNeRF

106,525 次观看 • 1 年前

The $NEIRO Solana community is honoured to introduce: The... Bronze Neiro Statue 🐕 A gift to かぼすママ To keep up the ongoing tradition set forth by Doge, we will hand craft a similar statue as the one built to forever honor Kabosu. We expect nothing in return for this gift. It’s a mere token of our admiration and love for everything that this legacy represents. As a community, we have dared to dream beyond the ordinary, and our vision has led us to a remarkable decision for us to gift Atsuko Sato. - A hand sculpted bronze statue in honor of her new family member, Neiro. This could be something she accepts as a gift or placed in a park in Japan to honour her and the Legacy that surrounds Atsuko, Neiro and Kabosu. On behalf of Neiro community We send our love and best wishes to Atsuko and Neiro as they start this new journey together paw in hand. 🫶 Learn more:show more

NEIRO

175,933 次观看 • 1 年前

Is Google taking initial steps to enhance Street View?... For some reason, Street View seems stuck in technology that feels outdated. I wonder if we'll see such improvements on the product side. Also, note how much better it performs in all aspects compared to Zip-NeRF in their presented material. It offers more details and fewer artifacts. Great work! "LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering" Contributions: • We propose a novel LOD representation for 3DGS which, unlike previous methods [27, 28, 17], does not recompute the list of used Gaussians at each frame. This allows for acceleration and compaction, enabling the rendering of large-scale scenes even on mobile devices. • We design a strategy to automatically select optimal hyperparameters for splitting LODs, whereas most other methods require manual tuning of hyperparameters for each 3D scene. • To further accelerate rendering, we split the scene into chunks and pre-compute sets of active Gaussians per chunk. • Finally, we introduce a novel opacity interpolation scheme to produce visually pleasing rendering and eliminate artifacts when transitioning between chunks.show more

MrNeRF

62,564 次观看 • 1 年前

Beyond generating high-fidelity visuals, we wanted to test the... limits of what Nano Banana Pro can do. We worked with design partners Porto Rocha to build out a hypothetical brand called YOYOYO to see how the model would handle the task. Here’s what we found: 🎨Brand consistency: Across logos, colors, and typography, the model maintained a strict, cohesive brand identity (even for wildly diverse concepts) 🛍️Environmental realism: We asked to see the products in storefront and studio mockups. It nailed accurate lighting, shadows, and physical proportions - even when upscaled for massive retail displays 🪀Spatial accuracy: We tested spatial volumes for physical packaging. The generated proportions were so precise that we were able to 3D-print the functional yo-yo How have you been pushing the limits of Nano Banana Pro? Let us know in the replies below!show more

Google AI

123,192 次观看 • 3 个月前

What happens next? We took a photo with Lens... and asked Nano Banana to generate the next 5 seconds of a scene. While it’s not predicting the future in a literal sense, the model draws upon its vast world knowledge to generate a plausible outcome in each of these examples. It's a fun peek at how our latest AI image editing model can be used for creative forecasting and storytelling. Nano Banana is now live in Search – accessible through Lens and AI Mode – so you can try this out for yourself!show more

Rajan Patel

74,826 次观看 • 9 个月前

Milestone unlocked 🔓🔑. Today, we announced positive topline results... from Emerge, the first Phase 3 study of our pharmaceutical LSD for major depressive disorder (MDD). These results mark a significant step forward in our mission to forge a new era of psychiatry, one with novel treatments that target underlying cause of disease, not just symptoms. For patients who may have run out of options, these results matter. We’re energized by the momentum ahead as we continue pursuing potential new treatment options for the millions of people living with a mental health disorder. Learn more about this significant step:show more

Definium Therapeutics

18,790 次观看 • 28 天前

Live Cam