Wonderland: Navigating 3D Scenes from a Single Image Contributions:... • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.show more

MrNeRF
52,801 次观看 • 1 年前
DimensionX: Create Any 3D and 4D Scenes from a... Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.show more

MrNeRF
17,028 次观看 • 1 年前
Create a 3D model from a single image, set... of images or a text prompt in < 1 minute 😮💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”show more

Bilawal Sidhu
92,760 次观看 • 2 年前
Human Hair Reconstruction with Strand-Aligned 3D Gaussians Contributions (cited):... – We propose a new 3D line lifting scheme that uses a modified 3DGS reconstruction technique to lift 2D orientation maps into a 3D field while also providing refinement of the camera parameters; – We introduce a dual representation of hair strand polylines and 3D Gaussians to achieve differentiable rasterization of hair strands and leverage photometric constraints for strand-based hair reconstruction; – Based on these components, we propose a coarse-to-fine optimization method for prior-guided hair reconstruction that leverages both latent and explicit representations of the hairstyle.show more

MrNeRF
106,497 次观看 • 1 年前
🚀New paper out - We present Video-MSG (Multimodal Sketch... Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵show more

Jialu Li
35,060 次观看 • 1 年前
DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from... In-the-Wild Drone Imagery Abstract: Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.show more

MrNeRF
21,340 次观看 • 1 年前
📢We introduce “RefFusion”, a novel inpainting method for scenes... reconstructed using 3D Gaussian Splatting. 🔗 TLDR: we personalize an image diffusion model to a given reference image and distill its knowledge to 3D through score distillation sampling.show more

Ashkan Mirzaei
34,683 次观看 • 2 年前
We are excited to introduce Stable Fast 3D, Stability... AI’s latest breakthrough in 3D asset generation technology. This innovative model transforms a single input image into a detailed 3D asset in just 0.5 seconds, setting a new standard for speed and quality in the field of 3D reconstruction! Alongside this release, we’ve also published a technical report that highlights how we achieve fast inference speeds with reduced baked illumination and material parameters. 👾You can learn more and access the report here:show more

Stability AI
438,327 次观看 • 1 年前
🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video... Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5show more

Haven Feng @ CVPR
22,427 次观看 • 1 年前
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos... with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.show more

MrNeRF
24,729 次观看 • 11 个月前
3d world abilities of AI models keep getting better,... this is @Kling_AI 1.5 It can move around a space in 3d and you can steer its camera movements All based on a single input imageshow more

@levelsio
112,201 次观看 • 1 年前
Chop the gradients ✂️! We found that truncating decoder... gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.show more

Felix Heide
28,323 次观看 • 2 个月前
NVIDIA AI Released DiffusionRenderer: An AI Model for Editable,... Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDevshow more

Marktechpost AI Dev News ⚡
104,741 次观看 • 11 个月前
NVIDIA just released a very impressive text-to-video paper. Video... Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:show more

Lior Alexander
158,539 次观看 • 3 年前
the fact that i can take an image of... a room and turn it into a 3d model in one shot is actually insane this took like 30 seconds from image to 3d modelshow more

Jan
131,884 次观看 • 7 个月前
✨ Made a new mini feature on Photo AI:... [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!show more

@levelsio
119,210 次观看 • 11 个月前
(1/2) Check out "𝐏𝐨𝐥𝐲𝐃𝐢𝐟𝐟: Generating 3D Polygonal Meshes with... Diffusion Models"! Our model operates directly on the polygons of 3D meshes and generates novel shapes as output through an iterative diffusion process.show more

Matthias Niessner
57,940 次观看 • 2 年前
(1/3) Can we turn text-to-image models into photorealistic 3D... generators? ViewDiff (#CVPR2024) produces realistic, multi-view consistent images of real-world 3D objects in authentic surroundings. Website Video How does it work?show more

Matthias Niessner
34,751 次观看 • 2 年前
The Stable Video Diffusion model just dropped 🔥 The... new model supports: – Text-to-Video – Image-to-Video – 14 or 25 frames at 576 x 1024 – Multi-View Generation – Frame Interpolation – 3D Scene Understanding – Camera Control via LoRA Paper: Code: SVD model: SVD-XT model:show more

Dreaming Tulpa 🥓👑
457,688 次观看 • 2 年前
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page:... Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.show more

AK
375,080 次观看 • 3 年前