Video wird geladen...
Video konnte nicht geladen werden
Introducing โDiffusion with Forward Modelsโ, ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ต๐ฎ๐ ๐ฐ๐ฎ๐ป ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ฒ ๐ฑ๐ถ๐๐ฒ๐ฟ๐๐ฒ, ๐ฟ๐ฒ๐ฎ๐น ๐ฏ๐ ๐๐ฐ๐ฒ๐ป๐ฒ๐ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐๐ถ๐ป๐ด๐น๐ฒ ๐ถ๐บ๐ฎ๐ด๐ฒ, ๐๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐๐ถ๐๐ต ๐ถ๐บ๐ฎ๐ด๐ฒ๐ ๐/๐ผ ๐ฎ๐ป๐ ๐ฏ๐ ๐ฑ๐ฎ๐๐ฎ! 1/n
88,712 Aufrufe โข vor 2 Jahren โขvia X (Twitter)
16 Kommentare

Work done with @_atewari, Tianwei Yin, @GCazenavette, & @eigenstate, collaborating with Fredo Durand, Bill Freeman, Josh Tenenbaum, at my Scene Representation Group @MIT_CSAIL. Ayush and I have been working on this for more than a year - he did amazing work here!! 2/n

Conventional, non-probabilistic models such as pixelNeRF that reconstruct a 3D scene from a single image generate blurry results for any parts of the scene that were not observed in the input image. 3/n

As a diffusion model, our model instead parameterizes the ๐ฑ๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ถ๐ผ๐ป of 3D scenes that are consistent with a single image, and can thus instead sample plausible 3D scenes in the form of radiance fields! 4/n

Recent diffusion models for novel view synthesis (GenVs, SparseFusion, etc) learn to sample from the distribution of *novel views* given context images. However, that is not what we are generally interested in. We want to directly sample from the distribution of 3D scenes! 5/n

This is difficult, b/c we never observe ground-truth 3d scenes - we only observe 2D images! We propose a new diffusion model that can nevertheless learn to directly generate 3D scenes, by integrating the differentiable renderer into each denoising step. 6/n

This enables us to solve a truly long-standing problem that Iโve attempted again and again over the years: Given just a single image, we can directly sample hundreds of 3D scenes consistent with that image - no post-processing (=Score Distillation) necessary!! 7/n

This works on *real-world* scenes in RealEstate10k and Co3D, and significantly outperforms score-distillation based approaches! This is the first time that any 3D generative model trained with images can sample from the distribution of such complex 3D scenes! 8/n

The samples are *truly* diverse. Note that each sample here is a full radiance field, from which you could - at any point - extract the pointcloud. And they vary widely in the unobserved regions! 9/n

It turns out that there is a whole class of problems, often referred to as โStochastic Inverse Problemsโ, where we are interested in modeling signals observed only through lossy forward models. 10/n

In the paper, we prototype two more applications to make this point: sampling from the distributions over plausible motions of an image, trained end-to-end from video, and probabilistic GAN inversion! 11/n

However, there is a whole wealth of problems across science and engineering that require probabilistic inversion of a known forward model! 12/n

To wrap up - we think that this is a significant step forward not only for generative modeling, but also for self-supervised training of 3D foundation models. Generating plausible 3D scenes means that our model receives plausible gradients for unobserved regions! 13/n

Weโd also like to highlight concurrent work by our friends at Oxford VOG, Viewset Diffusion: which has some related ideas and looks great! 14/n

More to come, stay tuned! 15/n

You can watch me talk about the paper here:

Code is out now:
