Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion discuss: We introduce a new approach for generating realistic 3D models with UV maps through a representation termed "Object Images." This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex... 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in polygonal meshes. This method allows us to use image generation models, such as Diffusion Transformers, directly for 3D shape generation. Evaluated on the ABO dataset, our generated shapes with patch structures achieve point cloud FID comparable to recent 3D generative models, while naturally supporting PBR material generation.show more

AK

509,279 subscribers

66,435 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

9 Kommentare

Profilbild von Jaakko Lehtinen

Jaakko Lehtinenvor 1 Jahr

Classic Geometry Images make a comeback?

Profilbild von Darwin Miller

Darwin Millervor 1 Jahr

SO NO CODE?

Profilbild von Xin Yu (Andy)

Xin Yu (Andy)vor 1 Jahr

what? Amazing

Profilbild von Genia Cheskidova

Genia Cheskidovavor 1 Jahr

It's a smart idea, I love it.

Profilbild von Nicolas

Nicolasvor 1 Jahr

It's a great idea!

Profilbild von Raviv Wolfe

Raviv Wolfevor 1 Jahr

Absolutely fascinating the new ways these tools continue to be applied to problem solving

Profilbild von Power Of AI

Power Of AIvor 1 Jahr

So cool 👍🏻👏🏻

Profilbild von Chazz Gold

Chazz Goldvor 1 Jahr

I “think” I understand what I just read

Profilbild von 𝕏ingguang Yan

𝕏ingguang Yanvor 1 Jahr

Thanks for the post! The code and data is now live at:

Ähnliche Videos

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 Aufrufe • vor 2 Jahren

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 Aufrufe • vor 2 Jahren

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 Aufrufe • vor 2 Jahren

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,736 Aufrufe • vor 1 Jahr

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 3 Jahren

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

18,976 Aufrufe • vor 4 Monaten

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Felix Heide

27,779 Aufrufe • vor 10 Monaten

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 Aufrufe • vor 3 Jahren

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 Aufrufe • vor 10 Monaten

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

MrNeRF

13,594 Aufrufe • vor 1 Jahr

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix AI

30,853 Aufrufe • vor 1 Jahr

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

Andrew Ng

151,198 Aufrufe • vor 2 Jahren

📢Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction📢 -> highly accurate face reconstruction by training powerful VITs via surface normals and UV-coordinates estimation. The geometric cues from our 2D foundation model backbone constrain the 3DMM parameters, which allows us to achieve remarkable reconstruction accuracy - works for both single image and videos! In addition, we introduce a new 3D face reconstruction benchmark that evaluates both neutral and posed face geometry. 🌍 📷 Great work by Simon Giebenhain Tobias Kirschstein Martin Rünz Lourdes Agapito

📢Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction📢 -> highly accurate face reconstruction by training powerful VITs via surface normals and UV-coordinates estimation. The geometric cues from our 2D foundation model backbone constrain the 3DMM parameters, which allows us to achieve remarkable reconstruction accuracy - works for both single image and videos! In addition, we introduce a new 3D face reconstruction benchmark that evaluates both neutral and posed face geometry. 🌍 📷 Great work by Simon Giebenhain Tobias Kirschstein Martin Rünz Lourdes Agapito

Matthias Niessner

62,269 Aufrufe • vor 1 Jahr

📢 IntrinsiX: High-Quality PBR Generation using Image Priors 📢 From text input, we generate renderable PBR maps! Next to editable image generation, our predictions can be distilled into room-scale scenes using SDS for large-scale PBR texture generation. We first train separate LoRA modules for the intrinsic properties of albedo, rough/metal, normal. Then, we introduce cross-intrinsic attention using a rerendering loss with importance-weighted light sampling to enable coherent PBR generation. Our method outperforms text -> image -> PBR methods both in generalization and quality, since directly generating PBR maps does not suffer from the inherent ambiguity of intrinsic image decomposition. In addition, our design choice facilitates SDS-based PBR texture distillation. 🌍 🎥 Great work by Peter Kocsis, Lukas Höllein

📢 IntrinsiX: High-Quality PBR Generation using Image Priors 📢 From text input, we generate renderable PBR maps! Next to editable image generation, our predictions can be distilled into room-scale scenes using SDS for large-scale PBR texture generation. We first train separate LoRA modules for the intrinsic properties of albedo, rough/metal, normal. Then, we introduce cross-intrinsic attention using a rerendering loss with importance-weighted light sampling to enable coherent PBR generation. Our method outperforms text -> image -> PBR methods both in generalization and quality, since directly generating PBR maps does not suffer from the inherent ambiguity of intrinsic image decomposition. In addition, our design choice facilitates SDS-based PBR texture distillation. 🌍 🎥 Great work by Peter Kocsis, Lukas Höllein

Matthias Niessner

21,891 Aufrufe • vor 1 Jahr

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

AK

33,500 Aufrufe • vor 3 Jahren

Adobe.. has released a tool that combines Generative Gaussian Splats with a Diffusion layer and it's not all over the internet? WHAT IS GOING ON :D I had to test this out ofc! Substance 3D viewer, the new 3d viewer just released by Adobe, not only supports viewing of a large, large list of 3d formats, but it also supports generating of 3d models as Gaussian Splats. Combine that with the built-in 3d to image functionality and you basically have at least a few of the steps I've been doing for the last year or so using multiple tools. Exciting stuff! The new Photoshop Beta also supports the 3D viewer and you can import 3d files directly in photoshop as smart objects linked to the viewer. The elements you prompt in based on your models, can be easily exported to your clipboard without backgrounds making it very easy to use this in your normal image editing workflows. I have a suspicion that we'll see much more of this in other Adobe tools and I'm very curious to see this used for things like this in more 3d tools than "just" Adobe Neo (which is fantastic btw). #adobe #art #gaussiansplatting

Adobe.. has released a tool that combines Generative Gaussian Splats with a Diffusion layer and it's not all over the internet? WHAT IS GOING ON :D I had to test this out ofc! Substance 3D viewer, the new 3d viewer just released by Adobe, not only supports viewing of a large, large list of 3d formats, but it also supports generating of 3d models as Gaussian Splats. Combine that with the built-in 3d to image functionality and you basically have at least a few of the steps I've been doing for the last year or so using multiple tools. Exciting stuff! The new Photoshop Beta also supports the 3D viewer and you can import 3d files directly in photoshop as smart objects linked to the viewer. The elements you prompt in based on your models, can be easily exported to your clipboard without backgrounds making it very easy to use this in your normal image editing workflows. I have a suspicion that we'll see much more of this in other Adobe tools and I'm very curious to see this used for things like this in more 3d tools than "just" Adobe Neo (which is fantastic btw). #adobe #art #gaussiansplatting

Martin Nebelong

280,005 Aufrufe • vor 1 Jahr

Here's a look behind the scenes of how I used a 3d model, generated from an image, as a "driver" for ai animation. In the near future we'll see more 3d tools that support these types of workflows, with much more emphasis on enhancing user input instead of just letting the ai do everything. I used the 3d model to generate keyframes that I then animated in Luma Dream. By using a 3d model with a diffusion "layer" done with Krea, I got quite high quality frames with a relatively high consistency as well since the AI didn't have to "hallucinate" that much. I upscaled and detailed the frames using Magnific. Using semi-detailed 3d models as drivers for gen ai is very powerful, and it allows us to use models that are sculpted in a more free-flowing type of workflow that doesn't rely so much on high surface detailing or very time-consuming finish, but instead relies more on gestural 3d sculpting. #art #ai

Here's a look behind the scenes of how I used a 3d model, generated from an image, as a "driver" for ai animation. In the near future we'll see more 3d tools that support these types of workflows, with much more emphasis on enhancing user input instead of just letting the ai do everything. I used the 3d model to generate keyframes that I then animated in Luma Dream. By using a 3d model with a diffusion "layer" done with Krea, I got quite high quality frames with a relatively high consistency as well since the AI didn't have to "hallucinate" that much. I upscaled and detailed the frames using Magnific. Using semi-detailed 3d models as drivers for gen ai is very powerful, and it allows us to use models that are sculpted in a more free-flowing type of workflow that doesn't rely so much on high surface detailing or very time-consuming finish, but instead relies more on gestural 3d sculpting. #art #ai

Martin Nebelong

48,200 Aufrufe • vor 2 Jahren