正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can... show more

Sherwin Bahmani

1,388 subscribers

66,590 次观看 • 9 个月前 •via X (Twitter)

科学技术新闻政治教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 次观看 • 2 年前

"YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting" TL;DR: a unified 3D Gaussian splatting model that reconstructs high-quality scene geometry and camera poses from unposed/uncalibrated images in a single forward pass.

"YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting" TL;DR: a unified 3D Gaussian splatting model that reconstructs high-quality scene geometry and camera poses from unposed/uncalibrated images in a single forward pass.

Alexandre Morgand

15,035 次观看 • 5 个月前

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,210 次观看 • 2 年前

📢📢GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction📢📢 Reconstructing high-fidelity 3D scenes from sparse RGB input is hard. It needs a strong 3D prior! We reformulate multi-view scene reconstruction as conditional 3D generation over overlapping spatial chunks, lifting posed image features into a generative shape prior via 3D conditioning. As an example prior, we build on Trellis2, and train it such that its reconstruction is pixel aligned and matches from all views. GenRecon achieves unprecedented reconstruction quality from any sparse RGB input sequence, even from a phone capture. The reconstruction also includes PBR materials which facilitates relighting and virtual object insertion. Amazing work by Katharina Schmid, Nicolas von Lützow, Jozef, Angela Dai

📢📢GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction📢📢 Reconstructing high-fidelity 3D scenes from sparse RGB input is hard. It needs a strong 3D prior! We reformulate multi-view scene reconstruction as conditional 3D generation over overlapping spatial chunks, lifting posed image features into a generative shape prior via 3D conditioning. As an example prior, we build on Trellis2, and train it such that its reconstruction is pixel aligned and matches from all views. GenRecon achieves unprecedented reconstruction quality from any sparse RGB input sequence, even from a phone capture. The reconstruction also includes PBR materials which facilitates relighting and virtual object insertion. Amazing work by Katharina Schmid, Nicolas von Lützow, Jozef, Angela Dai

Matthias Niessner

18,069 次观看 • 1 个月前

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

MrNeRF

53,292 次观看 • 1 年前

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 次观看 • 2 年前

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 次观看 • 2 年前

3D scene reconstructions by NVIDIA. ArtiFixer - repairs artifacts and extends sparse views via Wan 2.1. - high-fidelity inpainting in occluded regions - gens hundreds of consistent frames in a single pass - 3D Gaussian Splatting for navigable scene reconstruction Makes the 3D environment look photorealistic and fully navigable for VR/AR. It basically turns a broken 3D model into a polished, professional scene.

3D scene reconstructions by NVIDIA. ArtiFixer - repairs artifacts and extends sparse views via Wan 2.1. - high-fidelity inpainting in occluded regions - gens hundreds of consistent frames in a single pass - 3D Gaussian Splatting for navigable scene reconstruction Makes the 3D environment look photorealistic and fully navigable for VR/AR. It basically turns a broken 3D model into a polished, professional scene.

Wildminder

15,711 次观看 • 1 个月前

🚀Turn Single Image into 3D Human🚀 #GeneMAN is a generalizable single-image 3D human reconstruction framework that turns in-the-wild images into high-quality 3D humans with ease 🔗Project: 📜Paper: 🧑‍💻Code:

🚀Turn Single Image into 3D Human🚀 #GeneMAN is a generalizable single-image 3D human reconstruction framework that turns in-the-wild images into high-quality 3D humans with ease 🔗Project: 📜Paper: 🧑‍💻Code:

Ziwei Liu

26,953 次观看 • 1 年前

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Xingang Pan

19,413 次观看 • 1 年前

📢📢 𝐀𝐯𝐚𝐭𝟑𝐫 📢📢 Avat3r creates high-quality 3D head avatars from just a few input images in a single forward pass with a new dynamic 3DGS reconstruction model. Video: Project: Our core idea is to make Gaussian Reconstruction Models animatable. We find that a simple cross-attention to an expression code sequence is already sufficient to model complex facial expressions. We then incorporate position maps from DUSt3R and feature maps from Sapiens to facilitate the prediction task. While DUSt3R's position maps act as a pixel-aligned initialization for the Gaussians' positions, the Sapiens feature maps help the cross-view transformer to match corresponding image tokens in the 4 input images. One major challenge in creating a 3D head avatar from smartphone images comes from inconsistent facial expressions when the subject could not remain perfectly static during the capture. We eliminate this static requirement by simply showing our model input images with different facial expressions during training. This technique makes our model robust to inconsistent input images later on. Finally, we show that despite the model has been trained with 4 input images, one can even create a 3D head avatar when only a single image is available. To achieve this, we employ a pre-trained 3D GAN to lift the single image to 3D and then render the 4 input images for our model. This allows us to create 3D head avatars from single images and even highly out-of-distribution examples like AI generated faces, paintings or statues. Great work by Tobias Kirschstein from his internship at Meta with Javier Romero, Artem Sevastopolsky, and Shunsuke Saito

Matthias Niessner

74,698 次观看 • 1 年前

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper: 🔗 Read the SAM 3D Body research paper:

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper: 🔗 Read the SAM 3D Body research paper:

AI at Meta

17,858 次观看 • 8 个月前

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

Hanwen Jiang

58,093 次观看 • 7 个月前

We’re open-sourcing HY-World 2.0, a multimodal world model that generates, reconstructs, and simulates interactive *3D worlds* from text, images, and videos. Outputs can be integrated into game engines and embodied simulation pipelines. Key highlights: 🔹 One-click world generation Turn text or image into interactive 3D worlds automatically. 🔹 Pipeline-ready 3D outputs Editable 3D worlds for Unity and Unreal Engine, with standard 3D exports including mesh, 3DGS, and point clouds. 🔹 Unified world model system One model family for world generation and reconstruction across synthetic and real-world scenes. 🔹 Interactive character mode Explore generated 3D worlds in real time with physics-aware movement and collision support. ✨ Apply for access: 🔗 GitHub: 🤗 Hugging Face: 📄 Technical Report:

We’re open-sourcing HY-World 2.0, a multimodal world model that generates, reconstructs, and simulates interactive 3D worlds from text, images, and videos. Outputs can be integrated into game engines and embodied simulation pipelines. Key highlights: 🔹 One-click world generation Turn text or image into interactive 3D worlds automatically. 🔹 Pipeline-ready 3D outputs Editable 3D worlds for Unity and Unreal Engine, with standard 3D exports including mesh, 3DGS, and point clouds. 🔹 Unified world model system One model family for world generation and reconstruction across synthetic and real-world scenes. 🔹 Interactive character mode Explore generated 3D worlds in real time with physics-aware movement and collision support. ✨ Apply for access: 🔗 GitHub: 🤗 Hugging Face: 📄 Technical Report:

Tencent Hy

370,675 次观看 • 3 个月前

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

Tencent Hy

198,289 次观看 • 10 个月前

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 次观看 • 3 年前

🚀 Our NeurIPS '24 work, Large Spatial Model (LSM), is here! LSM performs semantic 3D reconstruction in just 0.1s, processing unposed data via feed-forward 3D reconstruction. 👉It leverages large-scale 3D datasets with minimal annotations, defining a 3D latent space. We are continuously exploring how this explicit 3D representation can further enhance reasoning and robotic learning. 🔗 Try our online Gradio demo with your own data at #NeurIPS2024 #3DReconstruction

🚀 Our NeurIPS '24 work, Large Spatial Model (LSM), is here! LSM performs semantic 3D reconstruction in just 0.1s, processing unposed data via feed-forward 3D reconstruction. 👉It leverages large-scale 3D datasets with minimal annotations, defining a 3D latent space. We are continuously exploring how this explicit 3D representation can further enhance reasoning and robotic learning. 🔗 Try our online Gradio demo with your own data at #NeurIPS2024 #3DReconstruction

Zhiwen(Aaron) Fan

43,871 次观看 • 1 年前

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

Santiago

65,446 次观看 • 3 个月前

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,855 次观看 • 8 个月前