Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Personalized 3D Generative Avatars from a Single Portrait Contributions: 1. Generate a personalized 3D avatar from a reference portrait image with controllable facial attributes. 2. Create high-quality synthetic 2D video datasets with diverse attribute editing from a reference portrait image. 3. Use latent space regularization with face morphing supervision... show more

MrNeRF

14,597 subscribers

20,507 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Kunst

Anya Rossi• Live Now

Private livecam show

3 Kommentare

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

Paper: Project:

Profilbild von Erick Ram

Erick Ramvor 1 Jahr

Finally I'll know what I would look like with silver hair and a decent silver beard.

Profilbild von MrNeRF

MrNeRFvor 1 Jahr

Haha, true :)

Ähnliche Videos

Want to create an avatar from a single image? FlexAvatar is a transformer model that creates full 360°, high-quality, and expressive 3D head avatar from just a single portrait image in minutes. Real-time Demo: FlexAvatar's lightweight architecture allows both animation and rendering in real-time, enabling interactive user experiences. To create a new 3D head avatar, only one image is required, e.g., from a webcam. The final avatar is ready after 2 minutes. Architecture: Under the hood, FlexAvatar adopts a transformer-based encoder-decoder design. The encoder maps the input image onto a latent avatar space, while the decoder produces 3D Gaussian attribute maps by incorporating the animation signal via cross-attention. The model learns all facial animations directly from the data without relying on pre-built 3D face models. This equips the avatars with realistic facial expressions. The internal avatar latent space can be conveniently used to integrate additional observations of a person via fitting. This enables use-cases where more than one image of a person is available, e.g., from a phone scan of the person. We train jointly on 2D monocular videos and multi-view data. However, in monocular videos, the animation signal leaks the target viewpoint, causing the model to produce incomplete 3D heads. We call this phenomenon entanglement of driving signal and target viewpoint. To prevent entanglement, we introduce bias sinks. These are learnable tokens that indicate whether a training sample stems from a monocular or a multi-view dataset. During training, the model learns to produce incomplete 3D heads only when the monocular token is present. During inference, FlexAvatar then always uses the multi-view token for which the model has learned to produce complete 3D heads. This simple design allows to combine the generalizability from monocular data with the quality of multi-view data. FlexAvatar summary: - Input: Single-image, phone scan, or monocular video - Output: Full 360° head avatar - Expressive animations - Real-time rendering and animation - Generalization to any portrait - Create a new avatar in 2 minutes - Use bias sinks to combine 2D and 3D data 🏠 🌍 🎥 Great work by Tobias Kirschstein and Simon Giebenhain!

Matthias Niessner

95,991 Aufrufe • vor 7 Monaten

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

AK

305,663 Aufrufe • vor 3 Jahren

📢📢 𝐀𝐯𝐚𝐭𝟑𝐫 📢📢 Avat3r creates high-quality 3D head avatars from just a few input images in a single forward pass with a new dynamic 3DGS reconstruction model. Video: Project: Our core idea is to make Gaussian Reconstruction Models animatable. We find that a simple cross-attention to an expression code sequence is already sufficient to model complex facial expressions. We then incorporate position maps from DUSt3R and feature maps from Sapiens to facilitate the prediction task. While DUSt3R's position maps act as a pixel-aligned initialization for the Gaussians' positions, the Sapiens feature maps help the cross-view transformer to match corresponding image tokens in the 4 input images. One major challenge in creating a 3D head avatar from smartphone images comes from inconsistent facial expressions when the subject could not remain perfectly static during the capture. We eliminate this static requirement by simply showing our model input images with different facial expressions during training. This technique makes our model robust to inconsistent input images later on. Finally, we show that despite the model has been trained with 4 input images, one can even create a 3D head avatar when only a single image is available. To achieve this, we employ a pre-trained 3D GAN to lift the single image to 3D and then render the 4 input images for our model. This allows us to create 3D head avatars from single images and even highly out-of-distribution examples like AI generated faces, paintings or statues. Great work by Tobias Kirschstein from his internship at Meta with Javier Romero, Artem Sevastopolsky, and Shunsuke Saito

Matthias Niessner

74,763 Aufrufe • vor 1 Jahr

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

18,976 Aufrufe • vor 4 Monaten

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,855 Aufrufe • vor 8 Monaten

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

AK

684,372 Aufrufe • vor 2 Jahren

📢 Intrinsic Image Fusion for Multi-View 3D Material Reconstruction 📢 We combine generative material priors with inverse path tracing: 1) define a parametric texture space 2) fuse monocular predictions across views into consistent textures 3) optimize low-dimensional parameters for physically-grounded reconstructions. The results are relightable PBR textures for 3D scenes: check out the result on a real-world 3D scan from the ScanNet++ dataset! 🌍 🎥 Great work by Peter Kocsis Lukas Höllein!

📢 Intrinsic Image Fusion for Multi-View 3D Material Reconstruction 📢 We combine generative material priors with inverse path tracing: 1) define a parametric texture space 2) fuse monocular predictions across views into consistent textures 3) optimize low-dimensional parameters for physically-grounded reconstructions. The results are relightable PBR textures for 3D scenes: check out the result on a real-world 3D scan from the ScanNet++ dataset! 🌍 🎥 Great work by Peter Kocsis Lukas Höllein!

Matthias Niessner

23,087 Aufrufe • vor 7 Monaten

Another use case with Nano Banana Pro (Video is fully narrated so don't forget to turn on the sound 🔊) Image to 3D Model For 3D Printing 1 - Generate an image of a vinyl toy character. I generated a character called Peelbert. 2 - Generate multiple outfits for Peelbert while retaining consistency. Nano Banana Pro's consistency and understanding of outfit changes is superb. 3 - Use Hunyuan 3.0 to do image to 3D model 4 - Save as STL and 3D Print. Nano Banana Pro to the real world 👍

Another use case with Nano Banana Pro (Video is fully narrated so don't forget to turn on the sound 🔊) Image to 3D Model For 3D Printing 1 - Generate an image of a vinyl toy character. I generated a character called Peelbert. 2 - Generate multiple outfits for Peelbert while retaining consistency. Nano Banana Pro's consistency and understanding of outfit changes is superb. 3 - Use Hunyuan 3.0 to do image to 3D model 4 - Save as STL and 3D Print. Nano Banana Pro to the real world 👍

Travis Davids

44,133 Aufrufe • vor 8 Monaten

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix AI

30,853 Aufrufe • vor 1 Jahr

ScenarioControl 🚗🛣️ - Scenario Generation from a single Dashcam Image 📸 or Text Prompt 💬!! Excited to introduce a new vision-language control mechanism for learned driving scenario generation. Given a single dashcam image or a scene prompt or an image, we generate a full scene layout 🧩, temporally consistent rollouts, including map 🗺️, agents 🚗, and ego video🛣️ ScenarioControl enables direct, fine-grained control over layout and traffic while preserving realism. It operates in a vectorized latent space with a new cross-global control mechanism to fuse vision-language inputs with scene structure while preserving realism. Interfaces seamlessly with generative video models! Project: Super fun project by Lili Gao, Yanbo Xu , William Koch, Samuele Ruffino, Luke Rowe , Behdad Chalaki, Dmitriy Rivkin, Julian Ost, Roger Girgis, Mario Bijelic.

ScenarioControl 🚗🛣️ - Scenario Generation from a single Dashcam Image 📸 or Text Prompt 💬!! Excited to introduce a new vision-language control mechanism for learned driving scenario generation. Given a single dashcam image or a scene prompt or an image, we generate a full scene layout 🧩, temporally consistent rollouts, including map 🗺️, agents 🚗, and ego video🛣️ ScenarioControl enables direct, fine-grained control over layout and traffic while preserving realism. It operates in a vectorized latent space with a new cross-global control mechanism to fuse vision-language inputs with scene structure while preserving realism. Interfaces seamlessly with generative video models! Project: Super fun project by Lili Gao, Yanbo Xu , William Koch, Samuele Ruffino, Luke Rowe , Behdad Chalaki, Dmitriy Rivkin, Julian Ost, Roger Girgis, Mario Bijelic.

Felix Heide

22,331 Aufrufe • vor 3 Monaten

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,736 Aufrufe • vor 1 Jahr

🚀Announcing NeRSemble 3D Head Avatar Benchmark v2 Version 2 of the NeRSemble 3D Head Avatar Benchmark systematically evaluates several aspects of 3D head avatar creation. Our goal is to drive progress toward more realistic, robust, and generalizable avatar methods. 🔬Benchmark Tasks The NeRSemble Benchmark v2 features three core challenges: - Dynamic Novel View Synthesis - Monocular FLAME-driven Avatar Creation (updated) - Single-view 3D Face Reconstruction (new) 👉Explore the online leaderboard and submission system: 🆕What's new? 1. New Task: Single-view 3D Face Reconstruction Given a single portrait image, reconstruct an accurate 3D mesh either showing the input expression or a fully neutral one. Unlike prior benchmarks, the NeRSemble benchmark emphasizes diverse and challenging facial expressions, better reflecting real scenarios. For technical details, see the Pixel3DMM paper. 2. Updated task: Monocular FLAME-driven Avatar Creation We have improved the FLAME tracking that is used for both avatar creation from the monocular videos and avatar driving on the hidden test sequences. The updated benchmark task has: - more stable torso tracking - more expressive lip closures during speech - Improved mouth tracking for challenging facial expressions We hope that these improvements to the benchmark help drive the field forward. 🏆 CVPR 2026 Workshop & Prizes The NeRSemble benchmark will be featured at the CVPR 2026 Workshop on Photo-realistic 3D Head Avatars. Participants in the new and updated tasks have the opportunity to win: - 🎁RTX 5080 GPUs (sponsored by NVIDIA) - 🎤15-minute oral presentation at the workshop ⏰ Submission Deadline - May 26, 2026 Reach out to the amazing Tobias Kirschstein and Simon Giebenhain for more details :)

Matthias Niessner

29,954 Aufrufe • vor 3 Monaten

1/2 Meet Wan2.7-Video — The Comprehensive Model for Controllable Video Storytelling! From single clips to full-scale narrative direction, we’ve built more than just a generator. We’ve built a director’s suite: • Multimodal control over performance and style via text, image, audio, and video. • Character customization with up to 5 reference inputs and voice profiles. • Video editing with simple, intuitive instructions. • Full-stack creative toolkit: generation, editing, cloning, restyling, continuation, and more. • Sustained improvements in visual fidelity, motion stability, and prompt adherence.

1/2 Meet Wan2.7-Video — The Comprehensive Model for Controllable Video Storytelling! From single clips to full-scale narrative direction, we’ve built more than just a generator. We’ve built a director’s suite: • Multimodal control over performance and style via text, image, audio, and video. • Character customization with up to 5 reference inputs and voice profiles. • Video editing with simple, intuitive instructions. • Full-stack creative toolkit: generation, editing, cloning, restyling, continuation, and more. • Sustained improvements in visual fidelity, motion stability, and prompt adherence.

Wan

25,571,943 Aufrufe • vor 3 Monaten

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

AK

161,453 Aufrufe • vor 3 Jahren

[SIGGRAPH '25] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling Note: On the left that's a 3DGS rendering! Contributions: 1. We propose a simple approach for rigging 3D Gaussians within the continuous tangent space of 3DMM face models, allowing Gaussians to move freely across mesh triangles. 2. We propose a novel CNN-based deformation model that is agnostic to the number of 3D Gaussians, naturally enabling adaptively densification of the representation to improve detail where most needed, with expression-dependent shading. 3. We show significant improvements over baseline SOTA methods and demonstrate the ability to render even extreme close-up images at high quality.

[SIGGRAPH '25] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling Note: On the left that's a 3DGS rendering! Contributions: 1. We propose a simple approach for rigging 3D Gaussians within the continuous tangent space of 3DMM face models, allowing Gaussians to move freely across mesh triangles. 2. We propose a novel CNN-based deformation model that is agnostic to the number of 3D Gaussians, naturally enabling adaptively densification of the representation to improve detail where most needed, with expression-dependent shading. 3. We show significant improvements over baseline SOTA methods and demonstrate the ability to render even extreme close-up images at high quality.

MrNeRF

29,010 Aufrufe • vor 1 Jahr

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 Aufrufe • vor 11 Monaten

I’ve dreamt of creating a tool that could animate anyone with any motion from just ONE image… and now it’s a reality! 🎉 Super excited to introduce updated 3DHM: Synthesizing Moving People with 3D Control. 🕺💃3DHM can generate human videos from a single real or synthetic human image. #Animation #GenAI #AI #3DHM ✨ The magic of 3D control? Turning 2D pixels into lifelike, animated humans. 🎥 Check out our demo (and Merry Christmas)! Paper: Github: Webpage: Proudly working with the great Junming (Leo) Chen, , Yossi Gandelsman, Alyosha Efros and Jitendra MALIK😃 Kindly note: This video is intended solely for research purposes and is not authorized for commercial use.

I’ve dreamt of creating a tool that could animate anyone with any motion from just ONE image… and now it’s a reality! 🎉 Super excited to introduce updated 3DHM: Synthesizing Moving People with 3D Control. 🕺💃3DHM can generate human videos from a single real or synthetic human image. #Animation #GenAI #AI #3DHM ✨ The magic of 3D control? Turning 2D pixels into lifelike, animated humans. 🎥 Check out our demo (and Merry Christmas)! Paper: Github: Webpage: Proudly working with the great Junming (Leo) Chen, , Yossi Gandelsman, Alyosha Efros and Jitendra MALIK😃 Kindly note: This video is intended solely for research purposes and is not authorized for commercial use.

Boyi Li

52,482 Aufrufe • vor 1 Jahr

Testing Grok Imagine 1.5's abilities with the burst frame technique (a brainstorm exercise in which you use a reference image to create a bunch of new scenes, locations, and characters in the same visual style, from which you can then extract key frames to create new scenes). This one is for an imaginary Italian Giallo thriller. The image quality is quite good in Imagine 1.5 - a little too good - I had to grunge it up a bit for a more vintage look. Seedance 2 gives you a more rhythmic burst frame video with shorter snippets - Grok Imagine gave each snippet a little more room to breathe.

Testing Grok Imagine 1.5's abilities with the burst frame technique (a brainstorm exercise in which you use a reference image to create a bunch of new scenes, locations, and characters in the same visual style, from which you can then extract key frames to create new scenes). This one is for an imaginary Italian Giallo thriller. The image quality is quite good in Imagine 1.5 - a little too good - I had to grunge it up a bit for a more vintage look. Seedance 2 gives you a more rhythmic burst frame video with shorter snippets - Grok Imagine gave each snippet a little more room to breathe.

Christopher Gwinn | Grindhouse Glitch

17,869 Aufrufe • vor 1 Monat