正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

📢Announcing our 3D head avatar benchmark📢 Two tasks with hidden test sets: - Dynamic Novel View Synthesis on Heads - Monocular FLAME-driven Head Avatar Reconstruction Our goal is to make research on 3D head avatars more comparable and ultimately increase the realism of digital humans. The benchmark studies distinct... phenomena of 3D head avatar creation, such as extreme facial expressions, slow motion captures of shaking long hair, or complicated light reflection and refraction patterns of glasses. The two benchmark tasks assess two core desiderata of 3D avatars: While the novel view synthesis challenge focuses on best possible rendering quality of complex moving scenes, the avatar animation challenge is concerned with how well a driving signal is translated into an avatar. Evaluations are light-weight and consist of diverse video recordings from the popular NeRSemble dataset with a hidden test set. Participation in the benchmark is therefore straight-forward and requires only 5 reconstructions per task. Leaderboard and benchmark submission: Benchmark data access and toolkit: Great work by Tobias Kirschstein Simon Giebenhainshow more

Matthias Niessner

48,069 subscribers

28,075 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians paper page: Creating high-fidelity 3D head avatars has always been a research hotspot, but there remains a great challenge under lightweight sparse view setups. In this paper, we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other, thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore, we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions.

AK

65,847 次观看 • 2 年前

📢📢𝐍𝐞𝐑𝐒𝐞𝐦𝐛𝐥𝐞 𝐯𝟐 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐑𝐞𝐥𝐞𝐚𝐬𝐞📢📢 Head captures of 7.1MP from 16 cameras at 73fps: * More recordings (425 people) * Better color calibration * Convenient download scripts The new version of our dataset adds 156 participants for a total of 425 different people. In its entirety, the dataset provides now 65 million images from over 15 hours of diverse human facial expression performances. We improved the color consistency of the recorded images with a better color calibration procedure. As a result, 3D reconstructions with images from the NeRSemble dataset should become better and look more realistic. Finally, we made it much easier to download the recordings with our new download repository. It now just takes a single command to download all frontal hair shake videos of all participants or to download all recordings of a single participant. Check it out: Awesome work by Tobias Kirschstein Simon Giebenhain !!!

📢📢𝐍𝐞𝐑𝐒𝐞𝐦𝐛𝐥𝐞 𝐯𝟐 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐑𝐞𝐥𝐞𝐚𝐬𝐞📢📢 Head captures of 7.1MP from 16 cameras at 73fps: * More recordings (425 people) * Better color calibration * Convenient download scripts The new version of our dataset adds 156 participants for a total of 425 different people. In its entirety, the dataset provides now 65 million images from over 15 hours of diverse human facial expression performances. We improved the color consistency of the recorded images with a better color calibration procedure. As a result, 3D reconstructions with images from the NeRSemble dataset should become better and look more realistic. Finally, we made it much easier to download the recordings with our new download repository. It now just takes a single command to download all frontal hair shake videos of all participants or to download all recordings of a single participant. Check it out: Awesome work by Tobias Kirschstein Simon Giebenhain !!!

Matthias Niessner

12,089 次观看 • 1 年前

📢Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image📢 We directly regress neural parametric head models (NPHMs) from a single image — fast, stable, and significantly more expressive than classical 3DMMs such as FLAME. Face tracking & 3D reconstruction are often limited by the representational capacity of PCA-based face models. By lifting NPHMs to a first-class reconstruction primitive, we enable more accurate geometry, richer expressions, and finer animation control. Pix2NPHM obtains fast and reliable NPHM reconstructions on real-world data. Inference-time optimization against surface normals and canonical point maps can further increase fidelity. Key to successful and generalized training of our ViT-based network are: (1) large-scale registration of existing 3D head datasets, and (2) self-supervised training on vast in-the-wild 2D video datasets using pseudo ground-truth surface normals. Finally, we show that geometry-aware pretraining on pixel-aligned reconstruction tasks significantly outperforms generic visual pretraining (e.g., DINO-style features) in terms of generalization. 🌍 🎥 Great work by Simon Giebenhain, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Zhe Chen

📢Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image📢 We directly regress neural parametric head models (NPHMs) from a single image — fast, stable, and significantly more expressive than classical 3DMMs such as FLAME. Face tracking & 3D reconstruction are often limited by the representational capacity of PCA-based face models. By lifting NPHMs to a first-class reconstruction primitive, we enable more accurate geometry, richer expressions, and finer animation control. Pix2NPHM obtains fast and reliable NPHM reconstructions on real-world data. Inference-time optimization against surface normals and canonical point maps can further increase fidelity. Key to successful and generalized training of our ViT-based network are: (1) large-scale registration of existing 3D head datasets, and (2) self-supervised training on vast in-the-wild 2D video datasets using pseudo ground-truth surface normals. Finally, we show that geometry-aware pretraining on pixel-aligned reconstruction tasks significantly outperforms generic visual pretraining (e.g., DINO-style features) in terms of generalization. 🌍 🎥 Great work by Simon Giebenhain, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Zhe Chen

Matthias Niessner

37,850 次观看 • 7 个月前

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,849 次观看 • 1 年前

📢GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans📢 We present a novel method to reconstruct hair strands from colorless 3D scans by extracting orientation cues directly from the mesh surface geometry by finding local characteristic lines and from shaded renderings using a neural 2D line detector. We enhance the reconstruction with a diffusion prior trained on synthetic hair data and adapted to each scan using a tailored text prompt, allowing us to recover both simple and complex hairstyles without relying on color input. To support further research, we also introduce Strands400, the largest publicly available dataset of 3D hair strand reconstructions from real-world scans of 400 different people, featuring complicated hairstyles, such as ponytails and buns. 🌍 📷 Great work by Rachmadio Noval L. Artem Sevastopolsky Egor Zakharov @ness_pris

📢GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans📢 We present a novel method to reconstruct hair strands from colorless 3D scans by extracting orientation cues directly from the mesh surface geometry by finding local characteristic lines and from shaded renderings using a neural 2D line detector. We enhance the reconstruction with a diffusion prior trained on synthetic hair data and adapted to each scan using a tailored text prompt, allowing us to recover both simple and complex hairstyles without relying on color input. To support further research, we also introduce Strands400, the largest publicly available dataset of 3D hair strand reconstructions from real-world scans of 400 different people, featuring complicated hairstyles, such as ponytails and buns. 🌍 📷 Great work by Rachmadio Noval L. Artem Sevastopolsky Egor Zakharov @ness_pris

Matthias Niessner

12,490 次观看 • 1 年前

Apple just trained a 3D Gaussian head reconstruction model on 10,000+ subjects. Feed-forward. No test-time optimization. New identity in, reconstructed Gaussian head out. The UV-parameterized Gaussian representation decouples the number of Gaussians from the number and resolution of input images, making it practical to train with many high resolution views. And the heads are not just static either: text-conditioned identity generation, plus blendshape-driven latent animation across identities. We've been building in the 3D Gaussian Splatting space for a while. The gap between "research demo" and "works on real people at scale" is closing fast.

Apple just trained a 3D Gaussian head reconstruction model on 10,000+ subjects. Feed-forward. No test-time optimization. New identity in, reconstructed Gaussian head out. The UV-parameterized Gaussian representation decouples the number of Gaussians from the number and resolution of input images, making it practical to train with many high resolution views. And the heads are not just static either: text-conditioned identity generation, plus blendshape-driven latent animation across identities. We've been building in the 3D Gaussian Splatting space for a while. The gap between "research demo" and "works on real people at scale" is closing fast.

KIRI Engine - 3D Scanner App

12,181 次观看 • 2 个月前

Microsoft presents Windows Agent Arena Evaluating Multi-Modal OS Agents at Scale discuss: Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena.

Microsoft presents Windows Agent Arena Evaluating Multi-Modal OS Agents at Scale discuss: Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena.

AK

19,684 次观看 • 1 年前

FastMap: Revisiting Dense and Scalable Structure from Motion "FASTMAP, a redesigned SfM framework, achieves fast, high-accuracy dense structure from motion. On large scenes with thousands of images, FASTMAP is up to one to two orders of magnitude faster than GLOMAP and COLMAP. ... Importantly, FASTMAP achieves efficiency improvements while keeping comparable performance. Extensive experiments on eight datasets demonstrate pose estimation accuracy and novel view synthesis quality close to GLOMAP and COLMAP. " Contributions: 1. For all the iterative nonlinear optimization problems involved, we design algorithms such that the computational complexity of each iteration is only linear in the number of image pairs, not keypoint pairs or 3D points. This includes replacing the traditional bundle adjustment [50] present in previous SfM frameworks with a novel re-weighting epipolar adjustment algorithm, which is much more efficient. 2. Throughout the entire framework, we formulate as many steps as possible as GPU-friendly dense tensor operations. This allows us to implement the entire method in PyTorch [39], which provides seamless GPU acceleration.

FastMap: Revisiting Dense and Scalable Structure from Motion "FASTMAP, a redesigned SfM framework, achieves fast, high-accuracy dense structure from motion. On large scenes with thousands of images, FASTMAP is up to one to two orders of magnitude faster than GLOMAP and COLMAP. ... Importantly, FASTMAP achieves efficiency improvements while keeping comparable performance. Extensive experiments on eight datasets demonstrate pose estimation accuracy and novel view synthesis quality close to GLOMAP and COLMAP. " Contributions: 1. For all the iterative nonlinear optimization problems involved, we design algorithms such that the computational complexity of each iteration is only linear in the number of image pairs, not keypoint pairs or 3D points. This includes replacing the traditional bundle adjustment [50] present in previous SfM frameworks with a novel re-weighting epipolar adjustment algorithm, which is much more efficient. 2. Throughout the entire framework, we formulate as many steps as possible as GPU-friendly dense tensor operations. This allows us to implement the entire method in PyTorch [39], which provides seamless GPU acceleration.

MrNeRF

15,233 次观看 • 1 年前

If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths. I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be! Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." - The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space. - The 3D objects are consistently animated as they sail and avoid each other's paths. - Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations. - Photorealism, almost like rendering with raytracing. - The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe. - The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect. Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.

If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths. I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be! Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." - The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space. - The 3D objects are consistently animated as they sail and avoid each other's paths. - Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations. - Photorealism, almost like rendering with raytracing. - The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe. - The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect. Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.

Jim Fan

6,182,281 次观看 • 2 年前

BREAKING: Anthropic just dropped Opus 4.8—and it is a MONSTER We've been testing for about a week Every 📧 and our verdict is they could've just called it Opus 5, it's that good. Here's our vibe check: - Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works. HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results. - Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context. HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high. - Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark. - Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic. THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude. Anthropic is back baby! Read the rest on Every 📧:

BREAKING: Anthropic just dropped Opus 4.8—and it is a MONSTER We've been testing for about a week Every 📧 and our verdict is they could've just called it Opus 5, it's that good. Here's our vibe check: - Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works. HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results. - Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context. HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high. - Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark. - Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic. THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude. Anthropic is back baby! Read the rest on Every 📧:

Dan Shipper 📧

353,763 次观看 • 1 个月前

I've seen a lot of animatics lately that are super detailed, essentially viewport previews of the final shot. But an animatic at it's core doesn't need to be anything fancy. Its main purpose is to plan and test the timing, camera angles, movement, and overall composition. With animatics you have to keep in mind the following: Pre-visualization: It allows to see a basic version of the animation or scene before committing to detailed work. Timing and pacing: An animatic helps identify how long each scene or shot should last, ensuring that the timing feels right. Planning: It helps with layout, camera angles, and transitions. By visualizing the shots, the team can ensure that the framing and overall design of the scenes work well in 3D space. Efficiency: It allows the team to test and fix any potential issues early on, like awkward movements or awkward pacing, before spending time on high-quality rendering or complex animation. In short, an animatic helps in conceptualizing the final animation by giving a low-res, rough version of the scenes, which guides the entire production process in terms of visual design, timing, and storytelling.

I've seen a lot of animatics lately that are super detailed, essentially viewport previews of the final shot. But an animatic at it's core doesn't need to be anything fancy. Its main purpose is to plan and test the timing, camera angles, movement, and overall composition. With animatics you have to keep in mind the following: Pre-visualization: It allows to see a basic version of the animation or scene before committing to detailed work. Timing and pacing: An animatic helps identify how long each scene or shot should last, ensuring that the timing feels right. Planning: It helps with layout, camera angles, and transitions. By visualizing the shots, the team can ensure that the framing and overall design of the scenes work well in 3D space. Efficiency: It allows the team to test and fix any potential issues early on, like awkward movements or awkward pacing, before spending time on high-quality rendering or complex animation. In short, an animatic helps in conceptualizing the final animation by giving a low-res, rough version of the scenes, which guides the entire production process in terms of visual design, timing, and storytelling.

Voxyde

22,244 次观看 • 1 年前

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 次观看 • 11 个月前

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Michael Black

22,182 次观看 • 7 个月前

The AI hunt for alien life has just begun. Welcome to ThousandsWorlds, a wild new dataset from researchers at Oxford/Cambridge++, for detecting faint signatures in the atmospheres of potentially habitable exoplanets. This is the first step towards finding life beyond earth. The plan is basically: 1) scan the galaxy for as many potentially habitable planets as possible 2) detect the gases in their atmospheres with powerful telescopes like JWST 3) infer from these gases whether life is present or not. ThousandWorlds is a benchmark for emulating these exoplanet climates: 1760 simulations across 5 GCMs, 8 planet parameters, and atmospheric variables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three nested benchmark subsets, two evaluation protocols, and eight released baseline methods. incredible work from Miles Cranmer and many more 👽👽👽

The AI hunt for alien life has just begun. Welcome to ThousandsWorlds, a wild new dataset from researchers at Oxford/Cambridge++, for detecting faint signatures in the atmospheres of potentially habitable exoplanets. This is the first step towards finding life beyond earth. The plan is basically: 1) scan the galaxy for as many potentially habitable planets as possible 2) detect the gases in their atmospheres with powerful telescopes like JWST 3) infer from these gases whether life is present or not. ThousandWorlds is a benchmark for emulating these exoplanet climates: 1760 simulations across 5 GCMs, 8 planet parameters, and atmospheric variables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three nested benchmark subsets, two evaluation protocols, and eight released baseline methods. incredible work from Miles Cranmer and many more 👽👽👽

Georgia Channing

18,751 次观看 • 1 个月前

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Shikun Liu

22,332 次观看 • 9 个月前

🌀 #Live3D #Live2D #Live2DWIP 🌀 Today, I’d like to share two of the three methods I use to create joint twisting in my Live2D-based 2D Pseudo-3D concept. On the right is the Glue Joint： The basic idea is to use Glue to connect two pseudo-3D ArtMeshes. By taking advantage of Glue’s weight settings, it can create an effect similar to a 3D joint. Once the weights are set up, it can move very freely with a relatively low amount of work. However, it also shares the same weakness as 3D joints: when rotated at large angles, it can produce unpleasant twisting and distortion. On the left is the Hand-Bent Joint： The basic idea is to add bending Keyforms directly to a Warp Deformer, then manually sculpt the ideal deformation. This method can create the most detailed and refined joints, depending entirely on how the rigger designs them. The downside is that it requires much more work. Every additional degree of freedom requires a significant amount of manual adjustment. These two joint methods have almost opposite characteristics, so in actual Live3D work, I choose the most suitable method depending on the situation. For example, in this hand model, I used Hand-Bent Joints for the arm twisting and fingers, while the wrist uses a Glue Joint.✨

🌀 #Live3D #Live2D #Live2DWIP 🌀 Today, I’d like to share two of the three methods I use to create joint twisting in my Live2D-based 2D Pseudo-3D concept. On the right is the Glue Joint： The basic idea is to use Glue to connect two pseudo-3D ArtMeshes. By taking advantage of Glue’s weight settings, it can create an effect similar to a 3D joint. Once the weights are set up, it can move very freely with a relatively low amount of work. However, it also shares the same weakness as 3D joints: when rotated at large angles, it can produce unpleasant twisting and distortion. On the left is the Hand-Bent Joint： The basic idea is to add bending Keyforms directly to a Warp Deformer, then manually sculpt the ideal deformation. This method can create the most detailed and refined joints, depending entirely on how the rigger designs them. The downside is that it requires much more work. Every additional degree of freedom requires a significant amount of manual adjustment. These two joint methods have almost opposite characteristics, so in actual Live3D work, I choose the most suitable method depending on the situation. For example, in this hand model, I used Hand-Bent Joints for the arm twisting and fingers, while the wrist uses a Glue Joint.✨

📐Hephaestus📏Live2D匠人魂

18,119 次观看 • 1 个月前

🚨 CHINESE SCIENTISTS JUST INVENTED 3D PRINTING THAT CREATES OBJECTS IN 0.6 SECONDS USING ONLY LIGHT. Researchers at Tsinghua University have developed a new method called DISH (Digital Incoherent Synthesis of Holographic light fields) that can print complex millimeter-scale objects almost instantly. Instead of slowly building layer by layer, the system fires thousands of precisely patterned light images from multiple angles into a still vat of liquid resin. Where the light overlaps, the resin instantly hardens into a solid 3D object. The entire process takes just 0.6 seconds. Why this matters: • It’s currently the fastest volumetric 3D printing method ever demonstrated • Achieves extremely fine detail features thinner than a human hair • The resin stays completely still, so there’s no vibration or distortion • It can work with watery (low-viscosity) resins, making it suitable for biological applications • The team has already printed complex structures like blood vessel-like tubes and even a tiny bust of a historical figure The deeper implication: Traditional 3D printing has always been limited by speed and the need to move either the print head or the resin. This approach removes both constraints by using light itself as the sculptor. Because it can print directly into still liquid (and potentially onto living tissue), it opens new possibilities in bioprinting, medical devices, and rapid manufacturing. If the technology can be scaled beyond millimeter sizes, it could fundamentally change how we think about making physical objects turning “print” from a slow process into something closer to instantaneous fabrication. We’re moving from “layer by layer” to “all at once.” How do you think instant volumetric 3D printing like this could change medicine, manufacturing, or everyday life if it becomes widely available? Follow for more frontier manufacturing and materials science breakthroughs.

🚨 CHINESE SCIENTISTS JUST INVENTED 3D PRINTING THAT CREATES OBJECTS IN 0.6 SECONDS USING ONLY LIGHT. Researchers at Tsinghua University have developed a new method called DISH (Digital Incoherent Synthesis of Holographic light fields) that can print complex millimeter-scale objects almost instantly. Instead of slowly building layer by layer, the system fires thousands of precisely patterned light images from multiple angles into a still vat of liquid resin. Where the light overlaps, the resin instantly hardens into a solid 3D object. The entire process takes just 0.6 seconds. Why this matters: • It’s currently the fastest volumetric 3D printing method ever demonstrated • Achieves extremely fine detail features thinner than a human hair • The resin stays completely still, so there’s no vibration or distortion • It can work with watery (low-viscosity) resins, making it suitable for biological applications • The team has already printed complex structures like blood vessel-like tubes and even a tiny bust of a historical figure The deeper implication: Traditional 3D printing has always been limited by speed and the need to move either the print head or the resin. This approach removes both constraints by using light itself as the sculptor. Because it can print directly into still liquid (and potentially onto living tissue), it opens new possibilities in bioprinting, medical devices, and rapid manufacturing. If the technology can be scaled beyond millimeter sizes, it could fundamentally change how we think about making physical objects turning “print” from a slow process into something closer to instantaneous fabrication. We’re moving from “layer by layer” to “all at once.” How do you think instant volumetric 3D printing like this could change medicine, manufacturing, or everyday life if it becomes widely available? Follow for more frontier manufacturing and materials science breakthroughs.

TheNewPhysics

347,458 次观看 • 1 个月前

Two weeks ago I fixed one of my teeth with algorithms I wrote a couple of years ago! I got hooked by 3D scanning when I started to work for a software shop in Zurich that was programming 3D computational geometry algorithms for denture scanning to produce crowns (and more). Back then, a typical reconstruction pipeline was like: scan the patient’s teeth using an intraoral scanner, reconstruct the surface mesh, design the restoration digitally, and finally mill the crown out of ceramic. We were working mostly with point clouds and meshes, but it wasn’t just math, it was craftsmanship translated into a digital process. Every micron mattered. You could literally see how a good algorithm meant a better fit in someone’s mouth. Gaussian Splatting isn’t about surface reconstruction, it’s about appearance reconstruction. It doesn’t care about explicit topology, it captures how light interacts with the scene. In a sense, it’s the opposite philosophy of the dental world: instead of modeling what the object is, it models how the object looks. 3D Gaussian Splatting enables applications like training self driving cars, teaching robots to understand their environment, creating virtual worlds, or monitoring real sites. It represents scenes as millions of small Gaussians rendered in real time without the need for meshes or textures. Coming from a world where precision geometry was everything, this shift felt natural. It’s still about reconstruction, but with a different goal: not manufacturing a perfect object, but reproducing how the world actually looks. Two weeks ago I got my first dental crown, made with the same software, reconstruction algorithms, and Swiss precision I once helped develop. I haven’t worked there in two years, but sitting in that chair and seeing the process from the other side was a proud moment. It reminded me why I love this field.

Two weeks ago I fixed one of my teeth with algorithms I wrote a couple of years ago! I got hooked by 3D scanning when I started to work for a software shop in Zurich that was programming 3D computational geometry algorithms for denture scanning to produce crowns (and more). Back then, a typical reconstruction pipeline was like: scan the patient’s teeth using an intraoral scanner, reconstruct the surface mesh, design the restoration digitally, and finally mill the crown out of ceramic. We were working mostly with point clouds and meshes, but it wasn’t just math, it was craftsmanship translated into a digital process. Every micron mattered. You could literally see how a good algorithm meant a better fit in someone’s mouth. Gaussian Splatting isn’t about surface reconstruction, it’s about appearance reconstruction. It doesn’t care about explicit topology, it captures how light interacts with the scene. In a sense, it’s the opposite philosophy of the dental world: instead of modeling what the object is, it models how the object looks. 3D Gaussian Splatting enables applications like training self driving cars, teaching robots to understand their environment, creating virtual worlds, or monitoring real sites. It represents scenes as millions of small Gaussians rendered in real time without the need for meshes or textures. Coming from a world where precision geometry was everything, this shift felt natural. It’s still about reconstruction, but with a different goal: not manufacturing a perfect object, but reproducing how the world actually looks. Two weeks ago I got my first dental crown, made with the same software, reconstruction algorithms, and Swiss precision I once helped develop. I haven’t worked there in two years, but sitting in that chair and seeing the process from the other side was a proud moment. It reminded me why I love this field.

MrNeRF

290,140 次观看 • 8 个月前

We’re excited to introduce ShinkaEvolve: An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency. Blog: Code: Like AlphaEvolve and its variants, our framework leverages LLMs to find state-of-the-art solutions to complex problems, but using orders of magnitude fewer resources! Many evolutionary AI systems are powerful but act like brute-force engines, burning thousands of samples to find good solutions. This makes discovery slow and expensive. We took inspiration from the efficiency of nature. ‘Shinka’ (進化) is Japanese for evolution, and we designed our system to be just as resourceful. On the classic circle packing optimization problem, ShinkaEvolve discovered a new state-of-the-art solution using only 150 samples. This is a big leap in efficiency compared to previous methods that required thousands of evaluations. We applied ShinkaEvolve to a diverse set of hard problems with real-world applications: 1/ AIME Math Reasoning: It evolved sophisticated agentic scaffolds that significantly outperform strong baselines, discovering an entire Pareto frontier of solutions trading performance for efficiency. 2/ Competitive Programming: On ALE-Bench (a benchmark for NP-Hard optimization problems), ShinkaEvolve took the best existing agent's solutions and improved them, turning a 5th place solution on one task into a 2nd place leaderboard rank in a competitive programming competition. 3/ LLM Training: We even turned ShinkaEvolve inward to improve LLMs themselves. It tackled the open challenge of designing load balancing losses for Mixture-of-Experts (MoE) models. It discovered a novel loss function that leads to better expert specialization and consistently improves model performance and perplexity. ShinkaEvolve achieves its remarkable sample-efficiency through three key innovations that work together: (1) an adaptive parent sampling strategy to balance exploration and exploitation, (2) novelty-based rejection filtering to avoid redundant work, and (3) a bandit-based LLM ensemble that dynamically picks the best model for the job. By making ShinkaEvolve open-source and highly sample-efficient, our goal is to democratize access to advanced, open-ended discovery tools. Our vision for ShinkaEvolve is to be an easy-to-use companion tool to help scientists and engineers with their daily work. We believe that building more efficient, nature-inspired systems is key to unlocking the future of AI-driven scientific research. We are excited to see what the community builds with it! Learn more in our technical report:

We’re excited to introduce ShinkaEvolve: An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency. Blog: Code: Like AlphaEvolve and its variants, our framework leverages LLMs to find state-of-the-art solutions to complex problems, but using orders of magnitude fewer resources! Many evolutionary AI systems are powerful but act like brute-force engines, burning thousands of samples to find good solutions. This makes discovery slow and expensive. We took inspiration from the efficiency of nature. ‘Shinka’ (進化) is Japanese for evolution, and we designed our system to be just as resourceful. On the classic circle packing optimization problem, ShinkaEvolve discovered a new state-of-the-art solution using only 150 samples. This is a big leap in efficiency compared to previous methods that required thousands of evaluations. We applied ShinkaEvolve to a diverse set of hard problems with real-world applications: 1/ AIME Math Reasoning: It evolved sophisticated agentic scaffolds that significantly outperform strong baselines, discovering an entire Pareto frontier of solutions trading performance for efficiency. 2/ Competitive Programming: On ALE-Bench (a benchmark for NP-Hard optimization problems), ShinkaEvolve took the best existing agent's solutions and improved them, turning a 5th place solution on one task into a 2nd place leaderboard rank in a competitive programming competition. 3/ LLM Training: We even turned ShinkaEvolve inward to improve LLMs themselves. It tackled the open challenge of designing load balancing losses for Mixture-of-Experts (MoE) models. It discovered a novel loss function that leads to better expert specialization and consistently improves model performance and perplexity. ShinkaEvolve achieves its remarkable sample-efficiency through three key innovations that work together: (1) an adaptive parent sampling strategy to balance exploration and exploitation, (2) novelty-based rejection filtering to avoid redundant work, and (3) a bandit-based LLM ensemble that dynamically picks the best model for the job. By making ShinkaEvolve open-source and highly sample-efficient, our goal is to democratize access to advanced, open-ended discovery tools. Our vision for ShinkaEvolve is to be an easy-to-use companion tool to help scientists and engineers with their daily work. We believe that building more efficient, nature-inspired systems is key to unlocking the future of AI-driven scientific research. We are excited to see what the community builds with it! Learn more in our technical report:

Sakana AI

359,537 次观看 • 10 个月前