Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

We are introducing Hunyuan3D-Part: an open-source part-level 3D shape generation model that outperforms all existing open and close-source models. Highlights: 🔹P3-SAM: The industry's first native 3D part segmentation model. 🔹X-Part: A part generation model that achieves state-of-the-art results in controllability and shape quality. Key-features: 1️⃣Eliminates the use of 2D... SAM during training, relying solely on a large-scale dataset with 3.7 million shapes and clean part annotations. 2️⃣Introduces a new automated segmentation pipeline in 3D without user intervention. 3️⃣Implements a diffusion-based part decomposition pipeline utilizing both geometry and semantic clues. Code: Weights: Tech reports: 🔸P3-SAM： → Paper: → Project page: 🔸X-Part： → Paper: → Project page: Try it now： → (Light version) Hugging Face demo: → (Full version) Hunyuan3D Studio:show more

Tencent Hy

39,340 subscribers

72,522 Aufrufe • vor 10 Monaten •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

We're thrilled to release & open-source Hunyuan3D World Model 1.0! This model enables you to generate immersive, explorable, and interactive 3D worlds from just a sentence or an image. It's the industry's first open-source 3D world generation model, compatible with CG pipelines for full editability & simulation. Set to transform game development, VR, digital content creation and so on. Get started now👇🏻 Project Page： Try it now： Github： Hugging Face：

We're thrilled to release & open-source Hunyuan3D World Model 1.0! This model enables you to generate immersive, explorable, and interactive 3D worlds from just a sentence or an image. It's the industry's first open-source 3D world generation model, compatible with CG pipelines for full editability & simulation. Set to transform game development, VR, digital content creation and so on. Get started now👇🏻 Project Page： Try it now： Github： Hugging Face：

Tencent Hy

1,230,418 Aufrufe • vor 1 Jahr

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 Aufrufe • vor 10 Monaten

✨We are excited to open-source Tencent HY-Motion 1.0, a billion-parameter text-to-motion model built on the Diffusion Transformer (DiT) architecture and flow matching. Tencent HY-Motion 1.0 empowers developers and individual creators alike by transforming natural language into high-fidelity, fluid, and diverse 3D character animations, delivering exceptional instruction-following capabilities across a broad range of categories. The generated 3D animation assets can be seamlessly integrated into typical 3D animation pipelines.🎮🎥 Highlights: 🔹Billion-Scale DiT: Successfully scaled flow-matching DiT to 1B+ parameters, setting a new ceiling for instruction-following capability and generated motion quality. 🔹Full-Stage Training Strategy: The industry’s first motion generation model featuring a complete Pre-training → SFT → RL loop to optimize physical plausibility and semantic accuracy. 🔹Comprehensive Category Coverage: Features 200+ motion categories across 6 major classes—the most comprehensive in the industry, curated via a meticulous data pipeline. 🌐Project Page: 🔗Github: 🤗Hugging Face: 📄Technical report:

✨We are excited to open-source Tencent HY-Motion 1.0, a billion-parameter text-to-motion model built on the Diffusion Transformer (DiT) architecture and flow matching. Tencent HY-Motion 1.0 empowers developers and individual creators alike by transforming natural language into high-fidelity, fluid, and diverse 3D character animations, delivering exceptional instruction-following capabilities across a broad range of categories. The generated 3D animation assets can be seamlessly integrated into typical 3D animation pipelines.🎮🎥 Highlights: 🔹Billion-Scale DiT: Successfully scaled flow-matching DiT to 1B+ parameters, setting a new ceiling for instruction-following capability and generated motion quality. 🔹Full-Stage Training Strategy: The industry’s first motion generation model featuring a complete Pre-training → SFT → RL loop to optimize physical plausibility and semantic accuracy. 🔹Comprehensive Category Coverage: Features 200+ motion categories across 6 major classes—the most comprehensive in the industry, curated via a meticulous data pipeline. 🌐Project Page: 🔗Github: 🤗Hugging Face: 📄Technical report:

Tencent Hy

328,493 Aufrufe • vor 6 Monaten

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 Aufrufe • vor 2 Jahren

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 Aufrufe • vor 3 Jahren

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Tencent Hy

122,706 Aufrufe • vor 11 Monaten

Introducing CADGenBench: measure how well AI systems produce engineering-grade 3D parts! While current models can generate 3D parts, they are far from precise enough to build functional parts. We built a benchmark to systematically measure their capabilities on two tasks: 1. Generation from an engineering drawing of a part 2. Editing: given an existing STEP file and a requested change The benchmark is tool-agnostic. It makes no assumptions about how you build the model. You can vary the LLM, and you can vary the environment. Use build123d, Onshape, Autodesk, or a model without an LLM entirely. We open sourced the scoring engine and a reference baseline on top of build123d. A collaboration between Hugging Face and Mecado! Submission space: Code repository:

Introducing CADGenBench: measure how well AI systems produce engineering-grade 3D parts! While current models can generate 3D parts, they are far from precise enough to build functional parts. We built a benchmark to systematically measure their capabilities on two tasks: 1. Generation from an engineering drawing of a part 2. Editing: given an existing STEP file and a requested change The benchmark is tool-agnostic. It makes no assumptions about how you build the model. You can vary the LLM, and you can vary the environment. Use build123d, Onshape, Autodesk, or a model without an LLM entirely. We open sourced the scoring engine and a reference baseline on top of build123d. A collaboration between Hugging Face and Mecado! Submission space: Code repository:

Michael Rabinovich

63,879 Aufrufe • vor 1 Monat

🚀Introducing Hunyuan3D-PolyGen, our newly upgraded and industry-first art-grade 3D generative model. It brings effortless intelligent retopology, making AI-generated models ready for professional art pipelines. ✅ Superior Mesh Topology: Our self-developed mesh autoregressive model ensures higher-quality mesh topology that meets stringent art standards. ✅ Complex Object Modeling: Leveraging our high-compression BPT representation, we can generate models with 10K+ faces, enabling more complex geometry, higher topology precision, and better detail. ✅ Flexible Output: Supports both tri and quad meshes, meeting diverse pipeline requirements. Hunyuan3D-PolyGen enables direct application of AI-generated 3D assets in game development and significantly boosts artist modeling efficiency. It's a robust foundation for the future of 3D content creation. 👉Try it now:

🚀Introducing Hunyuan3D-PolyGen, our newly upgraded and industry-first art-grade 3D generative model. It brings effortless intelligent retopology, making AI-generated models ready for professional art pipelines. ✅ Superior Mesh Topology: Our self-developed mesh autoregressive model ensures higher-quality mesh topology that meets stringent art standards. ✅ Complex Object Modeling: Leveraging our high-compression BPT representation, we can generate models with 10K+ faces, enabling more complex geometry, higher topology precision, and better detail. ✅ Flexible Output: Supports both tri and quad meshes, meeting diverse pipeline requirements. Hunyuan3D-PolyGen enables direct application of AI-generated 3D assets in game development and significantly boosts artist modeling efficiency. It's a robust foundation for the future of 3D content creation. 👉Try it now:

Tencent Hy

161,189 Aufrufe • vor 1 Jahr

Prince of Persia – Community Project The Prince of Persia community has come together to create its own project following the cancellation of the Sands of Time remake! This is not a traditional remake, but a large-scale mod aiming to replace most 3D models with high-quality assets and introduce global illumination through path tracing technology. We’re working on an official website — stay tuned! Current Status: Early development⏳ About the Project: 🔹 Public and community-driven 🔹 Fully playable 🔹 Focused on next-gen visuals while preserving the original experience We are looking for contributors! 🔹 3D Modelers / Artists 🔹 Texture Artists 🔹 Programmers 🔹 RTX Remix experts This is a hobby project with no payment, focused on long-term development without strict deadlines. We’re looking for passionate people who enjoy creating and want to be part of something special✨ Join us on Discord and become part of the team: 👉🏻 Follow our progress, share feedback, or get involved in development!😉 #PrinceofPersia #ThePoPProject

Prince of Persia – Community Project The Prince of Persia community has come together to create its own project following the cancellation of the Sands of Time remake! This is not a traditional remake, but a large-scale mod aiming to replace most 3D models with high-quality assets and introduce global illumination through path tracing technology. We’re working on an official website — stay tuned! Current Status: Early development⏳ About the Project: 🔹 Public and community-driven 🔹 Fully playable 🔹 Focused on next-gen visuals while preserving the original experience We are looking for contributors! 🔹 3D Modelers / Artists 🔹 Texture Artists 🔹 Programmers 🔹 RTX Remix experts This is a hobby project with no payment, focused on long-term development without strict deadlines. We’re looking for passionate people who enjoy creating and want to be part of something special✨ Join us on Discord and become part of the team: 👉🏻 Follow our progress, share feedback, or get involved in development!😉 #PrinceofPersia #ThePoPProject

PoP Universe

67,375 Aufrufe • vor 3 Monaten

KYRALL TURNS "I NEED THIS PART" INTO SECONDS, WITHOUT OPENING CAD you describe the part. the AI generates it. no SolidWorks. no Fusion. no hours of setup. why this matters to you: modeling a part in CAD is slow and technical. sketch, constrain, extrude, fillet — for every bracket, housing or fixture. and simple one-off parts eat the same ceremony every time. for an engineer, that bottleneck stalls the whole project. Kyrall skips it: → prompt → 3D part ready → exports to STEP and STL, the formats you already use → for when you need the geometry now, not in three days it doesn't compete with CAD suites on millimeter control. it competes on getting you to a usable part first. idea → part → keep building. link below :)

KYRALL TURNS "I NEED THIS PART" INTO SECONDS, WITHOUT OPENING CAD you describe the part. the AI generates it. no SolidWorks. no Fusion. no hours of setup. why this matters to you: modeling a part in CAD is slow and technical. sketch, constrain, extrude, fillet — for every bracket, housing or fixture. and simple one-off parts eat the same ceremony every time. for an engineer, that bottleneck stalls the whole project. Kyrall skips it: → prompt → 3D part ready → exports to STEP and STL, the formats you already use → for when you need the geometry now, not in three days it doesn't compete with CAD suites on millimeter control. it competes on getting you to a usable part first. idea → part → keep building. link below :)

marcus

31,238 Aufrufe • vor 18 Tagen

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

AK

305,667 Aufrufe • vor 2 Jahren

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 Aufrufe • vor 10 Monaten

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Boyi Li

66,999 Aufrufe • vor 7 Monaten

Introducing Omnia Alpha. Audio-driven generation with full control over camera, motion, and background. Built to cross the uncanny valley without sacrificing creative control. This release is part of a larger vision. The future of world models is open and local. We're exploring how to bring our upcoming models to consumer hardware. America needs a leader in open source world models, and we're building toward that. Full version coming soon. Until then, show us what you can do with Omnia. Available now.

Introducing Omnia Alpha. Audio-driven generation with full control over camera, motion, and background. Built to cross the uncanny valley without sacrificing creative control. This release is part of a larger vision. The future of world models is open and local. We're exploring how to bring our upcoming models to consumer hardware. America needs a leader in open source world models, and we're building toward that. Full version coming soon. Until then, show us what you can do with Omnia. Available now.

Hedra

3,105,305 Aufrufe • vor 5 Monaten

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 Aufrufe • vor 3 Jahren

NVIDIA has published a paper on DREAMGEN – a powerful 4-step pipeline for generating synthetic data for humanoids that enables task and environment generalization. - Step 1: Fine-tune a video generation model using a small number of human teleoperation videos - Step 2: Prompt the fine-tuned model to turn a single real image into new AI-imagined videos - Step 3: Automatically label actions in the generated videos - Step 4: Train a robot AI model with the labeled synthetic dataset This enabled humanoid robots to perform 22 novel behaviors – such as pouring, opening/closing articulated objects, and manipulating a variety of tools. The original teleoperation dataset only included pick-and-place tasks. This takes task extensibility to another level without requiring human teleoperation for every single task. The pipeline will be made open-source soon. Project page:

NVIDIA has published a paper on DREAMGEN – a powerful 4-step pipeline for generating synthetic data for humanoids that enables task and environment generalization. - Step 1: Fine-tune a video generation model using a small number of human teleoperation videos - Step 2: Prompt the fine-tuned model to turn a single real image into new AI-imagined videos - Step 3: Automatically label actions in the generated videos - Step 4: Train a robot AI model with the labeled synthetic dataset This enabled humanoid robots to perform 22 novel behaviors – such as pouring, opening/closing articulated objects, and manipulating a variety of tools. The original teleoperation dataset only included pick-and-place tasks. This takes task extensibility to another level without requiring human teleoperation for every single task. The pipeline will be made open-source soon. Project page:

The Humanoid Hub

12,074 Aufrufe • vor 1 Jahr

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., 7.5). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., 512times512) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic paper page:

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., 7.5). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., 512times512) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic paper page:

AK

46,151 Aufrufe • vor 3 Jahren

Here are 10 AI video editor GitHub repos worth bookmarking: 1. Shotcut Most actively maintained open source video editor in 2026. 14K stars. Cross-platform with AI-assisted features. Just shipped a new release April 30, 2026. 2. Kdenlive The closest open source alternative to Adobe Premiere Pro. Multi-track editing, proxy editing, VST audio, and customizable workspace. Best for professional workflows. 3. OpenShot The easiest entry point for beginners. Drag and drop, 400+ transitions, 3D titles, and AI-assisted trimming. 5,700 stars. 4. Blender Not just 3D. Blender's video sequence editor and compositing pipeline is used in professional film production. 18,300 stars. Unmatched for VFX. 5. Recordly Screen recorder with auto-zoom, cursor polish, webcam overlays, and styled frames built in. Built for demo videos and walkthroughs. 6. Wan2.1 Alibaba's open source text-to-video model. Cinema-grade 1080p generation. Apache 2.0. The gold standard for open source video generation in 2026. 7. HunyuanVideo Tencent's 13B parameter open source video model. 11.9K stars. Handles 720p and 1080p with high temporal coherence. 8. CogVideoX Apache 2.0 licensed. Loads natively via Hugging Face Diffusers. Strong prompt following and smooth frame transitions. Needs 16GB VRAM minimum. 12.5K stars. 9. Open-Sora Most starred open source video generation project at 24K stars. Full training pipeline for $200K. Production-level output quality. 10. Mochi 1 Focused entirely on motion quality. The most natural-looking physics of any open source video model. Water, fabric, and human gestures without AI jitter. Apache 2.0.

Here are 10 AI video editor GitHub repos worth bookmarking: 1. Shotcut Most actively maintained open source video editor in 2026. 14K stars. Cross-platform with AI-assisted features. Just shipped a new release April 30, 2026. 2. Kdenlive The closest open source alternative to Adobe Premiere Pro. Multi-track editing, proxy editing, VST audio, and customizable workspace. Best for professional workflows. 3. OpenShot The easiest entry point for beginners. Drag and drop, 400+ transitions, 3D titles, and AI-assisted trimming. 5,700 stars. 4. Blender Not just 3D. Blender's video sequence editor and compositing pipeline is used in professional film production. 18,300 stars. Unmatched for VFX. 5. Recordly Screen recorder with auto-zoom, cursor polish, webcam overlays, and styled frames built in. Built for demo videos and walkthroughs. 6. Wan2.1 Alibaba's open source text-to-video model. Cinema-grade 1080p generation. Apache 2.0. The gold standard for open source video generation in 2026. 7. HunyuanVideo Tencent's 13B parameter open source video model. 11.9K stars. Handles 720p and 1080p with high temporal coherence. 8. CogVideoX Apache 2.0 licensed. Loads natively via Hugging Face Diffusers. Strong prompt following and smooth frame transitions. Needs 16GB VRAM minimum. 12.5K stars. 9. Open-Sora Most starred open source video generation project at 24K stars. Full training pipeline for $200K. Production-level output quality. 10. Mochi 1 Focused entirely on motion quality. The most natural-looking physics of any open source video model. Water, fabric, and human gestures without AI jitter. Apache 2.0.

Kanika

17,309 Aufrufe • vor 1 Monat

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives paper page: Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives paper page: Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.

AK

38,571 Aufrufe • vor 3 Jahren

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: Code: Paper: Models: Work done with Jianyuan Minghao Chen Christian Rupprecht and Andrea Vedaldi

Stan Szymanowicz

31,651 Aufrufe • vor 4 Monaten