You can't 3D reconstruct glass from images... ...WRONG! Thanks... for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AIshow more

Jonathan Stephens
17,712 görüntüleme • 5 ay önce
Placing objects sounds simple… until robots have to do... it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !show more

Ilir Aliu - eu/acc
22,843 görüntüleme • 1 yıl önce
Depth Any Video with Scalable Synthetic Data AI physicists... and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.show more

MrNeRF
27,428 görüntüleme • 1 yıl önce
NVIDIA just released a very impressive text-to-video paper. Video... Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:show more

Lior Alexander
158,539 görüntüleme • 3 yıl önce
Introducing 📦𝗔𝗿𝘁𝗶𝗟𝗮𝘁𝗲𝗻𝘁🔧 (SIGGRAPH Asia 2025) — a high-quality 3D... diffusion model that explicitly models object articulation, paving the way for richer, more realistic assets in embodied AI and simulation: – Generates fully articulated 3D objects – Physically plausible joints & motion – High-fidelity 3D Gaussian appearance – Supports generation from a single real image arXiv: Project: Code (coming soon):show more

Xingang Pan
11,473 görüntüleme • 7 ay önce
Video diffusion models have strong implicit representations of 3D... shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.show more

Michael Black
22,092 görüntüleme • 6 ay önce
Chop the gradients ✂️! We found that truncating decoder... gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.show more

Felix Heide
28,323 görüntüleme • 2 ay önce
We’ve seen humanoid robots walk around for a while,... but when will they actually help with useful tasks in daily life? The challenge here is the diversity and complexity of real-world scenes. Our new work tackles this problem via 3D visuomotor policy learning. Using data from only 1 scene, our Improved 3D Diffusion Policy (iDP3) enables a full-sized humanoid robot to autonomously pick&place objects, pour water, and wipe tables, in the wild open world. (and all these skills are useful, right?) Web: Fully open-sourced code:show more

Yanjie Ze
75,194 görüntüleme • 1 yıl önce
Check out this Stereo4D paper from Google DeepMind. It's... a pretty clever approach to a persistent problem in computer vision -- getting good training data for how things move in 3D. The key insight is using VR180 videos -- those stereo fisheye videos we launched back in 2017 for YouTubeVR. It was always clear that structured stereo datasets would be valuable for computer vision -- and we launched some powerful VR tools with it back in 2017 (link below). But what's the game changer now in 2024 is the scale -- they're providing 110K high quality clips :-) That's the kind of massive, real-world AI dataset that was just a dream back then! They're using it to train this model called DynaDUSt3R that can predict both 3D structure and motion from video frames. Which means it tracks how objects move between frames while simultaneously reconstructing their 3D shape. And given we're dealing with real stereoscopic content, results are notably better than synthetic data, giving you a faithful rendition of the real-world with a diverse set of subject matter. It's one of those through lines when tackling a timeless mission like mapping the world or spatial computing -- VR content created for immersion becoming the foundation for teaching machines to understand how the world moves. Sometimes innovation chains together in unexpected ways! Links to projects below⛓️show more

Bilawal Sidhu
67,919 görüntüleme • 1 yıl önce
How can robots reliably place objects in diverse real-world... tasks? 🤖🔍 Placement is tough—objects vary in shape and placement modes (such as stacking, hanging, and insertion), making it a challenging problem. We introduce AnyPlace, a two-stage method trained purely on synthetic data to predict diverse placement poses of unseen objects for real-world tasks. Read on for more👇show more

Animesh Garg
24,662 görüntüleme • 1 yıl önce
NVIDIA AI Released DiffusionRenderer: An AI Model for Editable,... Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDevshow more

Marktechpost AI Dev News ⚡
104,741 görüntüleme • 11 ay önce
NVIDIA finally released Neuralangelo's source code! The model can... turn videos from any device into detailed 3D structures, fully replicating buildings, sculptures, or other real aworld objects or spaces virtually. Here's how it works: A model utilizes a 2D video with multiple angles of an object or scene. I selects frames from different viewpoints to understand depth, size, and shape. The AI creates an initial 3D representation, similar to a sculptor shaping a subject. The render is optimized to enhance details, like a sculptor refining texture. The outcome is a 3D object or scene suitable for virtual reality, digital twins, or robotics.show more

Lior Alexander
478,001 görüntüleme • 2 yıl önce
✨ Made a new mini feature on Photo AI:... [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!show more

@levelsio
119,210 görüntüleme • 11 ay önce
As announced in partnership with NVIDIA at CES, we’re... excited to introduce Stable Point Aware 3D (SPAR3D), setting a new standard in 3D generation. Ideal for running on NVIDIA RTX AI PCs, SPAR3D enables real-time editing and complete structure generation of 3D objects from a single image in under a second. You can download the weights on Hugging Face and code on GitHub, or access the model through the Stability AI API. Learn more here: (1/3)show more

Stability AI
181,402 görüntüleme • 1 yıl önce
So remember that time machine I told you about?... I managed to sneek out another video from my visit to the future.. this time I captured video of a 3d artist creating assets for a short commercial. They sculpted the bird in the video with nothing but two controllers which they used to manipulate generative matter in 3d. The AI model changed in realtime and adopted their style of work, both based on the work they did on the scene, and on reference images they gave the system. Seeing the 3d model transform in realtime as they were working on it was incredible. When it started moving and reacted to their instructions, I knew it was time to return to the present :D #art #aishow more

Martin Nebelong
22,036 görüntüleme • 2 yıl önce
AI in robotics gets all the attention right now,... but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:show more

Ilir Aliu
1,674,698 görüntüleme • 6 ay önce
Some random thoughts reading through the new RobbyAnt VLA... paper: - 20,000 hours of data across 9 robots!! - damn, Chinese companies are going to trivially outscale the American ones on real robot data - It's really cool they train on depth; it means they can handle transparent objects really well for example - You can never tell how good these models are without trying them, since everyone trains on different robots, but from the results they show it does seem to clean up - cross embodiment scaling laws are really cool - if you have lots of robots do you need human video??show more

Chris Paxton
19,641 görüntüleme • 4 ay önce
Introducing Attio Objects 🚀 We know how hard... it is to find a CRM that fits your unique business model. That's why we built Attio Objects – our powerful data model with custom objects that gives you complete flexibility to structure your CRM exactly how you need it. Along with custom objects, we've also introduced new standard objects: - Workspaces and Users objects for PLG businesses. - A robust Deals object for sales-driven companies. This is the culmination of a 4-year effort, with 3 years of work put in even before launching Attio. Since day one, we've been determined to solve the fundamental problem in the CRM space: the trade-off between power and time-to-value. If you wanted power and flexibility, your CRM would take forever to build and not work well with your stack. If you wanted speed, you'd need to use highly opinionated, inflexible software that doesn't really work for your business. That ends today. With Attio, you no longer have to compromise. Build your CRM your way, fast. Iterate as you grow. High-growth startups like Replicate, , and Modal and more are already using Attio's object architecture to perfectly match their businesses and accelerate their growth. To get all the details, check out our blog post 👇 show more

Attio
26,821 görüntüleme • 2 yıl önce
🤖 NVIDIA’s Gr00t N1.5 is now available in LeRobot!... This is the result of a great collaboration between the Hugging Face LeRobot team and NVIDIA Robotics ! Gr00t N1.5 highlights: 🦾 Cross-embodiment foundation model for robots 🧠 Multimodal inputs: vision, language, and proprioception 🪛Tested on the Libero benchmark and real-world hardware tasks 🌍Trained on real robot, synthetic, and internet-scale video data ⚙️ Flow matching action transformer for action predictionshow more

LeRobot
115,194 görüntüleme • 8 ay önce
doodles AI beta. next week. we're building the tools... for a new era of dynamic world-building. it starts with an image model that reimagines anything and everything through the doodles lens. this is the first iteration of many. as the product evolves, we'll introduce the ability to turn your generations into physical objects. video with sound and dialogue, realtime AR, and gaming are all on the roadmap. doodles AI aligns us with the speed and scale of the AI industry at large. our colourful world can now be plugged into new tech as it unfolds. create with us.show more

burnt toast
61,243 görüntüleme • 3 ay önce
This is some quietly impressive work on making video... world models actually controllable in 4D space. VerseCrafter lets you take an input image, use something like Blender to animate the 3D camera path and object trajectories, then uses that to condition generation. Scribbling in 2D feels so crude in comparison. The authors represent everything in a shared 4D world state - static background as a point cloud, moving objects as 3D gaussian trajectories. The gaussians are an interesting choice because they capture position, shape, and orientation probabilistically rather than forcing rigid bounding boxes or category specific models like SMPL-X for human bodies. They bolt this onto frozen Wan2.1 with a lightweight adapter, so they get a strong video prior. They also built a pipeline to auto extract 4D annotations from real world videos to train this puppy. It doesn't look sexy yet, but IMO this is the interface video world models need - actual 3D authoring tools to exert control rather than crude scribbles and prompt incantations.show more

Bilawal Sidhu
25,802 görüntüleme • 5 ay önce