Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough... for real-time tracking. The idea is simple. These models use global self-attention, which is powerful but computationally expensive. KV-Tracker caches the key and value pairs from selected keyframes and reuses them for new frames. That cache becomes an implicit scene representation. Result: • Up to 30 FPS • 10 to 15x speedup • Accurate 6-DoF tracking on benchmarks like TUM RGB-D and 7-Scenes • Works with monocular RGB only It also supports object-level tracking with masks and allows saving the KV-cache for later reuse. For robotics, this reduces hardware constraints and moves real-time 3D perception closer to practical deployment. Credit to Marwan Taher (Marwan Taher) at Imperial’s Dyson Robotics Lab and many others who contributed to this! 📍 Save projects page + paper for later: Video: ——- if it matters in AI or Robotics you'll read it here first:show more

Ilir Aliu

53,814 subscribers

53,866 просмотров • 2 месяцев назад •via X (Twitter)

Новости и политика Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

You don’t need expensive hardware to build something that feels like real robotics. An Arduino, a servo, and an ultrasonic sensor can scan space and turn echoes into a simple radar view. It sweeps, measures distance, and visualizes it in real time. Cheap, simple, and a solid way to understand how machines sense the world. Limits show up fast, which is where better sensing and smarter software come in. Original build by SunFounder —— Weekly robotics and AI insights. Subscribe free:

You don’t need expensive hardware to build something that feels like real robotics. An Arduino, a servo, and an ultrasonic sensor can scan space and turn echoes into a simple radar view. It sweeps, measures distance, and visualizes it in real time. Cheap, simple, and a solid way to understand how machines sense the world. Limits show up fast, which is where better sensing and smarter software come in. Original build by SunFounder —— Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

872,423 просмотров • 3 дней назад

Placing objects sounds simple… until robots have to do it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !

Placing objects sounds simple… until robots have to do it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !

Ilir Aliu - eu/acc

22,843 просмотров • 1 год назад

SAM 3D Body is a CVPR 2026 award candidate paper from AI at Meta model recovers a full 3D human body mesh from a single RGB image you can run it automatically, or guide the reconstruction with masks and 2D keypoints thx to Niels Rogge for awesome demo idea

SAM 3D Body is a CVPR 2026 award candidate paper from AI at Meta model recovers a full 3D human body mesh from a single RGB image you can run it automatically, or guide the reconstruction with masks and 2D keypoints thx to Niels Rogge for awesome demo idea

SkalskiP

71,881 просмотров • 29 дней назад

NVIDIA finally released Neuralangelo's source code! The model can turn videos from any device into detailed 3D structures, fully replicating buildings, sculptures, or other real aworld objects or spaces virtually. Here's how it works: A model utilizes a 2D video with multiple angles of an object or scene. I selects frames from different viewpoints to understand depth, size, and shape. The AI creates an initial 3D representation, similar to a sculptor shaping a subject. The render is optimized to enhance details, like a sculptor refining texture. The outcome is a 3D object or scene suitable for virtual reality, digital twins, or robotics.

NVIDIA finally released Neuralangelo's source code! The model can turn videos from any device into detailed 3D structures, fully replicating buildings, sculptures, or other real aworld objects or spaces virtually. Here's how it works: A model utilizes a 2D video with multiple angles of an object or scene. I selects frames from different viewpoints to understand depth, size, and shape. The AI creates an initial 3D representation, similar to a sculptor shaping a subject. The render is optimized to enhance details, like a sculptor refining texture. The outcome is a 3D object or scene suitable for virtual reality, digital twins, or robotics.

Lior Alexander

478,025 просмотров • 2 лет назад

What if anatomy explorers felt alive? This 3D dog, including the skeleton, organs and rig, was generated with ai and reacts to the cursor in real time with head tracking and tail movement. - Used GPT Images 2 for consistency - Omma AI for 3D generation and code using Three.js Lmk if you’d like to try it!

What if anatomy explorers felt alive? This 3D dog, including the skeleton, organs and rig, was generated with ai and reacts to the cursor in real time with head tracking and tail movement. - Used GPT Images 2 for consistency - Omma AI for 3D generation and code using Three.js Lmk if you’d like to try it!

Gábor Pribék

203,777 просмотров • 1 месяц назад

Robotics keeps hitting the same wall. Single task RL works, but... it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and Newt, a language conditioned world model trained online across all 200 tasks at once. The simple idea behind Newt: The model learns from demos to get the right priors It trains across many tasks through online interaction It uses language to ground the goal It adapts fast when a new task shows up What stood out to me: ✅ One model trained on 200 tasks at the same time ✅ Language conditioned control for both states and RGB ✅ Better data efficiency than strong baselines ✅ Strong open loop control ✅ Fast adaptation to new tasks and embodiments ✅ Full release of 200 checkpoints, 4000 demos, code, and benchmark This is a good push toward general control instead of one model per task. If you want the full paper: Project page: —- Weekly robotics and AI insights. Subscribe free:

Robotics keeps hitting the same wall. Single task RL works, but... it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and Newt, a language conditioned world model trained online across all 200 tasks at once. The simple idea behind Newt: The model learns from demos to get the right priors It trains across many tasks through online interaction It uses language to ground the goal It adapts fast when a new task shows up What stood out to me: ✅ One model trained on 200 tasks at the same time ✅ Language conditioned control for both states and RGB ✅ Better data efficiency than strong baselines ✅ Strong open loop control ✅ Fast adaptation to new tasks and embodiments ✅ Full release of 200 checkpoints, 4000 demos, code, and benchmark This is a good push toward general control instead of one model per task. If you want the full paper: Project page: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

70,090 просмотров • 7 месяцев назад

From seeing → to understanding → to following → now to finding. In the last update, our robot learned how to recognize and follow people SR Agentic now evolves from human-following → to goal-driven object search. Give it a task like: → “find the white bottle on the table ” It will: • break down the instruction via LLM task planning • scan the environment in real-time • localize the object in 3D space • navigate and approach autonomously Step by step, capability by capability — from perception → to tracking → to task execution in the physical world. This is how real-world agentic robotics is built

From seeing → to understanding → to following → now to finding. In the last update, our robot learned how to recognize and follow people SR Agentic now evolves from human-following → to goal-driven object search. Give it a task like: → “find the white bottle on the table ” It will: • break down the instruction via LLM task planning • scan the environment in real-time • localize the object in 3D space • navigate and approach autonomously Step by step, capability by capability — from perception → to tracking → to task execution in the physical world. This is how real-world agentic robotics is built

Strike Robot

21,916 просмотров • 2 месяцев назад

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDev

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDev

Marktechpost AI Dev News ⚡

104,741 просмотров • 11 месяцев назад

GSTAR: Gaussian Surface Tracking and Reconstruction Contributions: • A new framework for tracking and reconstructing dynamic scenes, combining 3D Gaussians and meshes to effectively manage changes in topology. • A method for Gaussian unbinding and surface re-meshing, allowing for the generation of new surfaces as topologies evolve. • A method for handling large or fast deformations of surfaces between frames using scene flow warping. Abstract (excerpt): However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. Given multi-view captures as input, GSTAR binds Gaussians to mesh faces to represent dynamic objects. For surfaces with consistent topology, GSTAR maintains the mesh topology and tracks the meshes using Gaussians.

GSTAR: Gaussian Surface Tracking and Reconstruction Contributions: • A new framework for tracking and reconstructing dynamic scenes, combining 3D Gaussians and meshes to effectively manage changes in topology. • A method for Gaussian unbinding and surface re-meshing, allowing for the generation of new surfaces as topologies evolve. • A method for handling large or fast deformations of surfaces between frames using scene flow warping. Abstract (excerpt): However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. Given multi-view captures as input, GSTAR binds Gaussians to mesh faces to represent dynamic objects. For surfaces with consistent topology, GSTAR maintains the mesh topology and tracks the meshes using Gaussians.

MrNeRF

22,698 просмотров • 1 год назад

Most imitation learning policies break when the camera moves or the robot changes. NOT THIS ONE 👇 [📍 Bookmark for later ] A new 3D scene representation encoder, tackles this by enabling zero-shot generalization to unseen embodiments and viewpoints… And it works with any IL algorithm. The trick? •Use a 2D foundation model to extract semantic features •Lift them into 3D space for localization (not semantics) •Condition the IL policy on this spatially grounded vector Across 93 simulated and 6 real tasks, Adapt3R: ✅ Maintains IL performance on LIBERO & MimicGen benchmarks ✅ Outperforms DP3 and 3D Diffuser Actor in most settings ✅ Holds >80% success on LIBERO even with large camera rotations Thanks for sharing this, Animesh Garg & Albert Wilcox! 📍Paper: Website: Code:

Most imitation learning policies break when the camera moves or the robot changes. NOT THIS ONE 👇 [📍 Bookmark for later ] A new 3D scene representation encoder, tackles this by enabling zero-shot generalization to unseen embodiments and viewpoints… And it works with any IL algorithm. The trick? •Use a 2D foundation model to extract semantic features •Lift them into 3D space for localization (not semantics) •Condition the IL policy on this spatially grounded vector Across 93 simulated and 6 real tasks, Adapt3R: ✅ Maintains IL performance on LIBERO & MimicGen benchmarks ✅ Outperforms DP3 and 3D Diffuser Actor in most settings ✅ Holds >80% success on LIBERO even with large camera rotations Thanks for sharing this, Animesh Garg & Albert Wilcox! 📍Paper: Website: Code:

Ilir Aliu

12,178 просмотров • 10 месяцев назад

Most robots still need markers, checkerboards, or long calibration rituals just to know where their arms are. Now it works from raw images in seconds. roboreg is a markerless multi arm localization toolkit that plugs into ROS 2 and RViz. No special hardware. No custom setup. You toggle between robot descriptions and the system figures out the rest. The idea is simple: ✅ Hand eye calibration from plain RGB or RGB D images ✅ Only three robot poses needed for millimeter accuracy ✅ Works with any ROS 2 compatible robot and camera ✅ Fully open source under Apache 2.0 It is powered by Hydra, a new marker free ICP variant that converges far more reliably than classical baselines and runs in under a second. If you want to try it: roboreg: ROS 2 roboreg: Hydra paper: pip install roboreg More details and discussion on Open Robotics Discourse:

Most robots still need markers, checkerboards, or long calibration rituals just to know where their arms are. Now it works from raw images in seconds. roboreg is a markerless multi arm localization toolkit that plugs into ROS 2 and RViz. No special hardware. No custom setup. You toggle between robot descriptions and the system figures out the rest. The idea is simple: ✅ Hand eye calibration from plain RGB or RGB D images ✅ Only three robot poses needed for millimeter accuracy ✅ Works with any ROS 2 compatible robot and camera ✅ Fully open source under Apache 2.0 It is powered by Hydra, a new marker free ICP variant that converges far more reliably than classical baselines and runs in under a second. If you want to try it: roboreg: ROS 2 roboreg: Hydra paper: pip install roboreg More details and discussion on Open Robotics Discourse:

Ilir Aliu

18,406 просмотров • 7 месяцев назад

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 просмотров • 1 год назад

Palletizing in the real world! 📦🤖 How do you stack 65 unique SKUs on a pallet when they arrive in random order? Here’s how an on-the-fly algorithm solved it in a real logistics use case with only a single-digit buffer. Every placement was checked for stability, not just for itself, but for every other box it touched. The result? A rock-solid 2.05 m (6.5 ft) pallet. Robotics in logistics keeps improving. Hardware matters, but without smart software your robots won’t know what to do and you’ll waste money and time. Credit: Progressive Robotics

Palletizing in the real world! 📦🤖 How do you stack 65 unique SKUs on a pallet when they arrive in random order? Here’s how an on-the-fly algorithm solved it in a real logistics use case with only a single-digit buffer. Every placement was checked for stability, not just for itself, but for every other box it touched. The result? A rock-solid 2.05 m (6.5 ft) pallet. Robotics in logistics keeps improving. Hardware matters, but without smart software your robots won’t know what to do and you’ll waste money and time. Credit: Progressive Robotics

Ilir Aliu - eu/acc

31,718 просмотров • 10 месяцев назад

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AI

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AI

Jonathan Stephens

17,712 просмотров • 6 месяцев назад

Pixel3DMM is a new method that can reconstruct 3D human faces from a single RGB image or multiple frames

Pixel3DMM is a new method that can reconstruct 3D human faces from a single RGB image or multiple frames

Dreaming Tulpa 🥓👑

32,565 просмотров • 1 год назад

AI in robotics gets all the attention right now, but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:

AI in robotics gets all the attention right now, but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

1,674,814 просмотров • 7 месяцев назад

This work makes a humanoid robot do simple parkour moves by looking with a depth camera and choosing the right move on the fly. The big deal is that it turns lots of small human moves into long, real-time robot behavior, without hand-coding every transition or retraining for each new course. A humanoid robot is usually good at steady walking, but it often fails when it has to do fast moves like jumping up, vaulting, or rolling, and then keep going to the next obstacle. The hard part is that you cannot easily collect training data for every possible obstacle shape, distance, and mistake, so robots end up learning a few moves that only work in a narrow setup. This work starts from short clips of real human parkour moves, like stepping over, vaulting, climbing, and rolling. It uses motion matching, which is basically a smart “pick the next clip that fits best right now” search, to stitch those short clips into a long, smooth plan that looks like a human doing a whole course. Then it trains a controller with reinforcement learning (RL), which means the robot learns by trial and error to copy that plan while staying balanced and not falling. After training separate expert controllers for different moves, it compresses them into 1 controller that uses only onboard depth sensing and a simple “go this fast in this direction” command. In real tests on a Unitree G1 humanoid, it can clear multiple obstacles in a row, adapt when obstacles get moved, and climb a wall up to 1.25m.

This work makes a humanoid robot do simple parkour moves by looking with a depth camera and choosing the right move on the fly. The big deal is that it turns lots of small human moves into long, real-time robot behavior, without hand-coding every transition or retraining for each new course. A humanoid robot is usually good at steady walking, but it often fails when it has to do fast moves like jumping up, vaulting, or rolling, and then keep going to the next obstacle. The hard part is that you cannot easily collect training data for every possible obstacle shape, distance, and mistake, so robots end up learning a few moves that only work in a narrow setup. This work starts from short clips of real human parkour moves, like stepping over, vaulting, climbing, and rolling. It uses motion matching, which is basically a smart “pick the next clip that fits best right now” search, to stitch those short clips into a long, smooth plan that looks like a human doing a whole course. Then it trains a controller with reinforcement learning (RL), which means the robot learns by trial and error to copy that plan while staying balanced and not falling. After training separate expert controllers for different moves, it compresses them into 1 controller that uses only onboard depth sensing and a simple “go this fast in this direction” command. In real tests on a Unitree G1 humanoid, it can clear multiple obstacles in a row, adapt when obstacles get moved, and climb a wall up to 1.25m.

Rohan Paul

37,121 просмотров • 4 месяцев назад

🤖 Google DeepMind's robotics partnerships are some of the most exciting projects I've seen in my career! Everything from fine-tuning Gemma to run locally for robotics control, to using the Gemini APIs for real-time interactions, bounding box detection, and robotics simulations.

🤖 Google DeepMind's robotics partnerships are some of the most exciting projects I've seen in my career! Everything from fine-tuning Gemma to run locally for robotics control, to using the Gemini APIs for real-time interactions, bounding box detection, and robotics simulations.

👩‍💻 Paige Bailey

23,581 просмотров • 3 месяцев назад

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation. We stream the KV cache and hide the network delay. We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.

EXO Labs

22,604 просмотров • 8 месяцев назад

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 просмотров • 1 год назад