Загрузка видео...

Не удалось загрузить видео

На главную

Robots can now reconstruct 3D scenes in real time from a single RGB camera. [📍 Projects page + paper] No depth sensor. No retraining. 30 FPS. Researchers at the Imperial College London introduced KV-Tracker, a training-free method that makes heavy models like π³ and Depth Anything 3 fast enough...

53,866 просмотров • 2 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AI

Jonathan Stephens

17,712 просмотров • 6 месяцев назад

This work makes a humanoid robot do simple parkour moves by looking with a depth camera and choosing the right move on the fly. The big deal is that it turns lots of small human moves into long, real-time robot behavior, without hand-coding every transition or retraining for each new course. A humanoid robot is usually good at steady walking, but it often fails when it has to do fast moves like jumping up, vaulting, or rolling, and then keep going to the next obstacle. The hard part is that you cannot easily collect training data for every possible obstacle shape, distance, and mistake, so robots end up learning a few moves that only work in a narrow setup. This work starts from short clips of real human parkour moves, like stepping over, vaulting, climbing, and rolling. It uses motion matching, which is basically a smart “pick the next clip that fits best right now” search, to stitch those short clips into a long, smooth plan that looks like a human doing a whole course. Then it trains a controller with reinforcement learning (RL), which means the robot learns by trial and error to copy that plan while staying balanced and not falling. After training separate expert controllers for different moves, it compresses them into 1 controller that uses only onboard depth sensing and a simple “go this fast in this direction” command. In real tests on a Unitree G1 humanoid, it can clear multiple obstacles in a row, adapt when obstacles get moved, and climb a wall up to 1.25m.

Rohan Paul

37,121 просмотров • 4 месяцев назад

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 просмотров • 1 год назад