Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

How do you teach a robot to handle complex, multi-step tasks, without training it for each one? [Github ⬇️] The team behind ReKep shows that robots can perform bimanual, in-the-wild tasks by reasoning over keypoint constraints: Generated on the fly using vision and language models. No task-specific data, no... environment modeling. Why it matters ✅ Encodes tasks as simple Python functions over 3D keypoints ✅ Uses VLMs to generate keypoint constraints from instructions ✅ Plans and replans in real time with a 10 Hz perception-action loop ✅ Works for bimanual, multi-stage tasks without task-specific training Built on open tools like SciPy and BEHAVIOR, ReKep brings reactive, general-purpose reasoning closer to real-world robot control. Project website: Paper: Code: Walkthrough video: Thank you, Wenlong Huang for sharing 🫶show more

Ilir Aliu - eu/acc

6,252 subscribers

25,348 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Haberler & Politika Eğitim

Anya Rossi• Live Now

Private livecam show

2 Yorum

VentureMind AI profil fotoğrafı

VentureMind AI1 yıl önce

INSANE 🔥

MightyBot profil fotoğrafı

MightyBot1 yıl önce

🧠 Unified Search. Smarter Meetings. Effortless CRM. MightyBot is your AI agent platform for seamless workflows—record meetings, automate CRM updates, and find answers across apps in seconds. 🌟 Focus on what matters. We'll handle the grind.

Benzer Videolar

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇

Wenlong Huang @ CVPR

190,836 görüntüleme • 1 yıl önce

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

Ilir Aliu - eu/acc

10,509 görüntüleme • 1 yıl önce

The robot is learning several novel tasks instantly, after just ONE demonstration each... Instant Policy makes it possible: no extra training, no weight updates, just pure in-context learning. It just got accepted at ICLR 2025, and it’s changing how robots learn. With just a single demo, a robot can pick up a new task and start performing it right away. Why this is a big deal: ✅ Learns tasks instantly with just one or a few demonstrations ✅ Improves over time as more demonstrations are given ✅ Uses simulation-based training with “pseudo-demonstrations” for scalability ✅ Can transfer skills across different robots and even follow language-defined tasks It brings in-context learning to robotics, opening up new possibilities for flexible, real-world automation. You can try it yourself: code and weights are available at • • • • Thank you to Edward Johns, Director of the Robot Learning Lab at Imperial College for sharing their work! 🙏

The robot is learning several novel tasks instantly, after just ONE demonstration each... Instant Policy makes it possible: no extra training, no weight updates, just pure in-context learning. It just got accepted at ICLR 2025, and it’s changing how robots learn. With just a single demo, a robot can pick up a new task and start performing it right away. Why this is a big deal: ✅ Learns tasks instantly with just one or a few demonstrations ✅ Improves over time as more demonstrations are given ✅ Uses simulation-based training with “pseudo-demonstrations” for scalability ✅ Can transfer skills across different robots and even follow language-defined tasks It brings in-context learning to robotics, opening up new possibilities for flexible, real-world automation. You can try it yourself: code and weights are available at • • • • Thank you to Edward Johns, Director of the Robot Learning Lab at Imperial College for sharing their work! 🙏

Ilir Aliu

46,285 görüntüleme • 1 yıl önce

A new robot policy just cleaned up a kitchen it had never seen before [watch what happens. paper included ⬇️] Pi-0.5 builds on top of Pi-0 and shows how smart co-training with diverse data can unlock real generalization in the home. It doesn’t just learn from one setup but adapts to many, including homes it’s never seen. What it does ✅ Handles new homes without training in them ✅ Follows complex language instructions ✅ Cleans, places dishes, handles spills ✅ Matches in-home training models using cross-embodiment and web data Robots that understand tasks and adapt to new spaces are finally within reach. More in the blog: Read the paper ⬇️ Physical Intelligence, co-founded by UC Berkeley professor Sergey Levine, is a robotics startup developing general-purpose AI foundation models that enable robots to perform a wide variety of real-world tasks with human-like adaptability, recently raising $400 million to advance this vision.

A new robot policy just cleaned up a kitchen it had never seen before [watch what happens. paper included ⬇️] Pi-0.5 builds on top of Pi-0 and shows how smart co-training with diverse data can unlock real generalization in the home. It doesn’t just learn from one setup but adapts to many, including homes it’s never seen. What it does ✅ Handles new homes without training in them ✅ Follows complex language instructions ✅ Cleans, places dishes, handles spills ✅ Matches in-home training models using cross-embodiment and web data Robots that understand tasks and adapt to new spaces are finally within reach. More in the blog: Read the paper ⬇️ Physical Intelligence, co-founded by UC Berkeley professor Sergey Levine, is a robotics startup developing general-purpose AI foundation models that enable robots to perform a wide variety of real-world tasks with human-like adaptability, recently raising $400 million to advance this vision.

Ilir Aliu - eu/acc

18,120 görüntüleme • 1 yıl önce

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,392 görüntüleme • 1 yıl önce

Code Arena can handle image inputs for agentic web dev tasks, reasoning through multi-step problems and using tools along the way. Watch how it works with Aryan Vichare in this clip. Find a link to the full walkthrough on how Code Arena creates sites and apps from images in thread.

Code Arena can handle image inputs for agentic web dev tasks, reasoning through multi-step problems and using tools along the way. Watch how it works with Aryan Vichare in this clip. Find a link to the full walkthrough on how Code Arena creates sites and apps from images in thread.

Arena.ai

19,933 görüntüleme • 2 ay önce

How precise can a robot get… without ever seeing the object move? Turns out, just a geometric model is enough. A robot hit 73 precise 3D goals in a row, using only a geometric model. No demonstrations. No imitation learning. No fancy vision models. The secret? Sample where the robot could make contact. Let local contact-implicit control take over from there. That’s it. ✅ Real-time performance ✅ Natural contact making and breaking ✅ Works on tricky tasks like pushing and rotating This approach mixes global planning with local MPC. Fast. Accurate. Sample-efficient. Thanks for sharing, Michael Posa ! 📍Watch it in action: Paper:

How precise can a robot get… without ever seeing the object move? Turns out, just a geometric model is enough. A robot hit 73 precise 3D goals in a row, using only a geometric model. No demonstrations. No imitation learning. No fancy vision models. The secret? Sample where the robot could make contact. Let local contact-implicit control take over from there. That’s it. ✅ Real-time performance ✅ Natural contact making and breaking ✅ Works on tricky tasks like pushing and rotating This approach mixes global planning with local MPC. Fast. Accurate. Sample-efficient. Thanks for sharing, Michael Posa ! 📍Watch it in action: Paper:

Ilir Aliu

19,012 görüntüleme • 11 ay önce

Legged Locomotion… meets Skateboarding [Paper ⬇️] Most robot movement models either rely on fixed patterns or struggle to handle complex changes. DHAL (Discrete-time Hybrid Automata Learning) takes a different approach: using reinforcement learning to teach robots when and how to switch movements in real-time: ✅ Learns when to switch between different motions without pre-labeled data ✅ Handles complex, high-dimensional movements like a quadrupedal robot on a skateboard ✅ Uses a multi-critic architecture to improve contact-based motion control ✅ Works in both simulation and real-world environments with strong results It proves that robots can learn movement transitions on their own, without predefined rules. Paper: Thanks to Hang Liu for bringing this to my attention!

Legged Locomotion… meets Skateboarding [Paper ⬇️] Most robot movement models either rely on fixed patterns or struggle to handle complex changes. DHAL (Discrete-time Hybrid Automata Learning) takes a different approach: using reinforcement learning to teach robots when and how to switch movements in real-time: ✅ Learns when to switch between different motions without pre-labeled data ✅ Handles complex, high-dimensional movements like a quadrupedal robot on a skateboard ✅ Uses a multi-critic architecture to improve contact-based motion control ✅ Works in both simulation and real-world environments with strong results It proves that robots can learn movement transitions on their own, without predefined rules. Paper: Thanks to Hang Liu for bringing this to my attention!

Ilir Aliu - eu/acc

41,894 görüntüleme • 1 yıl önce

Physical Intelligence's π₀, a general-purpose robot foundation model that combines Internet-scale vision-language pretraining with robot interaction data to execute tasks. They aim "to develop foundation models that can control any robot to perform any task" Autonomous demos:

Physical Intelligence's π₀, a general-purpose robot foundation model that combines Internet-scale vision-language pretraining with robot interaction data to execute tasks. They aim "to develop foundation models that can control any robot to perform any task" Autonomous demos:

The Humanoid Hub

196,055 görüntüleme • 1 yıl önce

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,641 görüntüleme • 5 ay önce

Maaz (agentic arc) built OSAP to stop the constant tab-switching between Slack, GitHub, and Notion. It uses GLM-5.1 as a reasoning layer to handle multi-step tasks across different apps, with persistent memory (HydraDB) to keep track of your specific workflow habits.

Maaz (agentic arc) built OSAP to stop the constant tab-switching between Slack, GitHub, and Notion. It uses GLM-5.1 as a reasoning layer to handle multi-step tasks across different apps, with persistent memory (HydraDB) to keep track of your specific workflow habits.

Z.ai

48,311 görüntüleme • 2 ay önce

We are excited to share new experiments with AgiBot @AgiBot_zhiyuan on multi-task, multi-embodiment VLAs! With one model that can perform many tasks with both two-finger grippers and multi-fingered hands, we take another step toward one model for all robots and tasks.

We are excited to share new experiments with AgiBot @AgiBot_zhiyuan on multi-task, multi-embodiment VLAs! With one model that can perform many tasks with both two-finger grippers and multi-fingered hands, we take another step toward one model for all robots and tasks.

Physical Intelligence

75,761 görüntüleme • 1 yıl önce

🚀 Introducing KumoRFM — the world’s first Relational Foundation Model purpose-built for enterprise prediction tasks! KumoRFM reasons over complex relational data to deliver instant, accurate, in-context predictions — no task-specific model training required. A true game-changer for solving key business problems like: ✅ Product recommendations ✅ Fraud detection ✅ Customer retention 🔗 Explore KumoRFM: 📄 Read the paper: 💡 Learn more: #AI #EnterpriseAI #RelationalAI #FoundationModels #MachineLearning #KumoRFM #PredictiveAI4o

🚀 Introducing KumoRFM — the world’s first Relational Foundation Model purpose-built for enterprise prediction tasks! KumoRFM reasons over complex relational data to deliver instant, accurate, in-context predictions — no task-specific model training required. A true game-changer for solving key business problems like: ✅ Product recommendations ✅ Fraud detection ✅ Customer retention 🔗 Explore KumoRFM: 📄 Read the paper: 💡 Learn more: #AI #EnterpriseAI #RelationalAI #FoundationModels #MachineLearning #KumoRFM #PredictiveAI4o

Jure Leskovec

63,601 görüntüleme • 1 yıl önce

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

Ilir Aliu - eu/acc

21,703 görüntüleme • 1 yıl önce

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

49,864 görüntüleme • 5 ay önce

Money is the coordination layer for agents. As multi-agent systems take on real tasks, they will need to quote, budget, pay, and settle as part of their reasoning loop. We are building this reasoning infrastructure in the open, for everyone.

Money is the coordination layer for agents. As multi-agent systems take on real tasks, they will need to quote, budget, pay, and settle as part of their reasoning loop. We are building this reasoning infrastructure in the open, for everyone.

Sentient

14,712 görüntüleme • 4 ay önce

What happens when you combine voice, vision, and reasoning on-device? 🤔 Gemma 4 + a vision-language agent (VLA) running on NVIDIA Jetson Orin Nano shows how compact hardware can now handle real-world AI tasks using today’s open models—no cloud required. Get started:

What happens when you combine voice, vision, and reasoning on-device? 🤔 Gemma 4 + a vision-language agent (VLA) running on NVIDIA Jetson Orin Nano shows how compact hardware can now handle real-world AI tasks using today’s open models—no cloud required. Get started:

NVIDIA Robotics

25,655 görüntüleme • 1 ay önce

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Jason Ma

98,090 görüntüleme • 1 yıl önce