Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

🤖 Introducing InternVLA-A1 — now fully open-sourced! Many VLA models follow instructions well in static scenes… but struggle in dynamic environments (conveyor belts, rotating platforms, multi-robot setups). Why? They see the present—but can’t imagine the future. InternVLA-A1 solution: unify perception, imagination, and action in one model: ✅ Scene understanding:... Image + text → task parsing ✅ Task imagination: Predict future frames → reason about dynamics ✅ Guided control: Execute actions steered by visual foresight Powered by InternData-A1 - Large-scale high-quality simulated dataset, InternVLA-A1 stays robust under complex backgrounds, lighting, and distractions. 🔥 See it in action: 1️⃣ High-speed conveyor: track, predict, and stably grasp or flip packages 2️⃣ Rotating platform: task-aware recognition & precise pick-up of diverse items 📊 Outperforms π0 and Gr00t N1.5 on general manipulation benchmarks! ✨ Model, data, and code are all open! Models: Datasets: GitHub:show more

ModelScope

8,830 subscribers

38,016 görüntüleme • 5 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Boyi Li

66,307 görüntüleme • 5 ay önce

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

465,559 görüntüleme • 1 yıl önce

How do you teach a robot to handle complex, multi-step tasks, without training it for each one? [Github ⬇️] The team behind ReKep shows that robots can perform bimanual, in-the-wild tasks by reasoning over keypoint constraints: Generated on the fly using vision and language models. No task-specific data, no environment modeling. Why it matters ✅ Encodes tasks as simple Python functions over 3D keypoints ✅ Uses VLMs to generate keypoint constraints from instructions ✅ Plans and replans in real time with a 10 Hz perception-action loop ✅ Works for bimanual, multi-stage tasks without task-specific training Built on open tools like SciPy and BEHAVIOR, ReKep brings reactive, general-purpose reasoning closer to real-world robot control. Project website: Paper: Code: Walkthrough video: Thank you, Wenlong Huang for sharing 🫶

How do you teach a robot to handle complex, multi-step tasks, without training it for each one? [Github ⬇️] The team behind ReKep shows that robots can perform bimanual, in-the-wild tasks by reasoning over keypoint constraints: Generated on the fly using vision and language models. No task-specific data, no environment modeling. Why it matters ✅ Encodes tasks as simple Python functions over 3D keypoints ✅ Uses VLMs to generate keypoint constraints from instructions ✅ Plans and replans in real time with a 10 Hz perception-action loop ✅ Works for bimanual, multi-stage tasks without task-specific training Built on open tools like SciPy and BEHAVIOR, ReKep brings reactive, general-purpose reasoning closer to real-world robot control. Project website: Paper: Code: Walkthrough video: Thank you, Wenlong Huang for sharing 🫶

Ilir Aliu - eu/acc

25,348 görüntüleme • 1 yıl önce

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,641 görüntüleme • 5 ay önce

NVIDIA Cosmos Reason 2 is here. 🥳 An open, highly accurate reasoning vision language model for physical AI, featuring: ✅ Improved spatio-temporal understanding and timestamp precision ✅ Flexible deployment with 2B and 8B model sizes ✅ Long-context reasoning with up to 256K tokens ✅ Expanded visual perception across complex environments We also have new Cosmos releases: Predict 2.5, Transfer 2.5, and the NVIDIA GR00T N1.6 robot foundation model. 📗Read our technical blog: 🤗 Download Cosmos Reason 2 on Hugging Face:

NVIDIA Cosmos Reason 2 is here. 🥳 An open, highly accurate reasoning vision language model for physical AI, featuring: ✅ Improved spatio-temporal understanding and timestamp precision ✅ Flexible deployment with 2B and 8B model sizes ✅ Long-context reasoning with up to 256K tokens ✅ Expanded visual perception across complex environments We also have new Cosmos releases: Predict 2.5, Transfer 2.5, and the NVIDIA GR00T N1.6 robot foundation model. 📗Read our technical blog: 🤗 Download Cosmos Reason 2 on Hugging Face:

NVIDIA AI Developer

45,677 görüntüleme • 5 ay önce

Can robots self-improve by collecting data autonomously🤖? Introducing SOAR: a system for large-scale autonomous data collection 🚀 and autonomous improvement📈of a multi-task language-conditioned policy in diverse scenes without human interventions .

Can robots self-improve by collecting data autonomously🤖? Introducing SOAR: a system for large-scale autonomous data collection 🚀 and autonomous improvement📈of a multi-task language-conditioned policy in diverse scenes without human interventions .

Paul Zhou

47,667 görüntüleme • 1 yıl önce

Introducing Ψ₀ ( — an open foundation model for universal humanoid loco-manipulation. 🏆 Outperforms GR00T N1.6 by 40%+ overall success rate 📉 Uses only ~10% of the pre-training data 📦 Fully open-source: model, data, code, and deployment pipeline 1/10

Introducing Ψ₀ ( — an open foundation model for universal humanoid loco-manipulation. 🏆 Outperforms GR00T N1.6 by 40%+ overall success rate 📉 Uses only ~10% of the pre-training data 📦 Fully open-source: model, data, code, and deployment pipeline 1/10

Yue Wang

18,623 görüntüleme • 2 ay önce

🚀 Unitree open-sources UnifoLM-WBT-Dataset — a high-quality real-world humanoid robot whole-body teleoperation (WBT) dataset for open environments. 🥳Publicly available since March 5, 2026, the dataset will continue to receive high-frequency rolling updates. It aims to establish the most comprehensive real-world humanoid robot dataset in terms of scenario coverage, task complexity, and manipulation diversity. 👉 Explore the dataset here:

🚀 Unitree open-sources UnifoLM-WBT-Dataset — a high-quality real-world humanoid robot whole-body teleoperation (WBT) dataset for open environments. 🥳Publicly available since March 5, 2026, the dataset will continue to receive high-frequency rolling updates. It aims to establish the most comprehensive real-world humanoid robot dataset in terms of scenario coverage, task complexity, and manipulation diversity. 👉 Explore the dataset here:

Unitree

5,629,556 görüntüleme • 2 ay önce

💥 A 450M model just beat bigger VLAs on real robot tasks, and it’s 100% open source [📍 bookmark for later] Came across SmolVLA, a new vision-language-action model for robotics that’s compact, fast, and trained entirely on open community datasets from LeRobot via Hugging Face. What stood out to me is how it matches or outperforms much larger models like ACT using noisy, real-world community data instead of giant private datasets. Why it’s worth a look ✅ 26% performance boost from pretraining on open-source data ✅ Runs on consumer hardware, even a MacBook ✅ 30% faster responses with async inference and smart architecture tweaks ✅ Strong results across Meta-World, LIBERO, SO100, and SO101 ✅ Fully open source: weights, code, training pipeline, eval stack They also introduced smart efficiency tricks like using fewer visual tokens, pulling outputs from mid-layer, and separating perception from action to make it all run fast. SmolVLA is a strong case for what can happen when the robotics community shares data and builds in the open. Definitely worth keeping an eye on.

💥 A 450M model just beat bigger VLAs on real robot tasks, and it’s 100% open source [📍 bookmark for later] Came across SmolVLA, a new vision-language-action model for robotics that’s compact, fast, and trained entirely on open community datasets from LeRobot via Hugging Face. What stood out to me is how it matches or outperforms much larger models like ACT using noisy, real-world community data instead of giant private datasets. Why it’s worth a look ✅ 26% performance boost from pretraining on open-source data ✅ Runs on consumer hardware, even a MacBook ✅ 30% faster responses with async inference and smart architecture tweaks ✅ Strong results across Meta-World, LIBERO, SO100, and SO101 ✅ Fully open source: weights, code, training pipeline, eval stack They also introduced smart efficiency tricks like using fewer visual tokens, pulling outputs from mid-layer, and separating perception from action to make it all run fast. SmolVLA is a strong case for what can happen when the robotics community shares data and builds in the open. Definitely worth keeping an eye on.

Ilir Aliu - eu/acc

17,353 görüntüleme • 9 ay önce

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

The TWIML AI Podcast

19,942 görüntüleme • 1 yıl önce

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The Humanoid Hub

68,453 görüntüleme • 4 ay önce

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

The Humanoid Hub

34,090 görüntüleme • 4 ay önce

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 görüntüleme • 9 ay önce

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

AK

429,143 görüntüleme • 3 yıl önce

Physics AI 🌊 Jensen just unveiled the NVIDIA Isaac GR00T Reference Humanoid Robot at GTC Taipei--the world’s first open-source humanoid robot reference design Built on Jetson Thor edge computing and the Isaac GR00T open platform, it features a Unitree H2 Plus body with Sharpa five-finger dexterous hands and 75 degrees of freedom. Even before Physical AI truly explodes, NVIDIA has built a complete end-to-end full-stack infrastructure: >chips (Jetson Thor + Blackwell), >synthetic data (GR00T-Dreams), >world models (Cosmos), >foundation models (GR00T N-series VLA), >high-fidelity simulation platforms (Isaac Sim + Isaac Lab + Omniverse). Dozens of humanoid robot companies ,including Agility Robotics, Aglie robotics,Boston Dynamics, Figure, 1X, NEURA Robotics, XPENG Robotics, and more are already running large-scale daily simulation training on this platform, rapidly iterating perception, reasoning, and actions in virtual environments, speeding up the jump from lab prototypes to real factory deployment. By providing an open reference design and full-stack tools, NVIDIA has dramatically lowered the barrier, letting every robotics team join the Physical AI wave with low cost and high efficiency.

Physics AI 🌊 Jensen just unveiled the NVIDIA Isaac GR00T Reference Humanoid Robot at GTC Taipei--the world’s first open-source humanoid robot reference design Built on Jetson Thor edge computing and the Isaac GR00T open platform, it features a Unitree H2 Plus body with Sharpa five-finger dexterous hands and 75 degrees of freedom. Even before Physical AI truly explodes, NVIDIA has built a complete end-to-end full-stack infrastructure: >chips (Jetson Thor + Blackwell), >synthetic data (GR00T-Dreams), >world models (Cosmos), >foundation models (GR00T N-series VLA), >high-fidelity simulation platforms (Isaac Sim + Isaac Lab + Omniverse). Dozens of humanoid robot companies ,including Agility Robotics, Aglie robotics,Boston Dynamics, Figure, 1X, NEURA Robotics, XPENG Robotics, and more are already running large-scale daily simulation training on this platform, rapidly iterating perception, reasoning, and actions in virtual environments, speeding up the jump from lab prototypes to real factory deployment. By providing an open reference design and full-stack tools, NVIDIA has dramatically lowered the barrier, letting every robotics team join the Physical AI wave with low cost and high efficiency.

CyberRobo

17,092 görüntüleme • 9 gün önce

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

49,864 görüntüleme • 5 ay önce

Dreamer 4 takes world models to a new level - training multi-task agents fully in imagination while never touching the real environment. Powered by a novel shortcut forcing method, it delivers lightning-fast, accurate predictions and crushes benchmarks, beating OpenAI’s VPT with 100x less data. This breakthrough makes scalable, versatile AI agents far more realistic and unlocks new frontiers in learning complex tasks.

Dreamer 4 takes world models to a new level - training multi-task agents fully in imagination while never touching the real environment. Powered by a novel shortcut forcing method, it delivers lightning-fast, accurate predictions and crushes benchmarks, beating OpenAI’s VPT with 100x less data. This breakthrough makes scalable, versatile AI agents far more realistic and unlocks new frontiers in learning complex tasks.

Chubby♨️

56,924 görüntüleme • 8 ay önce

I’m thrilled to announce that we just released GraspGen, a multi-year project we have been cooking at NVIDIA Robotics 🚀 GraspGen: A Diffusion-Based Framework for 6-DOF Grasping Grasping is a foundational challenge in robotics 🤖 — whether for industrial picking or general-purpose humanoids. VLA + real data collection is all the rage now but is expensive and scales poorly for this task. For every new gripper and/or scene, you’ll have to recollect the dataset in this paradigm for the best perf. 💡Key Idea: Since grasping is such a well-defined task in simulation - why can’t we just scale synthetic data generation and train a generative model for grasping? By embracing modularity and standardized grasp formats, we can make this a turnkey technology that works zero-shot for multiple settings. GraspGen is a modular framework for diffusion-based 6-DOF grasp generation that scales across embodiment types, observability conditions, clutter, task complexity. Key Features: ✅ Multi-embodiment support: suction, parallel-jaw, and multi-fingered grippers ✅ Generalization to partial + complete 3D point clouds ✅ Generalization to single-objects + cluttered scenes ✅ Modular design uses other robotics modules and foundation models (SAM2, cuRobo, FoundationStereo, FoundationPose). This allows GraspGen to focus on only one thing - grasp generation ✅ Training recipe: grasp discriminator is trained with On-Generator data from the diffusion model - so that it learns to correct the mistakes (if any) of the diffusion generator ✅ Real-time performance (~20 Hz) before any GPU acceleration; low memory footprint 📊 Results: • SOTA on the FetchBench [Han et al. CoRL 2024] benchmark • Zero-shot sim-to-real transfer on unknown objects and cluttered scenes • Dataset of 53M simulated grasps across 8K objects from Objaverse 📄 arXiv: 🌐 Website: 💻 Code: A huge thank you to everyone involved in this journey — excited to see what the community builds on top of it! Joint work with Clemens Eppner , Balakumar Sundaralingam , Yu-Wei, Jun Yamada Wentao Yuan and other collaborators #robotics #diffusionmodels #physicalAI #simtoreal

I’m thrilled to announce that we just released GraspGen, a multi-year project we have been cooking at NVIDIA Robotics 🚀 GraspGen: A Diffusion-Based Framework for 6-DOF Grasping Grasping is a foundational challenge in robotics 🤖 — whether for industrial picking or general-purpose humanoids. VLA + real data collection is all the rage now but is expensive and scales poorly for this task. For every new gripper and/or scene, you’ll have to recollect the dataset in this paradigm for the best perf. 💡Key Idea: Since grasping is such a well-defined task in simulation - why can’t we just scale synthetic data generation and train a generative model for grasping? By embracing modularity and standardized grasp formats, we can make this a turnkey technology that works zero-shot for multiple settings. GraspGen is a modular framework for diffusion-based 6-DOF grasp generation that scales across embodiment types, observability conditions, clutter, task complexity. Key Features: ✅ Multi-embodiment support: suction, parallel-jaw, and multi-fingered grippers ✅ Generalization to partial + complete 3D point clouds ✅ Generalization to single-objects + cluttered scenes ✅ Modular design uses other robotics modules and foundation models (SAM2, cuRobo, FoundationStereo, FoundationPose). This allows GraspGen to focus on only one thing - grasp generation ✅ Training recipe: grasp discriminator is trained with On-Generator data from the diffusion model - so that it learns to correct the mistakes (if any) of the diffusion generator ✅ Real-time performance (~20 Hz) before any GPU acceleration; low memory footprint 📊 Results: • SOTA on the FetchBench [Han et al. CoRL 2024] benchmark • Zero-shot sim-to-real transfer on unknown objects and cluttered scenes • Dataset of 53M simulated grasps across 8K objects from Objaverse 📄 arXiv: 🌐 Website: 💻 Code: A huge thank you to everyone involved in this journey — excited to see what the community builds on top of it! Joint work with Clemens Eppner , Balakumar Sundaralingam , Yu-Wei, Jun Yamada Wentao Yuan and other collaborators #robotics #diffusionmodels #physicalAI #simtoreal

Adithya Murali

23,756 görüntüleme • 10 ay önce

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,189 görüntüleme • 2 yıl önce

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

46,778 görüntüleme • 2 yıl önce