Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

🤖 Introducing InternVLA-A1 — now fully open-sourced! Many VLA models follow instructions well in static scenes… but struggle in dynamic environments (conveyor belts, rotating platforms, multi-robot setups). Why? They see the present—but can’t imagine the future. InternVLA-A1 solution: unify perception, imagination, and action in one model: ✅ Scene understanding:...

38,016 görüntüleme • 5 ay önce •via X (Twitter)

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

465,559 görüntüleme • 1 yıl önce

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

The TWIML AI Podcast

19,942 görüntüleme • 1 yıl önce

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 görüntüleme • 9 ay önce

I’m thrilled to announce that we just released GraspGen, a multi-year project we have been cooking at NVIDIA Robotics 🚀 GraspGen: A Diffusion-Based Framework for 6-DOF Grasping Grasping is a foundational challenge in robotics 🤖 — whether for industrial picking or general-purpose humanoids. VLA + real data collection is all the rage now but is expensive and scales poorly for this task. For every new gripper and/or scene, you’ll have to recollect the dataset in this paradigm for the best perf. 💡Key Idea: Since grasping is such a well-defined task in simulation - why can’t we just scale synthetic data generation and train a generative model for grasping? By embracing modularity and standardized grasp formats, we can make this a turnkey technology that works zero-shot for multiple settings. GraspGen is a modular framework for diffusion-based 6-DOF grasp generation that scales across embodiment types, observability conditions, clutter, task complexity. Key Features: ✅ Multi-embodiment support: suction, parallel-jaw, and multi-fingered grippers ✅ Generalization to partial + complete 3D point clouds ✅ Generalization to single-objects + cluttered scenes ✅ Modular design uses other robotics modules and foundation models (SAM2, cuRobo, FoundationStereo, FoundationPose). This allows GraspGen to focus on only one thing - grasp generation ✅ Training recipe: grasp discriminator is trained with On-Generator data from the diffusion model - so that it learns to correct the mistakes (if any) of the diffusion generator ✅ Real-time performance (~20 Hz) before any GPU acceleration; low memory footprint 📊 Results: • SOTA on the FetchBench [Han et al. CoRL 2024] benchmark • Zero-shot sim-to-real transfer on unknown objects and cluttered scenes • Dataset of 53M simulated grasps across 8K objects from Objaverse 📄 arXiv: 🌐 Website: 💻 Code: A huge thank you to everyone involved in this journey — excited to see what the community builds on top of it! Joint work with Clemens Eppner , Balakumar Sundaralingam , Yu-Wei, Jun Yamada Wentao Yuan and other collaborators #robotics #diffusionmodels #physicalAI #simtoreal

Adithya Murali

23,756 görüntüleme • 10 ay önce