正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of... show more

Amir Zamir

5,870 subscribers

69,564 次观看 • 2 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

6 条评论

Amir Zamir 的头像

Amir Zamir2 年前

shoutout to @roman__bachmann, @oguzhanthefatih, @dmizrahi_ who led the work, along with @aligarjani, @mingfei_gao, David Griffiths, @hujm99, @afshin_dn, @zamir_ar.

Shikun Liu 的头像

Shikun Liu2 年前

Great work! And thanks for open-sourcing to the community. :)

Isaac Kargar 的头像

Isaac Kargar2 年前

No audio input?

Amir Zamir 的头像

Amir Zamir2 年前

It’s a matter of data. Otherwise IMO the method will work as-is.

Lele 的头像

Lele2 年前

massive

Puneet (Linkedin Top Voice | AI and Data speaker) 的头像

Puneet (Linkedin Top Voice | AI and Data speaker)2 年前

@zamir_ar You guys are nailing it! #multimodal #framework

相关视频

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Big thanks to AK for highlighting our work! LEO marks our pioneering step towards building an embodied generalist agent that can really comprehend the 3D world! 🚀Leveraging LLMs, we train LEO with real and synthetic 3D data across a diverse spectrum of tasks. It's thrilling to see LEO surpass current state-of-the-art SOTA methods in most benchmarked tasks, all under a single, unified model. 🔥 #Generalist_Agent

Siyuan Huang

22,710 次观看 • 2 年前

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 2 年前

Happy to share what I’ve been working on since joining Genesis! GENE-26.5 is a one-of-a-kind, robotics-native multimodal foundation model that learns from diverse, in-the-wild data across modalities and outputs actions enabling a 54-DoF robot system to perform the most dexterous, long-horizon manipulation tasks to date—approaching human-level capability. This is the result of innovations across the full stack—data collection and processing, robot systems, model architecture, training strategies, and scalable evaluation infrastructure.

Happy to share what I’ve been working on since joining Genesis! GENE-26.5 is a one-of-a-kind, robotics-native multimodal foundation model that learns from diverse, in-the-wild data across modalities and outputs actions enabling a 54-DoF robot system to perform the most dexterous, long-horizon manipulation tasks to date—approaching human-level capability. This is the result of innovations across the full stack—data collection and processing, robot systems, model architecture, training strategies, and scalable evaluation infrastructure.

Zu Wang

19,493 次观看 • 2 个月前

[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the Gradio demo. Paper: Code: Demo: This is a joint work with Gaurav Parmar (the leading author), Taesung Park, and Srinivasa Narasimhan. This work shows that a pre-trained one-step model can be easily adapted to conditional GANs frameworks for downstream image editing and synthesis tasks. #Edges2Cats

[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the Gradio demo. Paper: Code: Demo: This is a joint work with Gaurav Parmar (the leading author), Taesung Park, and Srinivasa Narasimhan. This work shows that a pre-trained one-step model can be easily adapted to conditional GANs frameworks for downstream image editing and synthesis tasks. #Edges2Cats

Jun-Yan Zhu

36,488 次观看 • 2 年前

If we train VLAs to respond to diverse multimodal prompts, then we can steer them better: [grasp the carrot]/[move to x,y,z]/[put the carrot on the plate]. With many levels of detail, powerful VLMs can step in and steer the model to success much more often! More below 👇

If we train VLAs to respond to diverse multimodal prompts, then we can steer them better: [grasp the carrot]/[move to x,y,z]/[put the carrot on the plate]. With many levels of detail, powerful VLMs can step in and steer the model to success much more often! More below 👇

Sergey Levine

21,049 次观看 • 5 个月前

The most effective AI systems don't rely on a single model. Frontier models provide state-of-the-art performance for complex tasks, while routers automatically select lightweight, open-source models for simpler jobs to optimize accuracy, latency, and cost. Learn more:

The most effective AI systems don't rely on a single model. Frontier models provide state-of-the-art performance for complex tasks, while routers automatically select lightweight, open-source models for simpler jobs to optimize accuracy, latency, and cost. Learn more:

NVIDIA AI

12,152 次观看 • 4 个月前

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Santiago

164,162 次观看 • 2 年前

For the first time in human history, we are teaching a Foundation Model to master the diverse tasks of medicinal chemists, biologists, and computational scientists all in one place. In our latest collaboration with Liquid AI, we are moving away from fragmented, specialized tools toward a single, super-intelligent model. What surprised me most? This model isn't just performing at reasonable levels—it has started outperforming specialist models across physics-based tasks, imaging, and longitudinal data. Why this changes everything: -Synergy over Specialization: Fine-tuning on specific tasks has unlocked unexpected capabilities in synergetic areas, opening a new frontier in multimodal AI research. -Zero-Shot Potential: We are building a model that can perform out-of-scope tasks, moving us closer to an "AI deity" for drug discovery. -Quality First: The goal isn't just to bypass regulations to save time; it’s about using these synergies to develop better, more effective drugs. We are no longer just looking at linear regression or simple text; we are looking at the future of how humanity fights disease. #LiquidAI #InsilicoMedicine #GenerativeAI #DrugDiscovery #DeepTech #BiotechInnovation

For the first time in human history, we are teaching a Foundation Model to master the diverse tasks of medicinal chemists, biologists, and computational scientists all in one place. In our latest collaboration with Liquid AI, we are moving away from fragmented, specialized tools toward a single, super-intelligent model. What surprised me most? This model isn't just performing at reasonable levels—it has started outperforming specialist models across physics-based tasks, imaging, and longitudinal data. Why this changes everything: -Synergy over Specialization: Fine-tuning on specific tasks has unlocked unexpected capabilities in synergetic areas, opening a new frontier in multimodal AI research. -Zero-Shot Potential: We are building a model that can perform out-of-scope tasks, moving us closer to an "AI deity" for drug discovery. -Quality First: The goal isn't just to bypass regulations to save time; it’s about using these synergies to develop better, more effective drugs. We are no longer just looking at linear regression or simple text; we are looking at the future of how humanity fights disease. #LiquidAI #InsilicoMedicine #GenerativeAI #DrugDiscovery #DeepTech #BiotechInnovation

Alex Zhavoronkov, PhD (aka Aleksandrs Zavoronkovs)

10,544 次观看 • 4 个月前

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

AI at Meta

703,853 次观看 • 2 年前

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Xiang Yue

57,699 次观看 • 1 年前

Can vision-language-action (VLA) models generalize to diverse OOD tasks and align with customized objectives? 🤔 🚀 We introduce GRAPE, a plug-and-play algorithm to generalize robot policies via preference alignment. GRAPE unfolds three benefits to boost the generalizability of VLAs: 👉1. GRAPE aligns VLAs on a trajectory level and endows the model with the ability for global decision-making, instead of merely cloning behavior; 👉2. GRAPE implicitly models reward from both successful and failed trials to boost generalizability to diverse tasks; 👉3. GRAPE adopts a scalable preference synthesis algorithm to rank trajectories with preferences that align with arbitrary objectives. Our experiments on a diverse array of real-world and simulated robotic tasks reveal: 1⃣GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%; 2⃣GRAPE is versatile to be aligned with diverse objectives and reduce collision rates by 44.31% or rollout length by 11.15% when aligning towards safer or more efficient manipulation policy, respectively. Check out our full project for more details: 🔥 Paper: 🔥 Project: 🔥 Code:

Can vision-language-action (VLA) models generalize to diverse OOD tasks and align with customized objectives? 🤔 🚀 We introduce GRAPE, a plug-and-play algorithm to generalize robot policies via preference alignment. GRAPE unfolds three benefits to boost the generalizability of VLAs: 👉1. GRAPE aligns VLAs on a trajectory level and endows the model with the ability for global decision-making, instead of merely cloning behavior; 👉2. GRAPE implicitly models reward from both successful and failed trials to boost generalizability to diverse tasks; 👉3. GRAPE adopts a scalable preference synthesis algorithm to rank trajectories with preferences that align with arbitrary objectives. Our experiments on a diverse array of real-world and simulated robotic tasks reveal: 1⃣GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%; 2⃣GRAPE is versatile to be aligned with diverse objectives and reduce collision rates by 44.31% or rollout length by 11.15% when aligning towards safer or more efficient manipulation policy, respectively. Check out our full project for more details: 🔥 Paper: 🔥 Project: 🔥 Code:

Huaxiu Yao

19,988 次观看 • 1 年前

MIT PhD student Alex Zhang reveals the scaling result where a model trained on short tasks generalizes to problems 100x longer for free: "If you're very clever about the design of your harness or how you use the language model, you can almost get scaling gains for free." "If you train a model naively, there's no tricks. It's just the same way you train a model on these RL environments. You just roll it out, and then you just get some reward." "If you train it on only short tasks, like only tasks that are 10,000 tokens long, and then you were to run it on a similar domain, but at a million tokens, or 10 million tokens, or 100,000 tokens, it generalizes really, really well. If you look at it compared to even the base transformer, you get way better generalization properties." "When the model uses an RLM (Recursive Language Model) after it's trained on these short tasks, it will see some kind of trajectory of actions that it does. Between these two problems of different lengths, the RLM learns to see them as almost the same problem." "Token for token, they're almost the same. You can describe it in code. In one code setting, maybe the for loop is a little bigger, but it's the same kind of code and it derives the constants from the data. There's no hard coding, so they literally look the same." alex zhang

MIT PhD student Alex Zhang reveals the scaling result where a model trained on short tasks generalizes to problems 100x longer for free: "If you're very clever about the design of your harness or how you use the language model, you can almost get scaling gains for free." "If you train a model naively, there's no tricks. It's just the same way you train a model on these RL environments. You just roll it out, and then you just get some reward." "If you train it on only short tasks, like only tasks that are 10,000 tokens long, and then you were to run it on a similar domain, but at a million tokens, or 10 million tokens, or 100,000 tokens, it generalizes really, really well. If you look at it compared to even the base transformer, you get way better generalization properties." "When the model uses an RLM (Recursive Language Model) after it's trained on these short tasks, it will see some kind of trajectory of actions that it does. Between these two problems of different lengths, the RLM learns to see them as almost the same problem." "Token for token, they're almost the same. You can describe it in code. In one code setting, maybe the for loop is a little bigger, but it's the same kind of code and it derives the constants from the data. There's no hard coding, so they literally look the same." alex zhang

MTS

99,784 次观看 • 12 天前

Happy to share our new work on Navigation World Models! 🔥🔥 Navigation is a fundamental skill of agents with visual-motor capabilities. We train a single World Model across multiple environments and diverse agent data. w/ Gaoyue Zhou, Danny Tran, trevordarrell and Yann LeCun.

Happy to share our new work on Navigation World Models! 🔥🔥 Navigation is a fundamental skill of agents with visual-motor capabilities. We train a single World Model across multiple environments and diverse agent data. w/ Gaoyue Zhou, Danny Tran, trevordarrell and Yann LeCun.

Amir Bar

83,539 次观看 • 1 年前

We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life.

We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life.

Leandro von Werra

404,957 次观看 • 2 个月前

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

AK

60,243 次观看 • 1 年前

Super excited to share the last paper of my PhD: "Hallucination in World Models is Predictable and Preventable"✨ We train a 350M-param generative world model on a large dataset w/ 210 tasks and show that we can predict *when* hallucination happens and use that to fix it! 🧵1/n

Super excited to share the last paper of my PhD: "Hallucination in World Models is Predictable and Preventable"✨ We train a 350M-param generative world model on a large dataset w/ 210 tasks and show that we can predict when hallucination happens and use that to fix it! 🧵1/n

Nicklas Hansen

54,573 次观看 • 1 个月前

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

merve

28,014 次观看 • 2 年前

Many of you asked for code & weights for π₀, we are happy to announce that we are releasing π₀ and pre-trained checkpoints in our new openpi repository! We tested the model on a few public robots, and we include code for you to fine-tune it yourself.

Many of you asked for code & weights for π₀, we are happy to announce that we are releasing π₀ and pre-trained checkpoints in our new openpi repository! We tested the model on a few public robots, and we include code for you to fine-tune it yourself.

Physical Intelligence

441,382 次观看 • 1 年前

here's what i vibecoded today: punchingface 🥊 an app to make Hugging Face models fight each other on coding and canvas challenges, built with qwen3.6 35b a3b in 24 hours! benchmarks numbers don't mean anything anymore, we need a way to visualize what the models are actually capable of, and canvas are one the best way to showcase it imo. why? because one single error in the code and everything breaks it shows the differences in a matter of seconds, way easier than manually reviewing the quality of the code produced for a complex project; much needed in the space with all the new finetunes dropping everyday! i did a little demo here with qwen3.6 vs qwopus glm 18b merged, the frankenstein model from Kyle Hessling the winner is clear here, qwen is crazy good and has nothing to prove. that said, qwopus 18b isn't terrible at all; the result isn’t the prettiest to the eye, but hey… it works! i've seen so many models just output a blank page (completely non-working code) so this is already a win frankenstein talks and thinks but he needs some extra brain surgery 🧠 results were expected (it's a very experimental model) but love the effort in the 18b direction from jackrong and kyle! the app was entirely vibecoded with qwen3.6, i didn't edit a single file manually. i can say with confidence that it really has the intelligence of claude sonnet 4.5 at a speed of 125tok/s on an rtx 5080 which models should i make fight next?

here's what i vibecoded today: punchingface 🥊 an app to make Hugging Face models fight each other on coding and canvas challenges, built with qwen3.6 35b a3b in 24 hours! benchmarks numbers don't mean anything anymore, we need a way to visualize what the models are actually capable of, and canvas are one the best way to showcase it imo. why? because one single error in the code and everything breaks it shows the differences in a matter of seconds, way easier than manually reviewing the quality of the code produced for a complex project; much needed in the space with all the new finetunes dropping everyday! i did a little demo here with qwen3.6 vs qwopus glm 18b merged, the frankenstein model from Kyle Hessling the winner is clear here, qwen is crazy good and has nothing to prove. that said, qwopus 18b isn't terrible at all; the result isn’t the prettiest to the eye, but hey… it works! i've seen so many models just output a blank page (completely non-working code) so this is already a win frankenstein talks and thinks but he needs some extra brain surgery 🧠 results were expected (it's a very experimental model) but love the effort in the 18b direction from jackrong and kyle! the app was entirely vibecoded with qwen3.6, i didn't edit a single file manually. i can say with confidence that it really has the intelligence of claude sonnet 4.5 at a speed of 125tok/s on an rtx 5080 which models should i make fight next?

left curve dev

19,114 次观看 • 3 个月前

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

AK

33,500 次观看 • 3 年前