正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Introducing PAN — MBZUAI’s New World Model for Interactive Intelligence Developed by MBZUAI’s Institute of Foundation Models, PAN is built for simulation, prediction, and agentic reasoning. Unlike traditional video generators that only output frames, PAN maintains a persistent internal state that evolves when guided with natural language. Its Generative... Latent Prediction architecture combines: • A latent encoder to capture the world state • A dynamics module that evolves that state step-by-step • A video diffusion decoder that visualizes outcomes By decoding at every step using a causal sliding-window diffusion process, PAN stays grounded in real-world physics and maintains long-horizon continuity, a leap beyond single-shot models. Evaluated on action fidelity, long-horizon stability, and simulative planning, PAN delivers state-of-the-art performance compared to open models and rivals leading commercial systems. For robotics, autonomy, and decision support, PAN is a foundation for the next wave of intelligent, foresight-driven AI.show more

MBZUAI

16,478 subscribers

98,725 次观看 • 8 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

🔥Really excited to see the release of PAN world model, a project I had been working over the past years. PAN is a general world model capable of simulating physical, agentic, and nested worlds, synthesizing infinite interactive experiences for training AI agents. Building on top of pretrained LLMs and video diffusion models, PAN connects language, perception, action, and latent thoughts, for long-horizon simulation and reasoning. PAN shows overwhelming performance gains over JEPA-2, Cosmos-2, and other prior models. More in the thread👇 ... 1/

🔥Really excited to see the release of PAN world model, a project I had been working over the past years. PAN is a general world model capable of simulating physical, agentic, and nested worlds, synthesizing infinite interactive experiences for training AI agents. Building on top of pretrained LLMs and video diffusion models, PAN connects language, perception, action, and latent thoughts, for long-horizon simulation and reasoning. PAN shows overwhelming performance gains over JEPA-2, Cosmos-2, and other prior models. More in the thread👇 ... 1/

Zhiting Hu

31,195 次观看 • 8 个月前

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

The Humanoid Hub

35,204 次观看 • 5 个月前

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution paper page: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution paper page: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

AK

32,849 次观看 • 2 年前

Today we're introducing Gen-4, our new series of state-of-the-art AI models for media generation and world consistency. Gen-4 is a significant step forward for fidelity, dynamic motion and controllability in generative media. Gen-4 Image-to-Video is rolling out today to all paid plans and Enterprise customers. 1/8

Today we're introducing Gen-4, our new series of state-of-the-art AI models for media generation and world consistency. Gen-4 is a significant step forward for fidelity, dynamic motion and controllability in generative media. Gen-4 Image-to-Video is rolling out today to all paid plans and Enterprise customers. 1/8

Runway

736,776 次观看 • 1 年前

Introducing LDA, a latent world action foundation model that, for the first time, unifies the utilization of heterogeneous embodied data across simulation and reality, humans and robots, and varying levels of action quality and annotation. By breaking long-standing data silos in embodied intelligence, LDA enables the field, much like GPT did for language, to benefit continuously from scaling data, marking the transition into a new era of scalable learning. #Galbot #Robotics #Innovation #AI #Technology #Humanoid #WorldModel

Introducing LDA, a latent world action foundation model that, for the first time, unifies the utilization of heterogeneous embodied data across simulation and reality, humans and robots, and varying levels of action quality and annotation. By breaking long-standing data silos in embodied intelligence, LDA enables the field, much like GPT did for language, to benefit continuously from scaling data, marking the transition into a new era of scalable learning. #Galbot #Robotics #Innovation #AI #Technology #Humanoid #WorldModel

Galbot

38,131 次观看 • 3 个月前

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

AI at Meta

703,801 次观看 • 2 年前

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 次观看 • 3 年前

At Avalon we are building "Real-time creating" - the ability to generate gameplay ready persistent worlds prompted from text. While others are building real-time video world models, Avalon is building real-time world generation inside a fully playable, persistent multiplayer engine. Internally running at 3840×2180 at 60 FPS. Built on Unreal Engine. Multiplayer by default. Persistent by default. Gameplay-ready by default. This is not a video latent replay. Not a simulation of interaction. It is a real 3D world with physics, logic, and authoritative multiplayer state. Avalon is trained on proprietary Avalon interaction data and powered by a hybrid system that combines language understanding, 3D model generation, procedural systems, and structured gameplay logic synthesis. Players can walk through a live world and generate environments, assets, mechanics, and entirely new gameplay modes using natural language. We accomplish this through a combination of 3D model generation, game logic generation based on our proprietary systems, and AI driven world creation. While other players are inside it. Changes persist instantly. State is synchronized in real time. Creation happens inside the world, not outside of it. Describe a biome. Spawn a civilization. Create a survival mode. Build a dungeon crawler. Launch a new game inside the world. Avalon interprets intent and integrates it directly into the live multiplayer environment. This is not a world model predicting video. This is a gameplay engine that understands language. If you can describe it, you can build it. And others can walk into it instantly.

At Avalon we are building "Real-time creating" - the ability to generate gameplay ready persistent worlds prompted from text. While others are building real-time video world models, Avalon is building real-time world generation inside a fully playable, persistent multiplayer engine. Internally running at 3840×2180 at 60 FPS. Built on Unreal Engine. Multiplayer by default. Persistent by default. Gameplay-ready by default. This is not a video latent replay. Not a simulation of interaction. It is a real 3D world with physics, logic, and authoritative multiplayer state. Avalon is trained on proprietary Avalon interaction data and powered by a hybrid system that combines language understanding, 3D model generation, procedural systems, and structured gameplay logic synthesis. Players can walk through a live world and generate environments, assets, mechanics, and entirely new gameplay modes using natural language. We accomplish this through a combination of 3D model generation, game logic generation based on our proprietary systems, and AI driven world creation. While other players are inside it. Changes persist instantly. State is synchronized in real time. Creation happens inside the world, not outside of it. Describe a biome. Spawn a civilization. Create a survival mode. Build a dungeon crawler. Launch a new game inside the world. Avalon interprets intent and integrates it directly into the live multiplayer environment. This is not a world model predicting video. This is a gameplay engine that understands language. If you can describe it, you can build it. And others can walk into it instantly.

AVALON

62,403 次观看 • 5 个月前

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

310,120 次观看 • 1 年前

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Heng Yang

18,994 次观看 • 4 个月前

Meet Amazon Nova 2 reasoning models Nova has introduced its next generation of foundation models that deliver frontier intelligence with industry-leading price performance 1/ Amazon Nova 2 Lite, is a fast, cost-effective model designed for everyday AI tasks. 2/ Amazon Nova 2 Pro (preview), Nova’s most intelligent model built for complex AI workloads. These models offer developers control to manage thinking budgets, native support for agentic workflows, and provide a one-million-token context window for richer interactions.

Meet Amazon Nova 2 reasoning models Nova has introduced its next generation of foundation models that deliver frontier intelligence with industry-leading price performance 1/ Amazon Nova 2 Lite, is a fast, cost-effective model designed for everyday AI tasks. 2/ Amazon Nova 2 Pro (preview), Nova’s most intelligent model built for complex AI workloads. These models offer developers control to manage thinking budgets, native support for agentic workflows, and provide a one-million-token context window for richer interactions.

Amazon Web Services

3,958,154 次观看 • 7 个月前

Introducing FLUX-mimic, a next-generation Video-Action Model for general purpose dexterity, developed in partnership with Black Forest Labs. Late last year we published mimic-video and introduced Video-Action Models (VAM): a new family of robotics foundation models built on top of video generation models. We showed that robot control reduces to visual prediction, and that robot capability is downstream of improvements in video modeling accuracy. The obvious implication was that advances in the video modeling frontier would directly translate to increased capabilities in end-to-end robot learning. FLUX-mimic is that thesis at frontier scale: We've applied our VAM architecture to the strongest video backbone available today, FLUX 3 from Black Forest Labs, and trained it on data from our own robots and wearables. General-purpose dexterity, running on a single GPU on premises. Because the model already understands world dynamics, it needs far fewer demonstrations to learn a new task. This is game-changing for our mission to deploy robots to factory floors, where industrial robot data is scarce and expensive to collect. We're now testing and deploying FLUX-mimic with manufacturing leaders like Audi USA, on complex, multi-step manipulation long considered impossible for conventional automation.

Introducing FLUX-mimic, a next-generation Video-Action Model for general purpose dexterity, developed in partnership with Black Forest Labs. Late last year we published mimic-video and introduced Video-Action Models (VAM): a new family of robotics foundation models built on top of video generation models. We showed that robot control reduces to visual prediction, and that robot capability is downstream of improvements in video modeling accuracy. The obvious implication was that advances in the video modeling frontier would directly translate to increased capabilities in end-to-end robot learning. FLUX-mimic is that thesis at frontier scale: We've applied our VAM architecture to the strongest video backbone available today, FLUX 3 from Black Forest Labs, and trained it on data from our own robots and wearables. General-purpose dexterity, running on a single GPU on premises. Because the model already understands world dynamics, it needs far fewer demonstrations to learn a new task. This is game-changing for our mission to deploy robots to factory floors, where industrial robot data is scarce and expensive to collect. We're now testing and deploying FLUX-mimic with manufacturing leaders like Audi USA, on complex, multi-step manipulation long considered impossible for conventional automation.

mimic

108,954 次观看 • 3 天前

Yann LeCun (Yann LeCun ) beautifully explains how the architecture and principles used to train LLMs can not be extended to teach AI the real-world intelligence. In 1 line: LLMs excel where intelligence equals sequence prediction over symbols. Real-world intelligence requires learned world models, abstraction, causality, and action planning under uncertainty, which current next-token training does not provide. He says current LLMs learn by predicting the next token. That objective works very well when the task itself can be reduced to manipulating discrete symbols and sequences. Math, physics problem solving on paper, and coding fit this pattern because success largely comes from searching and composing the right sequences of symbols, equations, or program tokens. With enough data and scale, these models get very good at that kind of structured sequence prediction. Real-world intelligence is different. The physical world is continuous, noisy, uncertain, and high dimensional. To act in it, a system needs internal models that capture objects, dynamics, causality, constraints from the body, and the outcomes of actions over time. Humans and animals build abstract representations from rich sensory streams, then make predictions in that abstract space, not at the raw pixel level. That is why a child can learn intuitive physics, plan multi-step actions, and adapt quickly in new situations with little data. His claim about saturation follows from this gap. Scaling token prediction keeps improving symbol manipulation tasks like math and code, but it hits limits on embodied reasoning and common sense because text alone does not provide the right learning signals for world models. Predicting the next word cannot efficiently teach contact forces, affordances, occlusion, friction, or how actions change the state of the environment. For that, he argues we need architectures that learn abstractions from sensory data and predict futures in abstract latent spaces, then use those predictions to plan actions toward goals with built-in guardrails. --- From 'Pioneer Works' YT Channel (link in comment)

Yann LeCun (Yann LeCun ) beautifully explains how the architecture and principles used to train LLMs can not be extended to teach AI the real-world intelligence. In 1 line: LLMs excel where intelligence equals sequence prediction over symbols. Real-world intelligence requires learned world models, abstraction, causality, and action planning under uncertainty, which current next-token training does not provide. He says current LLMs learn by predicting the next token. That objective works very well when the task itself can be reduced to manipulating discrete symbols and sequences. Math, physics problem solving on paper, and coding fit this pattern because success largely comes from searching and composing the right sequences of symbols, equations, or program tokens. With enough data and scale, these models get very good at that kind of structured sequence prediction. Real-world intelligence is different. The physical world is continuous, noisy, uncertain, and high dimensional. To act in it, a system needs internal models that capture objects, dynamics, causality, constraints from the body, and the outcomes of actions over time. Humans and animals build abstract representations from rich sensory streams, then make predictions in that abstract space, not at the raw pixel level. That is why a child can learn intuitive physics, plan multi-step actions, and adapt quickly in new situations with little data. His claim about saturation follows from this gap. Scaling token prediction keeps improving symbol manipulation tasks like math and code, but it hits limits on embodied reasoning and common sense because text alone does not provide the right learning signals for world models. Predicting the next word cannot efficiently teach contact forces, affordances, occlusion, friction, or how actions change the state of the environment. For that, he argues we need architectures that learn abstractions from sensory data and predict futures in abstract latent spaces, then use those predictions to plan actions toward goals with built-in guardrails. --- From 'Pioneer Works' YT Channel (link in comment)

Rohan Paul

104,460 次观看 • 7 个月前

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

NVIDIA DRIVE

35,324 次观看 • 6 个月前

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and help advance AI in a responsible way. More in the video from Joelle Pineau. What we’re releasing: 🦎 Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs. 🪙 Meta Multi-Token Prediction Pretrained Language Models for code completion using Multi-Token Prediction. 🎼 Meta JASCO Generative text-to-music models capable of accepting various conditioning inputs for greater controllability. Paper available today with a pretrained model coming soon. 🗣️ Meta AudioSeal An audio watermarking model that we believe is the first designed specifically for the localized detection of AI-generated speech, available under a commercial license. 📝 Additional RAI artifacts Including research, data and code to measure and improve the representation of geographical and cultural preferences and diversity in AI systems. We believe that access to state-of-the-art AI creates opportunities for everyone – not just a small handful of Big Tech companies. We’re excited to share this work and to see how the community learns, iterates and builds using this technology. Details and access to everything released by FAIR today ➡️

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and help advance AI in a responsible way. More in the video from Joelle Pineau. What we’re releasing: 🦎 Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs. 🪙 Meta Multi-Token Prediction Pretrained Language Models for code completion using Multi-Token Prediction. 🎼 Meta JASCO Generative text-to-music models capable of accepting various conditioning inputs for greater controllability. Paper available today with a pretrained model coming soon. 🗣️ Meta AudioSeal An audio watermarking model that we believe is the first designed specifically for the localized detection of AI-generated speech, available under a commercial license. 📝 Additional RAI artifacts Including research, data and code to measure and improve the representation of geographical and cultural preferences and diversity in AI systems. We believe that access to state-of-the-art AI creates opportunities for everyone – not just a small handful of Big Tech companies. We’re excited to share this work and to see how the community learns, iterates and builds using this technology. Details and access to everything released by FAIR today ➡️

AI at Meta

380,751 次观看 • 2 年前

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 次观看 • 5 个月前

🚀 We’re excited to announce LingBot-VA, a new state-of-the-art robot policy model from Robbyant ! LingBot-VA is built on a causal, autoregressive video-action world model for generalist robot control. Highlights: (1) First unified autoregressive video-action world model for robot control (2) Low-latency inference with a new asynchronous execution pipeline (3) SOTA on RoboTwin (92.9%, firstever > 90%) and LIBERO (98.5%) (4) +20% over π0.5 on challenging real-world long-horizon & high-precision tasks

🚀 We’re excited to announce LingBot-VA, a new state-of-the-art robot policy model from Robbyant ! LingBot-VA is built on a causal, autoregressive video-action world model for generalist robot control. Highlights: (1) First unified autoregressive video-action world model for robot control (2) Low-latency inference with a new asynchronous execution pipeline (3) SOTA on RoboTwin (92.9%, firstever > 90%) and LIBERO (98.5%) (4) +20% over π0.5 on challenging real-world long-horizon & high-precision tasks

Yinghao Xu

46,962 次观看 • 5 个月前

New short course: Build Long-Context AI Apps with Jamba. Learn about state space models (SSMs), which have emerged as an alternative to transformers! Specifically, Jamba is a hybrid transformer-Mamba architecture that combines strengths of the transformer with ideas from SSMs. This course is built with AI21 Labs and taught by Chen Wang and Chen Almagor. The transformer architecture is computationally expensive when handling very long input contexts. But there's an alternative called Mamba, a selective state space model that can process very long contexts with a much lower computational cost. However, researchers found that the pure Mamba architecture underperforms in understanding the context, and gives lower-quality responses. To overcome this, AI21 developed the Jamba model, which combines Mamba's computational efficiency with the transformer's attention mechanism to help with the output quality. In this course, you’ll learn about how state space models, and Jamba, work. You’ll also learn how to prompt Jamba, use it to process long documents, and build long-context RAG apps. - Learn how Jamba combines transformer and state space model architectures to achieve high performance and quality - Use the AI21 SDK, with an example of prompting over a large 200k-token annual financial report of Nvidia - Use Jamba for tool-calling, with hands-on examples from calling simple arithmetic calculations to a function that returns quarterly company financial reports. - Learn how training for long context is done, and the metrics used for its evaluation - Create a RAG app using the AI21 Conversational RAG tool and build your own RAG pipeline that uses Jamba and LangChain. By the end of this course, you'll learn how to build applications that can handle context as long as an entire book. Please sign up here:

New short course: Build Long-Context AI Apps with Jamba. Learn about state space models (SSMs), which have emerged as an alternative to transformers! Specifically, Jamba is a hybrid transformer-Mamba architecture that combines strengths of the transformer with ideas from SSMs. This course is built with AI21 Labs and taught by Chen Wang and Chen Almagor. The transformer architecture is computationally expensive when handling very long input contexts. But there's an alternative called Mamba, a selective state space model that can process very long contexts with a much lower computational cost. However, researchers found that the pure Mamba architecture underperforms in understanding the context, and gives lower-quality responses. To overcome this, AI21 developed the Jamba model, which combines Mamba's computational efficiency with the transformer's attention mechanism to help with the output quality. In this course, you’ll learn about how state space models, and Jamba, work. You’ll also learn how to prompt Jamba, use it to process long documents, and build long-context RAG apps. - Learn how Jamba combines transformer and state space model architectures to achieve high performance and quality - Use the AI21 SDK, with an example of prompting over a large 200k-token annual financial report of Nvidia - Use Jamba for tool-calling, with hands-on examples from calling simple arithmetic calculations to a function that returns quarterly company financial reports. - Learn how training for long context is done, and the metrics used for its evaluation - Create a RAG app using the AI21 Conversational RAG tool and build your own RAG pipeline that uses Jamba and LangChain. By the end of this course, you'll learn how to build applications that can handle context as long as an entire book. Please sign up here:

Andrew Ng

77,792 次观看 • 1 年前

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,088 次观看 • 1 年前

KWAME TURE: 'AFRICANS WANT CONTINENTAL UNITY!' As Ghanaian President and Prime Minister Kwame Nkrumah once famously stated, 'Pan-Africanism or perish!' When Pan-Africanists talk about the need to fight for a unified Pan-African state with a single government and universal citizenship for all Africans, we are often told that we are not being realistic. But, let's slow down and think about it. Nowhere in the world is the desire for continent-wide unity greater than on the African continent, as Kwame Ture remarks in this clip. Every country you go to in Africa, you can meet Pan-Africanists. Ture, a founding member of the All-African People's Revolutionary Party, was born in Trinidad and organised in the United States before moving to Africa. The masses understand that, after 500 years of enslavement, kidnapping, colonialism, theft, genocide, terrorism and neo-colonialism, the only way we can build a new future is if Africans across the world come together as one. What do you think of Ture's remarks? Let us know in the comments.

KWAME TURE: 'AFRICANS WANT CONTINENTAL UNITY!' As Ghanaian President and Prime Minister Kwame Nkrumah once famously stated, 'Pan-Africanism or perish!' When Pan-Africanists talk about the need to fight for a unified Pan-African state with a single government and universal citizenship for all Africans, we are often told that we are not being realistic. But, let's slow down and think about it. Nowhere in the world is the desire for continent-wide unity greater than on the African continent, as Kwame Ture remarks in this clip. Every country you go to in Africa, you can meet Pan-Africanists. Ture, a founding member of the All-African People's Revolutionary Party, was born in Trinidad and organised in the United States before moving to Africa. The masses understand that, after 500 years of enslavement, kidnapping, colonialism, theft, genocide, terrorism and neo-colonialism, the only way we can build a new future is if Africans across the world come together as one. What do you think of Ture's remarks? Let us know in the comments.

African Stream

18,464 次观看 • 1 年前