Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Large Language Diffusion with Masking (LLaDA) are here - and their generation looks so fucking dope! 🤯 True to Yann LeCun's vision, Ditch the auto-regressive bits and approximate the language distribution via Maximum Likelihood Estimation! So cool to watch the model denoise text from tokens in real time! -... show more

Vaibhav (VB) Srivastav

52,235 subscribers

21,410 views • 1 year ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

12 Comments

Vaibhav (VB) Srivastav1 year ago

Check out the demo here:

Vaibhav (VB) Srivastav1 year ago

Model checkpoints here:

AssemblyAI1 year ago

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

Chaithanya Kumar1 year ago

@ylecun @Stardust_nds check this out buddy , something that we have been discussing about Also LLaDa

HosseinAgha1 year ago

@ylecun This is interesting! Finally a new architecture for LLMs. I don't think this solves any of the @ylecun concerns with transformer based Auto Regressive LMs. No world model. No video understanding, etc.

luis1 year ago

@ylecun Omg , I think this idea is perfect to make the answers more precise, 🥵 hello 100 precision

marko.1 year ago

@ylecun Since you have to compute the whole maximum possible response length every time, what does this mean for VRAM requirements when deploying these models?

Futurist Avenue1 year ago

@ylecun How does this stack up with Inception?

AI at Meta1 year ago

Llama has now been downloaded over 1 Billion times! A note to: The researchers at Meta training these models — and those building on the research in other labs. The developers and enthusiasts on r/LocalLlama, @huggingface and more; experimenting with new models and creating derivatives. The small startups and big enterprises alike who are creating a new wave of AI-powered products, built with Llama. The global AI community. Your actions speak louder than words, thank you for making it abundantly clear — a billion times over — that open source AI is how we'll create the next wave of world changing technologies, together. 🦙❤️

Hunyuan1 year ago

Coming soon: HunYuan-T1，The first ultra-large Mamba-powered reasoning model! Stay tuned! 🚀

AK1 year ago

Bytedance just dropped DAPO on Hugging Face An Open-Source LLM Reinforcement Learning System at Scale

Jeremy Howard1 year ago

Announcing fasttransform: a Python lib that makes data transformations reversible/extensible. No more writing inverse functions to see what your model sees. Debug pipelines by actually looking at your data. Built on multi-dispatch. Work w/ @R_Dimm

Related Videos

Introducing The Matrix --- a foundation world model for generating infinite-length, hyper-realistic videos with real-time, frame-level control: - Infinite-length video generation - 720p high-quality rendering - Real-time, frame-level control at 16 FPS - Generalization to real-world video control 🔗Blog: 📄Paper: 💻Code & Playable Demo: Coming soon! Key Innovation: A brand new technique called the shift-window denoise process model, enabling auto-regressive generation for diffusion and consistency models in real-time. Special thanks to project leader Ruili Feng and the entire Matrix team for their dedication and hard work over the year-long project.

Hongyang Zhang

178,322 views • 1 year ago

Diffusion language models are SO FAST!! A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model" It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs. And you can try it now!

Diffusion language models are SO FAST!! A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model" It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs. And you can try it now!

Tanishq, Ph.D. at ICML

354,254 views • 1 year ago

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 views • 3 years ago

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 views • 10 months ago

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 views • 3 years ago

MolmoAct2 runs zero-shot on an SO-ARM101, no training required 🦾 Ai2 open Action Reasoning Model is now in LeRobot with the full lifecycle covered: fine-tuning (full or LoRA), evaluation, and real-robot deployment. It pairs a Molmo2-ER vision-language backbone with a flow-matching action expert to turn images, language, and proprioceptive state into action chunks. The best part: ready-made checkpoints ship with calibration correction baked in, so you can point it at your SO-100/101 and just watch it work. Inference fits in ~12GB at bf16, and LoRA fine-tuning runs on a single 24GB GPU. Big thanks to the Ai2 team for building this in the open.

MolmoAct2 runs zero-shot on an SO-ARM101, no training required 🦾 Ai2 open Action Reasoning Model is now in LeRobot with the full lifecycle covered: fine-tuning (full or LoRA), evaluation, and real-robot deployment. It pairs a Molmo2-ER vision-language backbone with a flow-matching action expert to turn images, language, and proprioceptive state into action chunks. The best part: ready-made checkpoints ship with calibration correction baked in, so you can point it at your SO-100/101 and just watch it work. Inference fits in ~12GB at bf16, and LoRA fine-tuning runs on a single 24GB GPU. Big thanks to the Ai2 team for building this in the open.

LeRobot

19,891 views • 14 days ago

Thanks to the Latent Consistency Model (LCM), we're nearing real-time image diffusion. I've made a simple MJPEG server for generation stream using diffusers img2img pipeline. It's really fun to play with it. Can't wait for the ControlNet version. try it:

Thanks to the Latent Consistency Model (LCM), we're nearing real-time image diffusion. I've made a simple MJPEG server for generation stream using diffusers img2img pipeline. It's really fun to play with it. Can't wait for the ControlNet version. try it:

Radamés Ajna

231,859 views • 2 years ago

“The only Chairman with the known English……….The Chinese, Japanese and Indians are using their language to succeed, so I’m also using the twi language.” — Chairman Wontumi responds to being hailed for his proficiency in the English language.

“The only Chairman with the known English……….The Chinese, Japanese and Indians are using their language to succeed, so I’m also using the twi language.” — Chairman Wontumi responds to being hailed for his proficiency in the English language.

SIKAOFFICIAL🦍

87,753 views • 1 year ago

$Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.$

Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.

Alok

52,656 views • 1 month ago

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 views • 3 years ago

I got to "play" a world model in real life. The Google DeepMind folks set up a crazy demo for Genie. You select glowing orbs to represent your scene and character. It loads the world in the model, and you navigate with joysticks like a video game 🕹️

I got to "play" a world model in real life. The Google DeepMind folks set up a crazy demo for Genie. You select glowing orbs to represent your scene and character. It loads the world in the model, and you navigate with joysticks like a video game 🕹️

Justine Moore

60,142 views • 2 months ago

A quick video about how Come-from-Beyond discovered that you can actually break any Large Language Model by trolling it with complex questions. It is called a "Zero Delta" exploit and all LLM models are susceptible to it. I managed to recreate this on Grok and the video shows the result. With regards to all LLMs out there from $QUBIC - building the real AGI. Qubic My YouTube Channel:

A quick video about how Come-from-Beyond discovered that you can actually break any Large Language Model by trolling it with complex questions. It is called a "Zero Delta" exploit and all LLM models are susceptible to it. I managed to recreate this on Grok and the video shows the result. With regards to all LLMs out there from $QUBIC - building the real AGI. Qubic My YouTube Channel:

retrodrive ⛏

13,625 views • 1 year ago

Layer3 is teaching the language of web3. 🧑‍🏫 We spent some time with to learn about how he and the Layer3 team are helping users start their journey in web3 and earn tokens onchain:

Layer3 is teaching the language of web3. 🧑‍🏫 We spent some time with to learn about how he and the Layer3 team are helping users start their journey in web3 and earn tokens onchain:

MetaMask 🦊

29,154 views • 1 year ago

Two weeks ago, Figure's CEO announced the end of their partnership with OpenAI and promised to show the in-house AI development – here we are: Helix is a model that unifies perception, language and learned control to execute humanoid tasks by following natural language prompts.

Two weeks ago, Figure's CEO announced the end of their partnership with OpenAI and promised to show the in-house AI development – here we are: Helix is a model that unifies perception, language and learned control to execute humanoid tasks by following natural language prompts.

The Humanoid Hub

69,442 views • 1 year ago

Add near real-time voice translation to your apps with Gemini 3.5 Live Translate via the Gemini Live API. 🎙️ Watch how the model handles live broadcast ingestion and translation with continuous speech-to-speech streaming (S2ST) and synced transcripts, letting users tune into global radio broadcasts in their native language.

Add near real-time voice translation to your apps with Gemini 3.5 Live Translate via the Gemini Live API. 🎙️ Watch how the model handles live broadcast ingestion and translation with continuous speech-to-speech streaming (S2ST) and synced transcripts, letting users tune into global radio broadcasts in their native language.

Google AI Developers

3,543,687 views • 1 month ago

Announcing the open-source release of Qwen3-VL! A powerful vision-language model that can operate GUIs, code charts from mockups, and recognize "everything" from daily life to specialized fields. Highlights: 🔹 Precise event location in videos up to 2 hours long. 🔹 OCR language support boosted from 19 to 32, with major gains on rare characters and tilted text. 🔹 Supports a native 256K context length, expandable to 1M tokens. 🔹 Achieves leading accuracy in risk detection in real-world scenarios. Available on ModelScope, HuggingFace, GitHub, and integrated into Alibaba Cloud Model Studio. Try it today!

Announcing the open-source release of Qwen3-VL! A powerful vision-language model that can operate GUIs, code charts from mockups, and recognize "everything" from daily life to specialized fields. Highlights: 🔹 Precise event location in videos up to 2 hours long. 🔹 OCR language support boosted from 19 to 32, with major gains on rare characters and tilted text. 🔹 Supports a native 256K context length, expandable to 1M tokens. 🔹 Achieves leading accuracy in risk detection in real-world scenarios. Available on ModelScope, HuggingFace, GitHub, and integrated into Alibaba Cloud Model Studio. Try it today!

Tongyi Lab

59,937 views • 10 months ago

The fine-tuning dashboard makes it easy to visualize how model performance shifts across checkpoints. Watch @promptshant and @tsautory use it to dial in precision vs recall for a nuanced legal classification problem. Watch the full demo from Build Hours:

The fine-tuning dashboard makes it easy to visualize how model performance shifts across checkpoints. Watch @promptshant and @tsautory use it to dial in precision vs recall for a nuanced legal classification problem. Watch the full demo from Build Hours:

OpenAI Developers

30,403 views • 1 year ago

🎶 Lyria RealTime is a new experimental interactive music generation model that allows anyone to interactively create, control and perform music in real time. Available via the Gemini API and you can try the demo app on Google AI Studio.

🎶 Lyria RealTime is a new experimental interactive music generation model that allows anyone to interactively create, control and perform music in real time. Available via the Gemini API and you can try the demo app on Google AI Studio.

Google AI Developers

25,739 views • 1 year ago