Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Large Language Diffusion with Masking (LLaDA) are here - and their generation looks so fucking dope! 🤯 True to Yann LeCun's vision, Ditch the auto-regressive bits and approximate the language distribution via Maximum Likelihood Estimation! So cool to watch the model denoise text from tokens in real time! -... show more

Vaibhav (VB) Srivastav

43,377 subscribers

21,394 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

12 Kommentare

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 1 Jahr

Check out the demo here:

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 1 Jahr

Model checkpoints here:

Profilbild von AssemblyAI

AssemblyAIvor 1 Jahr

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

Profilbild von Chaithanya Kumar

Chaithanya Kumarvor 1 Jahr

@ylecun @Stardust_nds check this out buddy , something that we have been discussing about Also LLaDa

Profilbild von HosseinAgha

HosseinAghavor 1 Jahr

@ylecun This is interesting! Finally a new architecture for LLMs. I don't think this solves any of the @ylecun concerns with transformer based Auto Regressive LMs. No world model. No video understanding, etc.

Profilbild von luis

luisvor 1 Jahr

@ylecun Omg , I think this idea is perfect to make the answers more precise, 🥵 hello 100 precision

Profilbild von marko.

marko.vor 1 Jahr

@ylecun Since you have to compute the whole maximum possible response length every time, what does this mean for VRAM requirements when deploying these models?

Profilbild von Futurist Avenue

Futurist Avenuevor 1 Jahr

@ylecun How does this stack up with Inception?

Profilbild von AI at Meta

AI at Metavor 1 Jahr

Llama has now been downloaded over 1 Billion times! A note to: The researchers at Meta training these models — and those building on the research in other labs. The developers and enthusiasts on r/LocalLlama, @huggingface and more; experimenting with new models and creating derivatives. The small startups and big enterprises alike who are creating a new wave of AI-powered products, built with Llama. The global AI community. Your actions speak louder than words, thank you for making it abundantly clear — a billion times over — that open source AI is how we'll create the next wave of world changing technologies, together. 🦙❤️

Profilbild von Hunyuan

Hunyuanvor 1 Jahr

Coming soon: HunYuan-T1，The first ultra-large Mamba-powered reasoning model! Stay tuned! 🚀

Profilbild von AK

AKvor 1 Jahr

Bytedance just dropped DAPO on Hugging Face An Open-Source LLM Reinforcement Learning System at Scale

Profilbild von Jeremy Howard

Jeremy Howardvor 1 Jahr

Announcing fasttransform: a Python lib that makes data transformations reversible/extensible. No more writing inverse functions to see what your model sees. Debug pipelines by actually looking at your data. Built on multi-dispatch. Work w/ @R_Dimm

Ähnliche Videos

Introducing The Matrix --- a foundation world model for generating infinite-length, hyper-realistic videos with real-time, frame-level control: - Infinite-length video generation - 720p high-quality rendering - Real-time, frame-level control at 16 FPS - Generalization to real-world video control 🔗Blog: 📄Paper: 💻Code & Playable Demo: Coming soon! Key Innovation: A brand new technique called the shift-window denoise process model, enabling auto-regressive generation for diffusion and consistency models in real-time. Special thanks to project leader Ruili Feng and the entire Matrix team for their dedication and hard work over the year-long project.

Hongyang Zhang

178,322 Aufrufe • vor 1 Jahr

Diffusion language models are SO FAST!! A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model" It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs. And you can try it now!

Diffusion language models are SO FAST!! A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model" It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs. And you can try it now!

Tanishq Mathew Abraham, Ph.D.

354,178 Aufrufe • vor 1 Jahr

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 Aufrufe • vor 3 Jahren

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 Aufrufe • vor 9 Monaten

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 2 Jahren

try out the Gradio Demo for AudioLDM: Text-to-Audio Generation with Latent Diffusion Models on Hugging Face demo:

try out the Gradio Demo for AudioLDM: Text-to-Audio Generation with Latent Diffusion Models on Hugging Face demo:

AK

82,137 Aufrufe • vor 3 Jahren

Thanks to the Latent Consistency Model (LCM), we're nearing real-time image diffusion. I've made a simple MJPEG server for generation stream using diffusers img2img pipeline. It's really fun to play with it. Can't wait for the ControlNet version. try it:

Thanks to the Latent Consistency Model (LCM), we're nearing real-time image diffusion. I've made a simple MJPEG server for generation stream using diffusers img2img pipeline. It's really fun to play with it. Can't wait for the ControlNet version. try it:

Radamés Ajna

231,832 Aufrufe • vor 2 Jahren

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

Allen T.

56,015 Aufrufe • vor 2 Jahren

“The only Chairman with the known English……….The Chinese, Japanese and Indians are using their language to succeed, so I’m also using the twi language.” — Chairman Wontumi responds to being hailed for his proficiency in the English language.

“The only Chairman with the known English……….The Chinese, Japanese and Indians are using their language to succeed, so I’m also using the twi language.” — Chairman Wontumi responds to being hailed for his proficiency in the English language.

SIKAOFFICIAL🦍

87,753 Aufrufe • vor 1 Jahr

$Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.$

Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.

Alok

52,656 Aufrufe • vor 10 Tagen

We built a live multilingual, multi-person video call with Gemini 3.5 Live Translate on LiveKit. Everyone picks their language, speaks naturally, and hears each other in real time in their language of choice. Watch the demo and check out the open source repo:

We built a live multilingual, multi-person video call with Gemini 3.5 Live Translate on LiveKit. Everyone picks their language, speaks naturally, and hears each other in real time in their language of choice. Watch the demo and check out the open source repo:

LiveKit

21,051 Aufrufe • vor 11 Tagen

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 Aufrufe • vor 3 Jahren

I got to "play" a world model in real life. The Google DeepMind folks set up a crazy demo for Genie. You select glowing orbs to represent your scene and character. It loads the world in the model, and you navigate with joysticks like a video game 🕹️

I got to "play" a world model in real life. The Google DeepMind folks set up a crazy demo for Genie. You select glowing orbs to represent your scene and character. It loads the world in the model, and you navigate with joysticks like a video game 🕹️

Justine Moore

58,504 Aufrufe • vor 1 Monat

A quick video about how Come-from-Beyond discovered that you can actually break any Large Language Model by trolling it with complex questions. It is called a "Zero Delta" exploit and all LLM models are susceptible to it. I managed to recreate this on Grok and the video shows the result. With regards to all LLMs out there from $QUBIC - building the real AGI. Qubic My YouTube Channel:

A quick video about how Come-from-Beyond discovered that you can actually break any Large Language Model by trolling it with complex questions. It is called a "Zero Delta" exploit and all LLM models are susceptible to it. I managed to recreate this on Grok and the video shows the result. With regards to all LLMs out there from $QUBIC - building the real AGI. Qubic My YouTube Channel:

retrodrive ⛏

13,625 Aufrufe • vor 1 Jahr

Two weeks ago, Figure's CEO announced the end of their partnership with OpenAI and promised to show the in-house AI development – here we are: Helix is a model that unifies perception, language and learned control to execute humanoid tasks by following natural language prompts.

Two weeks ago, Figure's CEO announced the end of their partnership with OpenAI and promised to show the in-house AI development – here we are: Helix is a model that unifies perception, language and learned control to execute humanoid tasks by following natural language prompts.

The Humanoid Hub

69,415 Aufrufe • vor 1 Jahr

Add near real-time voice translation to your apps with Gemini 3.5 Live Translate via the Gemini Live API. 🎙️ Watch how the model handles live broadcast ingestion and translation with continuous speech-to-speech streaming (S2ST) and synced transcripts, letting users tune into global radio broadcasts in their native language.

Add near real-time voice translation to your apps with Gemini 3.5 Live Translate via the Gemini Live API. 🎙️ Watch how the model handles live broadcast ingestion and translation with continuous speech-to-speech streaming (S2ST) and synced transcripts, letting users tune into global radio broadcasts in their native language.

Google AI Developers

20,516 Aufrufe • vor 8 Tagen

Layer3 is teaching the language of web3. 🧑‍🏫 We spent some time with Brandon to learn about how he and the Layer3 team are helping users start their journey in web3 and earn tokens onchain:

Layer3 is teaching the language of web3. 🧑‍🏫 We spent some time with Brandon to learn about how he and the Layer3 team are helping users start their journey in web3 and earn tokens onchain:

MetaMask.eth 🦊

28,962 Aufrufe • vor 1 Jahr

🚧 We are working on StyleTTS2 model training. 🏗 I think it's time to show some results from the training... 🔥 The model license is the same as the repo (MIT) 🎙 StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models 🔥 Jupyter Notebook 🥳 Thanks to Yinghao Aaron Li ❤ Cong Han ❤ Vinay S Raghavan ❤ Gavin Mischler ❤ Nima Mesgarani ❤ 📄paper: 🧬code: 🗃dataset: 📦model: please try it 🐣

🚧 We are working on StyleTTS2 model training. 🏗 I think it's time to show some results from the training... 🔥 The model license is the same as the repo (MIT) 🎙 StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models 🔥 Jupyter Notebook 🥳 Thanks to Yinghao Aaron Li ❤ Cong Han ❤ Vinay S Raghavan ❤ Gavin Mischler ❤ Nima Mesgarani ❤ 📄paper: 🧬code: 🗃dataset: 📦model: please try it 🐣

camenduru

11,532 Aufrufe • vor 2 Jahren

Announcing the open-source release of Qwen3-VL! A powerful vision-language model that can operate GUIs, code charts from mockups, and recognize "everything" from daily life to specialized fields. Highlights: 🔹 Precise event location in videos up to 2 hours long. 🔹 OCR language support boosted from 19 to 32, with major gains on rare characters and tilted text. 🔹 Supports a native 256K context length, expandable to 1M tokens. 🔹 Achieves leading accuracy in risk detection in real-world scenarios. Available on ModelScope, HuggingFace, GitHub, and integrated into Alibaba Cloud Model Studio. Try it today!

Announcing the open-source release of Qwen3-VL! A powerful vision-language model that can operate GUIs, code charts from mockups, and recognize "everything" from daily life to specialized fields. Highlights: 🔹 Precise event location in videos up to 2 hours long. 🔹 OCR language support boosted from 19 to 32, with major gains on rare characters and tilted text. 🔹 Supports a native 256K context length, expandable to 1M tokens. 🔹 Achieves leading accuracy in risk detection in real-world scenarios. Available on ModelScope, HuggingFace, GitHub, and integrated into Alibaba Cloud Model Studio. Try it today!

Tongyi Lab

59,913 Aufrufe • vor 9 Monaten