正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on... show more

Hao AI Lab

6,447 subscribers

37,506 次观看 • 2 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

Yukang Chen

58,223 次观看 • 1 个月前

(1/5) 5 seconds of video. 1.8s seconds of generation. One NVIDIA GeForce RTX 5090 on FastVideo. 🤯🚀 - FastWan-QAD, a new family of video generation models - Trained with FastVideo's Quantization-Aware Distillation (QAD) recipe. - Powered by FastVideo, we push a single NVIDIA GeForce RTX 5090 to its absolute limit: generating a 5-second 480P video in 1.8s end-to-end! 📜 Blog: 💻 Code: 💽 Model:

(1/5) 5 seconds of video. 1.8s seconds of generation. One NVIDIA GeForce RTX 5090 on FastVideo. 🤯🚀 - FastWan-QAD, a new family of video generation models - Trained with FastVideo's Quantization-Aware Distillation (QAD) recipe. - Powered by FastVideo, we push a single NVIDIA GeForce RTX 5090 to its absolute limit: generating a 5-second 480P video in 1.8s end-to-end! 📜 Blog: 💻 Code: 💽 Model:

Hao AI Lab

11,761 次观看 • 5 天前

🎥 Videos DiTs are painfully slow, HunyuanVideo takes 16 min to generate a 5s 720P video on H100. 🤯 Announcing Sliding Tile Attention (STA): * Accelerate 3D full attention (FA3) by up to 10x * Slash the end-to-end time from 16 --> 5 mins * NO extra training. NO quality loss! 🚀 Can you tell which videos are generated by the original HunyuanVideo, and which by STA? 👀 Blog:

🎥 Videos DiTs are painfully slow, HunyuanVideo takes 16 min to generate a 5s 720P video on H100. 🤯 Announcing Sliding Tile Attention (STA): * Accelerate 3D full attention (FA3) by up to 10x * Slash the end-to-end time from 16 --> 5 mins * NO extra training. NO quality loss! 🚀 Can you tell which videos are generated by the original HunyuanVideo, and which by STA? 👀 Blog:

Hao AI Lab

58,003 次观看 • 1 年前

$Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. Subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.$

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. Subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

Alexander Whedon

13,120,791 次观看 • 1 个月前

FP4 Explore, BF16 Train Diffusion Reinforcement Learning via Efficient Rollout Scaling paper:

FP4 Explore, BF16 Train Diffusion Reinforcement Learning via Efficient Rollout Scaling paper:

AK

12,778 次观看 • 2 个月前

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s) With experts at full-attn layers only: 1.9x faster (11 -> 20.5 tok/s) — same output quality Qwen3.5-397B, 4-bit, 209GB, M5 Max 128GB 1/3

Anemll

19,572 次观看 • 3 个月前

"When it comes to the jobs we create, we don’t just focus on the quantity, we pay more attention to the quality. So much so that the quality of a single job I create for one person is worth more than 50 jobs in other places." — Prof. Isa Ali Ibrahim, CON

"When it comes to the jobs we create, we don’t just focus on the quantity, we pay more attention to the quality. So much so that the quality of a single job I create for one person is worth more than 50 jobs in other places." — Prof. Isa Ali Ibrahim, CON

B_SARKY

18,810 次观看 • 1 个月前

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with Joshua Starmer, Founder and CEO of StatQuest. The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design. In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications. What you will do: - Understand the evolution of the attention mechanism, a key breakthrough that led to transformers. - Learn the relationships between word embeddings, positional embeddings, and attention. - Learn about the Query, Key, and Value matrices, and how to produce and use them in attention. - Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work. - Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs. - Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer. - Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention. There're lots of exciting technical details in this course. Please sign up here:

New short course: Attention in Transformers: Concepts and Code in PyTorch. Last week we released a course on how LLM transformers work. This week, go deeper and learn about the technical ideas behind the attention mechanism, and see how to code it in PyTorch. This course is built with Joshua Starmer, Founder and CEO of StatQuest. The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design. In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications. What you will do: - Understand the evolution of the attention mechanism, a key breakthrough that led to transformers. - Learn the relationships between word embeddings, positional embeddings, and attention. - Learn about the Query, Key, and Value matrices, and how to produce and use them in attention. - Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work. - Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs. - Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer. - Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention. There're lots of exciting technical details in this course. Please sign up here:

Andrew Ng

132,135 次观看 • 1 年前

Model house construction. The attention to detail on this model is incredible.

Model house construction. The attention to detail on this model is incredible.

HOW THINGS WORK

65,113 次观看 • 2 个月前

Fr. Sam Sawyer: Trump wants to be the center of attention all the time. Pope Leo is pulling our attention away from the president and putting our attention back on God, on Jesus, on the call to peace and the message of the Gospel. Trump seems to have trouble with that.

Fr. Sam Sawyer: Trump wants to be the center of attention all the time. Pope Leo is pulling our attention away from the president and putting our attention back on God, on Jesus, on the call to peace and the message of the Gospel. Trump seems to have trouble with that.

FactPost

35,952 次观看 • 2 个月前

We asked Lulu Cheng Meservey about her advice for founders in the age of AI. "The next era of Going Direct is Attention." "Founders work hard to get attention but often don't do anything with it." "Hold on to your attention. Turn it into money."

We asked Lulu Cheng Meservey about her advice for founders in the age of AI. "The next era of Going Direct is Attention." "Founders work hard to get attention but often don't do anything with it." "Hold on to your attention. Turn it into money."

TBPN

24,571 次观看 • 1 年前

Sarkaari camera quality is making it hard to pay attention to the Quad Foreign Ministers briefing

Sarkaari camera quality is making it hard to pay attention to the Quad Foreign Ministers briefing

Shashank Mattoo

35,885 次观看 • 1 个月前

Pay close attention to the end

Sensitive content

Pay close attention to the end

SuperTate

17,945 次观看 • 6 个月前

.Glinert 🇺🇸 🏭 (Co-Founder & CEO of Sphere Semi) on how they're using AI to change how chips are designed: “CPUs and GPUs get all the attention, but analog is a third of the trillion-dollar chip industry and it’s critical for communications, warfare, and more.” “We think human chip design is coming to an end. In analog, it’s still done by hand and we’re putting that to an end.”

.Glinert 🇺🇸 🏭 (Co-Founder & CEO of Sphere Semi) on how they're using AI to change how chips are designed: “CPUs and GPUs get all the attention, but analog is a third of the trillion-dollar chip industry and it’s critical for communications, warfare, and more.” “We think human chip design is coming to an end. In analog, it’s still done by hand and we’re putting that to an end.”

TBPN

58,658 次观看 • 9 个月前

Boulder Colorado Pearl Street Mall Attack seems staged. That guy at the end of the video is supposed to be the guy and nobody is even paying attention to him, lol And the sounds/screams sound like they were dubbed in from a studio. Credit BlueJay on TG for bringing this to my attention. #bouldercolorado

Boulder Colorado Pearl Street Mall Attack seems staged. That guy at the end of the video is supposed to be the guy and nobody is even paying attention to him, lol And the sounds/screams sound like they were dubbed in from a studio. Credit BlueJay on TG for bringing this to my attention. #bouldercolorado

Zadok 💫

19,206 次观看 • 1 年前

This is what is going on behind the scene Pay attention and watch till the end

This is what is going on behind the scene Pay attention and watch till the end

iamlpt_forex

62,015 次观看 • 2 个月前

🚨 Attention Leftists, the high road has come to an end!

Sensitive content

🚨 Attention Leftists, the high road has come to an end!

Catarina Senora Gatita

333,050 次观看 • 9 个月前

Last week, we launched "Attention in Transformers: Concepts and Code in PyTorch" instructed by Joshua Starmer! In this course, you'll: ✅ Learn how the attention mechanism in LLMs helps convert base token embeddings into rich context-aware embeddings. ✅ Understand the Query, Key, and Value matrices, what they are for, how to produce them, and how to use them in attention. ✅ Learn the difference between self-attention, masked self-attention, and cross-attention, and how multi-head attention scales the algorithm. 🔗 Enroll for free:

Last week, we launched "Attention in Transformers: Concepts and Code in PyTorch" instructed by Joshua Starmer! In this course, you'll: ✅ Learn how the attention mechanism in LLMs helps convert base token embeddings into rich context-aware embeddings. ✅ Understand the Query, Key, and Value matrices, what they are for, how to produce them, and how to use them in attention. ✅ Learn the difference between self-attention, masked self-attention, and cross-attention, and how multi-head attention scales the algorithm. 🔗 Enroll for free:

DeepLearning.AI

36,832 次观看 • 1 年前

ATTENTION: We are crowdsourcing research for our upcoming film on Bengal. If you want to contribute in the making of history, this is your chance., I’ve asked an important question at the end, watch till the end and please answer. And don’t forget to share.

ATTENTION: We are crowdsourcing research for our upcoming film on Bengal. If you want to contribute in the making of history, this is your chance., I’ve asked an important question at the end, watch till the end and please answer. And don’t forget to share.

Vivek Ranjan Agnihotri

139,311 次观看 • 1 年前

While everyone is taking 3-4 years for a single movie, bro cooked 7 & half hr first quality footage in 16 months and got all the nation wide attention

While everyone is taking 3-4 years for a single movie, bro cooked 7 & half hr first quality footage in 16 months and got all the nation wide attention

Legend Prabhas Fan 🇮🇳

839,075 次观看 • 3 个月前