正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous... show more

Yukang Chen

1,488 subscribers

197,064 次观看 • 2 个月前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,588 次观看 • 2 个月前

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

How To Prompt

39,043 次观看 • 1 个月前

Turn any GitHub repository into LLM-ready text! Simply replace "hub" with "ingest" in a GitHub URL and receive a prompt-friendly text ingest for LLMs. Gitingest is 100% open-source and provides: - Directory structure - A brief summary of the project - The entire content as LLM-ready text Plus, it comes with a nice python package and you can run the UI locally! Stay tuned, I'm working on something really cool with this!✨ Link to the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on ML and AI Engineering!

Turn any GitHub repository into LLM-ready text! Simply replace "hub" with "ingest" in a GitHub URL and receive a prompt-friendly text ingest for LLMs. Gitingest is 100% open-source and provides: - Directory structure - A brief summary of the project - The entire content as LLM-ready text Plus, it comes with a nice python package and you can run the UI locally! Stay tuned, I'm working on something really cool with this!✨ Link to the GitHub repo in the next tweet! _____ Find me → Akshay 🚀 ✔️ For more insights and tutorials on ML and AI Engineering!

Akshay 🚀

191,310 次观看 • 1 年前

Turn any open-source LLM into reasoning powerhouse! Using reinforcement finetuning you can add reasoning abilities to any LLM, even without a labelled dataset. Step-by-step explanation with code:

Turn any open-source LLM into reasoning powerhouse! Using reinforcement finetuning you can add reasoning abilities to any LLM, even without a labelled dataset. Step-by-step explanation with code:

Akshay 🚀

50,423 次观看 • 1 年前

We’re thrilled to open-source LabClaw — the Skill Operating Layer for LabOS by Stanford-Princeton Team One command turns any OpenClaw agent into a full AI Co-Scientist. Demo: Dragon Shrimp Army reporting for duty 🦞🔬 #AIforScience #OpenClaw

We’re thrilled to open-source LabClaw — the Skill Operating Layer for LabOS by Stanford-Princeton Team One command turns any OpenClaw agent into a full AI Co-Scientist. Demo: Dragon Shrimp Army reporting for duty 🦞🔬 #AIforScience #OpenClaw

AI4Science Catalyst

445,326 次观看 • 3 个月前

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

Red Hat AI

34,136 次观看 • 2 个月前

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and how the mechanism works below. ↓

Sentra just killed Google Research's TurboQuant. SpectralQuant — 5.95× KV cache compression on Mistral 7B at +7.5% perplexity overhead. TurboQuant at the same compression: +22%. 3× less degradation. 15-second calibration. One per-model, then drop-in for any HuggingFace LLM, ViT, ESM, AlphaFold Evoformer, or VideoMAE. Check out the findings and how the mechanism works below. ↓

Ashwin Gopinath

59,026 次观看 • 1 个月前

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: 📰 Paper: 🎥 Demo: #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

Yukang Chen

57,091 次观看 • 1 个月前

Introducing MiniCPM 4.1-8B: First Open-Source Reasoning LLM with Trainable Sparse Attention ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks! ✅ Fast Generation: 3x decoding speedup for reasoning ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding Download Models: Huggingface: Github: Technical Report: #AI #MiniCPM #LLM #OpenBMB #ArtificialIntelligence #MachineLearning

Introducing MiniCPM 4.1-8B: First Open-Source Reasoning LLM with Trainable Sparse Attention ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks! ✅ Fast Generation: 3x decoding speedup for reasoning ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding Download Models: Huggingface: Github: Technical Report: #AI #MiniCPM #LLM #OpenBMB #ArtificialIntelligence #MachineLearning

OpenBMB

19,236 次观看 • 9 个月前

Just built an MCP for Ghidra. Now basically any LLM (Claude, Gemini, local...) can Reverse Engineer malware for you. With the right prompting, it automates a *ton* of tedious tasks. One-shot markups of entire binaries with just a click. Open source, on Github now.

Just built an MCP for Ghidra. Now basically any LLM (Claude, Gemini, local...) can Reverse Engineer malware for you. With the right prompting, it automates a ton of tedious tasks. One-shot markups of entire binaries with just a click. Open source, on Github now.

LaurieWired

284,481 次观看 • 1 年前

LLM inference speed with vs. without KV caching: (learn how and why it works below)

LLM inference speed with vs. without KV caching: (learn how and why it works below)

Daily Dose of Data Science

59,218 次观看 • 2 个月前

LLM inference speed with vs. without KV caching: (learn how and why it works below)

LLM inference speed with vs. without KV caching: (learn how and why it works below)

Avi Chawla

394,936 次观看 • 3 个月前

We shipped OpenClaw for Windows 🦞 – Free with your LLM api keys – Custom skills from ClawHub – File system access - Browser control - Local Models (soon) - Open source. Inspired by Peter Steinberger 🦞

We shipped OpenClaw for Windows 🦞 – Free with your LLM api keys – Custom skills from ClawHub – File system access - Browser control - Local Models (soon) - Open source. Inspired by Peter Steinberger 🦞

atomicbot.ai

84,723 次观看 • 3 个月前

📢 It's the open source news heard 'round the world: The GitHub Copilot Chat extension for Visual Studio Code is now fully open source under the MIT license. 🎉 Check out the repo. ⬇️

📢 It's the open source news heard 'round the world: The GitHub Copilot Chat extension for Visual Studio Code is now fully open source under the MIT license. 🎉 Check out the repo. ⬇️

GitHub

27,773 次观看 • 11 个月前

We’ve been working on exploring building a library of open source UI patterns for agents on mobile, starting with Live Activities for observability. It’s available on GitHub and compatible with 🦞OpenClaw.

We’ve been working on exploring building a library of open source UI patterns for agents on mobile, starting with Live Activities for observability. It’s available on GitHub and compatible with 🦞OpenClaw.

Aaron Abentheuer

189,254 次观看 • 3 个月前

Seedance 2 just dropped as open-source.!.. ok, mostly... Helios - 14B model for minute-scale vid gen. - high-quality, fast -No KV-cache, no quantization, no TinyVAE, just pure architectural efficiency. - T2V/I2V/V2V ready.

Seedance 2 just dropped as open-source.!.. ok, mostly... Helios - 14B model for minute-scale vid gen. - high-quality, fast -No KV-cache, no quantization, no TinyVAE, just pure architectural efficiency. - T2V/I2V/V2V ready.

Wildminder

25,117 次观看 • 3 个月前

Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU

Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU

Guangxuan Xiao

31,023 次观看 • 1 年前

Build agents that can actually do real-world tasks! Agent Reinforcement Trainer (ART) is a framework to train multi-step LLM agents for real-world tasks using GRPO. Just a few lines of code. No manual rewards needed. vLLM + Unsloth combined 🚀 100% open-source.

Build agents that can actually do real-world tasks! Agent Reinforcement Trainer (ART) is a framework to train multi-step LLM agents for real-world tasks using GRPO. Just a few lines of code. No manual rewards needed. vLLM + Unsloth combined 🚀 100% open-source.

Akshay 🚀

38,297 次观看 • 4 个月前

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

Yukang Chen

69,720 次观看 • 8 个月前

can you chat privately with a cloud llm—*without* sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a secure gpu enclave, where inference stays fully confidential 🤯 links + code in comments👇

can you chat privately with a cloud llm—without sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a secure gpu enclave, where inference stays fully confidential 🤯 links + code in comments👇

Avanika Narayan

79,190 次观看 • 1 年前