Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning... show more

Zhijian Liu

6,667 subscribers

171,119 views • 5 months ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Which LLM reasons best when it doesn't have all the information? Enter LLM Poker Arena to find out. It's a Poker Playing benchmark where top reasoning models play Texas Hold'em poker against each other. Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro, and Grok 4 all sit at the same table and play full tournaments to see who finishes with the chips. Poker is very different when it comes to reasoning. It has to balance probabilistic reasoning, opponent modeling and make decisions under uncertainty. Poker is an interesting evaluation because it tests reasoning under incomplete information, something most coding benchmarks do not capture. In this tournaments the rules are: - Each LLM starts with $1,000 chips - Small and big blinds start at $25 / $50 - Blinds double every 3 minutes - All models run in their reasoning or thinking modes After the first 5 tournaments: - Claude Opus 4.5 with Thinking has 3 wins - GPT-5.2 has 2 wins - Grok 4 and Gemini 2.5 Pro have 0 wins Early results suggest Claude performs quite well at poker as well. Also five is a very small sample size. Planning to run many more tournaments, publish the full benchmark data and add a prediction market on top of it. Thanks for the suggestion clipz. Much more coming as part of Poker Cities !! This was built on Replit ⠕ using their AI integrations, which made it straightforward to connect Claude, GPT, and Gemini. What model do you think wins after 100 tournaments?

Which LLM reasons best when it doesn't have all the information? Enter LLM Poker Arena to find out. It's a Poker Playing benchmark where top reasoning models play Texas Hold'em poker against each other. Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro, and Grok 4 all sit at the same table and play full tournaments to see who finishes with the chips. Poker is very different when it comes to reasoning. It has to balance probabilistic reasoning, opponent modeling and make decisions under uncertainty. Poker is an interesting evaluation because it tests reasoning under incomplete information, something most coding benchmarks do not capture. In this tournaments the rules are: - Each LLM starts with $1,000 chips - Small and big blinds start at $25 / $50 - Blinds double every 3 minutes - All models run in their reasoning or thinking modes After the first 5 tournaments: - Claude Opus 4.5 with Thinking has 3 wins - GPT-5.2 has 2 wins - Grok 4 and Gemini 2.5 Pro have 0 wins Early results suggest Claude performs quite well at poker as well. Also five is a very small sample size. Planning to run many more tournaments, publish the full benchmark data and add a prediction market on top of it. Thanks for the suggestion clipz. Much more coming as part of Poker Cities !! This was built on Replit ⠕ using their AI integrations, which made it straightforward to connect Claude, GPT, and Gemini. What model do you think wins after 100 tournaments?

Anshul Dhawan

32,192 views • 6 months ago

timelapse #85 (27.5 hrs): - currently cant rely on any other coding models except grok code fast 1 + grok 4 fast (for complex reasoning grok 4 fast is 20 cents for 1M tokens) - wrote qwen3-next trainer entirely from scratch to make it more managable - each piece completely done by grok-code-fast-1 in cursor as it seems to handle this task pretty well without the grok 4 fast reasoning - take on smaller problems and complete them quickly (makes it easier with 400 toks/sec over the api) - got distributed fp8 qwen3-next trainer running at 0.8 seconds per step on 8xH100s (still need to finish checkpoint loading logic) - perfect timing as the fp8 version of qwen3-next drops as im writing this - ill be in LA in 2 days (will visit SF mid way through as well) - 12.5% margarita - steak dinner with family - gained intuition on FlashAttention in very long context settings - caught up w/ Kearm h/eng and Arnie Ramesh

timelapse #85 (27.5 hrs): - currently cant rely on any other coding models except grok code fast 1 + grok 4 fast (for complex reasoning grok 4 fast is 20 cents for 1M tokens) - wrote qwen3-next trainer entirely from scratch to make it more managable - each piece completely done by grok-code-fast-1 in cursor as it seems to handle this task pretty well without the grok 4 fast reasoning - take on smaller problems and complete them quickly (makes it easier with 400 toks/sec over the api) - got distributed fp8 qwen3-next trainer running at 0.8 seconds per step on 8xH100s (still need to finish checkpoint loading logic) - perfect timing as the fp8 version of qwen3-next drops as im writing this - ill be in LA in 2 days (will visit SF mid way through as well) - 12.5% margarita - steak dinner with family - gained intuition on FlashAttention in very long context settings - caught up w/ Kearm h/eng and Arnie Ramesh

Elliot Arledge

283,820 views • 10 months ago

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 views • 2 months ago

offers full-stack privacy across queries, data, payments and devices / hardware 1. The best open-source models (eg. Kimi K2.5) served in secure enclaves (TEEs) 2. The best closed-source models (eg. GPT-5.2 Pro) with queries pooled into proxy servers and PII anonymized with a specialized Silo model 3. The only available implementation of private Deep Research for secure reasoning 4. Fully private crypto payments via Zcash or discounted payments with FAI 5. A hardware offering to run everything on-prem from your own home with 288 GB GPU RAM (Silo Box) All with encrypted sync between devices so users can uniformly + privately access their AI sessions

offers full-stack privacy across queries, data, payments and devices / hardware 1. The best open-source models (eg. Kimi K2.5) served in secure enclaves (TEEs) 2. The best closed-source models (eg. GPT-5.2 Pro) with queries pooled into proxy servers and PII anonymized with a specialized Silo model 3. The only available implementation of private Deep Research for secure reasoning 4. Fully private crypto payments via Zcash or discounted payments with FAI 5. A hardware offering to run everything on-prem from your own home with 288 GB GPU RAM (Silo Box) All with encrypted sync between devices so users can uniformly + privately access their AI sessions

Freysa

16,647 views • 4 months ago

OpenAI just announced API access to o1 (advanced reasoning model) yesterday. I'm delighted to announce today a new short course, Reasoning with o1, built with OpenAI, and taught by Colin Jarvis, Head of AI Solutions at OpenAI, to show you how to use this effectively! Unlike previous language models which generate output directly, o1 “thinks before it responds,” and generates many reasoning tokens before returning a more thoughtful and accurate response. It is great at complex reasoning -- including planning for agentic workflows, coding, and domain-specific reasoning in STEM fields like law. But how you should use it is quite different from other LLMs. I think o1 will be a game changer for many AI applications; and in this course, you'll learn how to use it effectively. In detail, you’ll: - Learn to recognize what tasks o1 is suited for, and when to use a smaller model, or combine o1 with a smaller model - Understand the new principles of prompting reasoning models: Be simple and direct; no explicit chain-of-thought required; use structure; show rather than tell - Implement multi-step orchestration in which o1 plans, and hands tasks over to gpt-4o-mini to execute specific steps; this illustrates a design pattern to optimize intelligence (accuracy) and cost - Use o1 for a coding task to build a new application, edit existing code, and test performance by running a coding competition between o1-mini and GPT 4o - Use o1 for image understanding and learn how it performs better with a "hierarchy of reasoning," in which it incurs the latency and cost upfront, preprocessing the image and indexing it with rich details so it can be used for Q&A later - Learn a technique called meta-prompting, in which you use o1 to improve your prompts. Using a customer support evaluation set, you'll iteratively use o1 to modify a prompt to improve performance You'll also learn about how OpenAI used reinforcement learning to produce a model that uses "test-time compute" to improve performance. I think you'll find this course enjoyable and valuable. Please sign up for it here:

OpenAI just announced API access to o1 (advanced reasoning model) yesterday. I'm delighted to announce today a new short course, Reasoning with o1, built with OpenAI, and taught by Colin Jarvis, Head of AI Solutions at OpenAI, to show you how to use this effectively! Unlike previous language models which generate output directly, o1 “thinks before it responds,” and generates many reasoning tokens before returning a more thoughtful and accurate response. It is great at complex reasoning -- including planning for agentic workflows, coding, and domain-specific reasoning in STEM fields like law. But how you should use it is quite different from other LLMs. I think o1 will be a game changer for many AI applications; and in this course, you'll learn how to use it effectively. In detail, you’ll: - Learn to recognize what tasks o1 is suited for, and when to use a smaller model, or combine o1 with a smaller model - Understand the new principles of prompting reasoning models: Be simple and direct; no explicit chain-of-thought required; use structure; show rather than tell - Implement multi-step orchestration in which o1 plans, and hands tasks over to gpt-4o-mini to execute specific steps; this illustrates a design pattern to optimize intelligence (accuracy) and cost - Use o1 for a coding task to build a new application, edit existing code, and test performance by running a coding competition between o1-mini and GPT 4o - Use o1 for image understanding and learn how it performs better with a "hierarchy of reasoning," in which it incurs the latency and cost upfront, preprocessing the image and indexing it with rich details so it can be used for Q&A later - Learn a technique called meta-prompting, in which you use o1 to improve your prompts. Using a customer support evaluation set, you'll iteratively use o1 to modify a prompt to improve performance You'll also learn about how OpenAI used reinforcement learning to produce a model that uses "test-time compute" to improve performance. I think you'll find this course enjoyable and valuable. Please sign up for it here:

Andrew Ng

357,661 views • 1 year ago

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

Wenhu Chen

82,829 views • 1 year ago

The DeepSeek-R1 paper is a gem! Highly encourage everyone to read it. It's clear that LLM reasoning capabilities can be learned in different ways. RL, if applied correctly and at scale, can lead to some really powerful and interesting scaling and emergent properties. There is more to RL than meets the eye! Here is my breakdown of the paper along with a few tests: The multi-state training might not make sense initially but they provide clues on optimizations that we can continue to tap into. Data quality is still very important for enhancing the usability of the LLM. Unlike other reasoning LLMs, DeepSeek-R1's training recipe and weights are open so we can build on top of it. This opens up exciting research opportunities. About the attached clip: the previous preview model wasn't able to solve this task. DeepSeek-R1 can solve this and many other tasks that o1 can solve. It's a very good model for coding and math.

The DeepSeek-R1 paper is a gem! Highly encourage everyone to read it. It's clear that LLM reasoning capabilities can be learned in different ways. RL, if applied correctly and at scale, can lead to some really powerful and interesting scaling and emergent properties. There is more to RL than meets the eye! Here is my breakdown of the paper along with a few tests: The multi-state training might not make sense initially but they provide clues on optimizations that we can continue to tap into. Data quality is still very important for enhancing the usability of the LLM. Unlike other reasoning LLMs, DeepSeek-R1's training recipe and weights are open so we can build on top of it. This opens up exciting research opportunities. About the attached clip: the previous preview model wasn't able to solve this task. DeepSeek-R1 can solve this and many other tasks that o1 can solve. It's a very good model for coding and math.

elvis

140,692 views • 1 year ago

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,457 views • 1 year ago

Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemini 3, Gemma 4 brings advanced reasoning to your personal hardware and devices. Here’s what Gemma 4 unlocks for developers: — Intelligence-per-parameter: Our 31B (Dense) and 26B (MoE) models deliver state-of-the-art performance for their size, outcompeting models 20x their size on Arena.ai — Commercial flexibility: Released under a permissive Apache 2.0 license for complete developer flexibility and digital sovereignty — Agentic workflows: Native support for function-calling and structured JSON output allows you to build reliable, autonomous agents — Multimodal edge AI: The E2B and E4B models bring native vision, audio, and low latency to mobile and IoT devices — Long-context reasoning: Up to 256K context windows allow you to process entire repositories or large documents in a single prompt Whether you're building global applications in 140+ languages or local-first AI code assistants, Gemma 4 is built to be your foundation. Explore in Google AI Studio or download the weights on Hugging Face, Kaggle, and ollama.

Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemini 3, Gemma 4 brings advanced reasoning to your personal hardware and devices. Here’s what Gemma 4 unlocks for developers: — Intelligence-per-parameter: Our 31B (Dense) and 26B (MoE) models deliver state-of-the-art performance for their size, outcompeting models 20x their size on Arena.ai — Commercial flexibility: Released under a permissive Apache 2.0 license for complete developer flexibility and digital sovereignty — Agentic workflows: Native support for function-calling and structured JSON output allows you to build reliable, autonomous agents — Multimodal edge AI: The E2B and E4B models bring native vision, audio, and low latency to mobile and IoT devices — Long-context reasoning: Up to 256K context windows allow you to process entire repositories or large documents in a single prompt Whether you're building global applications in 140+ languages or local-first AI code assistants, Gemma 4 is built to be your foundation. Explore in Google AI Studio or download the weights on Hugging Face, Kaggle, and ollama.

Google AI

1,669,580 views • 3 months ago

For two decades, our community has been at the heart of every adventure, every story, and every challenge we’ve taken on 💗 Regardless of whether you’ve been with us since the very beginning or only just recently joined the community — you’re all part of our legacy. So thank you, stay strong, stay awesome, and keep the passion going! 💪✨ A big part of this community is also our moderation team. So, as we continue celebrating the 20 years we've spent together, we asked them to share their stories 👇

For two decades, our community has been at the heart of every adventure, every story, and every challenge we’ve taken on 💗 Regardless of whether you’ve been with us since the very beginning or only just recently joined the community — you’re all part of our legacy. So thank you, stay strong, stay awesome, and keep the passion going! 💪✨ A big part of this community is also our moderation team. So, as we continue celebrating the 20 years we've spent together, we asked them to share their stories 👇

CD PROJEKT RED

143,256 views • 1 year ago

lol he knows about his fight vs the mic and was reasoning out “tomorrow, tomorrow, we’ll see if I end up fighting with him. Do you know why we fought last time? It was because it had rhinestones here, and every time I plugged in the microphone, this part was really hard to pull open. Though normally it’s already not easy to pull apart, so after it started, I hurried to plug in the mic, got a bit anxious, and the mic bumped into his, so it took a while to fix. I was a bit angry then. But this time, today, I’ll get everything ready in advance, so when we go on stage, it’ll be more at ease. Yes.”

lol he knows about his fight vs the mic and was reasoning out “tomorrow, tomorrow, we’ll see if I end up fighting with him. Do you know why we fought last time? It was because it had rhinestones here, and every time I plugged in the microphone, this part was really hard to pull open. Though normally it’s already not easy to pull apart, so after it started, I hurried to plug in the mic, got a bit anxious, and the mic bumped into his, so it took a while to fix. I was a bit angry then. But this time, today, I’ll get everything ready in advance, so when we go on stage, it’ll be more at ease. Yes.”

jam 🐟

10,826 views • 9 months ago

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

Andrew Ng

120,728 views • 2 months ago

built and deployed an AI Ad Analysis Tool on my vibe marketing resource site today... steps: 1) I've managed $150m+ in campaigns so used my knowledge to have Manus build out a super detailed scope using weights for various factors 2) loaded the scope into Replit and it basically one shotted the app! it uses reasoning models from OpenAI 3) went back and forth with Replit agent to update the overall design and UX 4) now we have tools, super cool. did some testing and everything is working with beehiiv Give it a try and lmk what you think... THIS IS VIBE MARKETING. Stack: - Replit - Manus - OpenAI - Beehive

built and deployed an AI Ad Analysis Tool on my vibe marketing resource site today... steps: 1) I've managed $150m+ in campaigns so used my knowledge to have Manus build out a super detailed scope using weights for various factors 2) loaded the scope into Replit and it basically one shotted the app! it uses reasoning models from OpenAI 3) went back and forth with Replit agent to update the overall design and UX 4) now we have tools, super cool. did some testing and everything is working with beehiiv Give it a try and lmk what you think... THIS IS VIBE MARKETING. Stack: - Replit - Manus - OpenAI - Beehive

The Boring Marketer

17,832 views • 1 year ago

Thanksgiving-week treat: an epic conversation on Frontier AI with Lukasz Kaiser -co-author of “Attention Is All You Need” (Transformers) and leading research scientist at OpenAI working on GPT-5.1-era reasoning models. 00:00 – Cold open and intro 01:29 – “AI slowdown” vs a wild week of new frontier models 08:03 – Low-hanging fruit, infra, RL training and better data 11:39 – What is a reasoning model, in plain language 17:02 – Chain-of-thought and training the thinking process with RL 21:39 – Łukasz’s path: from logic and France to Google and Kurzweil 24:20 – Inside the Transformer story and what “attention” really means 28:42 – From Google Brain to OpenAI: culture, scale and GPUs 32:49 – What’s next for pre-training, GPUs and distillation 37:29 – Can we still understand these models? Circuits, sparsity and black boxes 39:42 – GPT-4 → GPT-5 → GPT-5.1: what actually changed 42:40 – Post-training, safety and teaching GPT-5.1 different tones 46:16 – How long should GPT-5.1 think? Reasoning tokens and jagged abilities 47:43 – The five-year-old’s dot puzzle that still breaks frontier models 52:22 – Generalization, child-like learning and whether reasoning is enough 53:48 – Beyond Transformers: ARC, LeCun’s ideas and multimodal bottlenecks 56:10 – GPT-5.1 Codex Max, long-running agents and compaction 1:00:06 – Will foundation models eat most apps? The translation analogy and trust 1:02:34 – What still needs to be solved, and where AI might go next

Thanksgiving-week treat: an epic conversation on Frontier AI with Lukasz Kaiser -co-author of “Attention Is All You Need” (Transformers) and leading research scientist at OpenAI working on GPT-5.1-era reasoning models. 00:00 – Cold open and intro 01:29 – “AI slowdown” vs a wild week of new frontier models 08:03 – Low-hanging fruit, infra, RL training and better data 11:39 – What is a reasoning model, in plain language 17:02 – Chain-of-thought and training the thinking process with RL 21:39 – Łukasz’s path: from logic and France to Google and Kurzweil 24:20 – Inside the Transformer story and what “attention” really means 28:42 – From Google Brain to OpenAI: culture, scale and GPUs 32:49 – What’s next for pre-training, GPUs and distillation 37:29 – Can we still understand these models? Circuits, sparsity and black boxes 39:42 – GPT-4 → GPT-5 → GPT-5.1: what actually changed 42:40 – Post-training, safety and teaching GPT-5.1 different tones 46:16 – How long should GPT-5.1 think? Reasoning tokens and jagged abilities 47:43 – The five-year-old’s dot puzzle that still breaks frontier models 52:22 – Generalization, child-like learning and whether reasoning is enough 53:48 – Beyond Transformers: ARC, LeCun’s ideas and multimodal bottlenecks 56:10 – GPT-5.1 Codex Max, long-running agents and compaction 1:00:06 – Will foundation models eat most apps? The translation analogy and trust 1:02:34 – What still needs to be solved, and where AI might go next

Matt Turck

168,007 views • 8 months ago

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻‍💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻‍💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face

Philipp Schmid

48,170 views • 1 year ago

The chaos inside of a tornado debris field is unfathomable. Some tornadoes can be so strong and sit in one spot for so long that they shred everything in their path into small pieces. In 2020 when I accidentally shot this tornado in slow-mo at 240 FPS, I originally was a little irritated with myself. However, it's the only 240 FPS piece of footage I have ever shot of a tornado, and it offers a unique and rare look into the chaos of a monster. So much going on inside of the debris field of this tornado near Scarth, MB. This tornado I call the Green Screen tornado. Many people accused me of using a green screen when shooting this video. These days I would be calling it the AI Tornado.

The chaos inside of a tornado debris field is unfathomable. Some tornadoes can be so strong and sit in one spot for so long that they shred everything in their path into small pieces. In 2020 when I accidentally shot this tornado in slow-mo at 240 FPS, I originally was a little irritated with myself. However, it's the only 240 FPS piece of footage I have ever shot of a tornado, and it offers a unique and rare look into the chaos of a monster. So much going on inside of the debris field of this tornado near Scarth, MB. This tornado I call the Green Screen tornado. Many people accused me of using a green screen when shooting this video. These days I would be calling it the AI Tornado.

Aaron Jayjack

66,629 views • 8 months ago

$PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B Hitting 89.5% of FP16 at 3.9GB. No new pretrain. No higher-precision escape hatches. No multi-GPU rig. Here's how it works. 👇 1: Codes, not floats Every weight becomes a code, with one shared FP16 scale per group of 128. Ternary is {−1, 0, +1}, binary is {−1, +1}. Sharing the scale across 128 weights keeps its cost at 16/128 = 0.125 bits. → Ternary: log2(3) + 16/128 ≈ 1.71 bits/weight → 5.9GB → Binary: 1 + 16/128 = 1.125 bits/weight → 3.9GB 2: Post-training, not from scratch No BitNet-style low-bit pretrain. It starts from off-the-shelf Qwen3.6-27B, architecture unchanged. The representation runs end to end across embeddings, attention projections, MLP projections, and the LM head. → 9.4× (ternary) and 14.2× (binary) vs the 54GB FP16 baseline 3: Labels are not bit-widths Conventional low-bit builds are mixed-precision by construction. The advertised name describes the most-compressed tensors, not the model. → Q4_K_XL, labeled "4-bit," is really 5.2 bits/weight at 17.6GB → IQ2_XXS, labeled "2-bit," is really 2.8 bits/weight at 9.4GB 4: Fitting a phone is two budgets iOS caps a single app near half of RAM, so a 12GB iPhone exposes ~6GB. The KV cache grows on top. Hybrid attention at ~75% linear means only 16 of 64 layers cache. → 4-bit KV: 4.3GB at 262K context, down from 17.2GB → 11.0 tok/s on iPhone 17 Pro Max 5: The numbers (15 benchmarks, thinking mode) → Ternary: 80.49 avg at 5.9GB — 94.6% of FP16 → 1-bit: 76.11 avg at 3.9GB — 89.5% of FP16 → IQ2_XXS falls to 57.5 on AIME26 while still scoring 88.93 on MMLU-Redux The key takeaway: 27B-class reasoning without the 54GB checkpoint — group-wise ternary and binary codes, an end-to-end low-bit language stack, 4-bit KV, on one phone. Full analysis: Repo: Model weight: Technical details: PrismML$

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B Hitting 89.5% of FP16 at 3.9GB. No new pretrain. No higher-precision escape hatches. No multi-GPU rig. Here's how it works. 👇 1: Codes, not floats Every weight becomes a code, with one shared FP16 scale per group of 128. Ternary is {−1, 0, +1}, binary is {−1, +1}. Sharing the scale across 128 weights keeps its cost at 16/128 = 0.125 bits. → Ternary: log2(3) + 16/128 ≈ 1.71 bits/weight → 5.9GB → Binary: 1 + 16/128 = 1.125 bits/weight → 3.9GB 2: Post-training, not from scratch No BitNet-style low-bit pretrain. It starts from off-the-shelf Qwen3.6-27B, architecture unchanged. The representation runs end to end across embeddings, attention projections, MLP projections, and the LM head. → 9.4× (ternary) and 14.2× (binary) vs the 54GB FP16 baseline 3: Labels are not bit-widths Conventional low-bit builds are mixed-precision by construction. The advertised name describes the most-compressed tensors, not the model. → Q4_K_XL, labeled "4-bit," is really 5.2 bits/weight at 17.6GB → IQ2_XXS, labeled "2-bit," is really 2.8 bits/weight at 9.4GB 4: Fitting a phone is two budgets iOS caps a single app near half of RAM, so a 12GB iPhone exposes ~6GB. The KV cache grows on top. Hybrid attention at ~75% linear means only 16 of 64 layers cache. → 4-bit KV: 4.3GB at 262K context, down from 17.2GB → 11.0 tok/s on iPhone 17 Pro Max 5: The numbers (15 benchmarks, thinking mode) → Ternary: 80.49 avg at 5.9GB — 94.6% of FP16 → 1-bit: 76.11 avg at 3.9GB — 89.5% of FP16 → IQ2_XXS falls to 57.5 on AIME26 while still scoring 88.93 on MMLU-Redux The key takeaway: 27B-class reasoning without the 54GB checkpoint — group-wise ternary and binary codes, an end-to-end low-bit language stack, 4-bit KV, on one phone. Full analysis: Repo: Model weight: Technical details: PrismML

Marktechpost AI

31,860 views • 15 days ago

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on Fireworks AI inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. DeepSeek's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here:

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on Fireworks AI inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. DeepSeek's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here:

elvis

59,803 views • 3 months ago