Загрузка видео...

Не удалось загрузить видео

На главную

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning...

170,758 просмотров • 3 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

🔥 Battle for the top reasoning LLM intensifies! The QwQ-32B-Preview is a very good reasoning LLM. Full video of my tests here: Summary of my findings and thoughts: It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. The competition is now going to be on a different level and it's going to be focused on which model produces the most efficient, optimized, accurate, and fastest reasoning steps beyond just accurate responses. That's what developers will care about. Traditional benchmarks are not going to be good enough for this. On that note, it's getting harder to assess these models, especially the consistency, efficiency, and quality of reasoning steps. After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings. There might be a need to build some type of native and efficient self-assessment or self-reflection capability that prevents these reasoning LLMs to go in loops or produce unnecessary lengthy sequences. I also noticed that this model, at least from the HF demo, doesn’t separate the reasoning from the response. I think that actually hurts the performance of the model. On the other hand, o1 and R1 do that really well. In addition to that, I believe the training on reasoning is hurting the performance of the LLM in other areas such as helpfulness (check the code example in the video). Something that’s necessary at the moment is validating or evaluating the quality of the reasoning chains and figuring out a better strategy to optimize them. Current methods are probably not sufficient to solve this problem but that's where innovation will comes next. I recognize that this is a first effort so kudos to the Qwen team on this release. These issues highlight the importance of transparency with reasoning LLMs. We need to know how it was trained and with exact data or optimization strategy. Understanding that will enable researchers and developers to build better intuition and improve the reasoning capabilities and components at a faster rate. There is an opportunity for someone or a company to build a truly open-reasoning LLM. The race is on! I will continue to track the state-of-the-art in reasoning LLMs and report my takes and observations here. Stay tuned for more.

elvis

14,740 просмотров • 1 год назад

Which LLM reasons best when it doesn't have all the information? Enter LLM Poker Arena to find out. It's a Poker Playing benchmark where top reasoning models play Texas Hold'em poker against each other. Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro, and Grok 4 all sit at the same table and play full tournaments to see who finishes with the chips. Poker is very different when it comes to reasoning. It has to balance probabilistic reasoning, opponent modeling and make decisions under uncertainty. Poker is an interesting evaluation because it tests reasoning under incomplete information, something most coding benchmarks do not capture. In this tournaments the rules are: - Each LLM starts with $1,000 chips - Small and big blinds start at $25 / $50 - Blinds double every 3 minutes - All models run in their reasoning or thinking modes After the first 5 tournaments: - Claude Opus 4.5 with Thinking has 3 wins - GPT-5.2 has 2 wins - Grok 4 and Gemini 2.5 Pro have 0 wins Early results suggest Claude performs quite well at poker as well. Also five is a very small sample size. Planning to run many more tournaments, publish the full benchmark data and add a prediction market on top of it. Thanks for the suggestion clipz. Much more coming as part of Poker Cities !! This was built on Replit ⠕ using their AI integrations, which made it straightforward to connect Claude, GPT, and Gemini. What model do you think wins after 100 tournaments?

Anshul Dhawan

31,651 просмотров • 4 месяцев назад