
Hao AI Lab
@haoailab • 6,267 subscribers
Hao AI Lab at UCSD. Our mission is to democratize large machine learning models, algorithms, and their underlying systems.
Shorts
Videos

When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️♂️ Inspired by that idea, we turned to Ace Attorney to test AI's reasoning. It’s the perfect stage: the AI plays as a detective to collect clues, expose contradictions, and uncover the truth. We put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies.
Hao AI Lab998,910 просмотров • 1 год назад

(1/N) We're launching Dreamverse. Most AI video models take minutes to generate a 5 s 1080p clip. In 4.5 seconds, we can generate 30 s 1080p clips on a single GPU. Our videos generate faster than you can watch them: stop waiting on prompts and start directing scenes live. 🕹️Demo: 📑 Blog: Welcome to the era of vibe-directing 👇
Hao AI Lab89,012 просмотров • 2 месяцев назад

Claude-3.7 was tested on Pokémon Red, but what about more real-time games like Super Mario 🍄🌟? We threw AI gaming agents into LIVE Super Mario games and found Claude-3.7 outperformed other models with simple heuristics. 🤯 Claude-3.5 is also strong, but less capable of planning complex maneuvers. Gemini-1.5-pro and GPT-4o perform less well.
Hao AI Lab234,336 просмотров • 1 год назад

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: Code: Checkpoints:
Hao AI Lab36,987 просмотров • 1 месяц назад

(1/N) Content creators have been stuck with costly and slow video generation APIs for far too long. We couldn’t take it anymore.😅😭 FastVideo’s new real-time inference stack has the fastest 1080p TI2AV pipeline ever.😍🚀🚀 Our optimized LTX-2.3 pipeline creates 5-second 1080p videos with audio in 4.55 s, on a single GPU! 3.9x faster than the next fastest option. 🕹️Live demo: 📜Blog:
Hao AI Lab29,272 просмотров • 2 месяцев назад

🔥 Pokémon Red is becoming a go-to benchmark for testing advanced AIs such as Gemini. But is Pokémon Red really a good eval? We study this problem and identify three issues: 1️⃣ Navigation tasks are too hard. 2️⃣ Combat control is too simple. 3️⃣ Raising a strong Pokémon team is slow and expensive as an eval. We find most of the problems are not fundamental to games themselves, but how they have been used. We believe game-as-an-eval remains a compelling and underutilized evaluation strategy. We introduce Lmgame Bench to standardize game-as-an-eval. More details and findings in our blogpost:
Hao AI Lab69,004 просмотров • 11 месяцев назад

🎥 Frustrated by Sora's credit limits? Still waiting for Veo 2? 🚀 Open-source video DiTs are actually on par. We introduce FastVideo, an open-source stack to support fast video generation for SoTA open models. We have supported Mochi and Hunyuan, 8x faster inference, 720P 5-second video in 62 seconds.
Hao AI Lab69,521 просмотров • 1 год назад

🎥 Videos DiTs are painfully slow, HunyuanVideo takes 16 min to generate a 5s 720P video on H100. 🤯 Announcing Sliding Tile Attention (STA): * Accelerate 3D full attention (FA3) by up to 10x * Slash the end-to-end time from 16 --> 5 mins * NO extra training. NO quality loss! 🚀 Can you tell which videos are generated by the original HunyuanVideo, and which by STA? 👀 Blog:
Hao AI Lab58,003 просмотров • 1 год назад

🔧🤖 New wave of open-source LLMs like Deekseek-R1-0528 and Qwen3-235B-A22B are leveling up with stronger agentic performance. We test them in head-to-head gameplay — the upgraded Deekseek-R1-0528 outsmarts strong reasoning models like o4-mini across several games and it nearly matches SOTA performance on Tetris, going toe-to-toe with o3. ✨🧠 Check out how R1 manages to clear lines in Tetris while other models still struggle 👇
Hao AI Lab36,265 просмотров • 1 год назад

LLaMA-4 Maverick performs well on reasoning benchmarks and ranks 2nd on the Chatbot Arena, yet its true performance remains controversial. What if we put them in a transparent gaming environment? 🎮 Our benchmark tells a different story...🤔 Will true intelligence shine through play? Let’s find out 👇
Hao AI Lab39,050 просмотров • 1 год назад

Phoenix Wright Ace Attorney is a popular visual novel known for its complex storytelling and courtroom drama. Like a detective novel, it challenges players to connect clues and evidence to expose contradictions and reveal the true culprit. In our setup, models are tested on the intense cross-examination stage. It must spot contradictions and present the correct evidence to challenge witness testimony. Each level grants 5 lives, allowing limited tolerance for mistakes.
Hao AI Lab29,983 просмотров • 1 год назад

[1/5] Do you know random gameplay in 2048 can yield 128 tiles and only 1% of human gameplay can reach a 2048 tile? Check out how today’s top AI models compare! ⚖️ For top reasoning models, the results were wild: only Claude-3.7 (with reasoning) and o1 managed to outperform random moves, achieving a 256 tile in 114 and 116 steps respectively! 😱
Hao AI Lab26,104 просмотров • 1 год назад

You might have heard top reasoning models now match AIME gold medalists in 2025 🏅, but watch them crumble in box-pushing Sokoban (倉庫番) from the 80s! 🧩 Again, we put top reasoning models into the game, o3-mini (medium) took the crown, reaching level 4 before tangled with just two boxes. 😵💫 Claude-3.7-thinking managed two levels, Deepseek-R1 cleared one level. Gemini-2.0-flash-thinking solved none.
Hao AI Lab23,727 просмотров • 1 год назад

🚨 New Challenger: GROK joins the Game Arena Benchmark! We evaluated Grok3-mini-beta: thinkining on four games: 🧩 2048 | 🧱 Sokoban | 🍬 Candy Crush | 🎮 Phoenix Wright With fast progress, it’s already comparable to top models like OpenAI’s O1, previous O3-mini, and Gemini-2.5-Pro, ranked 🥇 1st in 2048 and 🥈 2nd in Sokoban. But the Grok3-beta Thinking API is still unreleased—leaving an unanswered question: Can it match or surpass O3’s dominant reasoning performance? For now, let’s take a closer look at how Grok3-mini-beta: thinking performs in gameplay so far. 👇
Hao AI Lab19,487 просмотров • 1 год назад

This week, we tested 3 latest models in our Game Arena Benchmark: → O3 → O4-mini → Gemini 2.5 Flash Across 4 games—Phoenix Wright, Sokoban, Candy Crush, and 2048—O3 dominated the zero-shot leaderboard, ranking #1 or #2 in nearly every task and outperforming previous SOTA models like O3-mini and Gemini 2.5 Flash. 🔥 Beyond our customized tests, we took on a real challenge: Sokoban (1989)—the classic, unforgiving original. 🗣️ Many say O3 shows strong image reasoning, but does it really? Let’s see what our Game Arena Benchmark reveals. 👇
Hao AI Lab14,636 просмотров • 1 год назад

New results just dropped 🥳! We have integrated GPT-4.5 and Gemini-2.0-flash in our gaming agents and test them on Super Mario Bros. ⚔️ GPT-4.5 struggles due to high latency, Gemini-2.0-flash performs significantly better than Gemini-1.5-pro, on par with Claude-3.5. Enjoy! 🎮
Hao AI Lab15,499 просмотров • 1 год назад
Больше нет контента для загрузки