Arena.ai's banner

Arena.ai

@arena • 199,690 subscribers

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → https://t.co/XBZCrseaWF

Shorts

Arena Trends: Text-to-Image, Jan 2026 – Apr 2026 For most of the year, Google DeepMind and OpenAI traded the top spot within a tight margin - GPT-Image vs. Nano Banana - with the rest of the field clustered below 1,200. Today, GPT-Image-2 breaks away with a score of 1,512, 242 points ahead of #2 Google. The frontier continues to move.

Arena Trends: Text-to-Image, Jan 2026 – Apr 2026 For most of the year, Google DeepMind and OpenAI traded the top spot within a tight margin - GPT-Image vs. Nano Banana - with the rest of the field clustered below 1,200. Today, GPT-Image-2 breaks away with a score of 1,512, 242 points ahead of #2 Google. The frontier continues to move.

514,344 Aufrufe

US vs China update. Stanford's AI Index put the US–China gap at 2.7%. Here's what two years of real-world use from the Text Arena shows. Gap three years ago: +278. Today: +29. Anthropic's Claude Opus 4.6 Thinking vs. Baidu's ERNIE for Developers Ernie 5.1 at the top. The US has never lost #1, but the race keeps closing.

US vs China update. Stanford's AI Index put the US–China gap at 2.7%. Here's what two years of real-world use from the Text Arena shows. Gap three years ago: +278. Today: +29. Anthropic's Claude Opus 4.6 Thinking vs. Baidu's ERNIE for Developers Ernie 5.1 at the top. The US has never lost #1, but the race keeps closing.

58,444 Aufrufe

We decided to take Paul Jankura’s Claude Opus 4.5 out for a test drive vs. the current #1 ranking model in Code Arena: Gemini 3 Pro. Same prompt, different outputs. Let’s take a look. Remember, your votes drive the leaderboards. We’ll see how Claude Opus 4.5 stacks up in the coming days! Check out some of the comparisons, like how Claude Opus 4.5 handled the “Pyramids of Giza” prompt, in thread. 🧵

We decided to take Paul Jankura’s Claude Opus 4.5 out for a test drive vs. the current #1 ranking model in Code Arena: Gemini 3 Pro. Same prompt, different outputs. Let’s take a look. Remember, your votes drive the leaderboards. We’ll see how Claude Opus 4.5 stacks up in the coming days! Check out some of the comparisons, like how Claude Opus 4.5 handled the “Pyramids of Giza” prompt, in thread. 🧵

88,977 Aufrufe

📊Arena Trend update for August 2024 - Feb 2025: After a few DeepSeek jumps last month, xAI leaps forward to the top of the leaderboard. The AI race continues! 📈 animation credit: Peter Gostev

📊Arena Trend update for August 2024 - Feb 2025: After a few DeepSeek jumps last month, xAI leaps forward to the top of the leaderboard. The AI race continues! 📈 animation credit: Peter Gostev

154,425 Aufrufe

Arena leaderboards now include Price and Context. - Price is shown as input / output cost per 1M tokens, and context shows the maximum context window. Compare Arena scores based on what matters for your use case.

Arena leaderboards now include Price and Context. - Price is shown as input / output cost per 1M tokens, and context shows the maximum context window. Compare Arena scores based on what matters for your use case.

28,517 Aufrufe

Created by Gemini 3 Pro in one shot!

Created by Gemini 3 Pro in one shot!

36,569 Aufrufe

📈Arena Trends Update We pulled Arena scores for the Top 10 labs in Text for the past 6 months (Sept-2025-Feb 2026), and the competitive spread is shifting again. With tighter confidence intervals and new entries in the mix, the frontier continues to shift. Stay tuned for more insights as we dive deeper into the top open models for February later this week. Let us know what you found the most surprising in the comments. 👇

📈Arena Trends Update We pulled Arena scores for the Top 10 labs in Text for the past 6 months (Sept-2025-Feb 2026), and the competitive spread is shifting again. With tighter confidence intervals and new entries in the mix, the frontier continues to shift. Stay tuned for more insights as we dive deeper into the top open models for February later this week. Let us know what you found the most surprising in the comments. 👇

23,239 Aufrufe

🍌 Thousands of new people jumped into Image Arena Battle mode this week - our intern can barely keep up! What happens in Battle mode? 🧵 We partner directly with model providers to give you early access to cutting-edge models still in development, often before you can try them anywhere else. These pre-release models are tested in Battle mode. Details in the thread 👇

🍌 Thousands of new people jumped into Image Arena Battle mode this week - our intern can barely keep up! What happens in Battle mode? 🧵 We partner directly with model providers to give you early access to cutting-edge models still in development, often before you can try them anywhere else. These pre-release models are tested in Battle mode. Details in the thread 👇

16,913 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Big news from Kimi.ai this week. Check out Kimi K3 head-to-head with Fable 5 on identical prompts in Frontend Code Arena.

Big news from Kimi.ai this week. Check out Kimi K3 head-to-head with Fable 5 on identical prompts in Frontend Code Arena.

113,567 Aufrufe • vor 2 Tagen

Arena reached a $100M annual revenue run rate just 8 months after launching our evaluation product. We started as a research project at UC Berkeley with a simple mission: measure AI progress through real-world use. As AI shifts from chatbots to agents taking on longer, higher-stakes work, the problem matters more than ever. Today, Arena measures real-world AI utility with a community of tens of millions. With Agent Arena, we’re evaluating long-running agents on complex, real-world tasks - how they use tools, adapt to feedback, recover from errors, and accomplish goals set by humans. We are excited to keep deepening our work in agentic evaluations. Here’s Anastasios Nikolas Angelopoulos on what this milestone means and where we go from here:

Arena reached a $100M annual revenue run rate just 8 months after launching our evaluation product. We started as a research project at UC Berkeley with a simple mission: measure AI progress through real-world use. As AI shifts from chatbots to agents taking on longer, higher-stakes work, the problem matters more than ever. Today, Arena measures real-world AI utility with a community of tens of millions. With Agent Arena, we’re evaluating long-running agents on complex, real-world tasks - how they use tools, adapt to feedback, recover from errors, and accomplish goals set by humans. We are excited to keep deepening our work in agentic evaluations. Here’s Anastasios Nikolas Angelopoulos on what this milestone means and where we go from here:

206,878 Aufrufe • vor 20 Tagen

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

344,089 Aufrufe • vor 1 Monat

🚀Introducing Code Arena: the next generation of live coding evals for frontier AI models. Built to test how models plan, scaffold, debug, and build real web apps step-by-step. Try Claude, GPT-5, GLM-4.6 and Gemini in Code Arena today!

🚀Introducing Code Arena: the next generation of live coding evals for frontier AI models. Built to test how models plan, scaffold, debug, and build real web apps step-by-step. Try Claude, GPT-5, GLM-4.6 and Gemini in Code Arena today!

328,087 Aufrufe • vor 8 Monaten

GPT-5.2-high by OpenAI is off to a strong start in the Code Arena. ⚡️ If you’re new here: Code Arena is where AI models build full web apps, tools, and interactive sites — all from a single prompt. Watch the video to see GPT-5.2-high in action, then try your own prompt and reply with your creation below. ⬇️

GPT-5.2-high by OpenAI is off to a strong start in the Code Arena. ⚡️ If you’re new here: Code Arena is where AI models build full web apps, tools, and interactive sites — all from a single prompt. Watch the video to see GPT-5.2-high in action, then try your own prompt and reply with your creation below. ⬇️

246,337 Aufrufe • vor 7 Monaten

📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by a16z and UC Investments (University of California), we're proud to have the support of those that believe in both the science and the mission. We’re focused on building a neutral, open, community-driven platform that helps the world understand and improve the performance of AI models on real queries from real users. Also, big news is coming next week!👀 We're relaunching LMArena with a whole new look built directly with community feedback from the ground up 🧱 Link in thread.

📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by a16z and UC Investments (University of California), we're proud to have the support of those that believe in both the science and the mission. We’re focused on building a neutral, open, community-driven platform that helps the world understand and improve the performance of AI models on real queries from real users. Also, big news is coming next week!👀 We're relaunching LMArena with a whole new look built directly with community feedback from the ground up 🧱 Link in thread.

435,742 Aufrufe • vor 1 Jahr

You can smell a big model. Not the parameter count. Not the benchmark score. It's that feeling when something is actually reasoning. Not just pattern matching. We call it "big model smell."

You can smell a big model. Not the parameter count. Not the benchmark score. It's that feeling when something is actually reasoning. Not just pattern matching. We call it "big model smell."

111,988 Aufrufe • vor 3 Monaten

5 patterns in Text Arena's price–performance Pareto frontier since 2023: 1. GPT-4-level quality is now ~500x lower cost. - From a ~$50 blended price per million tokens in 2023 to ~$0.10 today. 2. The higher-price end is both better and lower-priced since 2023. - The leading Arena score has climbed ~170 points (1,330 → 1,500). While the price of the higher-end frontier models dropped from ~$50 to ~$20 per million tokens. 3. The low-cost end gained the most. - Under $0.20 per million tokens, the best available model went from ~1,000 Arena score in 2023 to ~1,440 today. 4. The low-cost/top performance gap has nearly closed. - In 2023, sub-$0.20 models trailed the leader by ~350 Arena points. Today, ~60. 5. The cast has rotated quite a bit. - - OpenAI set the 2023–24 benchmark. - AI at Meta strengthened the low-cost end in 2024. - Google DeepMind drove the 2025 jump. - Anthropic holds the peak in 2026. - xAI and Chinese labs like DeepSeek AI, Z.ai, Kimi.ai, Xiaomi MiMo, and Qwen are continuing to push the mid-price frontier.

5 patterns in Text Arena's price–performance Pareto frontier since 2023: 1. GPT-4-level quality is now ~500x lower cost. - From a ~$50 blended price per million tokens in 2023 to ~$0.10 today. 2. The higher-price end is both better and lower-priced since 2023. - The leading Arena score has climbed ~170 points (1,330 → 1,500). While the price of the higher-end frontier models dropped from ~$50 to ~$20 per million tokens. 3. The low-cost end gained the most. - Under $0.20 per million tokens, the best available model went from ~1,000 Arena score in 2023 to ~1,440 today. 4. The low-cost/top performance gap has nearly closed. - In 2023, sub-$0.20 models trailed the leader by ~350 Arena points. Today, ~60. 5. The cast has rotated quite a bit. - - OpenAI set the 2023–24 benchmark. - AI at Meta strengthened the low-cost end in 2024. - Google DeepMind drove the 2025 jump. - Anthropic holds the peak in 2026. - xAI and Chinese labs like DeepSeek AI, Z.ai, Kimi.ai, Xiaomi MiMo, and Qwen are continuing to push the mid-price frontier.

58,133 Aufrufe • vor 1 Monat

We put the top three Code Arena models head-to-head: Opus 4.5 Thinking 32k, Opus 4.5, and Gemini 3 Pro. They’re just 20 points apart. Same tough prompts, different results. Here’s what stood out. Remember, your votes drive the rankings. Watch how these contenders move on the leaderboard as more votes come in. Check out the comparisons in the thread below. 🧵

We put the top three Code Arena models head-to-head: Opus 4.5 Thinking 32k, Opus 4.5, and Gemini 3 Pro. They’re just 20 points apart. Same tough prompts, different results. Here’s what stood out. Remember, your votes drive the rankings. Watch how these contenders move on the leaderboard as more votes come in. Check out the comparisons in the thread below. 🧵

163,889 Aufrufe • vor 7 Monaten

How does the #1 open Text Arena model hold up in agentic coding tasks? We tested GLM-5 in Code Arena with head-to-head SVG prompts vs. top frontier AI models. What do you think? Scores for Z.ai 's GLM-5 in Code Arena coming soon. Test out GLM-5 for yourself and get voting.

How does the #1 open Text Arena model hold up in agentic coding tasks? We tested GLM-5 in Code Arena with head-to-head SVG prompts vs. top frontier AI models. What do you think? Scores for Z.ai 's GLM-5 in Code Arena coming soon. Test out GLM-5 for yourself and get voting.

115,220 Aufrufe • vor 5 Monaten

The NEW LMArena is officially live! 🎉 ✨ New Logo! ⚡️ Better, faster UI/UX for chat and leaderboard 📱 Mobile optimized 💬 Chat history 🧭 Clearer leaderboard navigation 🤖 Many modalities in one place: vision, image, and more coming soon Try it now at lmarena dot ai! (Link in 🧵)

266,923 Aufrufe • vor 1 Jahr

We’ve challenged Claude Opus 4.6 by Anthropic with our hardest 3D prompts, it did not disappoint.

We’ve challenged Claude Opus 4.6 by Anthropic with our hardest 3D prompts, it did not disappoint.

98,764 Aufrufe • vor 5 Monaten

We are excited to release the weights of Vicuna-13B. 🔥 Run it with a single GPU on your own machine! Get the weights: Web UI demo: Command line demo: see below

We are excited to release the weights of Vicuna-13B. 🔥 Run it with a single GPU on your own machine! Get the weights: Web UI demo: Command line demo: see below

549,335 Aufrufe • vor 3 Jahren

🚨 BIG NEWS: An announcement from our intern… Introducing, 🎬 Video Arena!

🚨 BIG NEWS: An announcement from our intern… Introducing, 🎬 Video Arena!

151,804 Aufrufe • vor 11 Monaten

The Image Arena is buzzing 👀 OpenAI’s GPT-image-1.5 is live and already shaking up the leaderboard. Watch it in action below, then try your own prompt and share what you create 👇🎨

The Image Arena is buzzing 👀 OpenAI’s GPT-image-1.5 is live and already shaking up the leaderboard. Watch it in action below, then try your own prompt and share what you create 👇🎨

77,574 Aufrufe • vor 7 Monaten

🚨🍌Big Reveal: who was "Nano Banana?" The anonymous model, “nano-banana,” that caught the world's attention with its ability to follow complex instructions, preserve character identity, and maintain contextual details was: Gemini-2.5-Flash-Image-Preview by Google DeepMind 🍌✨ - Now ranked #1 on the Image Edit Arena - Also ranked #1 for Text-to-Image In two weeks, “nano-banana” has driven over 5 million votes to the Image Edit Arena. With 2.5M+ votes for this model, it is the highest number of votes any model has received, with the largest Elo score lead (171) in Arena history. Congrats to the Google DeepMind team on this incredible milestone in image edit and generation. 👏

🚨🍌Big Reveal: who was "Nano Banana?" The anonymous model, “nano-banana,” that caught the world's attention with its ability to follow complex instructions, preserve character identity, and maintain contextual details was: Gemini-2.5-Flash-Image-Preview by Google DeepMind 🍌✨ - Now ranked #1 on the Image Edit Arena - Also ranked #1 for Text-to-Image In two weeks, “nano-banana” has driven over 5 million votes to the Image Edit Arena. With 2.5M+ votes for this model, it is the highest number of votes any model has received, with the largest Elo score lead (171) in Arena history. Congrats to the Google DeepMind team on this incredible milestone in image edit and generation. 👏

106,535 Aufrufe • vor 10 Monaten

How much better is Claude Opus 4.6 by Anthropic vs. past models? We compared Opus 4.6 to Opus 4.5 on a set of challenging SVG generations in Code Arena:

How much better is Claude Opus 4.6 by Anthropic vs. past models? We compared Opus 4.6 to Opus 4.5 on a set of challenging SVG generations in Code Arena:

58,591 Aufrufe • vor 5 Monaten

⚠️WARNING: offensive content ahead. Introducing RedTeam Arena with Bad Words—our first game. You've got 60 seconds to break the model to say the bad word. The faster, the better. (Collaboration with @elder_plinus and the awesome BASI 🐍 community.) Link to the site below👇

⚠️WARNING: offensive content ahead. Introducing RedTeam Arena with Bad Words—our first game. You've got 60 seconds to break the model to say the bad word. The faster, the better. (Collaboration with @elder_plinus and the awesome BASI 🐍 community.) Link to the site below👇

188,506 Aufrufe • vor 1 Jahr

🚨BIG NEWS: 🎬 Video Arena is now live on the web! Test out Veo 3.1, Sora 2, Seedance v1.5 Pro, Kling 2.6 Pro, Wan 2.5 & more. What started last summer as a small Discord bot experiment has grown into a rigorous way to measure and understand how frontier video models perform with real-world use. Thank you to our wonderful community for all the feedback! Today, we’re opening up access by making it available on the web. 🎥 Generate videos with 15 different frontier AI models and compare them head-to-head. 📊 Vote for the best output to power the leaderboards.

🚨BIG NEWS: 🎬 Video Arena is now live on the web! Test out Veo 3.1, Sora 2, Seedance v1.5 Pro, Kling 2.6 Pro, Wan 2.5 & more. What started last summer as a small Discord bot experiment has grown into a rigorous way to measure and understand how frontier video models perform with real-world use. Thank you to our wonderful community for all the feedback! Today, we’re opening up access by making it available on the web. 🎥 Generate videos with 15 different frontier AI models and compare them head-to-head. 📊 Vote for the best output to power the leaderboards.

61,930 Aufrufe • vor 5 Monaten

An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to break down why it stood out. Watch the full video on YouTube, link in thread 👇

An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to break down why it stood out. Watch the full video on YouTube, link in thread 👇

44,172 Aufrufe • vor 4 Monaten