Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Interactive Reasoning Benchmarks are the next step in frontier evaluations Hear Greg Kamradt share why measuring human-like intelligence requires multi-turn environments Including a sneak peak of ARC-AGI-3 Want to help us build interactive evaluations? We're hiring

ARC Prize

21,395 subscribers

26,218 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

8 Yorum

ARC Prize profil fotoğrafı

ARC Prize1 yıl önce

Calling Python Game Developers to help us create fun and challenging mini-games. This is a contract position for a remote game development role. Required Skills: * Strong Python * 2 years of game development experience Email [email protected] with your portfolio

ARC Prize profil fotoğrafı

ARC Prize1 yıl önce

This presentation was originally given at @aiDotEngineer on June 5, 2025 Slides:

UserInterface profil fotoğrafı

UserInterface2 yıl önce

Unveiling the Future of Prompt Engineering for Better AI Interactions #tech

Alex Zhang profil fotoğrafı

Alex Zhang1 yıl önce

Not entirely the same since we're not crafting tasks, but we're (me + @OfirPress) are also interested in benchmarking progress of a single agent / model across multiple games. In the video it's mentioned that we want to avoid data leakage (e.g. in Pokemon) and this is a factor for why Gemini Plays Pokemon succeeds. This is probably true (although it's hard to rigorously prove this fact) but arguably is not the primary issue here. I wouldn't be surprised if you hand-crafted a fake version of Pokemon Blue and the Gemini Plays Pokemon scaffold was able to solve it. I'd wager that the reason why Gemini Plays Pokemon finishes the game while Claude Plays Pokemon gets stuck has less to do with Gemini > Claude or more data leakage, and more to do with the design of their scaffolds. We also see this in our VideoGameBench paper, where minimizing the available scaffold leads to frequent "stuck" behavior regardless of what frontier VLM you use. Super excited about this effort though, and perhaps deploying similar agents on this new game benchmark and VideoGameBench will give us more perspective on where we are with embodied agents :)

shawn swyx wang profil fotoğrafı

shawn swyx wang1 yıl önce

@GregKamradt ominous

Chris profil fotoğrafı

Chris1 yıl önce

@GregKamradt So exciting how you guys are already on ARC AGI 3. Do you think that will be the last one before we hit AGI 👀

vmal profil fotoğrafı

vmal1 yıl önce

@GregKamradt why arc agi has it wrong

Yehyun profil fotoğrafı

Yehyun1 yıl önce

@GregKamradt This is needed benchmark since it will also represents how well these systems track their long term memory

Benzer Videolar

ARC-AGI-3 Preview Event Recap Greg Kamradt steps through our Interactive Reasoning Benchmark thesis * Why static benchmarks fall short measuring agentic capabilities * The ARC Prize approach to creating interactive benchmarks

ARC-AGI-3 Preview Event Recap Greg Kamradt steps through our Interactive Reasoning Benchmark thesis * Why static benchmarks fall short measuring agentic capabilities * The ARC Prize approach to creating interactive benchmarks

ARC Prize

22,248 görüntüleme • 11 ay önce

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale. At NeurIPS 2025, YC's Diana sat down with ARC Prize President Greg Kamradt to find out why most AI benchmarks fail, how ARC-AGI reveals the limits of today’s models, and why measuring intelligence may be harder than building it. 00:11 — What ARC Prize is and why it exists 00:38 — François Chollet’s definition of AGI 01:48 — What ARC-AGI Actually Tests 02:25 — When LLMs Failed the ARC Benchmark 03:38 — ARC-AGI Becomes the Standard 04:49 — False Positives in AI Progress 06:06 — The Evolution of ARC-AGI 08:55 — Measuring Intelligence beyond just accuracy 10:25 — What happens if a model solves ARC-AGI?

Y Combinator

98,369 görüntüleme • 7 ay önce

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Satpal Singh Rathore

46,010 görüntüleme • 11 ay önce

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

François Chollet (François Chollet) on the ARC Prize and how we get to AGI. At AI Startup School in San Francisco. 00:00 - The Falling Cost of Compute 00:57 - Deep-Learning’s Scaling Era & Benchmarks 01:59 - The ARC Benchmark 03:02 - The 2024 Shift to Test-Time Adaptation 05:01 - What Is Intelligence? 07:12 - Why Benchmarks Matter (and Mislead) 08:57 - ARC 1 Exposes Scaling Limits 10:58 - ARC 2: Compositional Reasoning Arrives 12:55 - Humans vs. Models on ARC2 14:58 - Previewing ARC3 & Interactive Agency 17:00 - Kaleidoscopic Hypothesis and Abstractions 22:00 - Type 1 vs. Type 2 Abstractions 26:00 - Discrete Program Search & Inventive AI 29:00 - Fusing Intuition with Symbolic Reasoning 32:00 - Building AGI Through Meta-Learning Systems

Y Combinator

231,796 görüntüleme • 1 yıl önce

This release is fucking huge. It's one of the biggest updates to LMArena this year! Code Arena is our next generation of coding evaluations, beginning with web development tasks. Here you can use models to build interactive websites and share them with your friends. The links are persistent, so you can e.g. build a game and play it whenever you want. Here watch two models -- Claude Haiku and Grok-Code-Fast -- compete to build a galaxy. In this case, I liked the "star-wars" effect of Grok!

This release is fucking huge. It's one of the biggest updates to LMArena this year! Code Arena is our next generation of coding evaluations, beginning with web development tasks. Here you can use models to build interactive websites and share them with your friends. The links are persistent, so you can e.g. build a game and play it whenever you want. Here watch two models -- Claude Haiku and Grok-Code-Fast -- compete to build a galaxy. In this case, I liked the "star-wars" effect of Grok!

Anastasios Nikolas Angelopoulos

38,032 görüntüleme • 8 ay önce

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

François Chollet (François Chollet) has spent years asking a different question than most of the AI world. Instead of scaling what already works, he’s trying to understand what intelligence actually is and how to build it from first principles. In this episode of the Lightcone Podcast, he traces that path from his early work on deep learning to the creation of the ARC Prize, and the launch of ARC V3, a new benchmark designed to measure something deeper than performance: the ability to learn, adapt, and reason efficiently in entirely new environments. He explains why today’s systems may be hitting limits, what recent breakthroughs really mean, and why reaching true general intelligence may require a fundamentally different approach. 00:00 - AGI by 2030? 00:31 - Introducing Ndea: A New Path Beyond Deep Learning 01:08 - A New ML Paradigm 01:30 - Replacing neural nets with compact symbolic programs 03:04 - Why Ndea Isn’t Competing With Coding Agents 05:20 - Why Everyone Might Be Wrong About Scaling LLMs 07:22 - Why Coding Agents Suddenly Work So Well 08:50 - The Limits of LLMs in Non-Verifiable Domains 10:48 - What AGI Actually Means (And Why Most Definitions Are Wrong) 13:30 - Why Deep Learning Hits a Wall 14:00 - ARC’s Origin Story 18:20 - ARC Benchmarks Explained: From V1 to V3 22:49 - The RL Loop Powering Coding Agents Today 27:03 - ARC-AGI V3: Measuring “Agentic Intelligence” 31:14 - Inside the ARC Game Studio 35:31 - Could AGI Fit in 10,000 Lines of Code? 44:01 - Building Ndea: From Idea to Compounding Research Stack 46:46 - The Future of ARC: Benchmarks That Evolve With AI 47:21 - Why There’s Still Huge Opportunity for New AI Paradigms 53:37 - How to Build a Breakout Open Source Project - Lessons From Keras 56:39 - Advice For How To Think About AI

Y Combinator

151,442 görüntüleme • 4 ay önce

With the unchecked race to build smarter-than-human AI intensifying, humanity is on track to almost certainly lose control. In "Keep The Future Human", FLI Executive Director Anthony Aguirre explains why we must close the 'gates' to AGI - and instead develop beneficial, safe Tool AI built to serve us, not replace us. We're at a crossroads: continue down this dangerous path, or choose a future where AI enhances human potential, rather than threatening it. 🔗 Read Anthony's full "Keep The Future Human" essay - or explore the interactive summary - at the link in the replies:

With the unchecked race to build smarter-than-human AI intensifying, humanity is on track to almost certainly lose control. In "Keep The Future Human", FLI Executive Director Anthony Aguirre explains why we must close the 'gates' to AGI - and instead develop beneficial, safe Tool AI built to serve us, not replace us. We're at a crossroads: continue down this dangerous path, or choose a future where AI enhances human potential, rather than threatening it. 🔗 Read Anthony's full "Keep The Future Human" essay - or explore the interactive summary - at the link in the replies:

Future of Life Institute

33,071 görüntüleme • 1 yıl önce

Today we are announcing Genie 3, a general purpose world model by Google DeepMind that can generate dynamic, interactive environments with a single text prompt. World models are AI that understand facets of the world (like Veo's knowledge of intuitive physics or Genie's mastery of new environments), and serve as a key stepping stone on the path to AGI. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2. Learn more ➡️

Today we are announcing Genie 3, a general purpose world model by Google DeepMind that can generate dynamic, interactive environments with a single text prompt. World models are AI that understand facets of the world (like Veo's knowledge of intuitive physics or Genie's mastery of new environments), and serve as a key stepping stone on the path to AGI. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2. Learn more ➡️

Google AI

102,615 görüntüleme • 11 ay önce

Port Alpha has a home. We're proud to bring our next-generation shipyard to our home state, and to join the community in Brownsville, Texas. Hear from our Co-founder and CEO Dino Mavrookas and our Head of Manufacturing John Morgan on why we chose Brownsville, and how we're rebuilding American shipbuilding's future. We're hiring — come build with us:

Port Alpha has a home. We're proud to bring our next-generation shipyard to our home state, and to join the community in Brownsville, Texas. Hear from our Co-founder and CEO Dino Mavrookas and our Head of Manufacturing John Morgan on why we chose Brownsville, and how we're rebuilding American shipbuilding's future. We're hiring — come build with us:

Saronic

72,704 görüntüleme • 10 gün önce

We raised $28M seed from Threshold Ventures, AIX Ventures, and NVentures (Nvidia's venture capital arm) —alongside 10+ unicorn founders and top AI researchers— to build reasoning models that generate real-time simulations and games. Models are bottlenecked by practical simulations that can act as Reinforcement Learning environments. Human self-expression is bounded by tools that let us create alternate realities. At Moonlake, we are building a future where anyone can create interactive worlds, bring their child-like wonder to life, learn within them, and most importantly, share experiences with people we care about. More in 🧵

We raised $28M seed from Threshold Ventures, AIX Ventures, and NVentures (Nvidia's venture capital arm) —alongside 10+ unicorn founders and top AI researchers— to build reasoning models that generate real-time simulations and games. Models are bottlenecked by practical simulations that can act as Reinforcement Learning environments. Human self-expression is bounded by tools that let us create alternate realities. At Moonlake, we are building a future where anyone can create interactive worlds, bring their child-like wonder to life, learn within them, and most importantly, share experiences with people we care about. More in 🧵

Moonlake

1,113,275 görüntüleme • 9 ay önce

See how on-screen text becomes spoken audio in real time using Telnyx Text-to-Speech. Want to build smarter voice experiences? From AI companions to wellness apps, interactive kiosks, and news readers, real-time TTS is quietly powering the next wave of human-tech interaction. Explore why real-time TTS is becoming a must-have layer in modern digital experiences: #TextToSpeech #VoiceAI #RealTimeVoice #Accessibility #UXDesign

See how on-screen text becomes spoken audio in real time using Telnyx Text-to-Speech. Want to build smarter voice experiences? From AI companions to wellness apps, interactive kiosks, and news readers, real-time TTS is quietly powering the next wave of human-tech interaction. Explore why real-time TTS is becoming a must-have layer in modern digital experiences: #TextToSpeech #VoiceAI #RealTimeVoice #Accessibility #UXDesign

Telnyx

17,058 görüntüleme • 7 ay önce

Today I'm excited to introduce Hark, a new artificial intelligence lab building the most advanced, personal intelligence in the world We've been in stealth for 8 months, assembling one of the greatest AI and hardware teams on the planet I want to explain why I started Hark and what we're focused on I've spent the last 3 years working on the hardest AI challenge imaginable: giving AI a humanoid body. On the digital side, I've been using all the existing LLM chatbots - and I have to say, they feel incredibly dumb to me AGI, in the limit, should feel like a sci-fi movie. It should be able to listen and talk. It should have persistent memory and be highly personalized. It should see and touch the world. But we're far from this today We are crafting a new interface to AGI. Intelligence that lets you offload your mental workload into a system that begins to think like you and sometimes ahead of you We started Hark with one goal: build the world's most advanced personal intelligence - paired with next-generation hardware designed to serve as a universal interface between humans and machines

Today I'm excited to introduce Hark, a new artificial intelligence lab building the most advanced, personal intelligence in the world We've been in stealth for 8 months, assembling one of the greatest AI and hardware teams on the planet I want to explain why I started Hark and what we're focused on I've spent the last 3 years working on the hardest AI challenge imaginable: giving AI a humanoid body. On the digital side, I've been using all the existing LLM chatbots - and I have to say, they feel incredibly dumb to me AGI, in the limit, should feel like a sci-fi movie. It should be able to listen and talk. It should have persistent memory and be highly personalized. It should see and touch the world. But we're far from this today We are crafting a new interface to AGI. Intelligence that lets you offload your mental workload into a system that begins to think like you and sometimes ahead of you We started Hark with one goal: build the world's most advanced personal intelligence - paired with next-generation hardware designed to serve as a universal interface between humans and machines

Brett Adcock

1,412,573 görüntüleme • 4 ay önce

I've recently been spending time with the ChatGPT team on shipping new experiences in ChatGPT! Our team's goal is simple – bring the incredible benefits of AI to everyone globally. We're making week by week progress and here are a few new improvements to share, all live right now: - new interactive beautiful charts - ability to edit your writing in full screen and save to your library - a table of contents for when your chats get really long - editing messages with attachments (finally!) - [for plus/pro users] long press on 'send' to select the model's intelligence / effort level - [on iOS] typing into the chat box now feels more responsive We're listening to what you're interested in, let us know what to build in comments!

I've recently been spending time with the ChatGPT team on shipping new experiences in ChatGPT! Our team's goal is simple – bring the incredible benefits of AI to everyone globally. We're making week by week progress and here are a few new improvements to share, all live right now: - new interactive beautiful charts - ability to edit your writing in full screen and save to your library - a table of contents for when your chats get really long - editing messages with attachments (finally!) - [for plus/pro users] long press on 'send' to select the model's intelligence / effort level - [on iOS] typing into the chat box now feels more responsive We're listening to what you're interested in, let us know what to build in comments!

Adam Fry

218,728 görüntüleme • 1 ay önce

Sam Altman just said in his new interview, that a new AI architecture is coming that will be a massive upgrade, just like Transformers were over Long Short-Term Memory. And also now the current class of frontier models are powerful enough to have the brainpower needed to help us research these ideas. His advice is to use the current AI to help you find that next giant step forward. --- From 'TreeHacks' YT Channel (link in comment)

Sam Altman just said in his new interview, that a new AI architecture is coming that will be a massive upgrade, just like Transformers were over Long Short-Term Memory. And also now the current class of frontier models are powerful enough to have the brainpower needed to help us research these ideas. His advice is to use the current AI to help you find that next giant step forward. --- From 'TreeHacks' YT Channel (link in comment)

Rohan Paul

658,077 görüntüleme • 4 ay önce

Imagine a world where your voice is the universal interface. No screens, no buttons, just seamless, human-like conversations. This is the future we're creating at PlayAI. Today, we are announcing our latest foundational voice model, Play Dialog, a multi-turn conversational voice model trained to converse like humans. And a $21m seed funding to fuel our research and product development efforts. Join us in building the future of conversational voice ai.

Imagine a world where your voice is the universal interface. No screens, no buttons, just seamless, human-like conversations. This is the future we're creating at PlayAI. Today, we are announcing our latest foundational voice model, Play Dialog, a multi-turn conversational voice model trained to converse like humans. And a $21m seed funding to fuel our research and product development efforts. Join us in building the future of conversational voice ai.

PlayAI

46,742 görüntüleme • 1 yıl önce

When data agents fail, they often fail silently - giving confident-sounding answers that are wrong, and it can be hard to figure out what caused the failure. "Building and Evaluating Data Agents" is a new short course created with Snowflake and taught by Anupam Datta and Josh Reini that teaches you to build data agents with comprehensive evaluation built in. Skills you'll gain: - Build reliable LLM data agents using the Goal-Plan-Action framework and runtime evaluations that catch failures mid-execution - Use OpenTelemetry tracing and evaluation infrastructure to diagnose exactly where agents fail and systematically improve performance - Orchestrate multi-step workflows across web search, SQL, and document retrieval in LangGraph-based agents The result: visibility into every step of your agent's reasoning, so if something breaks, you have a systematic approach to fix it. Sign up to get started:

When data agents fail, they often fail silently - giving confident-sounding answers that are wrong, and it can be hard to figure out what caused the failure. "Building and Evaluating Data Agents" is a new short course created with Snowflake and taught by Anupam Datta and Josh Reini that teaches you to build data agents with comprehensive evaluation built in. Skills you'll gain: - Build reliable LLM data agents using the Goal-Plan-Action framework and runtime evaluations that catch failures mid-execution - Use OpenTelemetry tracing and evaluation infrastructure to diagnose exactly where agents fail and systematically improve performance - Orchestrate multi-step workflows across web search, SQL, and document retrieval in LangGraph-based agents The result: visibility into every step of your agent's reasoning, so if something breaks, you have a systematic approach to fix it. Sign up to get started:

Andrew Ng

101,930 görüntüleme • 10 ay önce

New course: Nvidia's NeMo Agent Toolkit: Making Agents Reliable, taught by Brian McBrayer 🐬 from NVIDIA. Many teams struggle to turn agent demos into reliable systems that are ready for production. This short course teaches you to harden agentic workflows into reliable systems using Nvidia's open-source NeMo Agent Toolkit (NAT). Whether you built your agent in raw Python or using a framework like LangGraph, or CrewAI, NAT provides building blocks for observability, evaluation, and deployment that turn proofs-of-concept into production-ready systems. NAT makes it easy to troubleshoot and optimize agent performance with execution traces, systematic evaluations, and CI/CD integration. Skills you'll gain: - Build configuration-driven agent workflows with REST APIs and minimal code - Add observability with tracing to visualize agent reasoning and debug performance bottlenecks - Create systematic evaluations using gold-standard datasets to measure and improve agent reliability - Deploy multi-agent systems with authentication, rate limiting, and professional web interfaces - Orchestrate agents from different frameworks to collaborate on complex tasks Join and learn how to turn agent demos into reliable systems!

New course: Nvidia's NeMo Agent Toolkit: Making Agents Reliable, taught by Brian McBrayer 🐬 from NVIDIA. Many teams struggle to turn agent demos into reliable systems that are ready for production. This short course teaches you to harden agentic workflows into reliable systems using Nvidia's open-source NeMo Agent Toolkit (NAT). Whether you built your agent in raw Python or using a framework like LangGraph, or CrewAI, NAT provides building blocks for observability, evaluation, and deployment that turn proofs-of-concept into production-ready systems. NAT makes it easy to troubleshoot and optimize agent performance with execution traces, systematic evaluations, and CI/CD integration. Skills you'll gain: - Build configuration-driven agent workflows with REST APIs and minimal code - Add observability with tracing to visualize agent reasoning and debug performance bottlenecks - Create systematic evaluations using gold-standard datasets to measure and improve agent reliability - Deploy multi-agent systems with authentication, rate limiting, and professional web interfaces - Orchestrate agents from different frameworks to collaborate on complex tasks Join and learn how to turn agent demos into reliable systems!

Andrew Ng

63,773 görüntüleme • 7 ay önce

Today, we're launching Good Start Labs w/ $3.6M from amazing investors including Inovia Capital & General Catalyst My whole life I've been learning from games Over the past five years, I've dreamt about how AI learn with me. Today we're launching LOL Arena, the first AI benchmark for humor, informed by millions of human votes. We are also launching Diplomacy Arena ranking strategy, betrayal, and prompt impact across models. In the coming years we hope to lead at the intersection of Gen AI & Games and define what it means to do alignment via entertainment. Ensuring everyone can share their voice and help AI become a tool that really is custom built to help bring our dreams to life. If that inspires you, join us! We're hiring. Here's what we're shipping today: 🧵

Today, we're launching Good Start Labs w/ $3.6M from amazing investors including Inovia Capital & General Catalyst My whole life I've been learning from games Over the past five years, I've dreamt about how AI learn with me. Today we're launching LOL Arena, the first AI benchmark for humor, informed by millions of human votes. We are also launching Diplomacy Arena ranking strategy, betrayal, and prompt impact across models. In the coming years we hope to lead at the intersection of Gen AI & Games and define what it means to do alignment via entertainment. Ensuring everyone can share their voice and help AI become a tool that really is custom built to help bring our dreams to life. If that inspires you, join us! We're hiring. Here's what we're shipping today: 🧵

alex duffy

99,542 görüntüleme • 9 ay önce

Demis Hassabis's new interview: "Society needs to hear that because we don't have long to prepare for what that means. We are standing in the foothills of the singularity now. ..which is AGI. I believe that we are only a few years away from that, maybe around 2030, plus or minus a year. " ~ Demis Hassabis, Co-Founder and CEO of Google DeepMind It is going to be enormously profound, I think. The future, in my view, is still to be written. But these next few years are going to be very critical as to which way that will go, and how we collectively want that to look.” --- IMO, The real disruption is not whether AGI arrives exactly in 2030, plus or minus a year, but whether institutions can adapt, as in post-AGI world, technology will change much faster than human systems can respond. Schools still train people for stable professions, companies still organize work around human bottlenecks, and governments still regulate after harm becomes visible. AGI, if it arrives anywhere near the frontier-lab timelines, compresses that lag into a dangerous gap. ---- From "Stanford Graduate School of Business" YouTube channel, (link in comment)

Demis Hassabis's new interview: "Society needs to hear that because we don't have long to prepare for what that means. We are standing in the foothills of the singularity now. ..which is AGI. I believe that we are only a few years away from that, maybe around 2030, plus or minus a year. " ~ Demis Hassabis, Co-Founder and CEO of Google DeepMind It is going to be enormously profound, I think. The future, in my view, is still to be written. But these next few years are going to be very critical as to which way that will go, and how we collectively want that to look.” --- IMO, The real disruption is not whether AGI arrives exactly in 2030, plus or minus a year, but whether institutions can adapt, as in post-AGI world, technology will change much faster than human systems can respond. Schools still train people for stable professions, companies still organize work around human bottlenecks, and governments still regulate after harm becomes visible. AGI, if it arrives anywhere near the frontier-lab timelines, compresses that lag into a dangerous gap. ---- From "Stanford Graduate School of Business" YouTube channel, (link in comment)

Rohan Paul

70,642 görüntüleme • 1 ay önce

"Everyone says AI evaluations are important, so let's actually build one live from scratch.” Here's my new episode with (Arize) where we build AI evals for a customer support agent live, including: ✅ Creating the eval criteria ✅ Labeling the golden dataset ✅ Aligning LLM judges with human scores Some insights from Aman: 1. PMs must do manual labeling themselves. "I never found it useful to outsource human evals to contractors. The PM has to be in the spreadsheet to maintain good judgment." 2. Define what good/average/bad looks like on criteria like accuracy and tone upfront. This becomes your rubric for consistent evaluation across your team. 3. Make sure your LLM judges align with your human scores before you scale. Test the judges on a few dozen cases first and aim for at least 80%+ match rate. 📌 Watch now: Also available on: Spotify: Apple: Newsletter:

"Everyone says AI evaluations are important, so let's actually build one live from scratch.” Here's my new episode with (Arize) where we build AI evals for a customer support agent live, including: ✅ Creating the eval criteria ✅ Labeling the golden dataset ✅ Aligning LLM judges with human scores Some insights from Aman: 1. PMs must do manual labeling themselves. "I never found it useful to outsource human evals to contractors. The PM has to be in the spreadsheet to maintain good judgment." 2. Define what good/average/bad looks like on criteria like accuracy and tone upfront. This becomes your rubric for consistent evaluation across your team. 3. Make sure your LLM judges align with your human scores before you scale. Test the judges on a few dozen cases first and aim for at least 80%+ match rate. 📌 Watch now: Also available on: Spotify: Apple: Newsletter:

Peter Yang

37,177 görüntüleme • 11 ay önce