Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

How does an AI model actually learn to see? 🤖 Learn about the tech behind native multimodality, how models reason over visual data like documents and video, and the future of proactive AI assistants with Logan Kilpatrick and Gemini Model Behavior Product Lead, Ani Baddepudi. ↓ Timestamps: 01:12 Why... Gemini is natively multimodal 02:23 The technology behind multimodal models 05:15 Video understanding with Gemini 2.5 09:25 Deciding what to build next 13:23 Building new product experiences with multimodal AI 17:15 The vision for proactive assistants 24:13 Improving video usability with variable FPS and frame tokenization 27:35 What’s next for Gemini’s multimodal development 31:47 Deep dive on Gemini’s document understanding capabilities 37:56 The teamwork and collaboration behind Gemini 40:56 What’s next with model behaviorshow more

Google AI

2,339,108 subscribers

58,703 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля Google AI

Google AI1 год назад

@AniBaddepudi Watch the full episode here:

Фото профиля Mobile Scanner

Mobile Scanner1 год назад

Scan any documents, convert images into text, PDF files, etc. 👍

Фото профиля Fabio Lauria

Fabio Lauria1 год назад

@OfficialLoganK @AniBaddepudi AI’s ability to process multimodal data is captivating. It transforms how we interact with technology, bridging gaps between visual perception and reasoning. Excited for the insights from this discussion. #AIFuture

Фото профиля Reji Modiyil

Reji Modiyil1 год назад

@OfficialLoganK @AniBaddepudi @GoogleAI, the blending of ai and visual data opens incredible possibilities for innovation.

Фото профиля Cheatify

Cheatify1 год назад

@OfficialLoganK @AniBaddepudi @GoogleAI, the evolution of ai vision is fascinating – excited to dive deeper into this topic.

Фото профиля AIMEME

AIMEME1 год назад

@OfficialLoganK @AniBaddepudi "AI models learn to see through a combination of advanced technology and continuous learning, paving the way for proactive AI assistants in the future."

Фото профиля Smart AI Stash

Smart AI Stash1 год назад

@OfficialLoganK @AniBaddepudi Can’t wait for AI to start critiquing my interior design choices: ‘I can see this is a living room, but why did you choose that couch?’ 😅

Фото профиля ^innerly

^innerly1 год назад

@OfficialLoganK @AniBaddepudi this ain’t just code, it’s a glimpse at us living next to ai not just staring at screens but actually vibing with the damn thing

Фото профиля Roark Syntax

Roark Syntax1 год назад

@OfficialLoganK @AniBaddepudi Neat. #RoarkSyntax

Фото профиля abdelhadi

abdelhadi1 год назад

@OfficialLoganK @AniBaddepudi Like so i can come back

Фото профиля Confident Security

Confident Security1 год назад

@OfficialLoganK @AniBaddepudi Fascinating topic—just remember that when a model “sees,” it also remembers unless we design for ephemerality. Teaching AI vision should come with equal lessons in how to forget.

Похожие видео

✨ What makes Gemini 2.5 Pro stand out? In this Release Notes episode, Sr. Product Manager Logan Kilpatrick and Gemini Product Lead Tulsee Doshi break down its reasoning, coding, and multimodal strengths, plus a 1M token long context. ↓ Timecodes: 1:05 Gemini 2.5 launch overview 3:19 Academic evals vs. vibe checks 6:19 The jump to 2.5 7:51 Coordinating cross-stack improvements 11:48 Role of pre/post-training vs. test-time compute 13:21 Shipping Gemini 2.5 15:29 Embedded safety process 17:28 Multimodal reasoning with Gemini 2.5 18:55 Benchmark deep dive 22:07 What’s next for Gemini 24:49 Dynamic thinking in Gemini 2.5 25:37 The team effort behind the launch

✨ What makes Gemini 2.5 Pro stand out? In this Release Notes episode, Sr. Product Manager Logan Kilpatrick and Gemini Product Lead Tulsee Doshi break down its reasoning, coding, and multimodal strengths, plus a 1M token long context. ↓ Timecodes: 1:05 Gemini 2.5 launch overview 3:19 Academic evals vs. vibe checks 6:19 The jump to 2.5 7:51 Coordinating cross-stack improvements 11:48 Role of pre/post-training vs. test-time compute 13:21 Shipping Gemini 2.5 15:29 Embedded safety process 17:28 Multimodal reasoning with Gemini 2.5 18:55 Benchmark deep dive 22:07 What’s next for Gemini 24:49 Dynamic thinking in Gemini 2.5 25:37 The team effort behind the launch

Google AI Developers

228,118 просмотров • 1 год назад

Learn about Google’s new SOTA image model, Gemini 2.5 Flash, its key capabilities, and what’s next on the roadmap with some of the team behind the model Nicole Brichtova Kaushik Shivakumar Mostafa Dehghani Robert Riachi with Logan Kilpatrick. Timecodes: 0:37 New model introduction 01:21 Demo: Image editing 03:44 Text rendering capabilities 04:44 Beyond human preference evals 06:44 Text rendering as a proxy for quality 08:38 Positive transfer between modalities 11:25 Demo: multi-turn, context aware image generation 13:54 Pixel-perfect editing and character consistency 15:51 Interleaved image generation 17:59 Specialized vs. native models 19:52 Understanding nuanced prompts 20:59 User feedback shaping model development 22:37 Improvements in character consistency 24:17 More natural looking images from team collaboration 26:41 What’s next for image generation models

Learn about Google’s new SOTA image model, Gemini 2.5 Flash, its key capabilities, and what’s next on the roadmap with some of the team behind the model Nicole Brichtova Kaushik Shivakumar Mostafa Dehghani Robert Riachi with Logan Kilpatrick. Timecodes: 0:37 New model introduction 01:21 Demo: Image editing 03:44 Text rendering capabilities 04:44 Beyond human preference evals 06:44 Text rendering as a proxy for quality 08:38 Positive transfer between modalities 11:25 Demo: multi-turn, context aware image generation 13:54 Pixel-perfect editing and character consistency 15:51 Interleaved image generation 17:59 Specialized vs. native models 19:52 Understanding nuanced prompts 20:59 User feedback shaping model development 22:37 Improvements in character consistency 24:17 More natural looking images from team collaboration 26:41 What’s next for image generation models

Google AI Developers

31,150 просмотров • 11 месяцев назад

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

74,060 просмотров • 1 год назад

Further tinkering with my little French tutor app. This version is using the Gemini Multimodal Live API. The speech understanding in Gemini is quite something. In this video you can see Gemini correcting my pronunciation. (Very patiently.) The language tutor use case really highlights the strengths of a next-generation speech model like Gemini. This is 90 lines of Pipecat AI code, and uses WebRTC for super low-latency, super reliable network transport.

Further tinkering with my little French tutor app. This version is using the Gemini Multimodal Live API. The speech understanding in Gemini is quite something. In this video you can see Gemini correcting my pronunciation. (Very patiently.) The language tutor use case really highlights the strengths of a next-generation speech model like Gemini. This is 90 lines of Pipecat AI code, and uses WebRTC for super low-latency, super reliable network transport.

kwindla

13,005 просмотров • 1 год назад

Google just took the lead... Gemini 3 + Nano Banana Pro have captured the worlds attention. Let's talk about: > what these models do > best ways to use both models > how other people are using them TIME STAMPS 00:00 Introduction 01:05 Deep Dive into Gemini 3 01:50 Real World Examples of Gemini 3 04:33 Gemini 3 in Google AI Studio 06:33 Building a Landing Page with Gemini 3 09:17 Using Gemini 3 in Cursor 10:47 Interactive Water Cycle Simulation 12:30 Exploring Google Gemini's Capabilities 13:53 Introduction to Nano Banana 15:37 Creating Infographics with Nano Banana 19:08 Nano banana in Krea Nodes 23:41 Future of AI Models

Google just took the lead... Gemini 3 + Nano Banana Pro have captured the worlds attention. Let's talk about: > what these models do > best ways to use both models > how other people are using them TIME STAMPS 00:00 Introduction 01:05 Deep Dive into Gemini 3 01:50 Real World Examples of Gemini 3 04:33 Gemini 3 in Google AI Studio 06:33 Building a Landing Page with Gemini 3 09:17 Using Gemini 3 in Cursor 10:47 Interactive Water Cycle Simulation 12:30 Exploring Google Gemini's Capabilities 13:53 Introduction to Nano Banana 15:37 Creating Infographics with Nano Banana 19:08 Nano banana in Krea Nodes 23:41 Future of AI Models

Riley Brown

33,706 просмотров • 8 месяцев назад

MiniMax is building with next-gen AGI, offering foundation AI models and revolutionizing content creation with AI-powered video, music, chat products. With Alibaba's one-stop multimodal data platform, MiniMax can save the complexity of handling multimodal data work with the AI-ready data foundation, and boost streamlined innovation on it. Feifei Li, SVP and President of International Business at Alibaba Cloud Intelligence Group, shares how the right tech stack can elevate AI from potential to powerful. 🚀

MiniMax is building with next-gen AGI, offering foundation AI models and revolutionizing content creation with AI-powered video, music, chat products. With Alibaba's one-stop multimodal data platform, MiniMax can save the complexity of handling multimodal data work with the AI-ready data foundation, and boost streamlined innovation on it. Feifei Li, SVP and President of International Business at Alibaba Cloud Intelligence Group, shares how the right tech stack can elevate AI from potential to powerful. 🚀

Alibaba Group

305,831 просмотров • 7 месяцев назад

Explore the Live API’s new audio capabilities in this episode of Release Notes with Shrestha Basu Mallick and Logan Kilpatrick. From native audio output to proactive dialogue, learn how you can build more natural and engaging multimodal AI applications. Timecodes: 0:00 Intro 01:18 Live API Overview 03:36 Why audio is a special modality 05:07 Speed vs. precision in audio 06:17 Controllable and promptable TTS 08:31 What developers are building with the Live API 11:14 URL context and async calling features 15:02 Proactive audio and affective dialog 16:55 Addressing developer feedback 21:54 Live API roadmap 23:49 The role of long context 24:57 What’s next for the Live API 26:41 State of the AI audio market 30:10 Advice for developers getting started with the Live API 31:16 Live API demo 38:10 Demo wrap up and closing

Explore the Live API’s new audio capabilities in this episode of Release Notes with Shrestha Basu Mallick and Logan Kilpatrick. From native audio output to proactive dialogue, learn how you can build more natural and engaging multimodal AI applications. Timecodes: 0:00 Intro 01:18 Live API Overview 03:36 Why audio is a special modality 05:07 Speed vs. precision in audio 06:17 Controllable and promptable TTS 08:31 What developers are building with the Live API 11:14 URL context and async calling features 15:02 Proactive audio and affective dialog 16:55 Addressing developer feedback 21:54 Live API roadmap 23:49 The role of long context 24:57 What’s next for the Live API 26:41 State of the AI audio market 30:10 Advice for developers getting started with the Live API 31:16 Live API demo 38:10 Demo wrap up and closing

Google AI Developers

24,239 просмотров • 11 месяцев назад

In the latest episode of Release Notes, Gemini's Dave Citron joins Logan Kilpatrick to deep dive into some of the latest Gemini updates. 🎙️ Learn more about Gemini with personalization, Canvas, Audio Overviews, Deep Research, and more: Timestamps: 0:00 Introduction 0:59 Recent Gemini app launches 2:00 Introducing Canvas 5:12 Canvas in action 8:46 Canvas examples 12:02 Enhanced capabilities with Thinking Models 15:12 Deep Research in action 20:27 The future of agentic experiences 22:12 Deep Research and Audio Overviews 24:11 Personalization in Gemini app 27:50 Personalization in action 29:58 How personalization works: user data and privacy 32:30 The future of personalization

In the latest episode of Release Notes, Gemini's Dave Citron joins Logan Kilpatrick to deep dive into some of the latest Gemini updates. 🎙️ Learn more about Gemini with personalization, Canvas, Audio Overviews, Deep Research, and more: Timestamps: 0:00 Introduction 0:59 Recent Gemini app launches 2:00 Introducing Canvas 5:12 Canvas in action 8:46 Canvas examples 12:02 Enhanced capabilities with Thinking Models 15:12 Deep Research in action 20:27 The future of agentic experiences 22:12 Deep Research and Audio Overviews 24:11 Personalization in Gemini app 27:50 Personalization in action 29:58 How personalization works: user data and privacy 32:30 The future of personalization

Google Gemini

119,472 просмотров • 1 год назад

A LOT happened last week. ICYMI, koray kavukcuoglu (CTO of Google DeepMind and Chief AI Architect of Google) and Logan Kilpatrick discuss Gemini 3, the state of AI, and where we are on the path to AGI. Chapters: 0:00 - Intro 2:00 - Gemini 3 launch reception 4:16 - Continuous progress and innovation 6:47 - Key areas for Gemini improvement 11:45 - Product scaffolding for model improvement 13:56 - Chief AI architect role 17:04 - Engineering mindset and collaboration 18:37 - Future growth areas for Gemini 20:33 - From research to engineering mindset 23:22 - The rise of generative media 27:22 - Nano Banana Pro capabilities 29:31 - Towards unified model checkpoints 36:26 - Organizing for AI success 38:26 - Balancing exploration and scaling 41:40 - DeepMind's collaborative culture 45:21 - Innovating at Google 48:37 - Closing

A LOT happened last week. ICYMI, koray kavukcuoglu (CTO of Google DeepMind and Chief AI Architect of Google) and Logan Kilpatrick discuss Gemini 3, the state of AI, and where we are on the path to AGI. Chapters: 0:00 - Intro 2:00 - Gemini 3 launch reception 4:16 - Continuous progress and innovation 6:47 - Key areas for Gemini improvement 11:45 - Product scaffolding for model improvement 13:56 - Chief AI architect role 17:04 - Engineering mindset and collaboration 18:37 - Future growth areas for Gemini 20:33 - From research to engineering mindset 23:22 - The rise of generative media 27:22 - Nano Banana Pro capabilities 29:31 - Towards unified model checkpoints 36:26 - Organizing for AI success 38:26 - Balancing exploration and scaling 41:40 - DeepMind's collaborative culture 45:21 - Innovating at Google 48:37 - Closing

Google AI

179,851 просмотров • 8 месяцев назад

Exclusive: Google just released two upgraded Gemini 1.5 models, 1.5-Pro-002 and 1.5-Flash-002 — achieving new, state-of-the-art performance across math. I sat down with Logan Kilpatrick (Logan Kilpatrick) to discuss the models, AI agents, AGI, and more. Timestamps: 00:00 Intro 01:01 Google rolls out two new Gemini models 2:18 What makes the new models so unique 3:40 Math improvements 8:20 Examples of Gemini 1.5 in the real world 10:19 Future problems AI could solve 12:54 Advantages of Gemini 1.5 16:50 Where beginners can get started 19:01 Advice to thrive in the new age of AI 22:02 Turning notes into podcasts 26:30 AI agents: Definitions and what they can do 31:22 What’s the final form for agents? 33:26 Proactive AI agent systems 36:00 Context windows in the agent era 41:01 AGI’s definition in Logan’s eyes 42:05 The current bottlenecks towards AGI

Exclusive: Google just released two upgraded Gemini 1.5 models, 1.5-Pro-002 and 1.5-Flash-002 — achieving new, state-of-the-art performance across math. I sat down with Logan Kilpatrick (Logan Kilpatrick) to discuss the models, AI agents, AGI, and more. Timestamps: 00:00 Intro 01:01 Google rolls out two new Gemini models 2:18 What makes the new models so unique 3:40 Math improvements 8:20 Examples of Gemini 1.5 in the real world 10:19 Future problems AI could solve 12:54 Advantages of Gemini 1.5 16:50 Where beginners can get started 19:01 Advice to thrive in the new age of AI 22:02 Turning notes into podcasts 26:30 AI agents: Definitions and what they can do 31:22 What’s the final form for agents? 33:26 Proactive AI agent systems 36:00 Context windows in the agent era 41:01 AGI’s definition in Logan’s eyes 42:05 The current bottlenecks towards AGI

Rowan Cheung

269,830 просмотров • 1 год назад

Really happy to see the interest around our “Hands-on with Gemini” video. In our developer blog yesterday, we broke down how Gemini was used to create it. We gave Gemini sequences of different modalities — image and text in this case — and had it respond by predicting what might come next. Devs can try similar things when access to Pro opens on 12/13 🚀. The knitting demo used Ultra⚡ All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers. When you’re building an app, you can get similar results (there’s always some variability with LLMs) by prompting Gemini with an instruction that allows the user to "configure" the behavior of the model, like inputting “you are an expert in science …” before a user can engage in the same kind of back and forth dialogue. Here’s a clip of what this looks like in AI Studio with Gemini Pro. We’ve come a long way since Flamingo 🦩 & PALI, looking forward to seeing what people build with it!

Really happy to see the interest around our “Hands-on with Gemini” video. In our developer blog yesterday, we broke down how Gemini was used to create it. We gave Gemini sequences of different modalities — image and text in this case — and had it respond by predicting what might come next. Devs can try similar things when access to Pro opens on 12/13 🚀. The knitting demo used Ultra⚡ All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers. When you’re building an app, you can get similar results (there’s always some variability with LLMs) by prompting Gemini with an instruction that allows the user to "configure" the behavior of the model, like inputting “you are an expert in science …” before a user can engage in the same kind of back and forth dialogue. Here’s a clip of what this looks like in AI Studio with Gemini Pro. We’ve come a long way since Flamingo 🦩 & PALI, looking forward to seeing what people build with it!

Oriol Vinyals

180,966 просмотров • 2 лет назад

Andrew Ng (Andrew Ng) on how startups can build faster with AI. At AI Startup School in San Francisco. 00:31 - The Importance of Speed in Startups 01:13 - Opportunities in the AI Stack 02:06 - The Rise of Agent AI 04:52 - Concrete Ideas for Faster Execution 08:56 - Rapid Prototyping and Engineering 17:06 - The Role of Product Management 21:23 - The Value of Understanding AI 22:33 - Technical Decisions in AI Development 23:26 - Leveraging Gen AI Tools for Startups 24:05 - Building with AI Building Blocks 25:26 - The Importance of Speed in Startups 26:41 - Addressing AI Hype and Misconceptions 37:35 - AI in Education: Current Trends and Future Directions 39:33 - Balancing AI Innovation with Ethical Considerations 41:27 - Protecting Open Source and the Future of AI

Andrew Ng (Andrew Ng) on how startups can build faster with AI. At AI Startup School in San Francisco. 00:31 - The Importance of Speed in Startups 01:13 - Opportunities in the AI Stack 02:06 - The Rise of Agent AI 04:52 - Concrete Ideas for Faster Execution 08:56 - Rapid Prototyping and Engineering 17:06 - The Role of Product Management 21:23 - The Value of Understanding AI 22:33 - Technical Decisions in AI Development 23:26 - Leveraging Gen AI Tools for Startups 24:05 - Building with AI Building Blocks 25:26 - The Importance of Speed in Startups 26:41 - Addressing AI Hype and Misconceptions 37:35 - AI in Education: Current Trends and Future Directions 39:33 - Balancing AI Innovation with Ethical Considerations 41:27 - Protecting Open Source and the Future of AI

Y Combinator

852,857 просмотров • 1 год назад

New generative UI experiences in the Google Gemini and AI Mode in Search use Gemini 3’s multimodal understanding capabilities to make entire user interfaces for you from scratch — dynamically delivering what you need in a visual and interactive way. Google Research software engineer Yaniv Leviathan shares more about the process of building generative UI (hint: so much fun that the team couldn't stop playing with it) and how he's been using it so far.

New generative UI experiences in the Google Gemini and AI Mode in Search use Gemini 3’s multimodal understanding capabilities to make entire user interfaces for you from scratch — dynamically delivering what you need in a visual and interactive way. Google Research software engineer Yaniv Leviathan shares more about the process of building generative UI (hint: so much fun that the team couldn't stop playing with it) and how he's been using it so far.

Google

138,435 просмотров • 8 месяцев назад

(1/5) Gemini 3, our most intelligent model, is landing in Google Search today – starting with AI Mode. Excited that this is the first time we’re shipping a new Gemini model in Search on day one! 🚀 In Search, Gemini 3 with generative layouts will make it easy to get a rich understanding of anything on your mind. It has state-of-the-art reasoning, deep multimodal understanding and advanced agentic capabilities. That allows the model to shine when you ask it to explain advanced concepts or ideas – it reasons and can code interactive visuals in real-time. It can tackle your toughest questions like advanced science.

(1/5) Gemini 3, our most intelligent model, is landing in Google Search today – starting with AI Mode. Excited that this is the first time we’re shipping a new Gemini model in Search on day one! 🚀 In Search, Gemini 3 with generative layouts will make it easy to get a rich understanding of anything on your mind. It has state-of-the-art reasoning, deep multimodal understanding and advanced agentic capabilities. That allows the model to shine when you ask it to explain advanced concepts or ideas – it reasons and can code interactive visuals in real-time. It can tackle your toughest questions like advanced science.

Robby Stein

94,877 просмотров • 8 месяцев назад

We’re expanding the Gemini API File Search tool 🔍 with 3 new updates that enable developers to more easily build multimodal RAG systems with enhanced precision: + Multimodal Support: By leveraging our Gemini Embedding 2 model, File Search can now reason across image and text simultaneously. + Custom Metadata Filtering: Bring structure to unstructured data by tagging files with custom key-value labels. This pre-filters your data and boosts search speed. + Exact citations: File Search can now capture and return the exact source (down to the page number) for every piece of information indexed. See multimodal File Search in action with our example app in Google AI Studio. Chat with your entire image and doc library, ask questions, and trace answers back to the source:

We’re expanding the Gemini API File Search tool 🔍 with 3 new updates that enable developers to more easily build multimodal RAG systems with enhanced precision: + Multimodal Support: By leveraging our Gemini Embedding 2 model, File Search can now reason across image and text simultaneously. + Custom Metadata Filtering: Bring structure to unstructured data by tagging files with custom key-value labels. This pre-filters your data and boosts search speed. + Exact citations: File Search can now capture and return the exact source (down to the page number) for every piece of information indexed. See multimodal File Search in action with our example app in Google AI Studio. Chat with your entire image and doc library, ask questions, and trace answers back to the source:

Google AI Developers

108,622 просмотров • 2 месяцев назад

Join our VP (Drastic) Research, Gemini co-Tech Lead Oriol Vinyals and our podcast host Hannah Fry as they discuss the evolution of our AI models, from AlphaGo to Gemini. They also cover agentic capabilities and why giving AI access to tools could lead to a new era of problem-solving. Listen now ↓ Timecodes: 00:00 Intro 02:30 Games and early AI agents 04:28 Weights 09:27 Architectures and a digital brain 10:24 Agentic behaviour 13:31 Digital body 14:09 Scaling 19:02 Data 20:59 Complex understanding and knowledge 25:14 Post training challenges 30:43 Reasoning 33:11 Planning 34:19 Systems 2 37:00 Memory 40:54 Gemini and agentic capabilities

Join our VP (Drastic) Research, Gemini co-Tech Lead Oriol Vinyals and our podcast host Hannah Fry as they discuss the evolution of our AI models, from AlphaGo to Gemini. They also cover agentic capabilities and why giving AI access to tools could lead to a new era of problem-solving. Listen now ↓ Timecodes: 00:00 Intro 02:30 Games and early AI agents 04:28 Weights 09:27 Architectures and a digital brain 10:24 Agentic behaviour 13:31 Digital body 14:09 Scaling 19:02 Data 20:59 Complex understanding and knowledge 25:14 Post training challenges 30:43 Reasoning 33:11 Planning 34:19 Systems 2 37:00 Memory 40:54 Gemini and agentic capabilities

Google DeepMind

57,351 просмотров • 1 год назад

Why AI Progress Suddenly Feels Real - my conversation with Yann Dubois, who co-leads the Post-Training Frontiers team at OpenAI 00:00 - Intro 01:30 - Why recent AI progress feels like a step function 04:13 - Model reliability & the emotional rollercoaster of shipping GPT-5.5 07:33 - How OpenAI structures vertical and horizontal teams 09:49 - Improving model efficiency and test-time compute 12:32 - Yann's journey from Switzerland to OpenAI 15:37 - Reasoning in 2026: Real-world utility vs verifiable rewards 18:34 - GPT-5.5 Thinking vs Pro: Scaling test-time compute 20:09 - How reasoning models become more efficient 23:23 - Pre-training scaling and overcoming the data wall 27:03 - Multimodal data, synthetic data, and embodied AI 31:05 - Demystifying mid-training and post-training 37:21 - Does RL create new capabilities in AI? 38:53 - The challenges and frontier of scaling RL 43:09 - Is building AI models a craft or a strict science 48:21 - How AI models generalize across different domains 54:18 - How reinforcement learning cures AI hallucinations 56:04 - Negative generalization and conflicting instructions 58:05 - Can RL scale to law, medicine, and the broader economy? 1:00:19 - The evaluation bottleneck and Model as a Judge 1:04:21 - Continuous AI progress & continual learning 1:08:49 - Will foundation models eat the agent harness 1:11:23 - Why startups should focus on the last mile of AI

Why AI Progress Suddenly Feels Real - my conversation with Yann Dubois, who co-leads the Post-Training Frontiers team at OpenAI 00:00 - Intro 01:30 - Why recent AI progress feels like a step function 04:13 - Model reliability & the emotional rollercoaster of shipping GPT-5.5 07:33 - How OpenAI structures vertical and horizontal teams 09:49 - Improving model efficiency and test-time compute 12:32 - Yann's journey from Switzerland to OpenAI 15:37 - Reasoning in 2026: Real-world utility vs verifiable rewards 18:34 - GPT-5.5 Thinking vs Pro: Scaling test-time compute 20:09 - How reasoning models become more efficient 23:23 - Pre-training scaling and overcoming the data wall 27:03 - Multimodal data, synthetic data, and embodied AI 31:05 - Demystifying mid-training and post-training 37:21 - Does RL create new capabilities in AI? 38:53 - The challenges and frontier of scaling RL 43:09 - Is building AI models a craft or a strict science 48:21 - How AI models generalize across different domains 54:18 - How reinforcement learning cures AI hallucinations 56:04 - Negative generalization and conflicting instructions 58:05 - Can RL scale to law, medicine, and the broader economy? 1:00:19 - The evaluation bottleneck and Model as a Judge 1:04:21 - Continuous AI progress & continual learning 1:08:49 - Will foundation models eat the agent harness 1:11:23 - Why startups should focus on the last mile of AI

Matt Turck

100,966 просмотров • 2 месяцев назад

From rewriting Google’s search stack in the early 2000s to reviving sparse trillion-parameter models and co-designing TPUs with frontier ML research, Jeff Dean has quietly shaped nearly every layer of the modern AI stack. As Chief AI Scientist at Google and a driving force behind Gemini, Jeff has lived through multiple scaling revolutions from CPUs and sharded indices to multimodal models that reason across text, video, and code. We sat down with Jeff to unpack what it really means to “own the Pareto frontier,” why distillation is the quiet force behind every generation of faster, cheaper models, how energy not FLOPs is becoming the true constraint on AI compute, what it takes to co-design hardware and models 2–6 years into the future, why unified multimodal systems will outperform specialized ones, what it was like leading the charge to unify all of Google’s AI teams, and his prediction that deeply personalized models with access to your full digital context will redefine what useful AI looks like. Jeff Dean Google DeepMind Google

From rewriting Google’s search stack in the early 2000s to reviving sparse trillion-parameter models and co-designing TPUs with frontier ML research, Jeff Dean has quietly shaped nearly every layer of the modern AI stack. As Chief AI Scientist at Google and a driving force behind Gemini, Jeff has lived through multiple scaling revolutions from CPUs and sharded indices to multimodal models that reason across text, video, and code. We sat down with Jeff to unpack what it really means to “own the Pareto frontier,” why distillation is the quiet force behind every generation of faster, cheaper models, how energy not FLOPs is becoming the true constraint on AI compute, what it takes to co-design hardware and models 2–6 years into the future, why unified multimodal systems will outperform specialized ones, what it was like leading the charge to unify all of Google’s AI teams, and his prediction that deeply personalized models with access to your full digital context will redefine what useful AI looks like. Jeff Dean Google DeepMind Google

Latent.Space

529,014 просмотров • 5 месяцев назад

Get the inside story on the development of Gemini's coding capabilities. Listen as the product and research leads for Gemini share their philosophy on what makes a great coding model, the impact of "vibe coding," and the future of programming languages with Logan Kilpatrick, Connie Fan and Danny Tarlow. Timecodes: 0:00 Intro 1:10 Defining Early Coding Goals 6:23 Ingredients of a Great Coding Model 9:28 Adapting to Developer Workflows 11:40 The Rise of Vibe Coding 14:43 Code as a Reasoning Tool 17:20 Code as a Universal Solver 20:47 Evaluating Coding Models 24:30 Leveraging Internal Googler Feedback 26:52 Winning Over AI Skeptics 28:04 Performance Across Programming Languages 33:05 The Future of Programming Languages 36:16 Strategies for Large Codebases 41:06 Hill Climbing New Benchmarks 42:46 Short-Term Improvements 44:42 Model Style and Taste 47:43 2.5 Pro’s Breakthrough 51:06 Early AI Coding Experiences 56:19 Specialist vs. Generalist Models

Get the inside story on the development of Gemini's coding capabilities. Listen as the product and research leads for Gemini share their philosophy on what makes a great coding model, the impact of "vibe coding," and the future of programming languages with Logan Kilpatrick, Connie Fan and Danny Tarlow. Timecodes: 0:00 Intro 1:10 Defining Early Coding Goals 6:23 Ingredients of a Great Coding Model 9:28 Adapting to Developer Workflows 11:40 The Rise of Vibe Coding 14:43 Code as a Reasoning Tool 17:20 Code as a Universal Solver 20:47 Evaluating Coding Models 24:30 Leveraging Internal Googler Feedback 26:52 Winning Over AI Skeptics 28:04 Performance Across Programming Languages 33:05 The Future of Programming Languages 36:16 Strategies for Large Codebases 41:06 Hill Climbing New Benchmarks 42:46 Short-Term Improvements 44:42 Model Style and Taste 47:43 2.5 Pro’s Breakthrough 51:06 Early AI Coding Experiences 56:19 Specialist vs. Generalist Models

Google AI Developers

65,474 просмотров • 1 год назад