
Philipp Schmid
@_philschmid • 80,578 subscribers
Agents & Gemini API, MTS @GoogleDeepMind | prev: Tech Lead at @huggingface, AWS ML Hero 🤗 Sharing my own views and AI News 🧑🏻💻 https://t.co/7IosdlO6RA
Shorts
Videos

Released today: Gemini 3.1 Flash TTS 🔊 A text-to-speech model you actually direct. Add [whispers] and it whispers. Add [shouting] and it shouts. Mid-sentence. eg. "[asmr] Hey there, [deep and loud] TURN THIS UP, [asmr] how can I help you?" Now available in Google AI Studio & Gemini API.
Philipp Schmid117,922 Aufrufe • vor 1 Monat

What if you can build an Agent with it own computer in a single api call? At my Google I/O talk, I showed how to use Gemini Managed Agents and the new Interactions API to give your AI a secure, hosted Linux sandbox to execute code and manage its own memory.
Philipp Schmid21,116 Aufrufe • vor 12 Tagen

Gemini 3 in Chrome! ✨ Gemini can now auto browse, see multiple tabs and take actions on your behalf. We’re releasing it at first for Google AI Pro and Ultra subscribers in the US. Also New: - Side panel, History recall + chat history + @ tabs mentions. - Direct Nano Banana integration for image creation/editing. - Deeper integration with Gmail, Calendar, YouTube via Google Gemini
Philipp Schmid62,158 Aufrufe • vor 4 Monaten

Playing Pokemon with LLMs became a benchmark, but what about Super Mario? Turns out Google DeepMind Gemini 2.0 Flash can play Super Mario in real-time due to its fast latency, multimodal input and long context! 🎮 Gemini 2.0 Flash receives screenshots of the game and then generates Python code (PyAutoGUI commands) for each screenshot to control the game for a short period (the next 1 or 2 seconds). Add a while loop to it and you have it play in real-time. 🤯
Philipp Schmid153,060 Aufrufe • vor 1 Jahr

This is not a joke! 🐬 Excited to share DolphinGemma the first audio-to-audio for dolphin communication! Yes, a model that predicts tokens on how dolphin speech! > DolphinGemma is the first LLM trained specifically to understand dolphin language patterns. > Leverages 40 years of data from Dr. Denise Herzing's unique collection > Works like text prediction, trying to "complete" dolphin whistles and sounds > Use wearable hardware (Google Pixel 9) to capture and analyze sounds in the field. > Dolphin Gemma is designed to be fine-tuned with new data > Weights coming soon! Research like this is why I love AI even more! ♥️
Philipp Schmid112,155 Aufrufe • vor 1 Jahr

Gemini 2.5 Flash can control a browser! Excited to share Gemini Browser Agent, a simple Python script example on how to use Google DeepMind Gemini 2.5 Flash and Browser Use to act as general assistant! 🤯 Usage Examples: 1⃣ Single Query Mode: `python scripts/gemini-browser-use.py --url --query "Summarize the key features of Gemini 2.5 Flash."` 2⃣Interactive Mode: Start an interactive session, optionally with a starting URL. `python scripts/gemini-browser-use.py` Command-line options: --model: The Gemini model to use (default: gemini-2.5-flash-preview-04-17) --headless: Run the browser in headless mode --url: Starting URL for the browser to navigate to before processing the query --query: Run a single query and exit (instead of interactive mode) Time to build a replication of Manus and OpenAI Operator powered by Gemini 2.5. Code below ⬇️
Philipp Schmid105,308 Aufrufe • vor 1 Jahr

That is the easiest way to chat with a complete Github Repository!👀 Replace "github" with "gitingest" in the url, and you get the whole repo as a single string, paste into AI studio and use Google DeepMind Gemini 2.0 Flash Million Token Context video to ask questions! 🤯 Here is how you can chat with the whole python SDK docs. ⬇️
Philipp Schmid97,475 Aufrufe • vor 1 Jahr

We will enter a new era for vibe coding! The new Gemini 2.5 Pro can now zero-shot full Single Page Application, Complete Responsive Mobile Games, convert UI screenshots precisly to working code. Can’t wait and see what Cursor ,GitHub, @windsurf, bolt.new, v0, , Replit ⠕, Cline, will enable with this! Here are my top vibe tests so far: “Build a Healthcare CRM to streamline patient communication, appointment scheduling, record management, and billing processes…..”
Philipp Schmid79,782 Aufrufe • vor 1 Jahr

The easiest way to build an MCP Server using Google DeepMind Gemini 2.5 Pro and get started! 1. Use Gitingest to get all the code and docs from the FastMCP repo 2. Download the code into a txt file 3. Go to AI Studio, upload the file, define what kind of MCP Server you want to build 4. Gemini 2.5 Pro builds it for you.
Philipp Schmid75,947 Aufrufe • vor 1 Jahr

Full circle of Gemini API Features: Gemini Deep Research > Gemini 3 Flash (Agent) + Subagents (Nano Banana, Gemini 3.1 Flash TTS) + Skills (HeyGen HyperFrame ) > 16 Parallel Variants. No manual script. No video editing software. Just a single prompt to Gemini 3 Flash orchestrating subagents with HyperFrames SKILL to create 16 distinct videos! 🚀 [video: Swiss print, A high-contrast, light-mode technical manual aesthetic with a firm, auditor-like voice.]
Philipp Schmid12,518 Aufrufe • vor 1 Monat

New Example! We built a fullstack open “Deep Research” quickstart for Google DeepMind Gemini 2.5. It dynamically searches the web, reflects on results, and delivers comprehensive answers with citations in a nice UI with streaming! Built using React and LangChain Langgraph! 🚀 TL;DR: 🔄 Agent iteratively loops through research and reflection until it gathers sufficient information. 🔍 Dynamic query generation, web research via Gemini native Google Search tool, and reflective reasoning. 🧠 Supports different search effors (low, medium, high) for width and depth of search 🛠️ React frontend, LangGraph backend, Tailwind CSS + Shadcn UI components. 🐳 Easily run locally or deploy with Docker. 📄 Answers inlcude citations from gathered web sources.
Philipp Schmid55,535 Aufrufe • vor 1 Jahr

What if you could talk to your Telegram bot and it actually talked back? Learn how built a voice-enabled Telegram bot with the Gemini Interactions API in ~400 lines of Python. Send a voice note in any language, Gemini understands the audio and replies with text and a spoken voice message. Uses: - Gemini 3.1 Flash Lite for reasoning, 3.1 Flash TTS for speech - Interactions API handles multi-turn memory server-side - Native audio input, no transcription step needed - Deploys to Cloud Run with scale-to-zero Awesome work by Thor 雷神 ⚡️. 🤗
Philipp Schmid11,143 Aufrufe • vor 1 Monat

Excited to share a Google DeepMind Gemini 2.0 Flash Image Generation and Editing Quickstart. We build a Next.js reference app on how to use the new image editing feature of Gemini 2.0 Flash. Demo to test ⬇️ > Generate images from text prompts using Gemini 2.0 Flash > Or upload an image and edit it using prompts
Philipp Schmid52,406 Aufrufe • vor 1 Jahr

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face
Philipp Schmid48,157 Aufrufe • vor 1 Jahr

Gmail is entering the Gemini era with new features powered by Gemini 3. AI overviews in your inbox! - Ask natural language questions. - Generate personalized context-aware replies. - Advanced grammar, tone, and style checks. - AI Inbox highlights to-dos from important messages.
Philipp Schmid12,729 Aufrufe • vor 4 Monaten