Philipp Schmid's banner

Philipp Schmid

@_philschmid • 99,986 subscribers

Agents & Gemini API, MTS @GoogleDeepMind | prev: Tech Lead at @huggingface, AWS ML Hero 🤗 Sharing my own views and AI News 🧑🏻‍💻 https://t.co/7IosdlO6RA

Shorts

We published a skill for Omni Flash so you can bootstrap video editing into your agent: ``` npx skills add google-gemini/gemini-skills --skill gemini-omni-flash-api ``` The skill covers the core workflows: - Text to video - Image references to video - First frame to video - Conversational video editing

We published a skill for Omni Flash so you can bootstrap video editing into your agent: ``` npx skills add google-gemini/gemini-skills --skill gemini-omni-flash-api ``` The skill covers the core workflows: - Text to video - Image references to video - First frame to video - Conversational video editing

54,060 просмотров

MANUS AI: HYPE VS. REALITY 🔍 Yichao 'Peak' Ji (co-founder of ) confirmed rumors: ✅ Built on Anthropic Claude Sonnet, not their own foundation model ✅Has access to 29 tools and uses Browser Use open-source for browser control ✅User communicates with executor agent and not planner or other agents. ✅Each user gets isolated sandbox environment ✅Outperforms OpenAI Deep Research on GAIA benchmark Building AI products doesn't require training your own foundation models. We're probably just scratching the surface of what existing models can do with the right tooling and integration!

MANUS AI: HYPE VS. REALITY 🔍 Yichao 'Peak' Ji (co-founder of ) confirmed rumors: ✅ Built on Anthropic Claude Sonnet, not their own foundation model ✅Has access to 29 tools and uses Browser Use open-source for browser control ✅User communicates with executor agent and not planner or other agents. ✅Each user gets isolated sandbox environment ✅Outperforms OpenAI Deep Research on GAIA benchmark Building AI products doesn't require training your own foundation models. We're probably just scratching the surface of what existing models can do with the right tooling and integration!

202,928 просмотров

Holy Shit Gemini 2.5 Pro Exp 0-shot @levelsio flight simulator: “In pure three.js, without downloading any assets or textures, create a flight simulator game where i can fly an airplane. Make sure it runs in the browser.”

Holy Shit Gemini 2.5 Pro Exp 0-shot @levelsio flight simulator: “In pure three.js, without downloading any assets or textures, create a flight simulator game where i can fly an airplane. Make sure it runs in the browser.”

165,632 просмотров

You haven’t tried Google AI Studio yet?👀 We made it simpler! When you come to AIS for the first time, you will have a Default Gemini Project & API Key waiting for you! This should reduce time to first prompt, and help you start building faster! Give it a try!

You haven’t tried Google AI Studio yet?👀 We made it simpler! When you come to AIS for the first time, you will have a Default Gemini Project & API Key waiting for you! This should reduce time to first prompt, and help you start building faster! Give it a try!

72,649 просмотров

Gemini Diffusion ~1000 tokens per second!⚡Text Diffusion doing bouncing balls. ⚽️

Gemini Diffusion ~1000 tokens per second!⚡Text Diffusion doing bouncing balls. ⚽️

56,674 просмотров

Gemini 3.1 Flash-Lite can generate and imagine websites on the fly while you browse. Each click leads to a newly generated site. See how it envisions "facebook in 2004" 🌐🔦 Link below to test. ⬇️

Gemini 3.1 Flash-Lite can generate and imagine websites on the fly while you browse. Each click leads to a newly generated site. See how it envisions "facebook in 2004" 🌐🔦 Link below to test. ⬇️

18,940 просмотров

Character Consistency with Google Veo 3 now in Gemini API! 🤯 Use Images as starting frame to keep character consistency! Here is an python script on how to make consistent viral videos, like you see on TikTok or Youtube shorts: 1. Based on an idea, it generates a series of scene prompts using Gemini 2.5. 2. Generates a Image based on the first scene using Imagen 3 3. For each scene prompt Veo 3 (fast) generates a video clip. 4. Uses Gemini 2.0 image editing to make sure the starting images fits the scenes 5. Combine the individual video clips into a single final video using MoviePy Veo 3 starts at $0.75 / second and Veo 3 at $0.40 / second with audio. ! 📹 🔉 Prompt: “A realistic energy drink commercial for athletes.”

Character Consistency with Google Veo 3 now in Gemini API! 🤯 Use Images as starting frame to keep character consistency! Here is an python script on how to make consistent viral videos, like you see on TikTok or Youtube shorts: 1. Based on an idea, it generates a series of scene prompts using Gemini 2.5. 2. Generates a Image based on the first scene using Imagen 3 3. For each scene prompt Veo 3 (fast) generates a video clip. 4. Uses Gemini 2.0 image editing to make sure the starting images fits the scenes 5. Combine the individual video clips into a single final video using MoviePy Veo 3 starts at $0.75 / second and Veo 3 at $0.40 / second with audio. ! 📹 🔉 Prompt: “A realistic energy drink commercial for athletes.”

15,863 просмотров

No plans in slowing down! 🤝

No plans in slowing down! 🤝

10,864 просмотров

Google Gemini 2.5 Pro Exp: “Write a p5.js script that simulates 25 particles in a vacuum space of a cylindrical container, bouncing within its boundaries. Use different colors for each ball and ensure they leave a trail showing their movement. Add a slow rotation of the container to give better view of what's going on in the scene. Make sure to create proper collision detection and physic rules to ensure particles remain in the container. Add an external spherical container. Add a slow zoom in and zoom out effect to the whole scene.” AK

Google Gemini 2.5 Pro Exp: “Write a p5.js script that simulates 25 particles in a vacuum space of a cylindrical container, bouncing within its boundaries. Use different colors for each ball and ensure they leave a trail showing their movement. Add a slow rotation of the container to give better view of what's going on in the scene. Make sure to create proper collision detection and physic rules to ensure particles remain in the container. Add an external spherical container. Add a slow zoom in and zoom out effect to the whole scene.” AK

13,644 просмотров

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Released today: Gemini 3.1 Flash TTS 🔊 A text-to-speech model you actually direct. Add [whispers] and it whispers. Add [shouting] and it shouts. Mid-sentence. eg. "[asmr] Hey there, [deep and loud] TURN THIS UP, [asmr] how can I help you?" Now available in Google AI Studio & Gemini API.

Released today: Gemini 3.1 Flash TTS 🔊 A text-to-speech model you actually direct. Add [whispers] and it whispers. Add [shouting] and it shouts. Mid-sentence. eg. "[asmr] Hey there, [deep and loud] TURN THIS UP, [asmr] how can I help you?" Now available in Google AI Studio & Gemini API.

118,138 просмотров • 3 месяцев назад

Google Colab CLI and Skills are out. Full Colab runtimes from your terminal. - GPU/TPU provisioning (colab --gpu A100) - Remote script execution (colab exec) - Interactive console/REPL access - Built-in agent skill Tell your agent "fine-tune Gemma 3 1B on this dataset" and it provisions a GPU, runs the training, downloads the adapter weights. Fully automatic.

Google Colab CLI and Skills are out. Full Colab runtimes from your terminal. - GPU/TPU provisioning (colab --gpu A100) - Remote script execution (colab exec) - Interactive console/REPL access - Built-in agent skill Tell your agent "fine-tune Gemma 3 1B on this dataset" and it provisions a GPU, runs the training, downloads the adapter weights. Fully automatic.

47,292 просмотров • 1 месяц назад

Yesterday we launched computer use in Gemini 3.5 Flash with browser, mobile, and desktop environments. I put together a quickstart for how to control an Android Phone. 1. Single script to install emulator from terminal. 2. Basic agent loop with interactions API using `adb` to control the phone. 3. Connects to remote devices too (`adb connect :5555`). 4. Same pattern works for iOS with, e.g. simctl.

Yesterday we launched computer use in Gemini 3.5 Flash with browser, mobile, and desktop environments. I put together a quickstart for how to control an Android Phone. 1. Single script to install emulator from terminal. 2. Basic agent loop with interactions API using `adb` to control the phone. 3. Connects to remote devices too (`adb connect :5555`). 4. Same pattern works for iOS with, e.g. simctl.

29,563 просмотров • 23 дней назад

Hey Gemini make a website presenting yourself using the skill below. (Gemini 3.1 Pro Preview) + Google AI Studio + design-taste-frontend skill.

Hey Gemini make a website presenting yourself using the skill below. (Gemini 3.1 Pro Preview) + Google AI Studio + design-taste-frontend skill.

87,034 просмотров • 4 месяцев назад

Playing Pokemon with LLMs became a benchmark, but what about Super Mario? Turns out Google DeepMind Gemini 2.0 Flash can play Super Mario in real-time due to its fast latency, multimodal input and long context! 🎮 Gemini 2.0 Flash receives screenshots of the game and then generates Python code (PyAutoGUI commands) for each screenshot to control the game for a short period (the next 1 or 2 seconds). Add a while loop to it and you have it play in real-time. 🤯

Playing Pokemon with LLMs became a benchmark, but what about Super Mario? Turns out Google DeepMind Gemini 2.0 Flash can play Super Mario in real-time due to its fast latency, multimodal input and long context! 🎮 Gemini 2.0 Flash receives screenshots of the game and then generates Python code (PyAutoGUI commands) for each screenshot to control the game for a short period (the next 1 or 2 seconds). Add a while loop to it and you have it play in real-time. 🤯

153,067 просмотров • 1 год назад

Gemini 3 in Chrome! ✨ Gemini can now auto browse, see multiple tabs and take actions on your behalf. We’re releasing it at first for Google AI Pro and Ultra subscribers in the US. Also New: - Side panel, History recall + chat history + @ tabs mentions. - Direct Nano Banana integration for image creation/editing. - Deeper integration with Gmail, Calendar, YouTube via Google Gemini

Gemini 3 in Chrome! ✨ Gemini can now auto browse, see multiple tabs and take actions on your behalf. We’re releasing it at first for Google AI Pro and Ultra subscribers in the US. Also New: - Side panel, History recall + chat history + @ tabs mentions. - Direct Nano Banana integration for image creation/editing. - Deeper integration with Gmail, Calendar, YouTube via Google Gemini

62,222 просмотров • 5 месяцев назад

Gemini 3.5 Live Translate! We just shipped a real-time babel fish. - 70+ languages, 2,000+ language pairs. - Natural translated speech, works in noisy environments. - Stays in sync with the speaker, no lag, no awkward pauses. - Auto-detects the language being spoken. Available today in Google Translate (Android & iOS), the Gemini API (Public Preview), and Google Meet (Private Preview). I genuinely think this is the beginning of the end of language barriers. Anyone can now speak and understand anyone.

Gemini 3.5 Live Translate! We just shipped a real-time babel fish. - 70+ languages, 2,000+ language pairs. - Natural translated speech, works in noisy environments. - Stays in sync with the speaker, no lag, no awkward pauses. - Auto-detects the language being spoken. Available today in Google Translate (Android & iOS), the Gemini API (Public Preview), and Google Meet (Private Preview). I genuinely think this is the beginning of the end of language barriers. Anyone can now speak and understand anyone.

17,419 просмотров • 1 месяц назад

What if you can build an Agent with it own computer in a single api call? At my Google I/O talk, I showed how to use Gemini Managed Agents and the new Interactions API to give your AI a secure, hosted Linux sandbox to execute code and manage its own memory.

What if you can build an Agent with it own computer in a single api call? At my Google I/O talk, I showed how to use Gemini Managed Agents and the new Interactions API to give your AI a secure, hosted Linux sandbox to execute code and manage its own memory.

21,726 просмотров • 1 месяц назад

This is not a joke! 🐬 Excited to share DolphinGemma the first audio-to-audio for dolphin communication! Yes, a model that predicts tokens on how dolphin speech! > DolphinGemma is the first LLM trained specifically to understand dolphin language patterns. > Leverages 40 years of data from Dr. Denise Herzing's unique collection > Works like text prediction, trying to "complete" dolphin whistles and sounds > Use wearable hardware (Google Pixel 9) to capture and analyze sounds in the field. > Dolphin Gemma is designed to be fine-tuned with new data > Weights coming soon! Research like this is why I love AI even more! ♥️

This is not a joke! 🐬 Excited to share DolphinGemma the first audio-to-audio for dolphin communication! Yes, a model that predicts tokens on how dolphin speech! > DolphinGemma is the first LLM trained specifically to understand dolphin language patterns. > Leverages 40 years of data from Dr. Denise Herzing's unique collection > Works like text prediction, trying to "complete" dolphin whistles and sounds > Use wearable hardware (Google Pixel 9) to capture and analyze sounds in the field. > Dolphin Gemma is designed to be fine-tuned with new data > Weights coming soon! Research like this is why I love AI even more! ♥️

112,244 просмотров • 1 год назад

Gemini 2.5 Flash can control a browser! Excited to share Gemini Browser Agent, a simple Python script example on how to use Google DeepMind Gemini 2.5 Flash and Browser Use to act as general assistant! 🤯 Usage Examples: 1⃣ Single Query Mode: `python scripts/gemini-browser-use.py --url --query "Summarize the key features of Gemini 2.5 Flash."` 2⃣Interactive Mode: Start an interactive session, optionally with a starting URL. `python scripts/gemini-browser-use.py` Command-line options: --model: The Gemini model to use (default: gemini-2.5-flash-preview-04-17) --headless: Run the browser in headless mode --url: Starting URL for the browser to navigate to before processing the query --query: Run a single query and exit (instead of interactive mode) Time to build a replication of Manus and OpenAI Operator powered by Gemini 2.5. Code below ⬇️

Gemini 2.5 Flash can control a browser! Excited to share Gemini Browser Agent, a simple Python script example on how to use Google DeepMind Gemini 2.5 Flash and Browser Use to act as general assistant! 🤯 Usage Examples: 1⃣ Single Query Mode: `python scripts/gemini-browser-use.py --url --query "Summarize the key features of Gemini 2.5 Flash."` 2⃣Interactive Mode: Start an interactive session, optionally with a starting URL. `python scripts/gemini-browser-use.py` Command-line options: --model: The Gemini model to use (default: gemini-2.5-flash-preview-04-17) --headless: Run the browser in headless mode --url: Starting URL for the browser to navigate to before processing the query --query: Run a single query and exit (instead of interactive mode) Time to build a replication of Manus and OpenAI Operator powered by Gemini 2.5. Code below ⬇️

105,308 просмотров • 1 год назад

I asked Google DeepMind Gemini 3.1 Pro watch the launch video of Cursor SDK and create a production script. Then tasked it to re-create the video 1:1 with Remotion 0-shot. Video Understanding capabilities.🔥Original cursor video in answer thread.

I asked Google DeepMind Gemini 3.1 Pro watch the launch video of Cursor SDK and create a production script. Then tasked it to re-create the video 1:1 with Remotion 0-shot. Video Understanding capabilities.🔥Original cursor video in answer thread.

23,463 просмотров • 2 месяцев назад

That is the easiest way to chat with a complete Github Repository!👀 Replace "github" with "gitingest" in the url, and you get the whole repo as a single string, paste into AI studio and use Google DeepMind Gemini 2.0 Flash Million Token Context video to ask questions! 🤯 Here is how you can chat with the whole python SDK docs. ⬇️

That is the easiest way to chat with a complete Github Repository!👀 Replace "github" with "gitingest" in the url, and you get the whole repo as a single string, paste into AI studio and use Google DeepMind Gemini 2.0 Flash Million Token Context video to ask questions! 🤯 Here is how you can chat with the whole python SDK docs. ⬇️

97,475 просмотров • 1 год назад

We will enter a new era for vibe coding! The new Gemini 2.5 Pro can now zero-shot full Single Page Application, Complete Responsive Mobile Games, convert UI screenshots precisly to working code. Can’t wait and see what Cursor ,GitHub, @windsurf, bolt.new, v0, , Replit ⠕, Cline, will enable with this! Here are my top vibe tests so far: “Build a Healthcare CRM to streamline patient communication, appointment scheduling, record management, and billing processes…..”

We will enter a new era for vibe coding! The new Gemini 2.5 Pro can now zero-shot full Single Page Application, Complete Responsive Mobile Games, convert UI screenshots precisly to working code. Can’t wait and see what Cursor ,GitHub, @windsurf, bolt.new, v0, , Replit ⠕, Cline, will enable with this! Here are my top vibe tests so far: “Build a Healthcare CRM to streamline patient communication, appointment scheduling, record management, and billing processes…..”

79,810 просмотров • 1 год назад

Google Gemini 2.5 Pro Exp: “write a p5.js program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically, add sliders to adjust parameters.“ Chubby♨️

Google Gemini 2.5 Pro Exp: “write a p5.js program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically, add sliders to adjust parameters.“ Chubby♨️

81,994 просмотров • 1 год назад

The easiest way to build an MCP Server using Google DeepMind Gemini 2.5 Pro and get started! 1. Use Gitingest to get all the code and docs from the FastMCP repo 2. Download the code into a txt file 3. Go to AI Studio, upload the file, define what kind of MCP Server you want to build 4. Gemini 2.5 Pro builds it for you.

The easiest way to build an MCP Server using Google DeepMind Gemini 2.5 Pro and get started! 1. Use Gitingest to get all the code and docs from the FastMCP repo 2. Download the code into a txt file 3. Go to AI Studio, upload the file, define what kind of MCP Server you want to build 4. Gemini 2.5 Pro builds it for you.

75,968 просмотров • 1 год назад

New Example! We built a fullstack open “Deep Research” quickstart for Google DeepMind Gemini 2.5. It dynamically searches the web, reflects on results, and delivers comprehensive answers with citations in a nice UI with streaming! Built using React and LangChain Langgraph! 🚀 TL;DR: 🔄 Agent iteratively loops through research and reflection until it gathers sufficient information. 🔍 Dynamic query generation, web research via Gemini native Google Search tool, and reflective reasoning. 🧠 Supports different search effors (low, medium, high) for width and depth of search 🛠️ React frontend, LangGraph backend, Tailwind CSS + Shadcn UI components. 🐳 Easily run locally or deploy with Docker. 📄 Answers inlcude citations from gathered web sources.

New Example! We built a fullstack open “Deep Research” quickstart for Google DeepMind Gemini 2.5. It dynamically searches the web, reflects on results, and delivers comprehensive answers with citations in a nice UI with streaming! Built using React and LangChain Langgraph! 🚀 TL;DR: 🔄 Agent iteratively loops through research and reflection until it gathers sufficient information. 🔍 Dynamic query generation, web research via Gemini native Google Search tool, and reflective reasoning. 🧠 Supports different search effors (low, medium, high) for width and depth of search 🛠️ React frontend, LangGraph backend, Tailwind CSS + Shadcn UI components. 🐳 Easily run locally or deploy with Docker. 📄 Answers inlcude citations from gathered web sources.

55,535 просмотров • 1 год назад

PURE INSANITY! Here is a 5 minute long compilation showcasing the craziest things people are generating with Google DeepMind VEO 3. 🤯 You won't believe your eyes! Sound on🔊 [source: reddit r/singularity]

PURE INSANITY! Here is a 5 minute long compilation showcasing the craziest things people are generating with Google DeepMind VEO 3. 🤯 You won't believe your eyes! Sound on🔊 [source: reddit r/singularity]

54,115 просмотров • 1 год назад

Full circle of Gemini API Features: Gemini Deep Research > Gemini 3 Flash (Agent) + Subagents (Nano Banana, Gemini 3.1 Flash TTS) + Skills (HeyGen HyperFrame ) > 16 Parallel Variants. No manual script. No video editing software. Just a single prompt to Gemini 3 Flash orchestrating subagents with HyperFrames SKILL to create 16 distinct videos! 🚀 [video: Swiss print, A high-contrast, light-mode technical manual aesthetic with a firm, auditor-like voice.]

Full circle of Gemini API Features: Gemini Deep Research > Gemini 3 Flash (Agent) + Subagents (Nano Banana, Gemini 3.1 Flash TTS) + Skills (HeyGen HyperFrame ) > 16 Parallel Variants. No manual script. No video editing software. Just a single prompt to Gemini 3 Flash orchestrating subagents with HyperFrames SKILL to create 16 distinct videos! 🚀 [video: Swiss print, A high-contrast, light-mode technical manual aesthetic with a firm, auditor-like voice.]

13,555 просмотров • 3 месяцев назад

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻‍💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻‍💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face

48,157 просмотров • 1 год назад