Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

🥳 Introducing MiniCPM-o 4.5 The first full-duplex omni-modal LLM in open-source community 🎬🎙️ 🔥 Key Highlights: • Full-duplex Omni-modal Live Streaming: The model can see, listen, and speak simultaneously in a real-time conversation without mutual blocking • Proactive Interaction: Moving beyond reactive QA to performing proactive interaction, such as... show more

OpenBMB

8,681 subscribers

397,802 views • 5 months ago •via X (Twitter)

Comedy Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

MiniCPM-o 4.5: Seeing, Listening, and Speaking — All at Once. 👁️👂🗣️ ✨Beyond traditional turn-taking, we’ve built a Native Full-Duplex engine that allows a 9B model to see, listen, and speak in one concurrent, non-blocking stream. Watch how it masters real-world complexity in real-time: 🔔 Proactive Auditory Interaction: Interrupts itself to alert you when it hears a "Ding!" while reading cards. 🎨 Temporal Flow Tracking: Follows your pen in real-time, narrating and "mind-reading" your drawing as you sketch. 🍎 Omni-Perception: Scans groceries & identifies prices on the fly. ✨Why it’s a category-leader: 📌Performance: Surpasses GPT-4o and Gemini 2.0 Pro on OpenCompass (Avg. 77.6). 📌Architecture: End-to-end fusion of SigLip2, Whisper, and CosyVoice2 on a Qwen3-8B base. 📌Efficiency: Full-duplex live streaming now runs locally on PCs via llama.cpp-omni. The era of "Wait-and-Response" AI is over. Proactive, real-time intelligence is now open-source. 🚀Experience it on Hugging Face: 🔗 #MiniCPM #Omnimodal #FullDuplex #EdgeAI #OpenSource #ComputerVision

MiniCPM-o 4.5: Seeing, Listening, and Speaking — All at Once. 👁️👂🗣️ ✨Beyond traditional turn-taking, we’ve built a Native Full-Duplex engine that allows a 9B model to see, listen, and speak in one concurrent, non-blocking stream. Watch how it masters real-world complexity in real-time: 🔔 Proactive Auditory Interaction: Interrupts itself to alert you when it hears a "Ding!" while reading cards. 🎨 Temporal Flow Tracking: Follows your pen in real-time, narrating and "mind-reading" your drawing as you sketch. 🍎 Omni-Perception: Scans groceries & identifies prices on the fly. ✨Why it’s a category-leader: 📌Performance: Surpasses GPT-4o and Gemini 2.0 Pro on OpenCompass (Avg. 77.6). 📌Architecture: End-to-end fusion of SigLip2, Whisper, and CosyVoice2 on a Qwen3-8B base. 📌Efficiency: Full-duplex live streaming now runs locally on PCs via llama.cpp-omni. The era of "Wait-and-Response" AI is over. Proactive, real-time intelligence is now open-source. 🚀Experience it on Hugging Face: 🔗 #MiniCPM #Omnimodal #FullDuplex #EdgeAI #OpenSource #ComputerVision

OpenBMB

115,462 views • 5 months ago

🚀Introducing MiniCPM-V 2.6! 🔥 1、Surpassing GPT-4V in single image, multi-image and video understanding 📸🎥 2、Outperforms GPT-4o mini and Gemini 1.5 on OpenCompass 🏆 3、Real-time video analysis on iPad 📱💨 Try out the best on-device multimodal LLM here！ 👑 GitHub： Huggingface： #MLLM #MiniCPM

🚀Introducing MiniCPM-V 2.6! 🔥 1、Surpassing GPT-4V in single image, multi-image and video understanding 📸🎥 2、Outperforms GPT-4o mini and Gemini 1.5 on OpenCompass 🏆 3、Real-time video analysis on iPad 📱💨 Try out the best on-device multimodal LLM here！ 👑 GitHub： Huggingface： #MLLM #MiniCPM

OpenBMB

196,359 views • 2 years ago

🚀 🚀Excited to announce the technical report of MiniCPM-o 4.5! MiniCPM-o 4.5 transitions #AI interaction from traditional turn-based processing to a real-time, native full-duplex stream-based paradigm. 🌊 The Omni-Flow Framework Instead of traditional VAD-based workarounds, we introduce the #Omni-#Flow framework. This unified stream paradigm aligns video, audio, and text on a synchronized millisecond timeline. • Native Full-Duplex: Simultaneous perception and response. • Proactive Interaction: Natively manages turn-taking without external VAD, supports proactive reminding. 📉 9B Scale, SOTA Performance MiniCPM-o 4.5 demonstrates SOTA multimodal intelligence at its scale: • Multimodal Benchmarks: Comparable to #Gemini 2.5 Flash on MMBench EN (87.6) and MathVista (80.1). • Streaming Evaluation: 54.4% win rate on LiveSports-3K-CC, surpassing specialized models. 💻 The Ultimate Edge AI — Fully Functional without Network Connection We are providing one-click installers for Windows (12G VRAM,RTX 5070) and macOS (M1-M5 Max/ M5 Pro). • Local API Support: Deploy your own inference server to integrate native full-duplex into custom apps. • Free Access: We are offering free community API services for exploration. • 100% Private: Your data never leaves your machine. Deploy in under 10 minutes. 🛠️👇 👐 Join the Open Future The weights are open. The protocol is public. 📄 Technical Report: 💻 GitHub: 🤗 HuggingFace: 🌐 Web Demo: #MiniCPMo #OpenSourceAI #EdgeAI #MachineLearning #ComputerVision #LLM

🚀 🚀Excited to announce the technical report of MiniCPM-o 4.5! MiniCPM-o 4.5 transitions #AI interaction from traditional turn-based processing to a real-time, native full-duplex stream-based paradigm. 🌊 The Omni-Flow Framework Instead of traditional VAD-based workarounds, we introduce the #Omni-#Flow framework. This unified stream paradigm aligns video, audio, and text on a synchronized millisecond timeline. • Native Full-Duplex: Simultaneous perception and response. • Proactive Interaction: Natively manages turn-taking without external VAD, supports proactive reminding. 📉 9B Scale, SOTA Performance MiniCPM-o 4.5 demonstrates SOTA multimodal intelligence at its scale: • Multimodal Benchmarks: Comparable to #Gemini 2.5 Flash on MMBench EN (87.6) and MathVista (80.1). • Streaming Evaluation: 54.4% win rate on LiveSports-3K-CC, surpassing specialized models. 💻 The Ultimate Edge AI — Fully Functional without Network Connection We are providing one-click installers for Windows (12G VRAM,RTX 5070) and macOS (M1-M5 Max/ M5 Pro). • Local API Support: Deploy your own inference server to integrate native full-duplex into custom apps. • Free Access: We are offering free community API services for exploration. • 100% Private: Your data never leaves your machine. Deploy in under 10 minutes. 🛠️👇 👐 Join the Open Future The weights are open. The protocol is public. 📄 Technical Report: 💻 GitHub: 🤗 HuggingFace: 🌐 Web Demo: #MiniCPMo #OpenSourceAI #EdgeAI #MachineLearning #ComputerVision #LLM

OpenBMB

147,824 views • 3 months ago

💥 Introducing MiniCPM-o 2.6: An 8B size, GPT-4o level Omni Model runs on device ✨ Highlights: ~Match GPT-4o-202405 in vision, audio and multimodal live streaming ~End-to-end real-time bilingual audio conversation ~Voice cloning & emotion control ~Advanced OCR & video understanding ~Offline iPad-compatible multimodal live streaming 🔗 Try it out: GitHub: HF: Demo:

💥 Introducing MiniCPM-o 2.6: An 8B size, GPT-4o level Omni Model runs on device ✨ Highlights: ~Match GPT-4o-202405 in vision, audio and multimodal live streaming ~End-to-end real-time bilingual audio conversation ~Voice cloning & emotion control ~Advanced OCR & video understanding ~Offline iPad-compatible multimodal live streaming 🔗 Try it out: GitHub: HF: Demo:

OpenBMB

97,762 views • 1 year ago

This Chinese open source AI agent is the first thing that’s made me say wow out loud. It can see. It can listen. It can talk. All at the same time. MiniCPM-O 4.5 does full duplex omni-modal chat. Like a phone call with vision. → Interrupt it → It interrupts you → It reacts live Save this video, you’ll realize local AI just caught up. Want the SOP? DM me. 💬

This Chinese open source AI agent is the first thing that’s made me say wow out loud. It can see. It can listen. It can talk. All at the same time. MiniCPM-O 4.5 does full duplex omni-modal chat. Like a phone call with vision. → Interrupt it → It interrupts you → It reacts live Save this video, you’ll realize local AI just caught up. Want the SOP? DM me. 💬

Julian Goldie SEO

217,719 views • 5 months ago

🚀 Introducing MiniCPM-V 4.5 8B: pushing the boundary of multimodal AI! ～ SOTA VL Capability: Surpasses GPT-4o, Gemini 2.0 Pro, Qwen2.5-VL 72B on OpenCompass! ～ "Eagle Eye" Video: 96x visual token compression for high refresh rate and long video understanding ～ Controllable Hybrid Fast/Deep Thinking ～ Strong OCR & Doc Parsing: Surpasses GPT-4o & Gemini 2.5 on OmniDocBench Get ready for the future of multimodal AI 👉 Huggingface｜ Github｜ Gradio｜ #AI #MiniCM #GPT #Gemini #OpenBMB #ArtificialIntelligence #MachineLearning

🚀 Introducing MiniCPM-V 4.5 8B: pushing the boundary of multimodal AI! ～ SOTA VL Capability: Surpasses GPT-4o, Gemini 2.0 Pro, Qwen2.5-VL 72B on OpenCompass! ～ "Eagle Eye" Video: 96x visual token compression for high refresh rate and long video understanding ～ Controllable Hybrid Fast/Deep Thinking ～ Strong OCR & Doc Parsing: Surpasses GPT-4o & Gemini 2.5 on OmniDocBench Get ready for the future of multimodal AI 👉 Huggingface｜ Github｜ Gradio｜ #AI #MiniCM #GPT #Gemini #OpenBMB #ArtificialIntelligence #MachineLearning

OpenBMB

25,136 views • 11 months ago

🚀 MiniCPM enters the physical world — enabling robots to understand, remember, and act. We open-source MiniCPM-Robot, our first embodied AI model series, including: 🤖 MiniCPM-RobotManip — a 1.5B general-purpose Vision-Language-Action (VLA) model for robotic manipulation. 🐕 MiniCPM-RobotTrack — a compact model for real-world target tracking. ⚡ PhyAI — a high-performance inference framework built for embodied models. Together, they bring efficient, practical, and open embodied intelligence closer to real-world robots. ⭐ GitHub: 🤗 MiniCPM-RobotManip: 🤗 MiniCPM-RobotTrack:

🚀 MiniCPM enters the physical world — enabling robots to understand, remember, and act. We open-source MiniCPM-Robot, our first embodied AI model series, including: 🤖 MiniCPM-RobotManip — a 1.5B general-purpose Vision-Language-Action (VLA) model for robotic manipulation. 🐕 MiniCPM-RobotTrack — a compact model for real-world target tracking. ⚡ PhyAI — a high-performance inference framework built for embodied models. Together, they bring efficient, practical, and open embodied intelligence closer to real-world robots. ⭐ GitHub: 🤗 MiniCPM-RobotManip: 🤗 MiniCPM-RobotTrack:

OpenBMB

327,126 views • 12 days ago

Introducing MiniCPM 4.1-8B: First Open-Source Reasoning LLM with Trainable Sparse Attention ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks! ✅ Fast Generation: 3x decoding speedup for reasoning ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding Download Models: Huggingface: Github: Technical Report: #AI #MiniCPM #LLM #OpenBMB #ArtificialIntelligence #MachineLearning

Introducing MiniCPM 4.1-8B: First Open-Source Reasoning LLM with Trainable Sparse Attention ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks! ✅ Fast Generation: 3x decoding speedup for reasoning ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding Download Models: Huggingface: Github: Technical Report: #AI #MiniCPM #LLM #OpenBMB #ArtificialIntelligence #MachineLearning

OpenBMB

19,236 views • 10 months ago

New open Omni model released! 👀OpenBMB MiniCPM-o 2.6 is a new 8B parameters, any-to-any multimodal model that can understand vision, speech, and language and runs on edge devices like phones and tablets. TL;DR: 🧠 8B total parameters (SigLip-400M + Whisper-300M + ChatTTS-200M + Qwen2.5-7B) 🔥 Outperforms GPT-4V on visual tasks with 70.2 average score on OpenCompass 🎙️ Best-in-class bilingual speech capabilities with real-time conversation and voice cloning 🎬 Supports multimodal streaming with support for continuous video/audio processing 📱 Runs on iPads and phones and supports 30+ languages 🖼️ Processes images up to 1.8M pixels (1344x1344) with OCR capabilities 🛠️ Easy integration with popular frameworks (llama.cpp, vLLM, Gradio) 🤗 commercial friendly (< 1 mio DAU) and available on Hugging Face

New open Omni model released! 👀OpenBMB MiniCPM-o 2.6 is a new 8B parameters, any-to-any multimodal model that can understand vision, speech, and language and runs on edge devices like phones and tablets. TL;DR: 🧠 8B total parameters (SigLip-400M + Whisper-300M + ChatTTS-200M + Qwen2.5-7B) 🔥 Outperforms GPT-4V on visual tasks with 70.2 average score on OpenCompass 🎙️ Best-in-class bilingual speech capabilities with real-time conversation and voice cloning 🎬 Supports multimodal streaming with support for continuous video/audio processing 📱 Runs on iPads and phones and supports 30+ languages 🖼️ Processes images up to 1.8M pixels (1344x1344) with OCR capabilities 🛠️ Easy integration with popular frameworks (llama.cpp, vLLM, Gradio) 🤗 commercial friendly (< 1 mio DAU) and available on Hugging Face

Philipp Schmid

19,409 views • 1 year ago

🤔The world’s best small models? We immediately compared Mistral-3-8B with our previous-gen model, MiniCPM-4.1 (Both in thinking) 😂The findings are compelling: ✅MiniCPM is still ～2x faster, maintaining a massive speed lead ✅It remains a full generation ahead in capabilities (excluding math/code) For developers prioritizing efficiency and speed, MiniCPM is undeniably the world's best small model.

🤔The world’s best small models? We immediately compared Mistral-3-8B with our previous-gen model, MiniCPM-4.1 (Both in thinking) 😂The findings are compelling: ✅MiniCPM is still ～2x faster, maintaining a massive speed lead ✅It remains a full generation ahead in capabilities (excluding math/code) For developers prioritizing efficiency and speed, MiniCPM is undeniably the world's best small model.

OpenBMB

260,271 views • 7 months ago

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 views • 1 year ago

🚀 Introducing AgentCPM-Explore: The First Open-Source 4B-Agent Model to Conquer GAIA & Complex Real-World Tasks! 🤗 Hugging Face: 🔗 GitHub: ✨ Key Highlights: ✅ SOTA Agentic Performance: Sets a new benchmark for 4B-scale agent models—outperforming all peers, surpassing 8B models, and rivaling select 30B+ and closed-source LLMs. 🧠 Deep Research Capability: Excels at long-horizon reasoning, supports 100+ turns of autonomous interaction with multi-source cross-validation, human-like self-correction, and dynamic tool use + strategy adaptation—just like a real researcher! 🔓 Full-Stack Open Source: We’re open-sourcing the entire end-to-end agent stack—not just the model! Empower your own innovations with - AgentRL: Asynchronous reinforcement learning framework - AgentDock: Secure, extensible tool sandbox - AgentToLeaP: An one-click evaluation platform for agent tool-learning capabilitie - Full training data pipeline & reproducible workflows #AgentCPM #OpenSourceAI #AgenticAI #AI #GAIA #LLM #OpenBMB #AIAgents #HuggingFace

🚀 Introducing AgentCPM-Explore: The First Open-Source 4B-Agent Model to Conquer GAIA & Complex Real-World Tasks! 🤗 Hugging Face: 🔗 GitHub: ✨ Key Highlights: ✅ SOTA Agentic Performance: Sets a new benchmark for 4B-scale agent models—outperforming all peers, surpassing 8B models, and rivaling select 30B+ and closed-source LLMs. 🧠 Deep Research Capability: Excels at long-horizon reasoning, supports 100+ turns of autonomous interaction with multi-source cross-validation, human-like self-correction, and dynamic tool use + strategy adaptation—just like a real researcher! 🔓 Full-Stack Open Source: We’re open-sourcing the entire end-to-end agent stack—not just the model! Empower your own innovations with - AgentRL: Asynchronous reinforcement learning framework - AgentDock: Secure, extensible tool sandbox - AgentToLeaP: An one-click evaluation platform for agent tool-learning capabilitie - Full training data pipeline & reproducible workflows #AgentCPM #OpenSourceAI #AgenticAI #AI #GAIA #LLM #OpenBMB #AIAgents #HuggingFace

OpenBMB

13,996 views • 6 months ago

Introducing Proactor, the world’s first proactive AI agent — context-aware, memory-augmented, and acting in real time before you ask. It doesn’t wait for prompts. It joins your conversation to deliver real-time transcription and live summary, instantly analyzing the discussion to uncover potential needs and actionable tasks — whether explicitly stated or merely implied — before you fully recognize them. In the moment, it acts on sparked curiosity, proactively researching topics as they arise and immediately executing on tasks identified within the flow. Get real-time insights, answers, meeting notes, and tasks handled — before you even ask. Whether in university lectures and other Education settings, guiding Students, boosting Productivity at work, or chiming in during engineering chats to surface helpful code, Proactor always stays two steps ahead. Try Proactor to experience a proactive AI agent, and check out what it can do in the thread below.

Introducing Proactor, the world’s first proactive AI agent — context-aware, memory-augmented, and acting in real time before you ask. It doesn’t wait for prompts. It joins your conversation to deliver real-time transcription and live summary, instantly analyzing the discussion to uncover potential needs and actionable tasks — whether explicitly stated or merely implied — before you fully recognize them. In the moment, it acts on sparked curiosity, proactively researching topics as they arise and immediately executing on tasks identified within the flow. Get real-time insights, answers, meeting notes, and tasks handled — before you even ask. Whether in university lectures and other Education settings, guiding Students, boosting Productivity at work, or chiming in during engineering chats to surface helpful code, Proactor always stays two steps ahead. Try Proactor to experience a proactive AI agent, and check out what it can do in the thread below.

Proactor

478,424 views • 1 year ago

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

Charly Wargnier

565,353 views • 6 months ago

Google I/O. Some thoughts: the model seems to be multimodal in, but not multimodal out. Imagen-3 and music gen models are still detached from Gemini as standalone components. Merging all modality I/O natively is the inevitable future: - enables tasks like "use a more robotic voice", "speak 2x faster", "edit this image iteratively", and "generate consistent comic strips". - does not lose information across modal boundaries, e.g. emotion and background sound. - opens up new in-context capabilities. You can teach the model to combine different senses in novel ways with few-shot examples. GPT-4o doesn't do it perfectly, but it gets the form factor correct. In words of Andrej's LLM-as-OS analogy: we need the model to natively support as many file extensions as possible. One thing Google is doing right: they are finally making serious efforts to integrate AI into the search box. I sense the agent flow: planning, real-time browsing, and multimodal input, all from the landing page. Google's strongest moat is distribution. Gemini doesn't have to be the best model to be the most used one in the world.

Google I/O. Some thoughts: the model seems to be multimodal in, but not multimodal out. Imagen-3 and music gen models are still detached from Gemini as standalone components. Merging all modality I/O natively is the inevitable future: - enables tasks like "use a more robotic voice", "speak 2x faster", "edit this image iteratively", and "generate consistent comic strips". - does not lose information across modal boundaries, e.g. emotion and background sound. - opens up new in-context capabilities. You can teach the model to combine different senses in novel ways with few-shot examples. GPT-4o doesn't do it perfectly, but it gets the form factor correct. In words of Andrej's LLM-as-OS analogy: we need the model to natively support as many file extensions as possible. One thing Google is doing right: they are finally making serious efforts to integrate AI into the search box. I sense the agent flow: planning, real-time browsing, and multimodal input, all from the landing page. Google's strongest moat is distribution. Gemini doesn't have to be the best model to be the most used one in the world.

Jim Fan

209,643 views • 2 years ago

I think people are still sleeping on realtime voice driven consumer experiences.. Releasing my latest side project geared towards kids: Upon A Voice ✨ You can watch my daughter interact with it in the video below but the highlights are as follows: - generates a "live" storybook based on who your child is and how they look. - the story is completely driven by voice and what the kid wants to see in the book. - images are generated in real time keeping the character consistent. - at the end, you get a full PDF that you can print and hand to your child to read as a bedtime story Live now at uponavoice dot com with a BYO Gemini key!

I think people are still sleeping on realtime voice driven consumer experiences.. Releasing my latest side project geared towards kids: Upon A Voice ✨ You can watch my daughter interact with it in the video below but the highlights are as follows: - generates a "live" storybook based on who your child is and how they look. - the story is completely driven by voice and what the kid wants to see in the book. - images are generated in real time keeping the character consistent. - at the end, you get a full PDF that you can print and hand to your child to read as a bedtime story Live now at uponavoice dot com with a BYO Gemini key!

Nikunj Kothari

27,809 views • 21 days ago

InternLM-XComposer-2.5 A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

InternLM-XComposer-2.5 A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

AK

66,472 views • 2 years ago

Which LLM reasons best when it doesn't have all the information? Enter LLM Poker Arena to find out. It's a Poker Playing benchmark where top reasoning models play Texas Hold'em poker against each other. Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro, and Grok 4 all sit at the same table and play full tournaments to see who finishes with the chips. Poker is very different when it comes to reasoning. It has to balance probabilistic reasoning, opponent modeling and make decisions under uncertainty. Poker is an interesting evaluation because it tests reasoning under incomplete information, something most coding benchmarks do not capture. In this tournaments the rules are: - Each LLM starts with $1,000 chips - Small and big blinds start at $25 / $50 - Blinds double every 3 minutes - All models run in their reasoning or thinking modes After the first 5 tournaments: - Claude Opus 4.5 with Thinking has 3 wins - GPT-5.2 has 2 wins - Grok 4 and Gemini 2.5 Pro have 0 wins Early results suggest Claude performs quite well at poker as well. Also five is a very small sample size. Planning to run many more tournaments, publish the full benchmark data and add a prediction market on top of it. Thanks for the suggestion clipz. Much more coming as part of Poker Cities !! This was built on Replit ⠕ using their AI integrations, which made it straightforward to connect Claude, GPT, and Gemini. What model do you think wins after 100 tournaments?

Which LLM reasons best when it doesn't have all the information? Enter LLM Poker Arena to find out. It's a Poker Playing benchmark where top reasoning models play Texas Hold'em poker against each other. Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro, and Grok 4 all sit at the same table and play full tournaments to see who finishes with the chips. Poker is very different when it comes to reasoning. It has to balance probabilistic reasoning, opponent modeling and make decisions under uncertainty. Poker is an interesting evaluation because it tests reasoning under incomplete information, something most coding benchmarks do not capture. In this tournaments the rules are: - Each LLM starts with $1,000 chips - Small and big blinds start at $25 / $50 - Blinds double every 3 minutes - All models run in their reasoning or thinking modes After the first 5 tournaments: - Claude Opus 4.5 with Thinking has 3 wins - GPT-5.2 has 2 wins - Grok 4 and Gemini 2.5 Pro have 0 wins Early results suggest Claude performs quite well at poker as well. Also five is a very small sample size. Planning to run many more tournaments, publish the full benchmark data and add a prediction market on top of it. Thanks for the suggestion clipz. Much more coming as part of Poker Cities !! This was built on Replit ⠕ using their AI integrations, which made it straightforward to connect Claude, GPT, and Gemini. What model do you think wins after 100 tournaments?

Anshul Dhawan

32,192 views • 6 months ago

Crypto is overwhelming. High yields are appealing but you’ll need two wallets, three bridges and 15 transactions to secure them. Assuming you can even source those yields. It can be as as stressful and time intensive as a full-time job. At least, it used to be. Introducing Cod3x. The solution to all the friction points for user-blockchain interaction. Cod3x employs a novel Agentic Interface capable of servicing all mass market user’s needs. You need only communicate intent and our network of lightweight AI agents will surface strategies, build transactions and manage queries on your behalf. All in a single swipe. This is all in service of true financial freedom. The freedom to decide on an outcome and instantly receive all the knowledge and tooling necessary to make it happen. The freedom to step away from your PC and enjoy you life and family without thinking about finances. Over the coming months, we will be onboarding strategic partners and firms that are committed to offering our vision of financial freedom to everyone. Expect more details and announcements as we release features leading up to our full launch later this year. Follow us on X to be an early tester for our Agentic Interface and data visualization tools. Get involved in the conversation by joining us on Discord: See you there.

Crypto is overwhelming. High yields are appealing but you’ll need two wallets, three bridges and 15 transactions to secure them. Assuming you can even source those yields. It can be as as stressful and time intensive as a full-time job. At least, it used to be. Introducing Cod3x. The solution to all the friction points for user-blockchain interaction. Cod3x employs a novel Agentic Interface capable of servicing all mass market user’s needs. You need only communicate intent and our network of lightweight AI agents will surface strategies, build transactions and manage queries on your behalf. All in a single swipe. This is all in service of true financial freedom. The freedom to decide on an outcome and instantly receive all the knowledge and tooling necessary to make it happen. The freedom to step away from your PC and enjoy you life and family without thinking about finances. Over the coming months, we will be onboarding strategic partners and firms that are committed to offering our vision of financial freedom to everyone. Expect more details and announcements as we release features leading up to our full launch later this year. Follow us on X to be an early tester for our Agentic Interface and data visualization tools. Get involved in the conversation by joining us on Discord: See you there.

Cod3x | Win More Trades

89,460 views • 2 years ago

From lab to open-source: A new milestone for AI-driven education. 🎓 🤗 We’ve been closely following the MAIC project at Tsinghua University, and we’re thrilled to see it now open-sourced as #OpenMAIC. ✨ This isn't just another chatbot; it takes Multi-Agent orchestration to the next level by building a fully interactive classroom where AI instructors and peers collaborate in real-time. What makes it technically impressive: 🛠️ Complex Orchestration: Leveraging #LangGraph to manage spontaneous interactions—like #AI students "raising hands" during a live lecture. 🧠 Structured Planning: A dedicated "Plan Agent" that transforms raw PDFs into coherent, logically sequenced pedagogical flows. 💻 Beyond Text: A masterclass in GenUI implementation, featuring synchronized TTS, laser pointers, and real-time whiteboard demonstrations. 🥳 If you’re building complex, multi-modal #Agent workflows, this repo is a treasure trove of engineering insights. 🖥️Explore the project: 📰 Read the research:

From lab to open-source: A new milestone for AI-driven education. 🎓 🤗 We’ve been closely following the MAIC project at Tsinghua University, and we’re thrilled to see it now open-sourced as #OpenMAIC. ✨ This isn't just another chatbot; it takes Multi-Agent orchestration to the next level by building a fully interactive classroom where AI instructors and peers collaborate in real-time. What makes it technically impressive: 🛠️ Complex Orchestration: Leveraging #LangGraph to manage spontaneous interactions—like #AI students "raising hands" during a live lecture. 🧠 Structured Planning: A dedicated "Plan Agent" that transforms raw PDFs into coherent, logically sequenced pedagogical flows. 💻 Beyond Text: A masterclass in GenUI implementation, featuring synchronized TTS, laser pointers, and real-time whiteboard demonstrations. 🥳 If you’re building complex, multi-modal #Agent workflows, this repo is a treasure trove of engineering insights. 🖥️Explore the project: 📰 Read the research:

OpenBMB

152,659 views • 4 months ago