Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

🥳 Introducing MiniCPM-o 4.5 The first full-duplex omni-modal LLM in open-source community 🎬🎙️ 🔥 Key Highlights: • Full-duplex Omni-modal Live Streaming: The model can see, listen, and speak simultaneously in a real-time conversation without mutual blocking • Proactive Interaction: Moving beyond reactive QA to performing proactive interaction, such as...

397,416 Aufrufe • vor 4 Monaten •via X (Twitter)

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

🚀 🚀Excited to announce the technical report of MiniCPM-o 4.5! MiniCPM-o 4.5 transitions #AI interaction from traditional turn-based processing to a real-time, native full-duplex stream-based paradigm. 🌊 The Omni-Flow Framework Instead of traditional VAD-based workarounds, we introduce the #Omni-#Flow framework. This unified stream paradigm aligns video, audio, and text on a synchronized millisecond timeline. • Native Full-Duplex: Simultaneous perception and response. • Proactive Interaction: Natively manages turn-taking without external VAD, supports proactive reminding. 📉 9B Scale, SOTA Performance MiniCPM-o 4.5 demonstrates SOTA multimodal intelligence at its scale: • Multimodal Benchmarks: Comparable to #Gemini 2.5 Flash on MMBench EN (87.6) and MathVista (80.1). • Streaming Evaluation: 54.4% win rate on LiveSports-3K-CC, surpassing specialized models. 💻 The Ultimate Edge AI — Fully Functional without Network Connection We are providing one-click installers for Windows (12G VRAM,RTX 5070) and macOS (M1-M5 Max/ M5 Pro). • Local API Support: Deploy your own inference server to integrate native full-duplex into custom apps. • Free Access: We are offering free community API services for exploration. • 100% Private: Your data never leaves your machine. Deploy in under 10 minutes. 🛠️👇 👐 Join the Open Future The weights are open. The protocol is public. 📄 Technical Report: 💻 GitHub: 🤗 HuggingFace: 🌐 Web Demo: #MiniCPMo #OpenSourceAI #EdgeAI #MachineLearning #ComputerVision #LLM

OpenBMB

147,650 Aufrufe • vor 1 Monat

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 Aufrufe • vor 1 Jahr