正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Demo2: Multimodal Interactive Hybrid Agent

Qwen

219,596 subscribers

27,423 次观看 • 24 天前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Hunyuan-GameCraft High-dynamic Interactive Game Video Generation with Hybrid History Condition

Hunyuan-GameCraft High-dynamic Interactive Game Video Generation with Hybrid History Condition

AK

91,044 次观看 • 1 年前

Seeing, Listening, Remembering, and Reasoning A Multimodal Agent with Long-Term Memory

Seeing, Listening, Remembering, and Reasoning A Multimodal Agent with Long-Term Memory

AK

19,923 次观看 • 10 个月前

demo2 is LIVE

demo2 is LIVE

Léonard Lemaitre

527,567 次观看 • 15 天前

can AI write engaging news that people can trust? introducing ✨Data2Story: a data journalist agent. give it raw data, it generate a verifiable, multimodal article. 🔍verifiable: every claim is evidence-grounded, traces back to data, code, or a cited source. 🔮multimodal: the article is a generative UI — images, videos, audio, interactive charts. not just readable, but trustworthy and playable. 🧵1/N

can AI write engaging news that people can trust? introducing ✨Data2Story: a data journalist agent. give it raw data, it generate a verifiable, multimodal article. 🔍verifiable: every claim is evidence-grounded, traces back to data, code, or a cited source. 🔮multimodal: the article is a generative UI — images, videos, audio, interactive charts. not just readable, but trustworthy and playable. 🧵1/N

Kevin Lin

25,183 次观看 • 9 天前

Demo2：Audio-Visual Vibe Coding

Demo2：Audio-Visual Vibe Coding

Qwen

210,634 次观看 • 2 个月前

Our AI design agent is absolutely on fire 🤯 What if you get an UX agent can design, wireframe, remix? Gemini's true multimodal ability unleashed Try it on

Our AI design agent is absolutely on fire 🤯 What if you get an UX agent can design, wireframe, remix? Gemini's true multimodal ability unleashed Try it on

Jason Zhou

80,710 次观看 • 7 个月前

Stagehand Agent just got way more powerful. We just added: • Agent Hybrid mode (DOM + vision) • Streaming output • Automatic web search (with Brave) Try the new agent with experimental flag.

Stagehand Agent just got way more powerful. We just added: • Agent Hybrid mode (DOM + vision) • Streaming output • Automatic web search (with Brave) Try the new agent with experimental flag.

Stagehand

16,695 次观看 • 5 个月前

Avatar Forcing: Real-time interactive head avatars for natural conversation, and it's not WAN. - handles both speaking & active listening + multimodal responsiveness; - 7x faster than baselines.

Wildminder

58,506 次观看 • 5 个月前

Multimodal Reasoning AI Agents are here with Gemini 2.0 Flash Thinking I built a multimodal AI agent that can reason and understand images using gemini flash reasoning LLM. 100% Opensource Code with step-by-step tutorial.

Multimodal Reasoning AI Agents are here with Gemini 2.0 Flash Thinking I built a multimodal AI agent that can reason and understand images using gemini flash reasoning LLM. 100% Opensource Code with step-by-step tutorial.

Shubham Saboo

36,596 次观看 • 1 年前

Gemini 3 Flash can analyze high-fidelity images and use multimodal reasoning to determine next steps. See how the model understands complex visuals and generates layers of interactive elements.

Gemini 3 Flash can analyze high-fidelity images and use multimodal reasoning to determine next steps. See how the model understands complex visuals and generates layers of interactive elements.

Google AI Developers

47,280 次观看 • 5 个月前

I built an automated AI design team with multi-agents. It has 3 multimodal AI agents working together as a team: 1. Visual Design Agent 2. UX Analysis Agent 3. Market Analysis Agent 100% Opensource Code with step-by-step tutorial.

I built an automated AI design team with multi-agents. It has 3 multimodal AI agents working together as a team: 1. Visual Design Agent 2. UX Analysis Agent 3. Market Analysis Agent 100% Opensource Code with step-by-step tutorial.

Shubham Saboo

108,393 次观看 • 1 年前

grep is multimodal now. performs better then apple photos and gives you and your agent perfect search. just run npm install -g @mixedbread/mgrep

grep is multimodal now. performs better then apple photos and gives you and your agent perfect search. just run npm install -g @mixedbread/mgrep

Aamir

49,631 次观看 • 7 个月前

I built a multimodal AI Coding Agent team with multi-agents. It has 3 AI agents working together as a team to generate and execute the code: 1. Coding Agent using o-3 mini 2. Vision Agent using Gemini 3. Code Execution Agent using o-3 mini and E2B 100% Opensource Code.

I built a multimodal AI Coding Agent team with multi-agents. It has 3 AI agents working together as a team to generate and execute the code: 1. Coding Agent using o-3 mini 2. Vision Agent using Gemini 3. Code Execution Agent using o-3 mini and E2B 100% Opensource Code.

Shubham Saboo

42,269 次观看 • 1 年前

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 1 年前

Excited to release: AgentUI > a fresh chat interface - natively multi-agent > agents coordinate via reports and figures > plug+play any open/closed model as sub-agent > agents specialise in code, web search, multimodal... Try it here:

Excited to release: AgentUI > a fresh chat interface - natively multi-agent > agents coordinate via reports and figures > plug+play any open/closed model as sub-agent > agents specialise in code, web search, multimodal... Try it here:

Leandro von Werra

41,036 次观看 • 3 个月前

🔥1-min Interactive Video Generation with Multimodal Control🔥 Towards *long-context world model*, #LongVie is an end-to-end autoregressive framework for controllable ultra-long video generation - Page: - Paper: . Thanks AK

🔥1-min Interactive Video Generation with Multimodal Control🔥 Towards long-context world model, #LongVie is an end-to-end autoregressive framework for controllable ultra-long video generation - Page: - Paper: . Thanks AK

Ziwei Liu

14,150 次观看 • 10 个月前

II-Agent turns any prompt into an interactive classroom. The future of education is already here. Watch it in action.

II-Agent turns any prompt into an interactive classroom. The future of education is already here. Watch it in action.

Intelligent Internet

91,082 次观看 • 1 年前

I updated my interactive subagents to free up the main agent to be interactive as well (basically /btw but just a normal continuation) and the subagent asynchronously returns its result to the starting session

I updated my interactive subagents to free up the main agent to be interactive as well (basically /btw but just a normal continuation) and the subagent asynchronously returns its result to the starting session

Daniel Griesser

28,991 次观看 • 3 个月前

From idea → content → distribution. OptimAI Persona Agent handles the entire workflow: Voice-accurate posts and threads Multimodal generation (text, image, audio, video) Native scheduling across major platforms One-click setup via Chrome extension A social media agent, not just a tool. Download:

From idea → content → distribution. OptimAI Persona Agent handles the entire workflow: Voice-accurate posts and threads Multimodal generation (text, image, audio, video) Native scheduling across major platforms One-click setup via Chrome extension A social media agent, not just a tool. Download:

OptimAI Network

10,107 次观看 • 3 个月前