正在加载视频...

视频加载失败

Introducing ambientGPT: an open-source and multimodal MacOS foundation model GUI Run GPT-4o and open-source models with full ambient knowledge of your screen. Foundation models have long been confined to the browser. With ambientGPT, your screen context is directly inferred as part of the query, ensuring you never need to...

192,753 次观看 • 2 年前 •via X (Twitter)

10 条评论

Siddharth Sharma 的头像
Siddharth Sharma2 年前

Unlike OpenAI’s desktop app where you must provide a screenshot or upload a file, the context from your screen is automatically parsed. We also provide the ability to run secure local models like Gemma and Phi-3 multimodal from our interface. Due to the local model sizes, at least 16 GB RAM would be preferred. This was possible via the apple MLX library - shoutout to @awnihannun, @reach_vb

Siddharth Sharma 的头像
Siddharth Sharma2 年前

ambientGPT is open-source and we plan to integrate vllm and ollama to provide more extensive inference hosting abilities with our multimodal GUI. We also aim to release ambientGPT on the apple app store soon.

Siddharth Sharma 的头像
Siddharth Sharma2 年前

thanks to @mihiranan (mr. clutch) for his help with the demo once again!

Sambhav Gupta 的头像
Sambhav Gupta2 年前

Very cool but chatgpt mac app will have the ability to see everything on screen when they release the update ..

Ishan Khare 的头像
Ishan Khare2 年前

so what does the “full ambient knowledge of your screen” entail privacy wise?? why would someone trust this if unwanted data is being captured forever and sent to openai?

Soham Konar 的头像
Soham Konar2 年前

Great work! Does it work across multiple monitors? Could be a game changer if so

whistle 的头像
whistle2 年前

was waiting for something exactly like this!! what’s the token usage look like?

Saïd Aitmbarek 的头像
Saïd Aitmbarek2 年前

looks awesome, would love to showcase ambient on

Salman 的头像
Salman2 年前

Looks super useful . How does it determine context if you have multiple screens?

Rahul bansal 👀 的头像
Rahul bansal 👀2 年前

This is cool. I built a tool for taking to model via voice

相关视频

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 1 年前