Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access... show more

kyutai

26,288 subscribers

47,924 просмотров • 1 год назад •via X (Twitter)

Здоровье и велнес Образование Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля kyutai

kyutai1 год назад

🧠 How it works MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision. 206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware. E.g., on a MacMini M4, MoshiVis only adds ~7ms per step to the ~45ms per step of the base model, thus remaining well below the 80ms threshold for our audio codec. That means fluid, live, and multimodal conversations with Moshi on your own device!

Фото профиля kyutai

kyutai1 год назад

🧰 Fully open-source We’re releasing a detailed preprint, along with model weights and a first of its kind benchmark dataset for spoken visual question answering: 📄 Preprint 🧠 Speech Benchmarks 🧾 Model weights 🧪 Inference code in PyTorch, MLX, and Rust

Фото профиля kyutai

kyutai1 год назад

If you want to work on cutting-edge research, join our non-profit AI lab in Paris 🇫🇷 Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences — and the open-source community.

Фото профиля AssemblyAI

AssemblyAI1 год назад

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

Фото профиля ZOHEB

ZOHEB1 год назад

More lightweight! Sweet

Фото профиля Utopic e/λ

Utopic e/λ1 год назад

share the repo with us ❤

Фото профиля Danish Khan

Danish Khan1 год назад

Tried it out, very good performance for everyday tasks!

Фото профиля Frédéric H. (E/ACC)

Frédéric H. (E/ACC)1 год назад

Ridiculous. Your AI is still not able to speak other language than English and doesn't even work decently. What a shame.

Фото профиля haareblond

haareblond1 год назад

wow this is very cool!

Фото профиля Mogomra (e/acc)

Mogomra (e/acc)1 год назад

Holey moley, it's the AI from "Start-up" in the real world!!

Фото профиля X_Learning969

X_Learning9691 год назад

Local not working. Despite enabling mic and trying out the web ui it does not speak. Wonder if it can even hear me.

Похожие видео

1. Meta’s open-sourced multisensory model Meta is back (again!) with yet another exciting open-source project. Introducing ImageBind, a new AI research model that understands and combines text, audio, visual, movement, thermal, AND depth data.

1. Meta’s open-sourced multisensory model Meta is back (again!) with yet another exciting open-source project. Introducing ImageBind, a new AI research model that understands and combines text, audio, visual, movement, thermal, AND depth data.

Rowan Cheung

173,984 просмотров • 3 лет назад

Today we are launching VE2 — a brand new model for Visual Electric. It can produce hyper realistic photos, accurate text, and generate images 2.5x faster. This is the biggest update ever to Visual Electric. 🔊 SOUND ON

Colin Dunn

41,435 просмотров • 1 год назад

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more:

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more:

AI at Meta

1,248,823 просмотров • 6 месяцев назад

π0.7 handles diverse prompts that don't just say what to do, but also how to do it, including rich language and multimodal information, such as visual subgoal images. At test time, these images can be produced by a lightweight world model.

π0.7 handles diverse prompts that don't just say what to do, but also how to do it, including rich language and multimodal information, such as visual subgoal images. At test time, these images can be produced by a lightweight world model.

Physical Intelligence

33,306 просмотров • 2 месяцев назад

Segment Anything Model 2 (SAM 2) is a foundation model from Meta FAIR for promptable visual segmentation in images & videos. Available now for anyone to build on for free, open source under an Apache license. Try the demo ➡️

Segment Anything Model 2 (SAM 2) is a foundation model from Meta FAIR for promptable visual segmentation in images & videos. Available now for anyone to build on for free, open source under an Apache license. Try the demo ➡️

AI at Meta

97,733 просмотров • 1 год назад

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Jianlan Luo

95,100 просмотров • 5 месяцев назад

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

OpenAI

12,872,881 просмотров • 2 месяцев назад

Check it out. You can talk to Visual Studio Visual Studio Code! 👂

Check it out. You can talk to Visual Studio Visual Studio Code! 👂

Daniel Kelly

23,966 просмотров • 2 лет назад

Gemma 3 understands images, text, and video - all at once. In this deep dive, learn how the model integrates multiple sources and performs a range of tasks from answering questions about documents to describing visual scenes in detail. Explore why multimodality matters.

Gemma 3 understands images, text, and video - all at once. In this deep dive, learn how the model integrates multiple sources and performs a range of tasks from answering questions about documents to describing visual scenes in detail. Explore why multimodality matters.

Google AI Developers

41,128 просмотров • 9 месяцев назад

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Alvaro Cintas

46,151 просмотров • 5 месяцев назад

co-seeing art with machine: an interactive demo where i browse the visual responses of a multimodal llm to prompts about the painting, from “who’s in here?” and to “what’s unusual about this painting?” the model replies not just with words, but with visual annotations as well.

co-seeing art with machine: an interactive demo where i browse the visual responses of a multimodal llm to prompts about the painting, from “who’s in here?” and to “what’s unusual about this painting?” the model replies not just with words, but with visual annotations as well.

Kat ⊷ the Poet Engineer

37,359 просмотров • 1 год назад

Got the vibe but not the words? Starting today, you can simply show or tell AI Mode about what you're looking for and get rich, visual results to explore. ✨ This experience works by bringing together Search’s world-class visual understanding from Lens and Image search with Gemini 2.5’s advanced multimodal capabilities. When you show AI Mode what you’re searching for it uses our new “visual search fan-out” technique to get a deeper understanding of what’s in an image (including subtle details and secondary objects), and runs multiple queries in the background. This helps it understand the visual context and nuance of your request to deliver relevant visual responses. You can also tell AI Mode what vibe you’re going for (no image needed). For example, ask AI Mode for bedroom design inspo and go back and forth to refine your vision. We’re rolling out this experience in English in the U.S. starting this week. Try it out and let us know what you think!

Got the vibe but not the words? Starting today, you can simply show or tell AI Mode about what you're looking for and get rich, visual results to explore. ✨ This experience works by bringing together Search’s world-class visual understanding from Lens and Image search with Gemini 2.5’s advanced multimodal capabilities. When you show AI Mode what you’re searching for it uses our new “visual search fan-out” technique to get a deeper understanding of what’s in an image (including subtle details and secondary objects), and runs multiple queries in the background. This helps it understand the visual context and nuance of your request to deliver relevant visual responses. You can also tell AI Mode what vibe you’re going for (no image needed). For example, ask AI Mode for bedroom design inspo and go back and forth to refine your vision. We’re rolling out this experience in English in the U.S. starting this week. Try it out and let us know what you think!

Google AI

68,179 просмотров • 8 месяцев назад

Apple's open-source on-device AI model instantly turns images into scenes, and Vision Pro owners can try it out in the app Splat Studio. Details here:

Apple's open-source on-device AI model instantly turns images into scenes, and Vision Pro owners can try it out in the app Splat Studio. Details here:

UploadVR

14,409 просмотров • 5 месяцев назад

Introducing... Amica! 👩‍🦰🤖 Amica is an open source interface for interactive communication with 3D characters with voice synthesis, speech recognition, visual understanding, and an emotion system. 🔗

Introducing... Amica! 👩‍🦰🤖 Amica is an open source interface for interactive communication with 3D characters with voice synthesis, speech recognition, visual understanding, and an emotion system. 🔗

Arbius

41,769 просмотров • 2 лет назад

New open Omni model released! 👀OpenBMB MiniCPM-o 2.6 is a new 8B parameters, any-to-any multimodal model that can understand vision, speech, and language and runs on edge devices like phones and tablets. TL;DR: 🧠 8B total parameters (SigLip-400M + Whisper-300M + ChatTTS-200M + Qwen2.5-7B) 🔥 Outperforms GPT-4V on visual tasks with 70.2 average score on OpenCompass 🎙️ Best-in-class bilingual speech capabilities with real-time conversation and voice cloning 🎬 Supports multimodal streaming with support for continuous video/audio processing 📱 Runs on iPads and phones and supports 30+ languages 🖼️ Processes images up to 1.8M pixels (1344x1344) with OCR capabilities 🛠️ Easy integration with popular frameworks (llama.cpp, vLLM, Gradio) 🤗 commercial friendly (< 1 mio DAU) and available on Hugging Face

New open Omni model released! 👀OpenBMB MiniCPM-o 2.6 is a new 8B parameters, any-to-any multimodal model that can understand vision, speech, and language and runs on edge devices like phones and tablets. TL;DR: 🧠 8B total parameters (SigLip-400M + Whisper-300M + ChatTTS-200M + Qwen2.5-7B) 🔥 Outperforms GPT-4V on visual tasks with 70.2 average score on OpenCompass 🎙️ Best-in-class bilingual speech capabilities with real-time conversation and voice cloning 🎬 Supports multimodal streaming with support for continuous video/audio processing 📱 Runs on iPads and phones and supports 30+ languages 🖼️ Processes images up to 1.8M pixels (1344x1344) with OCR capabilities 🛠️ Easy integration with popular frameworks (llama.cpp, vLLM, Gradio) 🤗 commercial friendly (< 1 mio DAU) and available on Hugging Face

Philipp Schmid

19,409 просмотров • 1 год назад

Meet FLUX Kontext – the new context-aware image generation and editing model. We’ve teamed up with Black Forest Labs to drop the model on OpenArt today! You can try out now for free. And we’ve built a 50-page visual guidebook packed with detailed examples, pro tips & secret tricks you won’t find anywhere else for Flux Kontext. Find it in the first comment. Get ready for the new Kontext era. If you find the book helpful, remember to like and repost. #fluxkontext #openart #kontext

Meet FLUX Kontext – the new context-aware image generation and editing model. We’ve teamed up with Black Forest Labs to drop the model on OpenArt today! You can try out now for free. And we’ve built a 50-page visual guidebook packed with detailed examples, pro tips & secret tricks you won’t find anywhere else for Flux Kontext. Find it in the first comment. Get ready for the new Kontext era. If you find the book helpful, remember to like and repost. #fluxkontext #openart #kontext

OpenArt

12,911 просмотров • 1 год назад

We're releasing Hibiki-Zero, a new real-time and multilingual speech translation model that can translate 🇫🇷French, 🇪🇸Spanish, 🇵🇹Portuguese and 🇩🇪German to English: accurate, low-latency, high audio quality, with voice transfer. And best of all: open-source.

We're releasing Hibiki-Zero, a new real-time and multilingual speech translation model that can translate 🇫🇷French, 🇪🇸Spanish, 🇵🇹Portuguese and 🇩🇪German to English: accurate, low-latency, high audio quality, with voice transfer. And best of all: open-source.

kyutai

80,660 просмотров • 4 месяцев назад

The latest Visual Studio Code release brings auto model selection (preview) - a way to have the best model picked for you based on current capacity and performance. Currently being rolled out to all GitHub Copilot users in Visual Studio Code starting with individuals. Learn more:

The latest Visual Studio Code release brings auto model selection (preview) - a way to have the best model picked for you based on current capacity and performance. Currently being rolled out to all GitHub Copilot users in Visual Studio Code starting with individuals. Learn more:

Visual Studio Code

67,645 просмотров • 9 месяцев назад

Viggle’s Mic 2.0 is here! With our upgraded model, take precise control over how a character talks and moves, in one shot. Turn images into videos simply with an audio or text prompt for speech. Or take it further—add a motion source to drive the character’s movements—action and voice, perfectly aligned. Try it free on our web, or on our iOS/Android apps.👇

Viggle’s Mic 2.0 is here! With our upgraded model, take precise control over how a character talks and moves, in one shot. Turn images into videos simply with an audio or text prompt for speech. Or take it further—add a motion source to drive the character’s movements—action and voice, perfectly aligned. Try it free on our web, or on our iOS/Android apps.👇

ViggleAI

36,121 просмотров • 1 год назад

Live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. In our #UIST2024 work, we present WorldScribe that generates automated visual descriptions that are adaptive to the users’ contexts in real-time, in the real world.

Live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. In our #UIST2024 work, we present WorldScribe that generates automated visual descriptions that are adaptive to the users’ contexts in real-time, in the real world.

Anhong Guo

16,130 просмотров • 1 год назад