Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Microsoft just killed voice subscriptions. 🤯 They quietly open-sourced VibeVoice, and it’s a total industry disruptor. Transcribe hour-long meetings or generate 90 minutes of natural, multi-speaker speech, all locally on your hardware for free. It handles 50+ languages and complex speaker tracking without the monthly fees. The era of... show more

Vaibhav Sisinty

155,487 subscribers

22,056 просмотров • 2 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

MICROSOFT OPEN SOURCED A 7B PARAMETER MODEL THAT TRANSCRIBES 60 MINUTES OF AUDIO IN A SINGLE PASS and it's completely free VIBEVOICE ASR no chunking, no context loss, full speaker diarization baked in not just speech to text..not a basic wrapper who spoke, when they spoke, exactly what they said..all in one shot and it handles the hard stuff too..50+ languages, custom hotwords, long form audio that breaks every other tool the model doesn't know what "context window" means apparently Available on macOS and Windows right now. Free to use. Free to fine tune. Free to build on.

MICROSOFT OPEN SOURCED A 7B PARAMETER MODEL THAT TRANSCRIBES 60 MINUTES OF AUDIO IN A SINGLE PASS and it's completely free VIBEVOICE ASR no chunking, no context loss, full speaker diarization baked in not just speech to text..not a basic wrapper who spoke, when they spoke, exactly what they said..all in one shot and it handles the hard stuff too..50+ languages, custom hotwords, long form audio that breaks every other tool the model doesn't know what "context window" means apparently Available on macOS and Windows right now. Free to use. Free to fine tune. Free to build on.

Rahul

1,370,539 просмотров • 2 месяцев назад

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

Kanika

220,523 просмотров • 2 месяцев назад

BOOM! Microsoft just released an upgraded VibeVoice Large ~10B Text to Speech model - MIT licensed 🔥 > Generate multi-speaker podcasts in minutes ⚡ > Works blazingly fast on ZeroGPU with H200 (FREE) Try it out today!

BOOM! Microsoft just released an upgraded VibeVoice Large ~10B Text to Speech model - MIT licensed 🔥 > Generate multi-speaker podcasts in minutes ⚡ > Works blazingly fast on ZeroGPU with H200 (FREE) Try it out today!

Vaibhav (VB) Srivastav

89,549 просмотров • 9 месяцев назад

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Vaibhav Sisinty

24,522,838 просмотров • 2 месяцев назад

🚨BREAKING: Frontdesk just quietly released a free AI workforce that automatically calls, texts, emails, and remembers ALL your customers. The workers reply 24/7 across all channels. They intelligently reach out at the right time. And since they remember every customer, they all share a custom CRM built just for them. Thousands of people are cancelling their GoHighLevel and Hubspot subscriptions and letting these AI workers run on autopilot. Try it here:

🚨BREAKING: Frontdesk just quietly released a free AI workforce that automatically calls, texts, emails, and remembers ALL your customers. The workers reply 24/7 across all channels. They intelligently reach out at the right time. And since they remember every customer, they all share a custom CRM built just for them. Thousands of people are cancelling their GoHighLevel and Hubspot subscriptions and letting these AI workers run on autopilot. Try it here:

Hasan Toor

75,368 просмотров • 3 месяцев назад

Tencent just launched its answer to Claude Computer Use 🤯 QClaw is a personal AI agent that runs locally on your device, sets up in 3 minutes, and takes commands from WhatsApp or Telegram. Built on open-source OpenClaw. Free beta.

Tencent just launched its answer to Claude Computer Use 🤯 QClaw is a personal AI agent that runs locally on your device, sets up in 3 minutes, and takes commands from WhatsApp or Telegram. Built on open-source OpenClaw. Free beta.

Alvaro Cintas

20,202 просмотров • 2 месяцев назад

Audio tools got 3 major upgrades 🔥 Voice Clone, Multi-Speaker Voiceovers and Change Voice → Create reusable custom voices → Add 2 speakers to generate natural dialogues → Instantly replace the voice in any audio Powered by ElevenLabs and Gemini Available now on Freepik

Audio tools got 3 major upgrades 🔥 Voice Clone, Multi-Speaker Voiceovers and Change Voice → Create reusable custom voices → Add 2 speakers to generate natural dialogues → Instantly replace the voice in any audio Powered by ElevenLabs and Gemini Available now on Freepik

Magnific

16,021 просмотров • 4 месяцев назад

Grok Voice Agent API is now one of the most advanced voice AIs in the world It ranks #1 on BigBench Audio for speech reasoning Grok Voice already powers voice mode across Grok apps and runs in millions of Tesla vehicles Conversations feel so natural, expressive, fluid, and instantly responsive - like talking to a real person, not a machine Try it now on Grok, in Tesla, and across the growing ecosystem of apps and experiences powered by xAI

Grok Voice Agent API is now one of the most advanced voice AIs in the world It ranks #1 on BigBench Audio for speech reasoning Grok Voice already powers voice mode across Grok apps and runs in millions of Tesla vehicles Conversations feel so natural, expressive, fluid, and instantly responsive - like talking to a real person, not a machine Try it now on Grok, in Tesla, and across the growing ecosystem of apps and experiences powered by xAI

X Freeze

4,171,862 просмотров • 5 месяцев назад

This is wild. Hume AI just dropped Octave 2. Ultra realistic text-to-speech AI model in 10+ languages with multi-speaker + voice cloning. 100% AI 8 wild examples + how to try:

This is wild. Hume AI just dropped Octave 2. Ultra realistic text-to-speech AI model in 10+ languages with multi-speaker + voice cloning. 100% AI 8 wild examples + how to try:

Min Choi

212,441 просмотров • 8 месяцев назад

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 просмотров • 5 месяцев назад

VoiceBox just did what everyone thought was impossible. It might have killed ElevenLabs overnight. No subscription. No cloud. No credits. Runs locally on your machine. Clones a voice from 3 seconds of audio. Powered by Qwen 3 TTS. Apache 2.0 license. Commercial use allowed. And your voice data never leaves your computer. For years, if you wanted high quality voice cloning, ElevenLabs was the default. Now there’s a free open-source alternative that sounds shockingly close. Plus it has a built-in multi-track editor, recording, transcription with Whisper, and a local API. This isn’t just another TTS tool. It’s a full voice production studio on your desktop. Open source AI is catching up fast. And this changes the game for creators and businesses. Would you trust the cloud with your voice… or keep it local? 👇

VoiceBox just did what everyone thought was impossible. It might have killed ElevenLabs overnight. No subscription. No cloud. No credits. Runs locally on your machine. Clones a voice from 3 seconds of audio. Powered by Qwen 3 TTS. Apache 2.0 license. Commercial use allowed. And your voice data never leaves your computer. For years, if you wanted high quality voice cloning, ElevenLabs was the default. Now there’s a free open-source alternative that sounds shockingly close. Plus it has a built-in multi-track editor, recording, transcription with Whisper, and a local API. This isn’t just another TTS tool. It’s a full voice production studio on your desktop. Open source AI is catching up fast. And this changes the game for creators and businesses. Would you trust the cloud with your voice… or keep it local? 👇

Julian Goldie SEO

59,113 просмотров • 4 месяцев назад

I just automated the entire process of creating hyperrealistic consistent AI character datasets. Completely free on your own hardware. Everything runs locally using free, open-source models in ComfyUI. FULL TUTORIAL & FREE WORKFLOWS👇

I just automated the entire process of creating hyperrealistic consistent AI character datasets. Completely free on your own hardware. Everything runs locally using free, open-source models in ComfyUI. FULL TUTORIAL & FREE WORKFLOWS👇

Mickmumpitz

32,115 просмотров • 8 месяцев назад

Your 90-min video has 15 clips hiding in it Instant Highlights V2 finds them Prompt-based search, face tracking, multi-speaker handling, and captions Translate into 175+ languages and upscale to 4K in the same workflow RT + comment "Highlight" for free credits (must follow)

Your 90-min video has 15 clips hiding in it Instant Highlights V2 finds them Prompt-based search, face tracking, multi-speaker handling, and captions Translate into 175+ languages and upscale to 4K in the same workflow RT + comment "Highlight" for free credits (must follow)

HeyGen

1,212,930 просмотров • 2 месяцев назад

Create OpenAI like API for newly launched Llama-3.1 and deploy it locally on your computer in just 2 minutes (100% free and without internet):

Create OpenAI like API for newly launched Llama-3.1 and deploy it locally on your computer in just 2 minutes (100% free and without internet):

Shubham Saboo

529,188 просмотров • 1 год назад

the era of agentic AI is here Google just open sourced Gemma 3, you can run it on a single GPU laptop or even phone - agentic AI workflows - 4 Sizes – 1B, 4B, 12B, 27B - up to 128K tokens - speaks 140+ languages - hit 1338 ELO on LMArena try & download for free, links 👇

the era of agentic AI is here Google just open sourced Gemma 3, you can run it on a single GPU laptop or even phone - agentic AI workflows - 4 Sizes – 1B, 4B, 12B, 27B - up to 128K tokens - speaks 140+ languages - hit 1338 ELO on LMArena try & download for free, links 👇

el.cine

199,500 просмотров • 1 год назад

BUILD 🔥: Microsoft is preparing new image and voice models for the announcement on June 2. > MAI Voice 2, a multilingual model supporting 15 news languages and a wider range of emotional spectrum (check voice samples in the article) > MAI Transcribe 1.5, a new model for speech-to-text use cases. > MAI Image 2.5, already announced last week, is now available on LM Arena in preview. Compared to MAI Image 2, it supports file uploads and can be used for image editing.

BUILD 🔥: Microsoft is preparing new image and voice models for the announcement on June 2. > MAI Voice 2, a multilingual model supporting 15 news languages and a wider range of emotional spectrum (check voice samples in the article) > MAI Transcribe 1.5, a new model for speech-to-text use cases. > MAI Image 2.5, already announced last week, is now available on LM Arena in preview. Compared to MAI Image 2, it supports file uploads and can be used for image editing.

🚨 AI News | TestingCatalog

46,260 просмотров • 26 дней назад

The Microsoft Designer app is now generally available. It's an AI-powered tool that turns your words into designs… free on web and mobile, and available in over 80 languages. Try it out here:

The Microsoft Designer app is now generally available. It's an AI-powered tool that turns your words into designs… free on web and mobile, and available in over 80 languages. Try it out here:

Yusuf Mehdi

27,796 просмотров • 1 год назад

Alibaba just open-sourced the world's FIRST AI dubbing model that handles multi-speaker scenes. It's called Fun-CineForge and it nails what every other dubbing model fails at: Lip-sync. Emotion. Voice stability. Timing. Across multiple characters. Not a single open-source model could do this before. Here's how it works: ↓ Tongyi Lab Alibaba Group

Alibaba just open-sourced the world's FIRST AI dubbing model that handles multi-speaker scenes. It's called Fun-CineForge and it nails what every other dubbing model fails at: Lip-sync. Emotion. Voice stability. Timing. Across multiple characters. Not a single open-source model could do this before. Here's how it works: ↓ Tongyi Lab Alibaba Group

Guri Singh

620,634 просмотров • 3 месяцев назад

It’s honestly next level 🤯 #Veo3 handles voices insanely well — and in multiple languages! I tested it in French, and as a native speaker… I can confirm: it slaps 🔥

It’s honestly next level 🤯 #Veo3 handles voices insanely well — and in multiple languages! I tested it in French, and as a native speaker… I can confirm: it slaps 🔥

Pierrick Chevallier | IA

17,815 просмотров • 1 год назад

How did a tiny, scrappy team build one of the most powerful AI voice models? In a deep dive with Sesame CTO Ankit Kumar and a16z's Anjney Midha, we explore how Sesame is pushing the boundaries of AI conversation, why it open-sourced its speech generation model, and the power of small teams to outdo much larger AI labs on product focus. A part of their secret: a relentless focus on real-time, natural conversations over raw intelligence, and a deep commitment to voice, personality, and user experience. By opening up its speech generation model, Sesame is paving the way for even more breakthroughs in AI-native conversation 👇

How did a tiny, scrappy team build one of the most powerful AI voice models? In a deep dive with Sesame CTO Ankit Kumar and a16z's Anjney Midha, we explore how Sesame is pushing the boundaries of AI conversation, why it open-sourced its speech generation model, and the power of small teams to outdo much larger AI labs on product focus. A part of their secret: a relentless focus on real-time, natural conversations over raw intelligence, and a deep commitment to voice, personality, and user experience. By opening up its speech generation model, Sesame is paving the way for even more breakthroughs in AI-native conversation 👇

a16z

29,880 просмотров • 1 год назад