正在加载视频...

视频加载失败

MICROSOFT OPEN SOURCED A 7B PARAMETER MODEL THAT TRANSCRIBES 60 MINUTES OF AUDIO IN A SINGLE PASS and it's completely free VIBEVOICE ASR no chunking, no context loss, full speaker diarization baked in not just speech to text..not a basic wrapper who spoke, when they spoke, exactly what they...

1,370,539 次观看 • 2 个月前 •via X (Twitter)

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

Kanika

220,523 次观看 • 2 个月前

NVIDIA JUST DROPPED A FREE AI MODEL THAT READS PDFS, WATCHES VIDEOS, LISTENS TO AUDIO, AND UNDERSTANDS YOUR SCREEN SIMULTANEOUSLY. Not one at a time. ALL AT ONCE. In a single pass. It is called Nemotron 3 Nano Omni and it runs 9 times faster than every other multimodal model currently available. Think about what that actually means for how you work. Right now you are switching between tools constantly. One tool for transcribing your call recordings. A different tool for analyzing your client PDFs. Another tool for processing your training videos. A separate workflow for understanding what is happening on your screen. Four tools. Four contexts. Four different outputs you have to manually synthesize into one decision. Nemotron 3 Nano Omni does all of it in one model. One pass. One output. The use cases that just got dramatically simpler: Meeting recordings where you need the transcript, the visual context, and the document references all analyzed together. Training videos where the audio, the slides, and the on-screen demonstrations all feed into one coherent summary. Client PDFs where you need the document content cross-referenced against your screen data and your call notes simultaneously. Sales call transcripts analyzed alongside the proposals and the CRM data in one unified pass. This is not a marginal improvement on existing multimodal models. It is a 9x speed increase on a capability that was already changing how people work. Free. From NVIDIA. Available right now. Bookmark this before everyone catches on. Follow CyrilXBT for every AI capability shift the moment it drops.

CyrilXBT

37,523 次观看 • 1 个月前