正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

MICROSOFT OPEN SOURCED A 7B PARAMETER MODEL THAT TRANSCRIBES 60 MINUTES OF AUDIO IN A SINGLE PASS and it's completely free VIBEVOICE ASR no chunking, no context loss, full speaker diarization baked in not just speech to text..not a basic wrapper who spoke, when they spoke, exactly what they... show more

Rahul

117,692 subscribers

1,370,539 次观看 • 2 个月前 •via X (Twitter)

教育健康养生科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

Kanika

220,523 次观看 • 2 个月前

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

Akshay 🚀

45,182 次观看 • 2 个月前

Microsoft just killed voice subscriptions. 🤯 They quietly open-sourced VibeVoice, and it’s a total industry disruptor. Transcribe hour-long meetings or generate 90 minutes of natural, multi-speaker speech, all locally on your hardware for free. It handles 50+ languages and complex speaker tracking without the monthly fees. The era of expensive voice apps is officially over. I share updates like these in my free AI community on WhatsApp. Join here

Microsoft just killed voice subscriptions. 🤯 They quietly open-sourced VibeVoice, and it’s a total industry disruptor. Transcribe hour-long meetings or generate 90 minutes of natural, multi-speaker speech, all locally on your hardware for free. It handles 50+ languages and complex speaker tracking without the monthly fees. The era of expensive voice apps is officially over. I share updates like these in my free AI community on WhatsApp. Join here

Vaibhav Sisinty

22,056 次观看 • 2 个月前

MICROSOFT DROPPED A 4B PARAMETER MODEL THAT TURNS ONE IMAGE INTO A 3D ASSET IN 3 SECONDS and it's open source TRELLIS.2 fully textured, physically accurate 3D models with PBR textures out of the box not a rough mesh..not a placeholder roughness, metallic, opacity the kind of detail that makes things look real under any lighting and it handles the weird stuff too..open surfaces, hollow interiors, geometry that breaks every other tool the model doesn't know the word "limitation" apparently demo is live on hugging face right now

MICROSOFT DROPPED A 4B PARAMETER MODEL THAT TURNS ONE IMAGE INTO A 3D ASSET IN 3 SECONDS and it's open source TRELLIS.2 fully textured, physically accurate 3D models with PBR textures out of the box not a rough mesh..not a placeholder roughness, metallic, opacity the kind of detail that makes things look real under any lighting and it handles the weird stuff too..open surfaces, hollow interiors, geometry that breaks every other tool the model doesn't know the word "limitation" apparently demo is live on hugging face right now

Vaishnavi

353,468 次观看 • 2 个月前

🚨 ElevenLabs just killed every transcription tool on the market. They dropped Scribe v2 Realtime and it's not just another speech-to-text model. This thing processes audio in real-time with zero lag, handles multiple speakers, and actually understands context not just words. Here's how with real examples: Full thread 🧵

🚨 ElevenLabs just killed every transcription tool on the market. They dropped Scribe v2 Realtime and it's not just another speech-to-text model. This thing processes audio in real-time with zero lag, handles multiple speakers, and actually understands context not just words. Here's how with real examples: Full thread 🧵

Hasan Toor

220,414 次观看 • 7 个月前

Hermes Agent is now free to run 24/7. No paid model needed. You can plug in: → Qwen 3.6 Plus → OwlAlpha via OpenRouter Both have 1M token context windows. That means your agent can research, write, find leads, handle tasks, and build workflows without burning through paid API credits. The best part? When one free model hits limits, just switch to the other. Free AI agents are getting scary fast.

Hermes Agent is now free to run 24/7. No paid model needed. You can plug in: → Qwen 3.6 Plus → OwlAlpha via OpenRouter Both have 1M token context windows. That means your agent can research, write, find leads, handle tasks, and build workflows without burning through paid API credits. The best part? When one free model hits limits, just switch to the other. Free AI agents are getting scary fast.

Julian Goldie SEO

21,609 次观看 • 1 个月前

French President Macron: I do believe in free speech. But what does it mean? Free speech means I will listen to you, you will listen to me, and we are in an equal relationship. A lot of people defending free speech do it based on algorithms without any transparency, with a lot of bias, and with their own political agenda. It's not free speech. And when people clearly help hate speech, racist speech to be spread all over the place, it's not about free speech. It's a jungle. I really believe in free speech based on respect and transparency.

French President Macron: I do believe in free speech. But what does it mean? Free speech means I will listen to you, you will listen to me, and we are in an equal relationship. A lot of people defending free speech do it based on algorithms without any transparency, with a lot of bias, and with their own political agenda. It's not free speech. And when people clearly help hate speech, racist speech to be spread all over the place, it's not about free speech. It's a jungle. I really believe in free speech based on respect and transparency.

Open Source Intel

63,018 次观看 • 4 个月前

French President Macron: I do believe in free speech. But what does it mean? Free speech means I will listen to you, you will listen to me, and we are in an equal relationship. A lot of people defending free speech do it based on algorithms without any transparency, with a lot of bias, and with their own political agenda. It's not free speech. And when people clearly help hate speech, racist speech to be spread all over the place, it's not about free speech. It's a jungle. I really believe in free speech based on respect and transparency.

French President Macron: I do believe in free speech. But what does it mean? Free speech means I will listen to you, you will listen to me, and we are in an equal relationship. A lot of people defending free speech do it based on algorithms without any transparency, with a lot of bias, and with their own political agenda. It's not free speech. And when people clearly help hate speech, racist speech to be spread all over the place, it's not about free speech. It's a jungle. I really believe in free speech based on respect and transparency.

Clash Report

786,406 次观看 • 4 个月前

DONALD LUSKIN: EUROPEAN LEADERS ARE TRYING TO TAKE AWAY ELON’S FREE SPEECH IN PLAIN SIGHT “For the President of France to say that Elon Musk is interfering directly in elections because he is speaking his opinion is just ridiculous. It's a terrible distortion of the French language. It's not interference to speak, it is exercising your right to free speech. The leaders of Europe who are attacking Musk for free speech are, by doing so, proving his point. His point is that they are tyrannical leaders who are taking away people's free speech, and they're trying to take away his in plain sight. They're not even ashamed of it. They're just hoping you won't notice the breathtaking hypocrisy. This is what it means to be a totalitarian. You get to decide who has free speech. And what a coincidence! The people who get free speech are the people who agree with you. The people who don't are the people who disagree with you. Now, it's not that they don't love Elon Musk and 𝕏 and all the other social media platforms, they love to tax them. They just don't like to let the free speech happen there.” Source: Fox News, January 7, 2025

DONALD LUSKIN: EUROPEAN LEADERS ARE TRYING TO TAKE AWAY ELON’S FREE SPEECH IN PLAIN SIGHT “For the President of France to say that Elon Musk is interfering directly in elections because he is speaking his opinion is just ridiculous. It's a terrible distortion of the French language. It's not interference to speak, it is exercising your right to free speech. The leaders of Europe who are attacking Musk for free speech are, by doing so, proving his point. His point is that they are tyrannical leaders who are taking away people's free speech, and they're trying to take away his in plain sight. They're not even ashamed of it. They're just hoping you won't notice the breathtaking hypocrisy. This is what it means to be a totalitarian. You get to decide who has free speech. And what a coincidence! The people who get free speech are the people who agree with you. The people who don't are the people who disagree with you. Now, it's not that they don't love Elon Musk and 𝕏 and all the other social media platforms, they love to tax them. They just don't like to let the free speech happen there.” Source: Fox News, January 7, 2025

Mario Nawfal

54,605 次观看 • 1 年前

NVIDIA JUST DROPPED A FREE AI MODEL THAT READS PDFS, WATCHES VIDEOS, LISTENS TO AUDIO, AND UNDERSTANDS YOUR SCREEN SIMULTANEOUSLY. Not one at a time. ALL AT ONCE. In a single pass. It is called Nemotron 3 Nano Omni and it runs 9 times faster than every other multimodal model currently available. Think about what that actually means for how you work. Right now you are switching between tools constantly. One tool for transcribing your call recordings. A different tool for analyzing your client PDFs. Another tool for processing your training videos. A separate workflow for understanding what is happening on your screen. Four tools. Four contexts. Four different outputs you have to manually synthesize into one decision. Nemotron 3 Nano Omni does all of it in one model. One pass. One output. The use cases that just got dramatically simpler: Meeting recordings where you need the transcript, the visual context, and the document references all analyzed together. Training videos where the audio, the slides, and the on-screen demonstrations all feed into one coherent summary. Client PDFs where you need the document content cross-referenced against your screen data and your call notes simultaneously. Sales call transcripts analyzed alongside the proposals and the CRM data in one unified pass. This is not a marginal improvement on existing multimodal models. It is a 9x speed increase on a capability that was already changing how people work. Free. From NVIDIA. Available right now. Bookmark this before everyone catches on. Follow CyrilXBT for every AI capability shift the moment it drops.

NVIDIA JUST DROPPED A FREE AI MODEL THAT READS PDFS, WATCHES VIDEOS, LISTENS TO AUDIO, AND UNDERSTANDS YOUR SCREEN SIMULTANEOUSLY. Not one at a time. ALL AT ONCE. In a single pass. It is called Nemotron 3 Nano Omni and it runs 9 times faster than every other multimodal model currently available. Think about what that actually means for how you work. Right now you are switching between tools constantly. One tool for transcribing your call recordings. A different tool for analyzing your client PDFs. Another tool for processing your training videos. A separate workflow for understanding what is happening on your screen. Four tools. Four contexts. Four different outputs you have to manually synthesize into one decision. Nemotron 3 Nano Omni does all of it in one model. One pass. One output. The use cases that just got dramatically simpler: Meeting recordings where you need the transcript, the visual context, and the document references all analyzed together. Training videos where the audio, the slides, and the on-screen demonstrations all feed into one coherent summary. Client PDFs where you need the document content cross-referenced against your screen data and your call notes simultaneously. Sales call transcripts analyzed alongside the proposals and the CRM data in one unified pass. This is not a marginal improvement on existing multimodal models. It is a 9x speed increase on a capability that was already changing how people work. Free. From NVIDIA. Available right now. Bookmark this before everyone catches on. Follow CyrilXBT for every AI capability shift the moment it drops.

CyrilXBT

37,523 次观看 • 1 个月前

BOOM! Microsoft just released an upgraded VibeVoice Large ~10B Text to Speech model - MIT licensed 🔥 > Generate multi-speaker podcasts in minutes ⚡ > Works blazingly fast on ZeroGPU with H200 (FREE) Try it out today!

BOOM! Microsoft just released an upgraded VibeVoice Large ~10B Text to Speech model - MIT licensed 🔥 > Generate multi-speaker podcasts in minutes ⚡ > Works blazingly fast on ZeroGPU with H200 (FREE) Try it out today!

Vaibhav (VB) Srivastav

89,549 次观看 • 9 个月前

Excited about the launch of Amazon Nova Sonic, our new speech-to-speech model that helps make AI voice applications feel remarkably natural. It's designed to understand not just what people say, but how they say it – working with tone, style, and conversation flow including pauses and interruptions. Nova Sonic delivers speech understanding and generation through a single, unified model, making it easier for builders to develop voice applications that maintain important context and nuance for customer service, AI agents, and other use cases across industries. It’s available in Amazon Bedrock now. Look forward to seeing what teams build with Nova Sonic!

Excited about the launch of Amazon Nova Sonic, our new speech-to-speech model that helps make AI voice applications feel remarkably natural. It's designed to understand not just what people say, but how they say it – working with tone, style, and conversation flow including pauses and interruptions. Nova Sonic delivers speech understanding and generation through a single, unified model, making it easier for builders to develop voice applications that maintain important context and nuance for customer service, AI agents, and other use cases across industries. It’s available in Amazon Bedrock now. Look forward to seeing what teams build with Nova Sonic!

Andy Jassy

155,772 次观看 • 1 年前

CHICAGO RIGHT NOW: "What they are seeing in Chicago is wrong. They know they are free citizens in a free country. A free country that we have spent hundreds of years to perfect." Americans will not be intimidated.

CHICAGO RIGHT NOW: "What they are seeing in Chicago is wrong. They know they are free citizens in a free country. A free country that we have spent hundreds of years to perfect." Americans will not be intimidated.

Lincoln Square

51,889 次观看 • 8 个月前

Introducing Build That Idea A no-code platform to launch and monetize custom AI agents in 60 seconds Available to everyone for free

Introducing Build That Idea A no-code platform to launch and monetize custom AI agents in 60 seconds Available to everyone for free

Okara

1,059,476 次观看 • 11 个月前

When someone on the left gets fired for what they said, it's: "Donald Trump wants to take away free speech". When someone on the right gets fired for what they said, it’s just and makes sense. “They deserved it.” This isn’t about being principled, they simply will use any tool to attempt to make Trump and the right look bad. That’s the goal.

When someone on the left gets fired for what they said, it's: "Donald Trump wants to take away free speech". When someone on the right gets fired for what they said, it’s just and makes sense. “They deserved it.” This isn’t about being principled, they simply will use any tool to attempt to make Trump and the right look bad. That’s the goal.

Jeffery Mead

1,190,014 次观看 • 9 个月前

Gemini 3.1 Flash Live just dropped and it's available with LiveKit today. This is the first Gemini 3 native audio model on the Live API. Better instruction following, improved tool calling, reduced speaker drift, and support for 70+ languages. Audio in, audio out. No text conversion in between.

Gemini 3.1 Flash Live just dropped and it's available with LiveKit today. This is the first Gemini 3 native audio model on the Live API. Better instruction following, improved tool calling, reduced speaker drift, and support for 70+ languages. Audio in, audio out. No text conversion in between.

LiveKit

40,277 次观看 • 3 个月前

💥Bill Maher SCHOOLS Canadian Comedian Tom Green on why Canada *DOESN'T* have free speech like America🇺🇸 GREEN: "We have freedom of speech … There are hate speech laws in Canada — so there are things you can't say ... Who would want to say that sh*t anyways?" MAHER: "That's not what free speech means … You don’t know what free speech is." "The Supreme Court ruled in our country that the Nazis could march in Skokie, Illinois — which they were marching in because it is a community of a lot of Holocaust survivors ... That is what free speech is about."

💥Bill Maher SCHOOLS Canadian Comedian Tom Green on why Canada DOESN'T have free speech like America🇺🇸 GREEN: "We have freedom of speech … There are hate speech laws in Canada — so there are things you can't say ... Who would want to say that sh*t anyways?" MAHER: "That's not what free speech means … You don’t know what free speech is." "The Supreme Court ruled in our country that the Nazis could march in Skokie, Illinois — which they were marching in because it is a community of a lot of Holocaust survivors ... That is what free speech is about."

Jason Cohen 🇺🇸

1,442,608 次观看 • 9 个月前

The ACLU used to defend free speech and the First Amendment. Now it's silently watching people get arrested in other countries for what they posted or said in America. What a shame.

The ACLU used to defend free speech and the First Amendment. Now it's silently watching people get arrested in other countries for what they posted or said in America. What a shame.

House Judiciary GOP 🇺🇸🇺🇸🇺🇸

38,060 次观看 • 4 个月前

OpenAI's S2S preview is polished but it still thinks in steps. Speech → text → model → text → speech. That's not how humans converse. Introducing Hydra. A native speech-to-speech model that doesn't wait for turn-taking, doesn't flatten emotion into text, and doesn't break when you interrupt it mid-sentence. Hydra reasons asynchronously, speaks and listens simultaneously, and preserves emotion because it never leaves the audio domain. It's still in beta, but the shift is obvious. If you want early access, the link is in the comments. Here's a preview of what that looks like -

OpenAI's S2S preview is polished but it still thinks in steps. Speech → text → model → text → speech. That's not how humans converse. Introducing Hydra. A native speech-to-speech model that doesn't wait for turn-taking, doesn't flatten emotion into text, and doesn't break when you interrupt it mid-sentence. Hydra reasons asynchronously, speaks and listens simultaneously, and preserves emotion because it never leaves the audio domain. It's still in beta, but the shift is obvious. If you want early access, the link is in the comments. Here's a preview of what that looks like -

Sudarshan Kamath

328,731 次观看 • 3 个月前