Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Introducing Scribe — the most accurate Speech to Text model. It has the highest accuracy on benchmarks, outperforming previous state-of-the-art models such as Gemini 2.0 and OpenAI Whisper v3. It’s now the leading model for English, Spanish, Italian, and many more. With support for 99 languages, speaker diarization, character-level... show more

ElevenLabs

183,712 subscribers

464,392 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

11 Yorum

ElevenLabs profil fotoğrafı

ElevenLabs1 yıl önce

It achieves the highest accuracy for the most common languages. And it significantly improves the performance of previously underserved languages such as Serbian, Cantonese, and Gujarati.

ElevenLabs profil fotoğrafı

ElevenLabs1 yıl önce

Learn more about the benchmarking and features in our blog post:

ElevenLabs profil fotoğrafı

ElevenLabs1 yıl önce

We have a low-latency version of Scribe coming soon, extending Scribe to real-time use cases.

ElevenLabs profil fotoğrafı

ElevenLabs1 yıl önce

Scribe is available today in both our UI and API. It’s priced at $0.40 per hour of input audio, with an additional 50% discount available for the next 6 weeks. Sign up for an account here:

ElevenLabs profil fotoğrafı

ElevenLabs1 yıl önce

Hear from @flavioschneide, one of the lead researchers behind the launch.

ElevenLabs profil fotoğrafı

ElevenLabs1 yıl önce

Join us next week for a virtual event with the team behind the launch:

AssemblyAI profil fotoğrafı

AssemblyAI1 yıl önce

Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks. - The highest accuracy rates—up to 95% - Up to 30% fewer hallucinations than other leaders - Low latency—63 minutes converts in 35 seconds Try via API for free today 👇

Lex Fridman profil fotoğrafı

Lex Fridman1 yıl önce

Awesome!

Kenrik March profil fotoğrafı

Kenrik March1 yıl önce

Not hating but feedback so hopefully you take this correctly: I want locally run models. Whisper can do this and even run in near realtime on mobile devices. Also .40 per hour is sky high compared to compute unless it’s wildly inefficient. Open Source previous generation models.

Max Rovensky profil fotoğrafı

Max Rovensky1 yıl önce

Yeah but have you considered thag I can run whisper locally instead of paying you

Alex Balfanz profil fotoğrafı

Alex Balfanz1 yıl önce

whoa.

Benzer Videolar

Eleven v3 (alpha) is the most expressive Text to Speech model. v3 introduces: • Multi-speaker dialogue with contextual awareness • Support for 70+ languages, up from 33 in v2 • Audio tags such as [excited], [sighs], [laughing], and [whispers]

Eleven v3 (alpha) is the most expressive Text to Speech model. v3 introduces: • Multi-speaker dialogue with contextual awareness • Support for 70+ languages, up from 33 in v2 • Audio tags such as [excited], [sighs], [laughing], and [whispers]

ElevenLabs

39,346 görüntüleme • 1 yıl önce

Introducing Eleven v3 (alpha) - the most expressive Text to Speech model ever. Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers]. Now in public alpha and 80% off in June.

Introducing Eleven v3 (alpha) - the most expressive Text to Speech model ever. Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers]. Now in public alpha and 80% off in June.

ElevenLabs

1,956,187 görüntüleme • 1 yıl önce

Introducing Scribe v2 Realtime – the most accurate real-time Speech to Text model. Built for voice agents, meeting notetakers, and live applications, it transcribes in 150ms across 90+ languages, including English, French, German, Italian, Spanish, Portuguese, Hindi, and Japanese. Available today by API and through ElevenLabs Agents.

Introducing Scribe v2 Realtime – the most accurate real-time Speech to Text model. Built for voice agents, meeting notetakers, and live applications, it transcribes in 150ms across 90+ languages, including English, French, German, Italian, Spanish, Portuguese, Hindi, and Japanese. Available today by API and through ElevenLabs Agents.

ElevenLabs

317,246 görüntüleme • 7 ay önce

We pioneered the first ultra-realistic Text to Speech model, and recently launched the world's most accurate Speech to Text model, Scribe. But we're not stopping there. Today, we're taking one small step for man, and one giant leap for man's best friend... with Text to Bark.

We pioneered the first ultra-realistic Text to Speech model, and recently launched the world's most accurate Speech to Text model, Scribe. But we're not stopping there. Today, we're taking one small step for man, and one giant leap for man's best friend... with Text to Bark.

ElevenLabs

291,227 görüntüleme • 1 yıl önce

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 görüntüleme • 3 yıl önce

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!

Logan Kilpatrick

800,265 görüntüleme • 2 ay önce

We are launching the Eleven v3 (alpha) API. Built for async use cases, Eleven v3 (alpha) delivers the most expressive Text to Speech model: - Dialogue mode, unlimited amount of speakers - 70+ languages - Enhanced voice and emotional control with [audio tags]

We are launching the Eleven v3 (alpha) API. Built for async use cases, Eleven v3 (alpha) delivers the most expressive Text to Speech model: - Dialogue mode, unlimited amount of speakers - 70+ languages - Enhanced voice and emotional control with [audio tags]

ElevenLabs

8,188,608 görüntüleme • 10 ay önce

Today we are announcing Avalon – the best transcription model in the world for prompting AIs. It outperforms the previous state-of-the-art models like ElevenLabs' Scribe and OpenAI's Whisper.

Today we are announcing Avalon – the best transcription model in the world for prompting AIs. It outperforms the previous state-of-the-art models like ElevenLabs' Scribe and OpenAI's Whisper.

Aqua Voice

472,132 görüntüleme • 10 ay önce

xAI just dropped 2 absolute monsters: Grok Speech-to-Text + Grok Text-to-Speech APIs Same battle-tested stack that runs Grok Voice, millions of Tesla cars, and Starlink support. - STT: Real-time WebSocket, speaker diarization, 25+ languages, word-level timestamps. - TTS: Super natural, expressive voices with tags for laugh, sigh, whisper, emphasis, pause; batch + streaming. Put voice AI on the list of things Grok already 100%'d xAI, Grok

xAI just dropped 2 absolute monsters: Grok Speech-to-Text + Grok Text-to-Speech APIs Same battle-tested stack that runs Grok Voice, millions of Tesla cars, and Starlink support. - STT: Real-time WebSocket, speaker diarization, 25+ languages, word-level timestamps. - TTS: Super natural, expressive voices with tags for laugh, sigh, whisper, emphasis, pause; batch + streaming. Put voice AI on the list of things Grok already 100%'d xAI, Grok

Mario Nawfal

164,842 görüntüleme • 2 ay önce

Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses. Try it now →

Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses. Try it now →

Google DeepMind

1,109,319 görüntüleme • 1 yıl önce

Today we’re introducing Scribe v2: the most accurate transcription model ever released. While Scribe v2 Realtime is optimized for ultra low latency and agents use cases, Scribe v2 is built for batch transcription, subtitling, and captioning at scale.

Today we’re introducing Scribe v2: the most accurate transcription model ever released. While Scribe v2 Realtime is optimized for ultra low latency and agents use cases, Scribe v2 is built for batch transcription, subtitling, and captioning at scale.

ElevenLabs

556,667 görüntüleme • 5 ay önce

Can you accurately transcribe fast speech? Tested ' new Speech-to-Text model (Scribe) with Eminem's "Rap God" (4.28 words/sec!) & it nailed it. Great quality and supports 99+ languages.

Can you accurately transcribe fast speech? Tested ' new Speech-to-Text model (Scribe) with Eminem's "Rap God" (4.28 words/sec!) & it nailed it. Great quality and supports 99+ languages.

Addy Osmani

108,722 görüntüleme • 1 yıl önce

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

AI at Meta

592,671 görüntüleme • 2 yıl önce

Introducing our new Turbo 2.5 model - Hindi, French, Spanish, Mandarin and 27 other languages just got 3x faster. This unlocks high-quality low-latency conversational AI for nearly 80% of the world. For the first time, we support Vietnamese, Hungarian and Norwegian text to speech. And English is now 25% faster compared to Turbo v2.

Introducing our new Turbo 2.5 model - Hindi, French, Spanish, Mandarin and 27 other languages just got 3x faster. This unlocks high-quality low-latency conversational AI for nearly 80% of the world. For the first time, we support Vietnamese, Hungarian and Norwegian text to speech. And English is now 25% faster compared to Turbo v2.

ElevenLabs

50,875 görüntüleme • 1 yıl önce

Here are the best practices for using Eleven v3 (alpha) - the most expressive Text to Speech model.

Here are the best practices for using Eleven v3 (alpha) - the most expressive Text to Speech model.

ElevenLabs

43,692 görüntüleme • 1 yıl önce

This is the best and fastest speech-to-text model in the world: • 23.2 seconds to process 30 minutes of audio • 93.3% accuracy • Diarization support to detect multiple speakers • Trained on 12.5 million hours of multilingual data I tried it out and it's pretty impressive:

This is the best and fastest speech-to-text model in the world: • 23.2 seconds to process 30 minutes of audio • 93.3% accuracy • Diarization support to detect multiple speakers • Trained on 12.5 million hours of multilingual data I tried it out and it's pretty impressive:

Santiago

66,397 görüntüleme • 9 ay önce

Introducing Eleven Music. The highest quality AI music model. - Complete control over genre, style, and structure - Multi-lingual, including English, Spanish, German, Japanese and more - Edit the sound and lyrics of individual sections or the whole song

Introducing Eleven Music. The highest quality AI music model. - Complete control over genre, style, and structure - Multi-lingual, including English, Spanish, German, Japanese and more - Edit the sound and lyrics of individual sections or the whole song

ElevenLabs

1,745,847 görüntüleme • 10 ay önce

Introducing Universal-1: Our most powerful and accurate Speech AI model yet. ✅ Trained on 12.5M hours of multilingual speech data ✅ 13.5% more accurate than models like Whisper ✅ Up to 30% fewer hallucinations than seq2seq models ✅ Just 38 seconds to process 1 hour of audio

Introducing Universal-1: Our most powerful and accurate Speech AI model yet. ✅ Trained on 12.5M hours of multilingual speech data ✅ 13.5% more accurate than models like Whisper ✅ Up to 30% fewer hallucinations than seq2seq models ✅ Just 38 seconds to process 1 hour of audio

AssemblyAI

23,511,258 görüntüleme • 2 yıl önce

Announcing 𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭🪄 SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at

Announcing 𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭🪄 SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at

Puyuan Peng

160,395 görüntüleme • 2 yıl önce

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Vaibhav Sisinty

24,520,024 görüntüleme • 2 ay önce