正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

new extremely fast text-to-audio model

Dreaming Tulpa 🥓👑

51,267 subscribers

68,825 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 条评论

Dreaming Tulpa 🥓👑 的头像

Dreaming Tulpa 🥓👑1 年前

this is TangoFlux, a new text-to-audio model that can generate 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU project page: code: demo:

Bytescribe 的头像

Bytescribe1 年前

Introducing Vehrbal, the AI that converts audio into SOAP notes! Say goodbye to wasted time and hello to effortless note-taking. Experience the power of fast, simple, and efficient with Vehrbal today.

Nim Eshed 𝕏🦋 的头像

Nim Eshed 𝕏🦋1 年前

Do you know if there is a good to to duplicate voice

Dreaming Tulpa 🥓👑 的头像

Dreaming Tulpa 🥓👑1 年前

not sure how it has evolved, but check out tortoise and t5-tts

michielh.eth 的头像

michielh.eth1 年前

I'm checking it now, need more ebooks in audio format.

Dreaming Tulpa 🥓👑 的头像

Dreaming Tulpa 🥓👑1 年前

don’t think this is gonna hold up for it but report back!

Russ Shimon 的头像

Russ Shimon1 年前

Cool. Going to check it out.

Dreaming Tulpa 🥓👑 的头像

Dreaming Tulpa 🥓👑1 年前

enjoy

Allar Haltsonen 的头像

Allar Haltsonen1 年前

☄️☄️

aivrar 的头像

aivrar1 年前

Doesn't seem to make great music, still promising though for general sound effects.

相关视频

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

Vaibhav (VB) Srivastav

330,034 次观看 • 1 年前

Realtime AI Conversations are here! Introducing PlayHT 2.0 Turbo ⚡️ Our new blazing fast Conversational AI Text-to-Speech model with <300ms latency! ✅ Input text streaming from LLMs ✅ Output audio streaming ✅ Clone any voice & accent Try here -

Realtime AI Conversations are here! Introducing PlayHT 2.0 Turbo ⚡️ Our new blazing fast Conversational AI Text-to-Speech model with <300ms latency! ✅ Input text streaming from LLMs ✅ Output audio streaming ✅ Clone any voice & accent Try here -

PlayAI

392,536 次观看 • 2 年前

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Vaibhav (VB) Srivastav

66,963 次观看 • 1 年前

Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵

Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵

Google DeepMind

468,982 次观看 • 2 个月前

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

Mistral AI

938,328 次观看 • 3 个月前

I've got access to the new Open model! It's not DeepSeek V4, but it's something powerful and fast! Here Space Invaders 0-shot, sprites and audio included, clearly prompt was extremely detailed on my side 🔥 🚀 Prompt below.

I've got access to the new Open model! It's not DeepSeek V4, but it's something powerful and fast! Here Space Invaders 0-shot, sprites and audio included, clearly prompt was extremely detailed on my side 🔥 🚀 Prompt below.

Ivan Fioravanti ᯅ

11,441 次观看 • 3 个月前

Stability AI just dropped Stable Audio Open Small on Hugging Face Fast Text-to-Audio Generation with Adversarial Post-Training

Stability AI just dropped Stable Audio Open Small on Hugging Face Fast Text-to-Audio Generation with Adversarial Post-Training

AK

55,100 次观看 • 1 年前

Introducing GPT-4o, our new model which can reason across text, audio, and video in real time. It's extremely versatile, fun to play with, and is a step towards a much more natural form of human-computer interaction (and even human-computer-computer interaction):

Introducing GPT-4o, our new model which can reason across text, audio, and video in real time. It's extremely versatile, fun to play with, and is a step towards a much more natural form of human-computer interaction (and even human-computer-computer interaction):

Greg Brockman

4,359,357 次观看 • 2 年前

Bark Text-to-Audio Model Full Text Input: "Why was six afraid of seven?" Ignore Bark's "I'm done with this input" token and tell Bark to just keep generating more audio anyway.

Bark Text-to-Audio Model Full Text Input: "Why was six afraid of seven?" Ignore Bark's "I'm done with this input" token and tell Bark to just keep generating more audio anyway.

Jonathan Fly 👾

461,816 次观看 • 3 年前

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

AK

165,158 次观看 • 3 年前

Underrated strategy on how to monetize new channels extremely fast 👇🏼

Underrated strategy on how to monetize new channels extremely fast 👇🏼

wannercashcow

26,068 次观看 • 2 个月前

VideoComposer brings ControlNet guidance to text and video-to-video. The model enables to combine multiple modalities like text, sketch, style and even motion to drive video generation. The results look extremely good.

VideoComposer brings ControlNet guidance to text and video-to-video. The model enables to combine multiple modalities like text, sketch, style and even motion to drive video generation. The results look extremely good.

Dreaming Tulpa 🥓👑

18,045 次观看 • 3 年前

Starting today you can try our new foundation research model for audio generation. The demo includes Zero shot TTS, Text to sound effects, Infilling and more! Try Audiobox ➡️

Starting today you can try our new foundation research model for audio generation. The demo includes Zero shot TTS, Text to sound effects, Infilling and more! Try Audiobox ➡️

AI at Meta

515,618 次观看 • 2 年前

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

OpenAI

22,806,198 次观看 • 2 年前

Can you accurately transcribe fast speech? Tested ' new Speech-to-Text model (Scribe) with Eminem's "Rap God" (4.28 words/sec!) & it nailed it. Great quality and supports 99+ languages.

Can you accurately transcribe fast speech? Tested ' new Speech-to-Text model (Scribe) with Eminem's "Rap God" (4.28 words/sec!) & it nailed it. Great quality and supports 99+ languages.

Addy Osmani

108,722 次观看 • 1 年前

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more:

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more:

AI at Meta

1,249,200 次观看 • 6 个月前

New text and image to video generation AI model Open-Sora-Plan-v1.3.0

New text and image to video generation AI model Open-Sora-Plan-v1.3.0

AK

51,838 次观看 • 1 年前

OK, this is insane.. Alibaba just dropped 4 new AI models at Apsara 2025, and they’re wild: → a 1 trillion parameter LLM → a vision model that codes from images → an omni-model for text/audio/video → and a new Wan 2.5 preview for video + audio gen more details below:👇

OK, this is insane.. Alibaba just dropped 4 new AI models at Apsara 2025, and they’re wild: → a 1 trillion parameter LLM → a vision model that codes from images → an omni-model for text/audio/video → and a new Wan 2.5 preview for video + audio gen more details below:👇

Hamza Khalid

26,707 次观看 • 9 个月前

Experimenting with OpenAI's new Text to Speech model 💬 Punctuation is powerful here 🤯

Experimenting with OpenAI's new Text to Speech model 💬 Punctuation is powerful here 🤯

Miguel | AP

202,867 次观看 • 2 年前

This is a big day. Meta is open-sourcing AudioCraft. You can now generate incredible music and sounds with a single prompt. It includes the most performant Generative AI Model (audio) on the market, the "Llama" of Audio. The research framework contains the weights and code of these models: ▸ MusicGen: controllable text-to-music model. ▸ AudioGen: text-to-sound model. ▸ EnCodec: high fidelity neural audio codec. ▸ Multi Band Diffusion: An EnCodec compatible decoder using diffusion. This is going to tremendously speed up audio research 👏

This is a big day. Meta is open-sourcing AudioCraft. You can now generate incredible music and sounds with a single prompt. It includes the most performant Generative AI Model (audio) on the market, the "Llama" of Audio. The research framework contains the weights and code of these models: ▸ MusicGen: controllable text-to-music model. ▸ AudioGen: text-to-sound model. ▸ EnCodec: high fidelity neural audio codec. ▸ Multi Band Diffusion: An EnCodec compatible decoder using diffusion. This is going to tremendously speed up audio research 👏

Lior Alexander

231,604 次观看 • 2 年前