Video yükleniyor...
Video Yüklenemedi
Are you using OpenAI's Whisper for speech recognition and finding the timestamps are out of sync? Just dropped: WhisperX with word-level timestamp accuracy by force aligning whisper with wav2vec2.0 🧵 [1/n]
78,290 görüntüleme • 3 yıl önce •via X (Twitter)
11 Yorum

🧵[2/n] @openAI’s Whisper shows impressive transcription performance, but often the corresponding timestamps are out of sync by several seconds.

Heightened volatility got you on edge? In my latest free Substack post, discover how a Hidden Markov Model (HMM) can help you navigate market corrections and safeguard your investments.

🧵[3/n] However, phoneme-based models such as Wav2Vec2.0 produce much more accurate timestamps. WhisperX leverages these models using forced alignment on the whisper transcription to generate word-level timestamps.

🧵[4/n] The result is word-level timestamp output. See more examples and try it yourself at

🧵[5/n] Of course, it would be better if a single model did everything. One way would be teacher-student, where whisper is learning to output wav2vec's aligned timestamps. If @OpenAI open-sourced the training data and script, it would be cool to try this :)

@philipvollet @OpenAI Awesome!

@OpenAI any way to get this working for a musician who struggles aligning vocals to videos ?

@OpenAI do you mean aligning lyrics to the audio? You can feed the lyrics to the align function in the code, although aligning over such a long sequence could be tricky.

@OpenAI Big bro killing it 🤝

@OpenAI @memdotai mem it

@OpenAI Thanks !
