Загрузка видео...

Не удалось загрузить видео

На главную

At Standard Intelligence we’ve been researching scalable cross-modality learning. We’re excited to share some early results in the form of 𝗵𝗲𝗿𝘁𝘇-𝗱𝗲𝘃, an open-source, first-of-its-kind base model for full-duplex conversational audio. 1/

178,771 просмотров • 1 год назад •via X (Twitter)

Комментарии: 10

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

Hertz-dev is an 8.5B parameter transformer trained on 20 million unique hours of high-quality audio data. We’ve released checkpoints and code for both mono and full-duplex generation on our website under the Apache license.

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

Hertz-dev is a base model, without fine-tuning, RLHF, or instruction-following behavior. It can be fine-tuned by researchers for almost 𝘢𝘯𝘺 audio modeling task, from live translation to classification.

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

Base models excel at faithfully modeling their training set, and accurate maps come from contact with reality. From the world’s largest dataset of high-quality real-world conversational audio, hertz-dev learned human-like speech patterns such as pauses and emotional inflections.

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

Hertz-dev has a 80ms theoretical average latency, and benchmarks 120ms real-world latency on a single RTX 4090—1.5-2x lower than the previous state of the art. Low latency is necessary for natural audio, and we're proud to move the field in this direction.

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

We’re currently training a scaled, 70B parameter version of Hertz, and we’ll be expanding to more modalities in the future. We’re excited to see what the research community builds on top of this model.

Фото профиля jian
jian1 год назад

This is impressive! Seems like the training dataset is mostly podcast? And FYI, I believe there’s also a fully-duplex vision/audio model out there, would be interested in learning more about the implementation!

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

cool project! would love to see our base model used in projects like this one

Фото профиля pranav ⠕
pranav ⠕1 год назад

i love small business sunday

Фото профиля Standard Intelligence
Standard Intelligence1 год назад

small. business. sunday.

Фото профиля Nicholas Charette
Nicholas Charette1 год назад

so happy we got this out. base models are very important research artifacts to have publicly available, and i'm glad to help ensure that they exist further into the timeline:)

Похожие видео

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 просмотров • 1 год назад