Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

At Standard Intelligence we’ve been researching scalable cross-modality learning. We’re excited to share some early results in the form of 𝗵𝗲𝗿𝘁𝘇-𝗱𝗲𝘃, an open-source, first-of-its-kind base model for full-duplex conversational audio. 1/

Standard Intelligence

10,198 subscribers

178,771 просмотров • 1 год назад •via X (Twitter)

Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

Hertz-dev is an 8.5B parameter transformer trained on 20 million unique hours of high-quality audio data. We’ve released checkpoints and code for both mono and full-duplex generation on our website under the Apache license.

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

Hertz-dev is a base model, without fine-tuning, RLHF, or instruction-following behavior. It can be fine-tuned by researchers for almost 𝘢𝘯𝘺 audio modeling task, from live translation to classification.

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

Base models excel at faithfully modeling their training set, and accurate maps come from contact with reality. From the world’s largest dataset of high-quality real-world conversational audio, hertz-dev learned human-like speech patterns such as pauses and emotional inflections.

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

Hertz-dev has a 80ms theoretical average latency, and benchmarks 120ms real-world latency on a single RTX 4090—1.5-2x lower than the previous state of the art. Low latency is necessary for natural audio, and we're proud to move the field in this direction.

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

We’re currently training a scaled, 70B parameter version of Hertz, and we’ll be expanding to more modalities in the future. We’re excited to see what the research community builds on top of this model.

Фото профиля jian

jian1 год назад

This is impressive! Seems like the training dataset is mostly podcast? And FYI, I believe there’s also a fully-duplex vision/audio model out there, would be interested in learning more about the implementation!

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

cool project! would love to see our base model used in projects like this one

Фото профиля pranav ⠕

pranav ⠕1 год назад

i love small business sunday

Фото профиля Standard Intelligence

Standard Intelligence1 год назад

small. business. sunday.

Фото профиля Nicholas Charette

Nicholas Charette1 год назад

so happy we got this out. base models are very important research artifacts to have publicly available, and i'm glad to help ensure that they exist further into the timeline:)

Похожие видео

Opanarchy Marketplace: Alpha Release we’re releasing the first alpha version of our robotic marketplace, an early look at what we’ve been building. open to feedback and improvements.

Opanarchy Marketplace: Alpha Release we’re releasing the first alpha version of our robotic marketplace, an early look at what we’ve been building. open to feedback and improvements.

Opanarchy

36,230 просмотров • 7 месяцев назад

NVIDIA just removed the biggest friction point in Voice AI 🤯 They've open-sourced PersonaPlex-7B, a full-duplex conversational model that can listen and speak at the same time. Instead of waiting for you to finish talking, it uses a dual-stream architecture to process and respond in real-time. 100% Open-Source and Free.

NVIDIA just removed the biggest friction point in Voice AI 🤯 They've open-sourced PersonaPlex-7B, a full-duplex conversational model that can listen and speak at the same time. Instead of waiting for you to finish talking, it uses a dual-stream architecture to process and respond in real-time. 100% Open-Source and Free.

Simplifying AI

278,640 просмотров • 4 месяцев назад

I'm *very* excited to share something we’ve been working on for a while now... It's a set of AI agents tailor-made for founders and built around each founder's specific workflows. Here’s an early look at some of the things the agents can do:

I'm very excited to share something we’ve been working on for a while now... It's a set of AI agents tailor-made for founders and built around each founder's specific workflows. Here’s an early look at some of the things the agents can do:

Ran Aroussi

30,167 просмотров • 1 год назад

We’re excited to share new research on Project AMIE (Articulate Medical Intelligence Explorer), our conversational medical AI! 💬 We're exploring its potential to assist clinicians in primary & specialty care. 🔗

We’re excited to share new research on Project AMIE (Articulate Medical Intelligence Explorer), our conversational medical AI! 💬 We're exploring its potential to assist clinicians in primary & specialty care. 🔗

Google for Health

38,430 просмотров • 1 год назад

Today, we’re launching Explore, an entirely new way to interact with your data. We’ve been testing Querio Explore with customers of all sizes for months, and we’re excited to share it with the world!

Today, we’re launching Explore, an entirely new way to interact with your data. We’ve been testing Querio Explore with customers of all sizes for months, and we’re excited to share it with the world!

rami

10,636 просмотров • 1 год назад

Meet Moshiko and Moshika, the open source Moshi models 📖🟢. Moshi is a 7B text-audio model, capable of doing full-duplex conversations: it can listen and speak at any time. Plus, its inner text monologue improves the generation 💬 All on device🧑‍💻 🔎

Meet Moshiko and Moshika, the open source Moshi models 📖🟢. Moshi is a 7B text-audio model, capable of doing full-duplex conversations: it can listen and speak at any time. Plus, its inner text monologue improves the generation 💬 All on device🧑‍💻 🔎

Alexandre Défossez

131,302 просмотров • 1 год назад

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

Charly Wargnier

564,424 просмотров • 4 месяцев назад

🚀 Today we’re launching Manus Academy (Early Bird) — an open-source learning platform to prepare professionals for the future of work. A clear, practical path to using AI for real tasks, not just Q&A.

🚀 Today we’re launching Manus Academy (Early Bird) — an open-source learning platform to prepare professionals for the future of work. A clear, practical path to using AI for real tasks, not just Q&A.

Manus

58,074 просмотров • 5 месяцев назад

We have been hard at work at the studio and are excited to share a reel of some of our web work. Excited for 2025!

We have been hard at work at the studio and are excited to share a reel of some of our web work. Excited for 2025!

Studio Null

27,283 просмотров • 1 год назад

Today, we’re launching Orpheus, an open-source TTS model that exceeds the capabilities of both open and closed-source models such as ElevenLabs and OpenAI! (1/6)

Today, we’re launching Orpheus, an open-source TTS model that exceeds the capabilities of both open and closed-source models such as ElevenLabs and OpenAI! (1/6)

Elias

629,456 просмотров • 1 год назад

We’ve been cooking and we’re excited to share this 👩‍🍳. Meet Moniebook, an all-in-one platform to help businesses with payments, inventory management and tracking their sales by Moniepoint. Here’s a look at the process of putting this together, and why we’re excited about what it means for the millions of businesses banking with us 🚀 #Moniepoint #Moniebook #Finance

We’ve been cooking and we’re excited to share this 👩‍🍳. Meet Moniebook, an all-in-one platform to help businesses with payments, inventory management and tracking their sales by Moniepoint. Here’s a look at the process of putting this together, and why we’re excited about what it means for the millions of businesses banking with us 🚀 #Moniepoint #Moniebook #Finance

Moniepoint Group

71,054 просмотров • 6 месяцев назад

Base is beginning to explore a network token We’re in the early phases of exploration, and don’t have any specifics to share around timing, design, or governance. We’re committed to bringing the community along with us, and building in the open.

Base is beginning to explore a network token We’re in the early phases of exploration, and don’t have any specifics to share around timing, design, or governance. We’re committed to bringing the community along with us, and building in the open.

Base

3,821,946 просмотров • 9 месяцев назад

Excited to share a peek of what I’ve been working on We Sesame believe voice is key to unlocking a future where computers are lifelike Here’s an early preview you can try! 👇 We’ll be open sourcing a model, and yes… we’re building hardware! 🧵

Excited to share a peek of what I’ve been working on We Sesame believe voice is key to unlocking a future where computers are lifelike Here’s an early preview you can try! 👇 We’ll be open sourcing a model, and yes… we’re building hardware! 🧵

Justin Alvey

452,433 просмотров • 1 год назад

After operating in stealth for the last 18 months , we’re excited today to finally show the world what we’ve been working on. We believe we’re on a path to physical AGI with the launch of our brand new foundation model, the Direct Video Action (DVA) model.

Jagdeep Singh

254,984 просмотров • 3 месяцев назад

1/ We’ve been deep in build mode for months, and we’re excited to show you what we've been working on. Introducing the next chapter of Solana Mobile: the Solana Seeker 🧵👇

1/ We’ve been deep in build mode for months, and we’re excited to show you what we've been working on. Introducing the next chapter of Solana Mobile: the Solana Seeker 🧵👇

Seeker | Solana Mobile

1,799,340 просмотров • 1 год назад

We’re excited to share our latest work published today in Nature Biotechnology: Protein2PAM, an AI model that enables the rapid design of CRISPR editors with new PAM recognition And we’re making the model freely available for research and commercial use:

We’re excited to share our latest work published today in Nature Biotechnology: Protein2PAM, an AI model that enables the rapid design of CRISPR editors with new PAM recognition And we’re making the model freely available for research and commercial use:

Profluent

16,511 просмотров • 4 месяцев назад

Today mimic and friends are excited to share mimic-video, a new class of Video-Action Model that elevates video model backbones as first class citizens for robot learning!

Today mimic and friends are excited to share mimic-video, a new class of Video-Action Model that elevates video model backbones as first class citizens for robot learning!

Elvis Nava

85,675 просмотров • 5 месяцев назад

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 просмотров • 1 год назад

I’m excited to deliver my first State of the Commonwealth on January 17th! 2023 was all about making Massachusetts more affordable. In 2024, we’re going to keep after it. We’ve got some awesome things in store for this year that I can’t wait to share with you. Tune in at #MASOTC

I’m excited to deliver my first State of the Commonwealth on January 17th! 2023 was all about making Massachusetts more affordable. In 2024, we’re going to keep after it. We’ve got some awesome things in store for this year that I can’t wait to share with you. Tune in at #MASOTC

Governor Maura Healey

23,665 просмотров • 2 лет назад

🎉Excited to share a fun little hardware project we’ve been working on. GELLO is an intuitive and low cost teleoperation device for robot arms that costs less than $300. We've seen the importance of data quality in imitation learning. Our goal is to make this more accessible 1/n

🎉Excited to share a fun little hardware project we’ve been working on. GELLO is an intuitive and low cost teleoperation device for robot arms that costs less than $300. We've seen the importance of data quality in imitation learning. Our goal is to make this more accessible 1/n

Philipp Wu

162,325 просмотров • 2 лет назад