Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

good morning /v1/chat/completions This is a test we ran overnight on TensorRT-LLM with modified kernels serving a custom 1B parameter model we trained for a customer ~200ms end-to-end latency (not TTFB, full request). Beats their current Cerebras stack on latency and quality

Sam Hogan 🇺🇸

22,674 subscribers

43,655 просмотров • 5 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Really proud to share something we’ve been working on for a while: Magenta RealTime 2 (MTR2), a live music model that is highly interactive (MIDI, audio, text, lots of parameters) and low-latency (~200ms end-to-end), and runs locally on a MacBook!

Really proud to share something we’ve been working on for a while: Magenta RealTime 2 (MTR2), a live music model that is highly interactive (MIDI, audio, text, lots of parameters) and low-latency (~200ms end-to-end), and runs locally on a MacBook!

Ilaria Manco

44,633 просмотров • 1 месяц назад

Today we're releasing Not Diamond… The world’s most powerful AI model router. Not Diamond maximizes LLM output quality by automatically recommending the best LLM on every request at lower cost and latency. And it takes <5m to set up. Watch this to see how to start using it:

Today we're releasing Not Diamond… The world’s most powerful AI model router. Not Diamond maximizes LLM output quality by automatically recommending the best LLM on every request at lower cost and latency. And it takes <5m to set up. Watch this to see how to start using it:

Tomas Hernando Kofman

75,806 просмотров • 1 год назад

Multi-LoRA is in private preview on Cerebras Inference. Deploy one base model alongside a library of LoRA adapters. Switch between them per request, with no reloading, no separate deployments, and no latency cost. Available now for dedicated endpoint users. Reach out to your account rep to get access.

Multi-LoRA is in private preview on Cerebras Inference. Deploy one base model alongside a library of LoRA adapters. Switch between them per request, with no reloading, no separate deployments, and no latency cost. Available now for dedicated endpoint users. Reach out to your account rep to get access.

Cerebras

21,168 просмотров • 1 месяц назад

can you chat privately with a cloud llm—*without* sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a secure gpu enclave, where inference stays fully confidential 🤯 links + code in comments👇

can you chat privately with a cloud llm—without sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a secure gpu enclave, where inference stays fully confidential 🤯 links + code in comments👇

Avanika Narayan

79,190 просмотров • 1 год назад

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 просмотров • 1 месяц назад

30 hours. One agent run. Sonnet 4.5 handled a full-day coding job end-to-end with lower latency and fewer handoffs.

30 hours. One agent run. Sonnet 4.5 handled a full-day coding job end-to-end with lower latency and fewer handoffs.

Superhuman AI

124,502 просмотров • 8 месяцев назад

We shipped LiveKit Turn Detector v1. Instead of reading transcripts, it listens to speech directly, combining semantic and acoustic cues into one end-of-turn prediction. The result: high accuracy, low latency—the best model we tested across 14 languages. Available on LiveKit Cloud.

We shipped LiveKit Turn Detector v1. Instead of reading transcripts, it listens to speech directly, combining semantic and acoustic cues into one end-of-turn prediction. The result: high accuracy, low latency—the best model we tested across 14 languages. Available on LiveKit Cloud.

LiveKit

11,142 просмотров • 17 дней назад

Meet Lightning V2! The fastest TTS model with just 100ms latency (ttfb)! Supports 16 languages, custom voices, and costs only $0.05 per 10K chars. Enterprise-ready. Ultra-fast. Built for scale.

Meet Lightning V2! The fastest TTS model with just 100ms latency (ttfb)! Supports 16 languages, custom voices, and costs only $0.05 per 10K chars. Enterprise-ready. Ultra-fast. Built for scale.

smallest.ai

120,909 просмотров • 1 год назад

compute has to be distributed and personalized to minimize latency as AI scales, the economy is increasingly latency-sensitive only $amd can solve that problem at scale, end-to-end long $amd since $4.2 and I’m betting on it becoming a $5T company in ~5 years

compute has to be distributed and personalized to minimize latency as AI scales, the economy is increasingly latency-sensitive only $amd can solve that problem at scale, end-to-end long $amd since $4.2 and I’m betting on it becoming a $5T company in ~5 years

Antonio Linares

64,103 просмотров • 2 месяцев назад

NBC livestreamed mid-flight using Starlink on a United Airlines flight. It's amazing we can now livestream high-quality video with low latency on a plane, all thanks to reusable rockets. 😅

NBC livestreamed mid-flight using Starlink on a United Airlines flight. It's amazing we can now livestream high-quality video with low latency on a plane, all thanks to reusable rockets. 😅

Robin

1,381,344 просмотров • 8 месяцев назад

"I used a billion tokens this week. I'm not even in the top 100 Codex users at OpenAI." We sat down with jason (creator of Instructor, now on OpenAI's Developer Experience team) to talk about how zero-latency inference is changing the way engineers work.

"I used a billion tokens this week. I'm not even in the top 100 Codex users at OpenAI." We sat down with jason (creator of Instructor, now on OpenAI's Developer Experience team) to talk about how zero-latency inference is changing the way engineers work.

Cerebras

93,023 просмотров • 2 месяцев назад

End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥 > Speech Encoder - Whisper Large v3 > LLM backbone - Llama 3.1 8B Instruct > Speech Decoder - HuBERT (UnitY) > Simultaneously generate Speech + Text > Less than 250 ms latency > Trained in less than 3 days on 4x GPUs > Used 200K instruct pairs > Model checkpoints on the Hub 🤗 > Space incoming! GG! I'm here for this trend! 🐐

End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥 > Speech Encoder - Whisper Large v3 > LLM backbone - Llama 3.1 8B Instruct > Speech Decoder - HuBERT (UnitY) > Simultaneously generate Speech + Text > Less than 250 ms latency > Trained in less than 3 days on 4x GPUs > Used 200K instruct pairs > Model checkpoints on the Hub 🤗 > Space incoming! GG! I'm here for this trend! 🐐

Vaibhav (VB) Srivastav

47,921 просмотров • 1 год назад

Kyutai released their Streaming Text to Speech model, ~2B param model, ultra low latency (220ms), CC-BY-4.0 license 🔥 Trained on 2.5 Million Hours of audio, it can serve up to 32 users w/ less than 350ms latency on a SINGLE L40 🤯 Incredible release by kyutai folks, go check out their hugging face page now!

Kyutai released their Streaming Text to Speech model, ~2B param model, ultra low latency (220ms), CC-BY-4.0 license 🔥 Trained on 2.5 Million Hours of audio, it can serve up to 32 users w/ less than 350ms latency on a SINGLE L40 🤯 Incredible release by kyutai folks, go check out their hugging face page now!

Vaibhav (VB) Srivastav

93,512 просмотров • 1 год назад

"One of the biggest misconceptions" Cerebras CFO Bob Komin pushes back on the small-models narrative. "We serve all models, and there is no limit to the size of the models that we can serve. Today, we're serving trillion parameter models. We're serving trillion parameter models that are internal for OpenAI today. We are currently running OpenAI 5.4 and 5.5 with them."

"One of the biggest misconceptions" Cerebras CFO Bob Komin pushes back on the small-models narrative. "We serve all models, and there is no limit to the size of the models that we can serve. Today, we're serving trillion parameter models. We're serving trillion parameter models that are internal for OpenAI today. We are currently running OpenAI 5.4 and 5.5 with them."

Deirdre Bosa

84,373 просмотров • 1 месяц назад

Microsoft silently updated OmniParser on the hub 👀 60% faster than v1 - sub-second latency on a 4090! "OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent." Bonus: you can try it out for free!

Microsoft silently updated OmniParser on the hub 👀 60% faster than v1 - sub-second latency on a 4090! "OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent." Bonus: you can try it out for free!

Vaibhav (VB) Srivastav

95,750 просмотров • 1 год назад

Finally got to ~zero input-to-display latency on macOS. It has taken us a _lot_ of work to get here! This is a slow-motion capture of a 120 Hz screen running a playbit program that draws a rectangle where the cursor is. The slight lag you're seeing is not from rendering latency but from "old" mouse position data (we receive it from macOS window server in a event queue.) Remember that the mouse cursor is drawn with dedicated hardware (dedicated GPU plane) so dragging a rectangle is the ultimate input-to-display latency test.

Finally got to ~zero input-to-display latency on macOS. It has taken us a _lot_ of work to get here! This is a slow-motion capture of a 120 Hz screen running a playbit program that draws a rectangle where the cursor is. The slight lag you're seeing is not from rendering latency but from "old" mouse position data (we receive it from macOS window server in a event queue.) Remember that the mouse cursor is drawn with dedicated hardware (dedicated GPU plane) so dragging a rectangle is the ultimate input-to-display latency test.

Rasmus Andersson

213,116 просмотров • 18 дней назад

🚨 This is Warzone on the highest fps/ lowest latency settings 🚨 It is really hard to lose a straight up gunfight when you are on this low of latency. Feels similar to the "host advantage" we all used to want back in the console days.

🚨 This is Warzone on the highest fps/ lowest latency settings 🚨 It is really hard to lose a straight up gunfight when you are on this low of latency. Feels similar to the "host advantage" we all used to want back in the console days.

Kibbs

300,553 просмотров • 2 лет назад

Congrats to LemonSlice on their $10.5M seed! They built the world's first interactive talking AI video model—a face layer for voice agents. Trained on a custom, 20B-parameter video diffusion transformer, streaming at 20fps on a single GPU. Infinite-length video generation with no error accumulation.

Congrats to LemonSlice on their $10.5M seed! They built the world's first interactive talking AI video model—a face layer for voice agents. Trained on a custom, 20B-parameter video diffusion transformer, streaming at 20fps on a single GPU. Infinite-length video generation with no error accumulation.

Y Combinator

50,846 просмотров • 6 месяцев назад

We work, we hold on, and we keep moving through a tunnel where no light is visible yet. And we know that darkness at the end of the tunnel means a turn, not the end. The most important thing is to keep going. Good will win.

We work, we hold on, and we keep moving through a tunnel where no light is visible yet. And we know that darkness at the end of the tunnel means a turn, not the end. The most important thing is to keep going. Good will win.

Anton Gerashchenko

70,798 просмотров • 5 месяцев назад

LATENCY is back with their first mini album, 'LATE O' CLOCK', serving with their self-titled single LATENCY consists of cignature members Jeewon, Haeun & Semi, LOONA's HyunJin and professional guitarist Fingerstylish

LATENCY is back with their first mini album, 'LATE O' CLOCK', serving with their self-titled single LATENCY consists of cignature members Jeewon, Haeun & Semi, LOONA's HyunJin and professional guitarist Fingerstylish

nugu promoter

84,286 просмотров • 3 месяцев назад