Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Been looking into token optimization and model routing, I think super obvious optimization to tackle both cost + demand on inference Here’s a small post about different techniques and methods

Sean Geng

4,079 subscribers

16,441 views • 1 month ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

$NBIS cofounder Roman Chernin describes how their recent acquisitions of Eigen AI and Clarifai were all about speed, incredible talent, and acceleration: "The philosophy is very simple. We need to build so many things, and we need to move so fast, that we're always looking for people who can accelerate us. It should be exceptional talent, and/or something that has a great adoption." "Our two recent acquisitions [were] two teams that work on inference optimization. A big part of our business is how efficiently we convert GPUs into tokens. And these two teams — Eigen AI and Clarifai — one is focused on model optimization, the engine of inference. How you run specific models and all the techniques around spec decoding, quantization, and so on." "And the other is system optimization. All the routing, KV caching, and orchestration across the big cluster of compute and so on." "We have a very strong internal team working on inference. But we felt that we needed to move faster, bring more capabilities. Because the market is so fast."

$NBIS cofounder Roman Chernin describes how their recent acquisitions of Eigen AI and Clarifai were all about speed, incredible talent, and acceleration: "The philosophy is very simple. We need to build so many things, and we need to move so fast, that we're always looking for people who can accelerate us. It should be exceptional talent, and/or something that has a great adoption." "Our two recent acquisitions [were] two teams that work on inference optimization. A big part of our business is how efficiently we convert GPUs into tokens. And these two teams — Eigen AI and Clarifai — one is focused on model optimization, the engine of inference. How you run specific models and all the techniques around spec decoding, quantization, and so on." "And the other is system optimization. All the routing, KV caching, and orchestration across the big cluster of compute and so on." "We have a very strong internal team working on inference. But we felt that we needed to move faster, bring more capabilities. Because the market is so fast."

TBPN

30,486 views • 1 month ago

The defining differentiator in AI right now isn't model performance, it's cost per token. Because in this market, efficiency is what turns demand into actual growth. Roman Chernin of Nebius outlines why inference economics are becoming the central challenge, as companies operating on thin margins need infrastructure that can scale without eroding profitability. Token Factory reflects that shift, focusing on orchestration, optimization, and full-stack control to make AI systems economically viable.

The defining differentiator in AI right now isn't model performance, it's cost per token. Because in this market, efficiency is what turns demand into actual growth. Roman Chernin of Nebius outlines why inference economics are becoming the central challenge, as companies operating on thin margins need infrastructure that can scale without eroding profitability. Token Factory reflects that shift, focusing on orchestration, optimization, and full-stack control to make AI systems economically viable.

Six Five Media

294,803 views • 2 months ago

[ Model Optimization ] Here's a comparison video of the Female Dancer's second optimization (the third one). Her legs are no longer thick as they were in the first optimization, and her chest size has been reduced to match the original model. ※ Source:

[ Model Optimization ] Here's a comparison video of the Female Dancer's second optimization (the third one). Her legs are no longer thick as they were in the first optimization, and her chest size has been reduced to match the original model. ※ Source:

Identity V | News

55,821 views • 5 months ago

As AI labs race to train and deploy new frontier models, existing models become more affordable with better tokenomics. ✨ "Everybody's trying to get to the next frontier. And every time they get to the next frontier, the last generation AI tokens, the cost starts to decline about a factor of 10x every year," said NVIDIA CEO Jensen Huang in a recent keynote. Model optimization techniques such as speculative decoding and multi-token prediction, combined with inference serving platforms like NVIDIA Dynamo on NVIDIA Blackwell NVL72 systems, enable AI factories to boost throughput by 10x with one-tenth of the cost per token. Learn more about AI factory tokenomics ➡️

As AI labs race to train and deploy new frontier models, existing models become more affordable with better tokenomics. ✨ "Everybody's trying to get to the next frontier. And every time they get to the next frontier, the last generation AI tokens, the cost starts to decline about a factor of 10x every year," said NVIDIA CEO Jensen Huang in a recent keynote. Model optimization techniques such as speculative decoding and multi-token prediction, combined with inference serving platforms like NVIDIA Dynamo on NVIDIA Blackwell NVL72 systems, enable AI factories to boost throughput by 10x with one-tenth of the cost per token. Learn more about AI factory tokenomics ➡️

NVIDIA AI

16,053 views • 5 months ago

"I don't really think about brand optimization, to be frank, as is obvious from my tweets, which are often self-inflicted wounds." Elon Musk

Tesla Owners Silicon Valley

1,056,862 views • 1 year ago

Crusoe Managed Inference is now available. Run model inference faster, easier, and more cost-effectively. Large-context AI solutions demand fast outputs, massive throughput, and resilient scaling. Crusoe Managed Inference delivers breakthrough speed & reduced input-token spend.

Crusoe Managed Inference is now available. Run model inference faster, easier, and more cost-effectively. Large-context AI solutions demand fast outputs, massive throughput, and resilient scaling. Crusoe Managed Inference delivers breakthrough speed & reduced input-token spend.

Crusoe

12,096,815 views • 7 months ago

Every AI product running in production depends on an inference system someone had to engineer, optimize and more. Scheduling, batching, routing, cost per token. This is a craft. The Inference Frontier Program spotlights the builders behind that work. 💡 Watch the video and nominate a team:

Every AI product running in production depends on an inference system someone had to engineer, optimize and more. Scheduling, batching, routing, cost per token. This is a craft. The Inference Frontier Program spotlights the builders behind that work. 💡 Watch the video and nominate a team:

Nebius

54,224 views • 3 months ago

The Token Company (The Token Company (YC W26)) builds LLM input optimization to lower costs, reduce latency, and improve accuracy. Congrats on the launch, otso!

The Token Company (The Token Company (YC W26)) builds LLM input optimization to lower costs, reduce latency, and improve accuracy. Congrats on the launch, otso!

Y Combinator

27,542 views • 4 months ago

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,771 views • 11 months ago

[ Comparison Video ] Truth & Inference Series Optimization ▸ Photographer A-Tier Costume – "D.M." Source • •

[ Comparison Video ] Truth & Inference Series Optimization ▸ Photographer A-Tier Costume – "D.M." Source • •

Identity V | News

167,443 views • 2 months ago

the 3 phases of AI sticker shock: 1. board asks “whats our AI strategy?” 2. company says “everyone use AI for everything” 3. the token bill arrives, the crackdown begins ...and model routing starts to look pretty obvious. Matan Grinberg: "“There are different models that are good at different tasks. They have different trade-offs between cost, quality, and speed"

the 3 phases of AI sticker shock: 1. board asks “whats our AI strategy?” 2. company says “everyone use AI for everything” 3. the token bill arrives, the crackdown begins ...and model routing starts to look pretty obvious. Matan Grinberg: "“There are different models that are good at different tasks. They have different trade-offs between cost, quality, and speed"

Deirdre Bosa

31,903 views • 1 month ago

🚨 Forget LIDAR. The Robbyant team just dropped a streaming 3D model that reconstructs scenes live, at ~20 FPS, over long sequences. One single camera. Runs in real time. Open-source. Entirely end-to-end. NO iterative optimization tricks and no post-processing cleanup steps! It outperforms both existing streaming approaches and several offline methods. 100% Free and open-source. Repo, paper and model weights in 🧵↓

🚨 Forget LIDAR. The Robbyant team just dropped a streaming 3D model that reconstructs scenes live, at ~20 FPS, over long sequences. One single camera. Runs in real time. Open-source. Entirely end-to-end. NO iterative optimization tricks and no post-processing cleanup steps! It outperforms both existing streaming approaches and several offline methods. 100% Free and open-source. Repo, paper and model weights in 🧵↓

Charly Wargnier

35,350 views • 7 days ago

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 views • 1 year ago

Learn to optimize RAG for cost and performance in our new short course, Prompt Compression and Query Optimization, created with MongoDB and taught by Richmond Alake. This course teaches you to combine traditional database capabilities with vector search using MongoDB for RAG. You'll learn these techniques: - Vector search: For semantic matching of user queries - Filtering using metadata: Pre- and post-filtering to narrow search results - Projections: Selecting only necessary fields to minimize data returned - Boosting: Reranking results to improve relevance - Prompt compression: Using a small LLM to compress context, significantly reducing token count and processing costs These methods address scaling, performance, and security challenges in large-scale RAG applications. You can sign up here:

Learn to optimize RAG for cost and performance in our new short course, Prompt Compression and Query Optimization, created with MongoDB and taught by Richmond Alake. This course teaches you to combine traditional database capabilities with vector search using MongoDB for RAG. You'll learn these techniques: - Vector search: For semantic matching of user queries - Filtering using metadata: Pre- and post-filtering to narrow search results - Projections: Selecting only necessary fields to minimize data returned - Boosting: Reranking results to improve relevance - Prompt compression: Using a small LLM to compress context, significantly reducing token count and processing costs These methods address scaling, performance, and security challenges in large-scale RAG applications. You can sign up here:

Andrew Ng

71,699 views • 2 years ago

Understanding OpenAI o1: Noam Brown on integrating reasoning into the model. Takeaways: - Avoid MCTS and current paradigm of using processes outside of the model during inference - Think about how to directly integrate reasoning into the model architecture

Understanding OpenAI o1: Noam Brown on integrating reasoning into the model. Takeaways: - Avoid MCTS and current paradigm of using processes outside of the model during inference - Think about how to directly integrate reasoning into the model architecture

Casper Hansen

313,291 views • 1 year ago

$NBIS co-founder Roman Chernin explained why the next AI bottleneck is running inference efficiently at scale. That’s why Nebius is building Token Factory around model optimization, orchestration & agentic AI deployment instead of acting like a basic GPU rental business.

$NBIS co-founder Roman Chernin explained why the next AI bottleneck is running inference efficiently at scale. That’s why Nebius is building Token Factory around model optimization, orchestration & agentic AI deployment instead of acting like a basic GPU rental business.

Shay Boloor

129,547 views • 1 month ago

bro casually walks and explains 5 GPU performance optimization methods for LLMs. one of the most simple and intuitive explanations for beginners.

bro casually walks and explains 5 GPU performance optimization methods for LLMs. one of the most simple and intuitive explanations for beginners.

ℏεsam

941,888 views • 5 months ago

Jim Cramer on Leopold Aschenbrenner’s $NBIS stake “When this man touches something, it turns to gold.” The next AI bottleneck is efficient inference at scale which is why Nebius Token Factory is built around optimization, orchestration and agentic deployment rather than basic GPU rental.

Jim Cramer on Leopold Aschenbrenner’s $NBIS stake “When this man touches something, it turns to gold.” The next AI bottleneck is efficient inference at scale which is why Nebius Token Factory is built around optimization, orchestration and agentic deployment rather than basic GPU rental.

Shay Boloor

289,153 views • 1 month ago

[ Model Optimization ] Here's a comparison video of the Mercenary's optimization. He now has hair underneath his hood, he's no longer a bald bean. ※ Source:

[ Model Optimization ] Here's a comparison video of the Mercenary's optimization. He now has hair underneath his hood, he's no longer a bald bean. ※ Source:

Identity V | News

146,320 views • 5 months ago