Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

LM Studio 0.3.10 is here with 🔮 Speculative Decoding! This provides inferencing speedups, in some cases 2x or more, with no degradation in quality. - Works for both GGUF/llama.cpp and MLX models! - Easily experiment with different draft models - Visualize accepted draft token % rate - Works in... show more

LM Studio

50,007 subscribers

73,791 views • 1 year ago •via X (Twitter)

Gaming Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

New Llama.cpp UI is a blessing for the local AI world 🌎 - Blazing fast, beautiful, and private (ofc) - Use 150,000+ GGUF models in a super slick UI - Drop in PDFs, images, or text documents - Branch and edit conversations anytime - Parallel chats and image processing - Math and code rendering - Constrained generation with JSON schema supported Easy setup + open-source + community-built 🔥

New Llama.cpp UI is a blessing for the local AI world 🌎 - Blazing fast, beautiful, and private (ofc) - Use 150,000+ GGUF models in a super slick UI - Drop in PDFs, images, or text documents - Branch and edit conversations anytime - Parallel chats and image processing - Math and code rendering - Constrained generation with JSON schema supported Easy setup + open-source + community-built 🔥

Victor M

161,184 views • 8 months ago

Batching for vision models is now available in Beta with our latest MLX engine update 👾 The updated engine also brings major improvements to caching for faster inference overall. Turn on Developer Mode, choose the beta runtime channel, and select LM Studio MLX v1.8.1.

Batching for vision models is now available in Beta with our latest MLX engine update 👾 The updated engine also brings major improvements to caching for faster inference overall. Turn on Developer Mode, choose the beta runtime channel, and select LM Studio MLX v1.8.1.

LM Studio

47,794 views • 2 months ago

Jan Desktop v0.7.7 is live 💛 This update brings native MLX support on macOS, a broader UX and UI refresh across the app, and better support for developer workflows. You can now upload files in Projects, use the local API server with both local and remote models, and work more smoothly with tools like Claude Code and other CLIs. Update your Jan or download the latest version at

Jan Desktop v0.7.7 is live 💛 This update brings native MLX support on macOS, a broader UX and UI refresh across the app, and better support for developer workflows. You can now upload files in Projects, use the local API server with both local and remote models, and work more smoothly with tools like Claude Code and other CLIs. Update your Jan or download the latest version at

👋 Jan

28,378 views • 5 months ago

THIS AI IS WILD chatgpt, claude and gemini and 3 more models in one app, you can chat with them at the same time, one shared brain that knows you. no more switching models. no more losing context. try here:

THIS AI IS WILD chatgpt, claude and gemini and 3 more models in one app, you can chat with them at the same time, one shared brain that knows you. no more switching models. no more losing context. try here:

Farhan

20,548 views • 1 month ago

Devin for Terminal is a local agent that works with all frontier models, including Opus 4.7, GPT 5.5, and SWE-1.6. You can switch model mid-session, or handoff to Devin in the cloud.

Devin for Terminal is a local agent that works with all frontier models, including Opus 4.7, GPT 5.5, and SWE-1.6. You can switch model mid-session, or handoff to Devin in the cloud.

Cognition

10,129,242 views • 3 months ago

After months of work, and with the help of our awesome community, we're excited to finally share LM Studio 0.3.0! 🎉 🔥 What's new: - Built-in Chat with Documents, 100% offline - OpenAI-like 'Structured Outputs' API with any local model - Total UI revamp (with dark/light/sepia themes) - Load & serve multiple LLMs *on the local network* - Available in 7 languages! 🌎🌍🌏 - Download any supported model from Hugging Face - Update LLM runtimes (llama.cpp) separately from the app ... and tons more goodies! Let us know how you like it! 👾🤝

After months of work, and with the help of our awesome community, we're excited to finally share LM Studio 0.3.0! 🎉 🔥 What's new: - Built-in Chat with Documents, 100% offline - OpenAI-like 'Structured Outputs' API with any local model - Total UI revamp (with dark/light/sepia themes) - Load & serve multiple LLMs on the local network - Available in 7 languages! 🌎🌍🌏 - Download any supported model from Hugging Face - Update LLM runtimes (llama.cpp) separately from the app ... and tons more goodies! Let us know how you like it! 👾🤝

LM Studio

142,600 views • 1 year ago

This symmetric diffusion paper at ICLR is nice (simple idea in retrospect): SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models We'd actually implemented this idea internally at Orbital, and it works nicely even for very large crystal structures:

This symmetric diffusion paper at ICLR is nice (simple idea in retrospect): SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models We'd actually implemented this idea internally at Orbital, and it works nicely even for very large crystal structures:

Mark Neumann

18,188 views • 1 year ago

we sped up distributed inference by up to 5x with decentralized speculative decoding. many don't realize that AI models normally generate text one single word at a time, waiting for the network after every word. speculative decoding changes this by using a "guess & confirm" system, similar to autocomplete. how it's done: 1. draft locally (the guess) instead of waiting for the network, a tiny, fast model on your device guesses the next few words instantly, without waiting for the network. 2. confirm remotely (the check) the massive remote model doesn't generate from scratch; it just checks the draft. it looks at the guesses in a batch and says "yes, yes, no." you get multiple words in the time it usually takes to get one. 3. adaptive logic dsd is smart. if the topic is creative, it lets the draft flow loose. if the topic is math or code, it checks more strictly. it balances speed and precision automatically so your inference almost feel instant. find out more: paper: blog:

we sped up distributed inference by up to 5x with decentralized speculative decoding. many don't realize that AI models normally generate text one single word at a time, waiting for the network after every word. speculative decoding changes this by using a "guess & confirm" system, similar to autocomplete. how it's done: 1. draft locally (the guess) instead of waiting for the network, a tiny, fast model on your device guesses the next few words instantly, without waiting for the network. 2. confirm remotely (the check) the massive remote model doesn't generate from scratch; it just checks the draft. it looks at the guesses in a batch and says "yes, yes, no." you get multiple words in the time it usually takes to get one. 3. adaptive logic dsd is smart. if the topic is creative, it lets the draft flow loose. if the topic is math or code, it checks more strictly. it balances speed and precision automatically so your inference almost feel instant. find out more: paper: blog:

Parallax

45,425 views • 6 months ago

Nano Banana 2 Lite 🤝 Gemini Omni Flash Now you can use the Interactions API to build with both models. We created this e-commerce applet in Google AI Studio that turns static product shots into cinematic videos. Here’s how it works: 1⃣ Upload a product photo and set the vibe 2⃣ Nano Banana 2 Lite generates new, brand-accurate assets 3⃣ Gemini Omni Flash renders those images into a high-quality video

Google

40,316 views • 29 days ago

Peyton Havard’s change-up is of note for the draft, 4” carry with 15.3” run. Among RHP who threw 100+ CH to RHH, Havard ranked 1st in wOBA allowed, 5th in xwOBA, and 2nd in whiff rate (57.1%). Also threw 3 pitches with 30%+ whiff this year

Peyton Havard’s change-up is of note for the draft, 4” carry with 15.3” run. Among RHP who threw 100+ CH to RHH, Havard ranked 1st in wOBA allowed, 5th in xwOBA, and 2nd in whiff rate (57.1%). Also threw 3 pitches with 30%+ whiff this year

Azad Earl

15,938 views • 1 year ago

Couldn’t have been more impressed with NDSU Football OL Grey Zabel in the Panini Senior Bowl 1-on-1s. Here are three good reps from him at three different positions (C, LG, RG) Strong case for IOL1 in this draft, and a top 50 pick

Couldn’t have been more impressed with NDSU Football OL Grey Zabel in the Panini Senior Bowl 1-on-1s. Here are three good reps from him at three different positions (C, LG, RG) Strong case for IOL1 in this draft, and a top 50 pick

Trevor Sikkema

374,086 views • 1 year ago

Today I'm launching a very early release of MLX Model Manager ( A Swift package for quickly and easily adding LLM/VLMs with just a couple of lines of code into your MacOS/iOS applications for local and private inferencing. This is built on the amazing work of the MLX team & what they have done with MLX Model Manager unifies the work they have done with MLXLLM as well as MLXVLM into one package with added abstractions to simplify the usage. I've also added support for Google's new Paligemma2 Vision Model ( You can see a demonstration in the video below! Huge shoutout to Awni Hannun for the creation of MLX, David Koski on leading the charge for MLXVLM and Prince Canuma for the creation of MLX-VLM. Three massive pillars in the MLX community among many others.

Today I'm launching a very early release of MLX Model Manager ( A Swift package for quickly and easily adding LLM/VLMs with just a couple of lines of code into your MacOS/iOS applications for local and private inferencing. This is built on the amazing work of the MLX team & what they have done with MLX Model Manager unifies the work they have done with MLXLLM as well as MLXVLM into one package with added abstractions to simplify the usage. I've also added support for Google's new Paligemma2 Vision Model ( You can see a demonstration in the video below! Huge shoutout to Awni Hannun for the creation of MLX, David Koski on leading the charge for MLXVLM and Prince Canuma for the creation of MLX-VLM. Three massive pillars in the MLX community among many others.

Kunal Batra

104,815 views • 1 year ago

My dual RTX PRO 6000 setup is currently training a Draft model for Qwen 3.6 27B! 🔥 I'm taking the paper DeepSeek dropped on 6/26 and going for a super ambitious application to the 27B scale. Thanks to my homelab, I was able to dive straight in — I read the paper and immediately started experimenting. The amount I've learned has been insane: - How memory bandwidth bottlenecks speed and clever ways to hack around it - Methods to train the draft model and boost its accuracy - Mechanisms to reference tokens all the way back to the previous one to skyrocket draft acceptance rates - The impact of Attention vs. GateDeltaNet on speculative decoding performance and how to handle those differences - The unique approaches and trade-offs of MTP, Dflash, JetSpec, and DSpark I could go on forever, but just from speculative decoding alone I've learned so much. The 27B architecture feels way more DSpark-native than JetSpec, so once draft training finishes, I'm going all-in with DSpark! My goal is to beat existing speculative decoding speeds outright — no task-specific shortcuts or cheating, pure general improvement. If you're into this kind of research, I'd love to hear your thoughts, impressions, and any suggestions — please reply! 🚀

My dual RTX PRO 6000 setup is currently training a Draft model for Qwen 3.6 27B! 🔥 I'm taking the paper DeepSeek dropped on 6/26 and going for a super ambitious application to the 27B scale. Thanks to my homelab, I was able to dive straight in — I read the paper and immediately started experimenting. The amount I've learned has been insane: - How memory bandwidth bottlenecks speed and clever ways to hack around it - Methods to train the draft model and boost its accuracy - Mechanisms to reference tokens all the way back to the previous one to skyrocket draft acceptance rates - The impact of Attention vs. GateDeltaNet on speculative decoding performance and how to handle those differences - The unique approaches and trade-offs of MTP, Dflash, JetSpec, and DSpark I could go on forever, but just from speculative decoding alone I've learned so much. The 27B architecture feels way more DSpark-native than JetSpec, so once draft training finishes, I'm going all-in with DSpark! My goal is to beat existing speculative decoding speeds outright — no task-specific shortcuts or cheating, pure general improvement. If you're into this kind of research, I'd love to hear your thoughts, impressions, and any suggestions — please reply! 🚀

Hikari∣LocalLLM⚡

56,314 views • 1 month ago

Introducing Jan v3, with updates to Jan Desktop v0.7.6 💛 Jan v3 is our first v3 model, a 4B base built for fine-tuning and fast local use, with stronger math and coding. This release also includes a small Jan Desktop update, starting with a UI refresh as we move toward a more unified Jan experience. Try it: - Jan v3 is available in Jan Desktop and at - Get the latest desktop app at Model: - Jan-v3-4B: - Jan-v3-4B-GGUF: Thanks Qwen for the base model and Georgi Gerganov for llama.cpp 💛

Introducing Jan v3, with updates to Jan Desktop v0.7.6 💛 Jan v3 is our first v3 model, a 4B base built for fine-tuning and fast local use, with stronger math and coding. This release also includes a small Jan Desktop update, starting with a UI refresh as we move toward a more unified Jan experience. Try it: - Jan v3 is available in Jan Desktop and at - Get the latest desktop app at Model: - Jan-v3-4B: - Jan-v3-4B-GGUF: Thanks Qwen for the base model and Georgi Gerganov for llama.cpp 💛

👋 Jan

53,900 views • 6 months ago

We’re unlocking the #NPU with #LiteRT to deliver high-performance AI that stays cool and fast. 🧠⚡️ Real-world impact and performance at scale: 🔷 Google Meet: Ultra-HD segmentation video effects models for pro-quality backgrounds 🔷 Epic Games: 30 FPS real-time MetaHuman facial animation on Android 🔷 Argmax: 2x speedup in speech-to-text with industry-leading latency 🔷 Google AI Edge Portal: cross-device benchmarking with NPU support Explore how it works →

We’re unlocking the #NPU with #LiteRT to deliver high-performance AI that stays cool and fast. 🧠⚡️ Real-world impact and performance at scale: 🔷 Google Meet: Ultra-HD segmentation video effects models for pro-quality backgrounds 🔷 Epic Games: 30 FPS real-time MetaHuman facial animation on Android 🔷 Argmax: 2x speedup in speech-to-text with industry-leading latency 🔷 Google AI Edge Portal: cross-device benchmarking with NPU support Explore how it works →

Google for Developers

53,735 views • 3 months ago

Businesses, you can now* accept payments directly from agents—without, or with, a human in the loop—using Stripe machine payments. It works for both cards and stablecoins via Machine Payments Protocol or x402. Add to your existing integration in a single prompt. *No more waitlist, get started =>

Businesses, you can now* accept payments directly from agents—without, or with, a human in the loop—using Stripe machine payments. It works for both cards and stablecoins via Machine Payments Protocol or x402. Add to your existing integration in a single prompt. *No more waitlist, get started =>

Jeff Weinstein

12,498 views • 2 months ago

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Alok

200,913 views • 1 month ago

Happy GPT 5.6 day! We dream of fighting disease with this kind of intelligence. In that spirit, here is a first preview of Cell Cinema, the future cell token for AI models. And a first for humanity: we recorded ferroptosis label-free, in real-time. A huge feat for biosciences.

Happy GPT 5.6 day! We dream of fighting disease with this kind of intelligence. In that spirit, here is a first preview of Cell Cinema, the future cell token for AI models. And a first for humanity: we recorded ferroptosis label-free, in real-time. A huge feat for biosciences.

Precigenetics

286,984 views • 20 days ago

THIS IS ONE SHOTTED. ONE PROMPT. ONE RESULT. ☠️ Never ever imagined creating such slick 3D effects with Three.js would be this simple. I vibe coded with the latest Gemini Pro in Google AI Studio, and it’s honestly impressive how far these Gemini models have come in handling 3D. It feels like no other AI models come close to producing results like this :) Live: Code:

THIS IS ONE SHOTTED. ONE PROMPT. ONE RESULT. ☠️ Never ever imagined creating such slick 3D effects with Three.js would be this simple. I vibe coded with the latest Gemini Pro in Google AI Studio, and it’s honestly impressive how far these Gemini models have come in handling 3D. It feels like no other AI models come close to producing results like this :) Live: Code:

The Bugged Dev

19,564 views • 3 months ago