Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Depth Anything 3 now runs as pure C++/ggml (ggml) . No Python, no PyTorch, no CUDA toolkit at inference, just one self-contained GGUF. It's faster than PyTorch on CPU! and ties speed on GPU. The CPU win came from the last place..I'd have looked. Quantized GGUF on Hugging Face🤗... show more

Ettore Di Giacinto

3,343 subscribers

33,985 views • 6 days ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

parakeet.cpp: native C++/ggml (ggml) inference for NVIDIA AI Developer's Parakeet, one of the best speech-to-text models out there, from the LocalAI team. Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU. Quantized GGUF on Hugging Face 🤗 Huge thanks to Georgi Gerganov for ggml and to NVIDIA AI Developer for releasing Parakeet! 🧵

parakeet.cpp: native C++/ggml (ggml) inference for NVIDIA AI Developer's Parakeet, one of the best speech-to-text models out there, from the LocalAI team. Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU. Quantized GGUF on Hugging Face 🤗 Huge thanks to Georgi Gerganov for ggml and to NVIDIA AI Developer for releasing Parakeet! 🧵

Ettore Di Giacinto

55,426 views • 25 days ago

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations Georgi Gerganov ! This is a triumph.

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations Georgi Gerganov ! This is a triumph.

Nat Friedman

1,764,074 views • 3 years ago

Just dropped on HF — NeuTTS Air Next-gen on-device TTS that matches cloud-level quality while staying fully open source. > Real-time speech synthesis on CPU/GPU > 3-second voice cloning, no cloud or data upload > Compact: under 200 MB, runs on mobile and edge devices > Multilingual and expressive > Developed by Neuphonic , optimized for speed and fidelity

Just dropped on HF — NeuTTS Air Next-gen on-device TTS that matches cloud-level quality while staying fully open source. > Real-time speech synthesis on CPU/GPU > 3-second voice cloning, no cloud or data upload > Compact: under 200 MB, runs on mobile and edge devices > Multilingual and expressive > Developed by Neuphonic , optimized for speed and fidelity

steven

72,273 views • 8 months ago

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

ani

16,583 views • 25 days ago

We are building “Open Source Nano Banana for Video” - here is open source demo v0.1 We are open sourcing Lucy Edit, the first foundation model for text-guided video editing! Get the model on Hugging Face 🤗, API on @FAL, and nodes on ComfyUI 🧵

We are building “Open Source Nano Banana for Video” - here is open source demo v0.1 We are open sourcing Lucy Edit, the first foundation model for text-guided video editing! Get the model on Hugging Face 🤗, API on @FAL, and nodes on ComfyUI 🧵

Decart

413,676 views • 9 months ago

Microsoft killed the GPU mafia 🤯 They finally open-sourced their 1-bit LLM inference framework called bitnet.cpp. It lets you run 100B parameter models on your local CPU without GPUs. - 6.17x faster inference - 82.2% less energy on CPUs 100% Open Source.

Microsoft killed the GPU mafia 🤯 They finally open-sourced their 1-bit LLM inference framework called bitnet.cpp. It lets you run 100B parameter models on your local CPU without GPUs. - 6.17x faster inference - 82.2% less energy on CPUs 100% Open Source.

Oliver Prompts

1,627,505 views • 4 months ago

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.

Guri Singh

2,180,357 views • 3 months ago

Step 4 to achieve truly serverless GPUs for AI inference: skip over unserializable inference engine setup steps like CUDA graph capture and Torch compilation by stacking GPU snapshots and CPU snapshots.

Step 4 to achieve truly serverless GPUs for AI inference: skip over unserializable inference engine setup steps like CUDA graph capture and Torch compilation by stacking GPU snapshots and CPU snapshots.

Charles 🎉 Frye

17,452 views • 1 month ago

I don’t have the BRAM for a depth buffer so it’s time to get the memory controller up and running for this fjord torus. 100% FPGA logic 3D pipeline, no CPU.

I don’t have the BRAM for a depth buffer so it’s time to get the memory controller up and running for this fjord torus. 100% FPGA logic 3D pipeline, no CPU.

Ian Hanschen

29,249 views • 4 months ago

Someone just built a desktop app that that generates 3D models from images and runs 100% locally. It's called Modly. It runs entirely on your GPU, no cloud, no API bills. Just drop an image and get a 3D mesh. 100% Open Source.

Someone just built a desktop app that that generates 3D models from images and runs 100% locally. It's called Modly. It runs entirely on your GPU, no cloud, no API bills. Just drop an image and get a 3D mesh. 100% Open Source.

How To Prompt

222,682 views • 2 months ago

Llama 2: Now on Hugging Chat 🤗🦙 Try out the 70B Chat model for free with super fast inference, web search, and powered by open-source tools! 👉

Llama 2: Now on Hugging Chat 🤗🦙 Try out the 70B Chat model for free with super fast inference, web search, and powered by open-source tools! 👉

Hugging Face

403,558 views • 2 years ago

New vs Old! Epyc 64-Core CPU vs DEC VAX! I like to compare weird things on an even playing field, and so I installed NetBSD 10.1 on both a VAX 4000-705A and on an Epyc 8534P, 64-core CPU with 128GB of RAM. The 1993 VAX runs at 112MHz (pretty fast for a VAX!) Then I set them both to building the NetBSD source tree. Epyc on the top, VAX on the bottom. And this VAX is at least 50x as fast as the original VAX 11/780!

New vs Old! Epyc 64-Core CPU vs DEC VAX! I like to compare weird things on an even playing field, and so I installed NetBSD 10.1 on both a VAX 4000-705A and on an Epyc 8534P, 64-core CPU with 128GB of RAM. The 1993 VAX runs at 112MHz (pretty fast for a VAX!) Then I set them both to building the NetBSD source tree. Epyc on the top, VAX on the bottom. And this VAX is at least 50x as fast as the original VAX 11/780!

Dave W Plummer

21,590 views • 1 year ago

We’re incredibly excited to launch Open Benchmarks Grants, a new program committing $3M in grants to fund new open source benchmarks advancing agentic AI. We’re partnering up with Hugging Face, Together AI, Prime Intellect, Factory HQ, Harbor Framework, and PyTorch to support academic and open-source teams as they define and advance the AI frontier.

We’re incredibly excited to launch Open Benchmarks Grants, a new program committing $3M in grants to fund new open source benchmarks advancing agentic AI. We’re partnering up with Hugging Face, Together AI, Prime Intellect, Factory HQ, Harbor Framework, and PyTorch to support academic and open-source teams as they define and advance the AI frontier.

Snorkel AI

664,965 views • 4 months ago

Meigen MultiTalk @gradio demo is available on Hugging Face 🤗 Duplicate on L40S for personal and unlimited inference, enjoy ! *Compatible with multi-GPU too 😉

Meigen MultiTalk @gradio demo is available on Hugging Face 🤗 Duplicate on L40S for personal and unlimited inference, enjoy ! *Compatible with multi-GPU too 😉

Sylvain Filoni

18,055 views • 1 year ago

Our CUDA-native genome generation stack can now generate a whole synthetic genome in just under 20 minutes on eight H200 GPUs from lium.io . That's roughly 440x faster than the traditional CPU-bound workflows. This opens the door to population-scale synthetic genome generation for benchmarking, variant calling evaluation, and genomic AI.

Our CUDA-native genome generation stack can now generate a whole synthetic genome in just under 20 minutes on eight H200 GPUs from lium.io . That's roughly 440x faster than the traditional CPU-bound workflows. This opens the door to population-scale synthetic genome generation for benchmarking, variant calling evaluation, and genomic AI.

Minos

15,421 views • 5 days ago

Trying out realtime video-to-depth on an iPhone 15 using Depth Anything V2.

Trying out realtime video-to-depth on an iPhone 15 using Depth Anything V2.

Tim Field

21,835 views • 2 years ago

You can now use Qwen3-VL in Jan. Find the GGUF model on Hugging Face, click "Use this model" and select Jan, or copy the model link and paste it into Jan Hub. Thanks Qwen 🧡

You can now use Qwen3-VL in Jan. Find the GGUF model on Hugging Face, click "Use this model" and select Jan, or copy the model link and paste it into Jan Hub. Thanks Qwen 🧡

👋 Jan

44,150 views • 7 months ago

First fully ML-framework-free 3D Gaussian Splatting implementation in LichtFeld Studio. I’ve completed the migration of the full training pipeline to a custom CUDA-based tensor library. No PyTorch, no LibTorch, no autograd. Every gradient is implemented by hand, either through CUDA kernels or minimal abstractions on top. This makes it the first full training setup for 3D Gaussian Splatting with zero dependencies on existing ML frameworks. It’s not just about independence, it's about control! We now manage every byte of GPU memory, which opens the door to tighter optimization and finer performance tuning. The framework footprint is minimal, without pulling in gigabytes of ML runtime code that was never designed for real-time or graphics-driven applications. A few modules, such as the metrics and 3DGUT interfaces, are still being ported, and some operations are temporarily naïve, so performance is not yet on par with master. But this refactor lays the groundwork for: - A fully self-contained binary - Fine-grained memory optimization - Easier experimentation without the weight of an ML stack We’re getting close.

First fully ML-framework-free 3D Gaussian Splatting implementation in LichtFeld Studio. I’ve completed the migration of the full training pipeline to a custom CUDA-based tensor library. No PyTorch, no LibTorch, no autograd. Every gradient is implemented by hand, either through CUDA kernels or minimal abstractions on top. This makes it the first full training setup for 3D Gaussian Splatting with zero dependencies on existing ML frameworks. It’s not just about independence, it's about control! We now manage every byte of GPU memory, which opens the door to tighter optimization and finer performance tuning. The framework footprint is minimal, without pulling in gigabytes of ML runtime code that was never designed for real-time or graphics-driven applications. A few modules, such as the metrics and 3DGUT interfaces, are still being ported, and some operations are temporarily naïve, so performance is not yet on par with master. But this refactor lays the groundwork for: - A fully self-contained binary - Fine-grained memory optimization - Easier experimentation without the weight of an ML stack We’re getting close.

MrNeRF

50,487 views • 7 months ago

Y'all need to use VoiceInk for voice dictation. It's fast, 100% local, open source, compilable from the GitHub repo, and no subscription fee. Replaced Wisprflow with it and it's just as fast and accurate. 🔊 Sound on!

Y'all need to use VoiceInk for voice dictation. It's fast, 100% local, open source, compilable from the GitHub repo, and no subscription fee. Replaced Wisprflow with it and it's just as fast and accurate. 🔊 Sound on!

Ben Holmes

41,468 views • 8 months ago

For those who care. No power for more than 10 hours now. I finished my workout and came to have a warcoffee; this place runs on generators. Kyiv stands. Lives. Repairs its wounds.

For those who care. No power for more than 10 hours now. I finished my workout and came to have a warcoffee; this place runs on generators. Kyiv stands. Lives. Repairs its wounds.

Yaroslava

70,368 views • 8 months ago