Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

`transformers` + `torchao` quantization + `torch.compile` for faster inference speed and less memory usage 🔥 Demo of "meta-llama/Meta-Llama-3.1-8B-Instruct" quantized in 4-bit weight-only :

Marc Sun

1,507 subscribers

24,515 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Ivan Fioravanti ᯅ

11,041 views • 1 year ago

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Charly Wargnier

36,463 views • 1 year ago

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

AK

18,539 views • 1 year ago

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

Together AI

88,229 views • 2 years ago

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Ruliad

219,315 views • 1 year ago

$Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:$

Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:

Akash Network

16,463 views • 1 year ago

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

ani

16,583 views • 24 days ago

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Lior Alexander

196,384 views • 2 years ago

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

ollama

113,514 views • 1 year ago

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Akash Network

37,087 views • 1 year ago

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Shubham Saboo

260,049 views • 1 year ago

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

Alex Cheema - e/acc

527,894 views • 1 year ago

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Ahmad Al-Dahle

176,223 views • 2 years ago

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

Alex Cheema

1,934,687 views • 1 year ago

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

AI at Meta

50,762 views • 2 years ago

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

Eternal AI

36,552 views • 1 year ago

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

Md Ismail Šojal 🕷️

43,647 views • 3 months ago

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

Paul Couvert

81,620 views • 1 year ago

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

Together AI

21,489 views • 1 year ago

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

Awni Hannun

30,441 views • 2 years ago