Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

`transformers` + `torchao` quantization + `torch.compile` for faster inference speed and less memory usage 🔥 Demo of "meta-llama/Meta-Llama-3.1-8B-Instruct" quantized in 4-bit weight-only :

Marc Sun

1,507 subscribers

24,515 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Ivan Fioravanti ᯅ

11,041 Aufrufe • vor 1 Jahr

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Charly Wargnier

36,463 Aufrufe • vor 1 Jahr

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

AK

18,539 Aufrufe • vor 1 Jahr

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

Together AI

88,229 Aufrufe • vor 2 Jahren

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Ruliad

219,315 Aufrufe • vor 1 Jahr

$Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:$

Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:

Akash Network

16,463 Aufrufe • vor 1 Jahr

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Lior Alexander

196,384 Aufrufe • vor 2 Jahren

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

ollama

113,514 Aufrufe • vor 1 Jahr

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Akash Network

37,087 Aufrufe • vor 1 Jahr

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Shubham Saboo

260,049 Aufrufe • vor 1 Jahr

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

Alex Cheema - e/acc

527,894 Aufrufe • vor 1 Jahr

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Ahmad Al-Dahle

176,223 Aufrufe • vor 2 Jahren

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

Alex Cheema

1,934,582 Aufrufe • vor 1 Jahr

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

AI at Meta

50,762 Aufrufe • vor 2 Jahren

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

Eternal AI

36,552 Aufrufe • vor 1 Jahr

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

Md Ismail Šojal 🕷️

43,617 Aufrufe • vor 3 Monaten

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

Together AI

21,489 Aufrufe • vor 1 Jahr

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

Paul Couvert

81,620 Aufrufe • vor 1 Jahr

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

Awni Hannun

30,441 Aufrufe • vor 2 Jahren

🚨| NEW: Speed was spotted in his suite wearing the brand-new Meta glasses 🔥🔥

Speedy HQ

31,933 Aufrufe • vor 4 Monaten