正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

`transformers` + `torchao` quantization + `torch.compile` for faster inference speed and less memory usage 🔥 Demo of "meta-llama/Meta-Llama-3.1-8B-Instruct" quantized in 4-bit weight-only :

Marc Sun

1,507 subscribers

24,515 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Ivan Fioravanti ᯅ

11,041 次观看 • 1 年前

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Charly Wargnier

36,463 次观看 • 1 年前

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

AK

18,528 次观看 • 1 年前

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

Together AI

88,229 次观看 • 2 年前

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

ruliad

218,138 次观看 • 1 年前

$Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:$

Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:

Akash Network

16,463 次观看 • 1 年前

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Lior Alexander

196,380 次观看 • 2 年前

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

ollama

113,130 次观看 • 1 年前

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Akash Network

37,087 次观看 • 1 年前

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Shubham Saboo

260,049 次观看 • 1 年前

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

Alex Cheema - e/acc

527,894 次观看 • 1 年前

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

Alex Cheema

1,934,459 次观看 • 1 年前

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Ahmad Al-Dahle

176,223 次观看 • 2 年前

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

AI at Meta

50,762 次观看 • 2 年前

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

Eternal AI

36,552 次观看 • 1 年前

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

Md Ismail Šojal 🕷️

43,617 次观看 • 3 个月前

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

Together AI

21,489 次观看 • 1 年前

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

Paul Couvert

81,620 次观看 • 1 年前

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

Awni Hannun

30,441 次观看 • 2 年前

🚨| NEW: Speed was spotted in his suite wearing the brand-new Meta glasses 🔥🔥

Speedy HQ

31,933 次观看 • 3 个月前