Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

`transformers` + `torchao` quantization + `torch.compile` for faster inference speed and less memory usage 🔥 Demo of "meta-llama/Meta-Llama-3.1-8B-Instruct" quantized in 4-bit weight-only :

Marc Sun

1,507 subscribers

24,515 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Mac Studio M2 Ultra is best Apple Silicon ever (so far!) 🔥75 t/s and 0 throttling 🔥 Even after 20 mins of uninterrupted token generation with Apple MLX and mlx-community/Meta-Llama-3.1-8B-Instruct-8bit model!

Ivan Fioravanti ᯅ

11,041 просмотров • 1 год назад

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Wild. Llama 4 Maverick just went Beast Mode 🔥 SambaNova partnered with AI at Meta to deliver the fastest Llama 4 inference on the market. Even the mighty Groq Inc can’t keep up! Let’s dive in ↓

Charly Wargnier

36,463 просмотров • 1 год назад

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

smolagents is now available in ai-gradio pip install ai-gradio[smolagents]==0.2.1 then simply do import gradio as gr import ai_gradio gr.load( name='smolagents:meta-llama/Llama-3.1-8B-Instruct', src=ai_gradio.registry).launch()

AK

18,539 просмотров • 1 год назад

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯

Together AI

88,229 просмотров • 2 лет назад

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Introducing DeepThought-8B: Transparent reasoning model built on LLaMA-3.1 with test-time compute scaling. - JSON-structured thought chains & controllable inference paths. - ~16GB VRAM, competitive w/ 70B models. - Open model weights, and inference scripts.

Ruliad

219,315 просмотров • 1 год назад

$Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:$

Llama 3.3 70B is live on AkashChat. The latest state-of-the-art multilingual AI model released by Meta is as performant as Llama 3.1 405B at a fraction of the size. Try it today:

Akash Network

16,463 просмотров • 1 год назад

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

pip install spectralquant ✂️ Up to 6.62x KV cache compression for LLMs and transformers. Same model. Faster outputs. Smaller KV cache. Try now (2 mins): - KV cache integration via Hugging Face's DynamicCache - Three presets: 5.95x (paper), 6.55x (validated), 6.68x (edge) - Mistral 7B / Qwen 2.5 7B / Llama 3.1 8B verified - Pure PyTorch + future CUDA kernel support - Auto-calibration from a bundled corpus 📰 Paper: 💻 Code, quickstart, and benchmarks: #LLM #Inference #PyTorch #OpenSource #MachineLearning #LLM #KVCache #Inference

ani

16,560 просмотров • 21 дней назад

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Meta just announced that Code Llama was now free for both research and commercial. This might the strongest competitor to ChatGPT: ▸ Can generate, explain, and debug your code ▸ Handles input 100,000 tokens ▸ Free for research + commercial use ▸ Outperforms most open models ▸ Comes in 7B, 13B, and 34B ▸ Supports Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash Available in: ▸ Foundation base models (Code Llama) ▸ Python specializations (Code Llama - Python), ▸ Instruction-following models (Code Llama - Instruct)

Lior Alexander

196,384 просмотров • 2 лет назад

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

.ollama is playing with AI at Meta Llama 4 Scout! 🤯 a perfect opportunity to test Ollama's giant super computer ✈️✈️✈️

ollama

113,514 просмотров • 1 год назад

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Llama 3.2 is the latest open-source AI model from Meta, released only a few hours ago. Here is the 3B parameter model running on Akash Chat at 165 tokens/second, powered by NVIDIA A100s on Akash. Try Llama 3.2 for free, no sign-in required:

Akash Network

37,087 просмотров • 1 год назад

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2. BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster. 100% opensource.

Shubham Saboo

260,049 просмотров • 1 год назад

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

Alex Cheema - e/acc

527,894 просмотров • 1 год назад

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive. Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.

Ahmad Al-Dahle

176,223 просмотров • 2 лет назад

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

Alex Cheema

1,934,687 просмотров • 1 год назад

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

The Meta Llama 3 Hackathon is this weekend in SF with @Cerebral_Valley! Get on the list ➡️ What to expect • Two days of building alongside the best hackers in AI • Hands on support from the Llama team • Talks from some of the top names in the industry

AI at Meta

50,762 просмотров • 2 лет назад

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

AI agents have arrived on Bittensor. Be the first to launch one. It's fast and easy. TAO AGENTS - Smart contracts on Openτensor Foundaτion - Decentralized inference by Eternal AI - AI Model by AI at Meta (Llama 3.3) You can issue an instantly tradable token for your agent, too.

Eternal AI

36,552 просмотров • 1 год назад

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

The first natively trained 1-bit model: BitNet 2B. Trained on 4 trillion tokens. that can run on CPUs like Apple M2 Native 1.58-bit weights and 8bit activations W158A8 Outperforms LLaMA &close to Qwen 2.5 1.5B in while using only 0.4GB memory versus 2GB and processes tokens 40%

Md Ismail Šojal 🕷️

43,647 просмотров • 3 месяцев назад

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

Together AI

21,489 просмотров • 1 год назад

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

The easiest way to use this new model is through HuggingChat with the link below. Just create a free account and select the model “nvidia/Llama-3.1-Nemotron-70B-Instruct-HF”. And you're ready to start chatting!

Paul Couvert

81,620 просмотров • 1 год назад

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

MLX Swift LLM example works with: - Mistral / Llama - Phi-2 - Qwen 1.5 - Starcoder 2 Quick-start: Qwen 1.5 0.5B runs pretty fast in 16-bit on my iPhone 14, no quantization needed:

Awni Hannun

30,441 просмотров • 2 лет назад