Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Apple FastVLM-7B Efficient Vision Encoding for Vision Language Models larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT vibe coding a video captioning app with it in anycoder

AK

433,003 subscribers

60,588 views • 9 months ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Apple released FastVLM so I tried vibe coding a video captioning AI app with it took 5 prompts to get a working app in anycoder and deployed it on Hugging Face 85x faster and 3.4x smaller than comparable sized VLMs the deployed app works 100% locally in your browser powered by transformers.js and WebGPU

Apple released FastVLM so I tried vibe coding a video captioning AI app with it took 5 prompts to get a working app in anycoder and deployed it on Hugging Face 85x faster and 3.4x smaller than comparable sized VLMs the deployed app works 100% locally in your browser powered by transformers.js and WebGPU

AK

42,677 views • 9 months ago

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Sumanth

43,685 views • 9 months ago

Day 88 vibe coding my AI video editor with SwiftUI It's insane how Apple provides everything, like removing the background of a video without using a single token.

Day 88 vibe coding my AI video editor with SwiftUI It's insane how Apple provides everything, like removing the background of a video without using a single token.

Meng To

14,152 views • 26 days ago

Structured Output from Multipage PDF with Sparrow (Qwen2 Vision LLM and MLX) I explain how multipage PDFs are handled in Sparrow to extract structured data in a single call.

Structured Output from Multipage PDF with Sparrow (Qwen2 Vision LLM and MLX) I explain how multipage PDFs are handled in Sparrow to extract structured data in a single call.

Andrej Baranovskij

30,645 views • 1 year ago

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset

AI at Meta

74,531 views • 1 year ago

inference on your head mistral 7b (4bit quantized) running locally on apple vision pro

inference on your head mistral 7b (4bit quantized) running locally on apple vision pro

Joseph Semrai

239,718 views • 2 years ago

Instant AutoGPT Launcher I wrote a launcher for AutoGPT today. Play with LLM agents using a ComfyUI-like node based interface. Even works with local LLM via ollama (shown in the video) Works on all platforms (windows, linux, mac). Install with 1 click. Use with 0 click.

Instant AutoGPT Launcher I wrote a launcher for AutoGPT today. Play with LLM agents using a ComfyUI-like node based interface. Even works with local LLM via ollama (shown in the video) Works on all platforms (windows, linux, mac). Install with 1 click. Use with 0 click.

cocktail peanut

11,405 views • 1 year ago

Planning with Reasoning using Vision Language World Model

Planning with Reasoning using Vision Language World Model

AK

26,274 views • 9 months ago

Reasoning on Apple Vision Pro with Apple MLX and DeepSeek R1 Qwen 7B 4bit! 🔥 14 tokens per sec! 🔥 Note: sorry for the shaking footage, I was excited to see it running 😂

Reasoning on Apple Vision Pro with Apple MLX and DeepSeek R1 Qwen 7B 4bit! 🔥 14 tokens per sec! 🔥 Note: sorry for the shaking footage, I was excited to see it running 😂

Ivan Fioravanti ᯅ

24,324 views • 1 year ago

#M5StackNew 🎊 The LLM630 Compute Kit is an #AI large language model (#LLM) inference development kit, powered by the #Axera #AX630C SoC with a 3.2 TOPs NPU, it delivers efficient AI inference for tasks like computer vision (CV) and LLM processing.

#M5StackNew 🎊 The LLM630 Compute Kit is an #AI large language model (#LLM) inference development kit, powered by the #Axera #AX630C SoC with a 3.2 TOPs NPU, it delivers efficient AI inference for tasks like computer vision (CV) and LLM processing.

M5Stack

16,383 views • 1 year ago

Why pay $3500 for the Apple Vision Pro? Control a robot with your hands using the phospho dev kit

Why pay $3500 for the Apple Vision Pro? Control a robot with your hands using the phospho dev kit

Pierre-Louis Biojout (PLB)

39,794 views • 1 year ago

NEW VIDEO - What Using Apple Vision Pro is Actually Like! Full 35 minutes in-depth video:

NEW VIDEO - What Using Apple Vision Pro is Actually Like! Full 35 minutes in-depth video:

Marques Brownlee

2,563,944 views • 2 years ago

Drawing on flat surfaces with Logitech Muse on Apple Vision Pro And yes, it has haptic feedback whenever you start drawing or touching a window (using the TouchDesk app)

Drawing on flat surfaces with Logitech Muse on Apple Vision Pro And yes, it has haptic feedback whenever you start drawing or touching a window (using the TouchDesk app)

Brad Lynch

16,712 views • 7 months ago

SpatialGen just announced Zeus, a dedicated hardware system for live Apple Immersive Video streaming. It supports live 16K immersive video encoding with 90 FPS streaming. This is a huge step toward more live immersive experiences on Apple Vision Pro.

SpatialGen just announced Zeus, a dedicated hardware system for live Apple Immersive Video streaming. It supports live 16K immersive video encoding with 90 FPS streaming. This is a huge step toward more live immersive experiences on Apple Vision Pro.

Spatial Insider

13,487 views • 1 month ago

I just built a RAG Agent with web search using Cohere's ⌘R 7B model. 100% Opensource Code with step-by-step tutorial.

I just built a RAG Agent with web search using Cohere's ⌘R 7B model. 100% Opensource Code with step-by-step tutorial.

Shubham Saboo

47,032 views • 1 year ago

MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). vibe coding a Video Chat AI app with MiniCPM-V-4.5 in anycoder

MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). vibe coding a Video Chat AI app with MiniCPM-V-4.5 in anycoder

AK

19,012 views • 9 months ago

I built a multimodal AI Coding Agent team with multi-agents. It has 3 AI agents working together as a team to generate and execute the code: 1. Coding Agent using o-3 mini 2. Vision Agent using Gemini 3. Code Execution Agent using o-3 mini and E2B 100% Opensource Code.

I built a multimodal AI Coding Agent team with multi-agents. It has 3 AI agents working together as a team to generate and execute the code: 1. Coding Agent using o-3 mini 2. Vision Agent using Gemini 3. Code Execution Agent using o-3 mini and E2B 100% Opensource Code.

Shubham Saboo

42,269 views • 1 year ago

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

AK

165,158 views • 3 years ago

I turned my living room into a full-on Blockbuster with Apple Vision Pro! Posters. Shelves. Trailers. The works. The app is called ReelRoom and I love it! If you miss the Friday night video store vibe, this brings it all back.

I turned my living room into a full-on Blockbuster with Apple Vision Pro! Posters. Shelves. Trailers. The works. The app is called ReelRoom and I love it! If you miss the Friday night video store vibe, this brings it all back.

Justin Ryan ᯅ

52,920 views • 1 year ago