Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook: 112 ms -> 1.1 ms for the image path 30% lower end-to-end image+LLM The architecture is just: patchify the image -> linear projection with pos... show more

Andi Marafioti

7,396 subscribers

58,819 views • 5 days ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp + CUDA 13.2 21 tokens per second. on a budget consumer GPU. locally. no API. no cloud. no subscription. and the benchmarks are absolutely cooked # first let's talk architecture because this is genuinely different every multimodal model you've used has a frozen vision encoder + frozen audio encoder + LLM backbone glued together Gemma 4 12B is different it's a single decoder only transformer. that's it. vision? raw 48×48 pixel patches → one matmul → projected directly into the LLM audio? raw 16kHz signal sliced into 40ms frames → linear projection → same LLM input space no encoder tax. no latency penalty. no fragmented memory to put the encoder savings in perspective: old Gemma 4 26B approach: - 550M param vision encoder (frozen) - 300M param audio encoder (frozen) - LLM backbone Gemma 4 12B: - 35M param vision embedder (a single matmul) - no audio encoder at all - LLM backbone handles EVERYTHING 550M → 35M for vision alone. that's a 15x reduction this is why the gemma-4-12b-it-Q4_K_M.gguf is just 6.6 GBs!!! and it has 256K native context context # Benchmarks: AIME 2026 (math olympiad): 77.5% GPQA Diamond (expert science): 78.8% LiveCodeBench v6 (real code): 72% Codeforces ELO: 1659 MMLU Pro: 77.2% MATH-Vision: 79.7% BigBench Extra Hard: 53% inference → llama.cpp, LM Studio, vLLM, SGLang llamacpp flags: -m "gemma-4-12b-it-Q4_K_M.gguf" -ngl 99 -c 8000 -v --port 8080 Available on huggingface now! Link below

Alok

277,107 views • 20 days ago

OpenAI Image generation is now available on TypingMind I can finally create ghibli images on my own app now 😂 Edit images just by talking to the LLM (any LLM, not just OpenAI models) Enable the "GPT Image Editor" plugin to enjoy!

OpenAI Image generation is now available on TypingMind I can finally create ghibli images on my own app now 😂 Edit images just by talking to the LLM (any LLM, not just OpenAI models) Enable the "GPT Image Editor" plugin to enjoy!

Tony Dinh

14,007 views • 1 year ago

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

AK

47,384 views • 2 years ago

P-Image-Upscale is the fastest and cheapest image upscaler in the world: supporting outputs up to 128 MP under 1 seconds. A couple of weeks ago we released p-image-upscale, and it now works better than ever. It’s the fastest image upscaler in the world, supporting outputs up to 128 MP while keeping pricing simple and predictable: - $0.005/image for 1–4 MP - $0.01/image for 4–8 MP - $0.02/image for 8–16 MP - $0.04/image for 16–32 MP - $0.06/image for 32–64 MP - $0.12/image for 64–128 MP That means you can go from low-res input to production-ready output with extreme speed, while preserving detail and keeping costs easy to understand. Available on: - Pruna AI | - each::labs | - inference.sh | - Replicate | - Runware | - Segmind | - WaveSpeedAI | - Wiro | If you’re building workflows where image quality, price, latency, and scale all matter, p-image-upscale is built for you.

P-Image-Upscale is the fastest and cheapest image upscaler in the world: supporting outputs up to 128 MP under 1 seconds. A couple of weeks ago we released p-image-upscale, and it now works better than ever. It’s the fastest image upscaler in the world, supporting outputs up to 128 MP while keeping pricing simple and predictable: - $0.005/image for 1–4 MP - $0.01/image for 4–8 MP - $0.02/image for 8–16 MP - $0.04/image for 16–32 MP - $0.06/image for 32–64 MP - $0.12/image for 64–128 MP That means you can go from low-res input to production-ready output with extreme speed, while preserving detail and keeping costs easy to understand. Available on: - Pruna AI | - each::labs | - inference.sh | - Replicate | - Runware | - Segmind | - WaveSpeedAI | - Wiro | If you’re building workflows where image quality, price, latency, and scale all matter, p-image-upscale is built for you.

Pruna AI

23,405 views • 28 days ago

We finally got a sneak peek at how Apple Vision Pro environments are made. The upper image shows the view as originally captured. The lower image represents the final composition, made up of two locations combined to create a seamless world that can be viewed in 360.

We finally got a sneak peek at how Apple Vision Pro environments are made. The upper image shows the view as originally captured. The lower image represents the final composition, made up of two locations combined to create a seamless world that can be viewed in 360.

Nathie

21,340 views • 4 months ago

3. Modify image Use this prompt to modify the image you have chosen: "Modify image [1] with seed [1470033597]: add a parrot on her shoulder" Dall-E 3 identifies the image and makes the changes for you! Tips and tricks: - You can generate as many variations as you like. - It's also possible to remove an element from the image using the same method. - Sometimes the image isn't 100% identical, but still looks very similar.

3. Modify image Use this prompt to modify the image you have chosen: "Modify image [1] with seed [1470033597]: add a parrot on her shoulder" Dall-E 3 identifies the image and makes the changes for you! Tips and tricks: - You can generate as many variations as you like. - It's also possible to remove an element from the image using the same method. - Sometimes the image isn't 100% identical, but still looks very similar.

Paul Couvert

44,072 views • 2 years ago

Each line on that image represents a projection through an object with a different viewing angle, here's what you see when you project it back, you can get some idea of what the original image was, but it's very blurry.

Each line on that image represents a projection through an object with a different viewing angle, here's what you see when you project it back, you can get some idea of what the original image was, but it's very blurry.

Scott Manley

21,719 views • 1 year ago

made an app that guesses where you are in the world with just a picture using image embeddings trained on street view data first time using swiftui, consumer apps in general, TestFlight below

made an app that guesses where you are in the world with just a picture using image embeddings trained on street view data first time using swiftui, consumer apps in general, TestFlight below

Surya

9,599,974 views • 1 year ago

From product image to video with just one tool - Dzine As you may have noticed, this is one of my favorite tools. It is also very underrated, as probably 50% of my tutorials include some workflow. I was testing the new image-to-video option today, and I love it. Step - by step guide in comments 🔽 I can do 95% of a workflow without switching between apps. Image generation, Image to image with style reference, background removal, background generation, and 2 frames image to video. The only other app I have been using for this video is CapCut so that I can stitch it together. Step by step 🔽

From product image to video with just one tool - Dzine As you may have noticed, this is one of my favorite tools. It is also very underrated, as probably 50% of my tutorials include some workflow. I was testing the new image-to-video option today, and I love it. Step - by step guide in comments 🔽 I can do 95% of a workflow without switching between apps. Image generation, Image to image with style reference, background removal, background generation, and 2 frames image to video. The only other app I have been using for this video is CapCut so that I can stitch it together. Step by step 🔽

Teodora P L

28,523 views • 1 year ago

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,760 views • 2 years ago

🍌 Just released NanoBanana Pro LoRA Dataset Generator with fal Finally an easy way to create training datasets for: • Flux 2 • Z-Image • Qwen Image Edit • Any image-to-image model ✨ Uses Nano Banana Pro API on fal 🌐 100% browser-based - no server needed ⚡ Parallel generation for speed 🔗 Try it live: 💻 Code:

🍌 Just released NanoBanana Pro LoRA Dataset Generator with fal Finally an easy way to create training datasets for: • Flux 2 • Z-Image • Qwen Image Edit • Any image-to-image model ✨ Uses Nano Banana Pro API on fal 🌐 100% browser-based - no server needed ⚡ Parallel generation for speed 🔗 Try it live: 💻 Code:

Lovis Odin

30,072 views • 6 months ago

let’s create the most dank image library for tap.fun. i quickly vibed together a frontend so everyone can submit images. i’ll drop some cash for the help: - $50 for the most unique meme - $50 for the craziest pic - top 20 will get added to the taplab contributor tg here’s what i’m looking for: images that can transform any image into a unique new one. think memes, crazy visuals, unique outfits, weird energy, funny shit. no text. single image only (not a grid). multiple characters in one pic is totally fine. how to submit: 1. go to 2. submit your image or meme with a name 3. download the image and post it as a comment here so we can see it

let’s create the most dank image library for tap.fun. i quickly vibed together a frontend so everyone can submit images. i’ll drop some cash for the help: - $50 for the most unique meme - $50 for the craziest pic - top 20 will get added to the taplab contributor tg here’s what i’m looking for: images that can transform any image into a unique new one. think memes, crazy visuals, unique outfits, weird energy, funny shit. no text. single image only (not a grid). multiple characters in one pic is totally fine. how to submit: 1. go to 2. submit your image or meme with a name 3. download the image and post it as a comment here so we can see it

Will Mexi

11,376 views • 5 months ago

A single image generated by GPT Image 2, combined with Seedance 2.0, can already achieve this kind of effect Below is the Image 2 prompt👇

A single image generated by GPT Image 2, combined with Seedance 2.0, can already achieve this kind of effect Below is the Image 2 prompt👇

Ai Arainz

28,450 views • 1 month ago

the fact that i can take an image of a room and turn it into a 3d model in one shot is actually insane this took like 30 seconds from image to 3d model

the fact that i can take an image of a room and turn it into a 3d model in one shot is actually insane this took like 30 seconds from image to 3d model

Jan

131,884 views • 7 months ago

With the image editor of Grok Imagine, you can easily change the viewpoint of an image simply by requesting it. Here, a single image was created using the prompt, and then I requested different viewpoints.

With the image editor of Grok Imagine, you can easily change the viewpoint of an image simply by requesting it. Here, a single image was created using the prompt, and then I requested different viewpoints.

Déborah

7,828,251 views • 7 months ago

Create a short film like this in just 1 minute with GPT Image 2.0 + Seedance 2.0. GPT Image 2.0 can naturally combine multiple photos into one single image, while Seedance 2.0 can use that image as a reference to automatically separate the scenes, generate a coherent video sequence, and add suitable background music. This workflow greatly improves the overall creative efficiency. When using this method, simply provide the merged image as a reference for Seedance 2.0 and briefly describe each scene with a simple prompt. This can significantly increase the success rate of the final video. All of the above was created on GPT Image Prompt: Seedance Prompt:

Create a short film like this in just 1 minute with GPT Image 2.0 + Seedance 2.0. GPT Image 2.0 can naturally combine multiple photos into one single image, while Seedance 2.0 can use that image as a reference to automatically separate the scenes, generate a coherent video sequence, and add suitable background music. This workflow greatly improves the overall creative efficiency. When using this method, simply provide the merged image as a reference for Seedance 2.0 and briefly describe each scene with a simple prompt. This can significantly increase the success rate of the final video. All of the above was created on GPT Image Prompt: Seedance Prompt:

Midjourney Sref and prompt Library

40,572 views • 1 month ago

We just shipped vision for v0 — generate React, Tailwind, and Shadcn UI from an image! If you’re still waitlisted, reply here with a use case for v0 vision.

We just shipped vision for v0 — generate React, Tailwind, and Shadcn UI from an image! If you’re still waitlisted, reply here with a use case for v0 vision.

Max Leiter

130,718 views • 2 years ago

The cleanest image transitions you'll see today (inspired by Genie AI 🏄‍♂️)

The cleanest image transitions you'll see today (inspired by Genie AI 🏄‍♂️)

saint

37,193 views • 8 days ago