Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

A peanut-sized Chinese model just dethroned Gemini at reading documents. GLM-OCR is a 0.9B parameter vision-language model. It scores 94.62 on OmniDocBench V1.5, ranking #1 overall. For context, it outperforms models 100x its size. 100% open-source. It works in two stages. 1. A layout engine detects every region in... a document. 2. Each region gets read in parallel. The model predicts multiple tokens per step instead of one. That's what makes it so fast at small size. It handles things most OCR tools struggle with: > Complex tables and nested layouts > Handwritten text and stamps > Math formulas and code blocks > Mixed image-and-text documents You can run it locally through Ollama. It fits on edge devices with limited compute. Every expensive OCR API just got a free competitor.show more

Jafar Najafov

60,966 subscribers

13,630 views • 2 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Akshay 🚀

126,036 views • 7 months ago

Mistral OCR 4 turned a handwritten calculus exam into clean LaTeX! We gave it a photo of a hand-written exam page. The model read the handwriting and rebuilt every formula into structured digital text Output: Time: 5.1s · Cost: $0.09 Formulas came through exactly right - the hard part was nailed. The graph, unfortunately, it didn’t redraw. But that’s the telling part: most OCR tools just dump the text and quietly drop the figure. OCR 4 caught the plot, boxed it, and tagged it as a chart. It doesn’t get redrawn, but it gets read and accounted for

Mistral OCR 4 turned a handwritten calculus exam into clean LaTeX! We gave it a photo of a hand-written exam page. The model read the handwriting and rebuilt every formula into structured digital text Output: Time: 5.1s · Cost: $0.09 Formulas came through exactly right - the hard part was nailed. The graph, unfortunately, it didn’t redraw. But that’s the telling part: most OCR tools just dump the text and quietly drop the figure. OCR 4 caught the plot, boxed it, and tagged it as a chart. It doesn’t get redrawn, but it gets read and accounted for

atomic.chat

410,690 views • 4 days ago

Announcing llama-ocr – a free + open source OCR tool! It takes documents (images for now) & outputs markdown, and does really well for complex receipts, PDFs with tables/charts, ect... Powered by Llama 3.2 vision on Together AI & available on npm today!

Announcing llama-ocr – a free + open source OCR tool! It takes documents (images for now) & outputs markdown, and does really well for complex receipts, PDFs with tables/charts, ect... Powered by Llama 3.2 vision on Together AI & available on npm today!

Hassan

443,826 views • 1 year ago

ByteDance just dropped an OCR model that reads documents just like humans. This 0.3B model analyzes page layout first, then parses elements in parallel. 100% open-source.

ByteDance just dropped an OCR model that reads documents just like humans. This 0.3B model analyzes page layout first, then parses elements in parallel. 100% open-source.

Unwind AI

101,055 views • 8 months ago

Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previous best dots-ocr. - Supports 40+ languages - Extracts complex texts, tables, formulas easily I tested on Ramanujan's handwritten letter from 1913. 100% open-source.

Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previous best dots-ocr. - Supports 40+ languages - Extracts complex texts, tables, formulas easily I tested on Ramanujan's handwritten letter from 1913. 100% open-source.

Avi Chawla

158,440 views • 6 months ago

Today Meta released "Code Llama", a large language model fine-tuned for coding tasks. It's publicly available and can be used for commercial use! It outperforms GPT 3.5 and you can even run it locally on your Macbook using Ollama.

Today Meta released "Code Llama", a large language model fine-tuned for coding tasks. It's publicly available and can be used for commercial use! It outperforms GPT 3.5 and you can even run it locally on your Macbook using Ollama.

Marcel Pociot 🧪

49,806 views • 2 years ago

NVIDIA just made AI detect objects 10x faster by deleting one step. It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models. Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit. → 12.7 boxes/sec on one H100 → 10x faster than Qwen3-VL → +3.8% F1 on LVIS, accuracy up, not down → 3B params, runs on one consumer GPU Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling. One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines. 100% open source, weights, code, demo, and paper all live.

NVIDIA just made AI detect objects 10x faster by deleting one step. It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models. Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit. → 12.7 boxes/sec on one H100 → 10x faster than Qwen3-VL → +3.8% F1 on LVIS, accuracy up, not down → 3B params, runs on one consumer GPU Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling. One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines. 100% open source, weights, code, demo, and paper all live.

Alvaro Cintas

158,239 views • 1 day ago

6. Apple Notes It's fast, syncs perfectly, and does EXACTLY what it needs to. The OCR for handwritten notes is magic, and the way you can scan documents now makes it indispensable. After trying Notion and Obsidian, I'm back to Apple Notes.

6. Apple Notes It's fast, syncs perfectly, and does EXACTLY what it needs to. The OCR for handwritten notes is magic, and the way you can scan documents now makes it indispensable. After trying Notion and Obsidian, I'm back to Apple Notes.

Denislav Jeliazkov

520,082 views • 1 year ago

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

elvis

88,512 views • 3 years ago

We have started taking DYNA-1, our dexterous robust VLA model, to conferences and showcasing it for hours on end! The model run for 3 days, 8 hours each day at #HITEC2025 3 weeks ago with 99.9% overall success rate (dropped 1 towel in day 2). No intervention, it just works :)

We have started taking DYNA-1, our dexterous robust VLA model, to conferences and showcasing it for hours on end! The model run for 3 days, 8 hours each day at #HITEC2025 3 weeks ago with 99.9% overall success rate (dropped 1 towel in day 2). No intervention, it just works :)

Dyna Robotics

55,097 views • 1 year ago

1/ Gemini 2.5 is here, and it’s our most intelligent AI model ever. Our first 2.5 model, Gemini 2.5 Pro Experimental is a state-of-the-art thinking model, leading in a wide range of benchmarks – with impressive improvements in enhanced reasoning and coding and now #1 on Arena by a significant margin. With a model this intelligent, we wanted to get it to people as quickly as possible. Find it on Google AI Studio and in the Google Gemini for Gemini Advanced users now – and in Vertex in the coming weeks. This is the start of a new era of thinking models – and we can’t wait to see where things go from here.

1/ Gemini 2.5 is here, and it’s our most intelligent AI model ever. Our first 2.5 model, Gemini 2.5 Pro Experimental is a state-of-the-art thinking model, leading in a wide range of benchmarks – with impressive improvements in enhanced reasoning and coding and now #1 on Arena by a significant margin. With a model this intelligent, we wanted to get it to people as quickly as possible. Find it on Google AI Studio and in the Google Gemini for Gemini Advanced users now – and in Vertex in the coming weeks. This is the start of a new era of thinking models – and we can’t wait to see where things go from here.

Sundar Pichai

864,102 views • 1 year ago

GPT-4o level multimodal LLM running on your phone. MiniCPM-V 4.5 outperforms GPT-4o, Gemini-2.0 Pro, and Qwen2.5-VL 72B on vision and language AI tasks. It can even understand videos and perform OCR in 30+ languages. And it's 100% Opensource.

GPT-4o level multimodal LLM running on your phone. MiniCPM-V 4.5 outperforms GPT-4o, Gemini-2.0 Pro, and Qwen2.5-VL 72B on vision and language AI tasks. It can even understand videos and perform OCR in 30+ languages. And it's 100% Opensource.

Shubham Saboo

17,947 views • 9 months ago

💻 Agent mode just went local! Today, Mistral AI released Devstral, an Apache-2.0, 24B-parameter model trained for tool calling in real-world software development environments It reached #1 on SWE-bench for open-source models, but more importantly in our real-world testing it has proven to be capable of navigating and autonomously editing codebases, while running entirely on a laptop To try it out, just add `ollama/devstral` in Continue. See below for a quickstart with ollama👇

💻 Agent mode just went local! Today, Mistral AI released Devstral, an Apache-2.0, 24B-parameter model trained for tool calling in real-world software development environments It reached #1 on SWE-bench for open-source models, but more importantly in our real-world testing it has proven to be capable of navigating and autonomously editing codebases, while running entirely on a laptop To try it out, just add `ollama/devstral` in Continue. See below for a quickstart with ollama👇

Continue

39,217 views • 1 year ago

Building RAG is easy. Parsing real, unstructured data is the hard part. Most tools fail when documents get complicated. RAGFlow by InfiniFlow makes the entire process visual and flawless 🔥 It is an (open-source!) engine built specifically to find the exact needle in a data haystack, even across literally unlimited tokens. The platform comes packed with: → "Quality in, quality out" parsing for highly complex formats → Multiple recall paired with fused re-ranking → A built-in Python and JavaScript code executor for agents → An orchestrable ingestion pipeline Here's why it stands out: 1️⃣ Structural Understanding Instead of just scraping text, it handles tables across pages, scanned copies, slides, and Excel sheets natively using deep document understanding. 2️⃣ Grounded Citations Every answer is verifiable. The UI highlights the exact chunks used, allowing you to trace any response directly back to the source material. 3️⃣ Enterprise Synchronization Keep your context constantly updated with native data sync from Google Drive, Notion, Discord, and Confluence. Stop letting bad document parsing ruin your RAG systems. Best part? It's 100% Free and open-source. Link to the repo in 🧵↓

Building RAG is easy. Parsing real, unstructured data is the hard part. Most tools fail when documents get complicated. RAGFlow by InfiniFlow makes the entire process visual and flawless 🔥 It is an (open-source!) engine built specifically to find the exact needle in a data haystack, even across literally unlimited tokens. The platform comes packed with: → "Quality in, quality out" parsing for highly complex formats → Multiple recall paired with fused re-ranking → A built-in Python and JavaScript code executor for agents → An orchestrable ingestion pipeline Here's why it stands out: 1️⃣ Structural Understanding Instead of just scraping text, it handles tables across pages, scanned copies, slides, and Excel sheets natively using deep document understanding. 2️⃣ Grounded Citations Every answer is verifiable. The UI highlights the exact chunks used, allowing you to trace any response directly back to the source material. 3️⃣ Enterprise Synchronization Keep your context constantly updated with native data sync from Google Drive, Notion, Discord, and Confluence. Stop letting bad document parsing ruin your RAG systems. Best part? It's 100% Free and open-source. Link to the repo in 🧵↓

Charly Wargnier

19,131 views • 3 months ago

GROK AI TURNS FILES INTO INSIGHT IN SECONDS—NO FILTERS, NO FUSS Grok 3 doesn’t just chat—it dissects. Upload documents, code, images, or data and it rips through the noise with speed and clarity. Unlike rivals that stumble over images or limit what they’ll read, Grok identifies objects, summarizes PDFs, debugs code, and even deciphers visual content like anime frames or brand logos. It's not just OCR—it’s cognition. Need a breakdown of a dense technical spec? Grok explains it. Got a mystery file? Grok reveals it. With zero fluff and minimal censorship, Grok treats your uploads like a mission, not a formality. If your work lives in files, Grok's the analyst you've been waiting for—fast, sharp, and finally unchained. Source: Content Beta

Mario Nawfal

48,611 views • 1 year ago

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

Ihtesham Ali

134,969 views • 3 months ago

🚨Breaking: Tencent Hunyuan just dropped Hunyuan-A13B first open-source hybrid reasoning model, which supports switching between fast and slow thinking modes. - 256K context window - Advanced agentic tool calling capabilities Did a quick test with a front-end question it performed well. Overall, a strong model given its size. More details and how to try👇

🚨Breaking: Tencent Hunyuan just dropped Hunyuan-A13B first open-source hybrid reasoning model, which supports switching between fast and slow thinking modes. - 256K context window - Advanced agentic tool calling capabilities Did a quick test with a front-end question it performed well. Overall, a strong model given its size. More details and how to try👇

AshutoshShrivastava

13,672 views • 1 year ago

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 views • 1 year ago

Big moment for text-to-speech. Qwen just open-sourced a text-to-speech model that lets you clone voices, design new ones, and control speech using natural language. Let me explain what I mean: You can literally tell it "speak in a cheerful tone with slight nervousness," and it actually does that. No complex audio engineering needed. What makes this special: - 3-second voice cloning - Covers 10 languages: English, German, French, and more - Latency as low as 97ms for real-time applications - Supports both streaming and non-streaming generation The model comes in two sizes (0.6B and 1.7B parameters), so you can pick based on your hardware and quality needs. Three modes to work with: 1. Custom Voice: Use pre-built premium voices with instruction-based style control 2. Voice Design: Describe the voice you want in plain English (or Chinese), and the model creates it 3. Voice Clone: Provide a 3-second reference audio and clone that voice The best part? It integrates with vLLM for production deployment and has a simple Python package you can pip install. I've shared a link to the GitHub repo in the next tweet.

Big moment for text-to-speech. Qwen just open-sourced a text-to-speech model that lets you clone voices, design new ones, and control speech using natural language. Let me explain what I mean: You can literally tell it "speak in a cheerful tone with slight nervousness," and it actually does that. No complex audio engineering needed. What makes this special: - 3-second voice cloning - Covers 10 languages: English, German, French, and more - Latency as low as 97ms for real-time applications - Supports both streaming and non-streaming generation The model comes in two sizes (0.6B and 1.7B parameters), so you can pick based on your hardware and quality needs. Three modes to work with: 1. Custom Voice: Use pre-built premium voices with instruction-based style control 2. Voice Design: Describe the voice you want in plain English (or Chinese), and the model creates it 3. Voice Clone: Provide a 3-second reference audio and clone that voice The best part? It integrates with vLLM for production deployment and has a simple Python package you can pip install. I've shared a link to the GitHub repo in the next tweet.

Akshay 🚀

31,216 views • 5 months ago

Hailuo 02 is now on Poe! This new video model stands out for its ability to handle real-world physics and motion where other leading-edge models struggle. It supports text and image prompts, and generates videos that are 6 seconds long. (1/2)

Hailuo 02 is now on Poe! This new video model stands out for its ability to handle real-world physics and motion where other leading-edge models struggle. It supports text and image prompts, and generates videos that are 6 seconds long. (1/2)

Poe

14,654 views • 1 year ago