Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

A peanut-sized Chinese model just dethroned Gemini at reading documents. GLM-OCR is a 0.9B parameter vision-language model. It scores 94.62 on OmniDocBench V1.5, ranking #1 overall. For context, it outperforms models 100x its size. 100% open-source. It works in two stages. 1. A layout engine detects every region in... a document. 2. Each region gets read in parallel. The model predicts multiple tokens per step instead of one. That's what makes it so fast at small size. It handles things most OCR tools struggle with: > Complex tables and nested layouts > Handwritten text and stamps > Math formulas and code blocks > Mixed image-and-text documents You can run it locally through Ollama. It fits on edge devices with limited compute. Every expensive OCR API just got a free competitor.show more

Jafar Najafov

60,966 subscribers

13,630 Aufrufe • vor 2 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Akshay 🚀

126,036 Aufrufe • vor 7 Monaten

Announcing llama-ocr – a free + open source OCR tool! It takes documents (images for now) & outputs markdown, and does really well for complex receipts, PDFs with tables/charts, ect... Powered by Llama 3.2 vision on Together AI & available on npm today!

Announcing llama-ocr – a free + open source OCR tool! It takes documents (images for now) & outputs markdown, and does really well for complex receipts, PDFs with tables/charts, ect... Powered by Llama 3.2 vision on Together AI & available on npm today!

Hassan

443,823 Aufrufe • vor 1 Jahr

ByteDance just dropped an OCR model that reads documents just like humans. This 0.3B model analyzes page layout first, then parses elements in parallel. 100% open-source.

ByteDance just dropped an OCR model that reads documents just like humans. This 0.3B model analyzes page layout first, then parses elements in parallel. 100% open-source.

Unwind AI

101,042 Aufrufe • vor 8 Monaten

Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previous best dots-ocr. - Supports 40+ languages - Extracts complex texts, tables, formulas easily I tested on Ramanujan's handwritten letter from 1913. 100% open-source.

Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previous best dots-ocr. - Supports 40+ languages - Extracts complex texts, tables, formulas easily I tested on Ramanujan's handwritten letter from 1913. 100% open-source.

Avi Chawla

158,386 Aufrufe • vor 6 Monaten

Today Meta released "Code Llama", a large language model fine-tuned for coding tasks. It's publicly available and can be used for commercial use! It outperforms GPT 3.5 and you can even run it locally on your Macbook using Ollama.

Today Meta released "Code Llama", a large language model fine-tuned for coding tasks. It's publicly available and can be used for commercial use! It outperforms GPT 3.5 and you can even run it locally on your Macbook using Ollama.

Marcel Pociot 🧪

49,806 Aufrufe • vor 2 Jahren

6. Apple Notes It's fast, syncs perfectly, and does EXACTLY what it needs to. The OCR for handwritten notes is magic, and the way you can scan documents now makes it indispensable. After trying Notion and Obsidian, I'm back to Apple Notes.

6. Apple Notes It's fast, syncs perfectly, and does EXACTLY what it needs to. The OCR for handwritten notes is magic, and the way you can scan documents now makes it indispensable. After trying Notion and Obsidian, I'm back to Apple Notes.

Denislav Jeliazkov

520,082 Aufrufe • vor 1 Jahr

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

elvis

88,506 Aufrufe • vor 3 Jahren

We have started taking DYNA-1, our dexterous robust VLA model, to conferences and showcasing it for hours on end! The model run for 3 days, 8 hours each day at #HITEC2025 3 weeks ago with 99.9% overall success rate (dropped 1 towel in day 2). No intervention, it just works :)

We have started taking DYNA-1, our dexterous robust VLA model, to conferences and showcasing it for hours on end! The model run for 3 days, 8 hours each day at #HITEC2025 3 weeks ago with 99.9% overall success rate (dropped 1 towel in day 2). No intervention, it just works :)

Dyna Robotics

55,097 Aufrufe • vor 11 Monaten

GPT-4o level multimodal LLM running on your phone. MiniCPM-V 4.5 outperforms GPT-4o, Gemini-2.0 Pro, and Qwen2.5-VL 72B on vision and language AI tasks. It can even understand videos and perform OCR in 30+ languages. And it's 100% Opensource.

GPT-4o level multimodal LLM running on your phone. MiniCPM-V 4.5 outperforms GPT-4o, Gemini-2.0 Pro, and Qwen2.5-VL 72B on vision and language AI tasks. It can even understand videos and perform OCR in 30+ languages. And it's 100% Opensource.

Shubham Saboo

17,947 Aufrufe • vor 9 Monaten

1/ Gemini 2.5 is here, and it’s our most intelligent AI model ever. Our first 2.5 model, Gemini 2.5 Pro Experimental is a state-of-the-art thinking model, leading in a wide range of benchmarks – with impressive improvements in enhanced reasoning and coding and now #1 on Arena by a significant margin. With a model this intelligent, we wanted to get it to people as quickly as possible. Find it on Google AI Studio and in the Google Gemini for Gemini Advanced users now – and in Vertex in the coming weeks. This is the start of a new era of thinking models – and we can’t wait to see where things go from here.

1/ Gemini 2.5 is here, and it’s our most intelligent AI model ever. Our first 2.5 model, Gemini 2.5 Pro Experimental is a state-of-the-art thinking model, leading in a wide range of benchmarks – with impressive improvements in enhanced reasoning and coding and now #1 on Arena by a significant margin. With a model this intelligent, we wanted to get it to people as quickly as possible. Find it on Google AI Studio and in the Google Gemini for Gemini Advanced users now – and in Vertex in the coming weeks. This is the start of a new era of thinking models – and we can’t wait to see where things go from here.

Sundar Pichai

864,057 Aufrufe • vor 1 Jahr

💻 Agent mode just went local! Today, Mistral AI released Devstral, an Apache-2.0, 24B-parameter model trained for tool calling in real-world software development environments It reached #1 on SWE-bench for open-source models, but more importantly in our real-world testing it has proven to be capable of navigating and autonomously editing codebases, while running entirely on a laptop To try it out, just add `ollama/devstral` in Continue. See below for a quickstart with ollama👇

💻 Agent mode just went local! Today, Mistral AI released Devstral, an Apache-2.0, 24B-parameter model trained for tool calling in real-world software development environments It reached #1 on SWE-bench for open-source models, but more importantly in our real-world testing it has proven to be capable of navigating and autonomously editing codebases, while running entirely on a laptop To try it out, just add `ollama/devstral` in Continue. See below for a quickstart with ollama👇

Continue

39,217 Aufrufe • vor 1 Jahr

Building RAG is easy. Parsing real, unstructured data is the hard part. Most tools fail when documents get complicated. RAGFlow by InfiniFlow makes the entire process visual and flawless 🔥 It is an (open-source!) engine built specifically to find the exact needle in a data haystack, even across literally unlimited tokens. The platform comes packed with: → "Quality in, quality out" parsing for highly complex formats → Multiple recall paired with fused re-ranking → A built-in Python and JavaScript code executor for agents → An orchestrable ingestion pipeline Here's why it stands out: 1️⃣ Structural Understanding Instead of just scraping text, it handles tables across pages, scanned copies, slides, and Excel sheets natively using deep document understanding. 2️⃣ Grounded Citations Every answer is verifiable. The UI highlights the exact chunks used, allowing you to trace any response directly back to the source material. 3️⃣ Enterprise Synchronization Keep your context constantly updated with native data sync from Google Drive, Notion, Discord, and Confluence. Stop letting bad document parsing ruin your RAG systems. Best part? It's 100% Free and open-source. Link to the repo in 🧵↓

Building RAG is easy. Parsing real, unstructured data is the hard part. Most tools fail when documents get complicated. RAGFlow by InfiniFlow makes the entire process visual and flawless 🔥 It is an (open-source!) engine built specifically to find the exact needle in a data haystack, even across literally unlimited tokens. The platform comes packed with: → "Quality in, quality out" parsing for highly complex formats → Multiple recall paired with fused re-ranking → A built-in Python and JavaScript code executor for agents → An orchestrable ingestion pipeline Here's why it stands out: 1️⃣ Structural Understanding Instead of just scraping text, it handles tables across pages, scanned copies, slides, and Excel sheets natively using deep document understanding. 2️⃣ Grounded Citations Every answer is verifiable. The UI highlights the exact chunks used, allowing you to trace any response directly back to the source material. 3️⃣ Enterprise Synchronization Keep your context constantly updated with native data sync from Google Drive, Notion, Discord, and Confluence. Stop letting bad document parsing ruin your RAG systems. Best part? It's 100% Free and open-source. Link to the repo in 🧵↓

Charly Wargnier

19,131 Aufrufe • vor 2 Monaten

GROK AI TURNS FILES INTO INSIGHT IN SECONDS—NO FILTERS, NO FUSS Grok 3 doesn’t just chat—it dissects. Upload documents, code, images, or data and it rips through the noise with speed and clarity. Unlike rivals that stumble over images or limit what they’ll read, Grok identifies objects, summarizes PDFs, debugs code, and even deciphers visual content like anime frames or brand logos. It's not just OCR—it’s cognition. Need a breakdown of a dense technical spec? Grok explains it. Got a mystery file? Grok reveals it. With zero fluff and minimal censorship, Grok treats your uploads like a mission, not a formality. If your work lives in files, Grok's the analyst you've been waiting for—fast, sharp, and finally unchained. Source: Content Beta

Mario Nawfal

48,611 Aufrufe • vor 1 Jahr

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

Ihtesham Ali

134,969 Aufrufe • vor 3 Monaten

🚨Breaking: Tencent Hunyuan just dropped Hunyuan-A13B first open-source hybrid reasoning model, which supports switching between fast and slow thinking modes. - 256K context window - Advanced agentic tool calling capabilities Did a quick test with a front-end question it performed well. Overall, a strong model given its size. More details and how to try👇

🚨Breaking: Tencent Hunyuan just dropped Hunyuan-A13B first open-source hybrid reasoning model, which supports switching between fast and slow thinking modes. - 256K context window - Advanced agentic tool calling capabilities Did a quick test with a front-end question it performed well. Overall, a strong model given its size. More details and how to try👇

AshutoshShrivastava

13,672 Aufrufe • vor 11 Monaten

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 Aufrufe • vor 11 Monaten

Hailuo 02 is now on Poe! This new video model stands out for its ability to handle real-world physics and motion where other leading-edge models struggle. It supports text and image prompts, and generates videos that are 6 seconds long. (1/2)

Hailuo 02 is now on Poe! This new video model stands out for its ability to handle real-world physics and motion where other leading-edge models struggle. It supports text and image prompts, and generates videos that are 6 seconds long. (1/2)

Poe

14,654 Aufrufe • vor 11 Monaten

Big moment for text-to-speech. Qwen just open-sourced a text-to-speech model that lets you clone voices, design new ones, and control speech using natural language. Let me explain what I mean: You can literally tell it "speak in a cheerful tone with slight nervousness," and it actually does that. No complex audio engineering needed. What makes this special: - 3-second voice cloning - Covers 10 languages: English, German, French, and more - Latency as low as 97ms for real-time applications - Supports both streaming and non-streaming generation The model comes in two sizes (0.6B and 1.7B parameters), so you can pick based on your hardware and quality needs. Three modes to work with: 1. Custom Voice: Use pre-built premium voices with instruction-based style control 2. Voice Design: Describe the voice you want in plain English (or Chinese), and the model creates it 3. Voice Clone: Provide a 3-second reference audio and clone that voice The best part? It integrates with vLLM for production deployment and has a simple Python package you can pip install. I've shared a link to the GitHub repo in the next tweet.

Big moment for text-to-speech. Qwen just open-sourced a text-to-speech model that lets you clone voices, design new ones, and control speech using natural language. Let me explain what I mean: You can literally tell it "speak in a cheerful tone with slight nervousness," and it actually does that. No complex audio engineering needed. What makes this special: - 3-second voice cloning - Covers 10 languages: English, German, French, and more - Latency as low as 97ms for real-time applications - Supports both streaming and non-streaming generation The model comes in two sizes (0.6B and 1.7B parameters), so you can pick based on your hardware and quality needs. Three modes to work with: 1. Custom Voice: Use pre-built premium voices with instruction-based style control 2. Voice Design: Describe the voice you want in plain English (or Chinese), and the model creates it 3. Voice Clone: Provide a 3-second reference audio and clone that voice The best part? It integrates with vLLM for production deployment and has a simple Python package you can pip install. I've shared a link to the GitHub repo in the next tweet.

Akshay 🚀

31,216 Aufrufe • vor 4 Monaten

Wow Mistral has released its new model tailor-made for AI code assistants Codestral 25.01 (that's its name) is debuting at #1 on the LMsys copilot arena leaderboard 🔥 You can already use it for free in Continue (100% open-source) for VS Code

Wow Mistral has released its new model tailor-made for AI code assistants Codestral 25.01 (that's its name) is debuting at #1 on the LMsys copilot arena leaderboard 🔥 You can already use it for free in Continue (100% open-source) for VS Code

Paul Couvert

83,390 Aufrufe • vor 1 Jahr

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.

Guri Singh

2,180,357 Aufrufe • vor 3 Monaten