Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

A peanut-sized Chinese model just dethroned Gemini at reading documents. GLM-OCR is a 0.9B parameter vision-language model. It scores 94.62 on OmniDocBench V1.5, ranking #1 overall. For context, it outperforms models 100x its size. 100% open-source. It works in two stages. 1. A layout engine detects every region in... a document. 2. Each region gets read in parallel. The model predicts multiple tokens per step instead of one. That's what makes it so fast at small size. It handles things most OCR tools struggle with: > Complex tables and nested layouts > Handwritten text and stamps > Math formulas and code blocks > Mixed image-and-text documents You can run it locally through Ollama. It fits on edge devices with limited compute. Every expensive OCR API just got a free competitor.show more

AlphaSignal

16,380 subscribers

91,979 Aufrufe • vor 3 Monaten •via X (Twitter)

Nachrichten & Politik Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Akshay 🚀

126,122 Aufrufe • vor 8 Monaten

Unlimited-OCR is a 3B parameter model that parses entire 100-page PDFs in one shot — no page chunking, no lost context. +⁠32K context window, reads the whole document in a single pass +⁠93% on standard parsing benchmarks, +6 over baseline +⁠Under 0.11 error rate past page 40 — where every other OCR tool falls apart ⁠+Multilingual out of the box, runs locally via Transformers, Ollama, llama.cpp, Docker +⁠Textract, Google Vision, and Azure Doc Intelligence charge $1.50–$15 per 1,000 pages 1.9M downloads on HuggingFace and most people have never heard of it. This runs on your machine. For free. Forever. Model link:

Unlimited-OCR is a 3B parameter model that parses entire 100-page PDFs in one shot — no page chunking, no lost context. +⁠32K context window, reads the whole document in a single pass +⁠93% on standard parsing benchmarks, +6 over baseline +⁠Under 0.11 error rate past page 40 — where every other OCR tool falls apart ⁠+Multilingual out of the box, runs locally via Transformers, Ollama, llama.cpp, Docker +⁠Textract, Google Vision, and Azure Doc Intelligence charge $1.50–$15 per 1,000 pages 1.9M downloads on HuggingFace and most people have never heard of it. This runs on your machine. For free. Forever. Model link:

0xMarioNawfal

83,780 Aufrufe • vor 7 Tagen

China open-sourced a peanut-sized OCR that parses entire 100-page PDFs in one shot.. It's called Unlimited-OCR. Only 3B params. Runs locally. Every other OCR tool chops your doc into pages and loses the thread. this one reads the whole thing in a single pass. → One-shot "long-horizon" parsing (32K context window) → Multilingual, out of the box → 93% on the standard parsing benchmark (+6 over baseline) → <0.11 error rate past 40 pages → Runs 100% locally on your own hardware → Works with Transformers, vLLM, SGLang, Docker, Ollama, llama.cpp Traditional cloud OCR (Textract, Google Vision, Azure Doc Intelligence) costs $1.50–$15 per 1,000 pages. This runs on your machine. For free. Forever. Baidu built it explicitly to push DeepSeek-OCR one step further. Already at 1.9M downloads on Hugging Face and most people have no idea it exists yet. 100% open source.

China open-sourced a peanut-sized OCR that parses entire 100-page PDFs in one shot.. It's called Unlimited-OCR. Only 3B params. Runs locally. Every other OCR tool chops your doc into pages and loses the thread. this one reads the whole thing in a single pass. → One-shot "long-horizon" parsing (32K context window) → Multilingual, out of the box → 93% on the standard parsing benchmark (+6 over baseline) → <0.11 error rate past 40 pages → Runs 100% locally on your own hardware → Works with Transformers, vLLM, SGLang, Docker, Ollama, llama.cpp Traditional cloud OCR (Textract, Google Vision, Azure Doc Intelligence) costs $1.50–$15 per 1,000 pages. This runs on your machine. For free. Forever. Baidu built it explicitly to push DeepSeek-OCR one step further. Already at 1.9M downloads on Hugging Face and most people have no idea it exists yet. 100% open source.

Superman

1,083,623 Aufrufe • vor 8 Tagen

Baidu just open-sourced an OCR model that reads entire 40-page documents in one shot. It's called Unlimited-OCR. 3 billion parameters but only 500 million active during inference. Runs 100% locally on your machine. Why this matters: traditional OCR tools chop documents page by page. Tables that span two pages break. Reading order gets lost. Cross-page context disappears. Unlimited-OCR processes the whole document at once. 32K context window. Text, formulas, tables, reading order all preserved across pages. Output comes out as clean structured Markdown. → 93% accuracy on the standard benchmark. +6 points over the baseline. → Error rate stays below 0.11 even past 40 pages. → Multilingual out of the box. → 2.12 million downloads on Hugging Face last month. 14,600 GitHub stars. For context: Amazon Textract, Google Cloud Vision, and Azure Document Intelligence all charge per page. This runs locally for free.

Baidu just open-sourced an OCR model that reads entire 40-page documents in one shot. It's called Unlimited-OCR. 3 billion parameters but only 500 million active during inference. Runs 100% locally on your machine. Why this matters: traditional OCR tools chop documents page by page. Tables that span two pages break. Reading order gets lost. Cross-page context disappears. Unlimited-OCR processes the whole document at once. 32K context window. Text, formulas, tables, reading order all preserved across pages. Output comes out as clean structured Markdown. → 93% accuracy on the standard benchmark. +6 points over the baseline. → Error rate stays below 0.11 even past 40 pages. → Multilingual out of the box. → 2.12 million downloads on Hugging Face last month. 14,600 GitHub stars. For context: Amazon Textract, Google Cloud Vision, and Azure Document Intelligence all charge per page. This runs locally for free.

Vaibhav Sisinty

409,153 Aufrufe • vor 7 Tagen

NVIDIA just made AI detect objects 10x faster by deleting one step. It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models. Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit. → 12.7 boxes/sec on one H100 → 10x faster than Qwen3-VL → +3.8% F1 on LVIS, accuracy up, not down → 3B params, runs on one consumer GPU Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling. One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines. 100% open source, weights, code, demo, and paper all live.

NVIDIA just made AI detect objects 10x faster by deleting one step. It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models. Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit. → 12.7 boxes/sec on one H100 → 10x faster than Qwen3-VL → +3.8% F1 on LVIS, accuracy up, not down → 3B params, runs on one consumer GPU Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling. One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines. 100% open source, weights, code, demo, and paper all live.

Alvaro Cintas

201,044 Aufrufe • vor 29 Tagen

I'm deleting every paid document extraction tool because of this. Zipstack built Unstract and it turns any PDF, scan, or image into clean structured JSON using LLMs you already pay for. You point it at a document (an invoice, a bank statement, a KYC form, a tax return, a claims form) and it automatically pulls the fields you asked for and hands back a JSON object ready to drop straight into your database. The difference from every other document extraction tool is the setup. Most tools want a model per vendor or regex per template. This works from a single natural language prompt that handles every variation. → Extracts structured JSON from PDFs, scans, and images → Lets you write extraction schemas in plain English through Prompt Studio → Deploys as a REST API or ETL pipeline in minutes → Ships an MCP server so Claude and other agents can extract documents directly → Connects to S3, GCS, Snowflake, BigQuery, Postgres, and every major destination → LLM-agnostic: bring OpenAI, Anthropic, Bedrock, Gemini, Mistral, Ollama, or any provider you already use Basically: you have thousands of documents nobody has time to read. Unstract turns your documents into clean, structured JSON that you can run in production. This is what document extraction should have been all along.

I'm deleting every paid document extraction tool because of this. Zipstack built Unstract and it turns any PDF, scan, or image into clean structured JSON using LLMs you already pay for. You point it at a document (an invoice, a bank statement, a KYC form, a tax return, a claims form) and it automatically pulls the fields you asked for and hands back a JSON object ready to drop straight into your database. The difference from every other document extraction tool is the setup. Most tools want a model per vendor or regex per template. This works from a single natural language prompt that handles every variation. → Extracts structured JSON from PDFs, scans, and images → Lets you write extraction schemas in plain English through Prompt Studio → Deploys as a REST API or ETL pipeline in minutes → Ships an MCP server so Claude and other agents can extract documents directly → Connects to S3, GCS, Snowflake, BigQuery, Postgres, and every major destination → LLM-agnostic: bring OpenAI, Anthropic, Bedrock, Gemini, Mistral, Ollama, or any provider you already use Basically: you have thousands of documents nobody has time to read. Unstract turns your documents into clean, structured JSON that you can run in production. This is what document extraction should have been all along.

Hasan Toor

36,409 Aufrufe • vor 4 Tagen

You waste hours downloading models that don't run on your hardware. This tool fixes that in one command. It's called llmfit. It scans your RAM, CPU, and GPU, then scores every model in its catalog for fit, speed, and quality, so you know what runs before you download anything. → Detects MoE architectures correctly, most tools treat them as dense and get it wrong → Recommends the best quantization for your exact hardware → Covers hundreds of models across dozens of providers 100% Free. Open Source.

You waste hours downloading models that don't run on your hardware. This tool fixes that in one command. It's called llmfit. It scans your RAM, CPU, and GPU, then scores every model in its catalog for fit, speed, and quality, so you know what runs before you download anything. → Detects MoE architectures correctly, most tools treat them as dense and get it wrong → Recommends the best quantization for your exact hardware → Covers hundreds of models across dozens of providers 100% Free. Open Source.

Simplifying AI

46,554 Aufrufe • vor 8 Tagen

Building RAG is easy. Parsing real, unstructured data is the hard part. Most tools fail when documents get complicated. RAGFlow by InfiniFlow makes the entire process visual and flawless 🔥 It is an (open-source!) engine built specifically to find the exact needle in a data haystack, even across literally unlimited tokens. The platform comes packed with: → "Quality in, quality out" parsing for highly complex formats → Multiple recall paired with fused re-ranking → A built-in Python and JavaScript code executor for agents → An orchestrable ingestion pipeline Here's why it stands out: 1️⃣ Structural Understanding Instead of just scraping text, it handles tables across pages, scanned copies, slides, and Excel sheets natively using deep document understanding. 2️⃣ Grounded Citations Every answer is verifiable. The UI highlights the exact chunks used, allowing you to trace any response directly back to the source material. 3️⃣ Enterprise Synchronization Keep your context constantly updated with native data sync from Google Drive, Notion, Discord, and Confluence. Stop letting bad document parsing ruin your RAG systems. Best part? It's 100% Free and open-source. Link to the repo in 🧵↓

Building RAG is easy. Parsing real, unstructured data is the hard part. Most tools fail when documents get complicated. RAGFlow by InfiniFlow makes the entire process visual and flawless 🔥 It is an (open-source!) engine built specifically to find the exact needle in a data haystack, even across literally unlimited tokens. The platform comes packed with: → "Quality in, quality out" parsing for highly complex formats → Multiple recall paired with fused re-ranking → A built-in Python and JavaScript code executor for agents → An orchestrable ingestion pipeline Here's why it stands out: 1️⃣ Structural Understanding Instead of just scraping text, it handles tables across pages, scanned copies, slides, and Excel sheets natively using deep document understanding. 2️⃣ Grounded Citations Every answer is verifiable. The UI highlights the exact chunks used, allowing you to trace any response directly back to the source material. 3️⃣ Enterprise Synchronization Keep your context constantly updated with native data sync from Google Drive, Notion, Discord, and Confluence. Stop letting bad document parsing ruin your RAG systems. Best part? It's 100% Free and open-source. Link to the repo in 🧵↓

Charly Wargnier

19,220 Aufrufe • vor 4 Monaten

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single tag or npm install → Control any web interface with plain English commands → Text-based DOM manipulation no OCR, no vision models needed → Bring your own LLM (GPT, Claude, Qwen, anything) → Ships a built-in UI with human-in-the-loop support → Turn 20-click ERP/CRM workflows into one sentence → Optional Chrome extension for multi-tab agent tasks → Works on any web app SaaS, admin panels, internal tools Companies are charging $30/month for AI copilots built on this exact idea. This is 3 lines of code. Your users. Your interface. The AI copilot layer for every web app just got open sourced. 1.6K stars. 100% Open Source. (Link in the comments)

Ihtesham Ali

135,384 Aufrufe • vor 4 Monaten

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 Aufrufe • vor 1 Jahr

Big moment for text-to-speech. Qwen just open-sourced a text-to-speech model that lets you clone voices, design new ones, and control speech using natural language. Let me explain what I mean: You can literally tell it "speak in a cheerful tone with slight nervousness," and it actually does that. No complex audio engineering needed. What makes this special: - 3-second voice cloning - Covers 10 languages: English, German, French, and more - Latency as low as 97ms for real-time applications - Supports both streaming and non-streaming generation The model comes in two sizes (0.6B and 1.7B parameters), so you can pick based on your hardware and quality needs. Three modes to work with: 1. Custom Voice: Use pre-built premium voices with instruction-based style control 2. Voice Design: Describe the voice you want in plain English (or Chinese), and the model creates it 3. Voice Clone: Provide a 3-second reference audio and clone that voice The best part? It integrates with vLLM for production deployment and has a simple Python package you can pip install. I've shared a link to the GitHub repo in the next tweet.

Big moment for text-to-speech. Qwen just open-sourced a text-to-speech model that lets you clone voices, design new ones, and control speech using natural language. Let me explain what I mean: You can literally tell it "speak in a cheerful tone with slight nervousness," and it actually does that. No complex audio engineering needed. What makes this special: - 3-second voice cloning - Covers 10 languages: English, German, French, and more - Latency as low as 97ms for real-time applications - Supports both streaming and non-streaming generation The model comes in two sizes (0.6B and 1.7B parameters), so you can pick based on your hardware and quality needs. Three modes to work with: 1. Custom Voice: Use pre-built premium voices with instruction-based style control 2. Voice Design: Describe the voice you want in plain English (or Chinese), and the model creates it 3. Voice Clone: Provide a 3-second reference audio and clone that voice The best part? It integrates with vLLM for production deployment and has a simple Python package you can pip install. I've shared a link to the GitHub repo in the next tweet.

Akshay 🚀

31,249 Aufrufe • vor 6 Monaten

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.

Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License.

Guri Singh

2,180,357 Aufrufe • vor 4 Monaten

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

MiniMax M3 just dropped — their first natively multimodal model. So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.) Verdict: it got everything on the paper. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there. > Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes > A few fields slightly misaligned, but every piece of data made it onto the form The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail. The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.

stevibe

27,383 Aufrufe • vor 1 Monat

we sped up distributed inference by up to 5x with decentralized speculative decoding. many don't realize that AI models normally generate text one single word at a time, waiting for the network after every word. speculative decoding changes this by using a "guess & confirm" system, similar to autocomplete. how it's done: 1. draft locally (the guess) instead of waiting for the network, a tiny, fast model on your device guesses the next few words instantly, without waiting for the network. 2. confirm remotely (the check) the massive remote model doesn't generate from scratch; it just checks the draft. it looks at the guesses in a batch and says "yes, yes, no." you get multiple words in the time it usually takes to get one. 3. adaptive logic dsd is smart. if the topic is creative, it lets the draft flow loose. if the topic is math or code, it checks more strictly. it balances speed and precision automatically so your inference almost feel instant. find out more: paper: blog:

we sped up distributed inference by up to 5x with decentralized speculative decoding. many don't realize that AI models normally generate text one single word at a time, waiting for the network after every word. speculative decoding changes this by using a "guess & confirm" system, similar to autocomplete. how it's done: 1. draft locally (the guess) instead of waiting for the network, a tiny, fast model on your device guesses the next few words instantly, without waiting for the network. 2. confirm remotely (the check) the massive remote model doesn't generate from scratch; it just checks the draft. it looks at the guesses in a batch and says "yes, yes, no." you get multiple words in the time it usually takes to get one. 3. adaptive logic dsd is smart. if the topic is creative, it lets the draft flow loose. if the topic is math or code, it checks more strictly. it balances speed and precision automatically so your inference almost feel instant. find out more: paper: blog:

Parallax

45,425 Aufrufe • vor 6 Monaten

🚨 JUST IN: CHINA just released an AI EMPLOYEE that works 24X7 on its own. 100% OPEN SOURCE. It researches, codes, builds websites, creates slide decks, and generates videos. All by itself. All on your computer. It's called DeerFlow. You give it a task. It makes a plan, spins up its own team of sub-agents, and gets to work. You come back and there's a finished deliverable waiting. Not a draft. Not a summary. The actual thing. Not a chatbot. Not a research assistant. An AI with its own computer that works while you sleep. Here's what it does on its own: → Spawns multiple sub-agents in parallel, each tackling a different piece of your task, then combines everything into one finished output → Writes real code, runs it, reads the results, and fixes its own mistakes without asking you once → Builds slide decks, websites, full research reports, and data dashboards from scratch → Remembers you across sessions. Your writing style. Your tech stack. Your preferences. Gets better every time. → Reads files you upload, works with them inside its own filesystem, hands you clean finished outputs → Searches the web, runs commands, calls any tool you plug in Here's how it thinks: You give one instruction. The lead agent makes a plan. Sub-agents fan out and work in parallel. Results come back. Everything gets synthesized. You get a deliverable. A single research task might split into a dozen sub-agents, each exploring a different angle, then converge into one finished website with generated visuals. Here's the wildest part: DeerFlow 2.0 launched on February 28th 2026 and hit number 1 on all of GitHub Trending the same day. Version 2.0 was a complete rewrite. Zero shared code with version 1. Because users kept using it for things the team never intended. Data pipelines. Dashboards. Entire content workflows. The community told them what it needed to become. So they burned it down and rebuilt it. 22.7K GitHub stars. 2.7K forks. Built by ByteDance 100% Open Source. MIT License.

🚨 JUST IN: CHINA just released an AI EMPLOYEE that works 24X7 on its own. 100% OPEN SOURCE. It researches, codes, builds websites, creates slide decks, and generates videos. All by itself. All on your computer. It's called DeerFlow. You give it a task. It makes a plan, spins up its own team of sub-agents, and gets to work. You come back and there's a finished deliverable waiting. Not a draft. Not a summary. The actual thing. Not a chatbot. Not a research assistant. An AI with its own computer that works while you sleep. Here's what it does on its own: → Spawns multiple sub-agents in parallel, each tackling a different piece of your task, then combines everything into one finished output → Writes real code, runs it, reads the results, and fixes its own mistakes without asking you once → Builds slide decks, websites, full research reports, and data dashboards from scratch → Remembers you across sessions. Your writing style. Your tech stack. Your preferences. Gets better every time. → Reads files you upload, works with them inside its own filesystem, hands you clean finished outputs → Searches the web, runs commands, calls any tool you plug in Here's how it thinks: You give one instruction. The lead agent makes a plan. Sub-agents fan out and work in parallel. Results come back. Everything gets synthesized. You get a deliverable. A single research task might split into a dozen sub-agents, each exploring a different angle, then converge into one finished website with generated visuals. Here's the wildest part: DeerFlow 2.0 launched on February 28th 2026 and hit number 1 on all of GitHub Trending the same day. Version 2.0 was a complete rewrite. Zero shared code with version 1. Because users kept using it for things the team never intended. Data pipelines. Dashboards. Entire content workflows. The community told them what it needed to become. So they burned it down and rebuilt it. 22.7K GitHub stars. 2.7K forks. Built by ByteDance 100% Open Source. MIT License.

Kanika

737,284 Aufrufe • vor 4 Monaten

THIS SHELF OF MAC MINIS REPLACES $4,080 A YEAR IN AI SUBSCRIPTIONS 00:02 the camera pans across a shelf of stacked Mac minis and the trick is obvious: that silent little farm runs the models you rent every month most people pay 7 companies for AI and use 3 of the tools. they forget the rest on the credit card and call it a stack the Mac mini M4 ends that. one shared memory pool means a $599 box runs 7B and 8B models faster than Windows machines that cost twice as much ollama pull, one command. open webui in one docker line. point Claude Code at localhost and it just works it draws 10 to 30 watts, sits silent next to a router, and runs 24/7 for $3 a month in power it pays back a $20 ChatGPT Plus sub in 3 months, then saves you $4,000 a year while the frontier still rents you compute every month you wait is another $340 gone for compute that fits on a shelf

THIS SHELF OF MAC MINIS REPLACES $4,080 A YEAR IN AI SUBSCRIPTIONS 00:02 the camera pans across a shelf of stacked Mac minis and the trick is obvious: that silent little farm runs the models you rent every month most people pay 7 companies for AI and use 3 of the tools. they forget the rest on the credit card and call it a stack the Mac mini M4 ends that. one shared memory pool means a $599 box runs 7B and 8B models faster than Windows machines that cost twice as much ollama pull, one command. open webui in one docker line. point Claude Code at localhost and it just works it draws 10 to 30 watts, sits silent next to a router, and runs 24/7 for $3 a month in power it pays back a $20 ChatGPT Plus sub in 3 months, then saves you $4,000 a year while the frontier still rents you compute every month you wait is another $340 gone for compute that fits on a shelf

Fokki

12,933 Aufrufe • vor 1 Monat

there is so much real data just sitting in the open right now it's almost funny. four years of starlight on every star, a NASA archive that's been free for over a decade, detectors still recording the sky tonight, and barely anyone has a net pointed at any of it. so i pointed one. this is me pulling the planet data, the data loading is the boring part. the net i built to read it, the wall it hit, and what that taught me about where AI goes next, that's the full story, and it drops tonight. the data's public, the tools are free, the box fits on a desk. what's stopping you. you can just do things anon.

there is so much real data just sitting in the open right now it's almost funny. four years of starlight on every star, a NASA archive that's been free for over a decade, detectors still recording the sky tonight, and barely anyone has a net pointed at any of it. so i pointed one. this is me pulling the planet data, the data loading is the boring part. the net i built to read it, the wall it hit, and what that taught me about where AI goes next, that's the full story, and it drops tonight. the data's public, the tools are free, the box fits on a desk. what's stopping you. you can just do things anon.

Sudo su

60,445 Aufrufe • vor 1 Monat

Google Translate is cooked after this. A developer built a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:

Google Translate is cooked after this. A developer built a local AI translation engine that runs 40 languages entirely on your own laptop. It's called LibreTranslate. No API key. No usage limits. No sending your documents to Google's servers. You install it once. It runs forever. Here's what it handles: → Paste text. Translated instantly. → Drop in a file. Outputs the translated version. → Point it at a URL. Returns the page in your language. → Build it into your own app via its local REST API. The speed is not the story. The privacy is. Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there. LibreTranslate runs entirely offline. Nothing leaves your machine. Ever. The numbers: → 40 languages supported → Runs on CPU -- no GPU needed → Self-hosted in under 5 minutes → REST API built in for developers → 10K+ stars on GitHub 100% open source. MIT licensed. Price: $0. Google charges nothing for Translate either but it charges you something else. GitHub:

Rimsha Bhardwaj

89,223 Aufrufe • vor 1 Monat

✨ Grok's new Imagine video model also comes with an Edit model We know edit models for images, you submit an image, write a prompt what to change, but this is the first time I've seen a proper edit model for video And it kinda works, not great yet though but it does something Here I had to remove the old name "Nomad List" in the video for my site First it said "Go nomad -> Nomad List", so I prompted it "remove the text Nomad List. do not change anything else", it didn't remove it but it replaced it with just "Go nomad" again, okay good enough Useful because otherwise I'd have to scour my backups for the original video in Final Cut Pro, and this is faster One thing you see is it changes the pattern on the door also, but that's okay for now if I fade it in

✨ Grok's new Imagine video model also comes with an Edit model We know edit models for images, you submit an image, write a prompt what to change, but this is the first time I've seen a proper edit model for video And it kinda works, not great yet though but it does something Here I had to remove the old name "Nomad List" in the video for my site First it said "Go nomad -> Nomad List", so I prompted it "remove the text Nomad List. do not change anything else", it didn't remove it but it replaced it with just "Go nomad" again, okay good enough Useful because otherwise I'd have to scour my backups for the original video in Final Cut Pro, and this is faster One thing you see is it changes the pattern on the door also, but that's okay for now if I fade it in

@levelsio

73,183 Aufrufe • vor 5 Monaten

Gemini Omni doesn’t just allow you to render text more accurately — but to create it in sync with your visuals. 🎥 Choose your type, placement, animation, exposure and more. Prompt for this video: Word by word, one word on the screen at a time: did, you, know, that, this, model, can, do, pretty, good, text!? each word appears with a different animated style, perfect pacing to a rhythm, sizzle reel.

Gemini Omni doesn’t just allow you to render text more accurately — but to create it in sync with your visuals. 🎥 Choose your type, placement, animation, exposure and more. Prompt for this video: Word by word, one word on the screen at a time: did, you, know, that, this, model, can, do, pretty, good, text!? each word appears with a different animated style, perfect pacing to a rhythm, sizzle reel.

Google

152,052 Aufrufe • vor 1 Monat