Daniel van Strien's banner

Daniel van Strien

@vanstriendaniel • 5,732 subscribers

Machine Learning Librarian @huggingface 🤗 I like datasets.

Shorts

Is olmOCR-bench getting close to saturation? Top score is now 85.9%. Yesterday Datalab took #1 with chandra-ocr-2. A year ago, the best was 79. Visualised the race to get there using Hugging Face leaderboard data

Is olmOCR-bench getting close to saturation? Top score is now 85.9%. Yesterday Datalab took #1 with chandra-ocr-2. A year ago, the best was 79. Visualised the race to get there using Hugging Face leaderboard data

16,908 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Can Google Gemma DiffusionGemma help fix broken OCR? In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront? Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.

Can Google Gemma DiffusionGemma help fix broken OCR? In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront? Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.

Daniel van Strien

36,105 views • 10 days ago

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code. In 2022, I hand-labelled ~1,900 pages from historical books for a project that never really shipped. The dataset sat on Hugging Face for 3 years. This week I pointed Claude Code at it: "make me a browser-deployable model which can run using transformers.js." A few hours later: trained, quantised, running entirely client-side. It finds illustrated pages in digitised books, i.e., engravings, photographs, maps, diagrams etc. Paste a IIIF manifest, see where they appear across hundreds of pages. No server required, so perfect for institutions that can't afford to pay to host big models! Old data work paying off!

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code. In 2022, I hand-labelled ~1,900 pages from historical books for a project that never really shipped. The dataset sat on Hugging Face for 3 years. This week I pointed Claude Code at it: "make me a browser-deployable model which can run using transformers.js." A few hours later: trained, quantised, running entirely client-side. It finds illustrated pages in digitised books, i.e., engravings, photographs, maps, diagrams etc. Paste a IIIF manifest, see where they appear across hundreds of pages. No server required, so perfect for institutions that can't afford to pay to host big models! Old data work paying off!

Daniel van Strien

47,657 views • 6 months ago

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed!

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed!

Daniel van Strien

20,108 views • 11 months ago

Inspired by Hugging Face's official MCP server, I built my own to expose my semantic search API for the HF ecosystem! Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets. Try: "Find non maths reasoning datasets from 2025"!

Inspired by Hugging Face's official MCP server, I built my own to expose my semantic search API for the HF ecosystem! Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets. Try: "Find non maths reasoning datasets from 2025"!

Daniel van Strien

15,147 views • 1 year ago

No more content to load