
Daniel van Strien
@vanstriendaniel • 5,732 subscribers
Machine Learning Librarian @huggingface 🤗 I like datasets.
Shorts
Videos

Can Google Gemma DiffusionGemma help fix broken OCR? In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront? Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.
Daniel van Strien36,105 просмотров • 10 дней назад

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code. In 2022, I hand-labelled ~1,900 pages from historical books for a project that never really shipped. The dataset sat on Hugging Face for 3 years. This week I pointed Claude Code at it: "make me a browser-deployable model which can run using transformers.js." A few hours later: trained, quantised, running entirely client-side. It finds illustrated pages in digitised books, i.e., engravings, photographs, maps, diagrams etc. Paste a IIIF manifest, see where they appear across hundreds of pages. No server required, so perfect for institutions that can't afford to pay to host big models! Old data work paying off!
Daniel van Strien47,657 просмотров • 6 месяцев назад

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed!
Daniel van Strien20,108 просмотров • 11 месяцев назад

Inspired by Hugging Face's official MCP server, I built my own to expose my semantic search API for the HF ecosystem! Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets. Try: "Find non maths reasoning datasets from 2025"!
Daniel van Strien15,147 просмотров • 1 год назад
Больше нет контента для загрузки