
Daniel van Strien
@vanstriendaniel • 5,732 subscribers
Machine Learning Librarian @huggingface 🤗 I like datasets.
Shorts
Videos

Can Google Gemma DiffusionGemma help fix broken OCR? In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront? Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.
Daniel van Strien36,105 views • 10 days ago

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code. In 2022, I hand-labelled ~1,900 pages from historical books for a project that never really shipped. The dataset sat on Hugging Face for 3 years. This week I pointed Claude Code at it: "make me a browser-deployable model which can run using transformers.js." A few hours later: trained, quantised, running entirely client-side. It finds illustrated pages in digitised books, i.e., engravings, photographs, maps, diagrams etc. Paste a IIIF manifest, see where they appear across hundreds of pages. No server required, so perfect for institutions that can't afford to pay to host big models! Old data work paying off!
Daniel van Strien47,657 views • 6 months ago

Inspired by Hugging Face's official MCP server, I built my own to expose my semantic search API for the HF ecosystem! Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets. Try: "Find non maths reasoning datasets from 2025"!
Daniel van Strien15,147 views • 1 year ago
No more content to load