
Ben Burtenshaw
@ben_burtenshaw • 7,889 subscribers
community MLE 🤗 @huggingface gh/hf username: burtenshaw anon feedback: https://t.co/FfvYvFRaWS
Videos

here's a hands on guide to setup multi-agent autoresearch by Andrej Karpathy. uses open models. works with codex, claude, open code. - uses 5 agents each with a configuration, specific tools, roles, and permission (see repo) - a researcher agent searches papers on hf papers and creates hypotheses - a planner agent maintains an experiment plan and log - workers take the hypotheses and updates the scripts, then it starts a hf job with a gpu to run the script - a reporter agent monitors these jobs and reports events and metrics to a TrackioApp dashboard I ran this for 4 hours and the agents ran 32 jobs, they improved on the baseline by a small margin. check out everything I learnt in thread.
Ben Burtenshaw90,496 次观看 • 2 个月前

Introducing the context course: a free course on doing ML with agent context. You will learn how to train models, optimize inferences, and build datasets, all by defining harness context with`SKILLS.md`, Plugins, MCP, Subagents, and Hooks. The course includes: - Weekly live AMA on YouTube - Weekly practical projects for ML with context - Instructions in Pi, Codex, Claude, and Opencode - Tutorials and guides on fundamentals - Interactive Quizzes Learn to give AI agents the right knowledge, tools, and structure to actually get work done. Skills, MCP servers, plugins, multi-agent workflows, and building an agent from scratch. Join here:
Ben Burtenshaw16,905 次观看 • 1 个月前

Eval scores in 2026 are broken. MMLU at 91%+, GSM8K at 94%+, yet models still can't handle basic multi-step tasks. And reported scores don't even agree across model cards, papers, and platforms. We just shipped Community Evals on Hugging Face: - Benchmark datasets now host live leaderboards (MMLU-Pro, GPQA, HLE) - Scores live in model repos as versioned YAML - Anyone can submit evals to any model via PR without merging. - Verified badges for reproducible runs via Inspect AI This won't fix saturation or stop test set contamination. But it makes the game visible. What was evaluated, how, when, and by whom. Done trusting black-box leaderboards. Time to decentralize evals.
Ben Burtenshaw19,259 次观看 • 4 个月前

still experimenting with LoRA based on the Thinking Machines configuration and just implemented it in colab. In this notebook I set up a fine tune of Qwen/Qwen3-0.6B on the OpenR1-Math dataset with lora rank of 1. with this setup you can get the same reward accuracy as full fine-tuning, at a fraction of vram usage.
Ben Burtenshaw25,624 次观看 • 8 个月前
没有更多内容可加载