Ben Burtenshaw's banner

Ben Burtenshaw

@ben_burtenshaw • 8,915 subscribers

community MLE 🤗 @huggingface gh/hf username: burtenshaw anon feedback: https://t.co/FfvYvFRaWS

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

don't worry. I've setup the logos so we're all included. play below

don't worry. I've setup the logos so we're all included. play below

792,406 次观看 • 7 天前

here's a hands on guide to setup multi-agent autoresearch by Andrej Karpathy. uses open models. works with codex, claude, open code. - uses 5 agents each with a configuration, specific tools, roles, and permission (see repo) - a researcher agent searches papers on hf papers and creates hypotheses - a planner agent maintains an experiment plan and log - workers take the hypotheses and updates the scripts, then it starts a hf job with a gpu to run the script - a reporter agent monitors these jobs and reports events and metrics to a TrackioApp dashboard I ran this for 4 hours and the agents ran 32 jobs, they improved on the baseline by a small margin. check out everything I learnt in thread.

here's a hands on guide to setup multi-agent autoresearch by Andrej Karpathy. uses open models. works with codex, claude, open code. - uses 5 agents each with a configuration, specific tools, roles, and permission (see repo) - a researcher agent searches papers on hf papers and creates hypotheses - a planner agent maintains an experiment plan and log - workers take the hypotheses and updates the scripts, then it starts a hf job with a gpu to run the script - a reporter agent monitors these jobs and reports events and metrics to a TrackioApp dashboard I ran this for 4 hours and the agents ran 32 jobs, they improved on the baseline by a small margin. check out everything I learnt in thread.

90,641 次观看 • 3 个月前

DeepSeek-V4 dropped. 1M context. 10x smaller KV cache. First open model where the context window and the agentic post-training meet.

DeepSeek-V4 dropped. 1M context. 10x smaller KV cache. First open model where the context window and the agentic post-training meet.

49,900 次观看 • 3 个月前

Introducing the context course: a free course on doing ML with agent context. You will learn how to train models, optimize inferences, and build datasets, all by defining harness context with`SKILLS.md`, Plugins, MCP, Subagents, and Hooks. The course includes: - Weekly live AMA on YouTube - Weekly practical projects for ML with context - Instructions in Pi, Codex, Claude, and Opencode - Tutorials and guides on fundamentals - Interactive Quizzes Learn to give AI agents the right knowledge, tools, and structure to actually get work done. Skills, MCP servers, plugins, multi-agent workflows, and building an agent from scratch. Join here:

Introducing the context course: a free course on doing ML with agent context. You will learn how to train models, optimize inferences, and build datasets, all by defining harness context with`SKILLS.md`, Plugins, MCP, Subagents, and Hooks. The course includes: - Weekly live AMA on YouTube - Weekly practical projects for ML with context - Instructions in Pi, Codex, Claude, and Opencode - Tutorials and guides on fundamentals - Interactive Quizzes Learn to give AI agents the right knowledge, tools, and structure to actually get work done. Skills, MCP servers, plugins, multi-agent workflows, and building an agent from scratch. Join here:

16,905 次观看 • 2 个月前

Eval scores in 2026 are broken. MMLU at 91%+, GSM8K at 94%+, yet models still can't handle basic multi-step tasks. And reported scores don't even agree across model cards, papers, and platforms. We just shipped Community Evals on Hugging Face: - Benchmark datasets now host live leaderboards (MMLU-Pro, GPQA, HLE) - Scores live in model repos as versioned YAML - Anyone can submit evals to any model via PR without merging. - Verified badges for reproducible runs via Inspect AI This won't fix saturation or stop test set contamination. But it makes the game visible. What was evaluated, how, when, and by whom. Done trusting black-box leaderboards. Time to decentralize evals.

Eval scores in 2026 are broken. MMLU at 91%+, GSM8K at 94%+, yet models still can't handle basic multi-step tasks. And reported scores don't even agree across model cards, papers, and platforms. We just shipped Community Evals on Hugging Face: - Benchmark datasets now host live leaderboards (MMLU-Pro, GPQA, HLE) - Scores live in model repos as versioned YAML - Anyone can submit evals to any model via PR without merging. - Verified badges for reproducible runs via Inspect AI This won't fix saturation or stop test set contamination. But it makes the game visible. What was evaluated, how, when, and by whom. Done trusting black-box leaderboards. Time to decentralize evals.

19,259 次观看 • 5 个月前

$still experimenting with LoRA based on the Thinking Machines configuration and just implemented it in colab. In this notebook I set up a fine tune of Qwen/Qwen3-0.6B on the OpenR1-Math dataset with lora rank of 1. with this setup you can get the same reward accuracy as full fine-tuning, at a fraction of vram usage.$

still experimenting with LoRA based on the Thinking Machines configuration and just implemented it in colab. In this notebook I set up a fine tune of Qwen/Qwen3-0.6B on the OpenR1-Math dataset with lora rank of 1. with this setup you can get the same reward accuracy as full fine-tuning, at a fraction of vram usage.

25,624 次观看 • 10 个月前

没有更多内容可加载