
Junxian He
@junxian_he • 6,340 subscribers
Assist. Prof @hkust. NLP/ML PhD @LTIatCMU.
Videos

🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering daily and professional scenarios Toolathlon reveals significant shortcomings of SOTA LLMs in realistic tool-use tasks, where Claude Sonnet 4.5 achieves 38.6% success rate. It also indicates a clear gap between open-source and leading proprietary models. Check our blog: Github: Paper: 🧵⬇️
Junxian He43,599 次观看 • 7 个月前
没有更多内容可加载