Akshit's banner

Akshit

@akshitwt • 3,327 subscribers

ml @cambridge_uni. previously @precogatiiith, @iiit_hyderabad. futurebound.

Shorts

introducing a new, very fun, LLM benchmark- the Game-of-Life Bench! the rules are simple: given an 8x8 grid following Conway's game of life rules, the goal is to create an initial pattern with at most 32 cells that can last the longest number of turns before dying/repeating. some results to highlight (with caveats detailed below): - gpt 5.1 lasts the longest with a 106 step run - claude models are really bad at this! they refuse to reason about this task and score < 25 points - deepseek r1 is the best open model with 102 steps. why? because i wanted to create a benchmark that has (i think) no practicality, but is still fun to look at, cheap, and still measures something interesting. i also am a big fan of the game of life. its absurdly simple rules leading to intractability is extremely cool to me. also, i saw a lot of work with LLMs trying to "predict" the next state in Conway's game of life, I think game-of-life bench is more fun because it's pretty open ended and only asks the LLM for the initial state. I also think this could be an RL env? but idk why you would ever train on this task haha i don't think this is a "serious" benchmark because it doesnt measure anything practical, but i still think it's a hard benchmark exactly because you can't predict what happens with your initial state many turns into the future; this is why i was initially expecting all LLMs to be bad at it, but turns out, some are clearly better than the others (the ordering may surprise you!) reminder: this is still a work-in-progress; (1) i am gpu-poor so could only do 10 runs for each model, even though total running cost is relatively low. maybe with some more credits i can run more seeds for each model. (2) i handpicked models which i think are at the frontier right now, plus some others that were on my mind. so, if you'd like to see a model on here, let me know. (3) i currently only do an 8x8 grid because i thought that by itself would be pretty hard for current LLMs, but of course we can increase grid sizes! (4) the coolest thing is, i dont think we can calculate the max possible number of states (yay undecidability!) you can go without repeating, so this is essentially a no-ceiling task, which is pretty cool! again, i did this mostly out of a desire to make LLMs do something fun. if this keeps me entertained for a few more days, i'd likely release a blog post on it. if it keeps me entertained for a week (and someone sponsors me), i'll put more work into it :P lastly, this is fully open sourced, so feel free to run this on your own!

introducing a new, very fun, LLM benchmark- the Game-of-Life Bench! the rules are simple: given an 8x8 grid following Conway's game of life rules, the goal is to create an initial pattern with at most 32 cells that can last the longest number of turns before dying/repeating. some results to highlight (with caveats detailed below): - gpt 5.1 lasts the longest with a 106 step run - claude models are really bad at this! they refuse to reason about this task and score < 25 points - deepseek r1 is the best open model with 102 steps. why? because i wanted to create a benchmark that has (i think) no practicality, but is still fun to look at, cheap, and still measures something interesting. i also am a big fan of the game of life. its absurdly simple rules leading to intractability is extremely cool to me. also, i saw a lot of work with LLMs trying to "predict" the next state in Conway's game of life, I think game-of-life bench is more fun because it's pretty open ended and only asks the LLM for the initial state. I also think this could be an RL env? but idk why you would ever train on this task haha i don't think this is a "serious" benchmark because it doesnt measure anything practical, but i still think it's a hard benchmark exactly because you can't predict what happens with your initial state many turns into the future; this is why i was initially expecting all LLMs to be bad at it, but turns out, some are clearly better than the others (the ordering may surprise you!) reminder: this is still a work-in-progress; (1) i am gpu-poor so could only do 10 runs for each model, even though total running cost is relatively low. maybe with some more credits i can run more seeds for each model. (2) i handpicked models which i think are at the frontier right now, plus some others that were on my mind. so, if you'd like to see a model on here, let me know. (3) i currently only do an 8x8 grid because i thought that by itself would be pretty hard for current LLMs, but of course we can increase grid sizes! (4) the coolest thing is, i dont think we can calculate the max possible number of states (yay undecidability!) you can go without repeating, so this is essentially a no-ceiling task, which is pretty cool! again, i did this mostly out of a desire to make LLMs do something fun. if this keeps me entertained for a few more days, i'd likely release a blog post on it. if it keeps me entertained for a week (and someone sponsors me), i'll put more work into it :P lastly, this is fully open sourced, so feel free to run this on your own!

13,722 görüntüleme