Andrej Karpathy's banner

Andrej Karpathy

@karpathy • 3,335,655 subscribers

I like training large deep neural nets.

Shorts

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

1,642,943 views

Test time compute cat 🐈‍⬛

Test time compute cat 🐈‍⬛

251,207 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

I'm playing around with generative AI tools and stitching them together into visual stories. Here I took the first few sentences of Pride and Prejudice and made it into a video. The gen stack used for this one: - Anthropic Claude took the first chapter, generated the scenes and the individual prompts to to the image generator. - Ideogram took the prompts and generate the images - Luma took the images and animated them - for narration - VEED | AI Video Creation to stitch it together (Many of these choices are just what I happened to use for this one while exploring a bunch of things). Anyway honestly it was pretty messy and there is a ton of copy pasting between all of the tools, and even this little video with 3 scenes took me about an hour. There is a huge storytelling opportunity here for whoever can make this convenient. Who is building the first 100% AI-native movie maker?

I'm playing around with generative AI tools and stitching them together into visual stories. Here I took the first few sentences of Pride and Prejudice and made it into a video. The gen stack used for this one: - Anthropic Claude took the first chapter, generated the scenes and the individual prompts to to the image generator. - Ideogram took the prompts and generate the images - Luma took the images and animated them - for narration - VEED | AI Video Creation to stitch it together (Many of these choices are just what I happened to use for this one while exploring a bunch of things). Anyway honestly it was pretty messy and there is a ton of copy pasting between all of the tools, and even this little video with 3 scenes took me about an hour. There is a huge storytelling opportunity here for whoever can make this convenient. Who is building the first 100% AI-native movie maker?

Andrej Karpathy

608,746 views • 2 years ago

We're vibing this nice Sunday morning. Added more functionality. Using the approx 3500kcal ~= 1lb of fat, we now show a really cool animated ring that fills up to 3500 in either +/- direction, and completing the circle adds it on the bottom. So e.g. 3 green circles = 3lb lighter, in theory :). 3 conversations were used: Refactor the AppStorage to be better / cleaner and shuffle elements around a bit Clamp the display to always be in range [-3500, 3500], which is 1lb of fat, and show lb of fat as circles on bottom Making the calorie counter have a nice ring that fills up

We're vibing this nice Sunday morning. Added more functionality. Using the approx 3500kcal ~= 1lb of fat, we now show a really cool animated ring that fills up to 3500 in either +/- direction, and completing the circle adds it on the bottom. So e.g. 3 green circles = 3lb lighter, in theory :). 3 conversations were used: Refactor the AppStorage to be better / cleaner and shuffle elements around a bit Clamp the display to always be in range [-3500, 3500], which is 1lb of fat, and show lb of fat as circles on bottom Making the calorie counter have a nice ring that fills up

Andrej Karpathy

356,066 views • 1 year ago

Amazing text to music generations from Suno , could easily see these taking over leaderboards. Personal favorite: this song I fished out of their Discord a few months ago, "Return to Monkey", which has been stuck in my head since :D [00:57] I wanna return to monkey, I wanna be wild and free, I wanna return to monkey, modern life is not for me. No more emails, no more bills, no more endless strife, Just the sound of the river, the hearbeat of life 😂

Amazing text to music generations from Suno , could easily see these taking over leaderboards. Personal favorite: this song I fished out of their Discord a few months ago, "Return to Monkey", which has been stuck in my head since :D [00:57] I wanna return to monkey, I wanna be wild and free, I wanna return to monkey, modern life is not for me. No more emails, no more bills, no more endless strife, Just the sound of the river, the hearbeat of life 😂

Andrej Karpathy

553,876 views • 2 years ago

August 1, 2024: The Music Video Fun hack just stitching up gen AI tools :), in this case to create a music video for today. - copy paste the entire WSJ front page into Claude - ask it to generate multiple scenes and give visual descriptions for them - copy paste scene descriptions into image generator (Ideogram here) - copy paste generated images into Runway Gen 3 Alpha to make each image into a 10-second video - ask Claude to generate lyrics that depict that day - copy paste lyrics into Suno to generate music - stitch things up in iMovie :D :D :D

August 1, 2024: The Music Video Fun hack just stitching up gen AI tools :), in this case to create a music video for today. - copy paste the entire WSJ front page into Claude - ask it to generate multiple scenes and give visual descriptions for them - copy paste scene descriptions into image generator (Ideogram here) - copy paste generated images into Runway Gen 3 Alpha to make each image into a 10-second video - ask Claude to generate lyrics that depict that day - copy paste lyrics into Suno to generate music - stitch things up in iMovie :D :D :D

Andrej Karpathy

416,475 views • 1 year ago

I gave a talk at GPU MODE workshop last week on llm.c - the origin story of llm.c - being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed - how to port a PyTorch layer to 1) explicit PyTorch - and then to 2) write the backward pass - 3) port forward & backward pass to C - 4) string all the layers together - achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe - how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software - convert all of it to run in CUDA on GPU in fp32 - port matmul to cuBLAS - port attention to cuDNN flash-attention - introduce bfloat16 mixed precision - introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism - add multi-GPU training, NCCL, sharded optimizer - add multi-node with MPI or file system or socket - reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run - how open source development attracts Avengers from the internet - port to training Llama 3 imminent (branch exists) - many other notable forks - last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed. More links in reply

I gave a talk at GPU MODE workshop last week on llm.c - the origin story of llm.c - being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed - how to port a PyTorch layer to 1) explicit PyTorch - and then to 2) write the backward pass - 3) port forward & backward pass to C - 4) string all the layers together - achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe - how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software - convert all of it to run in CUDA on GPU in fp32 - port matmul to cuBLAS - port attention to cuDNN flash-attention - introduce bfloat16 mixed precision - introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism - add multi-GPU training, NCCL, sharded optimizer - add multi-node with MPI or file system or socket - reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run - how open source development attracts Avengers from the internet - port to training Llama 3 imminent (branch exists) - many other notable forks - last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed. More links in reply

Andrej Karpathy

336,280 views • 1 year ago

Yay, llama2.c can now load and inference the Meta released models! :) E.g. here inferencing the smallest 7B model at ~3 tokens/s on 96 OMP threads on a cloud Linux box. Still just CPU, fp32, one single .c file of 500 lines: expecting ~300 tok/s tomorrow :)

Yay, llama2.c can now load and inference the Meta released models! :) E.g. here inferencing the smallest 7B model at ~3 tokens/s on 96 OMP threads on a cloud Linux box. Still just CPU, fp32, one single .c file of 500 lines: expecting ~300 tok/s tomorrow :)

Andrej Karpathy

409,515 views • 3 years ago

No more content to load