Загрузка видео...

Не удалось загрузить видео

На главную

🧵1/2 Andrej Karpathy at GPU MODE workshop on llm.c! 👇Full Video

16,177 просмотров • 1 год назад •via X (Twitter)

Комментарии: 8

Фото профиля tetsuo.ai 💹🧲
tetsuo.ai 💹🧲1 год назад

LLMs in simple, pure C/CUDA: Neural Networks: Zero to Hero: Full Video:

Фото профиля UserInterface
UserInterface4 лет назад

Need Professional Video Production, Music Videos, Commercials, Graphic Design, or Photo Retouching? We will take your project from concept to completion. #services #creative #DMV

Фото профиля Daniel O'Leary
Daniel O'Leary1 год назад

$TETSUO 💪

Фото профиля Bui Dinh Ngoc
Bui Dinh Ngoc1 год назад

This is so cool!

Фото профиля tetsuo.ai 💹🧲
tetsuo.ai 💹🧲1 год назад

yeah it is!

Фото профиля maxwellsdemon⏳
maxwellsdemon⏳1 год назад

he mentions at the end of the talk i think using LLMs as intermediate compilers which generate application specific llm.c files to accelerate workloads instead of relying on cuda/ptx (or high level apis like triton)

Фото профиля Jared / eacc
Jared / eacc1 год назад

this is awesome

Фото профиля Jared / eacc
Jared / eacc1 год назад

how can i explain these to simple people?.....

Похожие видео

I gave a talk at GPU MODE workshop last week on llm.c - the origin story of llm.c - being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed - how to port a PyTorch layer to 1) explicit PyTorch - and then to 2) write the backward pass - 3) port forward & backward pass to C - 4) string all the layers together - achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe - how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software - convert all of it to run in CUDA on GPU in fp32 - port matmul to cuBLAS - port attention to cuDNN flash-attention - introduce bfloat16 mixed precision - introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism - add multi-GPU training, NCCL, sharded optimizer - add multi-node with MPI or file system or socket - reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run - how open source development attracts Avengers from the internet - port to training Llama 3 imminent (branch exists) - many other notable forks - last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed. More links in reply

Andrej Karpathy

335,861 просмотров • 1 год назад