Загрузка видео...
Не удалось загрузить видео
I recently gave a talk at Anthropic on automating alignment research, covering the content from my (just released) essay on the topic. Video here and on youtube, link to transcript in thread.
28,039 просмотров • 1 год назад •via X (Twitter)
Комментарии: 5

Transcript here:

You're leaving out the biggest and most realistic failure mode: the humans doing the work/controlling what happens not being aligned or committed, and using your research to accelerate capabilities to be immortal or make them more money instead.

Looks interesting and deeply important; looking forward to watching it

The answer to your title slide is “no”.

I can't access the full text of this, but from a glance at the abstract, to me it looks like it's going to be an over-broad conclusion, and is going to imply that e.g. we can't expect an image classifier to correctly classify cat pictures it hasn't seen before, and/or cat pictures that are slightly outside the training distribution, because there are an infinite/suitably-large number of grue-like functions it could be implementing where it gets that classification wrong. And I expect the argument to go wrong for the same reason that it goes wrong for cat classification, and for grue-like induction problems in general (for example, in human science) -- namely, it's going to neglect something related to priors, simplicity, inductive biases, etc. (But also: alignment doesn't require that we know the exact function an LLM implements -- we just need to know enough about how it generalizes to be confident it will behave well on the inputs we care about.)
