Загрузка видео...

Не удалось загрузить видео

На главную

I recently gave a talk at Anthropic on automating alignment research, covering the content from my (just released) essay on the topic. Video here and on youtube, link to transcript in thread.

28,039 просмотров • 1 год назад •via X (Twitter)

Комментарии: 5

Фото профиля Joe Carlsmith
Joe Carlsmith1 год назад

Transcript here:

Фото профиля Holly ⏸️ Elmore
Holly ⏸️ Elmore1 год назад

You're leaving out the biggest and most realistic failure mode: the humans doing the work/controlling what happens not being aligned or committed, and using your research to accelerate capabilities to be immortal or make them more money instead.

Фото профиля Roman Hauksson
Roman Hauksson1 год назад

Looks interesting and deeply important; looking forward to watching it

Фото профиля Marcus Arvan
Marcus Arvan1 год назад

The answer to your title slide is “no”.

Фото профиля Joe Carlsmith
Joe Carlsmith1 год назад

I can't access the full text of this, but from a glance at the abstract, to me it looks like it's going to be an over-broad conclusion, and is going to imply that e.g. we can't expect an image classifier to correctly classify cat pictures it hasn't seen before, and/or cat pictures that are slightly outside the training distribution, because there are an infinite/suitably-large number of grue-like functions it could be implementing where it gets that classification wrong. And I expect the argument to go wrong for the same reason that it goes wrong for cat classification, and for grue-like induction problems in general (for example, in human science) -- namely, it's going to neglect something related to priors, simplicity, inductive biases, etc. (But also: alignment doesn't require that we know the exact function an LLM implements -- we just need to know enough about how it generalizes to be confident it will behave well on the inputs we care about.)

Похожие видео