Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.

97,147 Aufrufe • vor 1 Jahr •via X (Twitter)

10 Kommentare

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

The “no-op” acts as an attention sink. It allows a head to not move anything if it doesn’t want to. The circuit works by attending strongly to a single spatial location and then moving a “nothing vector” in the value projection. Each head uses one or two of its QK pairwise detectors to attend to the no-op position. This is the threshold. If other QK matches want to count for anything in the softmax they have to exceed this threshold.

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

If we look at the K projection, we see that one channel looks for the no-op token. The matching Q channel fires all the time, thus creating a coupled QK detector that promotes attention to the no-op location regardless of Q location. We can see that it’s a “nothing vector” by looking at the size of the vector in the V projection.

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

The “no-op” isn’t a new finding. Clark et al saw it in BERT, where it attends to the SEP token Kobayashi points out that you need to weight your attn patterns by the magnitude of your V projection if you want a real sense of what’s being moved Anthropic reiterated Kobayashi’s point in their transformers circuits work I was just surprised by the sheer mass of attention we’re spending on the no-op, especially when this attention matrix is the memory bottleneck!

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

For the record I’ve never liked the softmax. Feels coercive. Activations in linear / conv layers are voluntary, and indeed many neurons choose to not fire most of the time. But with softmax attention we’re forcing the model to develop this complex “no-op” apparatus just to get what other layers enjoy naturally: the ability to not fire when the pattern it’s looking for isn’t there!

Profilbild von Mingjie Sun
Mingjie Sunvor 1 Jahr

Check out our paper In the last section, we explain this phenomenon on DINO-v2 with registers.

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

This is a great reference, thank you! Definitely worth the read.

Profilbild von Zach Nussbaum
Zach Nussbaumvor 1 Jahr

i’m pretty sure there are vits trained with registers/sinks based on that finding!

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

Yes I suspect these are related to @TimDarcet 's registers! more evidence for this is that the no-op detectors also fire somewhat for the CLS token, which we know has global information.

Profilbild von uɐɥdǝʇS
uɐɥdǝʇSvor 1 Jahr

wasn't this supposed to be a fix for some of this or am I misunderstanding?

Profilbild von Rudy Gilman
Rudy Gilmanvor 1 Jahr

Yes this definitely seems connected to the phenomenon of registers, you are understanding correctly!

Ähnliche Videos