Video wird geladen...
Video konnte nicht geladen werden
This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.
97,147 Aufrufe • vor 1 Jahr •via X (Twitter)
10 Kommentare

The “no-op” acts as an attention sink. It allows a head to not move anything if it doesn’t want to. The circuit works by attending strongly to a single spatial location and then moving a “nothing vector” in the value projection. Each head uses one or two of its QK pairwise detectors to attend to the no-op position. This is the threshold. If other QK matches want to count for anything in the softmax they have to exceed this threshold.

If we look at the K projection, we see that one channel looks for the no-op token. The matching Q channel fires all the time, thus creating a coupled QK detector that promotes attention to the no-op location regardless of Q location. We can see that it’s a “nothing vector” by looking at the size of the vector in the V projection.

The “no-op” isn’t a new finding. Clark et al saw it in BERT, where it attends to the SEP token Kobayashi points out that you need to weight your attn patterns by the magnitude of your V projection if you want a real sense of what’s being moved Anthropic reiterated Kobayashi’s point in their transformers circuits work I was just surprised by the sheer mass of attention we’re spending on the no-op, especially when this attention matrix is the memory bottleneck!

For the record I’ve never liked the softmax. Feels coercive. Activations in linear / conv layers are voluntary, and indeed many neurons choose to not fire most of the time. But with softmax attention we’re forcing the model to develop this complex “no-op” apparatus just to get what other layers enjoy naturally: the ability to not fire when the pattern it’s looking for isn’t there!

Check out our paper In the last section, we explain this phenomenon on DINO-v2 with registers.

This is a great reference, thank you! Definitely worth the read.

i’m pretty sure there are vits trained with registers/sinks based on that finding!

Yes I suspect these are related to @TimDarcet 's registers! more evidence for this is that the no-op detectors also fire somewhat for the CLS token, which we know has global information.

wasn't this supposed to be a fix for some of this or am I misunderstanding?

Yes this definitely seems connected to the phenomenon of registers, you are understanding correctly!
