
Rudy Gilman
@rgilman33 • 3,109 subscribers
substrate agnostic
Shorts
Videos

DINO-v3 has a single high-magnitude channel on its residual pathway, channel 416. Turning off this single channel affects DINO's entire output by 50-80%. For context, turning off a random channel has an effect of less than one percent. The model builds up channel 416 in its last two layer-scale operations, using a single high-magnitude weight in each op to drastically ramp up channel 416's magnitude. This channel doesn't depend on the input, every image fires with a constant overlay. After bringing channel 416 up to a value of about ten thousand, DINO-v3 then scales it down in the final layer-norm to almost nothing, removing it without a trace.
Rudy Gilman85,749 views • 9 months ago

The VAE used in SDXL has extremely high-magnitude "splotches" in its latents. The individual neurons in these blobs fire with magnitudes of close to a million. These aren't some accident of training or initialization—the model creates these high-magnitude splotches for a specific reason: to circumvent the group-norm operations.
Rudy Gilman104,780 views • 1 year ago

This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.
Rudy Gilman97,122 views • 1 year ago

The majority of features in this layer of Siglip-2 are multimodal. I'd expected some multimodality but was surprised that two-thirds of the neurons I tested bind together their visual and linguistic features. This neuron fires for images of mustaches and for the word "mustache"
Rudy Gilman18,518 views • 1 year ago
No more content to load