Rudy Gilman's banner

Rudy Gilman

@rgilman33 • 3,140 subscribers

substrate agnostic

Shorts

The attention layers in the VAEs for FLUX, Stable Diffusion 3.5, and SDXL don't do anything. You can ablate them with almost no effect. At first I thought they might be involved in some clever circuitry—maybe moving global information—but no they're just flailing around doing nothing.

The attention layers in the VAEs for FLUX, Stable Diffusion 3.5, and SDXL don't do anything. You can ablate them with almost no effect. At first I thought they might be involved in some clever circuitry—maybe moving global information—but no they're just flailing around doing nothing.

88,587 views

Half the channels in the last layer of the sdxl vae are disabled intentionally by the model itself. I've never seen this type of dead neuron before—it's like a form of apoptosis where the model kills off a part of itself so the rest may thrive. This is the final conv layer where we produce the output RGB. Notice half the neurons never contribute. The weights on the ignored channels are tiny.

Half the channels in the last layer of the sdxl vae are disabled intentionally by the model itself. I've never seen this type of dead neuron before—it's like a form of apoptosis where the model kills off a part of itself so the rest may thrive. This is the final conv layer where we produce the output RGB. Notice half the neurons never contribute. The weights on the ignored channels are tiny.

59,955 views

The sdxl-VAE models a substantial amount of noise. Things we can't even see. It meticulously encodes the noise, uses precious bottleneck capacity to store it, then faithfully reconstructs it in the decoder. I grabbed what I thought was a simple black vector circle on a white background but the VAE latched on to a veritable Navajo quilt of background noise. I was absolutely flummoxed until I realized what was going on! A lot of capacity being used on things we don't need to be modelling at all. Suspected culprit is group norm.

The sdxl-VAE models a substantial amount of noise. Things we can't even see. It meticulously encodes the noise, uses precious bottleneck capacity to store it, then faithfully reconstructs it in the decoder. I grabbed what I thought was a simple black vector circle on a white background but the VAE latched on to a veritable Navajo quilt of background noise. I was absolutely flummoxed until I realized what was going on! A lot of capacity being used on things we don't need to be modelling at all. Suspected culprit is group norm.

52,110 views

One of the reasons I like distillation is because you can use it to combine the strengths of different models. RADIO from Pavlo Molchanov and the team at nvidia has the smooth, rich feature maps of DINO-v2 and it can read like Siglip / CLIP. RADIO even uses TimDarcet's registers to consolidate global information and keep the spatial features clean! I think we'll see lots more of this approach in the future. Here's the smoothness you get from distilling DINO-v2 and adding registers. No spatial artifacts.

One of the reasons I like distillation is because you can use it to combine the strengths of different models. RADIO from Pavlo Molchanov and the team at nvidia has the smooth, rich feature maps of DINO-v2 and it can read like Siglip / CLIP. RADIO even uses TimDarcet's registers to consolidate global information and keep the spatial features clean! I think we'll see lots more of this approach in the future. Here's the smoothness you get from distilling DINO-v2 and adding registers. No spatial artifacts.

43,046 views

The later features in DINO-v2 are more abstract and semantically meaningful than I'd expected from the training objectives. This neuron responds only to hugs. Nothing else, just hugs.

The later features in DINO-v2 are more abstract and semantically meaningful than I'd expected from the training objectives. This neuron responds only to hugs. Nothing else, just hugs.

34,378 views

This is Siglip-2's dedicated DEI neuron. It fires for LGBTQ and indigenous flags, BLM imagery, and especially mixed-race groups doing happy things together (e.g. business meetings, jumping triumphantly, reaching summits etc)

This is Siglip-2's dedicated DEI neuron. It fires for LGBTQ and indigenous flags, BLM imagery, and especially mixed-race groups doing happy things together (e.g. business meetings, jumping triumphantly, reaching summits etc)

23,092 views

A challenge for ML diagnosticians: The patient, Segment-Anything-Model-2 (SAM-2), is presenting with two prominent symptoms: Symptom 1) Extremely high-magnitude tokens spread evenly across spatial dimensions.

A challenge for ML diagnosticians: The patient, Segment-Anything-Model-2 (SAM-2), is presenting with two prominent symptoms: Symptom 1) Extremely high-magnitude tokens spread evenly across spatial dimensions.

25,936 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

DINO-v3 has a single high-magnitude channel on its residual pathway, channel 416. Turning off this single channel affects DINO's entire output by 50-80%. For context, turning off a random channel has an effect of less than one percent. The model builds up channel 416 in its last two layer-scale operations, using a single high-magnitude weight in each op to drastically ramp up channel 416's magnitude. This channel doesn't depend on the input, every image fires with a constant overlay. After bringing channel 416 up to a value of about ten thousand, DINO-v3 then scales it down in the final layer-norm to almost nothing, removing it without a trace.

DINO-v3 has a single high-magnitude channel on its residual pathway, channel 416. Turning off this single channel affects DINO's entire output by 50-80%. For context, turning off a random channel has an effect of less than one percent. The model builds up channel 416 in its last two layer-scale operations, using a single high-magnitude weight in each op to drastically ramp up channel 416's magnitude. This channel doesn't depend on the input, every image fires with a constant overlay. After bringing channel 416 up to a value of about ten thousand, DINO-v3 then scales it down in the final layer-norm to almost nothing, removing it without a trace.

85,984 views • 10 months ago

The VAE used in SDXL has extremely high-magnitude "splotches" in its latents. The individual neurons in these blobs fire with magnitudes of close to a million. These aren't some accident of training or initialization—the model creates these high-magnitude splotches for a specific reason: to circumvent the group-norm operations.

The VAE used in SDXL has extremely high-magnitude "splotches" in its latents. The individual neurons in these blobs fire with magnitudes of close to a million. These aren't some accident of training or initialization—the model creates these high-magnitude splotches for a specific reason: to circumvent the group-norm operations.

104,849 views • 1 year ago

This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.

This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.

97,147 views • 1 year ago

SDXL-turbo isn't given positional information—so it makes its own. You can see the positional grid forming in the first few blocks, starting from the borders and rippling inwards, carried by successive layers of 3x3 convs.

SDXL-turbo isn't given positional information—so it makes its own. You can see the positional grid forming in the first few blocks, starting from the borders and rippling inwards, carried by successive layers of 3x3 convs.

68,684 views • 1 year ago

Do diffusion models want Perlin-like noise? SDXL-turbo spends the first blocks simply creating Perlin-like noise of different frequencies, amplitudes and orientations. The model ignores the prompt conditioning during these layers.

Do diffusion models want Perlin-like noise? SDXL-turbo spends the first blocks simply creating Perlin-like noise of different frequencies, amplitudes and orientations. The model ignores the prompt conditioning during these layers.

59,764 views • 1 year ago

The majority of features in this layer of Siglip-2 are multimodal. I'd expected some multimodality but was surprised that two-thirds of the neurons I tested bind together their visual and linguistic features. This neuron fires for images of mustaches and for the word "mustache"

The majority of features in this layer of Siglip-2 are multimodal. I'd expected some multimodality but was surprised that two-thirds of the neurons I tested bind together their visual and linguistic features. This neuron fires for images of mustaches and for the word "mustache"

18,519 views • 1 year ago

Introducing darkspark, a gui for your neural network. It traces your pytorch code and brings up a visual representation for you to interact with. We have a hosted gallery of popular model architectures pre-traced and ready to explore. Here’s stable-diffusion-v1.5

Introducing darkspark, a gui for your neural network. It traces your pytorch code and brings up a visual representation for you to interact with. We have a hosted gallery of popular model architectures pre-traced and ready to explore. Here’s stable-diffusion-v1.5

16,783 views • 1 year ago

Through the length of SDXL-Turbo you can see the image forming in the activations like an object taking shape in the afternoon clouds. Like David emerging from a block of granite. At two-thirds through the model you can see the image clearly. It's spatially crisp, refined. Then in the last third of the model something strange but predictable happens: the model translates this well-formed image back into noise. By the output of the U-net, the activations are again completely uninterpretable. Why does this happen? Because we're asking the model to predict the noise rather than the clean image. This is awkward. Like Michelangelo visualizing the chips of granite surrounding the David rather than the David itself. Like asking a friend to look at a cloud and imagine not a shape, but the extra bits of cloud that when removed will leave a shape remaining.

Through the length of SDXL-Turbo you can see the image forming in the activations like an object taking shape in the afternoon clouds. Like David emerging from a block of granite. At two-thirds through the model you can see the image clearly. It's spatially crisp, refined. Then in the last third of the model something strange but predictable happens: the model translates this well-formed image back into noise. By the output of the U-net, the activations are again completely uninterpretable. Why does this happen? Because we're asking the model to predict the noise rather than the clean image. This is awkward. Like Michelangelo visualizing the chips of granite surrounding the David rather than the David itself. Like asking a friend to look at a cloud and imagine not a shape, but the extra bits of cloud that when removed will leave a shape remaining.

11,641 views • 1 year ago

No more content to load