Loading video...
Video Failed to Load
Text-to-image diffusion transformer models learn to align text and image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.
94,065 views • 5 months ago •via X (Twitter)
0 Comments
No comments available
Comments from the original post will appear here

