正在加载视频...
视频加载失败
Text-to-image diffusion transformer models learn to align text and image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.
94,065 次观看 • 6 个月前 •via X (Twitter)
0 条评论
暂无评论
原始帖子的评论将显示在这里

