Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Do 3D reconstruction transformers really need a billion parameters, or are most of those layers just doing the same thing over and over? Introducing Déjà View: a single transformer block, looped K times, that matches or beats models 8–10× its size with lower compute. 🧵

Tobias Fischer

1,031 subscribers

92,921 просмотров • 1 месяц назад •via X (Twitter)

Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

What are we doing people, is it just me, or does the foolery need to stop. What’s the difference than going on a field with an actual base and work on the same thing? Are we creating robots or athletes?

What are we doing people, is it just me, or does the foolery need to stop. What’s the difference than going on a field with an actual base and work on the same thing? Are we creating robots or athletes?

Austin Ford Sr.

448,948 просмотров • 11 месяцев назад

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 просмотров • 1 год назад

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,792 просмотров • 2 лет назад

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

AK

124,048 просмотров • 1 год назад

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don't work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don't work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

Chace Chambers

156,764 просмотров • 7 месяцев назад

Conducting road safety audits, scheduling maintenance, or planning logistics routes used to require combing through imagery or, worse, driving miles of roads yourself. Not anymore. New AI-powered data layers in Google Earth are changing that. Pulling from billions of Google Street View images, these layers now spot infrastructure assets like stop signs, speed limit signs, and more to map infrastructure for you. By combining these assets into a single project and diving into Street View you can locate and validate assets on Google Earth in seconds. These layers are now available for Professional & Professional Advanced customers on web and Android, with coverage expanding over the coming weeks.

Conducting road safety audits, scheduling maintenance, or planning logistics routes used to require combing through imagery or, worse, driving miles of roads yourself. Not anymore. New AI-powered data layers in Google Earth are changing that. Pulling from billions of Google Street View images, these layers now spot infrastructure assets like stop signs, speed limit signs, and more to map infrastructure for you. By combining these assets into a single project and diving into Street View you can locate and validate assets on Google Earth in seconds. These layers are now available for Professional & Professional Advanced customers on web and Android, with coverage expanding over the coming weeks.

Google Earth

19,822 просмотров • 1 месяц назад

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don´t work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don´t work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

Chace Chambers

642,492 просмотров • 10 месяцев назад

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

MrNeRF

17,206 просмотров • 1 год назад

A VIEW OF FOUR STORMS A moving satellite image released by the US National Oceanic & Atmospheric Administration (NOAA) reveals the view from above last Nov. 11 over 8 hours of 4 typhoons that hit or are set to hit the Philippines this week: Marce, Nika, #OfelPH, and #PepitoPH. (Courtesy: CSU/CIRA & NOAA/NESDIS) | via Anjo Bagaoisan

A VIEW OF FOUR STORMS A moving satellite image released by the US National Oceanic & Atmospheric Administration (NOAA) reveals the view from above last Nov. 11 over 8 hours of 4 typhoons that hit or are set to hit the Philippines this week: Marce, Nika, #OfelPH, and #PepitoPH. (Courtesy: CSU/CIRA & NOAA/NESDIS) | via Anjo Bagaoisan

ABS-CBN News

1,139,184 просмотров • 1 год назад

This is... not a remotely accurate description of what the 2023 Al executive order did? Undersecretary Emil Michael: "If you remember the Biden executive order on Al, which was this crazy executive order that limited the amount of compute any model company could do and was essentially grandfathering in a small number of ai companies that they were gonna designate as the winners, and everyone else was out" Its not true that the EO limited the compute that AI companies could do. What it did do was require companies who were training models above a certain very high compute threshold (10^26 FLOP or 10^23 FLOP for models trained primarily on biological sequence data) to notify the government and share what testing and red teaming they were doing for certain national security risks. People are free to dislike the Biden AI EO! But it seems good to factually describe what the policy said.

This is... not a remotely accurate description of what the 2023 Al executive order did? Undersecretary Emil Michael: "If you remember the Biden executive order on Al, which was this crazy executive order that limited the amount of compute any model company could do and was essentially grandfathering in a small number of ai companies that they were gonna designate as the winners, and everyone else was out" Its not true that the EO limited the compute that AI companies could do. What it did do was require companies who were training models above a certain very high compute threshold (10^26 FLOP or 10^23 FLOP for models trained primarily on biological sequence data) to notify the government and share what testing and red teaming they were doing for certain national security risks. People are free to dislike the Biden AI EO! But it seems good to factually describe what the policy said.

Nathan Calvin

58,745 просмотров • 4 месяцев назад

There is no such thing as "lower abs". Your rectus abdominis is one continuous muscle. You cannot contract the bottom half on its own, and no exercise on earth isolates a region that does not anatomically exist. So all the things sold to you as "lower ab" builders are actually just glorified hip flexor exercises: - Leg raises - Reverse crunches - Flutter kicks - Scissor kicks - Mountain climbers - The dreaded "lower ab burnout circuit" None of them is carving a separate lower section. They train the same single muscle every other ab exercise trains, just with more hip flexor and more wishful thinking. The reason you cannot see your lower abs is not a missing exercise. It is the layer of fat sitting over them, which collects there last and leaves there last. So the actual programme is dull and two-part: - Get into a calorie deficit and strip the fat that is hiding them - Build the abs themselves with weighted crunches, 4-6 reps, adding load over time, like any other muscle That's it. There is no lower-ab secret. There is just a visible-ab one, and it lives in the kitchen and under a loaded cable.

There is no such thing as "lower abs". Your rectus abdominis is one continuous muscle. You cannot contract the bottom half on its own, and no exercise on earth isolates a region that does not anatomically exist. So all the things sold to you as "lower ab" builders are actually just glorified hip flexor exercises: - Leg raises - Reverse crunches - Flutter kicks - Scissor kicks - Mountain climbers - The dreaded "lower ab burnout circuit" None of them is carving a separate lower section. They train the same single muscle every other ab exercise trains, just with more hip flexor and more wishful thinking. The reason you cannot see your lower abs is not a missing exercise. It is the layer of fat sitting over them, which collects there last and leaves there last. So the actual programme is dull and two-part: - Get into a calorie deficit and strip the fat that is hiding them - Build the abs themselves with weighted crunches, 4-6 reps, adding load over time, like any other muscle That's it. There is no lower-ab secret. There is just a visible-ab one, and it lives in the kitchen and under a loaded cable.

Sama Hoole

27,193 просмотров • 27 дней назад

On Wednesday, Brandon Johnson was asked whether he is running for re-election. He said: “Why are you mad at me for doing what the people of Chicago elected me to do? I’ve kept every single promise. I have not lied to the people of Chicago.” Did Mayor Johnson just lie when he said that? Do you feel he has lied to the people of Chicago? Do you feel he had a mandate for his progressive agenda when Chicago voter turnout was about 35%, and he was elected with roughly 18% of the vote—or just over 50% of those who voted?

On Wednesday, Brandon Johnson was asked whether he is running for re-election. He said: “Why are you mad at me for doing what the people of Chicago elected me to do? I’ve kept every single promise. I have not lied to the people of Chicago.” Did Mayor Johnson just lie when he said that? Do you feel he has lied to the people of Chicago? Do you feel he had a mandate for his progressive agenda when Chicago voter turnout was about 35%, and he was elected with roughly 18% of the vote—or just over 50% of those who voted?

Reporter William J. Kelly #thatreporter

76,738 просмотров • 5 месяцев назад

Once she stepped out of that car towards him with a gun. That is not self-defense. She wanted a fight, and she killed an unarmed man at Walmart. All she had to do was stay in her car, or leave. He should of done the same. But you can't just kill someone over a parking spot and call it self defense.

Once she stepped out of that car towards him with a gun. That is not self-defense. She wanted a fight, and she killed an unarmed man at Walmart. All she had to do was stay in her car, or leave. He should of done the same. But you can't just kill someone over a parking spot and call it self defense.

The SCIF

95,197 просмотров • 23 дней назад

Astronomers just watched a star explode - and saw its insides exposed. For the first time in history, scientists got a direct look inside a star at the moment it went supernova - revealing inner layers that had, until now, only existed in theory. A massive star 2.2 billion light-years away reached the end of its life and exploded in a brilliant burst of light. But something was off. When researchers analyzed the spectrum of light from the explosion, they didn't see the usual lighter elements like hydrogen, helium, or oxygen. Instead, they saw silicon. Sulphur. Argon. Elements normally buried deep inside a star's core. This wasn't supposed to be possible. According to stellar models, massive stars - those at least eight times the mass of our Sun - are layered like onions. Their cores are packed with heavy elements like iron, while progressive lighter layers of silicon, oxygen, and carbon sit above. Hydrogen and helium form the outermost shells. These outer layers usually obscure everything underneath. Astronomers believe the star violently ejected its outer layers in the final stages of life – not just the hydrogen and helium, but even the middle shells that hide the deeper interior. It’s possible that extreme instability in stars more than 100 times the mass of our Sun could cause this kind of shedding. While similar “pre-explosion outbursts” have been seen in other stars, this is the first time they’ve exposed the inner structure so clearly. The supernova was first detected by the Zwicky Transient Facility in California. Within 24 hours, astronomers triggered rapid follow-up observations with Hawaii’s Keck Observatory and captured the light signature before the explosion faded. That speed was critical. Supernovae evolve quickly, sometimes over just a few hours, and once the star’s material expands and cools, the deeper layers disappear from view. Read the study: Schulze, Steve, et al. “Extremely Stripped Supernova Reveals a Silicon and Sulfur Formation Site.” Nature Credit: Keck Observatory/Adam Makarenko

Astronomers just watched a star explode - and saw its insides exposed. For the first time in history, scientists got a direct look inside a star at the moment it went supernova - revealing inner layers that had, until now, only existed in theory. A massive star 2.2 billion light-years away reached the end of its life and exploded in a brilliant burst of light. But something was off. When researchers analyzed the spectrum of light from the explosion, they didn't see the usual lighter elements like hydrogen, helium, or oxygen. Instead, they saw silicon. Sulphur. Argon. Elements normally buried deep inside a star's core. This wasn't supposed to be possible. According to stellar models, massive stars - those at least eight times the mass of our Sun - are layered like onions. Their cores are packed with heavy elements like iron, while progressive lighter layers of silicon, oxygen, and carbon sit above. Hydrogen and helium form the outermost shells. These outer layers usually obscure everything underneath. Astronomers believe the star violently ejected its outer layers in the final stages of life – not just the hydrogen and helium, but even the middle shells that hide the deeper interior. It’s possible that extreme instability in stars more than 100 times the mass of our Sun could cause this kind of shedding. While similar “pre-explosion outbursts” have been seen in other stars, this is the first time they’ve exposed the inner structure so clearly. The supernova was first detected by the Zwicky Transient Facility in California. Within 24 hours, astronomers triggered rapid follow-up observations with Hawaii’s Keck Observatory and captured the light signature before the explosion faded. That speed was critical. Supernovae evolve quickly, sometimes over just a few hours, and once the star’s material expands and cools, the deeper layers disappear from view. Read the study: Schulze, Steve, et al. “Extremely Stripped Supernova Reveals a Silicon and Sulfur Formation Site.” Nature Credit: Keck Observatory/Adam Makarenko

Black Hole

26,579 просмотров • 11 месяцев назад

Transformer by hand ✍️ ~ 6 steps walkthrough below Open the hood of a transformer and the parts list is overwhelming: embeddings, positional encoding, attention weighting, self-attention, cross-attention, multi-head attention, layer norm, skip connections, softmax, linear, Nx, shifted right, query, key, value, masking. Which of those actually make the car run? Two of them. Attention weighting and the feed-forward network. Everything else is an enhancement to make it run faster and longer, which is how we got from a car to a truck, and to the word "large" in large language model. So I drew and calculated those two parts entirely by hand. Goal: push five features through one transformer block, filling in every cell yourself. 1. Given Five positions of input features, arriving from the previous block. 2. Attention matrix Let us feed all five features to a query-key module (QK) and read back an attention weight matrix, A. The details of that module are a post of their own. 3. Attention weighting We multiply the input features by A to get the attention weighted features, Z. Still five positions. The effect is to combine features *across positions*, horizontally: X1 becomes X1 + X2, X2 becomes X2 + X3, and so on. 4. First layer Let us feed all five weighted features into the first layer of the FFN. Multiply by the weights and biases. This time the combining happens *across feature dimensions*, vertically, and each feature grows from 3 numbers to 4. Note that every position goes through the same weight matrix. That is what "position-wise" means. 5. ReLU We cross out the negatives. They become zeros. 6. Second layer Let us bring it back down: 4 dimensions to 3. The output feeds the next block, which has a completely separate set of parameters, and the whole thing runs again. You have just calculated a transformer block by hand. ✍️ The takeaway: the two parts are doing two different jobs, and neither one alone is enough. Attention mixes *across positions*, so a feature can see its neighbours. The FFN mixes *across feature dimensions*, so each position can think about itself. Horizontal, then vertical. Then that pattern repeats N times, each block with its own separate set of weights. That is the Nx from the list up top, and that is what makes the transformer run. 💾 Save this post! #AIbyHand #Transformers #DeepLearning

Transformer by hand ✍️ ~ 6 steps walkthrough below Open the hood of a transformer and the parts list is overwhelming: embeddings, positional encoding, attention weighting, self-attention, cross-attention, multi-head attention, layer norm, skip connections, softmax, linear, Nx, shifted right, query, key, value, masking. Which of those actually make the car run? Two of them. Attention weighting and the feed-forward network. Everything else is an enhancement to make it run faster and longer, which is how we got from a car to a truck, and to the word "large" in large language model. So I drew and calculated those two parts entirely by hand. Goal: push five features through one transformer block, filling in every cell yourself. 1. Given Five positions of input features, arriving from the previous block. 2. Attention matrix Let us feed all five features to a query-key module (QK) and read back an attention weight matrix, A. The details of that module are a post of their own. 3. Attention weighting We multiply the input features by A to get the attention weighted features, Z. Still five positions. The effect is to combine features across positions, horizontally: X1 becomes X1 + X2, X2 becomes X2 + X3, and so on. 4. First layer Let us feed all five weighted features into the first layer of the FFN. Multiply by the weights and biases. This time the combining happens across feature dimensions, vertically, and each feature grows from 3 numbers to 4. Note that every position goes through the same weight matrix. That is what "position-wise" means. 5. ReLU We cross out the negatives. They become zeros. 6. Second layer Let us bring it back down: 4 dimensions to 3. The output feeds the next block, which has a completely separate set of parameters, and the whole thing runs again. You have just calculated a transformer block by hand. ✍️ The takeaway: the two parts are doing two different jobs, and neither one alone is enough. Attention mixes across positions, so a feature can see its neighbours. The FFN mixes across feature dimensions, so each position can think about itself. Horizontal, then vertical. Then that pattern repeats N times, each block with its own separate set of weights. That is the Nx from the list up top, and that is what makes the transformer run. 💾 Save this post! #AIbyHand #Transformers #DeepLearning

Tom Yeh

25,559 просмотров • 8 дней назад

I'm still playing with 3D storyboards for Seedance 2.0. Even when I ask for a 2D video, using a 3D storyboard often pushes the model toward a 3D look. I haven't tested it extensively yet but there definitely seems to be a bias. So 2D storyboards work better for 2D videos, while 3D storyboards are better suited for 3D or photorealistic outputs. Another thing I found is that my video prompt is already very detailed. If I remove the storyboard section and generate the video without attaching a storyboard, the model still follows the same beats. The main difference is the environment. With a storyboard, it tends to recreate an environment that closely matches the storyboard. I also think storyboards help with certain poses and compositions. If you're curious about the prompts, I've included them below.

I'm still playing with 3D storyboards for Seedance 2.0. Even when I ask for a 2D video, using a 3D storyboard often pushes the model toward a 3D look. I haven't tested it extensively yet but there definitely seems to be a bias. So 2D storyboards work better for 2D videos, while 3D storyboards are better suited for 3D or photorealistic outputs. Another thing I found is that my video prompt is already very detailed. If I remove the storyboard section and generate the video without attaching a storyboard, the model still follows the same beats. The main difference is the environment. With a storyboard, it tends to recreate an environment that closely matches the storyboard. I also think storyboards help with certain poses and compositions. If you're curious about the prompts, I've included them below.

Kōda

40,627 просмотров • 21 дней назад

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 просмотров • 10 месяцев назад

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

Tom Yeh

35,448 просмотров • 2 месяцев назад