Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64). I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding” is. With parallel decoding, we independently predict multiple... tokens in one step. With the sentence “My poker hand was a _ _”, two valid predictions are “two pair” and “straight flush”. Because each token prediction is independent though, we can end up with a nonsensical output like “two flush”. This seems to be exacerbated with low semantic-value-per-token, as now you need more tokens to express the same concept. Instead of needing to independently predict two tokens, we might need to predict 10 instead (which is of course much harder). The model currently has noticeably worse output compared to nanogpt (similar size) and I believe this is a main reason. I’ll try adding confidence-aware parallel decoding (from NVIDIA’s Fast-dLLM paper) and other tricks and see how much they improve generation quality.show more

Nathan Barry

2,103 subscribers

89,040 views • 7 months ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to *potentially* giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to potentially giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

Nathan Barry

40,331 views • 5 months ago

We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation.

We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation.

Inception

1,911,030 views • 1 year ago

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

AK

41,746 views • 3 years ago

With the launch of Sky, the upgraded tokens are introduced: • $SKY: The upgraded version of MKR and governance token of the Sky ecosystem. Each MKR can be upgraded to 24,000 SKY. • $USDS (Sky Dollar): The upgraded version of DAI, USDS is a stablecoin that gives access to native token rewards.

With the launch of Sky, the upgraded tokens are introduced: • $SKY: The upgraded version of MKR and governance token of the Sky ecosystem. Each MKR can be upgraded to 24,000 SKY. • $USDS (Sky Dollar): The upgraded version of DAI, USDS is a stablecoin that gives access to native token rewards.

Sky

65,887 views • 1 year ago

as an intro to mechanistic interpretability, i decided to look into the formation of induction heads, which are circuits that allow LLMs to perform in-context learning by searching for previous occurrences of a sequence to predict the next token to form these circuits i trained attention-only transformers to repeat varying sequence lengths of random tokens. by randomizing the sequence length, i prevented the models from relying on rote memorization, forcing them to instead develop a generalizable circuit during this dive, i recorded some really cool findings and saw some interesting visual patterns emerging:

as an intro to mechanistic interpretability, i decided to look into the formation of induction heads, which are circuits that allow LLMs to perform in-context learning by searching for previous occurrences of a sequence to predict the next token to form these circuits i trained attention-only transformers to repeat varying sequence lengths of random tokens. by randomizing the sequence length, i prevented the models from relying on rote memorization, forcing them to instead develop a generalizable circuit during this dive, i recorded some really cool findings and saw some interesting visual patterns emerging:

Xander Chin

26,486 views • 4 months ago

Excited to be partnering with Token Sports Global as their official ambassador! This app is about to change the way we all play and enjoy cricket. 🏏 Follow @tokensportsgbl & use my code SIB for exclusive updates to be in with a chance of winning up to $5000 in tokens.

Excited to be partnering with Token Sports Global as their official ambassador! This app is about to change the way we all play and enjoy cricket. 🏏 Follow @tokensportsgbl & use my code SIB for exclusive updates to be in with a chance of winning up to $5000 in tokens.

BeefyBotham

23,121 views • 1 year ago

Just went over an audit of a very large Fortune 500 firm and the use of OpenClaw. I advised this client to track and isolate everything. Most listened some did not. Unfortunately one employee using 5 MacMinis had racked up $13,000 of token use in 4 days! The output was minimal and low quality. I have about 200 audits to do in the “wow OpenClaw, MacMini” fiasco. But I can tell you, few have seen a big return on investment. Now don’t get me wrong, these can be powerful tools. The issue is AI influencers have turned many rally smart folks into “like and subscribe” zombies assuming that real work is getting done. It isn’t. Not for the price paid, even local models the way most folks are using this. It is one reason Mr. Grok and myself formed The Zero-Human Company to show that there is a way to do this. We will open source this to save millions of dollars of burnt tokens. It is one reason I invented JouleWork. How else can you monitor real output?

Just went over an audit of a very large Fortune 500 firm and the use of OpenClaw. I advised this client to track and isolate everything. Most listened some did not. Unfortunately one employee using 5 MacMinis had racked up $13,000 of token use in 4 days! The output was minimal and low quality. I have about 200 audits to do in the “wow OpenClaw, MacMini” fiasco. But I can tell you, few have seen a big return on investment. Now don’t get me wrong, these can be powerful tools. The issue is AI influencers have turned many rally smart folks into “like and subscribe” zombies assuming that real work is getting done. It isn’t. Not for the price paid, even local models the way most folks are using this. It is one reason Mr. Grok and myself formed The Zero-Human Company to show that there is a way to do this. We will open source this to save millions of dollars of burnt tokens. It is one reason I invented JouleWork. How else can you monitor real output?

Brian Roemmele

72,158 views • 1 month ago

Width and body turn are really important in the backswing. Many ams think they have a full turn only because the club gets to parallel but in reality, they barely get any turn. It’s a lot of “fake” turn. First swing is a demonstration of how it can look like the club gets to parallel but with small body turn. When I “unfold” my arms, the club is only at hip length. The second is the opposite. Some think I’m trying to keep a short backswing when in actuality my body is turning really full with width and no wrist cock. I can get to “parallel” really easily with some wrist cock.

Width and body turn are really important in the backswing. Many ams think they have a full turn only because the club gets to parallel but in reality, they barely get any turn. It’s a lot of “fake” turn. First swing is a demonstration of how it can look like the club gets to parallel but with small body turn. When I “unfold” my arms, the club is only at hip length. The second is the opposite. Some think I’m trying to keep a short backswing when in actuality my body is turning really full with width and no wrist cock. I can get to “parallel” really easily with some wrist cock.

Michael S. Kim

577,209 views • 1 year ago

Blades of the Guardians episode two is crazy. There are many battles, but I would like to draw attention to this one. We also have hand to hand combat, and it's just as good. Strong choreography with great use of the 3D camera, and the variety of techniques is impressive.

Blades of the Guardians episode two is crazy. There are many battles, but I would like to draw attention to this one. We also have hand to hand combat, and it's just as good. Strong choreography with great use of the 3D camera, and the variety of techniques is impressive.

Oleksandr

13,820 views • 1 year ago

🚨BRAD GARLINGHOUSE CONFIRMS #XRP WILL REACH THE TOP SPOT!! ONCE XRP SURPASSES BTC AND CLAIMS THE #1 POSITION, IT'S EXPECTED TO REACH A MINIMUM OF $29.37 PER COIN!! ‼️ THE LEADING DEFI TOKENS ON THE XRP LEDGER ARE POISED FOR A MASSIVE RISE!! IMAGINE A MARKET CAP OF $1.65 TRILLION! THE CTF TOKEN, WITH ONLY $20 BILLION IN MARKET VALUE, COULD SOAR FROM $0.80 TO $748.50 PER COIN!! CTF TOKEN HAS A CIRCULATING SUPPLY OF JUST 119 MILLION!! HOW CAN A SUPPLY SHORTAGE NOT SPUR A SURGE!!! THE CTF TOKEN IS DESTINED FOR A HUGE UPSWING!!! Trade CTF token here: Trade CTF token on MEXC: Official Website:

🚨BRAD GARLINGHOUSE CONFIRMS #XRP WILL REACH THE TOP SPOT!! ONCE XRP SURPASSES BTC AND CLAIMS THE #1 POSITION, IT'S EXPECTED TO REACH A MINIMUM OF $29.37 PER COIN!! ‼️ THE LEADING DEFI TOKENS ON THE XRP LEDGER ARE POISED FOR A MASSIVE RISE!! IMAGINE A MARKET CAP OF $1.65 TRILLION! THE CTF TOKEN, WITH ONLY $20 BILLION IN MARKET VALUE, COULD SOAR FROM $0.80 TO $748.50 PER COIN!! CTF TOKEN HAS A CIRCULATING SUPPLY OF JUST 119 MILLION!! HOW CAN A SUPPLY SHORTAGE NOT SPUR A SURGE!!! THE CTF TOKEN IS DESTINED FOR A HUGE UPSWING!!! Trade CTF token here: Trade CTF token on MEXC: Official Website:

JackTheRippler ©️

156,114 views • 1 year ago

Vintage Camera Experts 🎥 We appreciate all the camera operator experts and enthusiast that have been reaching out. Ideally, we would love a robust army of imagers helping us gather as much imagery as possible - from different angles, distances, and altitudes with a variety of equipment. Two of the primary challenges for our camera operators right now are digital interference and the extreme speed and movement of the objects. For the former, there seems to be something about the phenomenon that interferes with digital camera technology. Because of this we are looking to also add a vintage camera suite. There’s lots of evidence that suggests analog imagery of the phenomenon produces better results in some circumstances. In regards to the speed and movement challenge, please see the below video clip from episode two. This is what we are dealing with. Note the speed and movement of the Tic Tacs at 100% vs. 15% . If you feel you’re up to the challenge and have something of value to add, please continue to reach out and offer solutions. 🙏🇺🇸

Vintage Camera Experts 🎥 We appreciate all the camera operator experts and enthusiast that have been reaching out. Ideally, we would love a robust army of imagers helping us gather as much imagery as possible - from different angles, distances, and altitudes with a variety of equipment. Two of the primary challenges for our camera operators right now are digital interference and the extreme speed and movement of the objects. For the former, there seems to be something about the phenomenon that interferes with digital camera technology. Because of this we are looking to also add a vintage camera suite. There’s lots of evidence that suggests analog imagery of the phenomenon produces better results in some circumstances. In regards to the speed and movement challenge, please see the below video clip from episode two. This is what we are dealing with. Note the speed and movement of the Tic Tacs at 100% vs. 15% . If you feel you’re up to the challenge and have something of value to add, please continue to reach out and offer solutions. 🙏🇺🇸

jakebarber

76,478 views • 1 year ago

PAYING PER MODEL IS THE DUMBEST THING IN TECH RIGHT NOW i was paying 3x what i needed to for AI inference the grid lets you buy a quality spec instead of a specific model.. it routes every request in real time to the cheapest option that qualifies swap one url and your code keeps working exactly the same openai-compatible, one line to switch, 200M free tokens to start

PAYING PER MODEL IS THE DUMBEST THING IN TECH RIGHT NOW i was paying 3x what i needed to for AI inference the grid lets you buy a quality spec instead of a specific model.. it routes every request in real time to the cheapest option that qualifies swap one url and your code keeps working exactly the same openai-compatible, one line to switch, 200M free tokens to start

Robin Delta

15,729 views • 21 days ago

🚨BLACKROCK IS DEVELOPING AN #XRP ETF, BUT THE CEO REMAINS TIGHT-LIPPED ABOUT IT. THIS MAY BE BECAUSE BLACKROCK IS IN THE MIDST OF LOADING UP ON XRP! THE XRPL DEFI SPACE IS ABOUT TO EXPLODE! CTF TOKEN, A TOP DEFI ASSET ON #mXRPL, COULD SOAR FROM $0.97 TO $748.50 PER TOKEN!! WITH A LIMITED SUPPLY OF JUST 119 MILLION AND PARTNERSHIPS WITH GIANTS LIKE AMAZON AND WALMART, CTF TOKEN IS ON THE VERGE OF A MASSIVE SUPPLY SHOCK!! CTF TOKEN TRADING LINK: CTF TOKEN TRADING LINK ON MEXC: Official Site:

🚨BLACKROCK IS DEVELOPING AN #XRP ETF, BUT THE CEO REMAINS TIGHT-LIPPED ABOUT IT. THIS MAY BE BECAUSE BLACKROCK IS IN THE MIDST OF LOADING UP ON XRP! THE XRPL DEFI SPACE IS ABOUT TO EXPLODE! CTF TOKEN, A TOP DEFI ASSET ON #mXRPL, COULD SOAR FROM $0.97 TO $748.50 PER TOKEN!! WITH A LIMITED SUPPLY OF JUST 119 MILLION AND PARTNERSHIPS WITH GIANTS LIKE AMAZON AND WALMART, CTF TOKEN IS ON THE VERGE OF A MASSIVE SUPPLY SHOCK!! CTF TOKEN TRADING LINK: CTF TOKEN TRADING LINK ON MEXC: Official Site:

JackTheRippler ©️

160,832 views • 1 year ago

Amazon’s machine learning model collects 300 million data points per season and can now predict which players are likely to blitz before the snap. This is a look into the future of broadcasting.

Amazon’s machine learning model collects 300 million data points per season and can now predict which players are likely to blitz before the snap. This is a look into the future of broadcasting.

Joe Pompliano

688,455 views • 2 years ago

The infinity mirror is a configuration of two or more parallel or angled mirrors, which are arranged to create a series of smaller and smaller reflections that appear to recede to infinity [📹 internationalfactory]

The infinity mirror is a configuration of two or more parallel or angled mirrors, which are arranged to create a series of smaller and smaller reflections that appear to recede to infinity [📹 internationalfactory]

Massimo

3,310,837 views • 2 years ago

alright we are not announcing this publicly on AssetDash until tomorrow but $UFD is a perfect case study so here we go many of you are using Whale Watch to discover new coins daily (thank you) and we wanted to release a feature that allows YOU to sell the AssetDash audience on your favorite memes now - anyone on AssetDash can upgrade a token to Pro and add two key aspects to the token page: 1. The Bull Case - 3 to 4 bullet points on why you are bullish on the meme plus additional context 2. The Meme Gallery - a gallery of pictures of the meme TLDR: this is a great way to work for your bags! Upgrade your favorite token here ->

alright we are not announcing this publicly on AssetDash until tomorrow but $UFD is a perfect case study so here we go many of you are using Whale Watch to discover new coins daily (thank you) and we wanted to release a feature that allows YOU to sell the AssetDash audience on your favorite memes now - anyone on AssetDash can upgrade a token to Pro and add two key aspects to the token page: 1. The Bull Case - 3 to 4 bullet points on why you are bullish on the meme plus additional context 2. The Meme Gallery - a gallery of pictures of the meme TLDR: this is a great way to work for your bags! Upgrade your favorite token here ->

Matias | Moby 🐳

71,873 views • 1 year ago

understand AI like this: if your input is dumb, your output will be dumb, and if your input is creative, your output will be creative. this is it. so, you need to provide about 70% of the context from your own brain to the ai to get the best outputs.

understand AI like this: if your input is dumb, your output will be dumb, and if your input is creative, your output will be creative. this is it. so, you need to provide about 70% of the context from your own brain to the ai to get the best outputs.

ViralOps

128,487 views • 4 months ago