Nathan Barry's banner

Nathan Barry

@nathanrs • 2,822 subscribers

I like to work on cool things. AI stuff @zeddotdev, prev. @Apple, @zfellows

Shorts

I found out the other day that any compression tool can be contorted to do language modeling. Turns out gzip can generate text that somewhat *resembles* Shakespeare. Short write up linked below

I found out the other day that any compression tool can be contorted to do language modeling. Turns out gzip can generate text that somewhat resembles Shakespeare. Short write up linked below

252,284 次观看

I noticed that Apple Notes has a similar UI as the AI chat apps, so I turned it into a Claude/ChatGPT frontend. Use any LLM API to interact or chat with in Apple Notes

I noticed that Apple Notes has a similar UI as the AI chat apps, so I turned it into a Claude/ChatGPT frontend. Use any LLM API to interact or chat with in Apple Notes

43,411 次观看

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to *potentially* giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

Most recent diffusion language model research (that I’ve seen) seems to be using masking as the noising process. It looks like, however, most closed-source models (Google Gemini Diffusion and possibly Inception Labs’ Mercury) use a different noising process, where instead of masking tokens, they replace them with different tokens (either with a random token or a semantically similar token). I wondered how they were getting such high throughput with the latter noising process, since I believed that optimizing inference with KVCache approximation would be more difficult (for various reasons). I visualized this noising process with tiny-diffusion and compared it to normal unmasking, and was very surprised to see how fast the generation “settles” into a reasonable output, and then only slightly refines afterwards, requiring much fewer steps in total. Unmasking (where tokens are never remasked, the typical implementation) is inherently limited in generation speed by the fact that an increase in tokens decoded per step leads to more errors due to the mismatch between individual and marginal token probability distributions we sample from. The token replacement noising process seems to have a much different set of characteristics. Because we sample each token per step, every token makes “progress” towards the final output each iteration (in addition to potentially giving other tokens more information in future steps). Generally, masking has outperformed other noising processes, which is probably why most research focused on it (using smaller models). But the paper referred to in the retweet shows that random replacement as a noising process may scale better as model size increases. Big labs might have noticed these results much earlier (due to having drastically more training resources and being able to test larger models), which may explain the discrepancy in the choice of noising process. I’m gonna test this with larger models, since tiny-diffusion only has 10M parameters.

40,440 次观看

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Rewrote tiny-diffusion to be 3x smaller! Went from 951 lines to just 364, all contained in one file. As simple as possible, but not simpler. I also added a tiny GPT implementation as a comparison (312 lines, inspired by Andrej Karpathy). The two implementations are ~80% identical. The model architecture, training loop, tokenization, etc, only differ in 19 lines of code. The main differences are contained within two functions (generate and get_batch). The reason to include the GPT implementation was to show how similar autoregressive LMs are to diffusion LMs on an architectural level. Only *1* line of code in the architecture needs to be modified to support masked language diffusion instead of next-token prediction (by disabling causal masking). Link to the repo is in the comments

Rewrote tiny-diffusion to be 3x smaller! Went from 951 lines to just 364, all contained in one file. As simple as possible, but not simpler. I also added a tiny GPT implementation as a comparison (312 lines, inspired by Andrej Karpathy). The two implementations are ~80% identical. The model architecture, training loop, tokenization, etc, only differ in 19 lines of code. The main differences are contained within two functions (generate and get_batch). The reason to include the GPT implementation was to show how similar autoregressive LMs are to diffusion LMs on an architectural level. Only 1 line of code in the architecture needs to be modified to support masked language diffusion instead of next-token prediction (by disabling causal masking). Link to the repo is in the comments

161,578 次观看 • 6 个月前

tiny-diffusion, but Japanese! I wonder how logographic languages (Japanese, Chinese, etc) compare to phonetic/alphabetic languages in generation quality and speed with character-level tokenizers. The main difference is the semantic-value-per-token. Fewer tokens are needed to express an idea, which leads to fewer AR and diffusion steps. My main question is how would it affect the entropy in the output distributions. Lower entropy benefits parallel decoding. I could see arguments on both sides on how it affects it. One main benefit is that you have fewer opportunities to mangle words, leading to less obvious mistakes.

tiny-diffusion, but Japanese! I wonder how logographic languages (Japanese, Chinese, etc) compare to phonetic/alphabetic languages in generation quality and speed with character-level tokenizers. The main difference is the semantic-value-per-token. Fewer tokens are needed to express an idea, which leads to fewer AR and diffusion steps. My main question is how would it affect the entropy in the output distributions. Lower entropy benefits parallel decoding. I could see arguments on both sides on how it affects it. One main benefit is that you have fewer opportunities to mangle words, leading to less obvious mistakes.

54,185 次观看 • 6 个月前

没有更多内容可加载