Nathan Barry's banner

Nathan Barry

@nathanbarrydev • 2,103 subscribers

Man in the Arena Allocator. Prev @Apple, CS + Math @UTAustin, @zfellows

Shorts

Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64). I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding” is. With parallel decoding, we independently predict multiple tokens in one step. With the sentence “My poker hand was a ___ ___”, two valid predictions are “two pair” and “straight flush”. Because each token prediction is independent though, we can end up with a nonsensical output like “two flush”. This seems to be exacerbated with low semantic-value-per-token, as now you need more tokens to express the same concept. Instead of needing to independently predict two tokens, we might need to predict 10 instead (which is of course much harder). The model currently has noticeably worse output compared to nanogpt (similar size) and I believe this is a main reason. I’ll try adding confidence-aware parallel decoding (from NVIDIA’s Fast-dLLM paper) and other tricks and see how much they improve generation quality.

Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64). I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding” is. With parallel decoding, we independently predict multiple tokens in one step. With the sentence “My poker hand was a _ _”, two valid predictions are “two pair” and “straight flush”. Because each token prediction is independent though, we can end up with a nonsensical output like “two flush”. This seems to be exacerbated with low semantic-value-per-token, as now you need more tokens to express the same concept. Instead of needing to independently predict two tokens, we might need to predict 10 instead (which is of course much harder). The model currently has noticeably worse output compared to nanogpt (similar size) and I believe this is a main reason. I’ll try adding confidence-aware parallel decoding (from NVIDIA’s Fast-dLLM paper) and other tricks and see how much they improve generation quality.

89,040 görüntüleme