Attention

MLA vs GQA and their KV-cache tradeoffs.

At decode time the dominant memory cost is the KV cache, which grows with every token. GQA shrinks it by sharing a few full key/value heads across many query heads, while MLA shrinks it further by caching only a compressed latent and reconstructing keys and values on the fly. Pick a method and a variant to see how the tradeoff plays out.