What is the KV cache?
Recently we’ve seen researchers and engineers scaling transformer-based models to hundreds of billions of parameters. The transformer architecture is exactly what made this possible, thanks to its sequence parallelism (here is an introduction to the transformer architecture). However, if it certainly enables an efficient training procedure, the same cannot be said about the inference process. Background Recall the definition of Attention given in the “Attention Is All You Need” paper:...