What is the KV cache?

Recently we’ve seen researchers and engineers scaling transformer-based models to hundreds of billions of parameters. The transformer architecture is exactly what made this possible, thanks to its sequence parallelism (here is an introduction to the transformer architecture). However, if it certainly enables an efficient training procedure, the same cannot be said about the inference process. Background Recall the definition of Attention given in the “Attention Is All You Need” paper:...

September 18, 2023

Seq2Seq models and the Attention mechanism

The path followed in this post is: sequence-to-sequence models $\rightarrow$ neural turing machines $\rightarrow$ attentional interfaces $\rightarrow$ transformers. This post is dense of stuff, but I tried to keep it as simple as possible, without losing important details! Disclaimer: These notes are for the most part a collection of concepts taken from the slides of the ‘Artificial Neural Networks and Deep Learning’ course at Polytechnic of Milan, the book ‘Deep Learning’ (Goodfellow-et-al-2016) and from some other online resources....

December 23, 2019