Cottention: Linear Transformers With Cosine Attention
Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.
💡 Research Summary
The paper introduces “Cottention,” a novel attention mechanism that replaces the softmax operation in transformers with cosine similarity, thereby achieving linear memory complexity with respect to sequence length. The authors start by observing that the quadratic memory and time cost of standard softmax attention stems from the need to compute the full Q Kᵀ matrix and then apply a per‑row softmax. By first L2‑normalizing the query (Q) and key (K) matrices row‑wise (denoted N(Q) and N(K)), the cosine similarity can be expressed as N(Q) · N(K)ᵀ. This formulation decouples the normalization from the matrix multiplication, allowing the computation order to be rearranged. Specifically, the product N(K)ᵀ · V can be computed first, producing an intermediate of size d × d (where d is the per‑head key dimension). Multiplying this intermediate by N(Q) yields the final output, reducing intermediate memory from O(s²) to O(d²) when the sequence length s exceeds d.
A key challenge is training stability. Unlike softmax, the raw cosine similarity matrix can have row‑wise sums up to s, leading to large gradients. To mitigate this, the authors introduce a learned scalar m for each head, passed through a sigmoid σ(m) and used to divide the similarity matrix by s^{σ(m)}. This scaling forces the row sums into the
Comments & Academic Discussion
Loading comments...
Leave a Comment