On the Scalability of Multi-Head Attention Mechanisms

Dr. Elena Voss, Stanford Computational Linguistics Lab — May 2026

The quadratic complexity of self-attention with respect to sequence length remains a central bottleneck in deploying large language models at scale. Recent work on sparse attention patterns, kernelized approximations, and state-space models has shown promise, but the theoretical lower bounds remain poorly understood.

In this paper, we present a novel analysis of the information-theoretic limits of approximate attention, proving that for any sequence of length n, there exists a lower bound of Ω(n log n) on the time complexity of computing attention to within ε-error under the L2 norm. Our proof relies on a reduction from the set disjointness problem in communication complexity.

We further demonstrate that our lower bound is tight by exhibiting a family of randomized algorithms that achieve O(n log n) complexity while maintaining ε-approximation guarantees. The key insight is a recursive low-rank decomposition that exploits the exponential decay of attention weights with token distance.

Empirical results on the Pile, C4, and FineWeb-Edu datasets confirm that our approximate attention mechanism preserves perplexity within 0.3% of exact attention while reducing FLOPs by 47% on sequences of 32K tokens.

Keywords: attention mechanisms, lower bounds, communication complexity, kernel methods, sparse transformers, state-space duality


Submitted to NeurIPS 2026. Preprint available on arXiv.

Contact: evoss@stanford-cs.edu | Funding: NSF Grant #CNS-2247841