Attention Is All You Need — My Notes

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."

— Vaswani et al., Attention Is All You Need, 2017

The paper arrived in 2017 like a theorem already proven — the kind of result that, in retrospect, feels inevitable. [].NeurIPS 2017. 100,000+ citations. One of those papers where the title is also the thesis. But reading it carefully, you notice it is not as clean as the legend makes it. There are contingent choices, footnoted hedges, engineering compromises dressed as principles.

That is not a criticism. It is the nature of papers that change fields.

The Core Mechanism

PROPOSITION undefined

Let Q, K, V ∈ ℝⁿˣᵈ. The scaled dot-product attention is: Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

∎

The query attends to every key, weighted by similarity. The values are then mixed according to those weights. It is, at its core, a soft dictionary lookup — and that framing makes the architecture surprisingly legible.

Multi-Head Attention

Instead of computing one attention function, the authors project into h parallel subspaces:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) Wᴼ
where headᵢ = Attention(Q Wᵢᴿ, K Wᵢᴷ, V Wᵢᵛ)

LEMMA undefined

Each attention head operates in a subspace of dimension d_k = d_model / h. With d_model = 512 and h = 8, each head sees a 64-dimensional projection. The full model's expressivity is the union of these 8 views.

∎

What I Still Find Odd

Question

The positional encoding choice. They use fixed sinusoidal encodings and claim the model can generalize to sequences longer than seen at training. It mostly does not. RoPE and ALiBi came later and work demonstrably better. Why sinusoidal? Because it was interpretable, differentiable, and available in 2017.

[].Pre-LN (layer norm before attention) vs Post-LN (after) makes a significant difference to training stability. The original paper uses Post-LN without comment. The community has quietly standardized on Pre-LN.

The paper also contains no ablation of its normalization placement. Pre-LN versus Post-LN matters enormously for training stability — yet the authors use Post-LN without discussion, as though it were the only possibility. Later work (Wang et al., 2019) showed Pre-LN trains more stably. Almost everyone uses Pre-LN now.

A Geometric Reading

There is a reading of attention that Byrne would have appreciated: purely geometric. The query Q and key K live in the same space. Their dot product measures alignment — how much does this query point toward this key?

COROLLARY undefined

The attention matrix A = softmax(QKᵀ / √d_k) is a doubly-stochastic matrix whose (i,j) entry encodes how much position i "looks at" position j. Reading the matrix row by row gives you a complete picture of information flow.

∎

The values V are then mixed by A — each output position is a weighted average of all value vectors, with weights determined by query-key alignment.

My Bottom Line

Transformers work for three reasons that are separable:

Attention is a powerful routing mechanism — it can learn arbitrary pairwise dependencies in O(n²) time, which is expensive but expressive
Parallelization — unlike RNNs, every position is processed simultaneously; this makes scaling tractable
The residual stream — each layer writes to and reads from a shared residual pathway; this makes the network's computation unusually interpretable

The specific choices — sinusoidal PE, post-LN, learned QKV projections — were reasonable for 2017 and have largely been improved since. The core architectural intuition holds up brilliantly.

Theorem

The real contribution is not the specific architecture. It is the demonstration that sequence-to-sequence tasks can be solved without recurrence at all. Everything else followed from that.

Rating: 10/10 — changed everything My annotated implementation: GitHub

NOTES

↩ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. NeurIPS 2017.
↩ Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. ACL 2019. Demonstrates Pre-LN superiority for deep networks.