Working through the original Transformer paper with my own annotations and re-derivations. What they got right, what I still find weird.
tags: transformers, attention, nlp, seminal, ai
"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."
The paper arrived in 2017 like a theorem already proven — the kind of result that, in retrospect, feels inevitable. [].NeurIPS 2017. 100,000+ citations. One of those papers where the title is also the thesis. But reading it carefully, you notice it is not as clean as the legend makes it. There are contingent choices, footnoted hedges, engineering compromises dressed as principles.
That is not a criticism. It is the nature of papers that change fields.
Let Q, K, V ∈ ℝⁿˣᵈ. The scaled dot-product attention is: Attention(Q, K, V) = softmax(QKᵀ / √d_k) V
The query attends to every key, weighted by similarity. The values are then mixed according to those weights. It is, at its core, a soft dictionary lookup — and that framing makes the architecture surprisingly legible.
Instead of computing one attention function, the authors project into h parallel subspaces:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) Wᴼ
where headᵢ = Attention(Q Wᵢᴿ, K Wᵢᴷ, V Wᵢᵛ)
Each attention head operates in a subspace of dimension d_k = d_model / h. With d_model = 512 and h = 8, each head sees a 64-dimensional projection. The full model's expressivity is the union of these 8 views.
The positional encoding choice. They use fixed sinusoidal encodings and claim the model can generalize to sequences longer than seen at training. It mostly does not. RoPE and ALiBi came later and work demonstrably better. Why sinusoidal? Because it was interpretable, differentiable, and available in 2017.
The paper also contains no ablation of its normalization placement. Pre-LN versus Post-LN matters enormously for training stability — yet the authors use Post-LN without discussion, as though it were the only possibility. Later work (Wang et al., 2019) showed Pre-LN trains more stably. Almost everyone uses Pre-LN now.
There is a reading of attention that Byrne would have appreciated: purely geometric. The query Q and key K live in the same space. Their dot product measures alignment — how much does this query point toward this key?
The attention matrix A = softmax(QKᵀ / √d_k) is a doubly-stochastic matrix whose (i,j) entry encodes how much position i "looks at" position j. Reading the matrix row by row gives you a complete picture of information flow.
The values V are then mixed by A — each output position is a weighted average of all value vectors, with weights determined by query-key alignment.
Transformers work for three reasons that are separable:
The specific choices — sinusoidal PE, post-LN, learned QKV projections — were reasonable for 2017 and have largely been improved since. The core architectural intuition holds up brilliantly.
The real contribution is not the specific architecture. It is the demonstration that sequence-to-sequence tasks can be solved without recurrence at all. Everything else followed from that.
Rating: 10/10 — changed everything My annotated implementation: GitHub
✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof
a personal library in perpetual arrangement · MMXXVI