πAttention & Transformers
How models decide what to focus on
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Transformers replace step-by-step recurrence with attention. Instead of forcing information to travel through a long chain of time steps, each token can directly examine other tokens, assign weights to them, and build a context-aware representation in parallel.
How this lesson fits
This module explains the modern language-model stack from the inside out. Students see how words become vectors, how attention lets models choose context dynamically, and how large-scale next-token training turns those ingredients into systems that can write, summarize, and answer questions.
The big question
How can a machine represent meaning, decide which context matters, and then generate fluent language one token at a time?
Why You Should Care
Transformers are the backbone of modern language models and many other state-of-the-art systems. Understanding attention gives students a concrete picture of how models preserve context, resolve references, and scale to large datasets.
Where this is used today
- βMachine translation and search systems that require strong context modeling
- βBERT-style encoders used for retrieval, ranking, and language understanding tasks
- βScientific models such as AlphaFold that borrow attention mechanisms beyond text
Think of it like this
Imagine annotating a paragraph while answering a question. You do not reread every word with equal intensity. You glance across the whole passage, identify the most relevant phrases, and give those more mental weight while interpreting the sentence.
Easy mistake to make
Attention is not consciousness or human-like focus. It is a mathematical weighting mechanism that tells the model which positions should influence the current token more strongly.
By the end, you should be able to say:
- Explain attention as weighted interaction across tokens rather than as vague focus
- Describe queries, keys, and values at a high level and what role each plays
- Explain why transformers handle long-range context better than older recurrent sequence models
Think about this first
In the sentence 'The animal did not cross the street because it was tired,' what does 'it' refer to, and which other words in the sentence help you settle that reference?
Words we will keep using
From Text to Vectors
A transformer doesn't read like you do. First, it chops text into "tokens" (pieces of words). Then it turns those tokens into ID numbers, and finally into list of coordinates (vectors). Only then can it start doing math on meaning.
repeated
The Encoder Block
This is the engine room. Every token looks at every other token to figure out context, then processes that information through a private neural network. It does this dozens of times in a row.
The "social" step. Each word asks: "Who else in this sentence helps explain me?"
The "thinking" step. The word digests what it learned from its neighbors and updates its own meaning.
The stabilizer. These connections keep the signal clean so the network doesn't crash during training.
Scaled Dot-Product Attention β Step by Step
This is the secret sauce. A token asks a question (Query), matches it against others (Key), and if they match, it absorbs information (Value). It's a soft, fuzzy lookup table.
Tokens: Thinking, Machines β d_k=3 (simplified). Each token's embedding is linearly projected into Q, K, V:
Each token's new representation is a weighted blend of all value vectors.
Interactive Visualizations
βThe animal did not cross the street because it was too tiredβ
Click a purple word to see its attention distribution. Darker orange = higher weight.
Watch how one word pulls information from the words that help explain it. That is the practical power of attention.
Transformers vs RNNs
| Property | RNN/LSTM | Transformer |
|---|---|---|
| Parallelism | β Token-by-token | β All tokens at once |
| Long-range | β οΈ Vanishing gradient | β Direct O(1) path between any pair |
| Training speed | π’ Slow on GPU | π Highly parallelisable |
| Scalability | β οΈ Plateaus early | β Scales β foundation of LLMs |
The big story is simple: transformers handle long-range relationships and parallel training much better, which is why they became the foundation for modern LLMs.