Language & TransformersAdvanced

🔍Attention & Transformers

How models decide what to focus on

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

35 min

Pause and experiment as you go.

35 min- Explore at your own pace

Before We Begin

What we are learning today

Transformers replace step-by-step recurrence with attention. Instead of forcing information to travel through a long chain of time steps, each token can directly examine other tokens, assign weights to them, and build a context-aware representation in parallel.

How this lesson fits

This module explains the modern language-model stack from the inside out. Students see how words become vectors, how attention lets models choose context dynamically, and how large-scale next-token training turns those ingredients into systems that can write, summarize, and answer questions.

The big question

How can a machine represent meaning, decide which context matters, and then generate fluent language one token at a time?

Explain how text is converted into numeric representations without losing the idea of meaningInterpret attention as a selective context mechanism rather than as magicDescribe the workflow of large language models from pretraining through generation

Why You Should Care

Transformers are the backbone of modern language models and many other state-of-the-art systems. Understanding attention gives students a concrete picture of how models preserve context, resolve references, and scale to large datasets.

Where this is used today

✓Machine translation and search systems that require strong context modeling
✓BERT-style encoders used for retrieval, ranking, and language understanding tasks
✓Scientific models such as AlphaFold that borrow attention mechanisms beyond text

Think of it like this

Imagine annotating a paragraph while answering a question. You do not reread every word with equal intensity. You glance across the whole passage, identify the most relevant phrases, and give those more mental weight while interpreting the sentence.

Easy mistake to make

Attention is not consciousness or human-like focus. It is a mathematical weighting mechanism that tells the model which positions should influence the current token more strongly.

By the end, you should be able to say:

Explain attention as weighted interaction across tokens rather than as vague focus
Describe queries, keys, and values at a high level and what role each plays
Explain why transformers handle long-range context better than older recurrent sequence models

Think about this first

In the sentence 'The animal did not cross the street because it was tired,' what does 'it' refer to, and which other words in the sentence help you settle that reference?

Words we will keep using

attentiontokenquerykeyvalue

From Text to Vectors

A transformer doesn't read like you do. First, it chops text into "tokens" (pieces of words). Then it turns those tokens into ID numbers, and finally into list of coordinates (vectors). Only then can it start doing math on meaning.

Raw text“Hello world”

→

Tokens

Helloworld

→

Token IDs

75922088

→

Embeddings

0.2-0.50.8

0.90.3-0.2

→

+Pos Enc

0.3-0.40.9

0.80.4-0.1

→

Encoder ×NMHA + FFN
repeated

The Encoder Block

This is the engine room. Every token looks at every other token to figure out context, then processes that information through a private neural network. It does this dozens of times in a row.

Multi-Head Self-Attention
The "social" step. Each word asks: "Who else in this sentence helps explain me?"

Feed-Forward Network
The "thinking" step. The word digests what it learned from its neighbors and updates its own meaning.

Residual + LayerNorm
The stabilizer. These connections keep the signal clean so the network doesn't crash during training.

Scaled Dot-Product Attention — Step by Step

This is the secret sauce. A token asks a question (Query), matches it against others (Key), and if they match, it absorbs information (Value). It's a soft, fuzzy lookup table.

\text{Attention}(Q,K,V)=\text{softmax}\!\left(\tfrac{QK^{\!\top}}{\sqrt{d_k}}\right)V

Q (Query): "What am I looking for?"

K (Key): "What do I contain?"

V (Value): "Here is my content."

Tokens: Thinking, Machines — d_k=3 (simplified). Each token's embedding is linearly projected into Q, K, V:

Q (Queries)

Thinking

0.90

0.20

0.50

Machines

0.30

0.80

0.40

K (Keys)

Thinking

0.70

0.40

0.60

Machines

0.20

0.90

0.30

V (Values)

Thinking

0.50

0.80

0.20

Machines

0.70

0.10

0.90

Scores = QKᵀ ÷ √3

Thinking

0.58

0.29

Machines

0.44

0.52

Attention = softmax(Scores)

Thinking

0.57

0.43

Machines

0.48

0.52

Output = Attention × V

Thinking

0.59

0.50

Machines

0.60

0.44

0.56

Each token's new representation is a weighted blend of all value vectors.

Interactive Visualizations

“The animal did not cross the street because it was too tired”

Click a purple word to see its attention distribution. Darker orange = higher weight.

Watch how one word pulls information from the words that help explain it. That is the practical power of attention.

Transformers vs RNNs

Property	RNN/LSTM	Transformer
Parallelism	❌ Token-by-token	✅ All tokens at once
Long-range	⚠️ Vanishing gradient	✅ Direct O(1) path between any pair
Training speed	🐢 Slow on GPU	🚀 Highly parallelisable
Scalability	⚠️ Plateaus early	✅ Scales → foundation of LLMs

The big story is simple: transformers handle long-range relationships and parallel training much better, which is why they became the foundation for modern LLMs.