💬Large Language Models
Predicting the next token at scale
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Large language models scale the transformer recipe dramatically: more data, more parameters, more compute, and more opportunities to learn patterns in how language is structured. Their core training objective is still simple next-token prediction, but at scale that objective can produce useful capabilities in reasoning, summarization, coding, and dialogue.
How this lesson fits
This module explains the modern language-model stack from the inside out. Students see how words become vectors, how attention lets models choose context dynamically, and how large-scale next-token training turns those ingredients into systems that can write, summarize, and answer questions.
The big question
How can a machine represent meaning, decide which context matters, and then generate fluent language one token at a time?
Why You Should Care
Students encounter LLMs everywhere now. They need a precise mental model of both their power and their limits: why fluent generation can look intelligent, why prompting matters, and why confident mistakes are an expected part of the system.
Where this is used today
- ✓General-purpose assistants such as ChatGPT, Claude, and Gemini
- ✓Code-generation and autocomplete tools that assist software development
- ✓Summarization, drafting, search augmentation, tutoring, and customer-support workflows
Think of it like this
Imagine an improv performer who has read an enormous library and is unusually good at continuing patterns. They can often produce convincing, helpful responses, but they are still generating the next plausible continuation rather than consulting a guaranteed source of truth.
Easy mistake to make
An answer that sounds polished or confident is not automatically true. Fluency and correctness often travel together, but they are never the same thing.
By the end, you should be able to say:
- Explain next-token prediction in plain language without understating its importance
- Connect pretraining, prompting, alignment, and fine-tuning as parts of one larger workflow
- Discuss why fluent output, useful reasoning, and factual reliability are related but distinct qualities
Think about this first
Why might a system that repeatedly predicts the next token learn not only grammar, but also facts, style, structure, and patterns of reasoning from the text it has seen?
Words we will keep using
Large Language Models
An LLM is basically a super-powered autocomplete. It reads text, learns the patterns, and predicts what comes next. That sounds simple, but do it enough times on enough data, and "predicting the next word" starts to look a lot like reasoning.
P(x₁, x₂, ..., xₙ) = ∏ P(xₜ | x₁, ..., x₍ₜ₋₁₎)
That training objective sounds small, but it turns out to be enough to teach the model a surprising amount about language and pattern structure.
Next-Token Prediction Demo
When a chatbot writes a poem, it isn't "thinking" of the whole poem at once. It's just rolling the dice on the very next word, over and over again. The magic is that the probabilities are so well-tuned that the result makes sense.
Context (last 2 words):
Next token probabilities (T-adjusted):
Low temperature makes the model more conservative. High temperature spreads probability more widely and makes outputs less predictable.
Key Concepts
The Reading Phase. The model reads the internet to learn grammar, facts, and reasoning patterns. It is self-taught.
The Manners Phase. Humans step in to teach the model how to be helpful, harmless, and follow instructions.
The "Just Ask" Phase. You don't need to retrain the model to teach it a new trick—just show it an example in the prompt.
The Surprise Factor. When models get big enough, they suddenly learn to do things (like coding) that they weren't explicitly built for.
Scaling Laws
Here is the weirdest thing about LLMs: they are incredibly predictable. If you add more data or make the model bigger, it gets smarter at a very specific, mathematical rate. We can literally graph the "intelligence" (perplexity) before we even build the model.
The exact curve is less important than the message: progress often looked smooth and predictable as models got larger.
The Scaling Timeline
Attention Is All You Need — architecture breakthrough
Bidirectional pre-training, fine-tuning paradigm
Autoregressive language model, "too dangerous to release"
Few-shot learning, emergent capabilities at scale
RLHF alignment, conversational AI goes mainstream
Multimodal reasoning, open-source competition begins in earnest
Long context (1M+ tokens), efficiency push, open-source closes the gap
Test-time compute scaling, chain-of-thought reasoning models, DeepSeek disrupts cost assumptions globally
Reasoning-native by default, multimodal as baseline, efficiency over raw parameter count becomes the metric
NLP Evaluation Metrics
Scoring generated text is harder than scoring a yes/no answer. There is usually more than one acceptable sentence, so NLP uses a family of metrics that each capture only part of quality.
Interactive BLEU / ROUGE Calculator
Green = matched in reference · Red = unmatched
BLEU-1
71.4
unigram precision
BP = 1.000
BLEU-2
59.8
+bigram precision
geo-mean
ROUGE-1
76.9
unigram F1
R=83% P=71%
ROUGE-2
54.5
bigram F1
R=60% P=50%
BLEU precision bias: try making the hypothesis much shorter than the reference. BLEU-1 stays high but BLEU-2 drops — the brevity penalty partially corrects for this.
ROUGE recall bias: now make the hypothesis much longer. ROUGE recall stays high because it measures how much of the reference you covered, regardless of extra words.
Both fail on paraphrase: replace a reference word with a perfect synonym. Both scores drop — neither metric understands meaning, only surface overlap.
| Metric | What it measures | Formula core | Best for | Weakness |
|---|---|---|---|---|
| Perplexity | How surprised the model is by a test corpus | exp(−(1/N) Σ log P(xₜ|x<t)) | LM comparison | Not comparable across tokenisations |
| BLEU | N-gram precision of hypothesis vs reference(s) | BP · exp(Σ wₙ log pₙ) | Machine translation | Precision-biased; punishes paraphrase |
| ROUGE-N | N-gram recall of reference covered by hypothesis | matched-ngrams / ref-ngrams | Summarisation | Recall-biased; padding inflates score |
| ROUGE-L | Longest Common Subsequence F1 | LCS(hyp, ref) / len(ref) | Summarisation, fluency | Ignores non-contiguous order quality |
| METEOR | Alignment F1 with synonym matching & stemming | F_mean · (1 − penalty) | Translation; handles paraphrase | Language-dependent synonym tables |
| BERTScore | Token cosine similarity via contextual BERT embeddings | P·R·F1 over embedding matches | Any generation; semantic quality | Expensive; model-dependent |
| chrF | Character n-gram F-score — no word boundary dependence | F1 over char-ngrams | Morphologically rich languages | Less interpretable than word-level |
BLEU vs ROUGE in one sentence
BLEU cares more about how much of your output matches the reference. ROUGE cares more about how much of the reference you managed to cover.
The paraphrase problem
If two sentences mean the same thing but use different words, simple overlap metrics can score them unfairly. That is a major weakness to keep in mind.
Perplexity ≠ quality
A model can be good at predicting text patterns and still say false, unsafe, or unhelpful things. Fluent output is not the same as trustworthy output.