Language & TransformersAdvanced

💬Large Language Models

Predicting the next token at scale

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min

Pause and experiment as you go.

30 min- Explore at your own pace

Before We Begin

What we are learning today

Large language models scale the transformer recipe dramatically: more data, more parameters, more compute, and more opportunities to learn patterns in how language is structured. Their core training objective is still simple next-token prediction, but at scale that objective can produce useful capabilities in reasoning, summarization, coding, and dialogue.

How this lesson fits

This module explains the modern language-model stack from the inside out. Students see how words become vectors, how attention lets models choose context dynamically, and how large-scale next-token training turns those ingredients into systems that can write, summarize, and answer questions.

The big question

How can a machine represent meaning, decide which context matters, and then generate fluent language one token at a time?

Explain how text is converted into numeric representations without losing the idea of meaningInterpret attention as a selective context mechanism rather than as magicDescribe the workflow of large language models from pretraining through generation

Why You Should Care

Students encounter LLMs everywhere now. They need a precise mental model of both their power and their limits: why fluent generation can look intelligent, why prompting matters, and why confident mistakes are an expected part of the system.

Where this is used today

✓General-purpose assistants such as ChatGPT, Claude, and Gemini
✓Code-generation and autocomplete tools that assist software development
✓Summarization, drafting, search augmentation, tutoring, and customer-support workflows

Think of it like this

Imagine an improv performer who has read an enormous library and is unusually good at continuing patterns. They can often produce convincing, helpful responses, but they are still generating the next plausible continuation rather than consulting a guaranteed source of truth.

Easy mistake to make

An answer that sounds polished or confident is not automatically true. Fluency and correctness often travel together, but they are never the same thing.

By the end, you should be able to say:

Explain next-token prediction in plain language without understating its importance
Connect pretraining, prompting, alignment, and fine-tuning as parts of one larger workflow
Discuss why fluent output, useful reasoning, and factual reliability are related but distinct qualities

Think about this first

Why might a system that repeatedly predicts the next token learn not only grammar, but also facts, style, structure, and patterns of reasoning from the text it has seen?

Words we will keep using

tokenpromptpretrainingfine-tuningsampling

Large Language Models

An LLM is basically a super-powered autocomplete. It reads text, learns the patterns, and predicts what comes next. That sounds simple, but do it enough times on enough data, and "predicting the next word" starts to look a lot like reasoning.

Core training objective
P(x₁, x₂, ..., xₙ) = ∏ P(xₜ | x₁, ..., x₍ₜ₋₁₎)

That training objective sounds small, but it turns out to be enough to teach the model a surprising amount about language and pattern structure.

Next-Token Prediction Demo

When a chatbot writes a poem, it isn't "thinking" of the whole poem at once. It's just rolling the dice on the very next word, over and over again. The magic is that the probabilities are so well-tuned that the result makes sense.

Context (last 2 words):

Next token probabilities (T-adjusted):

Temperature: 1.0

Low temperature makes the model more conservative. High temperature spreads probability more widely and makes outputs less predictable.

the cat

Key Concepts

Pre-training

The Reading Phase. The model reads the internet to learn grammar, facts, and reasoning patterns. It is self-taught.

Fine-tuning & RLHF

The Manners Phase. Humans step in to teach the model how to be helpful, harmless, and follow instructions.

In-context Learning

The "Just Ask" Phase. You don't need to retrain the model to teach it a new trick—just show it an example in the prompt.

Emergent Abilities

The Surprise Factor. When models get big enough, they suddenly learn to do things (like coding) that they weren't explicitly built for.

Scaling Laws

Here is the weirdest thing about LLMs: they are incredibly predictable. If you add more data or make the model bigger, it gets smarter at a very specific, mathematical rate. We can literally graph the "intelligence" (perplexity) before we even build the model.

The exact curve is less important than the message: progress often looked smooth and predictable as models got larger.

The Scaling Timeline

2017Transformer65M

Attention Is All You Need — architecture breakthrough

2018BERT340M

Bidirectional pre-training, fine-tuning paradigm

2019GPT-21.5B

Autoregressive language model, "too dangerous to release"

2020GPT-3175B

Few-shot learning, emergent capabilities at scale

2022ChatGPT / InstructGPT175B+

RLHF alignment, conversational AI goes mainstream

2023GPT-4 / Llama 2~1T?

Multimodal reasoning, open-source competition begins in earnest

2024GPT-4o / Llama 3 / Gemini 1.5~1–2T?

Long context (1M+ tokens), efficiency push, open-source closes the gap

2025o1 / DeepSeek R1 / Claude 3.7 / Llama 4MoE variants

Test-time compute scaling, chain-of-thought reasoning models, DeepSeek disrupts cost assumptions globally

2026GPT-4.5 / Gemini 2.5 / Llama 4 ScoutUnknown

Reasoning-native by default, multimodal as baseline, efficiency over raw parameter count becomes the metric

NLP Evaluation Metrics

Scoring generated text is harder than scoring a yes/no answer. There is usually more than one acceptable sentence, so NLP uses a family of metrics that each capture only part of quality.

Interactive BLEU / ROUGE Calculator

Reference (gold)

thecatsatonthemat

Hypothesis (model output)

thecatissittingonthemat

Green = matched in reference · Red = unmatched

BLEU-1

71.4

unigram precision

BP = 1.000

BLEU-2

59.8

+bigram precision

geo-mean

ROUGE-1

76.9

unigram F1

R=83% P=71%

ROUGE-2

54.5

bigram F1

R=60% P=50%

BLEU precision bias: try making the hypothesis much shorter than the reference. BLEU-1 stays high but BLEU-2 drops — the brevity penalty partially corrects for this.

ROUGE recall bias: now make the hypothesis much longer. ROUGE recall stays high because it measures how much of the reference you covered, regardless of extra words.

Both fail on paraphrase: replace a reference word with a perfect synonym. Both scores drop — neither metric understands meaning, only surface overlap.

Metric	What it measures	Formula core	Best for	Weakness
Perplexity	How surprised the model is by a test corpus	exp(−(1/N) Σ log P(xₜ\|x<t))	LM comparison	Not comparable across tokenisations
BLEU	N-gram precision of hypothesis vs reference(s)	BP · exp(Σ wₙ log pₙ)	Machine translation	Precision-biased; punishes paraphrase
ROUGE-N	N-gram recall of reference covered by hypothesis	matched-ngrams / ref-ngrams	Summarisation	Recall-biased; padding inflates score
ROUGE-L	Longest Common Subsequence F1	LCS(hyp, ref) / len(ref)	Summarisation, fluency	Ignores non-contiguous order quality
METEOR	Alignment F1 with synonym matching & stemming	F_mean · (1 − penalty)	Translation; handles paraphrase	Language-dependent synonym tables
BERTScore	Token cosine similarity via contextual BERT embeddings	P·R·F1 over embedding matches	Any generation; semantic quality	Expensive; model-dependent
chrF	Character n-gram F-score — no word boundary dependence	F1 over char-ngrams	Morphologically rich languages	Less interpretable than word-level

BLEU vs ROUGE in one sentence

BLEU cares more about how much of your output matches the reference. ROUGE cares more about how much of the reference you managed to cover.

The paraphrase problem

If two sentences mean the same thing but use different words, simple overlap metrics can score them unfairly. That is a major weakness to keep in mind.

Perplexity ≠ quality

A model can be good at predicting text patterns and still say false, unsafe, or unhelpful things. Fluent output is not the same as trustworthy output.