πRNNs & LSTMs
Neural networks with memory
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Recurrent neural networks process sequences one step at a time while carrying a hidden state that summarizes what has been seen so far. LSTMs extend this idea with gates that regulate what information should be kept, updated, or forgotten, making longer-range dependencies easier to handle.
How this lesson fits
Some data points only make sense when you know what came before them. This module studies models built for ordered information such as language, audio, weather, and time series, where sequence and memory matter as much as the current input.
The big question
How can a model represent the past well enough to make a strong decision about what is happening now or what should happen next?
Why You Should Care
This lesson shows students how sequence modeling moved from explicit probabilistic tables toward learned memory systems. It also explains why older recurrent models struggled and what engineering ideas were introduced to make them more usable.
Where this is used today
- βLanguage modeling tasks that predict the next word or classify a sentence
- βTime-series forecasting in finance, operations, and sensor monitoring
- βMusic and audio generation, where the next event depends on prior rhythm and structure
Think of it like this
Reading a paragraph is a good analogy. Each new sentence changes your understanding, but you do not rebuild that understanding from zero every time. You carry forward a mental state that keeps the important context alive.
Easy mistake to make
LSTMs do not remember everything forever. They are simply better than basic RNNs at preserving the information that training suggests is worth keeping.
By the end, you should be able to say:
- Explain hidden state as a learned summary of previous time steps
- Describe why vanilla RNNs struggle with long-range dependencies and vanishing gradients
- Explain how LSTM gates regulate remembering, writing, and forgetting information
Think about this first
Why is the last word in a long sentence often impossible to interpret correctly if you have forgotten the earlier context? Give an example where the beginning changes the meaning of the end.
Words we will keep using
Recurrent Neural Networks & LSTMs
Standard neural networks have amnesiaβthey treat every input as brand new. RNNs have a memory. They read one word at a time, carrying a "thought" forward that summarizes everything they've seen so far.
Simple, but it forgets quickly. Good for short sentences, bad for paragraphs.
The pro version. It has special "gates" that let it choose what to remember and what to forget, so it can track ideas over long distances.
LSTM Gates
"Should I throw away this old memory?"
"Is this new information worth saving?"
"What is the new content I might add?"
"What should I tell the next layer right now?"
This is the real memory line of the LSTM. It is designed to carry useful information farther through time.
Interactive State Trace
Input sequence (drag to change)
HMMs and RNNs β what is the actual difference?
Students often learn HMMs and RNNs as if they belong to different worlds, but they are actually related. Both keep a hidden state that summarizes the past. The big difference is how that state is represented and updated.
What they fundamentally share
Both models rely on the same core idea: the current hidden state should summarize the important parts of the past. That means the next step can be computed from the previous state plus the current input, instead of storing the whole history directly.
In an HMM, the next hidden state comes from a probability table.
In an RNN, the next hidden state comes from learned weights instead of a small probability table.
The three biggest differences
| Axis | HMM | RNN |
|---|---|---|
| Hidden state type | Discrete β Point mass on one of K states | Continuous β A dense vector of arbitrary real values |
| Transition mechanism | Lookup table Fixed stochastic matrix β rows sum to 1 | Learned weight matrix Arbitrary real matrix + nonlinearity |
| Inference at test time | Required β Viterbi or forward pass Must marginalise over K hidden states | None β state is deterministic Just compute the forward pass; h_t IS the state |
| Learning algorithm | Baum-Welch (EM) Expectation over hidden states | Backprop through time (BPTT) Gradient through the unrolled graph |
The key insight
A good way to think about it is this: HMMs use a tidy probability table for transitions, while RNNs replace that table with flexible learned weights. That extra flexibility is why RNNs can model richer patterns.
The cost of that flexibility is interpretability. In an HMM, the hidden state can often be named clearly. In an RNN, the hidden state is a vector, so the meaning is spread across many numbers at once.
Why HMMs need inference
In an HMM, you never directly see the hidden state. So at each step you must reason over several possibilities and keep track of their probabilities.
Why RNNs need no inference
In an RNN, the hidden state is just computed directly. There is no extra uncertainty calculation over several candidate states. You simply run the network forward.
The continuous spectrum
You can think of these models as one family with increasing flexibility. As you move right, the state representation becomes richer and the model becomes better at handling complex sequence patterns.
GRU β Simpler Alternative
A GRU is a lighter version of an LSTM. It uses fewer moving parts, but still tries to control memory with gates.
Modern trend: Transformers now dominate many language tasks, but RNN-style models still matter when streaming, low latency, or limited hardware is important.