Back to all lessons
Sequence ModelsIntermediate

πŸ”„RNNs & LSTMs

Neural networks with memory

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min- Explore at your own pace

Before We Begin

What we are learning today

Recurrent neural networks process sequences one step at a time while carrying a hidden state that summarizes what has been seen so far. LSTMs extend this idea with gates that regulate what information should be kept, updated, or forgotten, making longer-range dependencies easier to handle.

How this lesson fits

Some data points only make sense when you know what came before them. This module studies models built for ordered information such as language, audio, weather, and time series, where sequence and memory matter as much as the current input.

The big question

How can a model represent the past well enough to make a strong decision about what is happening now or what should happen next?

Explain why sequence order changes meaning even when the same items are presentCompare probabilistic sequence models with neural sequence modelsTrack hidden state, memory, and context as they move across time steps

Why You Should Care

This lesson shows students how sequence modeling moved from explicit probabilistic tables toward learned memory systems. It also explains why older recurrent models struggled and what engineering ideas were introduced to make them more usable.

Where this is used today

  • βœ“Language modeling tasks that predict the next word or classify a sentence
  • βœ“Time-series forecasting in finance, operations, and sensor monitoring
  • βœ“Music and audio generation, where the next event depends on prior rhythm and structure

Think of it like this

Reading a paragraph is a good analogy. Each new sentence changes your understanding, but you do not rebuild that understanding from zero every time. You carry forward a mental state that keeps the important context alive.

Easy mistake to make

LSTMs do not remember everything forever. They are simply better than basic RNNs at preserving the information that training suggests is worth keeping.

By the end, you should be able to say:

  • Explain hidden state as a learned summary of previous time steps
  • Describe why vanilla RNNs struggle with long-range dependencies and vanishing gradients
  • Explain how LSTM gates regulate remembering, writing, and forgetting information

Think about this first

Why is the last word in a long sentence often impossible to interpret correctly if you have forgotten the earlier context? Give an example where the beginning changes the meaning of the end.

Words we will keep using

sequencehidden stategatememoryvanishing gradient

Recurrent Neural Networks & LSTMs

Standard neural networks have amnesiaβ€”they treat every input as brand new. RNNs have a memory. They read one word at a time, carrying a "thought" forward that summarizes everything they've seen so far.

Vanilla RNN
ht=tanh⁑(Whhtβˆ’1+Wxxt+b)h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

Simple, but it forgets quickly. Good for short sentences, bad for paragraphs.

LSTM (Long Short-Term Memory)

The pro version. It has special "gates" that let it choose what to remember and what to forget, so it can track ideas over long distances.

LSTM Gates

Forget gate ft=Οƒ(Wf[htβˆ’1,xt]+bf)f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)
"Should I throw away this old memory?"
Input gate it=Οƒ(Wi[htβˆ’1,xt]+bi)i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)
"Is this new information worth saving?"
Candidate c~t=tanh⁑(Wg[htβˆ’1,xt]+bg)\tilde{c}_t = \tanh(W_g[h_{t-1}, x_t] + b_g)
"What is the new content I might add?"
Output gate ot=Οƒ(Wo[htβˆ’1,xt]+bo)o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)
"What should I tell the next layer right now?"
Cell state update:
ct=ftβŠ™ctβˆ’1+itβŠ™c~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

This is the real memory line of the LSTM. It is designed to carry useful information farther through time.

Hidden state:
ht=otβŠ™tanh⁑(ct)h_t = o_t \odot \tanh(c_t)

Interactive State Trace

Input sequence (drag to change)

HMMs and RNNs β€” what is the actual difference?

Students often learn HMMs and RNNs as if they belong to different worlds, but they are actually related. Both keep a hidden state that summarizes the past. The big difference is how that state is represented and updated.

What they fundamentally share

Both models rely on the same core idea: the current hidden state should summarize the important parts of the past. That means the next step can be computed from the previous state plus the current input, instead of storing the whole history directly.

HMM β€” state update
P(zt∣ztβˆ’1)=Aztβˆ’1, ztP(z_t \mid z_{t-1}) = A_{z_{t-1},\, z_t}

In an HMM, the next hidden state comes from a probability table.

RNN β€” state update
ht=tanh⁑(Wh htβˆ’1+Wx xt+b)h_t = \tanh(W_h\, h_{t-1} + W_x\, x_t + b)

In an RNN, the next hidden state comes from learned weights instead of a small probability table.

The three biggest differences

AxisHMMRNN
Hidden state typeDiscrete β€” zt∈{1,…,K}z_t \in \{1,\ldots,K\}
Point mass on one of K states
Continuous β€” ht∈Rdh_t \in \mathbb{R}^d
A dense vector of arbitrary real values
Transition mechanismLookup table A∈RKΓ—KA \in \mathbb{R}^{K \times K}
Fixed stochastic matrix β€” rows sum to 1
Learned weight matrix Wh∈RdΓ—dW_h \in \mathbb{R}^{d \times d}
Arbitrary real matrix + nonlinearity
Inference at test timeRequired β€” Viterbi or forward pass
Must marginalise over K hidden states
None β€” state is deterministic
Just compute the forward pass; h_t IS the state
Learning algorithmBaum-Welch (EM)
Expectation over hidden states
Backprop through time (BPTT)
Gradient through the unrolled graph

The key insight

A good way to think about it is this: HMMs use a tidy probability table for transitions, while RNNs replace that table with flexible learned weights. That extra flexibility is why RNNs can model richer patterns.

The cost of that flexibility is interpretability. In an HMM, the hidden state can often be named clearly. In an RNN, the hidden state is a vector, so the meaning is spread across many numbers at once.

Why HMMs need inference

In an HMM, you never directly see the hidden state. So at each step you must reason over several possibilities and keep track of their probabilities.

Why RNNs need no inference

In an RNN, the hidden state is just computed directly. There is no extra uncertainty calculation over several candidate states. You simply run the network forward.

The continuous spectrum

HMM→Soft HMM (continuous states)→Linear RNN (no nonlinearity)→Vanilla RNN→LSTM / GRU

You can think of these models as one family with increasing flexibility. As you move right, the state representation becomes richer and the model becomes better at handling complex sequence patterns.

GRU β€” Simpler Alternative

A GRU is a lighter version of an LSTM. It uses fewer moving parts, but still tries to control memory with gates.

Modern trend: Transformers now dominate many language tasks, but RNN-style models still matter when streaming, low latency, or limited hardware is important.