🎮Reinforcement Learning
Learning by trying, failing, and improving
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Reinforcement learning studies decision-making over time. An agent observes a state, chooses an action, receives a reward, and must learn a policy that maximizes future return rather than just immediate payoff. That makes the problem fundamentally strategic.
How this lesson fits
This module looks beyond the standard supervised-learning workflow. Students explore systems that learn from delayed rewards and systems that train collaboratively while keeping raw data distributed, which introduces the real-world constraints of strategy, privacy, and deployment.
The big question
How can AI systems keep improving in realistic environments where feedback is delayed, data is sensitive, and decisions have long-term consequences?
Why You Should Care
This lesson expands students' idea of what learning can look like. Not all settings come with labeled answers; some require experimentation, patience, and balancing short-term gain against long-term success.
Where this is used today
- ✓Game-playing systems such as AlphaGo that plan ahead over long action sequences
- ✓Robotics problems where agents learn control strategies through repeated interaction
- ✓Optimization tasks such as datacenter cooling or traffic-signal timing with long-term rewards
Think of it like this
It is like learning a game where some actions look good at first but trap you later, while other actions feel inefficient now but set you up to win. Progress depends on understanding consequences across many steps, not just reacting to the present moment.
Easy mistake to make
Reinforcement learning is not just random trial and error forever. Strong RL systems structure feedback and use experience to become progressively less random over time.
By the end, you should be able to say:
- Identify the agent, environment, state, action, reward, and policy in a reinforcement-learning setup
- Explain Q-values as estimates of long-term usefulness rather than immediate reward
- Describe why exploration and exploitation must be balanced during learning
Think about this first
If you were coaching a robot through a maze, would it be enough to reward only the final success? What intermediate signals might help it learn faster without giving away the whole solution?
Words we will keep using
Reinforcement Learning
Reinforcement Learning is "learning by doing." No one gives the agent an answer key. It has to try things, fail, get a reward (or a penalty), and figure out the rules on its own. It's how you learned to ride a bike.
Q-Learning
Q-Learning is basically a cheat sheet. The agent keeps a table of every possible situation and writes down a score for every possible move. Good move? Score goes up. Bad move? Score goes down.
Gridworld Q-Learning Demo
🤖 Agent starts at (0,0). 🏆 Goal at (4,4) → +10. 💀 Trap at (3,3) → -5. 🧱 Walls block movement. Arrows show best action per cell.