Q-Learning on FrozenLake from scratch using an epsilon-greedy exploration strategy

Goal of the Project

Building a tabular Q-learning agent that learns to navigate the FrozenLake environment using an epsilon-greedy exploration strategy.

What is FrozenLake?

A 4×4 grid of 16 states. The agent starts at cell 0, must reach cell 15 (goal). Holes at cells 5, 7, 11, 12 end the episode with zero reward. Actions are just the 4 (left, right, top, bottom) directions.

Q-learning

Epsilon Greedy Exploration

Why the need?

Here's the dilemma. The Q-table starts at zero everywhere. If we always pick the action with the highest Q-value (greedy), we'll just keep picking arbitrarily, since all values are equal, and we'll never systematically discover anything. Worse, once we've found any path that gives reward, we'll repeat it forever and never discover whether a better path exists.

Solution

We act randomly sometimes. Epsilon greedy does that by using the parameter, ε. With probability ε, we will choose a random action, and with probability 1 - ε, we will choose the greedy (from q table) action.

At the start of the training, set ε = 1.0, meaning fully random. At this point, all the Q values are 0, nothing useful in q table, so randomness works. Then as we train through more and more episode, decrease the ε by multiplying it with a set decay rate which is between (0, 1] (typically 0.995). By the end, ε sits at some small floor value like 0.01, the agent is 99% greedy but still pokes around occasionally.

The agent gradually shifts from exploring to exploiting. The intuition here is that at the start, when we had nothing useful, we had ε = 1.0, which allowed us to use randomness to explore, but as we train, we learn more and more about the path, and we should use that. But a tiny bit of exploration allows us to get out of some suboptimal habit.

The two key things ε-decay controls are how fast you shift to exploitation (the decay rate) and how much minimum exploration you always preserve (the floor). Decay too fast and the agent commits to bad habits before it's had enough experience. Decay too slow and you waste time exploring even when you already know what's good.