Summary – Reinforcement Learning
Episode Walkthrough (1 Step Example)
A.Agent starts at position 0.
- State: 0
- Chooses action: ‘right’
- Moves to next_state: 1
- Reward: -0.1 (not the goal yet)
- Current Q-value: q_table[0][‘right’] → 0.0
- Best Q-value in next_state (1): max(q_table[1].values()) → 0.0
Let’s update the Q-value using the formula:
Q(state, action) = Q(state, action)
+ learning_rate * (reward + discount * max(Q(next_state))
– Q(state, action))
Plug in the numbers:
old_value = 0.0
reward = -0.1
learning_rate = 0.1
discount = 0.9
next_max = 0.0new_value = 0.0 + 0.1 * (-0.1 + 0.9 * 0.0 - 0.0) = 0.1 * (-0.1) = -0.01
So we update:
q_table[0][‘right’] = -0.01
B.Agent’s Next Move
- State: 1
- Chooses action ‘right’ → moves to 2 (the goal!)
- Reward: +1
- Q-table update:
old_value = 0.0
reward = 1
learning_rate = 0.1
discount = 0.9
next_max = max(q_table[2].values()) = 0.0new_value = 0.0 + 0.1 * (1 + 0.9 * 0 - 0.0) = 0.1 * (1.0) = 0.1
So we update:
q_table[1][‘right’] = 0.1
C. Final Q-Table After This Episode:
q_table = {
0: {‘left’: 0.0, ‘right’: -0.01},
1: {‘left’: 0.0, ‘right’: 0.1},
2: {‘left’: 0.0, ‘right’: 0.0},
}
D. Repeating Over Episodes
If the agent repeats this 100+ times, it will slowly learn:
- At position 0, going right → eventually gets reward.
- At position 1, going right → leads to the goal.
- So it will build up Q-values reflecting the best actions.
Reinforcement Learning – Core Concepts in Reinforcement Learning