Summary – Reinforcement Learning

Episode Walkthrough (1 Step Example)

A.Agent starts at position 0.

  1. State: 0
  2. Chooses action: ‘right’
  3. Moves to next_state: 1
  4. Reward: -0.1 (not the goal yet)
  5. Current Q-value: q_table[0][‘right’] → 0.0
  6. Best Q-value in next_state (1): max(q_table[1].values()) → 0.0

Let’s update the Q-value using the formula:

Q(state, action) = Q(state, action)
                + learning_rate * (reward + discount * max(Q(next_state))
                – Q(state, action))

Plug in the numbers:

old_value = 0.0
reward = -0.1
learning_rate = 0.1
discount = 0.9
next_max = 0.0


new_value  = 0.0 + 0.1 * (-0.1 + 0.9 * 0.0 - 0.0)
           = 0.1 * (-0.1)
           = -0.01

So we update:

q_table[0][‘right’] = -0.01

B.Agent’s Next Move

  1. State: 1
  2. Chooses action ‘right’ → moves to 2 (the goal!)
  3. Reward: +1
  4. Q-table update:

old_value = 0.0
reward = 1
learning_rate = 0.1
discount = 0.9
next_max = max(q_table[2].values()) = 0.0


new_value  = 0.0 + 0.1 * (1 + 0.9 * 0 - 0.0)
           = 0.1 * (1.0)
           = 0.1

So we update:

q_table[1][‘right’] = 0.1

C. Final Q-Table After This Episode:

q_table = {
0: {‘left’: 0.0, ‘right’: -0.01},
1: {‘left’: 0.0, ‘right’: 0.1},
2: {‘left’: 0.0, ‘right’: 0.0},
}

D. Repeating Over Episodes

If the agent repeats this 100+ times, it will slowly learn:

  • At position 0, going right → eventually gets reward.
  • At position 1, going right → leads to the goal.
  • So it will build up Q-values reflecting the best actions.

Reinforcement Learning – Core Concepts in Reinforcement Learning