Basic Math Concepts – Reinforcement Learning

Math Topic What You Need It For
Probability Handling randomness, exploration
Algebra Updating Q-values, value functions
Arithmetic Calculating rewards, scaling values
Functions Defining policies and value functions
Sequences Modeling steps, episodes, and learning over time
Expectation Calculating average reward
Discounting Valuing future rewards less than immediate ones

1. What is Algebra in RL?

In RL, algebra just means using symbols to represent changing values, like:

  • Q(s, a) = the value of taking action a in state s
  • r = reward
  • α (alpha) = learning rate
  • γ (gamma) = discount factor

We use these symbols to update our knowledge as we learn more.

2. What’s the Actual Formula?

This is the main Q-learning formula:

Q(s,a)=Q(s,a)+α⋅(r+γ⋅max⁡Q(s′)−Q(s,a))

Let’s break it down in simple language:

Part of Formula What it Means
Q(s, a) What we currently think the value is for doing action a in state s
r What reward we actually got
max Q(s’) The best value we expect from the next state
α (learning rate) How fast we want to update our guess
γ (discount factor) How much we care about future rewards

Real Example (with numbers)

Let’s say:

  • We’re in state 0
  • We take action ‘right’
  • We move to state 1
  • We get a reward r = 1
  • Our current Q-table says:
    • Q(0, ‘right’) = 0.2
    • max Q(1, *) = 0.5
  • Learning rate α = 0.1
  • Discount γ = 0.9

Now plug into the formula:

Q(0,’right’)=0.2+0.1⋅(1+0.9⋅0.5−0.2)

Step-by-step:

1. Calculate inside the parentheses:

1+0.9⋅0.5=1+0.45=1.45

2. Subtract current value:

1.45−0.2=1.25

3.Multiply by learning rate:

0.1⋅1.25=0.125

4.Add to old value:

0.2+0.125=0.325

So, we update Q(0, ‘right’) to 0.325.

3. What is Discounting?

Discounting means that rewards we get right now are more valuable than rewards we get later.

Simple Analogy

Imagine someone says:

  • “I’ll give you $10 today
  • OR “I’ll give you $10 next year

Most people would say: “I want $10 now!

Because:

  • We can use it now.
  • It’s more certain.
  • We can invest it or buy snacks today

Same idea in Reinforcement Learning!

The Discount Factor (γ)

  • Written as: gamma, or γ
  • A number between 0 and 1
  • Controls how much the agent cares about the future
γ Value Agent Behavior
0 Only cares about immediate rewards
0.9 Cares about future rewards a lot
1.0 Treats all future rewards as equal

How it Works (with numbers)

Suppose we get:

  • +1 now
  • +1 after 1 step
  • +1 after 2 steps

And γ = 0.9

Total discounted reward:

1+0.9⋅1+0.9^2⋅1=1+0.9+0.81=2.71

The later rewards are worth less because they’re discounted.

Why Use Discounting?

  1. To prefer faster rewards.
  2. To help the agent plan ahead, but not forever.
  3. To avoid getting stuck chasing long-delayed rewards that may never come.

null

Reinforcement Learning – Visual Roadmap