Basic Math Concepts – Reinforcement Learning
Math Topic | What You Need It For |
---|---|
Probability | Handling randomness, exploration |
Algebra | Updating Q-values, value functions |
Arithmetic | Calculating rewards, scaling values |
Functions | Defining policies and value functions |
Sequences | Modeling steps, episodes, and learning over time |
Expectation | Calculating average reward |
Discounting | Valuing future rewards less than immediate ones |
1. What is Algebra in RL?
In RL, algebra just means using symbols to represent changing values, like:
- Q(s, a) = the value of taking action a in state s
- r = reward
- α (alpha) = learning rate
- γ (gamma) = discount factor
We use these symbols to update our knowledge as we learn more.
2. What’s the Actual Formula?
This is the main Q-learning formula:
Q(s,a)=Q(s,a)+α⋅(r+γ⋅maxQ(s′)−Q(s,a))
Let’s break it down in simple language:
Part of Formula | What it Means |
---|---|
Q(s, a) | What we currently think the value is for doing action a in state s |
r | What reward we actually got |
max Q(s’) | The best value we expect from the next state |
α (learning rate) | How fast we want to update our guess |
γ (discount factor) | How much we care about future rewards |
Real Example (with numbers)
Let’s say:
- We’re in state 0
- We take action ‘right’
- We move to state 1
- We get a reward r = 1
- Our current Q-table says:
- Q(0, ‘right’) = 0.2
- max Q(1, *) = 0.5
- Learning rate α = 0.1
- Discount γ = 0.9
Now plug into the formula:
Q(0,’right’)=0.2+0.1⋅(1+0.9⋅0.5−0.2)
Step-by-step:
1. Calculate inside the parentheses:
1+0.9⋅0.5=1+0.45=1.45
2. Subtract current value:
1.45−0.2=1.25
3.Multiply by learning rate:
0.1⋅1.25=0.125
4.Add to old value:
0.2+0.125=0.325
So, we update Q(0, ‘right’) to 0.325.
3. What is Discounting?
Discounting means that rewards we get right now are more valuable than rewards we get later.
Simple Analogy
Imagine someone says:
- “I’ll give you $10 today”
- OR “I’ll give you $10 next year”
Most people would say: “I want $10 now!”
Because:
- We can use it now.
- It’s more certain.
- We can invest it or buy snacks today
Same idea in Reinforcement Learning!
The Discount Factor (γ)
- Written as: gamma, or γ
- A number between 0 and 1
- Controls how much the agent cares about the future
γ Value | Agent Behavior |
---|---|
0 | Only cares about immediate rewards |
0.9 | Cares about future rewards a lot |
1.0 | Treats all future rewards as equal |
How it Works (with numbers)
Suppose we get:
- +1 now
- +1 after 1 step
- +1 after 2 steps
And γ = 0.9
Total discounted reward:
1+0.9⋅1+0.9^2⋅1=1+0.9+0.81=2.71
The later rewards are worth less because they’re discounted.
Why Use Discounting?
- To prefer faster rewards.
- To help the agent plan ahead, but not forever.
- To avoid getting stuck chasing long-delayed rewards that may never come.
Reinforcement Learning – Visual Roadmap