Core Concepts in Reinforcement Learning
1. Agent:
The learner or decision maker.
Example: A robot, a game character, a self-driving car, or even a bot trying to spell the word “hi”.
2. Environment:
The world the agent lives in and interacts with.
Example: A maze, a road, a typing game, or a grid.
3. State (S):
A snapshot of the environment at a particular time.It tells the agent where it is or what’s going on right now.
Examples:
- In a maze: position (x, y)
- In a typing game: current_word = ‘h’, position = 1
- In chess: full board layout
The agent uses the state to decide what to do next.
4. Action (A):
What the agent can do in a state.
Examples:
- In a grid: move ‘up’, ‘down’, ‘left’, ‘right’
- In a typing game: choose letter ‘a’, ‘b’, ‘c’, …
- In a robot: turn left, pick up, move forward
The agent picks an action, hoping it leads to something good.
5. Reward ®:
The score the agent gets after taking an action.
Examples:
- Reached the goal? +1
- Took a wrong turn? -0.1
- Typed the correct letter? +1
- Took too long? -0.1
The goal of the agent is to maximize the total reward over time.
6. Policy (π):
The strategy the agent follows.
It maps each state to the best action.We can think of it as the agent’s “brain” — what it believes is the best thing to do in each situation.
7. Episode:
One complete run from start to end.
Examples:
- From starting point to goal in a maze
- From empty string to correctly typing “hi”
- From start of a game to game over
After each episode, the environment resets, and the agent can try again.
8. Step / Time Step:
One single move: (State → Action → Reward → Next State)
A bunch of steps = 1 episode.
9. Value Function (V):
Tells us how good it is to be in a particular state.
“If I’m here, what’s my expected total reward in the future?”
10. Q-Value (Q-function):
Tells us how good it is to take a specific action in a specific state.
“If I’m at position 1 and I go right, how much reward can I expect?”
That’s why we use the Q-table to store these values.
11. Exploration vs Exploitation
The agent has to choose between:
- Exploration (trying something new)
- Exploitation (choosing what it already thinks is best)
Example: Should the agent try a new action that might be better? Or stick to what worked last time?
This is why we sometimes pick a random action (explore).
12. Learning Rate (α):
How much the agent updates what it has learned.
- High = learns fast (but might be unstable)
- Low = learns slow (but safer)
13. Discount Factor (γ):
How much future rewards matter compared to immediate rewards.
- γ = 0 → Only care about now
- γ close to 1 → Care about long-term rewards
How it all fits together
In each episode, the agent:
- Starts in a state
- Picks an action
- Gets a reward
- Moves to a next state
- Updates its Q-table
- Repeats until the episode ends
Over many episodes, it gets better and better at making decisions!
Reinforcement Learning – Basic Math Concepts