Core Concepts in Reinforcement Learning

1. Agent:

The learner or decision maker.

Example: A robot, a game character, a self-driving car, or even a bot trying to spell the word “hi”.

2. Environment:

The world the agent lives in and interacts with.

Example: A maze, a road, a typing game, or a grid.

3. State (S):

A snapshot of the environment at a particular time.It tells the agent where it is or what’s going on right now.

Examples:

In a maze: position (x, y)
In a typing game: current_word = ‘h’, position = 1
In chess: full board layout

The agent uses the state to decide what to do next.

4. Action (A):

What the agent can do in a state.

Examples:

In a grid: move ‘up’, ‘down’, ‘left’, ‘right’
In a typing game: choose letter ‘a’, ‘b’, ‘c’, …
In a robot: turn left, pick up, move forward

The agent picks an action, hoping it leads to something good.

5. Reward :

The score the agent gets after taking an action.

Examples:

Reached the goal? +1

Took a wrong turn? -0.1

Typed the correct letter? +1

Took too long? -0.1

The goal of the agent is to maximize the total reward over time.

6. Policy (π):

The strategy the agent follows.

It maps each state to the best action.We can think of it as the agent’s “brain” — what it believes is the best thing to do in each situation.

7. Episode:

One complete run from start to end.

Examples:

From starting point to goal in a maze
From empty string to correctly typing “hi”
From start of a game to game over

After each episode, the environment resets, and the agent can try again.

8. Step / Time Step:

One single move: (State → Action → Reward → Next State)

A bunch of steps = 1 episode.

9. Value Function (V):

Tells us how good it is to be in a particular state.
“If I’m here, what’s my expected total reward in the future?”

10. Q-Value (Q-function):

Tells us how good it is to take a specific action in a specific state.
“If I’m at position 1 and I go right, how much reward can I expect?”
That’s why we use the Q-table to store these values.

11. Exploration vs Exploitation

The agent has to choose between:

Exploration (trying something new)
Exploitation (choosing what it already thinks is best)

Example: Should the agent try a new action that might be better? Or stick to what worked last time?

This is why we sometimes pick a random action (explore).

12. Learning Rate (α):

How much the agent updates what it has learned.

High = learns fast (but might be unstable)
Low = learns slow (but safer)

13. Discount Factor (γ):

How much future rewards matter compared to immediate rewards.

γ = 0 → Only care about now

γ close to 1 → Care about long-term rewards

How it all fits together

In each episode, the agent:

Starts in a state
Picks an action
Gets a reward
Moves to a next state
Updates its Q-table
Repeats until the episode ends

Over many episodes, it gets better and better at making decisions!

Reinforcement Learning – Basic Math Concepts