Reinforcement Learning example with Simple Python
1.Goal
We’ll create a game where:
- An agent (a player) is on a line of 5 positions (from 0 to 4).
- The goal is to reach position 4 (the goal) and get a reward.
- The agent can move left or right
- Reward for reaching the goal (position 4): +1
- Penalty for every step that’s not the goal: -0.1
We’ll use Q-learning, a simple RL algorithm.
2.Concepts Used
- States: positions 0 to 4
- Actions: ‘left’ or ‘right’
- Q-table: a table that stores the best action to take in each state
- Learning: updating the Q-table based on what happens
Beginner-Friendly Code with Penalties
import random
# Environment
states = [0, 1, 2, 3, 4]
actions = ['left', 'right']
goal_state = 4
# Q-table
q_table = {}
for state in states:
q_table[state] = {'left': 0.0, 'right': 0.0}
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.9
exploration_rate = 0.2
episodes = 100
# Training
for episode in range(episodes):
state = 0 # Start at the beginning
while state != goal_state:
# Choose action (explore or exploit)
if random.random() < exploration_rate:
action = random.choice(actions)
else:
action = max(q_table[state], key=q_table[state].get)
# Take action
if action == 'right':
next_state = min(state + 1, goal_state)
else:
next_state = max(state - 1, 0)
# Reward logic
if next_state == goal_state:
reward = 1 # goal reward
else:
reward = -0.1 # penalty for every non-goal step
# Q-learning update
old_value = q_table[state][action]
next_max = max(q_table[next_state].values())
new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value)
q_table[state][action] = new_value
state = next_state # move to next state
# Show Q-table after training
print("Learned Q-table with penalties:")
for s in q_table:
print(f"State {s}: {q_table[s]}")
# Testing
print("\nTesting learned policy:")
state = 0
steps = [state]
while state != goal_state:
action = max(q_table[state], key=q_table[state].get)
if action == 'right':
state = min(state + 1, goal_state)
else:
state = max(state - 1, 0)
steps.append(state)
print("Steps taken:", steps)
3.Output Explanation
After training, it will show the Q-values for each action in each state.During testing, it prints the sequence of steps the agent takes to reach the goal using the learned policy.
4. A real-life inspired example using text-based computation.
Idea: Word Typing Practice (Text-Based Computation)
Let’s imagine:
- The agent’s job is to type letters to match a target word (like ‘hi’).
- At each step, it chooses a letter.
- If the letter is correct at that position, it gets a reward.
- If the letter is wrong, it gets a penalty.
- When the full word is correctly typed, it gets a big reward.
This is a kind of sequence generation task, like spelling or autocomplete.
Python Code: Typing the Word ‘hi’
import random
# Target word to "type"
target_word = "hi"
max_steps = len(target_word)
# All possible letters to choose from
alphabet = list("abcdefghijklmnopqrstuvwxyz")
# Q-table: state = (position, current_string), action = letter
q_table = {}
# Initialize Q-table with zeros
for pos in range(max_steps):
for partial in ["".join(p) for p in set([a + b for a in alphabet for b in alphabet])]:
key = (pos, partial[:pos]) # only care up to current position
q_table[key] = {letter: 0.0 for letter in alphabet}
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.95
exploration_rate = 0.3
episodes = 200
# Training loop
for episode in range(episodes):
current_string = ""
position = 0
while position < max_steps:
state = (position, current_string)
# Choose letter (action)
if state not in q_table:
q_table[state] = {letter: 0.0 for letter in alphabet}
if random.random() < exploration_rate:
action = random.choice(alphabet)
else:
action = max(q_table[state], key=q_table[state].get)
# Check reward
correct_letter = target_word[position]
if action == correct_letter:
reward = 1 # correct letter
current_string += action
position += 1
else:
reward = -0.5 # wrong letter, don't move forward
# Get next state
next_state = (position, current_string)
if next_state not in q_table:
q_table[next_state] = {letter: 0.0 for letter in alphabet}
# Q-learning update
old_value = q_table[state][action]
next_max = max(q_table[next_state].values())
new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value)
q_table[state][action] = new_value
# Testing the learned policy
print(f"\nTesting learned policy to type the word '{target_word}':")
typed = ""
position = 0
while position < max_steps:
state = (position, typed)
if state in q_table:
action = max(q_table[state], key=q_table[state].get)
else:
action = random.choice(alphabet)
print(f"Position {position} - Typed: '{typed}' → Choosing: '{action}'")
if action == target_word[position]:
typed += action
position += 1
else:
print("Wrong letter, retrying...")
print(f"\nFinal typed word: {typed}")
What’s Happening Here?
- The agent learns to type a word one letter at a time.
- It gets +1 for correct letters, -0.5 for wrong ones.
- Through trial and error, it figures out how to spell the word correctly.
Why This Helps Understanding
- It’s like learning to type, spell, or predict text — real-life tasks
- Easy to see what the agent is doing, and why certain actions are rewarded or penalized.
Reinforcement Learning – Brainstorming Session
