Reinforcement Learning example with Simple Python

1.Goal

We’ll create a game where:

An agent (a player) is on a line of 5 positions (from 0 to 4).
The goal is to reach position 4 (the goal) and get a reward.
The agent can move left or right
Reward for reaching the goal (position 4): +1
Penalty for every step that’s not the goal: -0.1

We’ll use Q-learning, a simple RL algorithm.

2.Concepts Used

States: positions 0 to 4
Actions: ‘left’ or ‘right’
Q-table: a table that stores the best action to take in each state
Learning: updating the Q-table based on what happens

Beginner-Friendly Code with Penalties

import random

# Environment
states = [0, 1, 2, 3, 4]
actions = ['left', 'right']
goal_state = 4

# Q-table
q_table = {}
for state in states:
    q_table[state] = {'left': 0.0, 'right': 0.0}

# Hyperparameters
learning_rate = 0.1
discount_factor = 0.9
exploration_rate = 0.2
episodes = 100

# Training
for episode in range(episodes):
    state = 0  # Start at the beginning
    while state != goal_state:
        # Choose action (explore or exploit)
        if random.random() < exploration_rate:
            action = random.choice(actions)
        else:
            action = max(q_table[state], key=q_table[state].get)
        
        # Take action
        if action == 'right':
            next_state = min(state + 1, goal_state)
        else:
            next_state = max(state - 1, 0)
        
        # Reward logic
        if next_state == goal_state:
            reward = 1  # goal reward
        else:
            reward = -0.1  # penalty for every non-goal step
        
        # Q-learning update
        old_value = q_table[state][action]
        next_max = max(q_table[next_state].values())
        new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value)
        q_table[state][action] = new_value
        
        state = next_state  # move to next state

# Show Q-table after training
print("Learned Q-table with penalties:")
for s in q_table:
    print(f"State {s}: {q_table[s]}")

# Testing
print("\nTesting learned policy:")
state = 0
steps = [state]
while state != goal_state:
    action = max(q_table[state], key=q_table[state].get)
    if action == 'right':
        state = min(state + 1, goal_state)
    else:
        state = max(state - 1, 0)
    steps.append(state)

print("Steps taken:", steps)

3.Output Explanation

After training, it will show the Q-values for each action in each state.During testing, it prints the sequence of steps the agent takes to reach the goal using the learned policy.

4. A real-life inspired example using text-based computation.

Idea: Word Typing Practice (Text-Based Computation)

Let’s imagine:

The agent’s job is to type letters to match a target word (like ‘hi’).
At each step, it chooses a letter.
If the letter is correct at that position, it gets a reward.
If the letter is wrong, it gets a penalty.
When the full word is correctly typed, it gets a big reward.

This is a kind of sequence generation task, like spelling or autocomplete.
Python Code: Typing the Word ‘hi’

import random

# Target word to "type"
target_word = "hi"
max_steps = len(target_word)

# All possible letters to choose from
alphabet = list("abcdefghijklmnopqrstuvwxyz")

# Q-table: state = (position, current_string), action = letter
q_table = {}

# Initialize Q-table with zeros
for pos in range(max_steps):
    for partial in ["".join(p) for p in set([a + b for a in alphabet for b in alphabet])]:
        key = (pos, partial[:pos])  # only care up to current position
        q_table[key] = {letter: 0.0 for letter in alphabet}

# Hyperparameters
learning_rate = 0.1
discount_factor = 0.95
exploration_rate = 0.3
episodes = 200

# Training loop
for episode in range(episodes):
    current_string = ""
    position = 0
    
    while position < max_steps:
        state = (position, current_string)
        
        # Choose letter (action)
        if state not in q_table:
            q_table[state] = {letter: 0.0 for letter in alphabet}
        
        if random.random() < exploration_rate:
            action = random.choice(alphabet)
        else:
            action = max(q_table[state], key=q_table[state].get)
        
        # Check reward
        correct_letter = target_word[position]
        if action == correct_letter:
            reward = 1  # correct letter
            current_string += action
            position += 1
        else:
            reward = -0.5  # wrong letter, don't move forward
        
        # Get next state
        next_state = (position, current_string)
        if next_state not in q_table:
            q_table[next_state] = {letter: 0.0 for letter in alphabet}
        
        # Q-learning update
        old_value = q_table[state][action]
        next_max = max(q_table[next_state].values())
        new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value)
        q_table[state][action] = new_value

# Testing the learned policy
print(f"\nTesting learned policy to type the word '{target_word}':")
typed = ""
position = 0
while position < max_steps:
    state = (position, typed)
    if state in q_table:
        action = max(q_table[state], key=q_table[state].get)
    else:
        action = random.choice(alphabet)
    print(f"Position {position} - Typed: '{typed}' → Choosing: '{action}'")
    if action == target_word[position]:
        typed += action
        position += 1
    else:
        print("Wrong letter, retrying...")

print(f"\nFinal typed word: {typed}")

What’s Happening Here?

The agent learns to type a word one letter at a time.
It gets +1 for correct letters, -0.5 for wrong ones.
Through trial and error, it figures out how to spell the word correctly.

Why This Helps Understanding

It’s like learning to type, spell, or predict text — real-life tasks
Easy to see what the agent is doing, and why certain actions are rewarded or penalized.

Reinforcement Learning – Brainstorming Session