Reinforcement Learning example with Simple Python
1.Goal
We’ll create a game where:
- An agent (a player) is on a line of 5 positions (from 0 to 4).
- The goal is to reach position 4 (the goal) and get a reward.
- The agent can move left or right
- Reward for reaching the goal (position 4): +1
- Penalty for every step that’s not the goal: -0.1
We’ll use Q-learning, a simple RL algorithm.
2.Concepts Used
- States: positions 0 to 4
- Actions: ‘left’ or ‘right’
- Q-table: a table that stores the best action to take in each state
- Learning: updating the Q-table based on what happens
Beginner-Friendly Code with Penalties
import random # Environment states = [0, 1, 2, 3, 4] actions = ['left', 'right'] goal_state = 4 # Q-table q_table = {} for state in states: q_table[state] = {'left': 0.0, 'right': 0.0} # Hyperparameters learning_rate = 0.1 discount_factor = 0.9 exploration_rate = 0.2 episodes = 100 # Training for episode in range(episodes): state = 0 # Start at the beginning while state != goal_state: # Choose action (explore or exploit) if random.random() < exploration_rate: action = random.choice(actions) else: action = max(q_table[state], key=q_table[state].get) # Take action if action == 'right': next_state = min(state + 1, goal_state) else: next_state = max(state - 1, 0) # Reward logic if next_state == goal_state: reward = 1 # goal reward else: reward = -0.1 # penalty for every non-goal step # Q-learning update old_value = q_table[state][action] next_max = max(q_table[next_state].values()) new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value) q_table[state][action] = new_value state = next_state # move to next state # Show Q-table after training print("Learned Q-table with penalties:") for s in q_table: print(f"State {s}: {q_table[s]}") # Testing print("\nTesting learned policy:") state = 0 steps = [state] while state != goal_state: action = max(q_table[state], key=q_table[state].get) if action == 'right': state = min(state + 1, goal_state) else: state = max(state - 1, 0) steps.append(state) print("Steps taken:", steps)
3.Output Explanation
After training, it will show the Q-values for each action in each state.During testing, it prints the sequence of steps the agent takes to reach the goal using the learned policy.
4. A real-life inspired example using text-based computation.
Idea: Word Typing Practice (Text-Based Computation)
Let’s imagine:
- The agent’s job is to type letters to match a target word (like ‘hi’).
- At each step, it chooses a letter.
- If the letter is correct at that position, it gets a reward.
- If the letter is wrong, it gets a penalty.
- When the full word is correctly typed, it gets a big reward.
This is a kind of sequence generation task, like spelling or autocomplete.
Python Code: Typing the Word ‘hi’
import random # Target word to "type" target_word = "hi" max_steps = len(target_word) # All possible letters to choose from alphabet = list("abcdefghijklmnopqrstuvwxyz") # Q-table: state = (position, current_string), action = letter q_table = {} # Initialize Q-table with zeros for pos in range(max_steps): for partial in ["".join(p) for p in set([a + b for a in alphabet for b in alphabet])]: key = (pos, partial[:pos]) # only care up to current position q_table[key] = {letter: 0.0 for letter in alphabet} # Hyperparameters learning_rate = 0.1 discount_factor = 0.95 exploration_rate = 0.3 episodes = 200 # Training loop for episode in range(episodes): current_string = "" position = 0 while position < max_steps: state = (position, current_string) # Choose letter (action) if state not in q_table: q_table[state] = {letter: 0.0 for letter in alphabet} if random.random() < exploration_rate: action = random.choice(alphabet) else: action = max(q_table[state], key=q_table[state].get) # Check reward correct_letter = target_word[position] if action == correct_letter: reward = 1 # correct letter current_string += action position += 1 else: reward = -0.5 # wrong letter, don't move forward # Get next state next_state = (position, current_string) if next_state not in q_table: q_table[next_state] = {letter: 0.0 for letter in alphabet} # Q-learning update old_value = q_table[state][action] next_max = max(q_table[next_state].values()) new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value) q_table[state][action] = new_value # Testing the learned policy print(f"\nTesting learned policy to type the word '{target_word}':") typed = "" position = 0 while position < max_steps: state = (position, typed) if state in q_table: action = max(q_table[state], key=q_table[state].get) else: action = random.choice(alphabet) print(f"Position {position} - Typed: '{typed}' → Choosing: '{action}'") if action == target_word[position]: typed += action position += 1 else: print("Wrong letter, retrying...") print(f"\nFinal typed word: {typed}")
What’s Happening Here?
- The agent learns to type a word one letter at a time.
- It gets +1for correct letters, -0.5 for wrong ones.
- Through trial and error, it figures out how to spell the word correctly.
Why This Helps Understanding
- It’s like learning to type, spell, or predict text — real-life tasks
- Easy to see what the agent is doing, and why certain actions are rewarded or penalized.
Reinforcement Learning – Brainstorming Session