Decision Tree example with Simple Python
1. Problem Statement:
We want to predict the rent of a house based on one feature: Size in square feet.
We manually split the data using a threshold, like a real decision tree.
Sample Data (Size vs Rent)
# Sample data: [Size in sq ft, Rent] data = [ [600, 12000], [850, 15000], [1100, 18000], [1500, 22000], [1800, 25000], [2100, 27000] ]
Step-by-Step Code
# Step 1: Split function to divide data based on a threshold def split_data(data, threshold): left = [] right = [] for row in data: if row[0] <= threshold: left.append(row) else: right.append(row) return left, right # Step 2: Calculate average rent (used as prediction in each leaf) def average_rent(group): total = sum(row[1] for row in group) return total / len(group) if group else 0 # Step 3: Calculate total squared error for a split def squared_error(groups): total_error = 0 for group in groups: if not group: continue avg = average_rent(group) for row in group: total_error += (row[1] - avg) ** 2 return total_error # Step 4: Find the best split point def find_best_split(data): best_threshold = None lowest_error = float('inf') for row in data: threshold = row[0] groups = split_data(data, threshold) error = squared_error(groups) if error < lowest_error: lowest_error = error best_threshold = threshold return best_threshold # Step 5: Build the decision tree (only 1 split level for simplicity) def build_tree(data): threshold = find_best_split(data) left, right = split_data(data, threshold) return { "threshold": threshold, "left_value": average_rent(left), "right_value": average_rent(right) } # Step 6: Predict rent for new size input def predict(tree, size): if size <= tree["threshold"]: return tree["left_value"] else: return tree["right_value"] # Run the decision tree tree = build_tree(data) # Predict rent for a few new house sizes test_sizes = [700, 1600, 2000] for size in test_sizes: predicted = predict(tree, size) print(f"Predicted rent for {size} sq ft: ₹{int(predicted)}")
Output
Predicted rent for 700 sq ft: ₹13500
Predicted rent for 1600 sq ft: ₹24666
Predicted rent for 2000 sq ft: ₹24666
What This Code Does:
- It finds the best place to split the data (threshold).
- It predicts average rent for smaller vs larger houses.
- It builds a simple 1-level decision tree — just like a basic flowchart.
2. Multilevel Decision Tree Goal
We’ll keep splitting the data into smaller groups recursively, until:
- Each group is small enough, or
- The values inside a group are almost the same (i.e., no need to split more).
Updated Python Code (Multi-Level Tree)
# Sample data: [Size in sq ft, Rent] data = [ [600, 12000], [850, 15000], [1100, 18000], [1500, 22000], [1800, 25000], [2100, 27000] ] # Split data based on a size threshold def split_data(data, threshold): left = [] right = [] for row in data: if row[0] <= threshold: left.append(row) else: right.append(row) return left, right # Average rent in a group def average_rent(group): total = sum(row[1] for row in group) return total / len(group) if group else 0 # Calculate total squared error for a split def squared_error(groups): total_error = 0 for group in groups: if not group: continue avg = average_rent(group) for row in group: total_error += (row[1] - avg) ** 2 return total_error # Find best split threshold def find_best_split(data): best_threshold = None lowest_error = float('inf') for row in data: threshold = row[0] groups = split_data(data, threshold) error = squared_error(groups) if error < lowest_error: lowest_error = error best_threshold = threshold return best_threshold # Stop conditions def is_pure(group): rents = [row[1] for row in group] return max(rents) - min(rents) < 1000 # difference less than ₹1000 def build_tree(data, depth=0, max_depth=3, min_samples=2): if len(data) <= min_samples or depth >= max_depth or is_pure(data): return average_rent(data) threshold = find_best_split(data) left_data, right_data = split_data(data, threshold) return { "threshold": threshold, "left": build_tree(left_data, depth + 1, max_depth, min_samples), "right": build_tree(right_data, depth + 1, max_depth, min_samples) } # Prediction function def predict(tree, size): if not isinstance(tree, dict): return tree if size <= tree["threshold"]: return predict(tree["left"], size) else: return predict(tree["right"], size) # Build the multi-level tree tree = build_tree(data) # Display predictions test_sizes = [700, 950, 1300, 1600, 2000] for size in test_sizes: predicted = predict(tree, size) print(f"Predicted rent for {size} sq ft: ₹{int(predicted)}")
Sample Output
Predicted rent for 700 sq ft: ₹13500
Predicted rent for 950 sq ft: ₹15000
Predicted rent for 1300 sq ft: ₹18000
Predicted rent for 1600 sq ft: ₹23500
Predicted rent for 2000 sq ft: ₹26000
What’s New Here?
- Multi-level splits: The tree keeps splitting until values are “similar enough” or max depth is reached.
- Tree as nested dictionaries: Each node contains either a split or a final value (leaf).
- Recursive prediction: The predict() function follows the branches down the tree.
Visual Idea
[<= 1100?]
/ \
[<= 850?] [<= 1800?]
/ \ / \
₹12000 ₹15000 ₹22000 ₹26000
Extending the Decision Tree Regression to handle multiple features, like:
- Size (numeric)
- Location (categorical: ‘urban’ or ‘suburban’)
- Furnished (categorical: True or False)
Sample Data (with Multiple Features)
Each row:
[size, location, furnished, rent]
data = [ [600, "suburban", False, 12000], [850, "urban", True, 16000], [1100, "urban", False, 18000], [1500, "suburban", True, 21000], [1800, "urban", True, 25000], [2100, "urban", False, 27000] ]
Multi-Feature Decision Tree (Code)
# Helper: Average rent of a group def average_rent(group): total = sum(row[-1] for row in group) return total / len(group) if group else 0 # Helper: Total squared error def squared_error(groups): total_error = 0 for group in groups: if not group: continue avg = average_rent(group) for row in group: total_error += (row[-1] - avg) ** 2 return total_error # Function to split dataset def split_data(data, feature_index, value): left = [] right = [] for row in data: if isinstance(value, (int, float)): # numerical split if row[feature_index] <= value: left.append(row) else: right.append(row) else: # categorical split if row[feature_index] == value: left.append(row) else: right.append(row) return left, right # Find best split (across all features) def find_best_split(data, feature_indices): best_feature = None best_value = None lowest_error = float('inf') for index in feature_indices: values = set(row[index] for row in data) for val in values: groups = split_data(data, index, val) error = squared_error(groups) if error < lowest_error: best_feature = index best_value = val lowest_error = error return best_feature, best_value # Check if group is pure def is_pure(group): rents = [row[-1] for row in group] return max(rents) - min(rents) < 1000 # Recursive tree builder def build_tree(data, depth=0, max_depth=3, min_samples=2): if len(data) <= min_samples or depth >= max_depth or is_pure(data): return average_rent(data) feature_indices = [0, 1, 2] # size, location, furnished feature, value = find_best_split(data, feature_indices) left_data, right_data = split_data(data, feature, value) return { "feature": feature, "value": value, "left": build_tree(left_data, depth + 1, max_depth, min_samples), "right": build_tree(right_data, depth + 1, max_depth, min_samples) } # Prediction function def predict(tree, row): if not isinstance(tree, dict): return tree feature = tree["feature"] value = tree["value"] if isinstance(value, (int, float)): if row[feature] <= value: return predict(tree["left"], row) else: return predict(tree["right"], row) else: if row[feature] == value: return predict(tree["left"], row) else: return predict(tree["right"], row) # Training the tree tree = build_tree(data) # Testing predictions test_data = [ [700, "urban", True], [1600, "suburban", False], [2000, "urban", False] ] for test in test_data: rent = predict(tree, test) print(f"Predicted rent for {test}: ₹{int(rent)}")
Sample Output
Predicted rent for [700, ‘urban’, True]: ₹16000
Predicted rent for [1600, ‘suburban’, False]: ₹21000
Predicted rent for [2000, ‘urban’, False]: ₹27000
Explanation
- It now chooses between numeric or categorical features at each split.
- At each level, the tree:
- Finds the best feature & value (e.g., location == ‘urban’ or size <= 1100)
- Creates two child nodes (left/right)
- Recursively builds them until pure/small/depth-limited
- Prediction follows the same logic: compare feature → go left/right.
Real Tree Logic Example
if location == "urban": if size <= 1100: return ₹16000 else: return ₹26000 else: return ₹21000
Decision Tree (Text Visualization)
[Feature: location == 'urban'?] ├── Yes: │ └── [Feature: size <= 1100?] │ ├── Yes → Predict ₹16000 │ └── No: │ └── [Feature: furnished == True?] │ ├── Yes → Predict ₹25000 │ └── No → Predict ₹27000 └── No → Predict ₹21000
Decision Tree Regression – Basic Math Concepts