Decision Tree example with Simple Python
1. Problem Statement:
We want to predict the rent of a house based on one feature: Size in square feet.
We manually split the data using a threshold, like a real decision tree.
Sample Data (Size vs Rent)
# Sample data: [Size in sq ft, Rent]
data = [
[600, 12000],
[850, 15000],
[1100, 18000],
[1500, 22000],
[1800, 25000],
[2100, 27000]
]
Step-by-Step Code
# Step 1: Split function to divide data based on a threshold
def split_data(data, threshold):
left = []
right = []
for row in data:
if row[0] <= threshold:
left.append(row)
else:
right.append(row)
return left, right
# Step 2: Calculate average rent (used as prediction in each leaf)
def average_rent(group):
total = sum(row[1] for row in group)
return total / len(group) if group else 0
# Step 3: Calculate total squared error for a split
def squared_error(groups):
total_error = 0
for group in groups:
if not group:
continue
avg = average_rent(group)
for row in group:
total_error += (row[1] - avg) ** 2
return total_error
# Step 4: Find the best split point
def find_best_split(data):
best_threshold = None
lowest_error = float('inf')
for row in data:
threshold = row[0]
groups = split_data(data, threshold)
error = squared_error(groups)
if error < lowest_error:
lowest_error = error
best_threshold = threshold
return best_threshold
# Step 5: Build the decision tree (only 1 split level for simplicity)
def build_tree(data):
threshold = find_best_split(data)
left, right = split_data(data, threshold)
return {
"threshold": threshold,
"left_value": average_rent(left),
"right_value": average_rent(right)
}
# Step 6: Predict rent for new size input
def predict(tree, size):
if size <= tree["threshold"]:
return tree["left_value"]
else:
return tree["right_value"]
# Run the decision tree
tree = build_tree(data)
# Predict rent for a few new house sizes
test_sizes = [700, 1600, 2000]
for size in test_sizes:
predicted = predict(tree, size)
print(f"Predicted rent for {size} sq ft: ₹{int(predicted)}")
Output
Predicted rent for 700 sq ft: ₹13500
Predicted rent for 1600 sq ft: ₹24666
Predicted rent for 2000 sq ft: ₹24666
What This Code Does:
- It finds the best place to split the data (threshold).
- It predicts average rent for smaller vs larger houses.
- It builds a simple 1-level decision tree — just like a basic flowchart.
2. Multilevel Decision Tree Goal
We’ll keep splitting the data into smaller groups recursively, until:
- Each group is small enough, or
- The values inside a group are almost the same (i.e., no need to split more).
Updated Python Code (Multi-Level Tree)
# Sample data: [Size in sq ft, Rent]
data = [
[600, 12000],
[850, 15000],
[1100, 18000],
[1500, 22000],
[1800, 25000],
[2100, 27000]
]
# Split data based on a size threshold
def split_data(data, threshold):
left = []
right = []
for row in data:
if row[0] <= threshold:
left.append(row)
else:
right.append(row)
return left, right
# Average rent in a group
def average_rent(group):
total = sum(row[1] for row in group)
return total / len(group) if group else 0
# Calculate total squared error for a split
def squared_error(groups):
total_error = 0
for group in groups:
if not group:
continue
avg = average_rent(group)
for row in group:
total_error += (row[1] - avg) ** 2
return total_error
# Find best split threshold
def find_best_split(data):
best_threshold = None
lowest_error = float('inf')
for row in data:
threshold = row[0]
groups = split_data(data, threshold)
error = squared_error(groups)
if error < lowest_error:
lowest_error = error
best_threshold = threshold
return best_threshold
# Stop conditions
def is_pure(group):
rents = [row[1] for row in group]
return max(rents) - min(rents) < 1000 # difference less than ₹1000
def build_tree(data, depth=0, max_depth=3, min_samples=2):
if len(data) <= min_samples or depth >= max_depth or is_pure(data):
return average_rent(data)
threshold = find_best_split(data)
left_data, right_data = split_data(data, threshold)
return {
"threshold": threshold,
"left": build_tree(left_data, depth + 1, max_depth, min_samples),
"right": build_tree(right_data, depth + 1, max_depth, min_samples)
}
# Prediction function
def predict(tree, size):
if not isinstance(tree, dict):
return tree
if size <= tree["threshold"]:
return predict(tree["left"], size)
else:
return predict(tree["right"], size)
# Build the multi-level tree
tree = build_tree(data)
# Display predictions
test_sizes = [700, 950, 1300, 1600, 2000]
for size in test_sizes:
predicted = predict(tree, size)
print(f"Predicted rent for {size} sq ft: ₹{int(predicted)}")
Sample Output
Predicted rent for 700 sq ft: ₹13500
Predicted rent for 950 sq ft: ₹15000
Predicted rent for 1300 sq ft: ₹18000
Predicted rent for 1600 sq ft: ₹23500
Predicted rent for 2000 sq ft: ₹26000
What’s New Here?
- Multi-level splits: The tree keeps splitting until values are “similar enough” or max depth is reached.
- Tree as nested dictionaries: Each node contains either a split or a final value (leaf).
- Recursive prediction: The predict() function follows the branches down the tree.
Visual Idea
[<= 1100?]
/ \
[<= 850?] [<= 1800?]
/ \ / \
₹12000 ₹15000 ₹22000 ₹26000
Extending the Decision Tree Regression to handle multiple features, like:
- Size (numeric)
- Location (categorical: ‘urban’ or ‘suburban’)
- Furnished (categorical: True or False)
Sample Data (with Multiple Features)
Each row:
[size, location, furnished, rent]
data = [
[600, "suburban", False, 12000],
[850, "urban", True, 16000],
[1100, "urban", False, 18000],
[1500, "suburban", True, 21000],
[1800, "urban", True, 25000],
[2100, "urban", False, 27000]
]
Multi-Feature Decision Tree (Code)
# Helper: Average rent of a group
def average_rent(group):
total = sum(row[-1] for row in group)
return total / len(group) if group else 0
# Helper: Total squared error
def squared_error(groups):
total_error = 0
for group in groups:
if not group:
continue
avg = average_rent(group)
for row in group:
total_error += (row[-1] - avg) ** 2
return total_error
# Function to split dataset
def split_data(data, feature_index, value):
left = []
right = []
for row in data:
if isinstance(value, (int, float)): # numerical split
if row[feature_index] <= value:
left.append(row)
else:
right.append(row)
else: # categorical split
if row[feature_index] == value:
left.append(row)
else:
right.append(row)
return left, right
# Find best split (across all features)
def find_best_split(data, feature_indices):
best_feature = None
best_value = None
lowest_error = float('inf')
for index in feature_indices:
values = set(row[index] for row in data)
for val in values:
groups = split_data(data, index, val)
error = squared_error(groups)
if error < lowest_error:
best_feature = index
best_value = val
lowest_error = error
return best_feature, best_value
# Check if group is pure
def is_pure(group):
rents = [row[-1] for row in group]
return max(rents) - min(rents) < 1000
# Recursive tree builder
def build_tree(data, depth=0, max_depth=3, min_samples=2):
if len(data) <= min_samples or depth >= max_depth or is_pure(data):
return average_rent(data)
feature_indices = [0, 1, 2] # size, location, furnished
feature, value = find_best_split(data, feature_indices)
left_data, right_data = split_data(data, feature, value)
return {
"feature": feature,
"value": value,
"left": build_tree(left_data, depth + 1, max_depth, min_samples),
"right": build_tree(right_data, depth + 1, max_depth, min_samples)
}
# Prediction function
def predict(tree, row):
if not isinstance(tree, dict):
return tree
feature = tree["feature"]
value = tree["value"]
if isinstance(value, (int, float)):
if row[feature] <= value:
return predict(tree["left"], row)
else:
return predict(tree["right"], row)
else:
if row[feature] == value:
return predict(tree["left"], row)
else:
return predict(tree["right"], row)
# Training the tree
tree = build_tree(data)
# Testing predictions
test_data = [
[700, "urban", True],
[1600, "suburban", False],
[2000, "urban", False]
]
for test in test_data:
rent = predict(tree, test)
print(f"Predicted rent for {test}: ₹{int(rent)}")
Sample Output
Predicted rent for [700, ‘urban’, True]: ₹16000
Predicted rent for [1600, ‘suburban’, False]: ₹21000
Predicted rent for [2000, ‘urban’, False]: ₹27000
Explanation
- It now chooses between numeric or categorical features at each split.
- At each level, the tree:
- Finds the best feature & value (e.g., location == ‘urban’ or size <= 1100)
- Creates two child nodes (left/right)
- Recursively builds them until pure/small/depth-limited
- Prediction follows the same logic: compare feature → go left/right.
Real Tree Logic Example
if location == "urban":
if size <= 1100:
return ₹16000
else:
return ₹26000
else:
return ₹21000
Decision Tree (Text Visualization)
[Feature: location == 'urban'?] ├── Yes: │ └── [Feature: size <= 1100?] │ ├── Yes → Predict ₹16000 │ └── No: │ └── [Feature: furnished == True?] │ ├── Yes → Predict ₹25000 │ └── No → Predict ₹27000 └── No → Predict ₹21000
Decision Tree Regression – Basic Math Concepts
