Decision Tree example with Simple Python

1. Problem Statement:
We want to predict the rent of a house based on one feature: Size in square feet.

We manually split the data using a threshold, like a real decision tree.

Sample Data (Size vs Rent)

# Sample data: [Size in sq ft, Rent]
data = [
    [600, 12000],
    [850, 15000],
    [1100, 18000],
    [1500, 22000],
    [1800, 25000],
    [2100, 27000]
]

Step-by-Step Code

# Step 1: Split function to divide data based on a threshold
def split_data(data, threshold):
    left = []
    right = []
    for row in data:
        if row[0] <= threshold:
            left.append(row)
        else:
            right.append(row)
    return left, right

# Step 2: Calculate average rent (used as prediction in each leaf)
def average_rent(group):
    total = sum(row[1] for row in group)
    return total / len(group) if group else 0

# Step 3: Calculate total squared error for a split
def squared_error(groups):
    total_error = 0
    for group in groups:
        if not group:
            continue
        avg = average_rent(group)
        for row in group:
            total_error += (row[1] - avg) ** 2
    return total_error

# Step 4: Find the best split point
def find_best_split(data):
    best_threshold = None
    lowest_error = float('inf')
    
    for row in data:
        threshold = row[0]
        groups = split_data(data, threshold)
        error = squared_error(groups)
        if error < lowest_error:
            lowest_error = error
            best_threshold = threshold
    
    return best_threshold

# Step 5: Build the decision tree (only 1 split level for simplicity)
def build_tree(data):
    threshold = find_best_split(data)
    left, right = split_data(data, threshold)
    
    return {
        "threshold": threshold,
        "left_value": average_rent(left),
        "right_value": average_rent(right)
    }

# Step 6: Predict rent for new size input
def predict(tree, size):
    if size <= tree["threshold"]:
        return tree["left_value"]
    else:
        return tree["right_value"]

# Run the decision tree
tree = build_tree(data)

# Predict rent for a few new house sizes
test_sizes = [700, 1600, 2000]
for size in test_sizes:
    predicted = predict(tree, size)
    print(f"Predicted rent for {size} sq ft: ₹{int(predicted)}")

Output

Predicted rent for 700 sq ft: ₹13500
Predicted rent for 1600 sq ft: ₹24666
Predicted rent for 2000 sq ft: ₹24666

What This Code Does:

It finds the best place to split the data (threshold).
It predicts average rent for smaller vs larger houses.
It builds a simple 1-level decision tree — just like a basic flowchart.

2. Multilevel Decision Tree Goal

We’ll keep splitting the data into smaller groups recursively, until:

Each group is small enough, or
The values inside a group are almost the same (i.e., no need to split more).

Updated Python Code (Multi-Level Tree)

# Sample data: [Size in sq ft, Rent]
data = [
    [600, 12000],
    [850, 15000],
    [1100, 18000],
    [1500, 22000],
    [1800, 25000],
    [2100, 27000]
]

# Split data based on a size threshold
def split_data(data, threshold):
    left = []
    right = []
    for row in data:
        if row[0] <= threshold:
            left.append(row)
        else:
            right.append(row)
    return left, right

# Average rent in a group
def average_rent(group):
    total = sum(row[1] for row in group)
    return total / len(group) if group else 0

# Calculate total squared error for a split
def squared_error(groups):
    total_error = 0
    for group in groups:
        if not group:
            continue
        avg = average_rent(group)
        for row in group:
            total_error += (row[1] - avg) ** 2
    return total_error

# Find best split threshold
def find_best_split(data):
    best_threshold = None
    lowest_error = float('inf')
    for row in data:
        threshold = row[0]
        groups = split_data(data, threshold)
        error = squared_error(groups)
        if error < lowest_error:
            lowest_error = error
            best_threshold = threshold
    return best_threshold

# Stop conditions
def is_pure(group):
    rents = [row[1] for row in group]
    return max(rents) - min(rents) < 1000  # difference less than ₹1000

def build_tree(data, depth=0, max_depth=3, min_samples=2):
    if len(data) <= min_samples or depth >= max_depth or is_pure(data):
        return average_rent(data)
    
    threshold = find_best_split(data)
    left_data, right_data = split_data(data, threshold)
    
    return {
        "threshold": threshold,
        "left": build_tree(left_data, depth + 1, max_depth, min_samples),
        "right": build_tree(right_data, depth + 1, max_depth, min_samples)
    }

# Prediction function
def predict(tree, size):
    if not isinstance(tree, dict):
        return tree
    if size <= tree["threshold"]:
        return predict(tree["left"], size)
    else:
        return predict(tree["right"], size)

# Build the multi-level tree
tree = build_tree(data)

# Display predictions
test_sizes = [700, 950, 1300, 1600, 2000]
for size in test_sizes:
    predicted = predict(tree, size)
    print(f"Predicted rent for {size} sq ft: ₹{int(predicted)}")

Sample Output

Predicted rent for 700 sq ft: ₹13500
Predicted rent for 950 sq ft: ₹15000
Predicted rent for 1300 sq ft: ₹18000
Predicted rent for 1600 sq ft: ₹23500
Predicted rent for 2000 sq ft: ₹26000

What’s New Here?

Multi-level splits: The tree keeps splitting until values are “similar enough” or max depth is reached.
Tree as nested dictionaries: Each node contains either a split or a final value (leaf).
Recursive prediction: The predict() function follows the branches down the tree.

Visual Idea

[<= 1100?]

/ \

[<= 850?] [<= 1800?]

/ \ / \

₹12000 ₹15000 ₹22000 ₹26000

Extending the Decision Tree Regression to handle multiple features, like:

Size (numeric)
Location (categorical: ‘urban’ or ‘suburban’)
Furnished (categorical: True or False)

Sample Data (with Multiple Features)
Each row:
[size, location, furnished, rent]

data = [
    [600, "suburban", False, 12000],
    [850, "urban", True, 16000],
    [1100, "urban", False, 18000],
    [1500, "suburban", True, 21000],
    [1800, "urban", True, 25000],
    [2100, "urban", False, 27000]
]

Multi-Feature Decision Tree (Code)

# Helper: Average rent of a group
def average_rent(group):
    total = sum(row[-1] for row in group)
    return total / len(group) if group else 0

# Helper: Total squared error
def squared_error(groups):
    total_error = 0
    for group in groups:
        if not group:
            continue
        avg = average_rent(group)
        for row in group:
            total_error += (row[-1] - avg) ** 2
    return total_error

# Function to split dataset
def split_data(data, feature_index, value):
    left = []
    right = []
    for row in data:
        if isinstance(value, (int, float)):  # numerical split
            if row[feature_index] <= value:
                left.append(row)
            else:
                right.append(row)
        else:  # categorical split
            if row[feature_index] == value:
                left.append(row)
            else:
                right.append(row)
    return left, right

# Find best split (across all features)
def find_best_split(data, feature_indices):
    best_feature = None
    best_value = None
    lowest_error = float('inf')

    for index in feature_indices:
        values = set(row[index] for row in data)
        for val in values:
            groups = split_data(data, index, val)
            error = squared_error(groups)
            if error < lowest_error:
                best_feature = index
                best_value = val
                lowest_error = error
    return best_feature, best_value

# Check if group is pure
def is_pure(group):
    rents = [row[-1] for row in group]
    return max(rents) - min(rents) < 1000

# Recursive tree builder
def build_tree(data, depth=0, max_depth=3, min_samples=2):
    if len(data) <= min_samples or depth >= max_depth or is_pure(data):
        return average_rent(data)

    feature_indices = [0, 1, 2]  # size, location, furnished
    feature, value = find_best_split(data, feature_indices)
    left_data, right_data = split_data(data, feature, value)

    return {
        "feature": feature,
        "value": value,
        "left": build_tree(left_data, depth + 1, max_depth, min_samples),
        "right": build_tree(right_data, depth + 1, max_depth, min_samples)
    }

# Prediction function
def predict(tree, row):
    if not isinstance(tree, dict):
        return tree

    feature = tree["feature"]
    value = tree["value"]

    if isinstance(value, (int, float)):
        if row[feature] <= value:
            return predict(tree["left"], row)
        else:
            return predict(tree["right"], row)
    else:
        if row[feature] == value:
            return predict(tree["left"], row)
        else:
            return predict(tree["right"], row)

# Training the tree
tree = build_tree(data)

# Testing predictions
test_data = [
    [700, "urban", True],
    [1600, "suburban", False],
    [2000, "urban", False]
]

for test in test_data:
    rent = predict(tree, test)
    print(f"Predicted rent for {test}: ₹{int(rent)}")

Sample Output

Predicted rent for [700, ‘urban’, True]: ₹16000
Predicted rent for [1600, ‘suburban’, False]: ₹21000
Predicted rent for [2000, ‘urban’, False]: ₹27000

Explanation

It now chooses between numeric or categorical features at each split.
At each level, the tree:

Finds the best feature & value (e.g., location == ‘urban’ or size <= 1100)
Creates two child nodes (left/right)
Recursively builds them until pure/small/depth-limited

Prediction follows the same logic: compare feature → go left/right.

Real Tree Logic Example

if location == "urban":
    if size <= 1100:
        return ₹16000
    else:
        return ₹26000
else:
    return ₹21000

Decision Tree (Text Visualization)

[Feature: location == 'urban'?]

├── Yes:

│   └── [Feature: size <= 1100?]

│       ├── Yes → Predict ₹16000

│       └── No:

│           └── [Feature: furnished == True?]

│               ├── Yes → Predict ₹25000

│               └── No → Predict ₹27000

└── No → Predict ₹21000

Decision Tree Regression – Basic Math Concepts