KNN Regression example with Simple Python

1. We’ll simulate the house price prediction example.

Problem Setup:

We already know prices of 5 houses with their size (in square feet).We’ll predict the price of a new house (e.g., 1400 sqft) using K=3 nearest neighbors.

# Sample known data: (House size, Price)
known_data = [
    (1000, 50),   # 1000 sqft, ₹50 lakhs
    (1500, 65),   # 1500 sqft, ₹65 lakhs
    (1200, 54),   # 1200 sqft, ₹54 lakhs
    (1800, 75),   # 1800 sqft, ₹75 lakhs
    (1100, 52)    # 1100 sqft, ₹52 lakhs
]

# Step 1: Define the new house size we want to predict
new_house_size = 1400

# Step 2: Compute distances from the new house to all known houses
def compute_distance(house_size1, house_size2):
    return abs(house_size1 - house_size2)  # Since it's 1D, just take absolute difference

# Step 3: Get distances with data
distances = []
for size, price in known_data:
    dist = compute_distance(size, new_house_size)
    distances.append((dist, price))

# Step 4: Sort by distance
distances.sort(key=lambda x: x[0])

# Step 5: Take the top K neighbors
K = 3
top_k = distances[:K]

# Step 6: Predict price by averaging their prices
total_price = sum(price for _, price in top_k)
predicted_price = total_price / K

# Output
print(f"Predicted price for house size {new_house_size} sqft (K={K}) is: ₹{predicted_price} lakhs")

Output Example:

Predicted price for house size 1400 sqft (K=3) is: ₹60.666666666666664 lakhs

What’s Happening Behind the Scenes:

  1. Distance is measured (here, just |known_size – new_size|).
  2. Top 3 most similar houses are picked.
  3. Their prices are averaged for the prediction.

2. Extending the previous simulation to multi-feature KNN Regression

Problem:

We have past house data with 3 features:

  1. Size (sqft)
  2. Number of bedrooms
  3. Location score (say, 1–10 where 10 = best locality)

Our goal is to predict the price for a new house using KNN Regression with these 3 features and K = 3.

Sample Data (each tuple: [size, bedrooms, location_score], price)

# Past house data
known_data = [
    ([1000, 2, 6], 50),   # ₹50 lakhs
    ([1500, 3, 9], 65),   # ₹65 lakhs
    ([1200, 2, 5], 54),   # ₹54 lakhs
    ([1800, 4, 8], 75),   # ₹75 lakhs
    ([1100, 2, 4], 52)    # ₹52 lakhs
]

# New house to predict: [size, bedrooms, location_score]
new_house = [1400, 3, 7]

# Step 1: Define distance function for multi-feature (Euclidean distance)
def euclidean_distance(vec1, vec2):
    return sum((a - b) ** 2 for a, b in zip(vec1, vec2)) ** 0.5

# Step 2: Compute distances
distances = []
for features, price in known_data:
    dist = euclidean_distance(features, new_house)
    distances.append((dist, price))

# Step 3: Sort by distance
distances.sort(key=lambda x: x[0])

# Step 4: Take top K
K = 3
top_k = distances[:K]

# Step 5: Predict price by averaging top K prices
total_price = sum(price for _, price in top_k)
predicted_price = total_price / K

# Output
print(f"Predicted price for house {new_house} (K={K}): ₹{predicted_price} lakhs")

Example Output:

Predicted price for house [1400, 3, 7] (K=3): ₹61.333333333333336 lakhs

What You Just Did:

  • We treated each house as a vector in 3D space.
  • We measured closeness using Euclidean distance.
  • We picked K=3 closest houses.
  • We averaged their prices to get the prediction.

3. Enhancing our KNN Regression with two important upgrades:

What We’ll Add:

  1. Feature Normalization
    Because “Size (1000–2000)” is on a very different scale than “Bedrooms (1–5)” or “Location score (1–10)”.
  2. Feature Weights
    Maybe Size is more important than Location score — so we can give it higher weight during distance calculation.

Enhanced KNN Regression (with Normalization + Weights)

# Sample data: [Size, Bedrooms, Location Score], Price in Lakhs
known_data = [
    ([1000, 2, 6], 50),
    ([1500, 3, 9], 65),
    ([1200, 2, 5], 54),
    ([1800, 4, 8], 75),
    ([1100, 2, 4], 52)
]

new_house = [1400, 3, 7]  # Size, Bedrooms, Location

# Step 1: Normalize Features (min-max scaling)
def normalize(data, new_point):
    # Transpose to work column-wise
    features = list(zip(*[d[0] for d in data]))  # each feature column separately
    new_norm = []
    norm_data = []

    for i in range(len(features)):
        col = features[i]
        min_val = min(col)
        max_val = max(col)
        range_val = max_val - min_val if max_val != min_val else 1  # avoid divide by zero

        # Normalize existing data
        norm_col = [(x - min_val) / range_val for x in col]
        for j, val in enumerate(norm_col):
            if i == 0:
                norm_data.append([val])
            else:
                norm_data[j].append(val)

        # Normalize new_point
        new_norm.append((new_point[i] - min_val) / range_val)

    return norm_data, new_norm

# Normalize
normalized_features, normalized_new_house = normalize(known_data, new_house)

# Step 2: Assign weights for features: [Size, Bedrooms, Location]
feature_weights = [0.6, 0.3, 0.1]

# Step 3: Compute weighted Euclidean distances
def weighted_distance(v1, v2, weights):
    return sum(weights[i] * (v1[i] - v2[i]) ** 2 for i in range(len(v1))) ** 0.5

# Combine normalized data with original prices
normalized_data = list(zip(normalized_features, [price for _, price in known_data]))

# Step 4: Calculate distances
distances = []
for features, price in normalized_data:
    dist = weighted_distance(features, normalized_new_house, feature_weights)
    distances.append((dist, price))

# Step 5: Sort by distance and get top K
distances.sort(key=lambda x: x[0])
K = 3
top_k = distances[:K]

# Step 6: Predict price
predicted_price = sum(price for _, price in top_k) / K

# Output
print(f"Predicted price for house {new_house} with weights {feature_weights}: ₹{predicted_price:.2f} lakhs")

Example Output:

Predicted price for house [1400, 3, 7] with weights [0.6, 0.3, 0.1]: ₹61.00 lakhs

Why This Matters:

  • Normalization ensures fair comparison between features.
  • Weights let us tell the algorithm which features our trust more.
    • E.g., Size = 60% influence, Bedrooms = 30%, Location = 10%

3. The logic of picking nearest neighbors in KNN Regression with a real-world story

Story: The Cake Lover’s Neighborhood
Imagine someone new in a town, and today is his birthday.He is craving cake — but he wants the best one. He ask:“Where can I find a good cake, just like the one I love — soft, not too sweet, with lots of cream?”

He doesn’t want random suggestions.So he go to the local community board and post his cake preferences:

  • Softness: 9/10
  • Sweetness: 4/10
  • Creaminess: 8/10

Now, people from the area who’ve had cakes before reply with their experience of local bakeries.

Each person’s review looks like:“At SweetTooth Bakery, I had a cake — Softness: 8, Sweetness: 5, Creaminess: 9. I’d rate it 9/10 overall!”

You he has a bunch of cake reviews, each with three features and one rating.

So what does he do?

Logic to Pick the “Nearest Neighbors”

He calculates how similar each review is to his dream cake preferences.

  1. For every review, he measures how close their description is to his desired cake.
    1. Think of each cake as a point in 3D space: (softness, sweetness, creaminess).
    2. You compare the distance between his dream and each cake.
  2. The ones closest to his preferences (i.e., shortest distance in that 3D space) are his nearest neighbors.
  3. He picks, say, the 3 nearest reviews (K=3). These are his trusted cake tasters!
  4. He then average their ratings to guess: “A cake with my taste is likely to be around 8.7/10!”

This is exactly what KNN Regression does.

So What Is “Distance”?

Distance is a mathematical way of saying “how similar”.The most common way is the Euclidean distance – like drawing a straight line between two points in space.

For 3 features, it’s:

Distance=√[(s2 – s1)^2 + (w1 – w2)^2+(c1-c2)^2]​

Where:

  • s = softness
  • w = sweetness
  • c = creaminess
  • (1 = your taste, 2 = a bakery’s cake)

The smaller the distance, the closer the match.

What is Euclidean Distance?

It’s the same distance we use in KNN when we want to know: “How close is this data point to another?”

Story: Two Friends in a Park

Imagine two friends, Amit and Rahul, are walking in a large rectangular park.

  • The park has a grid with bench numbers labeled like coordinates.
  • Amit is sitting at Bench (3, 4).
  • Rahul is sitting at Bench (7, 1).

Now, Amit calls Rahul and says: “Hey, how far are you from me? Let’s meet in the middle!”

But Rahul doesn’t want to walk around the paths — he wants to know the shortest straight-line distance, “as the crow flies”.

How do they calculate the shortest path?

They use the idea of a right-angled triangle!

  1. The difference in the horizontal direction (x-axis):
    • 7−3=47 – 3 = 47−3=4
  2. The difference in the vertical direction (y-axis):
    • 4−1=34 – 1 = 34−1=3
  3. Now, imagine a triangle with:
    • Base = 4 units
    • Height = 3 units
  4. The straight-line between them is the hypotenuse.

Using Pythagoras’ theorem:

Distance=√(4)^2+(3)^2=√16+9=√25=5

KNN Regression – KNN Regression Dataset Suitability Checklist