KNN Regression example with Simple Python
1. We’ll simulate the house price prediction example.
Problem Setup:
We already know prices of 5 houses with their size (in square feet).We’ll predict the price of a new house (e.g., 1400 sqft) using K=3 nearest neighbors.
# Sample known data: (House size, Price) known_data = [ (1000, 50), # 1000 sqft, ₹50 lakhs (1500, 65), # 1500 sqft, ₹65 lakhs (1200, 54), # 1200 sqft, ₹54 lakhs (1800, 75), # 1800 sqft, ₹75 lakhs (1100, 52) # 1100 sqft, ₹52 lakhs ] # Step 1: Define the new house size we want to predict new_house_size = 1400 # Step 2: Compute distances from the new house to all known houses def compute_distance(house_size1, house_size2): return abs(house_size1 - house_size2) # Since it's 1D, just take absolute difference # Step 3: Get distances with data distances = [] for size, price in known_data: dist = compute_distance(size, new_house_size) distances.append((dist, price)) # Step 4: Sort by distance distances.sort(key=lambda x: x[0]) # Step 5: Take the top K neighbors K = 3 top_k = distances[:K] # Step 6: Predict price by averaging their prices total_price = sum(price for _, price in top_k) predicted_price = total_price / K # Output print(f"Predicted price for house size {new_house_size} sqft (K={K}) is: ₹{predicted_price} lakhs")
Output Example:
Predicted price for house size 1400 sqft (K=3) is: ₹60.666666666666664 lakhs
What’s Happening Behind the Scenes:
- Distance is measured (here, just |known_size – new_size|).
- Top 3 most similar houses are picked.
- Their prices are averaged for the prediction.
2. Extending the previous simulation to multi-feature KNN Regression
Problem:
We have past house data with 3 features:
- Size (sqft)
- Number of bedrooms
- Location score (say, 1–10 where 10 = best locality)
Our goal is to predict the price for a new house using KNN Regression with these 3 features and K = 3.
Sample Data (each tuple: [size, bedrooms, location_score], price)
# Past house data known_data = [ ([1000, 2, 6], 50), # ₹50 lakhs ([1500, 3, 9], 65), # ₹65 lakhs ([1200, 2, 5], 54), # ₹54 lakhs ([1800, 4, 8], 75), # ₹75 lakhs ([1100, 2, 4], 52) # ₹52 lakhs ] # New house to predict: [size, bedrooms, location_score] new_house = [1400, 3, 7] # Step 1: Define distance function for multi-feature (Euclidean distance) def euclidean_distance(vec1, vec2): return sum((a - b) ** 2 for a, b in zip(vec1, vec2)) ** 0.5 # Step 2: Compute distances distances = [] for features, price in known_data: dist = euclidean_distance(features, new_house) distances.append((dist, price)) # Step 3: Sort by distance distances.sort(key=lambda x: x[0]) # Step 4: Take top K K = 3 top_k = distances[:K] # Step 5: Predict price by averaging top K prices total_price = sum(price for _, price in top_k) predicted_price = total_price / K # Output print(f"Predicted price for house {new_house} (K={K}): ₹{predicted_price} lakhs")
Example Output:
Predicted price for house [1400, 3, 7] (K=3): ₹61.333333333333336 lakhs
What You Just Did:
- We treated each house as a vector in 3D space.
- We measured closeness using Euclidean distance.
- We picked K=3 closest houses.
- We averaged their prices to get the prediction.
3. Enhancing our KNN Regression with two important upgrades:
What We’ll Add:
- Feature Normalization
Because “Size (1000–2000)” is on a very different scale than “Bedrooms (1–5)” or “Location score (1–10)”. - Feature Weights
Maybe Size is more important than Location score — so we can give it higher weight during distance calculation.
Enhanced KNN Regression (with Normalization + Weights)
# Sample data: [Size, Bedrooms, Location Score], Price in Lakhs known_data = [ ([1000, 2, 6], 50), ([1500, 3, 9], 65), ([1200, 2, 5], 54), ([1800, 4, 8], 75), ([1100, 2, 4], 52) ] new_house = [1400, 3, 7] # Size, Bedrooms, Location # Step 1: Normalize Features (min-max scaling) def normalize(data, new_point): # Transpose to work column-wise features = list(zip(*[d[0] for d in data])) # each feature column separately new_norm = [] norm_data = [] for i in range(len(features)): col = features[i] min_val = min(col) max_val = max(col) range_val = max_val - min_val if max_val != min_val else 1 # avoid divide by zero # Normalize existing data norm_col = [(x - min_val) / range_val for x in col] for j, val in enumerate(norm_col): if i == 0: norm_data.append([val]) else: norm_data[j].append(val) # Normalize new_point new_norm.append((new_point[i] - min_val) / range_val) return norm_data, new_norm # Normalize normalized_features, normalized_new_house = normalize(known_data, new_house) # Step 2: Assign weights for features: [Size, Bedrooms, Location] feature_weights = [0.6, 0.3, 0.1] # Step 3: Compute weighted Euclidean distances def weighted_distance(v1, v2, weights): return sum(weights[i] * (v1[i] - v2[i]) ** 2 for i in range(len(v1))) ** 0.5 # Combine normalized data with original prices normalized_data = list(zip(normalized_features, [price for _, price in known_data])) # Step 4: Calculate distances distances = [] for features, price in normalized_data: dist = weighted_distance(features, normalized_new_house, feature_weights) distances.append((dist, price)) # Step 5: Sort by distance and get top K distances.sort(key=lambda x: x[0]) K = 3 top_k = distances[:K] # Step 6: Predict price predicted_price = sum(price for _, price in top_k) / K # Output print(f"Predicted price for house {new_house} with weights {feature_weights}: ₹{predicted_price:.2f} lakhs")
Example Output:
Predicted price for house [1400, 3, 7] with weights [0.6, 0.3, 0.1]: ₹61.00 lakhs
Why This Matters:
- Normalization ensures fair comparison between features.
- Weights let us tell the algorithm which features our trust more.
- E.g., Size = 60% influence, Bedrooms = 30%, Location = 10%
3. The logic of picking nearest neighbors in KNN Regression with a real-world story
Story: The Cake Lover’s Neighborhood
Imagine someone new in a town, and today is his birthday.He is craving cake — but he wants the best one. He ask:“Where can I find a good cake, just like the one I love — soft, not too sweet, with lots of cream?”
He doesn’t want random suggestions.So he go to the local community board and post his cake preferences:
- Softness: 9/10
- Sweetness: 4/10
- Creaminess: 8/10
Now, people from the area who’ve had cakes before reply with their experience of local bakeries.
Each person’s review looks like:“At SweetTooth Bakery, I had a cake — Softness: 8, Sweetness: 5, Creaminess: 9. I’d rate it 9/10 overall!”
You he has a bunch of cake reviews, each with three features and one rating.
So what does he do?
Logic to Pick the “Nearest Neighbors”
He calculates how similar each review is to his dream cake preferences.
- For every review, he measures how close their description is to his desired cake.
- Think of each cake as a point in 3D space: (softness, sweetness, creaminess).
- You compare the distance between his dream and each cake.
- The ones closest to his preferences (i.e., shortest distance in that 3D space) are his nearest neighbors.
- He picks, say, the 3 nearest reviews (K=3). These are his trusted cake tasters!
- He then average their ratings to guess: “A cake with my taste is likely to be around 8.7/10!”
This is exactly what KNN Regression does.
So What Is “Distance”?
Distance is a mathematical way of saying “how similar”.The most common way is the Euclidean distance – like drawing a straight line between two points in space.
For 3 features, it’s:
Distance=√[(s2 – s1)^2 + (w1 – w2)^2+(c1-c2)^2]
Where:
- s = softness
- w = sweetness
- c = creaminess
- (1 = your taste, 2 = a bakery’s cake)
The smaller the distance, the closer the match.
What is Euclidean Distance?
It’s the same distance we use in KNN when we want to know: “How close is this data point to another?”
Story: Two Friends in a Park
Imagine two friends, Amit and Rahul, are walking in a large rectangular park.
- The park has a grid with bench numbers labeled like coordinates.
- Amit is sitting at Bench (3, 4).
- Rahul is sitting at Bench (7, 1).
Now, Amit calls Rahul and says: “Hey, how far are you from me? Let’s meet in the middle!”
But Rahul doesn’t want to walk around the paths — he wants to know the shortest straight-line distance, “as the crow flies”.
How do they calculate the shortest path?
They use the idea of a right-angled triangle!
- The difference in the horizontal direction (x-axis):
- 7−3=47 – 3 = 47−3=4
- The difference in the vertical direction (y-axis):
- 4−1=34 – 1 = 34−1=3
- Now, imagine a triangle with:
- Base = 4 units
- Height = 3 units
- The straight-line between them is the hypotenuse.
Using Pythagoras’ theorem:
Distance=√(4)^2+(3)^2=√16+9=√25=5
KNN Regression – KNN Regression Dataset Suitability Checklist