Lasso Regression example with Simple Python

1. Goal:

Alex wants to predict how much a house will sell for, based on things like:

  • Size
  • Number of rooms
  • Distance from the city
  • Garden size
  • Fancy kitchen features
  • … and 50 more things

But Alex doesn’t know which ones actually matter.

Part 1: The Backpack Analogy

Alex is going on a hike. He has 50 items to pack, but can only carry 10.
Each item adds weight, and only some of them are actually useful

He thinks:”Let me try packing all, and slowly remove what doesn’t help me survive.”

This is Lasso Regression — we try all features (variables), but penalize the unhelpful ones and eventually drop them.

Part 2: How Lasso Works

Lasso helps Alex:

  • Start with all features (carry everything)
  • Check each one’s usefulness in predicting house price
  • Punish useless features by adding a cost (a “penalty”)
  • If the feature is not helpful, its weight (importance) shrinks toward zero
  • If it becomes zero, it’s like throwing the item out of the bag

So, Lasso = Linear Prediction + Penalty for Carrying Extra Items

2. Math (in Simple Words)

Let’s say

  • Alex predicts house price using:

    price = w₁ ⋅ size + w₂ ⋅ rooms + w₃ ⋅ garden + b

Where w_1, w_2, and w_3 are weights showing how important each factor is.Without regularization, Alex just finds the weights to best match the data.But with Lasso, we say:

“Don’t just match the data. Make sure you’re not carrying useless things.”

So we add a penalty to the formula:

Loss = Error + λ (|w₁| + |w₂| + |w₃| + …)

  • Error → How wrong the prediction is
  • λ (lambda) → How strict Alex is about dropping useless items
  • |w| → Absolute value of the weight (penalty added for just using it)

3. Why Each Step Matters

Step What Happens Why It’s Done
Start with all features Alex uses every detail to guess house prices Assumes all information might help
Measure prediction error Compares predicted vs actual prices To learn from mistakes
Add penalty for using too many features Charges a fee for every feature Forces Alex to be efficient
Shrink small weights If a feature is not useful, its weight goes down Eventually gets dropped
Some weights become zero Useless features are completely ignored Only key items are kept in the backpack

4. Real-Life Use Case: House Price Prediction

Initial model: Uses 50+ features

After Lasso:

Keeps only the useful ones like:

  • Square footage
  • Location rating
  • Year built

Drops noisy ones like:

  • Number of plants in garden
  • Distance to donut shop

Why? Because those dropped features don’t consistently help in making predictions across different houses.

5. Second Use Case: Disease Risk Detection

Imagine predicting if someone will get diabetes using 100 health indicators.

Lasso finds:

  • Glucose level, BMI, age → very predictive
  • Eye color, number of siblings → not predictive

Lasso keeps the important ones and discards the noise.

6. Why some features are useful and others are not (with a real-world example)

Real-Life Scenario: Predicting House Price

Alex is trying to predict the price of a house. Suppose he has 3 features (variables):

  • Area (sq.ft) → likely to impact price
  • Number of Rooms → also likely to matter
  • Number of Plants in the Garden → probably not useful

And he has some data:

Area (x₁) Rooms (x₂) Plants (x₃) Price (y)
1000 3 20 500,000
1500 4 25 700,000
1200 3 30 550,000
1800 5 18 800,000

7. Step-by-Step: Without Regularization (Normal Linear Regression)

We try to fit a line (or plane) like:

Predicted Price = w₁ ⋅ Area + w₂ ⋅ Rooms + w₃ ⋅ Plants + b

We find weights w₁, w₂, w₃ to minimize error (difference between predicted and actual price). This is done using Mean Squared Error (MSE):

MSE = (1/n) Σ (Predicted − Actual)²

So far, every variable tries to reduce the error — even if it’s only doing a little.

Problem?
Even a useless feature like “plants” might slightly reduce the error. But that doesn’t mean it’s actually important

Step-by-Step: Now Add Lasso Regularization

Lasso modifies the loss function to:

Loss = MSE + λ (|w₁| + |w₂| + |w₃|)

Now it’s not just about minimizing error — we’re penalizing each feature.

Now Let’s Compare Features

Let’s say after training:

  • w1=300w_1 = 300w1​=300 → Area (impact is strong)
  • w2=15,000w_2 = 15,000w2​=15,000 → Rooms (also strong)
  • w3=5w_3 = 5w3​=5 → Plants (very small impact)

Insight:

  • Area and Rooms are needed to reduce error substantially.
  • Plants only reduced the error slightly — but it adds to the penalty.

Now, the optimizer (Lasso) thinks:

“The plants feature isn’t helping enough to justify the penalty. Let me set w₃ = 0 and drop it.”

8. Visual Interpretation:

Feature Contribution to Prediction Lasso Penalty Worth Keeping?
Area High Medium Yes
Rooms High Medium Yes
Plants in Garden Low Still adds No

So, Lasso forces a trade-off:
“Only keep a feature if it helps a lot — enough to outweigh the cost.”

Real-Life Explanation:

Imagine we’re paying rent for each feature you use.

  • Area and Rooms give big returns → pay the rent.
  • Plants give very little → not worth keeping.

9.Final Prediction Model:

After Lasso, the model becomes:

Price=300⋅Area+15,000⋅Rooms+0⋅Plants+b

Plants are completely eliminated — and this simplifies the model.

10. Predict house prices using:

  • Area (sq.ft)
  • Rooms
  • Plants in garden (intentionally noisy)

Dataset

# [Area, Rooms, Plants] → Features
X = [
    [1000, 3, 20],
    [1500, 4, 25],
    [1200, 3, 30],
    [1800, 5, 18]
]

# Corresponding house prices (in $1,000s)
y = [500, 700, 550, 800]

Step-by-step Lasso Logic in Python (from scratch)

# Initialize weights and bias
w = [0.0, 0.0, 0.0]  # w1: area, w2: rooms, w3: plants
b = 0.0

alpha = 0.000001  # learning rate (small for precision)
lambda_ = 0.1     # L1 penalty strength
epochs = 1000

n = len(X)

for epoch in range(epochs):
    dw = [0.0, 0.0, 0.0]
    db = 0.0

    # Compute gradients
    for i in range(n):
        x1, x2, x3 = X[i]
        y_pred = w[0]*x1 + w[1]*x2 + w[2]*x3 + b
  error = y_pred - y[i]

        dw[0] += error * x1
        dw[1] += error * x2
        dw[2] += error * x3
        db += error

    # Average gradients
    dw = [d / n for d in dw]
    db /= n

    # Add L1 penalty to gradients
    for j in range(3):
        if w[j] > 0:
            dw[j] += lambda_
        elif w[j] < 0:
            dw[j] -= lambda_
        # if w[j] == 0 → no penalty change

    # Update weights and bias
    for j in range(3):
        w[j] -= alpha * dw[j]
    b -= alpha * db

    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Weights = {w}, Bias = {b:.2f}")

print("\n Final Model:")
print(f"Price = {w[0]:.2f} * Area + {w[1]:.2f} * Rooms + {w[2]:.2f} * Plants + {b:.2f}")

What We’ll Observe

  • w[0] (Area): will become large — big impact
  • w[1] (Rooms): will also grow
  • w[2] (Plants): will stay small or may become very close to zero

Output Sample (Varies slightly run to run)

Epoch 0: Weights = [0.28, 0.01, 0.005], Bias = 0.20

Epoch 900: Weights = [0.31, 0.012, 0.0001], Bias = 1.90

Final Model:
Price = 0.31 * Area + 0.012 * Rooms + 0.0001 * Plants + 1.90

→ Notice how plants’ weight is almost 0. That’s Lasso kicking in, realizing it doesn’t add value.

Realization

  • The model learns that Area and Rooms contribute most to reducing error.
  • Plants barely help, but they add a cost (λ × |w|), so it’s better to drop them (shrink to 0).

Lasso Regression – Basic Math Concepts