Deep Learning with Simple Python

We are simulating an XOR function (i.e., the neural network learns that 0 XOR 1 = 1, 1 XOR 1 = 0, etc.). XOR is not linearly separable, so:

  • A shallow neural network (1 hidden layer) usually struggles or needs a lot of neurons.
  • A deep neural network (2+ hidden layers) handles it more easily due to non-linear combinations.

1. Activation Function: Sigmoid

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def sigmoid_derivative(x):
    sx = sigmoid(x)
    return sx * (1 - sx)

Explanation:

  • sigmoid(x) maps input into (0, 1) range → creates non-linearity
  • Derivative is needed for backpropagation (learning step)

    In deep learning, non-linearity + multiple layers help learn complex patterns like XOR.

2. The Data: XOR Truth Table

data = [
    ([0, 0], 0),
    ([0, 1], 1),
    ([1, 0], 1),
    ([1, 1], 0)
]

This is the classic XOR problem, which shallow networks can’t solve without tricks.

3. Layer Initialization

def init_layer(input_size, output_size):
    return [[random.uniform(-1, 1) for _ in range(input_size)] for _ in range(output_size)], [random.uniform(-1, 1) for _ in range(output_size)]

Explanation:

  • Randomly initializes weights and biases
  • For example, from input (2 values) → hidden1 (3 neurons): you’ll get a matrix of size (3×2)

4. Forward Pass Function

def dot_product(weights, inputs, bias):
    return [sum(w * i for w, i in zip(weight, inputs)) + b for weight, b in zip(weights, bias)]

def forward(inputs, weights1, bias1, weights2, bias2, weights3, bias3):
    z1 = dot_product(weights1, inputs, bias1)
    a1 = [sigmoid(x) for x in z1]

    z2 = dot_product(weights2, a1, bias2)
    a2 = [sigmoid(x) for x in z2]

    z3 = dot_product(weights3, a2, bias3)
    a3 = [sigmoid(x) for x in z3]

    return z1, a1, z2, a2, z3, a3

Explanation:

  • Inputs flow through 3 layers:
    • Layer 1: input → hidden1 (3 neurons)
    • Layer 2: hidden1 → hidden2 (3 neurons)
    • Layer 3: hidden2 → output (1 neuron)
  • z = weighted sum + bias
  • a = activation output (after sigmoid)

5. Backpropagation and Training Loop

def train(epochs=10000, lr=0.5):
    w1, b1 = init_layer(2, 3)
    w2, b2 = init_layer(3, 3)
    w3, b3 = init_layer(3, 1)
  • We create three layers of weights and biases: input→hidden1, hidden1→hidden2, hidden2→output

6. Training Over Epochs

for epoch in range(epochs):
    total_error = 0
    for x, y in data:
        z1, a1, z2, a2, z3, a3 = forward(x, w1, b1, w2, b2, w3, b3)

For every epoch (iteration), we go through all examples and calculate:

  • z1, a1: first layer
  • z2, a2: second layer
  • z3, a3: final output

7. Error and Delta Computation

        error = y - a3[0]
        delta3 = error * sigmoid_derivative(z3[0])

        delta2 = [delta3 * w3[i][0] * sigmoid_derivative(z2[i]) for i in range(3)]

        delta1 = [sum(delta2[j] * w2[j][i] for j in range(3)) * sigmoid_derivative(z1[i]) for i in range(3)]

Explanation:

  • Calculates error between prediction and actual
  • delta3: How wrong the output neuron is
  • delta2: How much each hidden2 neuron contributed to error
  • delta1: Same logic for hidden1 neurons

    This is classic backpropagation using chain rule

8. Updating Weights and Biases

for i in range(3):
    for j in range(3):
        w2[i][j] += lr * delta2[i] * a1[j]
    b2[i] += lr * delta2[i]

Update each weight by: new_weight = old_weight + learning_rate × error_contribution

Same done for w1, w2, and w3 weights

9. Final Prediction

print("\nFinal predictions:")
for x, _ in data:
    _, _, _, _, _, a3 = forward(x, w1, b1, w2, b2, w3, b3)
    print(f"{x} → {round(a3[0], 3)}")

After training, we test each XOR input and print predicted value.

10. What If We Use Shallow Neural Network (1 Hidden Layer)?

If we change the code to remove one hidden layer, like this:

# Only one hidden layer
w1, b1 = init_layer(2, 3)
w2, b2 = init_layer(3, 1)

# forward pass would skip z2, a2

We will observe:

  • XOR problem doesn’t converge
  • The model will oscillate around 0.5 for all inputs
  • Because no non-linear intermediate representation is available to “flip” the pattern

10. Conceptual Summary: Deep vs Shallow for XOR

Point Shallow NN Deep NN
Layers 1 hidden layer 2+ hidden layers
Can solve XOR? Not without tricks Easily
Feature extraction Weak Hierarchical
Real-world use Only for linearly separable problems All modern AI (e.g. GPT, image models)

11. Real-Life Analogy: Mail Sorting in a Post Office

Imagine we’re running a modern post office and we want to separate spam emails from real emails:

Shallow Learning: One-Layer Classifier

We hire an assistant who looks at only the subject line of an email.

He’s trained like this:

  • If email contains the word “lottery”, mark as spam
  • If email contains “urgent”, mark as spam
  • If email contains “meeting”, mark as real

But…

What goes wrong?

  • “Meeting about lottery campaign” → Gets misclassified
  • “URGENT! Team Lunch Today” → Gets marked spam even though it’s valid

Why shallow learning fails:

  • Only works on simple rules
  • Can’t understand context or deeper relationships
  • Misses patterns in word combinations, email sender, tone, etc.

Deep Learning: Multi-Layer System

Now, you upgrade your post office with:

  • Layer 1: Extract keywords and sender info
  • Layer 2: Analyze tone, frequency of words, punctuation
  • Layer 3: Determine meaning and intent (e.g., sarcasm, urgency)
  • Final layer: Decide if it’s spam or not

Now it can correctly:

  • Understand “lottery meeting for marketing” is not spam
  • Detect “Hello friend, claim your prize” as spam due to pattern
  • Learn from examples, not just rules

Now Let’s Map This to Code

We’ll simulate a very simplified version of spam detection, where we use a deep network to learn patterns across 2 features:

  • contains_offer (0 or 1)
  • is_from_known_sender (0 or 1)

Dataset (Real-World Like):

# [contains_offer, is_from_known_sender] → is_spam
data = [
    ([1, 0], 1),  # Offer from unknown sender → spam
    ([0, 1], 0),  # No offer from known sender → not spam
    ([1, 1], 0),  # Offer from known sender → likely not spam
    ([0, 0], 1),  # No offer, unknown sender → suspicious → spam
]

Real Interpretation of Each Layer:

  • Layer 1: Finds basic features (e.g., offer, known sender)
  • Layer 2: Combines features (e.g., “Offer + Known Sender”)
  • Layer 3: Learns spam vs non-spam from combinations

What Happens Without Deep Layers?
With only one layer:

  • The network might say:
    • Offer = spam (blindly)
    • Known sender = not spam (blindly)

→ It can’t learn the combo logic like:“Offer from a known sender is not necessarily spam”

Deep Learning Helps Because:

  • It stacks multiple decision points
  • Allows the network to build abstract ideas, like:
    • “trustworthiness”
    • “intended tone”
    • “frequency of spammy words”

Deep Learning with Neural Networks – Basic Math Concepts