Deep Learning with Simple Python

We are simulating an XOR function (i.e., the neural network learns that 0 XOR 1 = 1, 1 XOR 1 = 0, etc.). XOR is not linearly separable, so:

A shallow neural network (1 hidden layer) usually struggles or needs a lot of neurons.
A deep neural network (2+ hidden layers) handles it more easily due to non-linear combinations.

1. Activation Function: Sigmoid

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def sigmoid_derivative(x):
    sx = sigmoid(x)
    return sx * (1 - sx)

Explanation:

sigmoid(x) maps input into (0, 1) range → creates non-linearity
Derivative is needed for backpropagation (learning step)

In deep learning, non-linearity + multiple layers help learn complex patterns like XOR.

2. The Data: XOR Truth Table

data = [
    ([0, 0], 0),
    ([0, 1], 1),
    ([1, 0], 1),
    ([1, 1], 0)
]

This is the classic XOR problem, which shallow networks can’t solve without tricks.

3. Layer Initialization

def init_layer(input_size, output_size):
    return [[random.uniform(-1, 1) for _ in range(input_size)] for _ in range(output_size)], [random.uniform(-1, 1) for _ in range(output_size)]

Explanation:

Randomly initializes weights and biases
For example, from input (2 values) → hidden1 (3 neurons): you’ll get a matrix of size (3×2)

4. Forward Pass Function

def dot_product(weights, inputs, bias):
    return [sum(w * i for w, i in zip(weight, inputs)) + b for weight, b in zip(weights, bias)]

def forward(inputs, weights1, bias1, weights2, bias2, weights3, bias3):
    z1 = dot_product(weights1, inputs, bias1)
    a1 = [sigmoid(x) for x in z1]

    z2 = dot_product(weights2, a1, bias2)
    a2 = [sigmoid(x) for x in z2]

    z3 = dot_product(weights3, a2, bias3)
    a3 = [sigmoid(x) for x in z3]

    return z1, a1, z2, a2, z3, a3

Explanation:

Inputs flow through 3 layers:
- Layer 1: input → hidden1 (3 neurons)
- Layer 2: hidden1 → hidden2 (3 neurons)
- Layer 3: hidden2 → output (1 neuron)

z = weighted sum + bias
a = activation output (after sigmoid)

5. Backpropagation and Training Loop

def train(epochs=10000, lr=0.5):
    w1, b1 = init_layer(2, 3)
    w2, b2 = init_layer(3, 3)
    w3, b3 = init_layer(3, 1)

We create three layers of weights and biases: input→hidden1, hidden1→hidden2, hidden2→output

6. Training Over Epochs

for epoch in range(epochs):
    total_error = 0
    for x, y in data:
        z1, a1, z2, a2, z3, a3 = forward(x, w1, b1, w2, b2, w3, b3)

For every epoch (iteration), we go through all examples and calculate:

z1, a1: first layer
z2, a2: second layer
z3, a3: final output

7. Error and Delta Computation

        error = y - a3[0]
        delta3 = error * sigmoid_derivative(z3[0])

        delta2 = [delta3 * w3[i][0] * sigmoid_derivative(z2[i]) for i in range(3)]

        delta1 = [sum(delta2[j] * w2[j][i] for j in range(3)) * sigmoid_derivative(z1[i]) for i in range(3)]

Explanation:

Calculates error between prediction and actual
delta3: How wrong the output neuron is
delta2: How much each hidden2 neuron contributed to error
delta1: Same logic for hidden1 neurons

This is classic backpropagation using chain rule

8. Updating Weights and Biases

for i in range(3):
    for j in range(3):
        w2[i][j] += lr * delta2[i] * a1[j]
    b2[i] += lr * delta2[i]

Update each weight by: new_weight = old_weight + learning_rate × error_contribution

Same done for w1, w2, and w3 weights

9. Final Prediction

print("\nFinal predictions:")
for x, _ in data:
    _, _, _, _, _, a3 = forward(x, w1, b1, w2, b2, w3, b3)
    print(f"{x} → {round(a3[0], 3)}")

After training, we test each XOR input and print predicted value.

10. What If We Use Shallow Neural Network (1 Hidden Layer)?

If we change the code to remove one hidden layer, like this:

# Only one hidden layer
w1, b1 = init_layer(2, 3)
w2, b2 = init_layer(3, 1)

# forward pass would skip z2, a2

We will observe:

XOR problem doesn’t converge
The model will oscillate around 0.5 for all inputs
Because no non-linear intermediate representation is available to “flip” the pattern

10. Conceptual Summary: Deep vs Shallow for XOR

Point	Shallow NN	Deep NN
Layers	1 hidden layer	2+ hidden layers
Can solve XOR?	Not without tricks	Easily
Feature extraction	Weak	Hierarchical
Real-world use	Only for linearly separable problems	All modern AI (e.g. GPT, image models)

11. Real-Life Analogy: Mail Sorting in a Post Office

Imagine we’re running a modern post office and we want to separate spam emails from real emails:

Shallow Learning: One-Layer Classifier

We hire an assistant who looks at only the subject line of an email.

He’s trained like this:

If email contains the word “lottery”, mark as spam
If email contains “urgent”, mark as spam
If email contains “meeting”, mark as real

But…

What goes wrong?

“Meeting about lottery campaign” → Gets misclassified
“URGENT! Team Lunch Today” → Gets marked spam even though it’s valid

Why shallow learning fails:

Only works on simple rules
Can’t understand context or deeper relationships
Misses patterns in word combinations, email sender, tone, etc.

Deep Learning: Multi-Layer System

Now, you upgrade your post office with:

Layer 1: Extract keywords and sender info
Layer 2: Analyze tone, frequency of words, punctuation
Layer 3: Determine meaning and intent (e.g., sarcasm, urgency)
Final layer: Decide if it’s spam or not

Now it can correctly:

Understand “lottery meeting for marketing” is not spam
Detect “Hello friend, claim your prize” as spam due to pattern
Learn from examples, not just rules

Now Let’s Map This to Code

We’ll simulate a very simplified version of spam detection, where we use a deep network to learn patterns across 2 features:

contains_offer (0 or 1)
is_from_known_sender (0 or 1)

Dataset (Real-World Like):

# [contains_offer, is_from_known_sender] → is_spam
data = [
    ([1, 0], 1),  # Offer from unknown sender → spam
    ([0, 1], 0),  # No offer from known sender → not spam
    ([1, 1], 0),  # Offer from known sender → likely not spam
    ([0, 0], 1),  # No offer, unknown sender → suspicious → spam
]

Real Interpretation of Each Layer:

Layer 1: Finds basic features (e.g., offer, known sender)
Layer 2: Combines features (e.g., “Offer + Known Sender”)
Layer 3: Learns spam vs non-spam from combinations

What Happens Without Deep Layers?
With only one layer:

The network might say:

Offer = spam (blindly)
Known sender = not spam (blindly)

→ It can’t learn the combo logic like:“Offer from a known sender is not necessarily spam”

Deep Learning Helps Because:

It stacks multiple decision points
Allows the network to build abstract ideas, like:

“trustworthiness”
“intended tone”
“frequency of spammy words”

Deep Learning with Neural Networks – Basic Math Concepts