Xavier Initialization example with Simple Python

1. Here’s a small network to classify digits using Xavier initialization:

import random
import math
import matplotlib.pyplot as plt

# Xavier Initialization Function
def xavier_init(n_in, n_out):
    limit = math.sqrt(6 / (n_in + n_out))
    return [random.uniform(-limit, limit) for _ in range(n_in)]

# Activation function: ReLU
def relu(x):
    return max(0, x)

# Simulating one hidden layer with Xavier
input_size = 5
hidden_size = 4
output_size = 3

# Simulated input
input_vector = [0.5, -0.3, 0.8, -0.1, 0.2]

# Xavier initialized weights for hidden layer
hidden_weights = [xavier_init(input_size, hidden_size) for _ in range(hidden_size)]

# Forward pass to hidden layer
hidden_output = []
for neuron_weights in hidden_weights:
    activation = sum(i*w for i, w in zip(input_vector, neuron_weights))
    hidden_output.append(relu(activation))

print("Hidden Layer Output:", hidden_output)

# Xavier initialized weights for output layer
output_weights = [xavier_init(hidden_size, output_size) for _ in range(output_size)]

# Forward pass to output layer
output = []
for neuron_weights in output_weights:
    activation = sum(h*w for h, w in zip(hidden_output, neuron_weights))
    output.append(relu(activation))

print("Final Output:", output)

Output:

Hidden Layer Output:

[0, 0.23176983099229365, 0.07018244771986898, 0.19024374061554133]
Final Output: [0.009534419317138415, 0, 0]

What this does:

Initializes weights using Xavier
Simulates forward pass through 1 hidden layer and 1 output layer
Uses ReLU activation for simplicity

2. New Use Case: Predicting Customer Churn from Behavior Data

Imagine we’re working at a telecom company, and we want to predict whether a customer is likely to leave (churn) based on their behavior — call duration, internet usage, complaint frequency, etc.

Without Proper Initialization

If we randomly initialize weights poorly:

The model either becomes too confident too early (weights too large)
Or doesn’t learn useful patterns (weights too small)
This leads to sluggish learning, unstable predictions, or stuck gradients

Why Xavier Helps Here

When you use Xavier Initialization:

Inputs like “minutes used”, “data consumption”, “customer age” are preserved across layers in terms of variance.
The gradients don’t vanish or explode.
Your churn prediction model becomes stable and converges faster.

Step-by-Step with Churn Use Case

Input Features (e.g., call_minutes, data_gb, complaints, tenure, bill_amount)
→ These go to the input layer (say, size = 5)
Hidden Layer Processing
→ Xavier Initialization keeps signal consistent
Output Layer predicts:

1 if customer will churn
0 if customer stays

Updated Python Simulation (Simplified Churn Use Case)

import random
import math

# Xavier Initialization Function
def xavier_init(n_in, n_out):
    limit = math.sqrt(6 / (n_in + n_out))
    return [random.uniform(-limit, limit) for _ in range(n_in)]

# ReLU Activation Function
def relu(x):
    return max(0, x)

# Simulated input for a customer
# [call_minutes, data_gb, complaints, tenure_months, bill_amount]
customer_input = [300, 2.5, 0, 12, 650]  # example normalized/processed values

# Xavier initialization for hidden layer (5 inputs → 4 hidden neurons)
input_size = len(customer_input)
hidden_size = 4
output_size = 1

# Initialize weights using Xavier
hidden_weights = [xavier_init(input_size, hidden_size) for _ in range(hidden_size)]

# Forward pass to hidden layer
hidden_output = []
for weights in hidden_weights:
    activation = sum(i*w for i, w in zip(customer_input, weights))
    hidden_output.append(relu(activation))

# Xavier init for output layer (4 → 1)
output_weights = xavier_init(hidden_size, output_size)

# Final output activation (churn probability)
output_activation = sum(h*w for h, w in zip(hidden_output, output_weights))

print("Churn Prediction Score:", output_activation)

Output:

Churn Prediction Score: 231.72790823422997

Required Math (Mapped to Churn)

Concept	Why It Matters for Churn
Variance Propagation	Prevents unstable output between layers
Xavier Formula	Ensures weights match layer sizes
ReLU	Handles sparse or zero-valued inputs (e.g., complaints = 0)
Linear Combinations	Models real-world influence of features (e.g., high bill → churn)

Summary (with Updated Use Case)

Step	Action	Example
1.	Collect customer features	[minutes, data, complaints, tenure, bill]
2.	Use Xavier Initialization for weights	to avoid exploding or vanishing values
3.	Run a forward pass	Hidden → Output (churn score)
4.	Use for binary prediction	Output > threshold → Churn

3. Why use Xavier Initialization instead of just random initialization?

The Problem with Random Initialization

If you just assign weights like this:

weights = [random.uniform(-1, 1) for _ in range(n)]

it might work… but in deeper networks or even simple multi-layer setups:

Too large weights → Activations explode → gradients explode → training breaks
Too small weights → Activations shrink → gradients vanish → network stops learning

This means the model might:

Get stuck
Not converge properly
Learn very slowly
Give poor or unstable predictions

Why Xavier Initialization Is Better

Xavier (Glorot) Initialization balances the scale of weights using the number of inputs and outputs.

What It Really Does (Plain English):

Issue	Random Init	Xavier Init
Activations get too big or small	Yes	Balanced
Gradients vanish or explode	Often	Controlled
Learning stability	Poor	Strong
Faster convergence	No	Yes
Works with tanh/relu activations	Unpredictable	Well-suited

Simple Analogy:

Imagine we’re pouring water from a jug (input) to multiple cups (outputs).If we pour too much — the cups overflow (activation explodes).If we pour too little — the cups are almost empty (activation vanishes).Xavier helps pour just the right amount — evenly and fairly.

Real-World Impact (Customer Churn Example)

Feature	Poor Init	Xavier Init
High call minutes	Might be drowned or overamplified	Balanced signal
Frequent complaints	Could lead to erratic jumps	Smooth influence
Bill too high	Might be ignored	Consistently weighted
Gradient backpropagation	Might die or explode	Stable flow

Summary:

Xavier Initialization = Balanced Learning

Keeps output variance similar to input variance

Avoids vanishing or exploding gradients

Helps the network learn faster, converge better, and generalize more

Especially important when network has multiple layers

Xavier Initialization applicability in Neural Network – Basic Math Concepts