Xavier Initialization example with Simple Python
1. Here’s a small network to classify digits using Xavier initialization:
import random import math import matplotlib.pyplot as plt # Xavier Initialization Function def xavier_init(n_in, n_out): limit = math.sqrt(6 / (n_in + n_out)) return [random.uniform(-limit, limit) for _ in range(n_in)] # Activation function: ReLU def relu(x): return max(0, x) # Simulating one hidden layer with Xavier input_size = 5 hidden_size = 4 output_size = 3 # Simulated input input_vector = [0.5, -0.3, 0.8, -0.1, 0.2] # Xavier initialized weights for hidden layer hidden_weights = [xavier_init(input_size, hidden_size) for _ in range(hidden_size)] # Forward pass to hidden layer hidden_output = [] for neuron_weights in hidden_weights: activation = sum(i*w for i, w in zip(input_vector, neuron_weights)) hidden_output.append(relu(activation)) print("Hidden Layer Output:", hidden_output) # Xavier initialized weights for output layer output_weights = [xavier_init(hidden_size, output_size) for _ in range(output_size)] # Forward pass to output layer output = [] for neuron_weights in output_weights: activation = sum(h*w for h, w in zip(hidden_output, neuron_weights)) output.append(relu(activation)) print("Final Output:", output)
Output:
Hidden Layer Output:
[0, 0.23176983099229365, 0.07018244771986898, 0.19024374061554133]
Final Output: [0.009534419317138415, 0, 0]
What this does:
- Initializes weights using Xavier
- Simulates forward pass through 1 hidden layer and 1 output layer
- Uses ReLU activation for simplicity
2. New Use Case: Predicting Customer Churn from Behavior Data
Imagine we’re working at a telecom company, and we want to predict whether a customer is likely to leave (churn) based on their behavior — call duration, internet usage, complaint frequency, etc.
Without Proper Initialization
If we randomly initialize weights poorly:
- The model either becomes too confident too early (weights too large)
- Or doesn’t learn useful patterns (weights too small)
- This leads to sluggish learning, unstable predictions, or stuck gradients
Why Xavier Helps Here
When you use Xavier Initialization:
- Inputs like “minutes used”, “data consumption”, “customer age” are preserved across layers in terms of variance.
- The gradients don’t vanish or explode.
- Your churn prediction model becomes stable and converges faster.
Step-by-Step with Churn Use Case
- Input Features (e.g., call_minutes, data_gb, complaints, tenure, bill_amount)
→ These go to the input layer (say, size = 5) - Hidden Layer Processing
→ Xavier Initialization keeps signal consistent - Output Layer predicts:
- 1 if customer will churn
- 0 if customer stays
Updated Python Simulation (Simplified Churn Use Case)
import random import math # Xavier Initialization Function def xavier_init(n_in, n_out): limit = math.sqrt(6 / (n_in + n_out)) return [random.uniform(-limit, limit) for _ in range(n_in)] # ReLU Activation Function def relu(x): return max(0, x) # Simulated input for a customer # [call_minutes, data_gb, complaints, tenure_months, bill_amount] customer_input = [300, 2.5, 0, 12, 650] # example normalized/processed values # Xavier initialization for hidden layer (5 inputs → 4 hidden neurons) input_size = len(customer_input) hidden_size = 4 output_size = 1 # Initialize weights using Xavier hidden_weights = [xavier_init(input_size, hidden_size) for _ in range(hidden_size)] # Forward pass to hidden layer hidden_output = [] for weights in hidden_weights: activation = sum(i*w for i, w in zip(customer_input, weights)) hidden_output.append(relu(activation)) # Xavier init for output layer (4 → 1) output_weights = xavier_init(hidden_size, output_size) # Final output activation (churn probability) output_activation = sum(h*w for h, w in zip(hidden_output, output_weights)) print("Churn Prediction Score:", output_activation)
Output:
Churn Prediction Score: 231.72790823422997
Required Math (Mapped to Churn)
Concept | Why It Matters for Churn |
---|---|
Variance Propagation | Prevents unstable output between layers |
Xavier Formula | Ensures weights match layer sizes |
ReLU | Handles sparse or zero-valued inputs (e.g., complaints = 0) |
Linear Combinations | Models real-world influence of features (e.g., high bill → churn) |
Summary (with Updated Use Case)
Step | Action | Example |
---|---|---|
1. | Collect customer features | [minutes, data, complaints, tenure, bill] |
2. | Use Xavier Initialization for weights | to avoid exploding or vanishing values |
3. | Run a forward pass | Hidden → Output (churn score) |
4. | Use for binary prediction | Output > threshold → Churn |
3. Why use Xavier Initialization instead of just random initialization?
The Problem with Random Initialization
If you just assign weights like this:
weights = [random.uniform(-1, 1) for _ in range(n)]
it might work… but in deeper networks or even simple multi-layer setups:
- Too large weights → Activations explode → gradients explode → training breaks
- Too small weights → Activations shrink → gradients vanish → network stops learning
This means the model might:
- Get stuck
- Not converge properly
- Learn very slowly
- Give poor or unstable predictions
Why Xavier Initialization Is Better
Xavier (Glorot) Initialization balances the scale of weights using the number of inputs and outputs.
What It Really Does (Plain English):
Issue | Random Init | Xavier Init |
---|---|---|
Activations get too big or small | Yes | Balanced |
Gradients vanish or explode | Often | Controlled |
Learning stability | Poor | Strong |
Faster convergence | No | Yes |
Works with tanh/relu activations | Unpredictable | Well-suited |
Simple Analogy:
Imagine we’re pouring water from a jug (input) to multiple cups (outputs).If we pour too much — the cups overflow (activation explodes).If we pour too little — the cups are almost empty (activation vanishes).Xavier helps pour just the right amount — evenly and fairly.
Real-World Impact (Customer Churn Example)
Feature | Poor Init | Xavier Init |
---|---|---|
High call minutes | Might be drowned or overamplified | Balanced signal |
Frequent complaints | Could lead to erratic jumps | Smooth influence |
Bill too high | Might be ignored | Consistently weighted |
Gradient backpropagation | Might die or explode | Stable flow |
Summary:
Xavier Initialization = Balanced Learning |
Keeps output variance similar to input variance |
Avoids vanishing or exploding gradients |
Helps the network learn faster, converge better, and generalize more |
Especially important when network has multiple layers |
Xavier Initialization applicability in Neural Network – Basic Math Concepts