Xavier Initialization example with Simple Python

1. Here’s a small network to classify digits using Xavier initialization:

import random
import math
import matplotlib.pyplot as plt

# Xavier Initialization Function
def xavier_init(n_in, n_out):
    limit = math.sqrt(6 / (n_in + n_out))
    return [random.uniform(-limit, limit) for _ in range(n_in)]

# Activation function: ReLU
def relu(x):
    return max(0, x)

# Simulating one hidden layer with Xavier
input_size = 5
hidden_size = 4
output_size = 3

# Simulated input
input_vector = [0.5, -0.3, 0.8, -0.1, 0.2]

# Xavier initialized weights for hidden layer
hidden_weights = [xavier_init(input_size, hidden_size) for _ in range(hidden_size)]

# Forward pass to hidden layer
hidden_output = []
for neuron_weights in hidden_weights:
    activation = sum(i*w for i, w in zip(input_vector, neuron_weights))
    hidden_output.append(relu(activation))

print("Hidden Layer Output:", hidden_output)

# Xavier initialized weights for output layer
output_weights = [xavier_init(hidden_size, output_size) for _ in range(output_size)]

# Forward pass to output layer
output = []
for neuron_weights in output_weights:
    activation = sum(h*w for h, w in zip(hidden_output, neuron_weights))
    output.append(relu(activation))

print("Final Output:", output)

Output:

Hidden Layer Output:

[0, 0.23176983099229365, 0.07018244771986898, 0.19024374061554133]
Final Output: [0.009534419317138415, 0, 0]

What this does:

  • Initializes weights using Xavier
  • Simulates forward pass through 1 hidden layer and 1 output layer
  • Uses ReLU activation for simplicity

2. New Use Case: Predicting Customer Churn from Behavior Data

Imagine we’re working at a telecom company, and we want to predict whether a customer is likely to leave (churn) based on their behavior — call duration, internet usage, complaint frequency, etc.

Without Proper Initialization

If we randomly initialize weights poorly:

  • The model either becomes too confident too early (weights too large)
  • Or doesn’t learn useful patterns (weights too small)
  • This leads to sluggish learning, unstable predictions, or stuck gradients

Why Xavier Helps Here

When you use Xavier Initialization:

  • Inputs like “minutes used”, “data consumption”, “customer age” are preserved across layers in terms of variance.
  • The gradients don’t vanish or explode.
  • Your churn prediction model becomes stable and converges faster.

Step-by-Step with Churn Use Case

  1. Input Features (e.g., call_minutes, data_gb, complaints, tenure, bill_amount)
    → These go to the input layer (say, size = 5)
  2. Hidden Layer Processing
    → Xavier Initialization keeps signal consistent
  3. Output Layer predicts:
    • 1 if customer will churn
    • 0 if customer stays

Updated Python Simulation (Simplified Churn Use Case)

import random
import math

# Xavier Initialization Function
def xavier_init(n_in, n_out):
    limit = math.sqrt(6 / (n_in + n_out))
    return [random.uniform(-limit, limit) for _ in range(n_in)]

# ReLU Activation Function
def relu(x):
    return max(0, x)

# Simulated input for a customer
# [call_minutes, data_gb, complaints, tenure_months, bill_amount]
customer_input = [300, 2.5, 0, 12, 650]  # example normalized/processed values

# Xavier initialization for hidden layer (5 inputs → 4 hidden neurons)
input_size = len(customer_input)
hidden_size = 4
output_size = 1

# Initialize weights using Xavier
hidden_weights = [xavier_init(input_size, hidden_size) for _ in range(hidden_size)]

# Forward pass to hidden layer
hidden_output = []
for weights in hidden_weights:
    activation = sum(i*w for i, w in zip(customer_input, weights))
    hidden_output.append(relu(activation))

# Xavier init for output layer (4 → 1)
output_weights = xavier_init(hidden_size, output_size)

# Final output activation (churn probability)
output_activation = sum(h*w for h, w in zip(hidden_output, output_weights))

print("Churn Prediction Score:", output_activation)

Output:

Churn Prediction Score: 231.72790823422997

Required Math (Mapped to Churn)

Concept Why It Matters for Churn
Variance Propagation Prevents unstable output between layers
Xavier Formula Ensures weights match layer sizes
ReLU Handles sparse or zero-valued inputs (e.g., complaints = 0)
Linear Combinations Models real-world influence of features (e.g., high bill → churn)

Summary (with Updated Use Case)

Step Action Example
1. Collect customer features [minutes, data, complaints, tenure, bill]
2. Use Xavier Initialization for weights to avoid exploding or vanishing values
3. Run a forward pass Hidden → Output (churn score)
4. Use for binary prediction Output > threshold → Churn

3. Why use Xavier Initialization instead of just random initialization?

The Problem with Random Initialization

If you just assign weights like this:

weights = [random.uniform(-1, 1) for _ in range(n)]

it might work… but in deeper networks or even simple multi-layer setups:

  • Too large weights → Activations explode → gradients explode → training breaks
  • Too small weights → Activations shrink → gradients vanish → network stops learning

This means the model might:

  • Get stuck
  • Not converge properly
  • Learn very slowly
  • Give poor or unstable predictions

Why Xavier Initialization Is Better

Xavier (Glorot) Initialization balances the scale of weights using the number of inputs and outputs.

What It Really Does (Plain English):

Issue Random Init Xavier Init
Activations get too big or small Yes Balanced
Gradients vanish or explode Often Controlled
Learning stability Poor Strong
Faster convergence No Yes
Works with tanh/relu activations Unpredictable Well-suited

Simple Analogy:

Imagine we’re pouring water from a jug (input) to multiple cups (outputs).If we pour too much — the cups overflow (activation explodes).If we pour too little — the cups are almost empty (activation vanishes).Xavier helps pour just the right amount — evenly and fairly.

Real-World Impact (Customer Churn Example)

Feature Poor Init Xavier Init
High call minutes Might be drowned or overamplified Balanced signal
Frequent complaints Could lead to erratic jumps Smooth influence
Bill too high Might be ignored Consistently weighted
Gradient backpropagation Might die or explode Stable flow

Summary:

Xavier Initialization = Balanced Learning
Keeps output variance similar to input variance
Avoids vanishing or exploding gradients
Helps the network learn faster, converge better, and generalize more
Especially important when network has multiple layers

Xavier Initialization applicability in Neural Network – Basic Math Concepts