LeCun Initialization applicability in Neural Network

1. What is LeCun Initialization?

LeCun Initialization is a weight initialization method optimized for activation functions like sigmoid or tanh, which are common in shallow networks or non-ReLU settings.

Key Point:

  • Helps control the variance of activations across layers.
  • Ensures gradients don’t vanish or explode during training.

It works by initializing weights W from a normal distribution:

Where:

  • ninn_{\text{in}}nin = number of input units to the neuron
Scenario Use LeCun Initialization? Why?
Activation = tanh / sigmoid Yes Controls signal size to prevent vanishing
Small networks (1-2 layers) Good Keeps gradient flow stable
Using ReLU No Use He initialization instead

Real-World Use Case: Stock Price Prediction

Imagine predicting the next-day closing price using past 5 days’ prices (sliding window). For a simple shallow neural network using tanh activations, LeCun initialization helps maintain balanced gradients and activations.

1. What is Vanishing Gradient?

It happens during backpropagation, where we adjust weights using gradients:

But if the gradient (partial derivative) is very small, this update becomes nearly zero.

2. Why tanh/sigmoid cause vanishing gradients?

Both tanh and sigmoid have flat regions in their curves where their derivatives are very close to zero.

tanh(x):

As ∣x∣|x|∣x∣ increases, tanh saturates to -1 or +1 → derivative ≈ 0.

sigmoid(x):

Same story: If input is far from 0, derivative becomes tiny.

Consequence:

If we’re using many layers:

  • Gradients get multiplied many times with small numbers (like 0.01 or 0.1)
  • They shrink exponentially
  • This leads to almost no learning in early layers

3. In Stock Price Prediction – Why It Hurts

Suppose we’re using a 3-layer feedforward network to predict tomorrow’s stock price based on 5 past days.

What happens?

  • We feed in prices: [105, 107, 106, 108, 110]
  • Data passes through 3 layers using tanh activation
  • During backpropagation, suppose gradients become:
    • Layer 3: 0.05
    • Layer 2: 0.01
    • Layer 1: 0.0005
  • So, Layer 1 (close to input) learns almost nothing — this means:
    • Our network cannot understand deeper patterns across days
    • It becomes biased toward recent values, not sequence trends

4. How LeCun Initialization Helps

LeCun initialization ensures that weights are small enough to avoid entering those “saturated” zones of tanh/sigmoid.

Let’s explain it visually:

Imagine a tanh curve:

-3 -----|-----------|-----------|----- +3
       -0.995      0        +0.995

If input to tanh is:

  • Too big (say +5) → output ≈ 1 → derivative ≈ 0 → no learning
  • Well-controlled (say 0.5) → output ≈ 0.46 → derivative ≈ 0.79 → healthy learning

LeCun idea:

Ensures that signal starts small, around mean 0, inside the steep part of tanh/sigmoid curve.

Result:

  • Activations don’t saturate
  • Gradients stay useful
  • Early layers learn meaningful features
  • Overall training becomes smoother and faster

5. Real Stock Use Case Walkthrough

  • We feed in 5-day windows of stock prices
  • A 3-layer tanh-based network predicts the next price
  • Compare random init vs LeCun init

In random init:

  • Early layer gradients die out
  • Model learns only surface-level patterns
  • Slower convergence

In LeCun init:

  • Gradients flow better
  • Layer 1 learns how past prices shape trends
  • Better generalization to unseen sequences
Concept Description
Vanishing Gradient Gradients shrink as they’re backpropagated through layers
Why It Happens Sigmoid/tanh derivatives ≈ 0 for large inputs
Effect on Stock Model Early layers (closer to past days) don’t learn — hurting sequence learning
LeCun Helps Because It scales weights to keep activations in active gradient zone
Real Benefit Allows stable learning in tanh/sigmoid-based networks

LeCun Initialization applicability in Neural Network – LeCun Initialization example with Simple Python