The Critical Role of Nonlinearity
Here is a fact that might seem counterintuitive at first: if you remove all activation functions from a neural network and just stack linear layers, you get a completely useless deep network — one that is equivalent to a single linear layer.
A linear function is one you can describe with a straight line (or flat plane in higher dimensions): output = (some constant) × input + (another constant). Multiplying a matrix by a vector is a linear operation. Adding a bias is linear. Composing two linear operations together is still linear — that's the key property. No matter how many times you compose them, you never leave the "flat surface" world.
Let's prove it. Suppose you have two linear layers. Layer 1 computes and Layer 2 computes . Substitute:
- weights of layer 1
- weights of layer 2
- bias layer 1
- bias layer 2
That is for — a single linear transformation. Stack 100 layers: still one linear transformation. You have gained nothing. The network can only represent linear functions of the input, no matter how deep.
The is what gives neural networks their power. It lets the composition of layers represent nonlinear functions, which is the whole point.
Sigmoid
- pre-activation weighted sum
- Euler's number ~2.718
Output range: (0, 1) — can be interpreted as a probability.
Pros: smooth, bounded, historically well-understood.
Cons:
- Saturates. For large , or , and \sigma'(z) \approx 0. . In plain language: training needs a useful 'nudge' telling each weight which way to move. When the derivative is almost zero, that nudge disappears and learning slows to a crawl. Think of it like a volume dial that's been turned all the way up — you can barely hear the difference if you turn it just a little more.
- Not zero-centered. Outputs are always in (0, 1), so all weight updates in the next layer are the same sign, leading to inefficient zig-zag optimization.
Use case: output layer for binary classification. Avoid in hidden layers.
Tanh
- pre-activation value
- Euler's number
Output range: (-1, 1) — .
Pros: zero-centered, so weight updates can be positive or negative.
Cons: still saturates at ±1. Still has vanishing gradients for large . The max derivative of tanh is 1.0 (vs sigmoid's 0.25), so gradients survive longer, but not indefinitely.
Use case: hidden layers of RNNs. Rarely used in modern feedforward networks.
ReLU
- pre-activation weighted sum
Output range: . Not bounded above.
Derivative: if z > 0, if z < 0. (At exactly the derivative is technically undefined, but frameworks define it as 0 — this almost never matters in practice.)
Pros:
- For positive inputs, gradient is exactly 1. No vanishing gradient in that region.
- Computationally trivial: just a comparison and a clamp.
- Sparse activations: roughly half the neurons output 0 at any time, which has regularizing effects.
Cons — the : if a neuron's inputs are consistently negative, its output is always 0 and its gradient is always 0. The weights receive no signal. This can happen if the learning rate is too high, pushing weights into a regime where the neuron is always off. Once that happens, the neuron may stay silent forever unless something else shifts it back into the positive region.
Use case: default for hidden layers in most modern feedforward and convolutional networks.
The sigmoid becomes σ(t·z) where t is sharpness. At t → ∞ it becomes the step function — but loses its derivative (gradient = 0 everywhere).
Leaky ReLU
- small slope for negative inputs, typically 0.01
- pre-activation value
A small tweak: instead of completely zeroing negative inputs, multiply by = 0.01. This eliminates the dying ReLU problem — even very negative neurons still receive a gradient of 0.01. The improvement is not always dramatic, but the cost is near-zero.
Parametric ReLU (PReLU) makes a learnable parameter, letting the network decide how much to allow negative activations through.
GELU
- standard normal cumulative distribution function
- pre-activation value
The , where is the standard normal CDF.
GELU slightly outperforms ReLU on many benchmarks. It is the default activation in BERT, GPT-2, GPT-3, and most modern transformer architectures.
Choosing the Right Activation
| Context | Activation |
|---|---|
| Output, binary classification | sigmoid |
| Output, multi-class | softmax |
| Output, regression | none (linear) |
| Hidden layers, most networks | ReLU or GELU |
| Hidden layers, RNNs | tanh |