Skip to content
Neural Networks
Lesson 2 ⏱ 12 min

Activation functions

Video coming soon

Activation Functions: Sigmoid, Tanh, ReLU, and Beyond

Visual comparison of activation function shapes, their derivatives, and where each one is used in modern networks.

⏱ ~7 min

🧮

Quick refresher

Function composition and derivatives

Composing two functions f(g(x)) means applying g first, then f. The chain rule says the derivative is f'(g(x)) times g'(x). Activation functions are applied elementwise after a linear transformation.

Example

ReLU(z) = max(0, z).

For z = 3: ReLU(3) = 3.

For z = -2: ReLU(-2) = 0.

The Critical Role of Nonlinearity

Here is a fact that might seem counterintuitive at first: if you remove all activation functions from a neural network and just stack linear layers, you get a completely useless deep network — one that is equivalent to a single linear layer.

A linear function is one you can describe with a straight line (or flat plane in higher dimensions): output = (some constant) × input + (another constant). Multiplying a matrix by a vector is a linear operation. Adding a bias is linear. Composing two linear operations together is still linear — that's the key property. No matter how many times you compose them, you never leave the "flat surface" world.

Let's prove it. Suppose you have two linear layers. Layer 1 computes a=W1x+b1a = W_1 x + b_1 and Layer 2 computes output=W2a+b2\text{output} = W_2 a + b_2. Substitute:

output=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)\text{output} = W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)
W1W_1
weights of layer 1
W2W_2
weights of layer 2
b1b_1
bias layer 1
b2b_2
bias layer 2

That is Wx+cWx + c for W=W2W1W = W_2 W_1 — a single linear transformation. Stack 100 layers: still one linear transformation. You have gained nothing. The network can only represent linear functions of the input, no matter how deep.

The is what gives neural networks their power. It lets the composition of layers represent nonlinear functions, which is the whole point.

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
zz
pre-activation weighted sum
ee
Euler's number ~2.718

Output range: (0, 1) — can be interpreted as a probability.

Pros: smooth, bounded, historically well-understood.

Cons:

  • Saturates. For large z\mid z\mid, σ(z)0\sigma(z) \approx 0 or 11, and \sigma'(z) \approx 0. . In plain language: training needs a useful 'nudge' telling each weight which way to move. When the derivative is almost zero, that nudge disappears and learning slows to a crawl. Think of it like a volume dial that's been turned all the way up — you can barely hear the difference if you turn it just a little more.
  • Not zero-centered. Outputs are always in (0, 1), so all weight updates in the next layer are the same sign, leading to inefficient zig-zag optimization.

Use case: output layer for binary classification. Avoid in hidden layers.

Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
zz
pre-activation value
ee
Euler's number

Output range: (-1, 1) — .

Pros: zero-centered, so weight updates can be positive or negative.

Cons: still saturates at ±1. Still has vanishing gradients for large z\mid z\mid. The max derivative of tanh is 1.0 (vs sigmoid's 0.25), so gradients survive longer, but not indefinitely.

Use case: hidden layers of RNNs. Rarely used in modern feedforward networks.

ReLU

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)
zz
pre-activation weighted sum

Output range: [0,+)\lbrack 0, +\infty). Not bounded above.

Derivative: if z > 0, if z < 0. (At exactly z=0z = 0 the derivative is technically undefined, but frameworks define it as 0 — this almost never matters in practice.)

Pros:

  • For positive inputs, gradient is exactly 1. No vanishing gradient in that region.
  • Computationally trivial: just a comparison and a clamp.
  • Sparse activations: roughly half the neurons output 0 at any time, which has regularizing effects.

Cons — the : if a neuron's inputs are consistently negative, its output is always 0 and its gradient is always 0. The weights receive no signal. This can happen if the learning rate is too high, pushing weights into a regime where the neuron is always off. Once that happens, the neuron may stay silent forever unless something else shifts it back into the positive region.

Use case: default for hidden layers in most modern feedforward and convolutional networks.

InteractiveSigmoid vs. Step Function
step-4-22400.51

The sigmoid becomes σ(t·z) where t is sharpness. At t → ∞ it becomes the step function — but loses its derivative (gradient = 0 everywhere).

Leaky ReLU

LeakyReLU(z)=max(αz,;z)\text{LeakyReLU}(z) = \max(\alpha z,; z)
α\alpha
small slope for negative inputs, typically 0.01
zz
pre-activation value

A small tweak: instead of completely zeroing negative inputs, multiply by = 0.01. This eliminates the dying ReLU problem — even very negative neurons still receive a gradient of 0.01. The improvement is not always dramatic, but the cost is near-zero.

Parametric ReLU (PReLU) makes α\alpha a learnable parameter, letting the network decide how much to allow negative activations through.

GELU

GELU(z)=zΦ(z)\text{GELU}(z) = z \cdot \Phi(z)
Φ\Phi
standard normal cumulative distribution function
zz
pre-activation value

The , where Φ(z)\Phi(z) is the standard normal CDF.

GELU slightly outperforms ReLU on many benchmarks. It is the default activation in BERT, GPT-2, GPT-3, and most modern transformer architectures.

Choosing the Right Activation

ContextActivation
Output, binary classificationsigmoid
Output, multi-classsoftmax
Output, regressionnone (linear)
Hidden layers, most networksReLU or GELU
Hidden layers, RNNstanh

Quiz

1 / 3

Without activation functions, stacking multiple linear layers is equivalent to...