Modern activations: GELU, Swish, GLU — Neural Networks

The Problem with ReLU's Hard Zero

ReLU is elegant: f(x) = max(0, x). It's fast, it prevents vanishing gradients for positive activations, and it introduces useful sparsity. But its hard zero cutoff creates two problems:

Dying neurons: If a neuron's pre-activation is always negative (which can happen after a large gradient update), its output is always zero, its gradient is always zero, and it never updates again. For large networks, a significant fraction of neurons can permanently die.

Non-smooth gradient: The derivative of ReLU has a discontinuity at zero (jumping from 0 to 1). While not mathematically catastrophic, smooth activations empirically provide better gradient flow, especially in very deep networks.

Modern activations keep ReLU's core character — pass positive inputs, suppress negative ones — while softening the boundary.

GELU: Gaussian Error Linear Unit

GELU was introduced for NLP tasks (BERT uses it; GPT uses it). The idea: instead of a hard gate that asks "is x positive?" ask "would x be positive if we sampled from a Gaussian noise distribution?"

What Φ(x) means in plain language: Imagine you're looking at a standard bell curve (the normal distribution). Φ(x) is simply the fraction of the curve that lies to the left of the value x. So Φ(0) = 0.5 means exactly half the bell curve is to the left of zero. Φ(2) ≈ 0.977 means about 97.7% of the curve is to the left of 2. The further right you go, the closer Φ(x) gets to 1. The further left, the closer it gets to 0. GELU uses this as a smooth, probabilistic gate: "how confident are we that this activation should be treated as positive?"

\text{GELU}(x) = x \cdot \Phi(x)

$\Phi(x)$: the standard normal cumulative distribution function (CDF) — the fraction of a bell curve to the left of x
$x$: the input activation

The ranges from 0 to 1 and is exactly 0.5 at x=0. So GELU:

Large positive x (x=3): Φ(3) ≈ 0.999, GELU(3) ≈ 3 × 1 = 3 (passes through)
At x=0: GELU(0) = 0 × 0.5 = 0
Small negative x (x=-0.5): Φ(-0.5) ≈ 0.31, GELU(-0.5) ≈ -0.155 (small negative value, not zero!)
Large negative x (x=-3): Φ(-3) ≈ 0.001, GELU(-3) ≈ -0.003 (nearly zero)

The key insight: mildly negative inputs are not hard-zeroed — they pass through a small fraction, keeping gradient alive. Only strongly negative inputs are suppressed.

A fast approximation used in practice:

\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh!\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^3\right)\right)\right)

Swish (SiLU): Self-Gated Activation

Swish uses the sigmoid function as its own gate:

\text{Swish}(x) = x \cdot \sigma(x)

$\sigma(x)$: sigmoid function: 1/(1+e^{-x})
$\text{Swish}(x)$: also called SiLU (Sigmoid Linear Unit)

The value $\sigma(x) \in (0, 1)$ acts as a smooth gate on $x$ . Like GELU, Swish is smooth at zero and allows small negative values to influence the output. Unlike GELU, it doesn't require the normal CDF — just a sigmoid, which is fast.

Key property — non-monotone: Swish has a small region of negative values for x slightly below zero (around x ≈ -1.28 is the minimum). This non-monotonicity is unusual for activation functions and appears to help in practice by allowing the network more expressivity.

GLU: Gated Linear Unit

Gated Linear Units take the gating idea further. The input is split into two equal halves, and one half gates the other:

\text{GLU}(x) = x_1 \odot \sigma(x_2)

$x_1$: content half of the input vector
$x_2$: gate half of the input vector
$\sigma$: sigmoid function
$\odot$: elementwise multiplication

where $x$ of dimension $2d$ is split into $x_1, x_2$ each of dimension $d$ .

The sigmoid-gated $x_2$ learns which content features should pass — values near 1 open the gate, values near 0 close it. This is a soft, learned version of the attention mechanism (though simpler and cheaper).

Variants:

GEGLU: replace σ in the gate with GELU: $\text{GEGLU}(x) = x_1 \odot \text{GELU}(x_2)$
SwiGLU: replace σ with Swish: $\text{SwiGLU}(x) = x_1 \odot \text{Swish}(x_2)$ . Used in LLaMA, PaLM, Gemini.

Comparing Activation Values at Key Points

x	ReLU	GELU	Swish	Note
−2.0	0.00	−0.05	−0.09	Smooth activations let some signal through
−0.5	0.00	−0.15	−0.19	Mildly negative: small nonzero output
0.0	0.00	0.00	0.00	All three pass zero through
0.5	0.50	0.35	0.31	Positive: small values slightly reduced
2.0	2.00	1.95	1.76	Large positive: nearly identical

Practical Recommendations

Transformers / LLMs: GELU (standard) or SwiGLU (state-of-the-art)
CNNs for vision: ReLU still works well; GELU or Swish as drop-in replacements
General MLP: ReLU is a solid default; swap to Swish for a potential small gain
Memory-constrained: ReLU (fastest); GELU/Swish add minor overhead

Code in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

# Built-in activations
print(F.gelu(x))    # GELU
print(F.silu(x))    # Swish (SiLU)
print(F.relu(x))    # ReLU for comparison

# GLU: expects even-dimensional input, splits along last dim
x2d = torch.randn(4, 8)  # batch of 4, dimension 8
glu = nn.GLU(dim=-1)
out = glu(x2d)  # output shape: (4, 4)

# SwiGLU in a feedforward layer
class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff)
        self.w2 = nn.Linear(d_model, d_ff)
        self.w3 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.w3(F.silu(self.w1(x)) * self.w2(x))