Skip to content
Neural Networks
Lesson 9 ⏱ 10 min

Modern activations: GELU, Swish, GLU

Video coming soon

Beyond ReLU - Smooth Activations and Gating

Why smooth activations improve gradient flow compared to ReLU's hard zero, deriving GELU from the Gaussian CDF, the self-gating property of Swish, and how Gated Linear Units extend the gating idea.

⏱ ~7 min

🧮

Quick refresher

ReLU activation function

ReLU(x) = max(0, x). For positive inputs, it passes the value through unchanged (gradient = 1). For negative inputs, it outputs zero (gradient = 0). This sparsity is efficient but creates 'dead neurons' — neurons that receive only negative inputs and stop learning entirely.

Example

For x = 2.0: ReLU(2) = 2, gradient = 1.

For x = -0.5: ReLU(-0.5) = 0, gradient = 0.

For a neuron that always receives negative input after training starts, its gradient is always 0 and it can never recover — the 'dying ReLU' problem.

The Problem with ReLU's Hard Zero

ReLU is elegant: f(x) = max(0, x). It's fast, it prevents vanishing gradients for positive activations, and it introduces useful sparsity. But its hard zero cutoff creates two problems:

Dying neurons: If a neuron's pre-activation is always negative (which can happen after a large gradient update), its output is always zero, its gradient is always zero, and it never updates again. For large networks, a significant fraction of neurons can permanently die.

Non-smooth gradient: The derivative of ReLU has a discontinuity at zero (jumping from 0 to 1). While not mathematically catastrophic, smooth activations empirically provide better gradient flow, especially in very deep networks.

Modern activations keep ReLU's core character — pass positive inputs, suppress negative ones — while softening the boundary.

GELU: Gaussian Error Linear Unit

GELU was introduced for NLP tasks (BERT uses it; GPT uses it). The idea: instead of a hard gate that asks "is x positive?" ask "would x be positive if we sampled from a Gaussian noise distribution?"

What Φ(x) means in plain language: Imagine you're looking at a standard bell curve (the normal distribution). Φ(x) is simply the fraction of the curve that lies to the left of the value x. So Φ(0) = 0.5 means exactly half the bell curve is to the left of zero. Φ(2) ≈ 0.977 means about 97.7% of the curve is to the left of 2. The further right you go, the closer Φ(x) gets to 1. The further left, the closer it gets to 0. GELU uses this as a smooth, probabilistic gate: "how confident are we that this activation should be treated as positive?"

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)
Φ(x)\Phi(x)
the standard normal cumulative distribution function (CDF) — the fraction of a bell curve to the left of x
xx
the input activation

The ranges from 0 to 1 and is exactly 0.5 at x=0. So GELU:

  • Large positive x (x=3): Φ(3) ≈ 0.999, GELU(3) ≈ 3 × 1 = 3 (passes through)
  • At x=0: GELU(0) = 0 × 0.5 = 0
  • Small negative x (x=-0.5): Φ(-0.5) ≈ 0.31, GELU(-0.5) ≈ -0.155 (small negative value, not zero!)
  • Large negative x (x=-3): Φ(-3) ≈ 0.001, GELU(-3) ≈ -0.003 (nearly zero)

The key insight: mildly negative inputs are not hard-zeroed — they pass through a small fraction, keeping gradient alive. Only strongly negative inputs are suppressed.

A fast approximation used in practice:

GELU(x)0.5x(1+tanh!(2π(x+0.044715x3)))\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh!\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^3\right)\right)\right)

Swish (SiLU): Self-Gated Activation

Swish uses the sigmoid function as its own gate:

Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x)
σ(x)\sigma(x)
sigmoid function: 1/(1+e^{-x})
Swish(x)\text{Swish}(x)
also called SiLU (Sigmoid Linear Unit)

The value σ(x)(0,1)\sigma(x) \in (0, 1) acts as a smooth gate on xx. Like GELU, Swish is smooth at zero and allows small negative values to influence the output. Unlike GELU, it doesn't require the normal CDF — just a sigmoid, which is fast.

Key property — non-monotone: Swish has a small region of negative values for x slightly below zero (around x ≈ -1.28 is the minimum). This non-monotonicity is unusual for activation functions and appears to help in practice by allowing the network more expressivity.

GLU: Gated Linear Unit

Gated Linear Units take the gating idea further. The input is split into two equal halves, and one half gates the other:

GLU(x)=x1σ(x2)\text{GLU}(x) = x_1 \odot \sigma(x_2)
x1x_1
content half of the input vector
x2x_2
gate half of the input vector
σ\sigma
sigmoid function
\odot
elementwise multiplication

where xx of dimension 2d2d is split into x1,x2x_1, x_2 each of dimension dd.

The sigmoid-gated x2x_2 learns which content features should pass — values near 1 open the gate, values near 0 close it. This is a soft, learned version of the attention mechanism (though simpler and cheaper).

Variants:

  • GEGLU: replace σ in the gate with GELU: GEGLU(x)=x1GELU(x2)\text{GEGLU}(x) = x_1 \odot \text{GELU}(x_2)
  • SwiGLU: replace σ with Swish: SwiGLU(x)=x1Swish(x2)\text{SwiGLU}(x) = x_1 \odot \text{Swish}(x_2). Used in LLaMA, PaLM, Gemini.

Comparing Activation Values at Key Points

xReLUGELUSwishNote
−2.00.00−0.05−0.09Smooth activations let some signal through
−0.50.00−0.15−0.19Mildly negative: small nonzero output
0.00.000.000.00All three pass zero through
0.50.500.350.31Positive: small values slightly reduced
2.02.001.951.76Large positive: nearly identical

Practical Recommendations

  • Transformers / LLMs: GELU (standard) or SwiGLU (state-of-the-art)
  • CNNs for vision: ReLU still works well; GELU or Swish as drop-in replacements
  • General MLP: ReLU is a solid default; swap to Swish for a potential small gain
  • Memory-constrained: ReLU (fastest); GELU/Swish add minor overhead

Code in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

# Built-in activations
print(F.gelu(x))    # GELU
print(F.silu(x))    # Swish (SiLU)
print(F.relu(x))    # ReLU for comparison

# GLU: expects even-dimensional input, splits along last dim
x2d = torch.randn(4, 8)  # batch of 4, dimension 8
glu = nn.GLU(dim=-1)
out = glu(x2d)  # output shape: (4, 4)

# SwiGLU in a feedforward layer
class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff)
        self.w2 = nn.Linear(d_model, d_ff)
        self.w3 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.w3(F.silu(self.w1(x)) * self.w2(x))

Quiz

1 / 3

GELU(x) = x · Φ(x) where Φ is the standard normal CDF. For large positive x (e.g., x=5), what does GELU output approximately?