Building the Intuition
We need a function — one that takes any real number and maps it to a value strictly between 0 and 1. Requirements:
- Smooth: differentiable everywhere (required for gradient descent)
- Monotone: larger → larger output (more evidence for class 1 → higher probability)
- Saturating: approach 0 as and approach 1 as
- Symmetric: (no input signal = maximum uncertainty)
The sigmoid function converts any real number to a probability between 0 and 1. It's the original neural network activation and still appears in output layers for binary classification — from spam filters to medical risk scoring systems.
One natural construction: start with (always positive), then normalize it: is always in . Multiply top and bottom by and you get the classic form:
- sigma - the sigmoid (logistic) function. Smoothly maps any real number to the open interval (0, 1)
- Euler's number ≈ 2.718 - base of the natural exponential
- the input - usually the linear score w·x + b
This is the (also called the logistic function). It's S-shaped, smooth, and bounded.
Key Values
Let's verify it does what we want:
| Interpretation | |||
|---|---|---|---|
| Nearly certain class 0 | |||
| Likely class 0 | |||
| Maximum uncertainty | |||
| Likely class 1 | |||
| Nearly certain class 1 |
As : , so . As : , so .
The sigmoid never actually reaches 0 or 1 — the model never expresses absolute certainty. That's appropriate: any finite amount of evidence should leave some residual doubt.
The Derivative and Why It Matters
Using the quotient rule, the derivative of has a beautifully compact form:
- derivative of sigmoid at z - how fast the sigmoid changes at this point
- sigmoid output at z
The derivative at any point is just the output times one minus the output. Let's check the range:
- Maximum derivative: occurs at where . Maximum value: . The steepest the sigmoid ever gets is a slope of 0.25.
- At : \sigma'(5) \approx 0.993 \times 0.007 \approx 0.007. Nearly flat.
- At : \sigma'(10) \approx 0.00005. Essentially zero.
Interpreting Sigmoid: Log-Odds
When we use sigmoid in logistic regression, the linear output has a specific probabilistic meaning. The value is the :
- linear score - the log-odds that the example is class 1
- probability of class 1 given input x
This gives logistic regression its probabilistic interpretation: measures "how much evidence for class 1," and converts that evidence score into a calibrated probability.
When z > 0: \sigma(z) > 0.5 → predict class 1. When z < 0: \sigma(z) < 0.5 → predict class 0. When : exactly on the boundary, 50/50 uncertainty.
Code: The Sigmoid in Python
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
# Key values
print(f"{'z':>4} σ(z)")
for z in [-5, -2, 0, 2, 5]:
print(f"{z:4d} {sigmoid(z):.4f}")
# -5 0.0067 (nearly certain class 0)
# 0 0.5000 (maximum uncertainty)
# 5 0.9933 (nearly certain class 1)
# Derivative: σ'(z) = σ(z) · (1 − σ(z))
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1.0 - s)
print(f"\nMax derivative at z=0: {sigmoid_deriv(0):.4f}") # 0.2500
print(f"Derivative at z=5: {sigmoid_deriv(5):.6f}") # 0.006648 — already small
print(f"Derivative at z=10: {sigmoid_deriv(10):.8f}") # 0.00004540 — near zero
# In a 10-layer network: (0.25)^10 ≈ 0.000001 — the vanishing gradient problem
The sigmoid becomes σ(t·z) where t is sharpness. At t → ∞ it becomes the step function — but loses its derivative (gradient = 0 everywhere).