The sigmoid function — Classification

Building the Intuition

We need a function $f: \mathbb{R} \to (0, 1)$ — one that takes any real number $z$ and maps it to a value strictly between 0 and 1. Requirements:

Smooth: differentiable everywhere (required for gradient descent)
Monotone: larger $z$ → larger output (more evidence for class 1 → higher probability)
Saturating: approach 0 as $z \to -\infty$ and approach 1 as $z \to +\infty$
Symmetric: $f(0) = 0.5$ (no input signal = maximum uncertainty)

The sigmoid function converts any real number to a probability between 0 and 1. It's the original neural network activation and still appears in output layers for binary classification — from spam filters to medical risk scoring systems.

One natural construction: start with $e^z$ (always positive), then normalize it: $e^z / (e^z + 1)$ is always in $(0, 1)$ . Multiply top and bottom by $e^{-z}$ and you get the classic form:

\sigma(z) = \frac{1}{1 + e^{-z}}

$\sigma$: sigma - the sigmoid (logistic) function. Smoothly maps any real number to the open interval (0, 1)
$e$: Euler's number ≈ 2.718 - base of the natural exponential
$z$: the input - usually the linear score w·x + b

This is the (also called the logistic function). It's S-shaped, smooth, and bounded.

Key Values

Let's verify it does what we want:

$z$	$e^{-z}$	$\sigma(z)$	Interpretation
$-5$	$148$	$\approx 0.007$	Nearly certain class 0
$-2$	$7.39$	$\approx 0.12$	Likely class 0
$0$	$1$	$0.5$	Maximum uncertainty
$2$	$0.135$	$\approx 0.88$	Likely class 1
$5$	$0.007$	$\approx 0.993$	Nearly certain class 1

As $z \to +\infty$ : $e^{-z} \to 0$ , so $\sigma(z) \to 1$ . As $z \to -\infty$ : $e^{-z} \to +\infty$ , so $\sigma(z) \to 0$ .

The sigmoid never actually reaches 0 or 1 — the model never expresses absolute certainty. That's appropriate: any finite amount of evidence should leave some residual doubt.

The Derivative and Why It Matters

Using the quotient rule, the derivative of has a beautifully compact form:

\sigma&#39;(z) = \sigma(z) \cdot \bigl(1 - \sigma(z)\bigr)

$\sigma'(z)$: derivative of sigmoid at z - how fast the sigmoid changes at this point
$\sigma(z)$: sigmoid output at z

The derivative at any point is just the output times one minus the output. Let's check the range:

Maximum derivative: occurs at $z = 0$ where $\sigma(0) = 0.5$ . Maximum value: $0.5 \times 0.5 = 0.25$ . The steepest the sigmoid ever gets is a slope of 0.25.
At $z = 5$ : $\sigma'(5) \approx 0.993 \times 0.007 \approx 0.007$ . Nearly flat.
At $z = 10$ : $\sigma'(10) \approx 0.00005$ . Essentially zero.

The vanishing gradient problem

When sigmoid is used in hidden layers of a deep network, gradients must pass through many layers by the chain rule — a calculus technique that says: when the output depends on a chain of intermediate steps, you find the overall rate of change by multiplying the local step-by-step rates together. Each sigmoid layer multiplies the gradient by $\sigma'(z) \leq 0.25$ . After 10 layers:

\text{gradient at layer 1} \leq (0.25)^{10} \approx 10^{-6} \times \text{gradient at output}

The first layers receive almost zero gradient signal and learn essentially nothing. This is the vanishing gradient problem. It's why sigmoid is no longer used in hidden layers of modern deep networks — we use ReLU instead, whose derivative is 1 for positive inputs (no shrinkage). Sigmoid is still used in output layers for binary classification.

Interpreting Sigmoid: Log-Odds

When we use sigmoid in logistic regression, the linear output $z = \mathbf{w} \cdot \mathbf{x} + b$ has a specific probabilistic meaning. The value $z$ is the :

z = \log\frac{P(y=1 \mid \mathbf{x})}{P(y=0 \mid \mathbf{x})} \implies P(y=1 \mid \mathbf{x}) = \sigma(z)

$z$: linear score - the log-odds that the example is class 1
$P(y=1|\mathbf{x})$: probability of class 1 given input x

This gives logistic regression its probabilistic interpretation: $z$ measures "how much evidence for class 1," and $\sigma$ converts that evidence score into a calibrated probability.

When $z > 0$ : $\sigma(z) > 0.5$ → predict class 1. When $z < 0$ : $\sigma(z) < 0.5$ → predict class 0. When $z = 0$ : exactly on the boundary, 50/50 uncertainty.

Code: The Sigmoid in Python

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Key values
print(f"{'z':>4}   σ(z)")
for z in [-5, -2, 0, 2, 5]:
    print(f"{z:4d}   {sigmoid(z):.4f}")
# -5   0.0067  (nearly certain class 0)
#  0   0.5000  (maximum uncertainty)
#  5   0.9933  (nearly certain class 1)

# Derivative: σ'(z) = σ(z) · (1 − σ(z))
def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1.0 - s)

print(f"\nMax derivative at z=0:  {sigmoid_deriv(0):.4f}")    # 0.2500
print(f"Derivative at z=5:      {sigmoid_deriv(5):.6f}")      # 0.006648 — already small
print(f"Derivative at z=10:     {sigmoid_deriv(10):.8f}")     # 0.00004540 — near zero

# In a 10-layer network: (0.25)^10 ≈ 0.000001 — the vanishing gradient problem

InteractiveSigmoid vs. Step Function

Sharpness: 1× — standard sigmoid

The sigmoid becomes σ(t·z) where t is sharpness. At t → ∞ it becomes the step function — but loses its derivative (gradient = 0 everywhere).

Exponential function e^x

Building the Intuition

Key Values

The Derivative and Why It Matters

Interpreting Sigmoid: Log-Odds

Code: The Sigmoid in Python