Skip to content
Classification
Lesson 2 ⏱ 12 min

The sigmoid function

Video coming soon

The Sigmoid - Squashing the Real Line into (0,1)

Construction of the sigmoid from first principles, key values at z = 0, ±2, ±5, the elegant derivative formula, and why the vanishing gradient problem emerges at saturation.

⏱ ~7 min

🧮

Quick refresher

Exponential function e^x

e ≈ 2.718 is Euler's number. e^x grows rapidly for positive x and approaches 0 for large negative x. e^0 = 1.

Example

e^1 ≈ 2.72.

e^2 ≈ 7.39.

e^(-1) ≈ 0.37.

e^(-large number) ≈ 0.

Building the Intuition

We need a function f:R(0,1)f: \mathbb{R} \to (0, 1) — one that takes any real number zz and maps it to a value strictly between 0 and 1. Requirements:

  • Smooth: differentiable everywhere (required for gradient descent)
  • Monotone: larger zz → larger output (more evidence for class 1 → higher probability)
  • Saturating: approach 0 as zz \to -\infty and approach 1 as z+z \to +\infty
  • Symmetric: f(0)=0.5f(0) = 0.5 (no input signal = maximum uncertainty)

The sigmoid function converts any real number to a probability between 0 and 1. It's the original neural network activation and still appears in output layers for binary classification — from spam filters to medical risk scoring systems.

One natural construction: start with eze^z (always positive), then normalize it: ez/(ez+1)e^z / (e^z + 1) is always in (0,1)(0, 1). Multiply top and bottom by eze^{-z} and you get the classic form:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
σ\sigma
sigma - the sigmoid (logistic) function. Smoothly maps any real number to the open interval (0, 1)
ee
Euler's number ≈ 2.718 - base of the natural exponential
zz
the input - usually the linear score w·x + b

This is the (also called the logistic function). It's S-shaped, smooth, and bounded.

Key Values

Let's verify it does what we want:

zzeze^{-z}σ(z)\sigma(z)Interpretation
5-51481480.007\approx 0.007Nearly certain class 0
2-27.397.390.12\approx 0.12Likely class 0
00110.50.5Maximum uncertainty
220.1350.1350.88\approx 0.88Likely class 1
550.0070.0070.993\approx 0.993Nearly certain class 1

As z+z \to +\infty: ez0e^{-z} \to 0, so σ(z)1\sigma(z) \to 1. As zz \to -\infty: ez+e^{-z} \to +\infty, so σ(z)0\sigma(z) \to 0.

The sigmoid never actually reaches 0 or 1 — the model never expresses absolute certainty. That's appropriate: any finite amount of evidence should leave some residual doubt.

The Derivative and Why It Matters

Using the quotient rule, the derivative of has a beautifully compact form:

\sigma'(z) = \sigma(z) \cdot \bigl(1 - \sigma(z)\bigr)
σ(z)\sigma'(z)
derivative of sigmoid at z - how fast the sigmoid changes at this point
σ(z)\sigma(z)
sigmoid output at z

The derivative at any point is just the output times one minus the output. Let's check the range:

  • Maximum derivative: occurs at z=0z = 0 where σ(0)=0.5\sigma(0) = 0.5. Maximum value: 0.5×0.5=0.250.5 \times 0.5 = 0.25. The steepest the sigmoid ever gets is a slope of 0.25.
  • At z=5z = 5: \sigma'(5) \approx 0.993 \times 0.007 \approx 0.007. Nearly flat.
  • At z=10z = 10: \sigma'(10) \approx 0.00005. Essentially zero.

Interpreting Sigmoid: Log-Odds

When we use sigmoid in logistic regression, the linear output z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b has a specific probabilistic meaning. The value zz is the :

z=logP(y=1x)P(y=0x)    P(y=1x)=σ(z)z = \log\frac{P(y=1 \mid \mathbf{x})}{P(y=0 \mid \mathbf{x})} \implies P(y=1 \mid \mathbf{x}) = \sigma(z)
zz
linear score - the log-odds that the example is class 1
P(y=1x)P(y=1|\mathbf{x})
probability of class 1 given input x

This gives logistic regression its probabilistic interpretation: zz measures "how much evidence for class 1," and σ\sigma converts that evidence score into a calibrated probability.

When z > 0: \sigma(z) > 0.5 → predict class 1. When z < 0: \sigma(z) < 0.5 → predict class 0. When z=0z = 0: exactly on the boundary, 50/50 uncertainty.

Code: The Sigmoid in Python

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Key values
print(f"{'z':>4}   σ(z)")
for z in [-5, -2, 0, 2, 5]:
    print(f"{z:4d}   {sigmoid(z):.4f}")
# -5   0.0067  (nearly certain class 0)
#  0   0.5000  (maximum uncertainty)
#  5   0.9933  (nearly certain class 1)

# Derivative: σ'(z) = σ(z) · (1 − σ(z))
def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1.0 - s)

print(f"\nMax derivative at z=0:  {sigmoid_deriv(0):.4f}")    # 0.2500
print(f"Derivative at z=5:      {sigmoid_deriv(5):.6f}")      # 0.006648 — already small
print(f"Derivative at z=10:     {sigmoid_deriv(10):.8f}")     # 0.00004540 — near zero

# In a 10-layer network: (0.25)^10 ≈ 0.000001 — the vanishing gradient problem
InteractiveSigmoid vs. Step Function
step-4-22400.51

The sigmoid becomes σ(t·z) where t is sharpness. At t → ∞ it becomes the step function — but loses its derivative (gradient = 0 everywhere).

Quiz

1 / 3

What is σ(0)?