Skip to content
Classification
Lesson 3 ⏱ 12 min

Logistic regression

Video coming soon

Logistic Regression - Linear Score into Probability

The two-step prediction pipeline (linear score then sigmoid), a concrete numerical example, and why logistic regression is still widely used in industry despite its simplicity.

⏱ ~7 min

🧮

Quick refresher

Sigmoid function

σ(z) = 1/(1+e^(-z)) squashes any real number to (0,1). σ(0)=0.5. Large positive z → σ(z)→1. Large negative z → σ(z)→0.

Example

σ(2) ≈ 0.88 (88% confidence class 1).

σ(-3) ≈ 0.047 (only 4.7% chance class 1).

The Model

The combines two things you already know:

  1. The linear combination wx+b\mathbf{w} \cdot \mathbf{x} + b (same as linear regression)
  2. The sigmoid function σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z})

Logistic regression is the foundation of classification. Despite its simplicity, it's deployed in production at massive scale — as a fast baseline, an interpretable component, or the final layer of a deep network.

The full prediction is:

y^=σ(wx+b)=11+e(wx+b)\hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w} \cdot \mathbf{x} + b)}}
y^\hat{y}
predicted probability of class 1 for input x
σ\sigma
sigmoid function - maps linear score to probability in (0,1)
w\mathbf{w}
weight vector - one weight per input feature
bb
bias - shifts the decision threshold

Think of it as two sequential steps:

  • Step 1 — Linear score: z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b. This is "how much evidence for class 1" as an unbounded real number. Positive zz means more evidence for class 1; negative means more evidence for class 0.
  • Step 2 — Probability: =σ(z)= \sigma(z). Squeeze that evidence score into a valid probability in (0,1)(0, 1).

The output represents P(y=1x)P(y = 1 \mid \mathbf{x}): the probability that example x\mathbf{x} belongs to class 1.

Making a Prediction

Once you have y^(0,1)\hat{y} \in (0, 1), classify by :

  • If \hat{y} > 0.5: predict class 1 (equivalently: z > 0)
  • If y^0.5\hat{y} \leq 0.5: predict class 0 (equivalently: z0z \leq 0)

The 0.5 threshold is a sensible default but is not sacred. In a medical diagnostic test, you might lower the threshold to 0.3 — it's better to investigate more potential cases (tolerate false positives) than to miss true positives. The threshold is a design decision made after training, separate from the training objective.

Concrete Example

Classifying whether a student passes an exam based on hours studied (x1x_1) and hours slept (x2x_2). After training, suppose the model learns:

z=0.8x1+0.5x2+(3.0)z = 0.8 \cdot x_1 + 0.5 \cdot x_2 + (-3.0)
w1=0.8w_1 = 0.8
weight for study hours - studying helps significantly
w2=0.5w_2 = 0.5
weight for sleep - sleep helps moderately
b=3.0b = -3.0
bias - most students fail without adequate preparation

For a student who studied 4 hours and slept 7 hours:

z=0.8×4+0.5×73.0=3.7,y^=σ(3.7)0.976z = 0.8 \times 4 + 0.5 \times 7 - 3.0 = 3.7, \quad \hat{y} = \sigma(3.7) \approx 0.976
zz
linear combination
y^\hat{y}
predicted probability

The model predicts a 97.6% chance of passing. Since 0.976 > 0.5, it predicts class 1 (pass).

A student who studied 1 hour and slept 5 hours:

z=0.8×1+0.5×53.0=0.3,y^=σ(0.3)0.574z = 0.8 \times 1 + 0.5 \times 5 - 3.0 = 0.3, \quad \hat{y} = \sigma(0.3) \approx 0.574
zz
linear combination
y^\hat{y}
predicted probability

Only barely predicts passing. The model is much less confident.

The Decision Boundary

The decision boundary is where y^=0.5\hat{y} = 0.5, which means z=0z = 0, which means:

wx+b=0\mathbf{w} \cdot \mathbf{x} + b = 0
wx+b=0\mathbf{w} \cdot \mathbf{x} + b = 0
the decision boundary - a linear equation defining a flat surface in feature space

That's a — a line in 2D, a plane in 3D, a flat (p-1)-dimensional surface in p dimensions. This linearity is both logistic regression's strength (interpretable, fast) and its limitation (can't capture curved boundaries).

Training

During training, we adjust w\mathbf{w} and bb to minimize a loss function. For logistic regression we use cross-entropy loss — the next lesson explains why MSE doesn't work here and derives the cross-entropy formula. Gradient descent iterates: compute the loss, compute the gradient L\nabla L, update wwαL\mathbf{w} \leftarrow \mathbf{w} - \alpha\thinspace\nabla L.

Code: Logistic Regression in Python

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Trained weights for the exam example
w = np.array([0.8, 0.5])   # [study_hours, sleep_hours]
b = -3.0                    # bias

# Full prediction pipeline: ŷ = σ(w · x + b)
def predict_proba(x):
    z = np.dot(w, x) + b   # Step 1: linear score (log-odds)
    return sigmoid(z)       # Step 2: squash to probability in (0, 1)

# Student 1: 4h study, 7h sleep
x1 = np.array([4.0, 7.0])
z1 = np.dot(w, x1) + b
p1 = predict_proba(x1)
print(f"Student 1: z = {z1:.1f},  P(pass) = {p1:.3f}  →  {'pass' if p1 > 0.5 else 'fail'}")
# z = 3.7,  P(pass) = 0.976  →  pass

# Student 2: 1h study, 5h sleep
x2 = np.array([1.0, 5.0])
z2 = np.dot(w, x2) + b
p2 = predict_proba(x2)
print(f"Student 2: z = {z2:.1f},  P(pass) = {p2:.3f}  →  {'pass' if p2 > 0.5 else 'fail'}")
# z = 0.3,  P(pass) = 0.574  →  pass (barely)

# Batch prediction using matrix multiplication
X = np.array([[4.0, 7.0],
              [1.0, 5.0],
              [0.5, 4.0],
              [6.0, 8.0]])
probs = sigmoid(X @ w + b)   # X @ w computes the batch dot product
for i, p in enumerate(probs):
    print(f"  Student {i+1}: P(pass) = {p:.3f}  →  {'pass' if p > 0.5 else 'fail'}")

Interactive example

Adjust weights and bias to move the decision boundary - see predicted probabilities update in real time

Coming soon

Quiz

1 / 3

In logistic regression ŷ = σ(w·x + b), what does the sigmoid function do?