Logistic regression — Classification

The Model

The combines two things you already know:

The linear combination $\mathbf{w} \cdot \mathbf{x} + b$ (same as linear regression)
The sigmoid function $\sigma(z) = 1/(1+e^{-z})$

Logistic regression is the foundation of classification. Despite its simplicity, it's deployed in production at massive scale — as a fast baseline, an interpretable component, or the final layer of a deep network.

The full prediction is:

\hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w} \cdot \mathbf{x} + b)}}

$\hat{y}$: predicted probability of class 1 for input x
$\sigma$: sigmoid function - maps linear score to probability in (0,1)
$\mathbf{w}$: weight vector - one weight per input feature
$b$: bias - shifts the decision threshold

Think of it as two sequential steps:

Step 1 — Linear score: $z = \mathbf{w} \cdot \mathbf{x} + b$ . This is "how much evidence for class 1" as an unbounded real number. Positive $z$ means more evidence for class 1; negative means more evidence for class 0.
Step 2 — Probability: $= \sigma(z)$ . Squeeze that evidence score into a valid probability in $(0, 1)$ .

The output represents $P(y = 1 \mid \mathbf{x})$ : the probability that example $\mathbf{x}$ belongs to class 1.

Making a Prediction

Once you have $\hat{y} \in (0, 1)$ , classify by :

If $\hat{y} > 0.5$ : predict class 1 (equivalently: $z > 0$ )
If $\hat{y} \leq 0.5$ : predict class 0 (equivalently: $z \leq 0$ )

The 0.5 threshold is a sensible default but is not sacred. In a medical diagnostic test, you might lower the threshold to 0.3 — it's better to investigate more potential cases (tolerate false positives) than to miss true positives. The threshold is a design decision made after training, separate from the training objective.

Concrete Example

Classifying whether a student passes an exam based on hours studied ( $x_1$ ) and hours slept ( $x_2$ ). After training, suppose the model learns:

z = 0.8 \cdot x_1 + 0.5 \cdot x_2 + (-3.0)

$w_1 = 0.8$: weight for study hours - studying helps significantly
$w_2 = 0.5$: weight for sleep - sleep helps moderately
$b = -3.0$: bias - most students fail without adequate preparation

For a student who studied 4 hours and slept 7 hours:

z = 0.8 \times 4 + 0.5 \times 7 - 3.0 = 3.7, \quad \hat{y} = \sigma(3.7) \approx 0.976

$z$: linear combination
$\hat{y}$: predicted probability

The model predicts a 97.6% chance of passing. Since $0.976 > 0.5$ , it predicts class 1 (pass).

A student who studied 1 hour and slept 5 hours:

z = 0.8 \times 1 + 0.5 \times 5 - 3.0 = 0.3, \quad \hat{y} = \sigma(0.3) \approx 0.574

$z$: linear combination
$\hat{y}$: predicted probability

Only barely predicts passing. The model is much less confident.

The Decision Boundary

The decision boundary is where $\hat{y} = 0.5$ , which means $z = 0$ , which means:

\mathbf{w} \cdot \mathbf{x} + b = 0

$\mathbf{w} \cdot \mathbf{x} + b = 0$: the decision boundary - a linear equation defining a flat surface in feature space

That's a — a line in 2D, a plane in 3D, a flat (p-1)-dimensional surface in p dimensions. This linearity is both logistic regression's strength (interpretable, fast) and its limitation (can't capture curved boundaries).

Training

During training, we adjust $\mathbf{w}$ and $b$ to minimize a loss function. For logistic regression we use cross-entropy loss — the next lesson explains why MSE doesn't work here and derives the cross-entropy formula. Gradient descent iterates: compute the loss, compute the gradient $\nabla L$ , update $\mathbf{w} \leftarrow \mathbf{w} - \alpha\thinspace\nabla L$ .

Code: Logistic Regression in Python

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Trained weights for the exam example
w = np.array([0.8, 0.5])   # [study_hours, sleep_hours]
b = -3.0                    # bias

# Full prediction pipeline: ŷ = σ(w · x + b)
def predict_proba(x):
    z = np.dot(w, x) + b   # Step 1: linear score (log-odds)
    return sigmoid(z)       # Step 2: squash to probability in (0, 1)

# Student 1: 4h study, 7h sleep
x1 = np.array([4.0, 7.0])
z1 = np.dot(w, x1) + b
p1 = predict_proba(x1)
print(f"Student 1: z = {z1:.1f},  P(pass) = {p1:.3f}  →  {'pass' if p1 > 0.5 else 'fail'}")
# z = 3.7,  P(pass) = 0.976  →  pass

# Student 2: 1h study, 5h sleep
x2 = np.array([1.0, 5.0])
z2 = np.dot(w, x2) + b
p2 = predict_proba(x2)
print(f"Student 2: z = {z2:.1f},  P(pass) = {p2:.3f}  →  {'pass' if p2 > 0.5 else 'fail'}")
# z = 0.3,  P(pass) = 0.574  →  pass (barely)

# Batch prediction using matrix multiplication
X = np.array([[4.0, 7.0],
              [1.0, 5.0],
              [0.5, 4.0],
              [6.0, 8.0]])
probs = sigmoid(X @ w + b)   # X @ w computes the batch dot product
for i, p in enumerate(probs):
    print(f"  Student {i+1}: P(pass) = {p:.3f}  →  {'pass' if p > 0.5 else 'fail'}")

Interactive example

Adjust weights and bias to move the decision boundary - see predicted probabilities update in real time

Coming soon