The Model
The combines two things you already know:
- The linear combination (same as linear regression)
- The sigmoid function
Logistic regression is the foundation of classification. Despite its simplicity, it's deployed in production at massive scale — as a fast baseline, an interpretable component, or the final layer of a deep network.
The full prediction is:
- predicted probability of class 1 for input x
- sigmoid function - maps linear score to probability in (0,1)
- weight vector - one weight per input feature
- bias - shifts the decision threshold
Think of it as two sequential steps:
- Step 1 — Linear score: . This is "how much evidence for class 1" as an unbounded real number. Positive means more evidence for class 1; negative means more evidence for class 0.
- Step 2 — Probability: . Squeeze that evidence score into a valid probability in .
The output represents : the probability that example belongs to class 1.
Making a Prediction
Once you have , classify by :
- If \hat{y} > 0.5: predict class 1 (equivalently: z > 0)
- If : predict class 0 (equivalently: )
The 0.5 threshold is a sensible default but is not sacred. In a medical diagnostic test, you might lower the threshold to 0.3 — it's better to investigate more potential cases (tolerate false positives) than to miss true positives. The threshold is a design decision made after training, separate from the training objective.
Concrete Example
Classifying whether a student passes an exam based on hours studied () and hours slept (). After training, suppose the model learns:
- weight for study hours - studying helps significantly
- weight for sleep - sleep helps moderately
- bias - most students fail without adequate preparation
For a student who studied 4 hours and slept 7 hours:
- linear combination
- predicted probability
The model predicts a 97.6% chance of passing. Since 0.976 > 0.5, it predicts class 1 (pass).
A student who studied 1 hour and slept 5 hours:
- linear combination
- predicted probability
Only barely predicts passing. The model is much less confident.
The Decision Boundary
The decision boundary is where , which means , which means:
- the decision boundary - a linear equation defining a flat surface in feature space
That's a — a line in 2D, a plane in 3D, a flat (p-1)-dimensional surface in p dimensions. This linearity is both logistic regression's strength (interpretable, fast) and its limitation (can't capture curved boundaries).
Training
During training, we adjust and to minimize a loss function. For logistic regression we use cross-entropy loss — the next lesson explains why MSE doesn't work here and derives the cross-entropy formula. Gradient descent iterates: compute the loss, compute the gradient , update .
Code: Logistic Regression in Python
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
# Trained weights for the exam example
w = np.array([0.8, 0.5]) # [study_hours, sleep_hours]
b = -3.0 # bias
# Full prediction pipeline: ŷ = σ(w · x + b)
def predict_proba(x):
z = np.dot(w, x) + b # Step 1: linear score (log-odds)
return sigmoid(z) # Step 2: squash to probability in (0, 1)
# Student 1: 4h study, 7h sleep
x1 = np.array([4.0, 7.0])
z1 = np.dot(w, x1) + b
p1 = predict_proba(x1)
print(f"Student 1: z = {z1:.1f}, P(pass) = {p1:.3f} → {'pass' if p1 > 0.5 else 'fail'}")
# z = 3.7, P(pass) = 0.976 → pass
# Student 2: 1h study, 5h sleep
x2 = np.array([1.0, 5.0])
z2 = np.dot(w, x2) + b
p2 = predict_proba(x2)
print(f"Student 2: z = {z2:.1f}, P(pass) = {p2:.3f} → {'pass' if p2 > 0.5 else 'fail'}")
# z = 0.3, P(pass) = 0.574 → pass (barely)
# Batch prediction using matrix multiplication
X = np.array([[4.0, 7.0],
[1.0, 5.0],
[0.5, 4.0],
[6.0, 8.0]])
probs = sigmoid(X @ w + b) # X @ w computes the batch dot product
for i, p in enumerate(probs):
print(f" Student {i+1}: P(pass) = {p:.3f} → {'pass' if p > 0.5 else 'fail'}")
Interactive example
Adjust weights and bias to move the decision boundary - see predicted probabilities update in real time
Coming soon