Skip to content
Framing the Problem
Lesson 2 ⏱ 12 min

What does learning mean?

Video coming soon

What Machine Learning Actually Does

Demystifying ML as parameter optimization - a model is a parameterized function, training is minimizing a loss, and gradient descent is the search algorithm.

⏱ ~8 min

🧮

Quick refresher

Functions and parameters

A function f(x) maps input x to output f(x). Parameters are numbers that change how the function behaves without changing its structure.

Example

f(x) = mx + b.

For m=2, b=3: f(4) = 11.

For m=5, b=0: f(4) = 20.

Same structure, different parameters, different behavior.

Demystifying Machine Learning

"Machine learning" sounds magical - computers gaining wisdom. Let us cut through the mystique. Machine learning is optimization. Finding numbers that minimize a score of wrongness. That is it.

No magic. No intuition. No understanding in the human sense. Just efficient search through a space of possible parameter values for the configuration that makes the fewest mistakes on training data.

Once you accept this framing, a lot of ML becomes much less confusing. Why do models fail? The optimization found the wrong minimum, the loss function did not capture what you wanted, or the training data did not reflect reality. No mystery - just math doing exactly what it was told.

A Model Is a Parameterized Function

A model takes input and produces output. Unlike a fixed function, a model has - adjustable numbers that shape its behavior.

The standard notation captures this:

y^=f(x;;θ)\hat{y} = f(\mathbf{x};; \boldsymbol{\theta})
xx
input feature vector for one example
thetatheta
all learnable parameters - weights, biases, everything adjustable
ff
the model architecture - defines what family of functions is possible
hatyhat_y
the prediction - estimated output for input x

Breaking it down:

  • The is the input
  • The (theta) is every learnable parameter - the semicolon means "parameterized by"
  • The (y-hat) is the prediction; the hat symbol means "estimated"

For linear regression: y^=wx+b\hat{y} = \mathbf{w} \cdot \mathbf{x} + b with parameters w (weights) and bb (bias).

For a two-layer neural network: f(x;;W1,b1,W2,b2)f(\mathbf{x};; \mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2) - weight matrices and bias vectors per layer.

The key insight: we are not searching for a single fixed function. We are searching through a family of functions - one per possible θ - for the member that fits the data best.

The Loss Function

How do we know if parameters are good? We compare predictions to ground truth using a :

L(y,;y^);;0,L(y,;y)=0L(y,; \hat{y}) ;\geq; 0, \quad L(y,; y) = 0
yy
the true label for this example
hatyhat_y
the model prediction for this example
LL
loss value - lower means the prediction is closer to truth

Two common choices:

  • Regression: L=(yy^)2L = (y - \hat{y})^2 - squared error; large mistakes are penalized heavily
  • Classification: L=logy^correctL = -\log \hat{y}_{\text{correct}} - cross-entropy; confident wrong predictions get very high loss

The total training loss averages over all nn examples:

L(θ)=1ni=1nL!(yi,;f(xi;;θ))\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^{n} L!\left(y_i,; f(\mathbf{x}_i;; \boldsymbol{\theta})\right)
nn
number of training examples
yiy_i
true label for example i
hat_y_i
model prediction for example i
thetatheta
model parameters being optimized

This single number - the average loss over all training data - is our compass. Higher means the current parameters are doing worse; lower means they are doing better.

The Loss Landscape

Imagine parameters as coordinates on a map. With two parameters (w1,w2)(w_1, w_2), the loss defines a 3D surface: height at each point equals the loss there.

Training aims to find the lowest point - the valley - in this surface. For linear regression, the landscape is a smooth bowl with one global minimum. For neural networks, it is a high-dimensional terrain with many valleys, ridges, and saddle points.

Training = Optimization

Training is solving an optimization problem. Before the notation: argminθ\arg\min_{\boldsymbol{\theta}} means "find the value of θ\boldsymbol{\theta} that makes the following expression as small as possible." The star in θ\boldsymbol{\theta}^* marks the winner — the optimal parameter setting. The summation inside is the average training loss you have already seen.

Training is solving an optimization problem:

θ=argminθ;1ni=1nL!(yi,;f(xi;;θ))\boldsymbol{\theta}^* = \underset{\boldsymbol{\theta}}{\arg\min}; \frac{1}{n} \sum_{i=1}^{n} L!\left(y_i,; f(\mathbf{x}_i;; \boldsymbol{\theta})\right)
thetastartheta_star
optimal parameters - the values that minimize average training loss
argminargmin
the argument (theta) that minimizes the following expression
nn
number of training examples

In plain English: find the parameter values θ* that minimize average loss across all training examples.

Everything in ML serves this search:

  • Architecture - constrains which functions ff can represent
  • Loss function - defines what "correct" means numerically
  • Optimizer - the search algorithm (gradient descent and its variants)
  • Regularization - prevents finding solutions that fit training data but fail on new data

The is the mental model that unifies the entire field.

InteractiveGradient Descent on a Non-Convex Function
step-2-112
x =2.2000
f(x) =2.7902
f'(x) =8.7368
steps =0

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Why This Framing Helps

When something in ML confuses you - why this architecture? why this loss? why this training trick? - come back to optimization. Ask: what problem in the search is this solving?

Dropout? Noise that prevents the optimizer from overfitting to specific gradient patterns. Batch normalization? Reshapes the loss landscape to be smoother and easier to descend. Adam optimizer? Adapts step size per parameter to handle curvature differences across dimensions.

No magic. All math.

Quiz

1 / 3

In the notation ŷ = f(x; θ), what does θ represent?