What does learning mean? — Framing the Problem

Demystifying Machine Learning

"Machine learning" sounds magical - computers gaining wisdom. Let us cut through the mystique. Machine learning is optimization. Finding numbers that minimize a score of wrongness. That is it.

No magic. No intuition. No understanding in the human sense. Just efficient search through a space of possible parameter values for the configuration that makes the fewest mistakes on training data.

Once you accept this framing, a lot of ML becomes much less confusing. Why do models fail? The optimization found the wrong minimum, the loss function did not capture what you wanted, or the training data did not reflect reality. No mystery - just math doing exactly what it was told.

A Model Is a Parameterized Function

A model takes input and produces output. Unlike a fixed function, a model has - adjustable numbers that shape its behavior.

The standard notation captures this:

\hat{y} = f(\mathbf{x};; \boldsymbol{\theta})

$x$: input feature vector for one example
$theta$: all learnable parameters - weights, biases, everything adjustable
$f$: the model architecture - defines what family of functions is possible
$hat_y$: the prediction - estimated output for input x

Breaking it down:

The is the input
The (theta) is every learnable parameter - the semicolon means "parameterized by"
The (y-hat) is the prediction; the hat symbol means "estimated"

For linear regression: $\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$ with parameters w (weights) and $b$ (bias).

For a two-layer neural network: $f(\mathbf{x};; \mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2)$ - weight matrices and bias vectors per layer.

The key insight: we are not searching for a single fixed function. We are searching through a family of functions - one per possible θ - for the member that fits the data best.

The Loss Function

How do we know if parameters are good? We compare predictions to ground truth using a :

L(y,; \hat{y}) ;\geq; 0, \quad L(y,; y) = 0

$y$: the true label for this example
$hat_y$: the model prediction for this example
$L$: loss value - lower means the prediction is closer to truth

Two common choices:

Regression: $L = (y - \hat{y})^2$ - squared error; large mistakes are penalized heavily
Classification: $L = -\log \hat{y}_{\text{correct}}$ - cross-entropy; confident wrong predictions get very high loss

Loss Numbers: Regression vs. Classification

Regression (squared error): True house price is $y = 300$ (in $k), model predicts $\hat{y} = 250$ . Loss: $L = (300 - 250)^2 = 2500$ . If the model improved its prediction to $\hat{y} = 290$ , loss drops to $L = (300 - 290)^2 = 100$ .

Classification (cross-entropy): True label is "cat." If the model assigns 90% probability to cat: $L = -\log(0.90) \approx 0.105$ . If the model says only 10% cat (confident and wrong): $L = -\log(0.10) \approx 2.30$ . The loss explodes for confident wrong predictions — exactly the behavior we want.

Both losses return a non-negative number that equals $0$ for a perfect prediction. Training pushes this number down.

The total training loss averages over all $n$ examples:

\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^{n} L!\left(y_i,; f(\mathbf{x}_i;; \boldsymbol{\theta})\right)

$n$: number of training examples
$y_i$: true label for example i
$hat_y_i$: model prediction for example i
$theta$: model parameters being optimized

This single number - the average loss over all training data - is our compass. Higher means the current parameters are doing worse; lower means they are doing better.

The Loss Landscape

Imagine parameters as coordinates on a map. With two parameters $(w_1, w_2)$ , the loss defines a 3D surface: height at each point equals the loss there.

Training aims to find the lowest point - the valley - in this surface. For linear regression, the landscape is a smooth bowl with one global minimum. For neural networks, it is a high-dimensional terrain with many valleys, ridges, and saddle points.

Training = Optimization

Training is solving an optimization problem. Before the notation: $\arg\min_{\boldsymbol{\theta}}$ means "find the value of $\boldsymbol{\theta}$ that makes the following expression as small as possible." The star in $\boldsymbol{\theta}^*$ marks the winner — the optimal parameter setting. The summation inside is the average training loss you have already seen.

Training is solving an optimization problem:

\boldsymbol{\theta}^* = \underset{\boldsymbol{\theta}}{\arg\min}; \frac{1}{n} \sum_{i=1}^{n} L!\left(y_i,; f(\mathbf{x}_i;; \boldsymbol{\theta})\right)

$theta_star$: optimal parameters - the values that minimize average training loss
$argmin$: the argument (theta) that minimizes the following expression
$n$: number of training examples

In plain English: find the parameter values θ* that minimize average loss across all training examples.

Everything in ML serves this search:

Architecture - constrains which functions $f$ can represent
Loss function - defines what "correct" means numerically
Optimizer - the search algorithm (gradient descent and its variants)
Regularization - prevents finding solutions that fit training data but fail on new data

The is the mental model that unifies the entire field.

InteractiveGradient Descent on a Non-Convex Function

x =2.2000

f(x) =2.7902

f'(x) =8.7368

steps =0

Learning rate α: 0.15

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Why This Framing Helps

When something in ML confuses you - why this architecture? why this loss? why this training trick? - come back to optimization. Ask: what problem in the search is this solving?

Dropout? Noise that prevents the optimizer from overfitting to specific gradient patterns. Batch normalization? Reshapes the loss landscape to be smoother and easier to descend. Adam optimizer? Adapts step size per parameter to handle curvature differences across dimensions.

No magic. All math.