Demystifying Machine Learning
"Machine learning" sounds magical - computers gaining wisdom. Let us cut through the mystique. Machine learning is optimization. Finding numbers that minimize a score of wrongness. That is it.
No magic. No intuition. No understanding in the human sense. Just efficient search through a space of possible parameter values for the configuration that makes the fewest mistakes on training data.
Once you accept this framing, a lot of ML becomes much less confusing. Why do models fail? The optimization found the wrong minimum, the loss function did not capture what you wanted, or the training data did not reflect reality. No mystery - just math doing exactly what it was told.
A Model Is a Parameterized Function
A model takes input and produces output. Unlike a fixed function, a model has - adjustable numbers that shape its behavior.
The standard notation captures this:
- input feature vector for one example
- all learnable parameters - weights, biases, everything adjustable
- the model architecture - defines what family of functions is possible
- the prediction - estimated output for input x
Breaking it down:
- The is the input
- The (theta) is every learnable parameter - the semicolon means "parameterized by"
- The (y-hat) is the prediction; the hat symbol means "estimated"
For linear regression: with parameters w (weights) and (bias).
For a two-layer neural network: - weight matrices and bias vectors per layer.
The key insight: we are not searching for a single fixed function. We are searching through a family of functions - one per possible θ - for the member that fits the data best.
The Loss Function
How do we know if parameters are good? We compare predictions to ground truth using a :
- the true label for this example
- the model prediction for this example
- loss value - lower means the prediction is closer to truth
Two common choices:
- Regression: - squared error; large mistakes are penalized heavily
- Classification: - cross-entropy; confident wrong predictions get very high loss
The total training loss averages over all examples:
- number of training examples
- true label for example i
- hat_y_i
- model prediction for example i
- model parameters being optimized
This single number - the average loss over all training data - is our compass. Higher means the current parameters are doing worse; lower means they are doing better.
The Loss Landscape
Imagine parameters as coordinates on a map. With two parameters , the loss defines a 3D surface: height at each point equals the loss there.
Training aims to find the lowest point - the valley - in this surface. For linear regression, the landscape is a smooth bowl with one global minimum. For neural networks, it is a high-dimensional terrain with many valleys, ridges, and saddle points.
Training = Optimization
Training is solving an optimization problem. Before the notation: means "find the value of that makes the following expression as small as possible." The star in marks the winner — the optimal parameter setting. The summation inside is the average training loss you have already seen.
Training is solving an optimization problem:
- optimal parameters - the values that minimize average training loss
- the argument (theta) that minimizes the following expression
- number of training examples
In plain English: find the parameter values θ* that minimize average loss across all training examples.
Everything in ML serves this search:
- Architecture - constrains which functions can represent
- Loss function - defines what "correct" means numerically
- Optimizer - the search algorithm (gradient descent and its variants)
- Regularization - prevents finding solutions that fit training data but fail on new data
The is the mental model that unifies the entire field.
This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.
Why This Framing Helps
When something in ML confuses you - why this architecture? why this loss? why this training trick? - come back to optimization. Ask: what problem in the search is this solving?
Dropout? Noise that prevents the optimizer from overfitting to specific gradient patterns. Batch normalization? Reshapes the loss landscape to be smoother and easier to descend. Adam optimizer? Adapts step size per parameter to handle curvature differences across dimensions.
No magic. All math.