Why Notation Feels Hard (And Why It Isn't)
Math notation is a compression format. It's like texting abbreviations - "lol," "brb," "imo" look like gibberish until you learn the code, and then they're perfectly clear. The problem isn't that notation is hard. The problem is that most textbooks throw it at you without translating it first.
This lesson translates it. By the end, you'll be able to look at something like:
...and read it like a sentence. Let's go symbol by symbol.
Subscripts: Picking Items From a List
A is the little number or letter written below and to the right of a variable. It means "the i-th one" - an index into a collection.
Think of apartment numbers. If your building has units 1, 2, 3, ..., n, then unit i is the i-th apartment. Now substitute "weight" for "apartment":
- w₁ = the first weight
- w₂ = the second weight
- wᵢ = the i-th weight
When you see wᵢ for i = 1, 2, ..., n, it means "there are n weights, and wᵢ is a shorthand for all of them."
In ML, you'll constantly see:
- xᵢ = the i-th training example
- yᵢ = the true label for the i-th example
- ŷᵢ = the model's prediction for the i-th example
weights = [0.4, -0.2, 0.9, 0.1] # a list of weights: w₁, w₂, w₃, w₄ # Math notation: w₃ (1-based, the 3rd weight) # Python index: weights[2] (0-based, index 2 = third item) w3_math = weights[2] # → 0.9 (math's w₃) w_last = weights[-1] # → 0.1 (last weight, w₄)
Superscripts: Two Different Jobs
Superscripts (above and to the right) have two uses, and context always tells you which applies.
Job 1 - Exponentiation. means x squared. means x cubed.
Job 2 - Layer labels in neural networks. In a multi-layer network, might mean "the activations in layer 2." The parentheses around the superscript usually signal this usage. So , but = "x from layer 2." Different things.
When you see a superscript: if it's inside parentheses, it's probably a layer index. If it's a bare number, it's probably an exponent.
Greek Letters - Your New Vocabulary
ML papers love Greek letters. Here's the cheat sheet you actually need:
Parameters and knobs
The = all model parameters. Instead of listing every weight and bias, you say "θ represents everything the model learned."
The = the learning rate. In the gradient descent update , α controls how big a step you take.
The = a tiny number to avoid division by zero. You'll see it in the Adam optimizer as a constant like 1e-8.
Functions and aggregation
The = the sigmoid function . Outputs values between 0 and 1.
The = summation. "Add all of these up." This is so common it gets its own lesson - coming right up.
The = the mean (average) of a distribution.
Penalty strength
The = regularization strength. Controls how harsh the penalty for large weights is.
Reading Equations Left to Right
Equations are sentences. Read them that way. Take the gradient descent update:
- the weight parameter being updated
- learning rate - step size
- gradient of the loss with respect to w - direction of steepest increase
Left to right: "the new value of w becomes the old value of w, minus α times the gradient of L." The arrow ← means "assign" or "update." This is one of the most important equations in ML - and it's just a sentence about subtracting a scaled gradient.
Hat Notation: "This Is an Estimate"
When you see a hat (^) over a variable, it means "this is an estimate of the thing without the hat."
The Semicolon: Separating Input From Parameters
You'll often see notation like . The semicolon separates two kinds of things:
- Left of the semicolon: what varies during inference - , the input
- Right of the semicolon: what's fixed once the model is trained - , the parameters
It says: "This function takes input x and its behavior is shaped by θ. When you're using the trained model, θ is baked in. When you're training, you're adjusting θ."
Putting It Together
Let's read one complete expression:
- the model's predicted output for example i
- the i-th training input
- all model parameters
Translation: "For the i-th training example, the model's prediction is obtained by passing the i-th input through the function f, whose behavior is controlled by parameters θ."
That's it. Once you know the code, equations stop being walls of symbols and start being sentences. Keep this page bookmarked - when you hit notation that confuses you, come back here first.