Mathematical notation — Functions & Notation

Why Notation Feels Hard (And Why It Isn't)

Math notation is a compression format. It's like texting abbreviations - "lol," "brb," "imo" look like gibberish until you learn the code, and then they're perfectly clear. The problem isn't that notation is hard. The problem is that most textbooks throw it at you without translating it first.

This lesson translates it. By the end, you'll be able to look at something like:

\hat{y}_i = f(x_i;\thinspace \theta)

...and read it like a sentence. Let's go symbol by symbol.

Subscripts: Picking Items From a List

A is the little number or letter written below and to the right of a variable. It means "the i-th one" - an index into a collection.

Think of apartment numbers. If your building has units 1, 2, 3, ..., n, then unit i is the i-th apartment. Now substitute "weight" for "apartment":

w₁ = the first weight
w₂ = the second weight
wᵢ = the i-th weight

When you see wᵢ for i = 1, 2, ..., n, it means "there are n weights, and wᵢ is a shorthand for all of them."

In ML, you'll constantly see:

xᵢ = the i-th training example
yᵢ = the true label for the i-th example
ŷᵢ = the model's prediction for the i-th example

weights = [0.4, -0.2, 0.9, 0.1]  # a list of weights: w₁, w₂, w₃, w₄

# Math notation: w₃  (1-based, the 3rd weight)
# Python index:  weights[2]  (0-based, index 2 = third item)

w3_math   = weights[2]   # → 0.9   (math's w₃)
w_last    = weights[-1]  # → 0.1   (last weight, w₄)

Superscripts: Two Different Jobs

Superscripts (above and to the right) have two uses, and context always tells you which applies.

Job 1 - Exponentiation. $x^2$ means x squared. $x^3$ means x cubed.

Job 2 - Layer labels in neural networks. In a multi-layer network, $x^{(2)}$ might mean "the activations in layer 2." The parentheses around the superscript usually signal this usage. So $x^2 = x \cdot x$ , but $x^{(2)}$ = "x from layer 2." Different things.

When you see a superscript: if it's inside parentheses, it's probably a layer index. If it's a bare number, it's probably an exponent.

Greek Letters - Your New Vocabulary

ML papers love Greek letters. Here's the cheat sheet you actually need:

Parameters and knobs

The = all model parameters. Instead of listing every weight and bias, you say "θ represents everything the model learned."

The = the learning rate. In the gradient descent update $w \leftarrow w - \alpha \nabla L$ , α controls how big a step you take.

The = a tiny number to avoid division by zero. You'll see it in the Adam optimizer as a constant like 1e-8.

Functions and aggregation

The = the sigmoid function $\sigma(x) = 1/(1+e^{-x})$ . Outputs values between 0 and 1.

The = summation. "Add all of these up." This is so common it gets its own lesson - coming right up.

The = the mean (average) of a distribution.

Penalty strength

The = regularization strength. Controls how harsh the penalty for large weights is.

Reading Equations Left to Right

Equations are sentences. Read them that way. Take the gradient descent update:

w \leftarrow w - \alpha \nabla L

$w$: the weight parameter being updated
$\alpha$: learning rate - step size
$\nabla L$: gradient of the loss with respect to w - direction of steepest increase

Left to right: "the new value of w becomes the old value of w, minus α times the gradient of L." The arrow ← means "assign" or "update." This is one of the most important equations in ML - and it's just a sentence about subtracting a scaled gradient.

Hat Notation: "This Is an Estimate"

When you see a hat (^) over a variable, it means "this is an estimate of the thing without the hat."

The Semicolon: Separating Input From Parameters

You'll often see notation like $f(x;\thinspace \theta)$ . The semicolon separates two kinds of things:

Left of the semicolon: what varies during inference - $x$ , the input
Right of the semicolon: what's fixed once the model is trained - $\theta$ , the parameters

It says: "This function takes input x and its behavior is shaped by θ. When you're using the trained model, θ is baked in. When you're training, you're adjusting θ."

Putting It Together

Let's read one complete expression:

\hat{y}_i = f(x_i;\thinspace \theta)

$\hat{y}_i$: the model's predicted output for example i
$x_i$: the i-th training input
$\theta$: all model parameters

Translation: "For the i-th training example, the model's prediction is obtained by passing the i-th input through the function f, whose behavior is controlled by parameters θ."

That's it. Once you know the code, equations stop being walls of symbols and start being sentences. Keep this page bookmarked - when you hit notation that confuses you, come back here first.