From data to numbers — Framing the Problem

Everything Starts as Numbers

ML algorithms understand exactly one language: numbers. Not images. Not text. Not audio. Numbers. The first step in any ML project is converting whatever data you have into vectors of numbers.

This sounds limiting. It is actually a superpower. Once data is a vector, every mathematical tool we have built - derivatives, matrix operations, optimization - becomes available.

The is the translation between the real world and the mathematical universe your model lives in.

Images as Numbers

A grayscale image is a 2D grid of numbers. Each pixel holds a value from $0$ (black) to $255$ (white). A $28 \times 28$ pixel image is a $28 \times 28$ matrix. Flatten it row by row: $28 \times 28 = 784$ numbers in a 1D vector.

d = H \times W \times C

$H$: image height in pixels
$W$: image width in pixels
$C$: color channels: 1 for grayscale, 3 for RGB
$d$: total feature count after flattening

A color (RGB) image has three channels. A $224 \times 224$ color image becomes $224 \times 224 \times 3 = 150{,}528$ numbers when flattened. Neural networks can work with either the 3D tensor (CNNs, which preserve spatial layout) or the flat vector (dense networks, which treat all pixels independently).

The is why CNNs outperform flat networks for image data.

Text as Numbers

Text does not come pre-numbered. You choose a representation.

Bag of words: count how many times each vocabulary word appears. A sentence becomes a vector the length of the vocabulary, mostly zeros. Simple and fast, but loses word order entirely.

One-hot encoding: each word maps to a vector of all zeros except a single 1 at the word's index. "cat" might be index 247 in a 10,000-word vocabulary. Problem: every word is equally distant from every other word, which is clearly wrong.

The encode semantic similarity as geometric closeness in vector space.

Each word maps to a dense vector like $[0.2,; -0.7,; 0.1,; \ldots]$ with 100-1000 dimensions. This is what word2vec, GloVe, and transformer embeddings produce. We cover embeddings in detail later in the course.

Tabular Data

If your data is already in a spreadsheet - customer records, sensor readings, financial data - you are most of the way there. Two things to address:

Numbers: use as-is, but to put all features on comparable scales.

Categories: never assign arbitrary numbers to unordered categories. NYC=1, LA=2, Chicago=3 implies LA is the midpoint of NYC and Chicago - which is meaningless. Use one-hot encoding: a binary column per category, with a 1 indicating which value applies.

The Feature Vector

Every training example becomes a fixed-length vector of numbers - the feature vector $\mathbf{x}$ :

\mathbf{x} = [x_1,; x_2,; \ldots,; x_p]

$x_j$: value of feature j for this example
$p$: number of features (the vector length)

Example: a loan applicant: $\mathbf{x} = [34,; 62000,; 7,; 720,; 3]$ (age, income, years employed, credit score, accounts).

Every example must have the same length $p$ . You cannot feed a 5-feature vector to a model trained on 10.

The Dataset Matrix

Think of it as a spreadsheet: each row is one example (a photo, a customer, a sensor reading) and each column is one feature (pixel value, age, temperature). That spreadsheet is $\mathbf{X}$ — and every ML algorithm reads data in exactly this form.

Stack feature vectors as rows and you get the data matrix $\mathbf{X}$ with shape $n \times p$ :

\mathbf{X} \in \mathbb{R}^{n \times p}, \qquad \mathbf{y} \in \mathbb{R}^{n}

$X$: data matrix, shape n times p
$n$: number of examples (rows)
$p$: number of features (columns)
$y$: label vector, one true output per example

The label vector holds one ground-truth output per example.

This $n \times p$ matrix $\mathbf{X}$ is the universal input format: linear regression, neural networks, SVMs, decision trees - all expect rows as examples and columns as features.

Loading Data in Python (for Programmers)

Here is how the raw-data → matrix pipeline looks in code:

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load CSV: each row is one training example
df = pd.read_csv("loans.csv")

# Separate features X from labels y
X_raw = df[["age", "income", "city"]]   # numeric + categorical columns
y     = df["approved"].values            # 1 = approved, 0 = denied

# Scale numbers to comparable range; one-hot encode the city column
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),   # z-score normalization
    ("cat", OneHotEncoder(),  ["city"]),             # city → binary columns
])

X = preprocessor.fit_transform(X_raw)  # shape: (n_examples, n_features)
print(X.shape)  # e.g. (500, 5) — 500 rows, 5 columns after one-hot expansion

X is now the $\mathbf{X}$ matrix. Pass it directly to any scikit-learn estimator or PyTorch DataLoader.

Interactive example

Convert raw data of different types into the X matrix - see how each format becomes numbers

Coming soon