Skip to content
Framing the Problem
Lesson 1 ⏱ 10 min

From data to numbers

Video coming soon

Turning Data Into Numbers: Pixels, Words, and Tables

How images become pixel arrays, text becomes embedding vectors, and tabular data becomes a design matrix - with worked examples for each data type.

⏱ ~7 min

🧮

Quick refresher

Vectors and matrices

A vector is an ordered list of numbers. A matrix is a 2D grid of numbers. In ML, a dataset of n examples with p features is an n×p matrix X.

Example

10 people described by height and weight → a 10×2 matrix.

Everything Starts as Numbers

ML algorithms understand exactly one language: numbers. Not images. Not text. Not audio. Numbers. The first step in any ML project is converting whatever data you have into vectors of numbers.

This sounds limiting. It is actually a superpower. Once data is a vector, every mathematical tool we have built - derivatives, matrix operations, optimization - becomes available.

The is the translation between the real world and the mathematical universe your model lives in.

Images as Numbers

A grayscale image is a 2D grid of numbers. Each pixel holds a value from 00 (black) to 255255 (white). A 28×2828 \times 28 pixel image is a 28×2828 \times 28 matrix. Flatten it row by row: 28×28=78428 \times 28 = 784 numbers in a 1D vector.

d=H×W×Cd = H \times W \times C
HH
image height in pixels
WW
image width in pixels
CC
color channels: 1 for grayscale, 3 for RGB
dd
total feature count after flattening

A color (RGB) image has three channels. A 224×224224 \times 224 color image becomes 224×224×3=150,528224 \times 224 \times 3 = 150{,}528 numbers when flattened. Neural networks can work with either the 3D tensor (CNNs, which preserve spatial layout) or the flat vector (dense networks, which treat all pixels independently).

The is why CNNs outperform flat networks for image data.

Text as Numbers

Text does not come pre-numbered. You choose a representation.

Bag of words: count how many times each vocabulary word appears. A sentence becomes a vector the length of the vocabulary, mostly zeros. Simple and fast, but loses word order entirely.

One-hot encoding: each word maps to a vector of all zeros except a single 1 at the word's index. "cat" might be index 247 in a 10,000-word vocabulary. Problem: every word is equally distant from every other word, which is clearly wrong.

The encode semantic similarity as geometric closeness in vector space.

Each word maps to a dense vector like [0.2,;0.7,;0.1,;][0.2,; -0.7,; 0.1,; \ldots] with 100-1000 dimensions. This is what word2vec, GloVe, and transformer embeddings produce. We cover embeddings in detail later in the course.

Tabular Data

If your data is already in a spreadsheet - customer records, sensor readings, financial data - you are most of the way there. Two things to address:

Numbers: use as-is, but to put all features on comparable scales.

Categories: never assign arbitrary numbers to unordered categories. NYC=1, LA=2, Chicago=3 implies LA is the midpoint of NYC and Chicago - which is meaningless. Use one-hot encoding: a binary column per category, with a 1 indicating which value applies.

The Feature Vector

Every training example becomes a fixed-length vector of numbers - the feature vector x\mathbf{x}:

x=[x1,;x2,;,;xp]\mathbf{x} = [x_1,; x_2,; \ldots,; x_p]
xjx_j
value of feature j for this example
pp
number of features (the vector length)

Example: a loan applicant: x=[34,;62000,;7,;720,;3]\mathbf{x} = [34,; 62000,; 7,; 720,; 3] (age, income, years employed, credit score, accounts).

Every example must have the same length pp. You cannot feed a 5-feature vector to a model trained on 10.

The Dataset Matrix

Think of it as a spreadsheet: each row is one example (a photo, a customer, a sensor reading) and each column is one feature (pixel value, age, temperature). That spreadsheet is X\mathbf{X} — and every ML algorithm reads data in exactly this form.

Stack feature vectors as rows and you get the data matrix X\mathbf{X} with shape n×pn \times p:

XRn×p,yRn\mathbf{X} \in \mathbb{R}^{n \times p}, \qquad \mathbf{y} \in \mathbb{R}^{n}
XX
data matrix, shape n times p
nn
number of examples (rows)
pp
number of features (columns)
yy
label vector, one true output per example

The label vector holds one ground-truth output per example.

This n×pn \times p matrix X\mathbf{X} is the universal input format: linear regression, neural networks, SVMs, decision trees - all expect rows as examples and columns as features.

Interactive example

Convert raw data of different types into the X matrix - see how each format becomes numbers

Coming soon

Quiz

1 / 3

A 28×28 pixel grayscale image, when flattened into a feature vector, has how many features?