The dot product — Vectors & Matrices

The Operation That Powers Every Neuron

The is the single most important operation in machine learning. Every neuron in every neural network computes a dot product. It is how linear regression makes predictions. It is how attention works in transformers. It is how similarity is measured in recommendation systems. Learn this well - you will see it everywhere.

What Is a Dot Product?

Given two vectors of the same length, the dot product multiplies corresponding elements and sums them up:

\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

$a_i$: i-th element of vector a
$b_i$: i-th element of vector b
$n$: vector dimension

The result is a single number - a - NOT another vector. Two vectors in, one number out.

Example 1: $[1, 2, 3] \cdot [4, 5, 6] = 1 \cdot 4 + 2 \cdot 5 + 3 \cdot 6 = 4 + 10 + 18 = 32$

Example 2: $[2, -1, 3] \cdot [1, 4, 2] = 2 \cdot 1 + (-1) \cdot 4 + 3 \cdot 2 = 2 - 4 + 6 = 4$

Notice the negative component: a negative weight times a positive feature subtracts from the total. This is what lets dot products represent trade-offs - some features push the output up, some push it down.

InteractiveDot Product — drag the vector tips

a[1.8, 1.2]

b[0.8, 1.8]

a · b3.60

angle32.3°

⬆ Positive — vectors point in similar directions

The dot product measures alignment. In ML, it's how a neuron "scores" its input — high positive = strong match, near zero = no signal, negative = opposite.

The Geometric Meaning

There is another formula for the same operation:

\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}|\thinspace |\mathbf{b}|\thinspace \cos\theta

$\theta$: angle between the two vectors
$\|\mathbf{a}\|$: magnitude of vector a
$\|\mathbf{b}\|$: magnitude of vector b

This reveals what the dot product actually measures: how much the two vectors "agree" in direction.

$\theta = 0°$ (parallel, same direction): $\cos(0°) = 1$ . Maximum agreement.
$\theta = 90°$ (perpendicular): $\cos(90°) = 0$ . Zero agreement - the vectors carry no information about each other.
$\theta = 180°$ (opposite directions): $\cos(180°) = -1$ . Maximum disagreement.

Dot Products as Weighted Sums

The most useful framing for ML: when $\mathbf{w}$ is a vector of weights and $\mathbf{x}$ is a vector of features, the dot product is a weighted sum:

\mathbf{w} \cdot \mathbf{x} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n

$w_i$: weight for feature i - how much influence that feature has
$x_i$: value of feature i

Each weight decides how much influence each feature has. Positive weight: feature pushes output up. Negative weight: feature pushes output down. Zero weight: feature is ignored entirely.

This is how linear regression makes predictions:

\hat{y} = \mathbf{w} \cdot \mathbf{x} + b

$\mathbf{w}$: weight vector - learned from data
$\mathbf{x}$: feature vector - one input example
$b$: bias term - learned offset

The prediction IS the dot product of weights and features. Training finds the weight vector such that $\mathbf{w} \cdot \mathbf{x}_i + b \approx y_i$ for all training examples.

This is how every neuron works:

A single neuron computes $z = \mathbf{w} \cdot \mathbf{x} + b$ (dot product plus bias), then applies an activation function: $a = \sigma(z)$ . The is, at its heart, a learned weighted sum.

Properties You Will Use

\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a} \qquad \text{(commutative)}

$c$: any scalar constant

\mathbf{a} \cdot \mathbf{a} = |\mathbf{a}|^2 \qquad \text{(self dot product = squared length)}

$\mathbf{a}$: any vector

That last one is useful: if you want the squared magnitude of a gradient (for gradient clipping), compute $\mathbf{g} \cdot \mathbf{g}$ instead of squaring the magnitude separately.

Dot Products in Attention

The dot product is also the heart of the attention mechanism in transformers. Given a query vector and a set of key vectors $\mathbf{k}_i$ , the attention scores are:

\text{score}(q, k_i) = \frac{\mathbf{q} \cdot \mathbf{k}_i}{\sqrt{d_k}}

$\mathbf{q}$: query vector
$\mathbf{k}_i$: i-th key vector
$d_k$: dimension of key vectors - used for scaling

High dot product means the query "aligns with" that key - the token is relevant. This single operation determines how much each token attends to every other token in a transformer.

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Dot product: three equivalent ways
print(np.dot(a, b))      # → 32   (1·4 + 2·5 + 3·6)
print(a @ b)             # → 32   (@ is the matmul operator)
print(sum(a * b))        # → 32   (element-wise multiply then sum)

# Cosine similarity: direction only
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"cos similarity: {cos_sim:.4f}")   # → 0.9746 (nearly parallel)

# Attention score for one query–key pair
q = np.array([0.5, 0.3, -0.2])
k = np.array([0.4, 0.8, 0.1])
d_k = len(q)
score = np.dot(q, k) / np.sqrt(d_k)
print(f"attention score: {score:.4f}")

Interactive example

Attention score demo - adjust query and key vectors to see how dot products create attention patterns

Coming soon