Skip to content
Math Foundation Vectors & Matrices
Lesson 2 ⏱ 10 min

The dot product

Video coming soon

The Dot Product: The Operation That Powers Every Neuron

The algebraic definition, the geometric meaning as alignment, the weighted-sum interpretation, and how every linear model prediction is a dot product.

⏱ ~6 min

🧮

Quick refresher

Vectors

A vector is an ordered list of numbers. You add vectors element-wise and multiply by scalars element-wise. The magnitude of [a, b] is √(a²+b²).

Example

[1, 2] + [3, 4] = [4, 6].

2·[1, 2] = [2, 4].

||[3, 4]|| = 5.

The Operation That Powers Every Neuron

The is the single most important operation in machine learning. Every neuron in every neural network computes a dot product. It is how linear regression makes predictions. It is how attention works in transformers. It is how similarity is measured in recommendation systems. Learn this well - you will see it everywhere.

What Is a Dot Product?

Given two vectors of the same length, the dot product multiplies corresponding elements and sums them up:

ab=i=1naibi=a1b1+a2b2++anbn\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n
aia_i
i-th element of vector a
bib_i
i-th element of vector b
nn
vector dimension

The result is a single number - a - NOT another vector. Two vectors in, one number out.

Example 1: [1,2,3][4,5,6]=14+25+36=4+10+18=32[1, 2, 3] \cdot [4, 5, 6] = 1 \cdot 4 + 2 \cdot 5 + 3 \cdot 6 = 4 + 10 + 18 = 32

Example 2: [2,1,3][1,4,2]=21+(1)4+32=24+6=4[2, -1, 3] \cdot [1, 4, 2] = 2 \cdot 1 + (-1) \cdot 4 + 3 \cdot 2 = 2 - 4 + 6 = 4

Notice the negative component: a negative weight times a positive feature subtracts from the total. This is what lets dot products represent trade-offs - some features push the output up, some push it down.

InteractiveDot Product — drag the vector tips
ab
a[1.8, 1.2]
b[0.8, 1.8]
a · b3.60
angle32.3°

⬆ Positive — vectors point in similar directions

The dot product measures alignment. In ML, it's how a neuron "scores" its input — high positive = strong match, near zero = no signal, negative = opposite.

The Geometric Meaning

There is another formula for the same operation:

ab=abcosθ\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}|\thinspace |\mathbf{b}|\thinspace \cos\theta
θ\theta
angle between the two vectors
a\|\mathbf{a}\|
magnitude of vector a
b\|\mathbf{b}\|
magnitude of vector b

This reveals what the dot product actually measures: how much the two vectors "agree" in direction.

  • θ=0°\theta = 0° (parallel, same direction): cos(0°)=1\cos(0°) = 1. Maximum agreement.
  • θ=90°\theta = 90° (perpendicular): cos(90°)=0\cos(90°) = 0. Zero agreement - the vectors carry no information about each other.
  • θ=180°\theta = 180° (opposite directions): cos(180°)=1\cos(180°) = -1. Maximum disagreement.

Dot Products as Weighted Sums

The most useful framing for ML: when w\mathbf{w} is a vector of weights and x\mathbf{x} is a vector of features, the dot product is a weighted sum:

wx=w1x1+w2x2++wnxn\mathbf{w} \cdot \mathbf{x} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n
wiw_i
weight for feature i - how much influence that feature has
xix_i
value of feature i

Each weight decides how much influence each feature has. Positive weight: feature pushes output up. Negative weight: feature pushes output down. Zero weight: feature is ignored entirely.

This is how linear regression makes predictions:

y^=wx+b\hat{y} = \mathbf{w} \cdot \mathbf{x} + b
w\mathbf{w}
weight vector - learned from data
x\mathbf{x}
feature vector - one input example
bb
bias term - learned offset

The prediction IS the dot product of weights and features. Training finds the weight vector such that wxi+byi\mathbf{w} \cdot \mathbf{x}_i + b \approx y_i for all training examples.

This is how every neuron works:

A single neuron computes z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b (dot product plus bias), then applies an activation function: a=σ(z)a = \sigma(z). The is, at its heart, a learned weighted sum.

Properties You Will Use

ab=ba(commutative)\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a} \qquad \text{(commutative)}
cc
any scalar constant
aa=a2(self dot product = squared length)\mathbf{a} \cdot \mathbf{a} = |\mathbf{a}|^2 \qquad \text{(self dot product = squared length)}
a\mathbf{a}
any vector

That last one is useful: if you want the squared magnitude of a gradient (for gradient clipping), compute gg\mathbf{g} \cdot \mathbf{g} instead of squaring the magnitude separately.

Dot Products in Attention

The dot product is also the heart of the attention mechanism in transformers. Given a query vector and a set of key vectors ki\mathbf{k}_i, the attention scores are:

score(q,ki)=qkidk\text{score}(q, k_i) = \frac{\mathbf{q} \cdot \mathbf{k}_i}{\sqrt{d_k}}
q\mathbf{q}
query vector
ki\mathbf{k}_i
i-th key vector
dkd_k
dimension of key vectors - used for scaling

High dot product means the query "aligns with" that key - the token is relevant. This single operation determines how much each token attends to every other token in a transformer.

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Dot product: three equivalent ways
print(np.dot(a, b))      # → 32   (1·4 + 2·5 + 3·6)
print(a @ b)             # → 32   (@ is the matmul operator)
print(sum(a * b))        # → 32   (element-wise multiply then sum)

# Cosine similarity: direction only
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"cos similarity: {cos_sim:.4f}")   # → 0.9746 (nearly parallel)

# Attention score for one query–key pair
q = np.array([0.5, 0.3, -0.2])
k = np.array([0.4, 0.8, 0.1])
d_k = len(q)
score = np.dot(q, k) / np.sqrt(d_k)
print(f"attention score: {score:.4f}")

Interactive example

Attention score demo - adjust query and key vectors to see how dot products create attention patterns

Coming soon

Quiz

1 / 3

What is [1, 2, 3] · [4, 5, 6]?