The Operation That Powers Every Neuron
The is the single most important operation in machine learning. Every neuron in every neural network computes a dot product. It is how linear regression makes predictions. It is how attention works in transformers. It is how similarity is measured in recommendation systems. Learn this well - you will see it everywhere.
What Is a Dot Product?
Given two vectors of the same length, the dot product multiplies corresponding elements and sums them up:
- i-th element of vector a
- i-th element of vector b
- vector dimension
The result is a single number - a - NOT another vector. Two vectors in, one number out.
Example 1:
Example 2:
Notice the negative component: a negative weight times a positive feature subtracts from the total. This is what lets dot products represent trade-offs - some features push the output up, some push it down.
⬆ Positive — vectors point in similar directions
The dot product measures alignment. In ML, it's how a neuron "scores" its input — high positive = strong match, near zero = no signal, negative = opposite.
The Geometric Meaning
There is another formula for the same operation:
- angle between the two vectors
- magnitude of vector a
- magnitude of vector b
This reveals what the dot product actually measures: how much the two vectors "agree" in direction.
- (parallel, same direction): . Maximum agreement.
- (perpendicular): . Zero agreement - the vectors carry no information about each other.
- (opposite directions): . Maximum disagreement.
Dot Products as Weighted Sums
The most useful framing for ML: when is a vector of weights and is a vector of features, the dot product is a weighted sum:
- weight for feature i - how much influence that feature has
- value of feature i
Each weight decides how much influence each feature has. Positive weight: feature pushes output up. Negative weight: feature pushes output down. Zero weight: feature is ignored entirely.
This is how linear regression makes predictions:
- weight vector - learned from data
- feature vector - one input example
- bias term - learned offset
The prediction IS the dot product of weights and features. Training finds the weight vector such that for all training examples.
This is how every neuron works:
A single neuron computes (dot product plus bias), then applies an activation function: . The is, at its heart, a learned weighted sum.
Properties You Will Use
- any scalar constant
- any vector
That last one is useful: if you want the squared magnitude of a gradient (for gradient clipping), compute instead of squaring the magnitude separately.
Dot Products in Attention
The dot product is also the heart of the attention mechanism in transformers. Given a query vector and a set of key vectors , the attention scores are:
- query vector
- i-th key vector
- dimension of key vectors - used for scaling
High dot product means the query "aligns with" that key - the token is relevant. This single operation determines how much each token attends to every other token in a transformer.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Dot product: three equivalent ways
print(np.dot(a, b)) # → 32 (1·4 + 2·5 + 3·6)
print(a @ b) # → 32 (@ is the matmul operator)
print(sum(a * b)) # → 32 (element-wise multiply then sum)
# Cosine similarity: direction only
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"cos similarity: {cos_sim:.4f}") # → 0.9746 (nearly parallel)
# Attention score for one query–key pair
q = np.array([0.5, 0.3, -0.2])
k = np.array([0.4, 0.8, 0.1])
d_k = len(q)
score = np.dot(q, k) / np.sqrt(d_k)
print(f"attention score: {score:.4f}")
Interactive example
Attention score demo - adjust query and key vectors to see how dot products create attention patterns
Coming soon