What Is a Distribution?
A single probability - "the chance of heads is 0.5" - describes one event. A describes all events at once: the probability of every possible value a random variable could take.
Give me the distribution and I can tell you: how likely any outcome is, the average outcome, the spread, and how rare extreme values are.
In machine learning, distributions are everywhere: the Gaussian shows up in weight initialization and regularization, the Bernoulli underlies binary classification, and the categorical distribution is what a softmax layer produces. Understanding distributions means understanding what your model is actually outputting and what assumptions are baked into your loss function.
Uniform Distribution
The simplest distribution: every value in a range is equally likely.
Notation: means is equally likely to be any value between and .
For a discrete uniform distribution - like a fair die with faces - each value has probability . For a continuous uniform distribution over , any interval of length has probability .
In ML: uniform distributions appear in weight initialization schemes and in random shuffling of training data. Some initialization strategies (like LeCun uniform) sample weights uniformly from .
The Gaussian (Normal) Distribution
The most important distribution in all of statistics. You cannot go far in probability, statistics, or ML without it.
Notation:
- Parameter : mean - where the bell is centered
- Parameter : variance - how spread out the bell is
- Parameter : standard deviation = , in the same units as
The probability density function:
- mean - center of the bell curve
- standard deviation - width of the bell
The in the exponent means: values far from the mean have exponentially smaller probability. The bell falls off symmetrically in both directions.
The 68-95-99.7 rule: for any Gaussian:
The standard normal has mean 0 and standard deviation 1. Any Gaussian can be standardized: . The z-score tells you how many standard deviations above or below the mean a value is.
Interactive example
Gaussian distribution explorer - drag mu and sigma to see how the bell shape changes
Coming soon
Why Gaussians Appear Everywhere
The : if you add together many independent random effects, each small compared to the total, the sum tends toward a Gaussian distribution, regardless of the distribution of the individual effects.
Intuition: Human height is determined by thousands of genetic, nutritional, and environmental factors. Each one nudges height up or down by a tiny amount, more or less independently. Add them together. The CLT says the result should be approximately Gaussian - and indeed, heights follow a bell curve.
The same logic applies to measurement errors, stock price daily changes, and many physical measurements.
What the CLT does NOT say: it requires independence and that no single factor dominates. Not all distributions are Gaussian - images, text, and many real-world signals are definitely not bell curves. But Gaussian assumptions are often a reasonable first approximation and make the math tractable.
Five ML Uses of the Gaussian
1. Weight initialization:
- initial weight variance
- number of inputs to the layer
Starting near zero prevents activations from exploding before training begins. Using a distribution (rather than all zeros) between neurons.
2. MSE loss assumes Gaussian noise:
If you believe your targets where , then maximizing the likelihood of your training data gives exactly least-squares regression. MSE "assumes" Gaussian errors.
3. Batch normalization:
Forces each layer's activations toward during training, keeping distributions stable across layers and making training much faster.
4. Variational autoencoders (VAEs):
The latent space is explicitly Gaussian. The encoder maps each input to rather than a single point. Sampling from this distribution produces diverse outputs.
5. L2 regularization = Gaussian prior:
From a Bayesian perspective, L2 regularization (adding to the loss) is equivalent to assuming a Gaussian prior on the weights. Mathematically encoding the belief that "weights should be small."
import numpy as np
from scipy import stats
# Sample from common distributions
n = 1000
normal_samples = np.random.normal(loc=0, scale=1, size=n) # N(0,1)
uniform_samples = np.random.uniform(low=0, high=1, size=n) # U(0,1)
bernoulli_samples = np.random.binomial(n=1, p=0.3, size=n) # Bernoulli(0.3)
# Compute stats
print(f"N(0,1) — mean: {normal_samples.mean():.3f}, std: {normal_samples.std():.3f}")
print(f"U(0,1) — mean: {uniform_samples.mean():.3f}, std: {uniform_samples.std():.3f}")
# Gaussian PDF: how likely is x=1 under N(0,1)?
p = stats.norm.pdf(1.0, loc=0, scale=1)
print(f"P(X=1 | N(0,1)) ≈ {p:.4f}") # → 0.2420
# Standardize a value: (x - mean) / std
x, mu, sigma = 72.0, 68.0, 5.0
z_score = (x - mu) / sigma
print(f"z-score: {z_score:.2f}") # → 0.80
Interactive example
Distribution comparison - compare Uniform, Gaussian, and other distributions side by side
Coming soon