Skip to content
Math Foundation Probability & Statistics
Lesson 3 ⏱ 12 min

Distributions: uniform and gaussian

Video coming soon

Probability Distributions: Uniform, Gaussian, and Why They Appear in ML

Uniform and Gaussian distributions, the 68-95-99.7 rule, the Central Limit Theorem, and five concrete ML applications of the Gaussian.

⏱ ~7 min

🧮

Quick refresher

Expected value and variance

E[X] = Σxᵢ·P(xᵢ) - the probability-weighted average. Var(X) = E[(X-μ)²] - expected squared deviation from the mean. Standard deviation σ = √Var(X).

Example

Fair die: E[roll] = 3.5.

Var[roll] = E[X²] - 3.5² = 15.17 - 12.25 = 2.92.

σ ≈ 1.71.

What Is a Distribution?

A single probability - "the chance of heads is 0.5" - describes one event. A describes all events at once: the probability of every possible value a random variable could take.

Give me the distribution and I can tell you: how likely any outcome is, the average outcome, the spread, and how rare extreme values are.

In machine learning, distributions are everywhere: the Gaussian shows up in weight initialization and regularization, the Bernoulli underlies binary classification, and the categorical distribution is what a softmax layer produces. Understanding distributions means understanding what your model is actually outputting and what assumptions are baked into your loss function.

Uniform Distribution

The simplest distribution: every value in a range is equally likely.

Notation: XUniform(a,b)X \sim \text{Uniform}(a, b) means XX is equally likely to be any value between aa and bb.

For a discrete uniform distribution - like a fair die with kk faces - each value has probability 1/k1/k. For a continuous uniform distribution over [0,1][0, 1], any interval of length LL has probability LL.

In ML: uniform distributions appear in weight initialization schemes and in random shuffling of training data. Some initialization strategies (like LeCun uniform) sample weights uniformly from [1/n,1/n][-\sqrt{1/n}, \sqrt{1/n}].

The Gaussian (Normal) Distribution

The most important distribution in all of statistics. You cannot go far in probability, statistics, or ML without it.

Notation: XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2)

  • Parameter : mean - where the bell is centered
  • Parameter : variance - how spread out the bell is
  • Parameter : standard deviation = σ2\sqrt{\sigma^2}, in the same units as XX

The probability density function:

f(x)=1σ2πexp!((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}}\thinspace\exp!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
μ\mu
mean - center of the bell curve
σ\sigma
standard deviation - width of the bell

The (xμ)2(x-\mu)^2 in the exponent means: values far from the mean have exponentially smaller probability. The bell falls off symmetrically in both directions.

The 68-95-99.7 rule: for any Gaussian:

The standard normal N(0,1)\mathcal{N}(0, 1) has mean 0 and standard deviation 1. Any Gaussian can be standardized: z=(xμ)/σz = (x - \mu)/\sigma. The z-score tells you how many standard deviations above or below the mean a value is.

Interactive example

Gaussian distribution explorer - drag mu and sigma to see how the bell shape changes

Coming soon

Why Gaussians Appear Everywhere

The : if you add together many independent random effects, each small compared to the total, the sum tends toward a Gaussian distribution, regardless of the distribution of the individual effects.

Intuition: Human height is determined by thousands of genetic, nutritional, and environmental factors. Each one nudges height up or down by a tiny amount, more or less independently. Add them together. The CLT says the result should be approximately Gaussian - and indeed, heights follow a bell curve.

The same logic applies to measurement errors, stock price daily changes, and many physical measurements.

What the CLT does NOT say: it requires independence and that no single factor dominates. Not all distributions are Gaussian - images, text, and many real-world signals are definitely not bell curves. But Gaussian assumptions are often a reasonable first approximation and make the math tractable.

Five ML Uses of the Gaussian

1. Weight initialization:

WN!(0,2nin)(He initialization for ReLU)W \sim \mathcal{N}!\left(0,\thinspace \frac{2}{n_{\text{in}}}\right) \quad \text{(He initialization for ReLU)}
σinit2\sigma^2_{\text{init}}
initial weight variance
ninn_{\text{in}}
number of inputs to the layer

Starting near zero prevents activations from exploding before training begins. Using a distribution (rather than all zeros) between neurons.

2. MSE loss assumes Gaussian noise:

If you believe your targets y=f(x)+εy = f(x) + \varepsilon where εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2), then maximizing the likelihood of your training data gives exactly least-squares regression. MSE "assumes" Gaussian errors.

3. Batch normalization:

Forces each layer's activations toward N(0,1)\mathcal{N}(0, 1) during training, keeping distributions stable across layers and making training much faster.

4. Variational autoencoders (VAEs):

The latent space is explicitly Gaussian. The encoder maps each input to N(μ,σ2)\mathcal{N}(\mu, \sigma^2) rather than a single point. Sampling from this distribution produces diverse outputs.

5. L2 regularization = Gaussian prior:

From a Bayesian perspective, L2 regularization (adding λw2\lambda\mid \mathbf{w}\mid ^2 to the loss) is equivalent to assuming a Gaussian prior N(0,1/(2λ))\mathcal{N}(0, 1/(2\lambda)) on the weights. Mathematically encoding the belief that "weights should be small."

import numpy as np
from scipy import stats

# Sample from common distributions
n = 1000
normal_samples   = np.random.normal(loc=0, scale=1, size=n)   # N(0,1)
uniform_samples  = np.random.uniform(low=0, high=1, size=n)   # U(0,1)
bernoulli_samples = np.random.binomial(n=1, p=0.3, size=n)    # Bernoulli(0.3)

# Compute stats
print(f"N(0,1) — mean: {normal_samples.mean():.3f}, std: {normal_samples.std():.3f}")
print(f"U(0,1) — mean: {uniform_samples.mean():.3f}, std: {uniform_samples.std():.3f}")

# Gaussian PDF: how likely is x=1 under N(0,1)?
p = stats.norm.pdf(1.0, loc=0, scale=1)
print(f"P(X=1 | N(0,1)) ≈ {p:.4f}")   # → 0.2420

# Standardize a value: (x - mean) / std
x, mu, sigma = 72.0, 68.0, 5.0
z_score = (x - mu) / sigma
print(f"z-score: {z_score:.2f}")       # → 0.80

Interactive example

Distribution comparison - compare Uniform, Gaussian, and other distributions side by side

Coming soon

Quiz

1 / 3

In a Gaussian distribution N(μ=0, σ=1), where is most of the probability mass?