Distributions: uniform and gaussian — Probability & Statistics

What Is a Distribution?

A single probability - "the chance of heads is 0.5" - describes one event. A describes all events at once: the probability of every possible value a random variable could take.

Give me the distribution and I can tell you: how likely any outcome is, the average outcome, the spread, and how rare extreme values are.

In machine learning, distributions are everywhere: the Gaussian shows up in weight initialization and regularization, the Bernoulli underlies binary classification, and the categorical distribution is what a softmax layer produces. Understanding distributions means understanding what your model is actually outputting and what assumptions are baked into your loss function.

Uniform Distribution

The simplest distribution: every value in a range is equally likely.

Notation: $X \sim \text{Uniform}(a, b)$ means $X$ is equally likely to be any value between $a$ and $b$ .

For a discrete uniform distribution - like a fair die with $k$ faces - each value has probability $1/k$ . For a continuous uniform distribution over $[0, 1]$ , any interval of length $L$ has probability $L$ .

In ML: uniform distributions appear in weight initialization schemes and in random shuffling of training data. Some initialization strategies (like LeCun uniform) sample weights uniformly from $[-\sqrt{1/n}, \sqrt{1/n}]$ .

The Gaussian (Normal) Distribution

The most important distribution in all of statistics. You cannot go far in probability, statistics, or ML without it.

Notation: $X \sim \mathcal{N}(\mu, \sigma^2)$

Parameter : mean - where the bell is centered
Parameter : variance - how spread out the bell is
Parameter : standard deviation = $\sqrt{\sigma^2}$ , in the same units as $X$

The probability density function:

f(x) = \frac{1}{\sigma\sqrt{2\pi}}\thinspace\exp!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

$\mu$: mean - center of the bell curve
$\sigma$: standard deviation - width of the bell

The $(x-\mu)^2$ in the exponent means: values far from the mean have exponentially smaller probability. The bell falls off symmetrically in both directions.

The 68-95-99.7 rule: for any Gaussian:

The standard normal $\mathcal{N}(0, 1)$ has mean 0 and standard deviation 1. Any Gaussian can be standardized: $z = (x - \mu)/\sigma$ . The z-score tells you how many standard deviations above or below the mean a value is.

Interactive example

Gaussian distribution explorer - drag mu and sigma to see how the bell shape changes

Coming soon

Why Gaussians Appear Everywhere

The : if you add together many independent random effects, each small compared to the total, the sum tends toward a Gaussian distribution, regardless of the distribution of the individual effects.

Intuition: Human height is determined by thousands of genetic, nutritional, and environmental factors. Each one nudges height up or down by a tiny amount, more or less independently. Add them together. The CLT says the result should be approximately Gaussian - and indeed, heights follow a bell curve.

The same logic applies to measurement errors, stock price daily changes, and many physical measurements.

What the CLT does NOT say: it requires independence and that no single factor dominates. Not all distributions are Gaussian - images, text, and many real-world signals are definitely not bell curves. But Gaussian assumptions are often a reasonable first approximation and make the math tractable.

Five ML Uses of the Gaussian

1. Weight initialization:

W \sim \mathcal{N}!\left(0,\thinspace \frac{2}{n_{\text{in}}}\right) \quad \text{(He initialization for ReLU)}

$\sigma^2_{\text{init}}$: initial weight variance
$n_{\text{in}}$: number of inputs to the layer

Starting near zero prevents activations from exploding before training begins. Using a distribution (rather than all zeros) between neurons.

2. MSE loss assumes Gaussian noise:

If you believe your targets $y = f(x) + \varepsilon$ where $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ , then maximizing the likelihood of your training data gives exactly least-squares regression. MSE "assumes" Gaussian errors.

3. Batch normalization:

Forces each layer's activations toward $\mathcal{N}(0, 1)$ during training, keeping distributions stable across layers and making training much faster.

4. Variational autoencoders (VAEs):

The latent space is explicitly Gaussian. The encoder maps each input to $\mathcal{N}(\mu, \sigma^2)$ rather than a single point. Sampling from this distribution produces diverse outputs.

5. L2 regularization = Gaussian prior:

From a Bayesian perspective, L2 regularization (adding $\lambda\mid \mathbf{w}\mid ^2$ to the loss) is equivalent to assuming a Gaussian prior $\mathcal{N}(0, 1/(2\lambda))$ on the weights. Mathematically encoding the belief that "weights should be small."

import numpy as np
from scipy import stats

# Sample from common distributions
n = 1000
normal_samples   = np.random.normal(loc=0, scale=1, size=n)   # N(0,1)
uniform_samples  = np.random.uniform(low=0, high=1, size=n)   # U(0,1)
bernoulli_samples = np.random.binomial(n=1, p=0.3, size=n)    # Bernoulli(0.3)

# Compute stats
print(f"N(0,1) — mean: {normal_samples.mean():.3f}, std: {normal_samples.std():.3f}")
print(f"U(0,1) — mean: {uniform_samples.mean():.3f}, std: {uniform_samples.std():.3f}")

# Gaussian PDF: how likely is x=1 under N(0,1)?
p = stats.norm.pdf(1.0, loc=0, scale=1)
print(f"P(X=1 | N(0,1)) ≈ {p:.4f}")   # → 0.2420

# Standardize a value: (x - mean) / std
x, mu, sigma = 72.0, 68.0, 5.0
z_score = (x - mu) / sigma
print(f"z-score: {z_score:.2f}")       # → 0.80