Language modeling: predicting the next token — Language Models

Every time your phone finishes your sentence, a language model is running

Your keyboard's autocomplete, every chatbot response, the search suggestion that appeared before you finished typing — all of these are powered by one deceptively simple idea: assign a probability to every possible sequence of words. The model that does this best wins.

Understanding language models from first principles is the key to understanding GPT, Gemini, and everything else in the modern NLP landscape.

The core task: probability over sequences

A defines a probability distribution over sequences. Given the sentence "The cat sat on the mat," a language model should assign a higher probability than to "The sat cat on mat the."

The key insight is that we can factor any joint probability using the chain rule:

In plain language: instead of asking "how likely is this entire sentence?", we ask a chain of smaller questions — "given these first words, how likely is the next one?" — and multiply all those smaller answers together.

P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, \ldots, w_{t-1})

$P$: probability of the full sequence
$w_t$: the t-th token
$T$: sequence length

This turns the hard problem of modeling an entire sequence into a series of simpler problems: given everything so far, what comes next? Each factor is a that the model learns.

Measuring quality: perplexity

How do we compare two language models? We use , the geometric mean inverse probability per token.

In plain language: perplexity answers "on average, how many equally plausible choices did the model feel it was picking between at each word?" Lower is better — a model with perplexity 5 is far more confident (and accurate) than one with perplexity 200.

PP = \exp!\left(-\frac{1}{T} \sum_{t=1}^{T} \log P(w_t \mid w_1,\ldots,w_{t-1})\right)

$PP$: perplexity
$T$: number of tokens
$P$: sequence probability

The here converts tiny probability numbers (like 0.001) into manageable negative numbers. Taking the average, negating, and exponentiating gives the perplexity score.

A perplexity of 100 means the model is, on average, as confused as if it had to choose uniformly among 100 equally likely next tokens. Perplexity of 10 is much better — fewer surprises. GPT-4 achieves single-digit perplexity on many English benchmarks.

Worked example

Suppose a model assigns these probabilities to a 3-token sequence:

P("The") = 0.1
P("cat" | "The") = 0.2
P("sat" | "The cat") = 0.5

Then:

log probs: −2.303, −1.609, −0.693
Average: (−2.303 − 1.609 − 0.693) / 3 = −1.535
Perplexity: exp(1.535) ≈ 4.64

The model needed on average about 4–5 guesses per token — quite good for a 3-token sentence.

N-gram models: the classical approach

Before neural networks, the dominant approach was the . A trigram model estimates:

P(w_t \mid w_1,\ldots,w_{t-1}) \approx P(w_t \mid w_{t-2}, w_{t-1})

$w_t$: current token
$w_{t-1}$: previous token
$w_{t-2}$: token two steps back

Training is simple: count how often each trigram appears in a large text corpus, then divide by the count of the bigram prefix.

The critical weakness: n-gram models have no memory beyond their window. A trigram cannot know that the word "Paris" appeared two paragraphs ago and is relevant to the current sentence. It also suffers severe data sparsity — most long n-grams never appear in training data, requiring backoff smoothing hacks.

Neural language models

Neural LMs replace the count table with a neural network. The network takes the context (all previous tokens) as input and outputs a probability distribution over the vocabulary.

The is the key quantity: a dense vector that summarizes everything the model has read. The output layer maps this to logits over vocabulary size , then a softmax converts to probabilities.

RNNs (2013–2017) used sequential hidden states. They were better than n-grams but still struggled with very long contexts — vanishing gradients limited their effective memory.

Transformers (2017–present) replaced sequential processing with , allowing every token to directly attend to every earlier token. There is no fixed window — a transformer trained on 4096 tokens can, in principle, use any of those 4096 tokens to inform its prediction. This is why transformer-based LMs are so much more powerful.

Why language modeling is the universal pre-training objective

Here's the remarkable thing: by simply training a model to predict the next word on a massive text corpus, the model is forced to learn grammar, facts, reasoning patterns, and even some world knowledge — because all of these help it predict text better.

What to remember

A language model is a probability distribution over token sequences, factored via the chain rule.
Perplexity measures how surprised the model is; lower is better.
N-gram models are fast but context-limited.
Neural LMs, especially transformers, can leverage arbitrarily long context through self-attention.
Predicting the next word is a surprisingly powerful pre-training task.

Interactive example

N-gram vs Neural LM prediction comparison

Coming soon