In his 1948 paper and subsequent work on the "Prediction and Entropy of Printed English" (1951), Claude Shannon applied information theory to natural language, treating English text as a stochastic process and estimating its entropy rate. Shannon calculated that while the maximum entropy of English (assuming 27 equiprobable characters including space) is log₂(27) ≈ 4.76 bits per character, the actual entropy rate — accounting for letter frequencies, digram statistics, word patterns, and long-range dependencies — is approximately 1.0 to 1.5 bits per character.
Entropy Rate and Redundancy
First-order (letter frequencies): H₁ ≈ 4.03 bits/char
Second-order (digrams): H₂ ≈ 3.32 bits/char
Shannon's estimate of true entropy rate: h ≈ 1.0–1.5 bits/char
Redundancy: R = 1 − h/H₀ ≈ 68–79%
The redundancy of English — the difference between its maximum entropy and its actual entropy rate, expressed as a proportion — is remarkably high, roughly 70–80%. This means that about three-quarters of the characters in English text are statistically predictable from context. Shannon demonstrated this dramatically with his guessing game: human subjects could predict the next letter of a text with far better than chance accuracy, and their performance provided an upper bound on the entropy rate.
Surprisal and Reading
Modern psycholinguistics has operationalized Shannon's framework through the concept of surprisal: the negative log probability of a word given its context, −log₂ P(wₙ | w₁, ..., w_{n-1}). Hale (2001) and Levy (2008) demonstrated that word-by-word reading times in eye-tracking and self-paced reading experiments are linearly proportional to surprisal, providing direct evidence that the human language processor is sensitive to the statistical structure quantified by entropy.
Modern large language models (LLMs) provide increasingly precise estimates of language entropy. As models improve in predicting the next word, their cross-entropy loss (an upper bound on the true entropy rate) decreases. Current state-of-the-art models achieve cross-entropy values that approach Shannon's original estimates from below, suggesting that Shannon's guessing game provided a remarkably accurate bound on the entropy of English despite the rudimentary methods available in 1951.
Applications in Cognitive Science
Entropy measures have become central to quantitative linguistics and cognitive science. Zipf's law — the inverse relationship between word frequency and rank — can be understood as a consequence of entropy maximization under constraints. The uniform information density hypothesis (Jaeger, 2010) proposes that speakers adjust their production to maintain a roughly constant information rate (bits per second), redistributing entropy across the utterance. This connects language production to Shannon's channel capacity: speakers avoid exceeding the listener's processing capacity by spreading information evenly.
In clinical applications, entropy measures of language production have been used to detect cognitive decline in Alzheimer's disease and to characterize aphasia. Patients with language disorders show altered entropy profiles — reduced vocabulary diversity (lower word-level entropy) or disrupted syntactic predictability — providing quantitative biomarkers grounded in information theory.