Mathematical Psychology
About

Entropy in Language

Shannon's application of entropy to English revealed that natural language has an entropy rate of roughly 1–1.5 bits per character, with substantial redundancy that enables error correction and prediction in communication.

H(X_n | X₁, ..., X_{n-1}) → h ≈ 1.0–1.5 bits/character

In his 1948 paper and subsequent work on the "Prediction and Entropy of Printed English" (1951), Claude Shannon applied information theory to natural language, treating English text as a stochastic process and estimating its entropy rate. Shannon calculated that while the maximum entropy of English (assuming 27 equiprobable characters including space) is log₂(27) ≈ 4.76 bits per character, the actual entropy rate — accounting for letter frequencies, digram statistics, word patterns, and long-range dependencies — is approximately 1.0 to 1.5 bits per character.

Entropy Rate and Redundancy

Language Entropy Maximum entropy (27 symbols): H₀ = log₂(27) ≈ 4.76 bits/char
First-order (letter frequencies): H₁ ≈ 4.03 bits/char
Second-order (digrams): H₂ ≈ 3.32 bits/char
Shannon's estimate of true entropy rate: h ≈ 1.0–1.5 bits/char

Redundancy: R = 1 − h/H₀ ≈ 68–79%

The redundancy of English — the difference between its maximum entropy and its actual entropy rate, expressed as a proportion — is remarkably high, roughly 70–80%. This means that about three-quarters of the characters in English text are statistically predictable from context. Shannon demonstrated this dramatically with his guessing game: human subjects could predict the next letter of a text with far better than chance accuracy, and their performance provided an upper bound on the entropy rate.

Surprisal and Reading

Modern psycholinguistics has operationalized Shannon's framework through the concept of surprisal: the negative log probability of a word given its context, −log₂ P(wₙ | w₁, ..., w_{n-1}). Hale (2001) and Levy (2008) demonstrated that word-by-word reading times in eye-tracking and self-paced reading experiments are linearly proportional to surprisal, providing direct evidence that the human language processor is sensitive to the statistical structure quantified by entropy.

Large Language Models and Entropy

Modern large language models (LLMs) provide increasingly precise estimates of language entropy. As models improve in predicting the next word, their cross-entropy loss (an upper bound on the true entropy rate) decreases. Current state-of-the-art models achieve cross-entropy values that approach Shannon's original estimates from below, suggesting that Shannon's guessing game provided a remarkably accurate bound on the entropy of English despite the rudimentary methods available in 1951.

Applications in Cognitive Science

Entropy measures have become central to quantitative linguistics and cognitive science. Zipf's law — the inverse relationship between word frequency and rank — can be understood as a consequence of entropy maximization under constraints. The uniform information density hypothesis (Jaeger, 2010) proposes that speakers adjust their production to maintain a roughly constant information rate (bits per second), redistributing entropy across the utterance. This connects language production to Shannon's channel capacity: speakers avoid exceeding the listener's processing capacity by spreading information evenly.

In clinical applications, entropy measures of language production have been used to detect cognitive decline in Alzheimer's disease and to characterize aphasia. Patients with language disorders show altered entropy profiles — reduced vocabulary diversity (lower word-level entropy) or disrupted syntactic predictability — providing quantitative biomarkers grounded in information theory.

Related Topics

References

  1. Shannon, C. E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30(1), 50–64. doi:10.1002/j.1538-7305.1951.tb01366.x
  2. Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126–1177. doi:10.1016/j.cognition.2007.05.006
  3. Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (pp. 1–8). doi:10.3115/1073336.1073357
  4. Jaeger, T. F. (2010). Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology, 61(1), 23–62. doi:10.1016/j.cogpsych.2010.02.002
  5. Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. doi:10.1002/047174882X

External Links