Positive words carry less information than negative words

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We show that the frequency of word use is not only determined by the word length \cite{Zipf1935} and the average information content \cite{Piantadosi2011}, but also by its emotional content. We have analyzed three established lexica of affective word usage in English, German, and Spanish, to verify that these lexica have a neutral, unbiased, emotional content. Taking into account the frequency of word usage, we find that words with a positive emotional content are more frequently used. This lends support to Pollyanna hypothesis \cite{Boucher1969} that there should be a positive bias in human expression. We also find that negative words contain more information than positive words, as the informativeness of a word increases uniformly with its valence decrease. Our findings support earlier conjectures about (i) the relation between word frequency and information content, and (ii) the impact of positive emotions on communication and social links.

💡 Research Summary

The paper “Positive words carry less information than negative words” investigates how emotional valence interacts with word frequency and information content across three major languages (English, German, and Spanish). Using three well‑established affective lexicons (ANEW for English and Spanish, BAWL for German) that provide valence scores on a scale from –1 (negative) to +1 (positive), the authors first confirm that the lexicons themselves are emotionally neutral on average. They then combine these lexical resources with massive usage statistics derived from Google’s N‑gram corpus (approximately one trillion tokens) to examine real‑world frequencies of the listed words on the Internet.

The analysis proceeds in two main steps. First, the authors compute a frequency‑weighted average valence for each language. While the raw lexicon distributions are centered near zero, weighting by actual usage dramatically shifts the median toward positive values (≈ 0.3 for English, 0.2 for German, and 0.24 for Spanish). Wilcoxon signed‑rank tests confirm that these shifts are statistically significant (95 % confidence intervals: English 0.257 ± 0.032, German 0.167 ± 0.017, Spanish 0.287 ± 0.035). This empirical finding supports the Pollyanna hypothesis that human communication exhibits a systematic positivity bias.

Second, the authors assess the informational load of each word. They calculate self‑information I(w) = −log P(w) directly from the observed frequencies, and they extend this to higher‑order contexts by computing 2‑gram, 3‑gram, and 4‑gram information measures (I₂, I₃, I₄). Pearson correlation analyses reveal a robust negative relationship between valence and self‑information: ρ(v, I) = −0.368 (English), −0.325 (German), and −0.402 (Spanish), all significant at p < 0.001. In other words, words that are more negative carry higher information content because they occur less frequently. The same pattern holds, albeit with slightly weaker coefficients, for the higher‑order N‑gram information measures. To test robustness, the authors repeat the analysis using traditional book‑based corpora; the correlations persist, confirming that the effect is not an artifact of the web‑derived data.

These results extend classic linguistic regularities. Zipf’s law links word frequency to length, while Piantadosi et al. showed that word length scales with information content. The present study adds a third dimension: emotional valence. Positive words are not only shorter on average (as implied by higher frequency) but also convey less information per occurrence, whereas negative words, being rarer, are more informative. This asymmetry has implications for models of efficient communication, sentiment analysis, and the design of affect‑aware natural language processing systems.

The paper also critiques a recent study that claimed a bias in the number of positive versus negative words without finding a frequency‑valence relationship. The authors argue that the earlier work suffered from methodological issues such as uncontrolled Mechanical Turk sampling, potential acquiescence bias, and the use of “happiness” rather than the more neutral valence construct. By contrast, the current work relies on validated lexicons and large‑scale, unbiased frequency counts.

Limitations are acknowledged. The affective lexicons cover only single words, ignoring multi‑word expressions and contextual shifts in meaning. The Google N‑gram dataset reflects primarily written web content, which may differ from spoken language or informal social media. Valence ratings are inherently subjective and may vary across cultures. Future research directions include incorporating additional affective dimensions (arousal, dominance), employing deep neural language models to capture context‑dependent information, and extending the analysis to non‑Indo‑European languages and other media types.

In summary, the study provides strong empirical evidence that (1) positive words dominate everyday written communication across three major languages, confirming a positivity bias, and (2) negative words carry higher self‑information, indicating that they are more informative per token. These findings bridge information theory, psycholinguistics, and social psychology, offering a nuanced view of how emotional content shapes linguistic efficiency and social signaling.

Positive words carry less information than negative words

💡 Research Summary

Comments & Academic Discussion

Leave a Comment