Discovering Basic Emotion Sets via Semantic Clustering on a Twitter Corpus

A plethora of words are used to describe the spectrum of human emotions, but how many emotions are there really, and how do they interact? Over the past few decades, several theories of emotion have been proposed, each based around the existence of a set of ‘basic emotions’, and each supported by an extensive variety of research including studies in facial expression, ethology, neurology and physiology. Here we present research based on a theory that people transmit their understanding of emotions through the language they use surrounding emotion keywords. Using a labelled corpus of over 21,000 tweets, six of the basic emotion sets proposed in existing literature were analysed using Latent Semantic Clustering (LSC), evaluating the distinctiveness of the semantic meaning attached to the emotional label. We hypothesise that the more distinct the language is used to express a certain emotion, then the more distinct the perception (including proprioception) of that emotion is, and thus more ‘basic’. This allows us to select the dimensions best representing the entire spectrum of emotion. We find that Ekman’s set, arguably the most frequently used for classifying emotions, is in fact the most semantically distinct overall. Next, taking all analysed (that is, previously proposed) emotion terms into account, we determine the optimal semantically irreducible basic emotion set using an iterative LSC algorithm. Our newly-derived set (Accepting, Ashamed, Contempt, Interested, Joyful, Pleased, Sleepy, Stressed) generates a 6.1% increase in distinctiveness over Ekman’s set (Angry, Disgusted, Joyful, Sad, Scared). We also demonstrate how using LSC data can help visualise emotions. We introduce the concept of an Emotion Profile and briefly analyse compound emotions both visually and mathematically.

💡 Research Summary

The paper tackles the long‑standing question of how many basic emotions exist and how they can be identified from everyday language use. Using a labelled corpus of more than 21,000 tweets, the authors treat each emotion keyword as a semantic anchor and examine how distinct the surrounding language is for each emotion. They first evaluate six well‑known basic‑emotion sets from the literature (including Ekman, Plutchik, Power, Lazarus, etc.) by applying Latent Semantic Clustering (LSC). LSC is implemented by constructing a TF‑IDF weighted term‑document matrix, reducing dimensionality with Singular Value Decomposition, and then clustering the resulting vectors with cosine similarity. A distinctiveness score is defined as the difference between average intra‑cluster similarity and average inter‑cluster similarity; higher scores indicate that the language associated with a label occupies a more isolated semantic region.

The results show that Ekman’s five‑emotion set (Angry, Disgusted, Joyful, Sad, Scared) achieves the highest overall distinctiveness, confirming the intuition that these emotions are the most clearly separated in everyday discourse. However, the authors argue that Ekman’s set does not fully capture the breadth of human affect. To address this, they introduce an iterative LSC algorithm that starts with all candidate emotion terms proposed in the literature (roughly thirty) and repeatedly removes the term with the lowest distinctiveness, recomputing the scores after each removal. The process stops when further deletions no longer improve the average distinctiveness.

The algorithm converges on an eight‑term set: Accepting, Ashamed, Contempt, Interested, Joyful, Pleased, Sleepy, and Stressed. This new set yields a 6.1 % increase in average distinctiveness over Ekman’s set, with particularly strong separation for socially evaluative emotions such as Ashamed and Contempt. The authors also present an “Emotion Profile” visualization, projecting each emotion’s semantic vector into a low‑dimensional space and displaying the relative positions of emotions. By treating compound emotions as linear combinations of basic‑emotion vectors, they demonstrate both visual and mathematical analysis of affective blends (e.g., anxiety + interest).

The study’s contributions are threefold. First, it provides a data‑driven metric for assessing how uniquely a language community uses words to convey each emotion, complementing physiological and facial‑expression approaches. Second, the iterative LSC method offers an objective way to derive a minimal, semantically irreducible set of basic emotions, reducing theoretical bias. Third, the Emotion Profile framework opens practical avenues for real‑time sentiment monitoring, affect‑aware recommendation systems, and richer human‑computer interaction designs.

Limitations are acknowledged: Twitter users are not demographically representative, tweets are short and often contain slang, hashtags, or emojis that introduce noise, and the manual labeling process inevitably carries some subjectivity. Future work is suggested to validate the findings on multilingual, longer‑form corpora (e.g., blogs, news articles) and to integrate the emotion profiles into live affect‑recognition pipelines.

In sum, by leveraging large‑scale natural language data and semantic clustering, the paper demonstrates a novel, reproducible approach to identifying basic emotions and visualizing their interrelations, thereby extending the toolkit available to psychologists, computational linguists, and affective computing researchers.