Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.
💡 Research Summary
The paper “Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models” tackles the persistent problem of clickbait headlines that degrade information quality and erode user trust. The authors propose a hybrid detection framework that combines modern transformer‑based text embeddings with a set of 15 linguistically motivated informativeness features. Their contributions can be summarized as follows:
-
Dataset Construction – Four publicly available English datasets are merged: two Kaggle clickbait corpora (32 k and 21 k articles), the Clickbait Challenge 2017 Twitter‑post collection (≈ 39 k items), and the SemEval 2023 Clickbait Spoiling dataset (≈ 4 k items). After cleaning (English‑only, duplicate removal, missing‑value filtering) and harmonizing labels (binary conversion using a 0.5 threshold on the Clickbait Challenge scores), a unified corpus of roughly 90 000 headline‑body pairs is created. The data are split stratified into 80 % training, 10 % validation, and 10 % test with a fixed random seed.
-
Informativeness Feature Set – The authors design 15 explicit features that capture stylistic and lexical signals associated with low information content: total word count, stop‑word ratio, uppercase character ratio, frequency of superlative adjectives/adverbs, Flesch Reading Ease score, sentiment polarity, cosine similarity between title and article body, and others. An extended appendix lists 25 additional measures. These features are intended to quantify how much useful information a headline conveys, rather than relying on opaque neural representations.
-
Embedding Strategies – Multiple text representations are evaluated: classic TF‑IDF vectors, static word embeddings (Word2Vec, GloVe), TF‑IDF‑weighted Word2Vec, and high‑dimensional embeddings from OpenAI’s large language models accessed via API. The LLM embeddings (1536‑dimensional) are reduced with PCA to 200 dimensions to keep computational cost reasonable.
-
Model Architecture – All representations are fed into tree‑based classifiers, primarily XGBoost and Random Forest. The central hybrid model concatenates the reduced LLM embedding with the 15 handcrafted informativeness features before training an XGBoost classifier. This design leverages deep semantic information from the LLM while preserving interpretable stylistic cues.
-
Experimental Results – Baselines (TF‑IDF + XGBoost, Word2Vec + XGBoost, GloVe + XGBoost, RoBERTa fine‑tuned detector, LLM embedding + XGBoost) achieve F1 scores ranging from 0.84 to 0.89. The feature‑only model reaches 0.78, confirming that the handcrafted features are informative but insufficient alone. The proposed hybrid model attains an F1 of 0.91 on both validation and test sets, outperforming all baselines. An ablation study shows that removing either the LLM embeddings or the informativeness features degrades performance by roughly 3–5 percentage points, demonstrating their complementary nature.
-
Interpretability Analysis – SHAP (SHapley Additive exPlanations) values are computed for the hybrid model. The most influential features are the presence of second‑person pronouns (“you”), the ratio of superlatives, the proportion of uppercase characters, and the count of numerals. These align with known clickbait tactics (personalized appeals, exaggerated language, attention‑grabbing punctuation). The authors argue that this level of transparency is rare in recent deep‑learning‑centric clickbait detectors.
-
Reproducibility and Future Work – All code, preprocessing scripts, and trained models are released under an open‑source license. The authors acknowledge the English‑only focus and propose extending the feature set to other languages, integrating real‑time user‑facing warning systems, and exploring multi‑task learning that simultaneously detects clickbait and generates “spoilers” (summaries that reveal the withheld information).
In conclusion, the study demonstrates that a judicious combination of large‑language‑model embeddings with a small, well‑designed set of linguistic features can achieve state‑of‑the‑art clickbait detection while preserving model interpretability. This hybrid approach offers a practical pathway for deploying trustworthy clickbait filters in news aggregators, social platforms, and browser extensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment