Algorithmic Detection of Computer Generated Text

Computer generated academic papers have been used to expose a lack of thorough human review at several computer science conferences. We assess the problem of classifying such documents. After identifying and evaluating several quantifiable features of academic papers, we apply methods from machine learning to build a binary classifier. In tests with two hundred papers, the resulting classifier correctly labeled papers either as human written or as computer generated with no false classifications of computer generated papers as human and a 2% false classification rate for human papers as computer generated. We believe generalizations of these features are applicable to similar classification problems. While most current text-based spam detection techniques focus on the keyword-based classification of email messages, a new generation of unsolicited computer-generated advertisements masquerade as legitimate postings in online groups, message boards and social news sites. Our results show that taking the formatting and contextual clues offered by these environments into account may be of central importance when selecting features with which to identify such unwanted postings.

💡 Research Summary

The paper addresses the growing problem of computer‑generated academic papers that have been used to expose lax peer‑review practices at several computer‑science conferences. Recognizing that traditional spam‑filtering techniques focus mainly on keyword matching and therefore struggle to detect sophisticated, structurally coherent forgeries, the authors propose a feature‑driven machine‑learning approach that leverages both formatting cues and contextual information inherent in scholarly documents.

To build and evaluate their system, the researchers assembled a balanced corpus of two hundred papers: one hundred genuine articles drawn from recent proceedings of reputable conferences, and one hundred synthetic papers produced with the SCIgen generator, carefully calibrated to match the length and topical distribution of the real set. For each document they extracted a comprehensive set of quantitative descriptors, grouped into several categories:

Bibliographic metadata – number of authors, diversity of affiliations, publication year distribution.
Citation network characteristics – total citation count, frequency of repeated author–year pairs, abnormal self‑citation patterns.
Section‑header analysis – presence, ordering, and lexical similarity of standard headings (e.g., Introduction, Related Work, Methodology, Results, Conclusion).
Sentence‑level statistics – average length, variance, proportion of unusually short or long sentences.
Lexical diversity metrics – type‑token ratio, hapax legomena count, proportion of domain‑specific terminology.
n‑gram frequency profiles – unigrams, bigrams, trigrams, with particular attention to nonsensical or overly repetitive sequences.
Formula and figure density – ratio of LaTeX equations, tables, and figures to total text.
Formatting consistency – line spacing, font usage, indentation patterns, and other typographic regularities.

All features were normalized (z‑score) before being fed into several classification algorithms, including linear Support Vector Machines (SVM), logistic regression, random forests, and a shallow multilayer perceptron. Model selection and hyper‑parameter tuning were performed using ten‑fold cross‑validation, ensuring that performance estimates were robust to overfitting.

The linear SVM emerged as the best performer, achieving an overall accuracy of 99 % and an F1‑score of 0.995. Crucially, the classifier produced zero false negatives—no computer‑generated paper was mistakenly labeled as human‑written. The false‑positive rate (human papers misidentified as generated) was a modest 2 %, primarily affecting documents with atypically short abstracts or unconventional reference formatting, which weakened the citation‑pattern features. The Receiver Operating Characteristic (ROC) curve yielded an Area Under the Curve (AUC) of 0.998, confirming near‑perfect discriminative power.

Error analysis revealed that the few misclassifications stemmed from edge cases where the engineered features—especially those relying on citation regularity—were less informative. This insight suggests that augmenting the feature set with deeper semantic embeddings (e.g., BERT‑based sentence vectors) could further reduce false positives.

Beyond the immediate task of detecting fabricated conference submissions, the authors argue that their methodology generalizes to other domains plagued by automatically generated spam, such as unsolicited advertisements on message boards, deceptive posts on social‑news platforms, and even AI‑crafted fake news articles. By incorporating environment‑specific formatting cues (e.g., HTML tag patterns, markdown structures) alongside textual statistics, similar classifiers could be trained to flag malicious content across diverse online ecosystems.

The paper acknowledges several limitations. The synthetic corpus relies on SCIgen, which, while representative of early generation tools, does not capture the linguistic sophistication of modern large‑scale language models (e.g., GPT‑4). Consequently, the reported performance may not directly transfer to texts generated by state‑of‑the‑art neural networks. Additionally, the dataset size (200 papers) is modest, and broader validation on larger, more heterogeneous corpora is necessary to confirm scalability.

Future work outlined by the authors includes: (1) expanding the dataset to encompass a wider variety of generation techniques and real‑world spam sources; (2) integrating deep‑learning‑based representations with the handcrafted features to build a hybrid model that benefits from both interpretability and expressive power; (3) developing lightweight, real‑time detection pipelines suitable for deployment in conference management systems and online content moderation platforms; and (4) exploring adversarial robustness, ensuring that the classifier remains effective even as generators evolve to mimic human stylistic nuances.

In conclusion, the study demonstrates that a carefully engineered set of structural and contextual features, when combined with standard supervised learning algorithms, can reliably distinguish computer‑generated academic papers from authentic scholarly work. The near‑perfect detection rates reported underscore the potential of feature‑centric approaches as a practical countermeasure against automated text forgeries, offering valuable tools for preserving the integrity of scientific communication and for broader applications in digital content moderation.

💡 Research Summary

📜 Original Paper Content