A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features

Reading time: 6 minute
...

📝 Abstract

A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing accuracy. This study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.

💡 Analysis

A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing accuracy. This study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.

📄 Content

Rapid progress in generative AI has taken it to the front lines of modern society and technology. Most major and broadly available generative AI systems including LLMs are continuously trained on vast volumes of available information collecting it from all accessible sources, mostly from the Internet and other datasets. AI can generate content in various formats, including text, voice, music, images, video, software code, many other types of content, and even novel concepts. The generated content makes some slight variations and possibly new features that could be uncommon or even unfeasible in the real world. The artificially generated synthetic content goes to the Internet and is later included in other datasets which are then used for further training of the same AI systems including LLMs. Being trained on the datasets that include AI-generated content, generative AI systems generate new generation of content which is partially based on the knowledge obtained from the training sets that included the generated content. Each AI-generated content may contain little and sometimes unnoticeable variations from the real world, which, in the long run, may accumulate such variations and cause quite significant deviations from the real world similarly to the evolutionary process. In addition, modern LLMs may generate content by making it up, which is quite different from reality. This is known as “AI hallucinations.” Recent studies have shown that the percentage of hallucinated content is quite high among popular LLMs, ranging from 17% to 19% up to 45% of the content [1]. If left without serious attention and the appropriate corrections, AI hallucinations can lead to critical limitations of AI applications that negatively impact human civilization and its progress. AI users may learn from that content that provides deformed and wrong information and representation of the real world. Addressing all kinds of AI hallucinations is a very important and an enormously big task. We decided to start with AI-generated texts and then expand our efforts to other types of AI-generated contents.

The ability to differentiate AI-generated from human-written texts has become increasingly important with the widespread adoption of generative AI. Advanced Large Language Models (LLMs) such as ChatGPT (OpenAI), PaLM (Google), Gemini (Google), Claude (Anthropic), Grok (xAI), DeepSeek (DeepSeek AI) and others can produce fluent, human-like prose [2], making manual identification nearly impossible. Early detection attempts struggled with accuracy. For example, OpenAI’s GPT Output Detector achieved only 26% accuracy and misclassified human-written text as AI in 9% of cases before its withdrawal [3]. Recent studies, however, have reported substantially higher detection rates, often in the low 90% range [4][5][6]. Several online services now offer AI vs. human text classification (e.g., zeroGPT.com), though their reported accuracy remains quite limited.

Research in AI-generated text detection has explored two broad paradigms. Transformer-based classifiers leverage semantic and contextual cues via fine-tuned models such as BERT, RoBERTa, and DistilBERT, can achieve F1 scores and accuracy in the range of 91%-99% in controlled settings [7][8][9][10][11][12]. However, these models require hundreds of millions of parameters, significantly degrade on unseen domains, and demand substantial computational resources [13]. Variants include zero-shot detectors like Binoculars, which uses perplexity ratios across two LLMs to flag AI text without finetuning [14], and ensembles of weak detectors such as Ghostbuster, which aggregate smaller model outputs to improve robustness [15].

Stylometric and hybrid approaches analyze linguistic features such as token statistics, syntactic complexity, readability indices, lexical diversity, and punctuation patterns. Early work demonstrated that simple feature-based classifiers can rival deep models on GPT-2/3 detection [16]. Pure stylometry methods, including Random Forests on 31 stylometric features, achieved up to 98% accuracy [17], while lightweight ensembles combining stylometric, POS, and entropy-based measures reached 85.5% accuracy with minimal computation [18]. Hybrid models that fuse handcrafted features with transformer embeddings, such as RoBERTa+E5 [7] and EssayDetect [19], achieved high accuracy in shared tasks. Approaches like T5LLMCipher leverage embedding clustering for better generalization to unseen models and domains [20]. Comprehensive surveys highlight key challenges, including out-of-distribution detection, adversarial robustness, and evaluation standards, pointing to the need for more generalizable detectors [21].

Most studies are based on the extensive usage of transformers and other heavy AI technologies that require quite significant computing power.

Gap & Contribution. Despite these advances, there remains a need for a lightweight solution that (a) exploits rich stylometric descriptors,

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut