The Statistical Signature of LLMs
Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless c…
Authors: Ortal Hadad, Edoardo Loru, Jacopo Nudo
The Statistical Signature of LLMs Ortal Hadad 1 , Edoardo Loru 2 , Jacop o Nudo 1 , Niccolò Di Marco 3 , Matteo Cinelli 1 , W alter Quattro cio cc hi 1* 1 Departmen t of Computer Science, Sapienza Universit y of Rome. 2 Departmen t of Computer, Control and Managemen t Engineering, Sapienza Universit y of Rome. 3 Departmen t of Legal, So cial, and Educational Sciences, T uscia Univ ersit y . *Corresp onding author(s). E-mail(s): w alter.quattro cio cc hi@uniroma1.com ; Abstract Large language mo dels generate text through probabilistic sampling from high- dimensional distributions, y et how this process reshap es the structural statistical organization of language remains incompletely c haracterized. Here we show that lossless compression provides a simple, mo del-agnostic measure of statistical regularit y that differentiates generativ e regimes directly from surface text. W e analyze compression b eha vior across three progressively more complex infor- mation ecosystems: controlled human–LLM contin uations, generative mediation of a kno wledge infrastructure (Wikip edia vs. Grokip edia), and fully synthetic so cial in teraction en vironmen ts (Moltbo ok vs. Reddit). A cross settings, com- pression reveals a p ersisten t structural signature of probabilistic generation. In con trolled and mediated con texts, LLM-produced language exhibits higher struc- tural regularity and compressibilit y than human-written text, consistent with a concen tration of output within highly recurrent statistical patterns. How ever, this signature sho ws scale dep endence: in fragmen ted interaction en viron- men ts the separation attenuates, suggesting a fundamental limit to surface-level distinguishabilit y at small scales. This compressibility-based separation emerges consistently across mo dels, tasks, and domains and can b e observ ed directly from surface text without relying on mo del internals or semantic ev aluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual pro duction, offering a structural p erspective on the evolving complexity of communication. 1 Keyw ords: large language mo dels, sto chastic generation, lossless compression, synthetic text, information dynamics 1 In tro duction The rapid diffusion of large language mo dels (LLMs) is reshaping how language is pro- duced, accessed, and circulated [ 1 ]. Generativ e systems are now embedded in searc h engines, pro ductivit y to ols, knowledge platforms, educational settings, and so cial media, increasingly mediating everyda y interactions with information [ 2 ]. As a result, large v olumes of online text are partially or entirely generated b y probabilistic mo d- els, often without clear signaling or stable boundaries betw een h uman and synthetic con tent [ 3 , 4 ]. This shift constitutes not only a technical challenge, but a struc- tural transformation of information ecosystems, with implications for accoun tability , go vernance, and epistemic literacy [ 5 , 6 ]. A t the surface level, LLM outputs often appear coheren t and stylistically similar to h uman writing [ 7 ]. This resem blance has fueled, on the one hand, inflated claims ab out near-h uman linguistic competence [ 8 ], and on the other, a large literature on detection, lab eling, prov enance, and w atermarking of synthetic text [ 9 – 12 ]. Empirical evidence, how ev er, shows that detection remains fragile in op en-w orld settings, with p erformance degrading under domain shift, paraphrasing, translation, p ost-editing, and c hanges in mo del families [ 9 , 10 ]. This brittleness suggests that the central issue is not classifier accuracy p er se, but limited understanding of the structural statistical prop erties that differentiate language pro duced under distinct generativ e regimes. Human language arises from complex cognitive and so cial pro cesses, shap ed b y in tention, comm unicative goals, pragmatic constraints, contextual grounding, and unev en information exp osure across individuals and comm unities [ 13 – 15 ]. LLM- generated text, b y con trast, is pro duced through iterativ e sampling from conditional probabilit y distributions learned from large corp ora of prior language use [ 16 , 17 ]. Ev en when outputs app ear similar, the underlying generativ e mec hanisms differ fun- damen tally . Recent work has highlighted this divergence from multiple angles. Loru et al. [ 18 ] sho w that LLMs can replicate human-lik e judgment outputs while relying on qualitativ ely differen t ev aluativ e strategies, leading to surface alignmen t that masks deep er epistemic mismatc hes. Quattro cio cchi et al. [ 6 ] formalize these mismatc hes as epistemological fault lines and in tro duce the notion of Epistemia , in which linguistic plausibilit y substitutes for epistemic ev aluation. Complementarily , Nudo et al. [ 19 ] do cumen t gener ative exagger ation , whereby probabilistic generation amplifies statisti- cal regularities in simulated discourse, pro ducing reduced v ariability relative to human language. Information-theoretic concepts provide a useful theoretical reference for compar- ing generativ e regimes in a mo del-agnostic setting. Here, in particular, we adopt an approac h based on lossless compression. Compression exploits redundancies in sym- b ol sequences, yielding shorter enco dings for text that exhibits stronger structural 2 regularit y . This connection is classical in information theory [ 20 ] and underlies uni- v ersal compression sc hemes such as Lemp el–Ziv [ 21 ]. W e use gzip , standardized in RF C 1952 [ 22 ], whic h op erates directly on surface strings without access to mo del in ternals, seman tic represen tations, or task-sp ecific features, making it suitable for cross-domain comparisons in partially observ able settings. W e apply this compression-based framework across three contexts of increasing realism: (i) con trolled corp ora where humans and LLMs produce text under compa- rable conditions; (ii) generative mediation of a large-scale knowledge infrastructure (Wikip edia vs. Grokipedia); and (iii) fully synthetic social interaction environmen ts (Moltb ook vs. Reddit). Compression reveals systematic differences in structural regu- larit y b etw een h uman and probabilistically generated language, as well as highligh ting scale-dep enden t effects. Rather than prop osing a detector of synthetic text or assess- ing semantic qualit y , we identify a structural signature of probabilistic language generation with implications across div erse settings. 2 State of the Art The rapid diffusion of large language mo dels (LLMs) has generated a large and grow- ing literature on distinguishing h uman-authored text from mac hine-generated conten t, largely framed as a detection problem motiv ated by concerns ov er misinformation, academic integrit y , and conten t moderation [ 23 ]. Approaches span sup ervised classi- fiers, zero-shot statistical metho ds, watermarking schemes, and analyses of robustness and ev asion. Supervised detection metho ds typically rely on discriminative mo dels trained on labeled corpora to separate h uman and mac hine-generated text. While effectiv e in con trolled, in-domain settings, their p erformance often degrades under domain shift, changes in source mo dels, and simple transformations such as para- phrasing or p ost-editing. Benchmark datasets such as HC3 [ 24 ] and its extensions [ 25 ] ha ve b een in tro duced to support ev aluation across multiple domains and tasks, includ- ing settings where seman tic inv ariance makes detection substan tially harder. Parallel w ork has explored zero-shot signals deriv ed from language-mo del likelihoo ds, tok en ranks, and related statistics. These metho ds exploit the tendency of common deco ding strategies to concentrate probabilit y mass on high-likelihoo d contin uations, leaving detectable statistical traces. GL TR visualizes token-rank distributions and aggregates simple statistics across sampling regimes [ 26 ], while related approaches use p erplex- it y , en tropy-adjacen t measures, and v ariability indicators such as burstiness, trading simplicit y and model-agnosticism for sensitivit y to deco ding parameters and post- pro cessing. More recent methods in tro duce theoretically grounded zero-shot criteria based on prop erties of the probabilit y landscap e. DetectGPT estimates lo cal curv ature of the log-probability function via sto chastic perturbations, sho wing that mac hine- generated text tends to o ccup y regions of negative curv ature [ 11 ]. F ast-DetectGPT impro ves efficiency by replacing explicit p erturbations with curv ature pro xies derived from conditional probabilities [ 27 ], with further work examining robustness to p er- turbation strategies and token-lev el c hoices [ 28 ]. An alternative line of work frames detection as a comparison b etw een generative distributions rather than as a classifica- tion task. Bino culars contrasts likelihoo ds assigned by closely related language mo dels, 3 demonstrating accurate zero-shot detection without task-sp ecific training data [ 29 ]. This p ersp ective emphasizes structural differences b et w een generativ e pro cesses rather than purely discriminative cues. Beyond p ost-ho c detection, watermarking and prov e- nance metho ds aim to embed detectable signals at generation time. Kirchen bauer et al. prop ose a w atermarking scheme based on biased tok en sampling o ver a secret v ocab- ulary subset [ 12 ]. While effective under co op erativ e settings, such approaches raise concerns regarding robustness, adv ersarial remo v al, and applicability to pre-existing con tent. A cross empirical ev aluations, a recurring finding is the limited robustness of b oth sup ervised and zero-shot detectors in real-world conditions, with substantial false p ositiv es and false negatives and vulnerability to simple obfuscation strategies [ 9 , 10 ]. Consisten t with these limitations, Op enAI discon tin ued its o wn AI text classifier due to insufficient accuracy [ 30 ]. Alongside detection-focused approaches, a smaller liter- ature explores light w eight metho ds based on structural statistical features that av oid large mo dels at inference time. Compression-based classification techniques exploit the link b etw een statistical regularity and compressibility [ 31 , 32 ]. Berch told et al. in tro duce a GZIP-KNN metho d combining lossless compression with nearest-neighbor classification [ 33 ], connecting to earlier work linking compression and linguistic struc- ture [ 31 , 34 ]. Our work builds on this persp ective by using compression not as a detector, but as a signal of statistical regularity , b y examining its b ehavior across con- trolled corp ora, generative mediation of knowledge infrastructures, and fully synthetic so cial interaction environmen ts. 3 Metho ds 3.1 T ok en distributions and con trolled statistical regimes T o examine how compression responds to statistical tok en regularit y , w e generate tok en sequences with v arying levels of Shannon en trop y . Let W = { w 1 , . . . , w n } b e a v o cabulary and T = ( t 1 , . . . , t N ) , t i ∈ W, a text of length N comp osed of tok ens drawn from W . The empirical token distribution of T is defined as p T ( w ) = 1 N N X i =1 1 { t i = w } . (1) where 1 { t i = w } is the indicator function. Giv en a probabilit y mass function p : W → [0 , 1] , w e generate synthetic sequence of tokens by indep endent sampling t 1 , . . . , t N ∼ p, after whic h, tokens are concatenated to form a text T . T o con trol the statistical concentration in an empirical token distribution, we con- sider a parametric family of distributions with fixed entrop y indexed by a parameter h ∈ 1 n , 1 : 4 p ( h ) ( w 1 ) = h, p ( h ) ( w i ) = 1 − h n − 1 , i = 2 , . . . , n. (2) The en trop y of this distribution is: H p ( h ) = − h log h + (1 − h ) log 1 − h n − 1 , (3) and it is b ounded at the extremes: • If h = 1 , then p (1) is a Dirac distribution and H p (1) = 0 . • If h = 1 n , then p (1 /n ) is uniform ov er W and H p (1 /n ) = log n . V arying h therefore defines a contin uum of reference regimes ranging from highly uniform to strongly concentrated token distributions. These regimes pro vide an inter- pretable baseline for the empirical analyses presen ted in Figure 1 , where compression b eha vior is ev aluated across different levels of distributional concentration. 3.2 Compression-based structural regularit y F rom an information-theoretic p ersp ective, lossless compression can b e interpreted as an op erational prob e of effectiv e statistical regularity . Universal co ding schemes such as Lemp el–Ziv adapt to recurring patterns in sym b ol sequences without requiring access to the underlying generativ e pro cess. Therefore, we use it as a model- agnostic measure of structural organization that can b e applied consisten tly across heterogeneous text sources. F or eac h document x , w e compute a compression-based measure using lossless compression. All texts are enco ded in UTF-8 and processed as ra w b yte sequences without tokenization or normalization, ensuring that the analysis dep ends only on observ able surface structure. W e use the gzip algorithm with default settings. Giv en a document x , its compressed size C ( x ) (in b ytes) defines the compression ratio R ( x ) = C ( x ) | x | , (4) where | x | denotes the uncompressed size in bytes. Low er v alues of R ( x ) indicate higher compressibilit y and stronger sequen tial regularit y . T o examine how structural organization accum ulates with length, w e compute prefix-based compression curv es. F or a byte sequence x = ( b 1 , . . . , b n ) , prefixes x 1: k = ( b 1 , . . . , b k ) are constructed for increasing k , and R ( x 1: k ) is ev aluated independently . This pro cedure captures how regularit y emerges as longer-range dep endencies b ecome a v ailable to the compressor. Alongside compression metrics, w e compute a limited set of descriptive statistics, including normalized diversit y measures and rep etition-distance indicators, used solely to contextualize structural differences. These quantities are not interpreted as direct estimates of entrop y , nor as proxies for seman tic qualit y . All analyses are applied iden tically across h uman-written and machine-generated texts, without relying on mo del internals, lik eliho o d scores, or semantic represen ta- tions. 5 3.3 Datasets W e analyze three datasets comparing human-written and machine-generated language across progressiv ely more realistic settings: • Con trolled Human–LLM Corpus. W e use the publicly av ailable Human-AI P arallel English Corpus [ 35 ]. F or a set of h uman-written prompts (“Chunk 1”), the dataset provides b oth human contin uations (“Chunk 2”) and mo del-generated com- pletions pro duced by six systems: GPT-4o, GPT-4o-mini, Llama 3.1 8B, Llama 3.1 70B, and their Instruct v ariants. This setting enables direct comparison of h uman and LLM outputs under matc hed seman tic and stylistic constraints. • Generativ e Mediation of Kno wledge: Wikip edia vs. Grokip edia. T o study generative mediation within a knowledge infrastructure, we compare 9,279 Wikip edia pages with corresp onding Grokip edia entries, where conten t is rewritten and expanded b y a generativ e system [ 36 ]. Pages are selected based on high reference coun ts and revision activit y to fo cus on en tries with substantial rewriting. Analyses are conducted at the page lev el, with prefix-based measuremen ts computed at the sen tence lev el. • F ully Syn thetic So cial In teraction: Moltb o ok vs. Reddit. W e compare Red- dit discussions with Moltb o ok, a platform p opulated b y autonomous LLM agents generating p osts and threaded comments [ 37 ]. Reddit data are restricted to pre-2018 p osts to minimize the presence of AI-generated con tent. Moltb o ok data w ere col- lected b etw een January 31 and F ebruary 10, 2026 1 , and Reddit data were obtained via the Pushshift API [ 38 ]. F or eac h platform, we sample 10,000 p osts stratified b y length into four categories: low , mid , high , and very high , following [ 15 ]. This stratified sampling preserves v ariability in textual length despite the highly skew ed distribution of so cial media conten t tow ard short p osts. The analyses rep orted here fo cus on the central mid and high length categories, resulting in a final dataset of 20,000 p osts for each platform. T o ensure comparabilit y across platforms, iden ti- cal prepro cessing steps are applied, including remov al of non-textual elements (e.g., URLs, emo jis, markup) and normalization of whitespace and enco ding. 4 Results 4.1 Con trolled Human vs LLM Corpus In this first section, we introduce the intuition b ehind using compression as a to ol to detect LLM-generated texts. Our primary goal is to establish an interpretable base- line linking token distributional concen tration to compression b eha vior. T o empirically in vestigate this connection, we rely on the metho dology described in Section 3.1 . In brief, we consider a parametric family of probabilit y distributions (Eq. ( 2 )) whose en tropy is controlled b y a parameter h . W e then generate 20 equally spaced v alues of h in the interv al 1 n , 1 and generate synthetic texts from the corresp onding dis- tributions. F or each generated text, we compute the compression ratio (Eq. ( 4 )) at 1 https://moltbook-observatory .sushant.info.np/export 6 differen t en trop y levels. Additional implementation details are provided in the Meth- o ds section. The results of this analysis are sho wn in Figure 1 (A), where w e observ e a p ositive relationship betw een these tw o quantities: higher v o cabulary en tropy is asso ciated with higher compression ratios, that is, lo w er compressibilit y . Giv en this result, it is of interest to lo cate the en trop y and compression v alues obtained b y h uman-written and LLM-generated texts. W e do that b y considering the Human-AI Parallel English Corpus dataset. Because the compressibility of text is dependent on its length, w e first ensure that only h uman and LLM documents of comparable length are considered. Indeed, as presented in the inset of Figure 1 (A), the distributions of words for human and LLM do cuments in the dataset are consid- erably different. The former p eaks at 479 w ords, while the latter sho ws muc h h igher heterogeneit y . F or this reason, we restrict our analysis to do cuments b et ween 466 and 489 words, whic h correspond, resp ectively , to the first and third quartiles of the dis- tribution for human texts. How ever, all results are consisten t when considering the full dataset. The colored points and bars in Figure 1 (A) locate h uman and LLMs in terms of a verage entrop y and compression. While entrop y v alues are broadly comparable across the tw o groups, more substantial differences emerge in their compression ratios: on a verage, human-written texts ac hiev e higher compression ratios than LLM-generated ones. This pattern is further highligh ted in panel B, whic h displa ys the distribution of compression ratios in a balanced sample including 1000 documents for each group. The distributions reveal that LLM-generated texts display low er median compression ratios compared to h uman-written texts. At the same time, they exhibit a higher heterogeneit y , particularly in the left tail of the distribution. F or reference, we also include a b enchmark based on randomly generated do cu- men ts, eac h constructed b y sampling 479 words, i.e., the mode of the human do cumen t length distribution, from the full v o cabulary of the dataset, using both uniform and w eighted strategies, generating 1000 do cuments for each. In the first setting, w ords are selected uniformly at random from the v o cabulary . In the second, they are sam- pled according to the empirical w ord distribution estimated from the joint corpus of h uman and LLM texts. As exp ected, these random sequences exhibit substantially lo w er compressibility than both human- and LLM-generated texts, with uniform sampling yielding the highest ratio v alues. Their near-uniform word usage leads to high en trop y and minimal exploitable structure, making them inheren tly less compressible. Finally , Panel C rep orts the differences in compression ratios b etw een human- and LLM-generated texts as do cument length increases. F or each document, we compute the compression ratio incrementally: first considering only the first sentence, then the first tw o sen tences, and so on. W e then compute the av erage and IQR of these distri- butions, grouping b y the n umber of sentences. As sho wn in the figure, after roughly 20 sen tences an increasingly high textual regularity and redundancy are reflected in the compression ratio curv e for LLMs. In con trast, for humans, the metric stabilizes, indicating that human-generated do cumen ts do not exhibit increasing regularity as the num b er of sentences grows; instead, compressibility remains constant throughout the en tire text. 7 0.0 0.2 0.4 Compression ratio 0.00 0.25 0.50 0.75 1.00 V ocabulary entropy A 0.000 0.005 0.010 0.015 0.020 300 600 9001200 Number of words Uniform sampling Weighted sampling Humans LLMs 0.0 0.2 0.4 0.6 Compression ratio B 0.5 1.0 A ver age compression ratio 0 50 100 150 Number of sentences C Humans LLMs Fig. 1 : (A) Relationship b etw een v o cabulary en tropy and compression ratio for texts generated from word distributions with fixed entrop y . The colored p oints show the a verage v alues for Humans and LLMs. The bars indicate one standard deviation from the mean. The inset displa ys the densit y distribution of do cument length (n um b er of w ords) for human-written and LLM-generated texts. (B) Compression ratios distri- bution for human-written texts, LLM-generated texts, and randomly generated texts. Higher compression ratios corresp ond to lo w er compressibility . Eac h group con tains 1000 do cumen ts. (C) A verage compression ratio of human-written and LLM-generated texts according to the n umber of sentences in the text. The shaded area shows the in terquartile range of the distribution of compressions. F or LLM-generated text, as the length increases, compressibility also increases, unlike human text. 4.1.1 Discriminativ e Strength of Compression F eatures Building on the compression differences observ ed ab o v e, w e examine additional measures that quantify the statistical regularit y of the sequence. 8 • Conditional compression. This metric approximates the incremen tal compres- sion cost of the second half of a do cument given the first, providing a pro xy for con text-dep enden t predictability . It is computed b y taking the difference C ( x + y ) − C ( x ) and normalizing it b y the length in b ytes of y , i.e., dividing b y | y | . • A v erage and slope of prefix compression curv es. F or eac h do cumen t, w e measure the compression ratio as more c haracters are progressively included. Then, w e compute the av erage o v er the range, and estimate its linear slope, whic h c har- acterizes the a verage compressibility of prefixes and the rate at which structural regularit y stabilizes as the sequence gro ws. • W ord-order contribution metrics. T o assess the sp ecific con tribution of w ord order to compressibilit y , we generated a shuffled v ersion of eac h do cumen t b y ran- domly p ermuting words within sentence b oundaries. W e then compare the original and shuffled texts using tw o measures: (i) the compression ratio gap, defined as R ( x shuffled ) − R ( x ) , and (ii) the Normalized Compression Distance (NCD) [ 39 ]. • Normalized character and w ord en trop y . Complementing compression-based measures, we compute normalized entrop y at the character and w ord levels to measure ho w evenly they are distributed in the text. • T yp e-T ok en Ratio (TTR). W e compute the type-token ratio, the prop ortion of unique words in a text, as a standard measure of lexical div ersit y and vocabulary ric hness. • Rep etition distance features. F or each word, we compute the av erage and standard deviation of the num b er of w ords separating consecutive o ccurrences. This quan tifies the frequency and regularity of repetitions, reflecting structural redundancy . In Figure 2 , w e sho w the distribution of these metrics across human and LLM- generated text from the Human-AI Parallel Corpus. This comparison reveals notable differences, even b etw een different mo del families. GPT mo dels, in particular, display higher conditional compression, low er prefix v ariabilit y , and sharply peaked distribu- tional profiles, indicating that their sequences stabilize quickly and remain strongly constrained by prior context. Although GPT outputs show comparatively high lexical div ersity , they remain more compressible ov erall, implying that redundancy arises from structural regularit y rather than simple w ord repetition. Llama 3 mo dels occupy an in termediate regime b etw een GPT and human text, while instruction tuning increases v ariability and shifts distributions closer to the h uman baseline without eliminating the separation. T o further ev aluate the discriminativ e p o w er of the prop osed compression-based features, w e trained a Histogram-based Gradien t Boosting Classification T ree [ 40 ] using only the features describ ed ab ov e. W e consider three classification settings of increasing granularit y . In the most fine-grained setting, we train a seven-class classifier to distinguish betw een Human, GPT-4o, GPT-4o Mini, Llama 3 70B, Llama 3 70B Instruct, Llama 3 8B, and Llama 3 8B Instruct. The classifier achiev es an ov erall accuracy of 0 . 65 (macro F1 = 0 . 65 ). GPT-family models and human text are identified more reliably (F1 scores betw een 0 . 75 and 0 . 82 ), while base Llama models are more frequen tly confused (F1 scores b et ween 0 . 46 and 0 . 66 ). Notably , this p erformance is comparable to prior detection work [ 35 ] rep orting 66% accuracy using 60 lexical, 9 0.00 0.05 0.10 0.15 0.20 0.25 Compr ession R atio Gap (Original vs.Shuffle) 0 20 40 60 80 100 Density 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Nor malized Compr ession Distance (Original vs.Shuffle) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Density 0.0014 0.0012 0.0010 0.0008 0.0006 0.0004 0.0002 0.0000 P r efix R atio T r end 0 10000 20000 30000 40000 Density 0.05 0.10 0.15 0.20 0.25 P r efix R atio V ariability 0 10 20 30 40 50 60 70 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Conditional Compr ession R atio 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Density 0.0 0.2 0.4 0.6 0.8 1.0 Unique W or d R atio 0 2 4 6 8 10 Density 0.2 0.4 0.6 0.8 1.0 Nor malized W or d-L evel Entr opy 0 5 10 15 20 25 30 35 Density 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Nor malized Character -L evel Entr opy 0 5 10 15 20 25 30 Density 0 25 50 75 100 125 150 175 Mean R epetition Distance 0.00 0.01 0.02 0.03 0.04 0.05 Density 0 50 100 150 200 R epetition Distance V ariability (Std Dev) 0.00 0.01 0.02 0.03 0.04 Density Human GPT Llama 3 Llama 3 Instruct Fig. 2 : Distribution of structural and compression-based features across h uman- written and LLM-generated texts in the Human–AI P arallel Corpus. grammatical, and rhetorical features. This suggests that metrics related to statistical regularit y can already capture m uc h of the separable signal. When collapsing labels in to Human, GPT, and Llama, accuracy rises to 0 . 93 (macro F1 = 0 . 90 ), indicating that compression features s trongly enco de family-lev el generative signatures. Finally , for the binary Human vs. LLM task, the classifier achiev es 0 . 93 accuracy (macro F1 10 0 5 10 15 20 25 Mean Shapley value (%) P r efix R atio V ariability R epetition Distance V ariability Mean R epetition Distance Nor malized Compr ession Distance Nor malized Character -L evel Entr opy Conditional Compr ession R atio Nor malized W or d-L evel Entr opy Compr ession R atio Gap Unique W or d R atio P r efix R atio T r end Global F eatur e Contribution Fig. 3 : Global feature imp ortance based on mean absolute Shapley v alues for the Gradien t Bo osting Classifier. Bars indicate the av erage magnitude of each feature’s con tribution to the predicted probabilit y of the Human class across the test set. = 0 . 88 ). LLM outputs are detected with v ery high precision and recall (F1 = 0 . 96 ), while h uman texts remain the more challenging class (F1 = 0 . 80 ). F or mo del explainabilit y , we p erform SHAP analysis on the binary classifier, with the aim of quantifying the impact of each feature on pushing a prediction tow ard either class. The results, in Fig. 3 , reveal that compression-based features and lexical- statistical measures contribute most strongly to the mo del’s decision b oundary . 4.2 Generativ e Mediation of Knowledge: Wikip edia vs Grokip edia While the previous section established compression as a pro of of concept under con- trolled generation, here we examine its behavior in a setting where h uman-authored con tent is rewritten and expanded b y LLMs. Wikip edia and Grokip edia provide parallel versions of the same encyclop edic con- ten t, with Grokip edia consisting of LLM-mediated rewritings of original Wikip edia pages. This setting allo ws us to examine ho w compression b eha v es when h uman- curated kno wledge is partially reshaped through probabilistic generation, and whether suc h mediation introduces measurable structural differences in the resulting text. Consisten t with the approach used for the Human–AI Parallel Corpus and describ ed in Metho ds, we start by computing the compression ratio through an incremen tal prefix analysis, where do cumen ts are decomp osed into sen tences and progressiv ely extended one sen tence at a time. Prefix v alues are aggregated into 20 uniformly spaced bins (minimum 100 observ ations) and summarized b y mean and in terquartile range. 11 0 100 200 300 400 500 600 700 Number of sentences 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Compr ession ratio A W ikipedia (median) Gr okipedia (median) 0.0 0.2 0.4 0.6 Conditional Compr ession R atio 0 5 10 15 20 Density B 0.7 0.8 0.9 1.0 Nor malized W or d-L evel Entr opy 0 5 10 15 20 25 Density 0 500 1000 1500 2000 2500 Mean R epetition Distance 0.0000 0.0005 0.0010 0.0015 0.0020 Density 0 1000 2000 3000 4000 5000 R epetition Distance V ariability (Std Dev) 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 Density W ikipedia Gr okipedia Fig. 4 : (A) A verage compression ratio of Wikip edia and Grokip edia page texts as a function of the num b er of sentences in the text. The shaded area repre- sen ts the interquartile range of the compression ratio distribution. (B) Distribution of Conditional Compression Ratio, Normalized W ord-Level En tropy , Mean Rep eti- tion Distance, and Rep etition Distance V ariability across texts from Wikip edia and Grokip edia pages. Figure 4 (A) shows that differences in compression ratios begin to emerge after a certain n umber of sentences. Around 50 sentences, the interquartile ranges of the tw o distributions show a diminishing ov erlap, suggesting that statistical regularities accu- m ulate as text length increases and recurrent structures b ecome more exploitable by compression. As the prefix length approaches 500 sentences, the observed gap progres- siv ely narro ws. Bey ond a certain threshold, this reduction can in part b e attributed to the w a y Grok operates. In particular, Wikipedia pages are predominan tly rewritten b y Grok in the initial part of the page [ 41 ]. This p ortion is also the most frequen tly accessed by users [ 42 ], which likely incentivizes more in tensiv e revisions at the b egin- ning of the do cument. As the prefix extends to include text portions where LLM in terven tion is comparatively limited, the prop ortion of human-written text increases and the difference b etw een h uman and LLM-generated text gradually disappears. Figure 4 (B) rep orts the distribution of four relev ant metrics. Conditional compres- sion is sligh tly low er for Grokip edia, indicating reduced incremental compression cost in LLM-mediated text. A t the same time, normalized word-lev el entrop y is higher, suggesting broader lexical disp ersion. In con trast, Wikip edia texts exhibit more fre- quen t w ord rep etition with low er v ariance. These patterns differ from those observ ed in the controlled completion setting, likely reflecting b oth the longer do cumen t struc- ture and differen t prompting conditions underlying generativ e mediation. Rewriting h uman-generated text with LLMs, particularly within an encyclop edic setting, ma y therefore alter lexical statistics while preserving a lo wer ov erall compression rate. Emplo ying the same model sp ecification and features used for the Human–AI P arallel Corpus, on this dataset, w e obtain a binary classification accuracy of 0 . 85 (macro F1 = 0 . 85 ). Wikip edia articles are identified with high recall ( 0 . 94 ; F1 = 0 . 86 ), while Grokip edia pages reach F1 = 0 . 84 , with higher precision ( 0 . 93 ) but low er recall ( 0 . 76 ). 12 4.3 F ully Synthetic So cial In teraction: Moltb o ok vs Reddit W e extend the analysis to a fully synthetic so cial setting, comparing human-generated discussions on Reddit with agen t-driv en in teractions on Moltbo ok, where all con- ten t is pro duced by LLM-based agen ts. This scenario allows us to examine ho w compression b eha v es when con versational dynamics themselves are generated proba- bilistically rather than emerging from human interaction. Sp ecifically , here w e employ the individual p ost as the unit of analysis. Compression ratios are computed follo wing the same pro cedure described in the Metho ds and used for the other tw o datasets. Figure 5 (A) rep orts the a v erage compression ratio as a function of the n umber of sentences. In contrast to the t w o settings analyzed abov e, differences b etw een Moltbo ok and Reddit p ersist only ov er a limited range of p ost lengths, and the gap betw een the t wo narrows after approximately 25 sentences. F urther, ov er the shorter range of p ost lengths considered here, LLM-generated conten t sho ws a sligh tly larger compression ratio, corresponding to lo w er regularit y . Figure 5 (B) shows that Moltbo ok interactions exhibit higher lexical div ersit y , reflected in a righ tw ard shift in unique w ord ratio, while main taining comparable o verall compressibilit y . In contrast, Reddit displa ys higher normalized compression distance and sligh tly s tronger conditional compression, indicating greater sensitivity to sequen tial ordering and lo cal predictability . The prefix ratio trend remains largely similar across corp ora, suggesting comparable incremental redundancy dynamics. These results suggest that fully synthetic interaction environmen ts do not simply amplify the structural signatures observed in previous sections. Instead, the emergence of compression-based differences app ears to also dep end on the scale and structure of the generated text. Since the texts in this case are short and fragmented, we cannot observ e the structural regularity that tends to emerge as the length of a text increases. F urthermore, w e should also consider the role of the prompting conditions under- lying Moltb ook generation. Prompts aimed to w ard con versational text generation ma y encourage LLMs to adopt a more casual, context-local, and human-lik e discourse st yle that deviates from their baseline generation. W e hypothesize that such condi- tioning attenuates the emergence of systematic long-range regularities that b ecome detectable in extended informational texts, and alters st ylistic features in such a w ay that generated messages effectively show low er regularit y and thus compressibility . T o complement the analysis presen ted abov e, w e apply the same mo del sp ecifica- tion and feature set used for the previous sections. On this task, we obtain a binary classification accuracy of 0 . 88 (macro F1 = 0 . 88 ). Performance is highly balanced across classes: Reddit p osts are iden tified with precision and recall 0 . 88 (F1 = 0 . 88 ), while Moltbo ok p osts achiev e precision 0 . 88 and recall 0 . 87 (F1 = 0 . 88 ). 5 Conclusions In this work, we examine whether lossless compression can serv e as an op erational pro xy for iden tifying structural statistical differences b etw een human language and 13 5 10 15 20 25 30 Number of sentences 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Compr ession ratio A R eddit (median) Moltbook (median) 0.0 0.2 0.4 0.6 0.8 Nor malized Compr ession Distance (Original vs.Shuffle) 0 1 2 3 4 5 Density B 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 P r efix R atio T r end 0 200 400 600 800 Density 0.0 0.2 0.4 0.6 0.8 1.0 Unique W or d R atio 0 1 2 3 Density 0.0 0.2 0.4 0.6 0.8 1.0 Conditional Compr ession R atio 0 2 4 6 8 Density R eddit Moltbook Fig. 5 : (A) A verage compression ratio of Moltbo ok and Reddit commen ts as a function of the num b er of sen tences in the text. The shaded area represen ts the in terquartile range of the compression ratio distribution. (B) Distribution of Normal- ized Compression Distance, Prefix Ratio T rend, Unique W ord ratio, and Conditional Compression ratio across Moltb o ok and Reddit p osts. text pro duced or mediated by large language models. W e sho w that compression- based analysis offers a simple and in terpretable wa y to study how generative systems reshap e linguistic structure across tasks and environmen ts. A cross three progressiv ely more complex settings, i.e., controlled contin uation tasks, generativ e mediation of encyclop edic knowledge, and fully syn thetic social inter- action, we observ e that compression consistently captures differences in structural organization b etw een h uman and LLM-generated language. In controlled contin ua- tion tasks, we find that compression rev eals a clear separation b etw een h uman- and LLM-generated text, particularly as a greater num b er of sentences are included in the analysis. This indicates that probabilistic generation concentrates sequences into more regular and redundan t patterns. In generative mediation of encyclop edic con- ten t, compression rev eals how rewriting reorganizes structure while lexical disp ersion increases, highlighting a decoupling b etw een v o cabulary diversit y and sequential reg- ularit y . In fully synthetic conv ersational en vironmen ts, how ev er, compression-based separation is only observed o v er a short range of text lengths. Short conv ersational units and this particular generative regime may limit the accumulation of long-range statistical dependencies. Our findings indicate that compression captures a structural footprint of plausibilit y-driven language pro duction. Imp ortantly , this effect should not be in ter- preted as an ev aluation of semantic quality or truthfulness. Compression reflects differences in the generativ e process rather than differences in meaning. F urther, the observ ed patterns suggest that statistical regularization emerges gradually as gen- erated text b ecomes longer and more structurally cohesiv e, rather than app earing uniformly . As synthetic con tent b ecomes increasingly embedded within information ecosys- tems, understanding these structural shifts is essen tial for characterizing the dynamics of large-scale language pro duction. F uture work may explore ho w compression-derived 14 signals in teract with human p ost-editing, alternative deco ding strategies, and hybrid h uman–machine pro duction pip elines, as w ell as their implications for the gov ernance of syn thetic information environmen ts. References [1] Burton, J.W., Lopez-Lop ez, E., Hech tlinger, S., Rah w an, Z., A esc hbac h, S., Bakk er, M.A., Beck er, J.A., Berditchevsk aia, A., Berger, J., Brinkmann, L., et al. : How large language mo dels can reshap e collective in telligence. Nature human b eha viour 8 (9), 1643–1655 (2024) [2] W essel, M., Adam, M., Benlian, A., Ma jchrzak, A., Thies, F.: Generative ai and its transformative v alue for digital platforms. Journal of Managemen t Information Systems 42 (2), 346–369 (2025) [3] Dugan, L., Ipp olito, D., Kirubara jan, A., Shi, S., Callison-Burch, C.: Real or fake text?: Inv estigating h uman abilit y to detect b oundaries b et ween h uman-written and machine-generated text. In: Pro ceedings of the AAAI Conference on Artificial In telligence, v ol. 37, pp. 12763–12771 (2023) [4] W u, J., Y ang, S., Zhan, R., Y uan, Y., Chao, L.S., W ong, D.F.: A survey on llm-generated text detection: Necessity , metho ds, and future directions. Computational Linguistics 51 (1), 275–338 (2025) [5] Bender, E.M., Gebru, T., McMillan-Ma jor, A., Shmitchell, S.: On the dangers of sto c hastic parrots: Can language mo dels b e too big? In: Pro ceedings of the 2021 A CM Conference on F airness, Accoun tabilit y , and T ransparency , pp. 610–623. A CM, ??? (2021). https://doi.org/10.1145/3442188.3445922 [6] Quattro ciocchi, W., Capraro, V., Perc, M.: Epistemological F ault Lines Bet ween Human and Artificial Intelligence. arXiv (2025) [7] Crothers, E.N., Japk o wicz, N., Viktor, H.L.: Mac hine-generated text: A com- prehensiv e surv ey of threat mo dels and detection metho ds. IEEE A ccess 11 , 70977–71002 (2023) [8] Duncan, D.: Do es c hatgpt hav e so ciolinguistic comp etence? Journal of Computer- Assisted Linguistic Research 8 , 51–75 (2024) [9] W eb er-W ulff, D., et al.: T esting of detection to ols for AI-generated text. arXiv preprin t arXiv:2306.15666 (2023) [10] P erkins, M., Ro e, J., Postma, D., Gür, E., Jay asuriy a, T., Lin, H., Xie, J., Y uan, B.: Simple techniques to bypass GenAI text detectors: implications for inclusive education. International Journal of Educational T echnology in Higher Education (2024) h ttps://doi.org/10.1186/s41239- 024- 00487- w 15 [11] Mitc hell, E., Lee, Y., Khazatsky , A., Manning, C.D., Finn, C.: DetectGPT: Zero-Shot Machine-Generated T ext Detection using Probabilit y Curv ature. arXiv (2023) [12] Kirc henbauer, J., Geiping, J., W en, Y., Katz, J., Miers, I., Goldstein, T.: A w atermark for large language mo dels. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023). [13] Pink er, S.: Language, Cognition, and Human Nature: Selected Articles. Oxford Univ ersity Press, ??? (2013) [14] Ellis, N.C.: Essentials of a theory of language cognition. The Mo dern Language Journal 103 , 39–60 (2019) [15] Di Marco, N., Loru, E., Bonetti, A., Serra, A.O.G., Cinelli, M., Quattro cio cc hi, W.: P atterns of linguistic simplification on so cial media platforms o ver time. Pro ceedings of the National A cademy of Sciences 121 (50) (2024) https://doi. org/10.1073/pnas.2412105121 [16] V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., P olosukhin, I.: A tten tion is all y ou need. Adv ances in neural information processing systems 30 (2017) [17] Holtzman, A., Buys, J., Du, L., F orb es, M., Choi, Y.: The curious case of neu- ral text degeneration. In: In ternational Conference on Learning Represen tations (ICLR) (2020) [18] Loru, E., Nudo, J., Di Marco, N., Santirocchi, A., Atzeni, R., Cinelli, M., Cestari, V., Rossi-Arnaud, C., Quattro cio cchi, W.: The simulation of judgment in LLMs. Pro ceedings of the National A cadem y of Sciences 122 (42), 2518443122 (2025) h ttps://doi.org/10.1073/pnas.2518443122 . [19] Nudo, J., Pandolfo, M.E., Loru, E., Samory , M., Cinelli, M., Quattro cio cchi, W.: Generativ e Exaggeration in LLM So cial Agen ts: Consistency , Bias, and T o xicity . arXiv. Also published as: Online So cial Net w orks and Media 51 (2026) 100344, doi:10.1016/j.osnem.2025.100344 (2025) [20] Co ver, T.M., Thomas, J.A.: Elements of Information Theory , 2nd edn. Wiley- In terscience, ??? (2006) [21] Ziv, J., Lemp el, A.: A universal algorithm for sequential data compression. IEEE T ransactions on Information Theory 23 (3), 337–343 (1977) h ttps://doi.org/10. 1109/TIT.1977.1055714 [22] Deutsc h, P .: GZIP file format sp ecification version 4.3. RFC 1952 (1996). https: //doi.org/10.17487/RF C1952 16 [23] W u, J., et al.: A survey on llm-generated text detection. arXiv preprin t arXiv:2310.14724 (2023) [24] Guo, B., et al.: How close is ChatGPT to h uman exp erts? comparison corpus, ev aluation, and detection. arXiv preprint arXiv:2301.07597 (2023) [25] Su, Z., W u, X., Zhou, W., Ma, G., Hu, S.: HC3 plus: A semantic-in v arian t human ChatGPT comparison corpus. arXiv preprin t arXiv:2309.02731 (2023) [26] Gehrmann, S., Strob elt, H., Rush, A.: Gltr: Statistical detection and visualization of generated text. In: Pro ceedings of ACL (2019). https://aclanthology .org/P19- 3019/ [27] Bao, G., Zhao, Y., T eng, Z., Y ang, L., Zhang, Y.: F ast-detectgpt: Efficien t zero-shot detection of mac hine-generated text via conditional probabilit y curv a- ture. In: International Conference on Learning Represen tations (ICLR) (2024). https://a rxiv.org/abs/2310.05130 [28] Liu, S., et al. : Do es DetectGPT fully utilize p erturbation? bridging and inte- grating metric-based and fine-tuned detectors. In: Pro ceedings of A CL (2024). https://aclanthology .org/2024.acl-long.103/ [29] Hans, A., Sc hw arzsc hild, A., Cherepano v a, V., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., Goldstein, T.: Sp otting LLMs with bino c- ulars: Zero-shot detection of mac hine-generated text. In: ICML (2024). https://a rxiv.org/abs/2401.12070 [30] Op enAI: New AI classifier for indicating AI-written text (2023). Up dated July 20, 2023: to ol discon tinued due to low accuracy [31] Jiang, Z., Y ang, M., T sirlin, M., T ang, R., Dai, Y., Lin, J.: “low-resource” text classification: A parameter-free classification metho d with compressors. In: Find- ings of the Asso ciation for Computational Linguistics: ACL 2023, pp. 6810–6828. Asso ciation for Computational Linguistics, ??? (2023). https://doi.org/10.18653/ v1/2023.findings- acl.426 . http://dx.doi.o rg/10.18653/v1/2023.findings-acl.426 [32] Loru, E., Di Marco, N., Cinelli, M., Quattro cio cc hi, W.: A compression-based approac h to detecting automated and co ordinated b ehavior on social media. ACM T ransactions on Knowledge Disco very from Data 20 (2), 1–25 (2026) https://doi. org/10.1145/3778356 [33] Berc htold, M., Mitro vić, S., Andreoletti, D., Puccinelli, D., A youb, O.: Detect- ing ChatGPT-Generated T ext with GZIP-KNN: A no-training, lo w-resource approac h. In: Pro ceedings of the 7th In ternational Conference on Natural Language and Spee c h Processing (ICNLSP 2024), T ren to, Italy , pp. 25–33 (2024) [34] Mahoney , M.: T ext compression as a test for natural language pro cessing. 17 Pro ceedings of the AAAI W orkshop on T ext Compression (1999) [35] Reinhart, A., Mark ey , B., Lauden bac h, M., P an tusen, K., Y urko, R., W ein- b erg, G., Brown, D.W.: Do llms write lik e humans? v ariation in gram- matical and rhetorical styles. Pro ceedings of the National A cademy of Sciences 122 (8), 2422455122 (2025) h ttps://doi.org/10.1073/pnas.2422455122 h ttps://www.pnas.org/doi/p df/10.1073/pnas.2422455122 [36] Hadad, O., Loru, E., Nudo, J., Bonetti, A., Cinelli, M., Quattrocio cchi, W.: Wikip edia and grokip edia: A comparison of human and generativ e encyclop edias. arXiv preprin t arXiv:2602.05519 (2026) [37] Moltb ook Observ atory Export. h ttps://moltbo ok- observ atory .sushant.info.np/ exp ort . Accessed: 2026-02-10 [38] Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The pushshift reddit dataset. In: Proceedings of the International AAAI Conference on W eb and So cial Media, v ol. 14, pp. 830–839 (2020) [39] Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P .M.B.: The similarit y metric. IEEE T ransactions on Information Theory 50 (12), 3250–3264 (2004) https://doi.org/ 10.1109/TIT.2004.838101 [40] HistGradien tBo ostingClassifier — scikit-learn.org. h ttps: //scikit- learn.org/stable/mo dules/generated/sklearn.ensemble. {H}ist{G}radien t{B}o osting{C}lassifier.h tml . [A ccessed 12-02-2026] [41] T riedman, H., Mantzarlis, A.: What did elon change? a comprehensiv e analysis of grokipedia. arXiv preprin t arXiv:2511.09685 (2025) [42] Lamprec ht, D., Helic, D., Strohmaier, M.: Quo v adis? on the effects of wikip edia’s p olicies on navigation. Pro ceedings of the International AAAI Conference on W eb and Social Media 9 (5), 64–66 (2021) h ttps://doi.org/10.1609/icwsm.v9i5.14699 18
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment