Probing the topological properties of complex networks modeling short written texts

Probing the topological properties of complex networks modeling short   written texts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well – many informative discoveries have been made this way – but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyzes performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks.


💡 Research Summary

The paper investigates whether complex‑network‑based analyses, which have traditionally been applied to whole books or large text fragments, can also yield reliable information when applied to short excerpts. Using the word‑adjacency model, each distinct word (after stop‑word removal, lemmatization and part‑of‑speech disambiguation) becomes a node, and an undirected edge is created whenever two words appear consecutively in the text. The authors sampled 50 Portuguese novels into non‑overlapping subtexts of various lengths (500, 1 000, 2 000, and 5 000 words) and constructed a separate network for each subtext.

Six topological measurements were extracted: clustering coefficient (C), average neighbor degree (kₙ), accessibility (α, an entropy‑based measure of the reachability of nodes within h‑step random walks), average shortest‑path length (l), betweenness centrality (B), and intermittency (I, quantifying the burstiness of word occurrences). For each subtext length the authors computed the mean and standard deviation of these metrics across all samples, assessing their stability. The results show that for subtexts of about 1 000–2 000 words the metrics are virtually indistinguishable from those obtained on the full books; the variability is low enough to guarantee statistical reliability. In particular, α and B, which capture quasi‑local and global structural information, remain robust even in relatively short fragments.

To test the practical relevance of these findings, the authors performed an authorship‑recognition experiment. Each subtext was represented by the six topological features, and four supervised classifiers—k‑nearest neighbours (kNN), C4.5 decision trees, naïve Bayes, and support‑vector machines (SVM)—were trained and evaluated using 10‑fold cross‑validation. The SVM consistently achieved the highest accuracy, and, remarkably, the classification based on short subtexts (especially those of 1 000–2 000 words) outperformed the classification based on the entire books by 2–3 percentage points. This suggests that local topological signatures are more discriminative of an author’s stylistic fingerprint than global averages, possibly because they capture subtle, author‑specific patterns that are diluted when the whole text is aggregated. The other classifiers also performed reasonably well, but none matched the SVM’s advantage.

Beyond static authorship tasks, the authors argue that the same sampling approach can be extended to time‑varying texts such as news feeds, social‑media streams, or dialogue transcripts. By sliding a window over a continuous text stream and constructing a sequence of adjacency networks, one can monitor the evolution of topological measures, thereby detecting topic shifts, stylistic changes, or emotional dynamics in real time.

In conclusion, the study demonstrates that (i) traditional complex‑network measurements are stable for short textual fragments, (ii) short‑text network representations can be more effective than full‑text representations for stylometric classification, and (iii) the methodology opens the door to dynamic, fine‑grained analyses of textual data. These insights have practical implications for stylometry, plagiarism detection, digital humanities, and any NLP application where only limited text is available or where temporal resolution is required.


Comments & Academic Discussion

Loading comments...

Leave a Comment