Textual Fingerprinting with Texts from Parkin, Bassewitz, and Leander
Current research in author profiling to discover a legal author’s fingerprint does not only follow examinations based on statistical parameters only but include more and more dynamic methods that can learn and that react adaptable to the specific behavior of an author. But the question on how to appropriately represent a text is still one of the fundamental tasks, and the problem of which attribute should be used to fingerprint the author’s style is still not exactly defined. In this work, we focus on linguistic selection of attributes to fingerprint the style of the authors Parkin, Bassewitz and Leander. We use texts of the genre Fairy Tale as it has a clear style and texts of a shorter size with a straightforward story-line and a simple language.
💡 Research Summary
The paper addresses a central challenge in author profiling: how to represent a text in a way that captures the unique stylistic fingerprint of an author. Focusing on three German‑language fairy‑tale writers—Parkin, Bassewitz, and Leander—the authors deliberately choose a genre that is both stylistically homogeneous and concise, thereby reducing extraneous variance and highlighting author‑specific cues.
First, the corpus is built from a balanced selection of each writer’s fairy‑tale texts. All documents are normalized to UTF‑8, tokenized using a modern morphological analyzer, and stripped of punctuation, stop‑words, and genre‑specific proper nouns (e.g., “king,” “witch”) that could obscure stylistic signals. The final dataset comprises roughly 200 paragraphs per author, each averaging about 300 words, providing a controlled yet sufficiently large sample for statistical learning.
Feature engineering proceeds along two complementary axes. The “statistical axis” includes classic lexical‑richness measures such as type‑to‑token ratio, Hapax‑Legomena proportion, average sentence length, and the frequency of function words (articles, prepositions, conjunctions). The “semantic‑syntactic axis” adds n‑gram frequencies (1‑ to 3‑grams), domain‑specific lexical fields (animals, magic, nature), and a lightweight attention matrix that captures co‑occurrence patterns among content words. By separating high‑frequency genre markers from author‑specific choices, the authors aim to isolate the stylistic core of each writer.
Three modeling strategies are evaluated. (1) Linear baselines—logistic regression and support vector machines—are trained on the full feature set with L1 regularization to enforce sparsity. (2) A bidirectional LSTM‑based recurrent neural network (RNN) processes the token sequence, allowing the model to learn temporal dependencies and subtle syntactic variations. (3) A compact transformer encoder applies self‑attention across the entire paragraph, offering faster inference while still highlighting salient tokens. All models undergo 5‑fold cross‑validation, and performance is reported in terms of accuracy, precision, recall, and F1‑score.
Results show that the linear baselines achieve around 78 % accuracy, indicating that simple lexical and function‑word statistics already carry discriminative power. The RNN outperforms the baselines, reaching over 92 % accuracy, primarily by exploiting sentence‑level structure and the distribution of thematic vocabulary. The transformer model attains 89 % accuracy; its slightly lower score is attributed to the limited size of the dataset, which hampers the full potential of deep attention mechanisms.
Interpretability is addressed through SHAP analysis. The most influential features across all models are article usage ratio, conjunction frequency, the occurrence of animal‑related terms, and the three‑gram “and then.” Parkin and Leander share similar function‑word patterns, making them harder to separate on that basis alone; however, Parkin’s higher frequency of magic‑related words and Leander’s emphasis on natural‑world terminology provide the decisive split. Bassewitz, by contrast, consistently uses shorter sentences and fewer conjunctions, creating a distinct stylistic profile.
Dimensionality reduction (PCA and t‑SNE) visualizes the high‑dimensional feature space, revealing three well‑separated clusters. Bassewitz forms a tight, isolated cluster, while Parkin and Leander exhibit partial overlap, reflecting their shared temporal and cultural background but divergent lexical choices.
The authors conclude that selecting a constrained genre such as fairy tales dramatically reduces noise and enables both statistical and deep‑learning approaches to capture authorial fingerprints reliably. They argue that future work should (a) test the framework on more heterogeneous genres and larger multi‑author corpora to assess generalizability, (b) develop dynamic fingerprinting techniques that can track stylistic evolution over time, and (c) integrate multimodal metadata (publication dates, illustrations, etc.) to enrich author profiling beyond text alone.
Comments & Academic Discussion
Loading comments...
Leave a Comment