BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to stop-word pruning, while BAAD16 authors remain comparatively robust highlighting genre-dependent reliance on stop-word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop-words serve as essential stylistic indicators; finely calibrated ML models prove effective within short-text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long-context or domain-adapted transformers.


💡 Research Summary

This paper introduces BARD10, a newly curated benchmark for Bangla authorship attribution that complements the existing BAAD16 dataset. While BAAD16 comprises literary, journalistic, and essay texts from 16 authors, BARD10 focuses on contemporary web‑based prose—blog posts, opinion pieces, and informal commentary—written by ten modern Bangla authors. The authors collected roughly 5,200 documents, each averaging about 180 words, and applied a uniform preprocessing pipeline that includes tokenization, normalization, and two experimental conditions: (1) retaining Bangla stop‑words and (2) removing them using a curated list of approximately 150 high‑frequency functional words.

Four representative classifiers were evaluated under identical training‑validation splits: (i) a linear Support Vector Machine with TF‑IDF features, (ii) XGBoost, (iii) a two‑layer Multilayer Perceptron, and (iv) Bangla BERT, a transformer pre‑trained on Bangla corpora. Hyper‑parameter tuning was performed via five‑fold cross‑validation, and performance was measured using macro‑F1, overall accuracy, and per‑class recall.

Results on the established BAAD16 benchmark show that the classical TF‑IDF + SVM approach nearly saturates the task, achieving a macro‑F1 of 0.997, with XGBoost and MLP trailing only slightly (≈0.985). Bangla BERT, despite its deep architecture, reaches 0.952, indicating a modest gap. On the newly introduced BARD10, the same TF‑IDF + SVM remains the top performer with a macro‑F1 of 0.921, while XGBoost and MLP obtain 0.889 and 0.874 respectively, and Bangla BERT drops further to 0.873.

The most striking finding emerges from the stop‑word ablation study. Removing stop‑words from BARD10 consistently degrades all models, with macro‑F1 reductions ranging from 3 to 7 percentage points; SVM falls from 0.921 to 0.862, and XGBoost from 0.889 to 0.823. In contrast, the same ablation on BAAD16 yields negligible performance changes, and Bangla BERT even shows a slight improvement. This divergence is attributed to genre differences: BARD10’s informal web texts rely heavily on high‑frequency functional words (“and”, “but”, “however”) that encode subtle authorial habits such as preferred conjunction placement or habitual discourse markers. BAAD16’s formal literary style, however, conveys author identity primarily through lexical choice and syntactic structure, making it less sensitive to stop‑word manipulation.

Error analysis reveals that transformer models tend to down‑weight or effectively ignore frequent stop‑words during embedding, thereby losing the fine‑grained stylistic signal that TF‑IDF captures directly via term frequency. Moreover, the short‑text nature of both corpora (average ≤150 tokens) favors high‑dimensional linear models that avoid over‑fitting, whereas deep models suffer from limited data for learning robust contextual representations.

From these observations the authors draw three key insights: (1) Bangla stop‑words are potent stylistic markers, especially in contemporary, conversational domains; (2) model selection must be aligned with data scale and genre—classical TF‑IDF + SVM remains a strong baseline for short, informal texts, while transformer‑based approaches need stop‑word‑preserving preprocessing or specialized token‑level training to compete; (3) BARD10 provides a reproducible, domain‑rich benchmark for future work on long‑context or domain‑adapted transformers, multi‑modal author attribution, and targeted vocabulary augmentation. The paper concludes by suggesting extensions such as incorporating metadata, exploring domain adaptation techniques, and designing stop‑word‑aware pre‑training objectives to further boost authorship attribution performance in Bangla.


Comments & Academic Discussion

Loading comments...

Leave a Comment