Large language models (LLMs) can produce text that closely resembles human writing. This capability raises concerns about misuse, including disinformation and content manipulation. Detecting AI-generated text is essential to maintain authenticity and prevent malicious applications. Existing research has addressed detection in multiple languages, but the Bengali language remains largely unexplored. Bengali's rich vocabulary and complex structure make distinguishing human-written and AI-generated text particularly challenging. This study investigates five transformer-based models: XLM-RoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Zero-shot evaluation shows that all models perform near chance levels (around 50% accuracy) and highlight the need for task-specific fine-tuning. Fine-tuning significantly improves performance, with XLM-RoBERTa, mDeBERTa and MultilingualBERT achieving around 91% on both accuracy and F1-score. IndicBERT demonstrates comparatively weaker performance, indicating limited effectiveness in fine-tuning for this task. This work advances AI-generated text detection in Bengali and establishes a foundation for building robust systems to counter AI-generated content.
Recent advances in large language models (LLMs) have enabled machines to generate text that closely resembles human writing, powering tasks such as summarization, translation, paraphrasing, creative writing and code generation. While models like ChatGPT and GPT-4 [1] showcase impressive capabilities, they also raise concerns over misuse for misinformation, propaganda and large-scale content manipulation. Public APIs make these tools widely accessible, heightening risks and underscoring the need for reliable detection methods. Misleading websites hosting AI-written articles highlight the urgency of robust systems to identify machine-generated text. This study addresses this challenge by focusing on detecting AI-generated news and paraphrased content in Bengali.
Detecting AI-generated content is essential to maintain information integrity and trust in digital communications. Prior research has explored differentiating human-written and AIgenerated text in English. A research gap persists in the Bengali language due to its complex grammar, extensive vocabulary and large set of letters. These linguistic characteristics make distinguishing between human and machine-generated Bengali text particularly challenging. [2] proposed a hybrid BiLSTM-SVM model to classify Bengali text as either humanwritten or ChatGPT-paraphrased, achieving an accuracy of 82.83%.
In this work, we address the question: “Can transformerbased models surpass traditional deep learning in detecting human-written versus ChatGPT-paraphrased Bengali text and which model shows the greatest improvement?” We investigate transformer-based models for classifying human-written and ChatGPT-paraphrased Bengali text. The BanglaTextDistinguish dataset [2] is used, comprising 6,644 instances from newspapers, social media and textbooks, with AI-generated paraphrases produced using GPT-3.5. Five transformer-based models, including XLM-RoBERTa-Large [3], mDeBERTaV3-Base [4], BanglaBERT-Base [5], IndicBERT-Base [6] and MultilingualBERT-Base [7], are assessed in zero-shot settings and fine-tuned for the classification task. We present a comparative analysis of our transformer models against recent deep learning and machine learning approaches. The study aims to determine the most effective model and enhance AI-generated text detection for the Bengali language.
Our key contributions are as follows:
โข We present the first study using transformer-based models to distinguish between human-written and ChatGPTparaphrased Bengali text. The paper is structured as follows. Section II reviews existing works on AI generated text detection and related studies in text processing. Section III provides background study relevant to this research. Section IV describes the proposed methodology, including zero-shot classification and fine-tuning strategies with different models. Section V presents the experimental results with detailed performance analysis. Section VI concludes the paper and highlights possible directions for future research in Bengali AI generated text detection.
Recent studies have explored AI-generated text detection using transformers, ensemble models and information-theoretic methods. Domain adaptation, multilingual setups and lowresource approaches such as contrastive learning, residual subspace methods and hybrid deep learning classifiers have also shown strong results in English and other languages. However, challenges remain with paraphrasing, domain shifts, dataset limitations and cross-lingual generalization.
[8] used information-theoretic methods with GPT-2, GPT-3.5-Turbo, LLaMA, oBERTa and Logistic Regression, achieving 97% AUROC. [9] applied a BERT-based model to 1,378 texts, reaching 97.71% accuracy, while [10] proposed Se-qXGPT with log probabilities from GPT2-xl, GPT-J and LLaMA, reporting Macro-F1 above 95%. Ensemble methods in [11] and [12] achieved ROC-AUC up to 0.975 and Fmacro 97.9% and [13] improved detection accuracy from 62.9% to 72.5% using adaptive ensembles. [14] combined linguistic and statistical features with transformers, reaching 99.73% accuracy on HC3-English.
[15] introduced ConDA with RoBERTa for contrastive domain adaptation, improving performance by 31.7%, while [16] used residual subspaces for cross-domain robustness gains of 14%. [17] proposed EAGLE with domain adversarial and contrastive learning, showing strong generalization on TuringBench. A hybrid POS-tagged BiLSTM in [18] reached 88% accuracy and [19] reported F1-scores up to 99% using XGBoost, Random Forest and MLP across four languages.
For Bengali, [2] developed a BiLSTM-SVM achieving 82.83% accuracy. [20] proposed ABERT with 99.09% accuracy and reduced parameters, while [21] showed human paraphrasing increased TPR but reduced AUROC. [22] combined DistilBERT with post-processing, achieving 87.5% accuracy on SemEval-2024 and [23] used AraELECTRA and AraBERT for dialectal Arabic, finding rephrased text particularly challenging.
We explore five transformer-based models for detecting AI-paraphrased Bengali text. XLM-RoBERTa-Large [3] is a multilingual model trained on 100 languages. It is designed for strong cross-lingual transfer. mDeBERTaV3-Base [4] improves on DeBERTa with better attention mechanisms and training efficiency. It performs well across languages. BanglaBERT-Base [5] is a monolingual model trained on large Bengali corpora, making it effective for Bengali tasks. IndicBERT-Base [6] is a multilingual model optimized for 12 Indian languages, including Bengali, which makes it efficient in low-resource settings. MultilingualBERT-Base [7] is an early multilingual model trained on 104 languages, serving as a strong baseline for multilingual text classification. These models offer complementary strengths for AI-generated text detection, including cross-lingual transfer, language-specific representation and resource-efficient training.
Accuracy represents the percentage of correctly classified samples among all samples and indicates overall classification performance. Precision measures the proportion of true positive predictions among all predicted positives and reflects the reliability of positive detections. Recall calculates the proportion of true positives identified among all actual positives and indicates the sensitivity of the model. F1 Score (Binary) is the harmonic mean of precision and recall in binary classification and balances false positives and false negatives. F1 Score (Macro) averages F1 scores across all classes equally and reflects balanced performance in multi-class settings. AUROC measures the ability of the model to distinguish between classes across thresholds and indicates overall discrimination capability. A larger AUROC shows stronger separability, while a smaller AUROC shows weaker performance. Brier Score measures the accuracy of probabilistic predictions and indicates calibration quality. A smaller Brier Score reflects better calibration and reliability, while a larger score reflects poor calibration.
The BanglaTextDistinguish [2] dataset is employed in this study, which contains Bengali texts sourced from newspapers, textbooks and social media. Newspaper and textbook sentences represent formal writing, while social media provides informal usage, creating a balanced mix of linguistic styles. A total of 3322 sentences were collected and paraphrased with GPT-3.5, resulting in 6644 samples. Sentence lengths follow a normal distribution with an average of 43.54 words and a standard deviation of 23.17. Duplicate entries are removed to eliminate redundancy. Null instances are deleted to ensure data integrity, leaving 6640 valid samples. Class labels are transformed into integers through label encoding for computational processing. The dataset is divided into training, validation and testing sets at a 60:20:20 ratio to support unbiased evaluation. Each subset is reformatted according to the input requirements of transformer models, including tokenization, attention mask generation and special token insertion, ensuring compatibility with both multilingual and Bengali-specific architectures.
The zero-shot experiments use the BanglaTextDistinguish [2] dataset containing human-written and GPT-3.5 paraphrased sentences. Each input sentence is tokenized according to model requirements, with special tokens, attention masks, truncation and padding applied to match the maximum sequence length. The XLM-RoBERTa-Large-xnli and mDeBERTaV3-Base-xnli models are loaded via the Hugging Face pipeline for zero-shot classification. BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base models are used to generate contextual embeddings by mean pooling the last hidden state for each token. Label embeddings are precomputed from Bengali descriptions of “Human-written” and “AI-generated” classes. Cosine similarity between text embeddings and label embeddings determines the predicted label, while the AIgenerated probability is recorded. The models return predicted labels, similarity scores and class probabilities. Figure 1 shows the proposed methodology of our work.
In this section, we present the experimental results of both zero-shot inference models and fine-tuned models.
Zero-shot transformer models showed limited effectiveness in detecting AI-generated Bengali paraphrases. BanglaBERT-Base and MultilingualBERT-Base achieved around 50% accuracy with very high recall (99.55% and 99.70%) but only moderate precision, while IndicBERT-Base reached 50.34% accuracy and 92.47% recall, indicating some misclassification of human text. XLM-RoBERTa-Large-xnli and mDeBERTaV3-Base-xnli performed worst, with accuracy below 50% and recall as low as 6.33%. F1 scores were low for these two models but higher (65-67%) for BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Overall, zero-shot models can detect some AI-generated content but lack task-specific training for reliable classification. Table II summarizes the performance metrics.
XLM-RoBERTa-Large-xnli achieved a macro F1 of 45.00%, AUROC 49.03% and Brier 29.17%, indicating moderate discrimination but weak calibration. mDeBERTaV3-Basexnli performed worst (F1 37.74%, AUROC 45.61%), while BanglaBERT-Base had lower F1 (33.88%) but the best calibration (Brier 25.26%). IndicBERT-Base balanced classes better (F1 39.66%) but with poor calibration (Brier 37.83%) and MultilingualBERT-Base recorded the highest AUROC (58.06%) with F1 34.24%. Zero-shot models lacked robustness, showing trade-offs between discrimination, balance and calibration. Table III
The fine-tuned transformer models exhibited significantly improved performance in detecting AI-generated Bengali paraphrases compared to zero-shot models. XLM-RoBERTa-Large achieved the highest accuracy of 91.50% and precision of 95.84%, with an F1 score of 91.07%, demonstrating robust identification of AI-generated text while maintaining low false positives. mDeBERTaV3-Base closely followed with an accuracy of 91.35%, precision of 94.06% and F1 score of 91.06%, reflecting balanced and reliable classification. MultilingualBERT-Base showed slightly lower accuracy at 90.82% but achieved the highest recall of 90.96%, indicating strong detection of positive instances. BanglaBERT-Base and IndicBERT-Base displayed moderate performance, with accuracies of 88.26% and 74.25% respectively, highlighting some limitations in identifying AI-generated paraphrases. Overall, the table underscores that fine-tuning transformer models substantially enhances classification performance on the BanglaTextDistinguish dataset. Table IV summarizes the detailed performance metrics.
XLM-RoBERTa-Large led the results with an F1 (Macro) score of 91.48%, AUROC of 96.87% and a low Brier Score of 8.03%, reflecting both high classification accuracy and wellcalibrated probability estimates. mDeBERTaV3-Base followed closely with an F1 (Macro) of 91.34%, AUROC of 96.39% and Brier Score of 7.72%. MultilingualBERT-Base also performed robustly, achieving an F1 (Macro) of 90.82%, AUROC of 96.77% and the lowest Brier Score of 7.06%, indicating excellent probability calibration. BanglaBERT-Base showed slightly lower performance (F1: 88.25%, AUROC: 94.53%, Brier Score: 10.28%), whereas IndicBERT-Base lagged behind with an F1 (Macro) of 74.09%, AUROC of 79.67% and the highest Brier Score of 20.77%, suggesting comparatively weaker detection capability and calibration. Overall, finetuning on the Bengali AI-paraphrased text dataset significantly improved classification performance, discrimination ability and probability calibration over zero-shot approaches. Table V presents the detailed metrics for all fine-tuned models. Figure 2 presents the confusion matrices of five fine-tuned models. XLM-RoBERTa-Large achieves strong performance with very low misclassification for both human and AI texts. mDeBERTaV3-Base also performs well, with slightly more errors than XLM-RoBERTa-Large but still balanced across both classes. BanglaBERT-Base shows a moderate increase in misclassification, with higher confusion for AI texts compared to human. IndicBERT-Base performs relatively poorly, with substantial misclassification for both human and AI texts, indicating weaker discriminative ability. MultilingualBERT-Base achieves robust performance, with low errors in both classes and the most balanced results across human and AI detection. The Detection Error Tradeoff (DET) curves in Figure 4 show the error tradeoff across five fine-tuned models. XLM-RoBERTa-Large demonstrates strong performance with a smooth downward-sloping curve, indicating low error rates at optimal thresholds. mDeBERTaV3-Base shows a similar pattern with slightly higher error rates than XLM-RoBERTa-Large, but still maintains a balanced tradeoff between false positives and false negatives. BanglaBERT-Base presents a very steep curve with minimal error at optimal thresholds, suggesting strong separability but less smoothness in error decline. IndicBERT-Base performs comparatively weaker, with a slower decline and higher error rates across most thresholds, reflecting limited discriminative ability. MultilingualBERT-Base achieves robust results with a sharp curve and very low error rates at optimal thresholds, comparable to XLM-RoBERTa-Large. The Reliability (Calibration) curves in Figure 5 compare how predicted probabilities align with true probabilities across five models. XLM-RoBERTa-Large shows significant overconfidence at low to mid probabilities and a sharp deviation before converging near high probabilities, indicating poor calibration despite good discrimination. mDeBERTaV3-Base fluctuates heavily around the diagonal, suggesting unstable probability estimates but less systematic bias. BanglaBERT-Base follows the diagonal more closely, though it slightly underestimates mid-range probabilities and overestimates at the high end, reflecting moderate calibration. IndicBERT-Base exhibits a smoother curve with consistent underconfidence at lower probabilities, indicating better reliability but less decisive predictions. MultilingualBERT-Base remains closer to the diagonal than XLM-RoBERTa-Large, with mild oscillations, reflecting relatively balanced calibration but some local misalignments.
[2] introduced the BanglaTextDistinguish dataset for detecting human-generated and AI-paraphrased Bengali text, showcasing results from several prominent models. We compare our results with theirs. Table VI presents this comparison.
Among the baseline models evaluated on the BanglaTextDistinguish dataset, XLM-RoBERTa-Large achieved the strongest performance with an accuracy of 91.50%, precision of 95.84% and an F1 score of 91.07%, indicating highly reliable detection of AI-paraphrased text
We present the first systematic study on detecting AI-generated Bengali text using transformer-based models. Zero-shot evaluation of XLM-RoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base shows near-chance performance, while fine-tuning raises accuracy and F1 scores to around 91% for XLM-RoBERTa, mDeBERTa and MultilingualBERT, with IndicBERT performing notably worse. Comparisons with BiLSTM-SVM and other baselines confirm the superiority of fine-tuned transformers on the BanglaTextDistinguish dataset. Future work should expand the dataset, explore cross-lingual transfer, improve robustness to unseen generation methods, develop lightweight models for real-time use and integrate linguistic or semantic cues to enhance detection reliability.
This content is AI-processed based on open access ArXiv data.