MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset

Reading time: 5 minute
...

📝 Original Info

  • Title: MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset
  • ArXiv ID: 2511.19317
  • Date: 2025-11-24
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. **

📝 Abstract

This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.

💡 Deep Analysis

📄 Full Content

Abstractive text summarization in Bangla has been the subject of many previous works, and summaries of news articles are typically composed by journalists according to traditional reporting conventions. The current research proposes an approach developed on a more diversified dataset that includes writers and contributors beyond journalists, such as bloggers, content creators, and general users. This dataset consolidates a substantial volume of information from a wide range of sources. Given the extensive availability and long-term archives maintained by newspaper websites, a large number of articles have been collected from The Business Standard (around 12,000) and Samakal (about 42,000). Owing to public access limitations and the restricted archiving range of Bangla blog sites, only about 700 posts have been gathered from Cinegolpo. Additionally, transformer-based models such as BanglaT5-small and MT5-small have been applied to the abstractive summarization task. These models are based on the sequence-to-sequence architecture of transformers, enabling eective capture of contextual relationships within Bangla text. Both BanglaT5-small and MT5-small have been fine-tuned on the constructed dataset to generate highquality abstractive summaries, demonstrating strong performance compared to traditional recurrent approaches. Their application further highlights the adaptability and advancement of neural architectures in Bangla text summarization. Model evaluation has been performed using standard quantitative metrics, including ROUGE and BLEU, which measure the quality of generated summaries by assessing their overlap with reference summaries. These metrics quantify performance in terms of alignment between predicted and ground-truth summaries and have been widely adopted in natural language processing research. To sum up, the contributions are as follows:

-The first multi-domain Bangla text summarization dataset is introduced. It captures diverse writing styles and patterns. -The dataset is the largest multi-domain collection to date. It has 54,620 articles and their summaries from three dierent sources. -Strong baselines are built using deep learning models. The results are comparable to or better than state-of-the-art models.

The study of Bangla text summarization has evolved significantly over the past decades, with early research primarily focusing on extractive techniques and limited datasets. One of the pioneering eorts by Islam et al. [1] introduced “Bhasa,” a search engine and summarizer for Unicode Bangla text. This approach integrated modules such as tokenization, keyword search, and summary generation and represented one of the first attempts at combining search engine capabilities with text summarization for Bangla.

Most Bangla text summarization research has used datasets from newspapers, where articles are mostly written by journalists and follow a uniform style. The BWSD [13] contains 1,100 web-sourced Bangla news articles with summaries, and MASBA [14] oers multi-level summaries of Bangla news articles. These datasets lack diversity in language, style, and vocabulary. To address this, this study builds a more varied dataset that combines newspapers, blogs, and business sources. It captures richer Bangla usage and enables models to generalize better across dierent text types.

About 54,620 articles have been collected from three dierent sources to capture diverse writing styles. Samakal is a major Bangla newspaper where professional journalists write formal news articles. Cinegolpo is a blogging platform with informal, story-like content on movies, series, and dramas. The Business Standard provides financial, business, and economic content written by professionals. To collect the data effciently, web crawlers have been built for each source. The crawlers found article links, extracted text and summaries, removed ads and incomplete entries, and stored the cleaned data in a uniform format. The final dataset includes 41,675 articles from Samakal, 12,255 from The Business Standard, and 690 from Cinegolpo. By combining these sources, the dataset captures a wide range of linguistic patterns and writing styles. This makes it large, diverse, and well-structured. It also makes it suitable for preprocessing, model training, and testing summarization systems. The following table 1 mentions the summary of articles collected from dierent sources for the dataset. at 153 words, the 50th at 210 words, the 75th at 307 words, and the 95th at 634 words. For summaries, lengths varied from a single word to a maximum of 63 words (489 characters). The average summary length was 30.41 words, with a median of 29 words and a standard deviation of 11.55 words. Only 25 summaries (0.05 percent) contained fewer than 10 words, showing that most summaries provided enough content. The 25th, 50th, 75th, and 95th percentiles were 21, 29, 38, and 52 words, respectively. On average, articles were 9.61 times longer than th

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut