Classification of Scientific Papers With Big Data Technologies
Data sizes that cannot be processed by conventional data storage and analysis systems are named as Big Data.It also refers to nex technologies developed to store, process and analyze large amounts of data. Automatic information retrieval about the contents of a large number of documents produced by different sources, identifying research fields and topics, extraction of the document abstracts, or discovering patterns are some of the topics that have been studied in the field of big data.In this study, Naive Bayes classification algorithm, which is run on a data set consisting of scientific articles, has been tried to automatically determine the classes to which these documents belong. We have developed an efficient system that can analyze the Turkish scientific documents with the distributed document classification algorithm run on the Cloud Computing infrastructure. The Apache Mahout library is used in the study. The servers required for classifying and clustering distributed documents are
💡 Research Summary
**
The paper presents a complete end‑to‑end system for automatically classifying Turkish scientific articles using big‑data technologies and cloud computing. The authors begin by describing the challenges posed by the rapid growth of scholarly documents, which exceed the capacity of traditional storage and single‑machine analysis tools. To address this, they deploy a Hadoop cluster on Google Cloud Platform (initially three nodes, later extended to four, five, and six nodes) and leverage Apache Mahout’s implementation of the Naive Bayes algorithm.
Data acquisition is performed by crawling PDFs from the TÜBİTAK DergiPark repository, resulting in roughly 48 000 documents. Each PDF is processed with Apache Tika to extract raw text, and then passed through Zemberek, a Turkish natural‑language processing library, for tokenization, stemming, and stop‑word removal. The cleaned texts are temporarily stored in MongoDB before being uploaded to the Hadoop Distributed File System (HDFS), which provides replication and fault tolerance.
The classification pipeline follows Mahout’s standard workflow:
mahout seqdirectoryconverts the text directory into sequence files.mahout seq2sparsecreates TF‑IDF weighted sparse vectors.mahout splitrandomly partitions the dataset into 60 % training and 40 % testing subsets.mahout trainnbtrains a Naive Bayes model on the training vectors, producing a model file and a label index.mahout testnbevaluates the model on the test set, generating a confusion matrix and a suite of performance metrics.
On the three‑node cluster, the authors train on 28 599 documents and test on 19 066, achieving an overall accuracy of 89.77 %, a Kappa statistic of 0.8611, weighted precision of 0.9222, weighted recall of 0.8977, and a weighted F1‑score of 0.9064. They also report per‑class statistics, showing that categories such as engineering, law, life sciences, medicine, and social sciences are distinguished with varying degrees of success.
Scalability experiments demonstrate that increasing the number of nodes reduces the total processing time (from 114 577 ms with four nodes to 90 721 ms with six nodes) while slightly improving accuracy (up to 91.48 %). To explore alternative execution engines, the authors repeat the workflow using Mahout’s Spark integration (spark‑trainnb and spark‑testnb). The Spark‑based run yields a comparable accuracy of 91.48 % and confirms that in‑memory processing can alleviate I/O bottlenecks inherent to the MapReduce model.
The discussion highlights several key insights:
- Naive Bayes, despite its strong independence assumption, remains highly effective for high‑dimensional, sparse text data when combined with robust preprocessing.
- Language‑specific preprocessing (Zemberek) is crucial for Turkish, where morphological richness can otherwise inflate feature space and degrade performance.
- The system’s modular design allows straightforward substitution of other classifiers (e.g., SVM, logistic regression) or feature representations (e.g., LDA, Word2Vec) for future work.
- Class imbalance (some categories contain far fewer documents) leads to lower recall for minority classes, suggesting the need for resampling techniques or cost‑sensitive learning.
- Operational considerations such as cloud cost management, data security, and real‑time streaming classification are identified as areas for further research.
In conclusion, the study demonstrates that a combination of open‑source big‑data frameworks (Hadoop, Mahout, Spark) and cloud resources can efficiently process and accurately classify large collections of Turkish scientific literature. The achieved accuracy above 90 % validates the feasibility of deploying such pipelines in production environments for tasks like automated indexing, recommendation, and trend analysis. Future directions include integrating deep‑learning models, addressing label imbalance, and extending the system to support continuous ingestion and real‑time inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment