High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

December 11, 2025

Reading time: 4 minute

...

📝 Original Info

Title: High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments
ArXiv ID: 2512.10312
Date: 2025-12-11
Authors: Julian Rodriguez, Piotr Lopez, Emiliano Lerma, Rafael Medrano, Jacobo Hernandez

📝 Abstract

This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.

💡 Deep Analysis

📄 Full Content

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments Jos´e Juli´an Rodr´ıguez Guti´errez, Piotr Enriquevitch Lopez Chernyshov, Emiliano Lerma Garc´ıa, Rafael Medrano Vazquez, and Jacobo Hern´andez Varela Licenciatura en Ingenier´ıa de Datos e Inteligencia Artificial Divisi´on de Ingenier´ıas Campus Irapuato-Salamanca Guanajuato, M´exico {jj.rodriguez.gutierrez, pe.lopezchernyshov, e.lermagarcia, r.medranovazquez, j.hernandezvarela}@ugto.mx Abstract—This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala. Index Terms—Big Data, Spark, Epsilon, RestMex, IMDb, Scala, Clustering, Classification, Regression I. INTRODUCTION The present study develops as an integral big data project encompassing three complementary analyses across different domains. The first case study, centered on the Epsilon dataset, served as an introduction to massive data handling through the implementation of a Multi-Layer Perceptron (MLP) with architecture 2000-128-128-2, trained on 100,000 instances with 2,000 features. Using PyTorch with GPU acceleration (CUDA), the model achieved 88.98% accuracy after 100 training epochs, with a batch size of 128 and Adam opti- mization (learning rate=10−5, weight decay=10−4). The im- plementation included regularization techniques through Batch Normalization and Dropout (p = 0.8), reducing validation loss from 0.6847 to 0.2670. This exercise established the methodological foundations for efficient processing of high- dimensionality datasets. The second analysis, focused on the Rest-Mex dataset, implemented a complete pipeline for polarity classification in Mexican tourist reviews. Text preprocessing techniques were applied including tokenization, stopword removal, and lemma- tization, followed by vectorization using CountVectorizer or TF-IDF. The supervised classification model categorized re- views into three classes (Positive/Negative/Neutral), address- ing the inherent imbalance of the tourist dataset through class weighting techniques. Finally, the IMDb dataset analysis represents the project’s culmination, integrating deep textual analysis through TF- IDF (5,000 features with HashingTF and minDocFreq=3) over descriptions, titles, and metadata of 85,855 films. An intelligent contextual imputation system was implemented based on director, genre, actor, and writer averages, signifi- cantly reducing missing values in critical variables. Optimized XGBoost Regressor through Grid Search (36 hyperparameter combinations) with 3-fold cross-validation achieved an RMSE of 0.6001 and R2 of 0.79 in continuous rating prediction. The optimal hyperparameters found were: maxDepth=12, eta=0.03, numRound=500, minChildWeight=3. Sentiment analysis of descriptions revealed a distribution of 46.83% neutral, 32.65% positive, and 20.52% negative, information that was incorporated as an additional feature. Ex- ploratory analysis identified Christopher Nolan (8.22), Satyajit Ray (8.02), and Hayao Miyazaki (8.01) as the directors with the highest average ratings, while Film-Noir (6.64), Biography (6.62), and History (6.54) emerged as the best-rated genres. This methodological progression—from binary classifica- tion in Epsilon, through sentiment analysis in Rest-Mex, to continuous rating prediction in IMDb—demonstrates the versatility of machine learning techniques applied to different scales and data types, establishing a robust framework for big data analysis across multiple domains. II. STATE OF THE ART A. Stochastic Gradient Descent for Large-Scale SVM(Epsilon) In the field of large-scale binary classification, Shalev- Shwartz et al. (2011) developed Pegasos, a stochastic sub- gradient descent algorithm for SVM that proved especially efficient for massive datasets. The algorithm achieves a com- plexity of ˜O(1/λϵ) iterations to reach ϵ accuracy, where each iteration operates on a single training example. This efficiency makes it particularly suitable for large-scale text classification problems. Pegasos’ approach works directly on the primal objective, avoiding the need to maintain the complete kernel matrix in memory, which represents a significant advantage when working with high-dimensionality datasets like Epsilon. Re- ported experiments show speedups of up to an order of magnitude over conventional SVM methods on datasets with sparse features, achieving accuracies above 85% in binary arXiv:2512.10312v1 [cs.DC] 11 Dec 2025 classification problems with millions of examples. The work establishes that the algorithm’s convergence does not directly depend on th

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Comparative Evaluation of Explainable Machine Learning Versus Linear Regression for Predicting County-Level Lung Cancer Mortality Rate in the United States

CourtPressGER: A German Court Decision to Press Release Summarization Dataset

Meta Hierarchical Reinforcement Learning for Scalable Resource Management in O-RAN

Start searching

No results found