Efficient Machine Learning for Big Data: A Review

With the emerging technologies and all associated devices, it is predicted that massive amount of data will be created in the next few years, in fact, as much as 90% of current data were created in the last couple of years,a trend that will continue for the foreseeable future. Sustainable computing studies the process by which computer engineer/scientist designs computers and associated subsystems efficiently and effectively with minimal impact on the environment. However, current intelligent machine-learning systems are performance driven, the focus is on the predictive/classification accuracy, based on known properties learned from the training samples. For instance, most machine-learning-based nonparametric models are known to require high computational cost in order to find the global optima. With the learning task in a large dataset, the number of hidden nodes within the network will therefore increase significantly, which eventually leads to an exponential rise in computational complexity. This paper thus reviews the theoretical and experimental data-modeling literature, in large-scale data-intensive fields, relating to: (1) model efficiency, including computational requirements in learning, and data-intensive areas structure and design, and introduces (2) new algorithmic approaches with the least memory requirements and processing to minimize computational cost, while maintaining/improving its predictive/classification accuracy and stability.

💡 Research Summary

The paper “Efficient Machine Learning for Big Data: A Review” provides a comprehensive synthesis of recent research aimed at reducing the computational and memory burdens of machine learning (ML) models in the era of massive data generation. It begins by highlighting the unprecedented data explosion—approximately 90 % of today’s data have been created in the last few years—and argues that conventional, accuracy‑centric ML approaches are increasingly unsustainable from both an energy‑consumption and hardware‑resource perspective. The authors organize the literature into two complementary pillars: (1) algorithmic techniques that make the models themselves more efficient, and (2) system‑level strategies that restructure data handling and training pipelines to lower overall resource demand.

Algorithmic Model Efficiency
The review surveys a wide range of model‑compression methods. Pruning (both weight‑based and structural) can eliminate 70‑90 % of parameters while keeping accuracy loss below 1 % on benchmarks such as CIFAR‑10 and ImageNet. Quantization reduces 32‑bit floating‑point weights to 8‑bit or even 4‑bit integers, cutting memory bandwidth by up to fourfold and enabling 2‑3× speed‑ups on modern accelerators; post‑training calibration and mixed‑precision training are discussed as ways to mitigate accuracy degradation. Knowledge distillation transfers the soft output distribution of a large “teacher” network to a compact “student” model, preserving generalization performance with far fewer layers—a technique now standard in NLP (e.g., DistilBERT) and vision. Kernel‑approximation strategies such as the Nyström method and random feature mappings replace the O(N²) kernel matrix with O(N·m) computations (m ≪ N), making non‑parametric models like SVMs and Gaussian Processes scalable to millions of samples.

Data‑Structure and Pipeline Optimisation
On the data side, the authors emphasize dimensionality‑reduction techniques. Random projection, grounded in the Johnson‑Lindenstrauss lemma, preserves pairwise distances while shrinking feature dimensionality by one to two orders of magnitude, which is especially valuable for sparse text or log data. Streaming and online learning frameworks—mini‑batch SGD, Adam, RMSProp—allow models to be updated incrementally without loading the entire dataset into memory, a necessity for IoT and sensor‑driven applications. The paper also details I/O‑aware pipeline engineering: memory‑mapped files, asynchronous data augmentation, sharding, and caching reduce data‑transfer bottlenecks and lower communication overhead in distributed training.

Sustainability‑Centric Evaluation
A distinctive contribution is the proposal of a “sustainable computing” evaluation framework that extends beyond FLOPs and memory footprints to include energy consumption (kWh), carbon emissions (gCO₂‑eq), and operational cost (cloud‑instance pricing). Empirical results show that quantized‑and‑pruned models can achieve up to 60 % energy savings and a 45 % reduction in cloud GPU expenses while maintaining comparable predictive performance.

Future Directions
The review identifies three promising research avenues. First, AutoML for efficiency—joint hyper‑parameter search and compression—could automate the discovery of optimal lightweight architectures. Second, hardware‑software co‑design, especially with ASICs or FPGAs tailored for low‑precision arithmetic, promises further gains. Third, incorporating environmental metrics directly into loss functions would make training objectives explicitly sustainability‑aware.

In summary, the paper convincingly argues that achieving efficiency in big‑data machine learning is not a peripheral concern but a core requirement for the field’s long‑term viability. By systematically cataloguing algorithmic compression, data‑pipeline optimisation, and sustainability‑focused evaluation, the review equips researchers and practitioners with a clear roadmap for building high‑performing, resource‑conscious ML systems.

💡 Research Summary

📜 Original Paper Content