A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets
A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets Annoor Sharara Akhand University of Dhaka sharara99@gmail.com Abstract Convolutional Neural Networks (CNNs) are a standard approach for visual recognition due to their capacity to learn hierarchical representations from raw pixels. In practice, practitioners often choose among (i) training a compact custom CNN from scratch, (ii) using a large pre- trained CNN as a fixed feature extractor, and (iii) performing transfer learning via partial or full fine-tuning of a pre-trained backbone. This report presents a controlled comparison of these three paradigms across five real-world image classification datasets spanning road-surface defect recognition, agricultural variety identification, fruit/leaf disease recognition, pedestrian walkway encroachment recognition, and unauthorized vehicle recognition. Models are evaluated using accuracy and macro F1-score, complemented by efficiency metrics including training time per epoch and parameter counts. The results show that transfer learning consistently yields the strongest predictive performance, while the custom CNN provides an attractive efficiency–accuracy trade-off, especially when compute and memory budgets are constrained. 1 Introduction Computer vision has undergone a profound transformation over the past decade, largely driven by advances in deep learning and the widespread adoption of Convolutional Neural Networks (CNNs). Unlike traditional vision pipelines that rely on hand-crafted features such as SIFT or HOG, CNNs learn hierarchical feature representations directly from raw pixel data through end-to- end optimization [12, 11]. This ability to jointly learn low-level, mid-level, and high-level visual abstractions has enabled CNNs to achieve state-of-the-art performance across a broad spectrum of tasks, including image classification, object detection, semantic segmentation, and visual scene understanding. The success of CNNs has been further amplified by the availability of large-scale annotated datasets and increased computational power, particularly through GPU acceleration. Benchmark datasets such as ImageNet have played a pivotal role in demonstrating the scalability of deep CNNs and their capacity to generalize across diverse visual categories [4]. Architectures trained on such datasets learn reusable visual primitives, including edges, textures, object parts, and compositional patterns, which form the basis for knowledge transfer across tasks and domains [28]. As a result, CNNs are now routinely deployed in real-world applications ranging from autonomous driving and infrastructure monitoring to precision agriculture and medical imaging. Despite these advances, the practical deployment of CNN-based systems presents several design challenges. One of the most consequential decisions concerns how a CNN should be trained for a given task. In practice, three dominant paradigms are commonly employed: training a custom CNN architecture from scratch, using a pre-trained CNN as a fixed feature extractor, and applying 1 arXiv:2601.02246v1 [cs.CV] 5 Jan 2026 transfer learning through fine-tuning of a pre-trained model. Each paradigm embodies distinct assumptions about data availability, computational resources, and the degree of domain similarity between training and deployment environments. Training a CNN from scratch provides full control over architectural design and allows the model to be explicitly tailored to the characteristics of a target dataset and deployment constraints. Compact custom CNNs are often preferred in scenarios where memory footprint, inference latency, or energy consumption are critical considerations, such as embedded or edge computing environments. Moreover, recent studies have shown that carefully designed mid-sized CNNs can achieve competitive performance when combined with modern regularization techniques such as batch normalization and global average pooling [10, 13]. However, training from scratch typically requires substantial labeled data and careful optimization to avoid overfitting, particularly when the task exhibits high intra-class variability. An alternative strategy is to leverage large pre-trained CNNs, such as VGG-16 or ResNet, as fixed feature extractors. In this setting, the convolutional layers are frozen and only a task-specific classifier head is trained on the target dataset [22]. This approach significantly reduces training time and mitigates overfitting when labeled data is limited. Nevertheless, freezing the feature extractor can restrict the model’s ability to adapt to domain-specific visual patterns, especially when the target domain differs substantially from the source domain in terms of texture, color distribution, or scene composition [18]. Transfer learning via fine-tuning has emerged as a powerful compromise between training from scratch and using frozen pre-trained features. By initializing a network with pre-trained weights and selectively fine-tuning higher-level layers, the model can retain general-purpose visual knowledge while adapting its representations to the target task [28]. Empirical evidence consistently shows that fine-tuning improves convergence speed and predictive accuracy, particularly for small and medium-sized datasets [26]. However, fine-tuning increases computational cost and introduces additional hyperparameter sensitivity, requiring careful selection of learning rates, regularization strategies, and the depth of unfrozen layers. Importantly, predictive performance alone does not fully capture the practical utility of a CNN-based system. Computational efficiency, training time, and model size play a crucial role in determining whether a model can be realistically deployed in production settings. Large pre- trained architectures often contain tens or hundreds of millions of parameters, resulting in high memory consumption and slower inference [8, 27]. In contrast, custom CNNs with significantly fewer parameters may offer favorable trade-offs between accuracy and efficiency, particularly in applications that require frequent retraining or real-time processing. Motivated by these considerations, this work presents a systematic comparative study of three CNN-based learning paradigms: a custom CNN trained entirely from scratch, a pre-trained CNN used as a fixed feature extractor, and a transfer learning model with fine-tuning. The comparison is conducted across multiple real-world image classification datasets that span diverse application domains, including road surface damage recognition [16], agricultural crop and fruit analysis [17, 5], and urban scene understanding. These datasets collectively capture challenges such as class imbalance, background clutter, illumination variation, and high inter-class visual similarity. All models are evaluated under identical experimental conditions, including consistent data splits, preprocessing pipelines, training hyperparameters, and evaluation metrics. Performance is assessed using both accuracy and macro F1-score to account for class imbalance, while computational efficiency is evaluated through training time per epoch and model complexity. By analyzing performance trends across datasets and learning paradigms, this study aims to provide practical, evidence-based guidance for selecting CNN training strategies under varying data and resource constraints. Rather than advocating a single universally optimal approach, the objective of this work is to 2 illuminate the trade-offs inherent in different CNN paradigms. The results of this study seek to answer a central practical question faced by researchers and practitioners alike: when does the additional computational cost of transfer learning yield sufficient performance gains to justify its use, and under what conditions can a carefully designed custom CNN provide a more efficient and competitive alternative? 2 Related Work The development of Convolutional Neural Networks (CNNs) has fundamentally reshaped the field of computer vision by enabling end-to-end learning of hierarchical visual representations directly from raw pixel data. Early CNN-based systems demonstrated the feasibility of learning spatially localized features using convolution and pooling operations, achieving strong performance on document recognition and digit classification tasks [12]. However, the widespread adoption of CNNs for large-scale visual recognition only became practical with the availability of large annotated datasets and increased computational power, culminating in the success of deep architectures on the ImageNet benchmark [11, 4]. 2.1 Deep CNN Architectures Following the breakthrough performance of AlexNet, subsequent research focused on improving depth, representational capacity, and optimization stability of CNN architectures. The VGG family of networks demonstrated that increasing depth through the systematic use of small 3 × 3 convolutional filters could significantly improve recognition accuracy, albeit at the cost of increased parameter count and computational complexity [22]. While VGG-style networks remain influential due to their architectural simplicity and transferability, their large memory footprint motivates the exploration of more efficient alternatives. Residual learning, introduced through ResNet architectures, addressed the degradation problem associated with very deep networks by enabling identity-based skip connections [6]. This innovation allowed CNNs to scale to hundreds of layers while maintaining stable optimization behavior. DenseNet architectures further extended this idea by encouraging feature reuse through dense connectivity patterns, improving parameter efficiency and gradient flow [9]. These architectural advances underscore the importance of structural inductive biases in deep CNN design. In parallel, a line of research has focused on computational efficiency and deployment feasibility. Architectures such as MobileNet, MobileNetV2, and EfficientNet employ depthwise separable convolutions, inverted residuals, and compound scaling to reduce parameter counts and floating- point operations while preserving accuracy [8, 20, 27]. These models are particularly relevant for edge computing scenarios, where resource constraints limit the applicability of large-scale CNNs. 2.2 Regularization and Architectural Design Choices Effective training of deep CNNs relies heavily on regularization and architectural design choices. Batch normalization has become a standard component of modern CNNs, mitigating internal covariate shift and enabling higher learning rates and faster convergence [10]. Dropout has been widely adopted as a stochastic regularization technique to prevent co-adaptation of neurons and reduce overfitting, particularly in fully connected layers [23]. Global average pooling (GAP) has emerged as an effective alternative to large fully connected layers, significantly reducing parameter counts while improving generalization by enforcing spatial correspondence between feature maps and class predictions [13]. These design principles are 3 particularly relevant for custom CNNs trained from scratch, where parameter efficiency and training stability are critical considerations. 2.3 Transfer Learning in Visual Recognition Transfer learning has become a dominant paradigm in computer vision, especially in settings where labeled data is limited or training from scratch is computationally expensive. The central premise of transfer learning is that CNNs trained on large-scale datasets learn general-purpose visual features that can be reused across tasks [18]. Empirical studies have shown that lower layers of CNNs tend to capture generic features such as edges and textures, while higher layers become increasingly task-specific [28]. Two primary transfer learning strategies are commonly employed: using a pre-trained CNN as a fixed feature extractor, and fine-tuning some or all of the network layers on the target dataset. Using frozen features offers computational efficiency and reduces overfitting risk but may limit adaptability to new domains. Fine-tuning allows the model to adjust its representations to domain-specific characteristics and typically yields superior performance, particularly when moderate amounts of labeled data are available [26]. The effectiveness of fine-tuning has been demonstrated across numerous domains, including medical imaging, remote sensing, and agriculture, where domain shift relative to ImageNet is often substantial. However, fine-tuning introduces additional hyperparameter sensitivity, requiring careful control of learning rates and regularization to prevent catastrophic forgetting or overfitting [18]. 2.4 CNNs in Road and Infrastructure Monitoring CNN-based approaches have been widely explored for road surface condition assessment and infrastructure monitoring. Road damage detection and classification using smartphone imagery has been shown to be a cost-effective and scalable alternative to traditional inspection methods [16]. Such datasets often exhibit significant variability in lighting conditions, camera viewpoints, and background clutter, posing challenges for robust visual recognition. Transfer learning has proven particularly effective in this domain, as pre-trained CNNs provide strong initialization for handling diverse visual patterns. 2.5 CNNs in Agriculture and Plant Phenotyping Agricultural applications represent another important area where CNNs have demonstrated substan- tial impact. Image-based plant disease detection and crop variety classification benefit from CNNs’ ability to capture fine-grained texture and color variations [17]. Subsequent studies have shown that deep CNNs can outperform traditional machine learning approaches across multiple crop species and disease categories, even under real-world imaging conditions [5]. However, agricultural datasets often suffer from class imbalance, limited sample sizes, and environmental variability, making the choice of training paradigm-scratch training versus transfer learning-particularly consequential. 2.6 Efficiency–Accuracy Trade-offs While transfer learning and large pre-trained models often achieve superior accuracy, their computa- tional cost and memory requirements can limit practical deployment. Several studies emphasize the importance of evaluating CNN models not only in terms of predictive performance but also with respect to efficiency metrics such as training time, inference latency, and parameter count [8, 27]. 4 Custom CNNs, when carefully designed, can offer competitive performance at a fraction of the computational cost, making them attractive for real-world systems with strict resource constraints. 2.7 Positioning of This Work Building upon these prior studies, the present work conducts a systematic and controlled comparison of three CNN paradigms-custom CNNs trained from scratch, frozen pre-trained CNNs, and fine-tuned transfer learning models-across multiple real-world datasets. Unlike studies that focus on a single domain or architecture, this work emphasizes cross-dataset consistency, controlled experimental conditions, and joint evaluation of performance and efficiency. By situating the analysis across diverse application domains, this study aims to provide practical insights into when transfer learning is essential and when a well-designed custom CNN can serve as a viable and efficient alternative. 3 Datasets This study evaluates five datasets: Auto-RickshawImageBD, FootpathVision, RoadDamageBD, Man- goImageBD, PaddyVarietyBD spanning both structured (agriculture varieties) and unconstrained street-scene imagery. For each dataset, we use fixed train/validation/test splits and apply identical preprocessing and augmentation across all models [24, 15, 1, 7, 25] 3.1 Dataset Overview Table 1 summarizes key dataset characteristics used in our experiments. Table 1: Dataset statistics (update counts if needed). Dataset #Classes (C) Train Val Test Auto-RickshawImageBD 2 932 266 133 FootpathVision 2 867 248 124 RoadDamageBD 2 315 90 45 MangoImageBD 15 3992 1141 570 PaddyVarietyBD 35 9800 2800 1400 3.2 Preprocessing All images are resized to a fixed resolution (e.g., 224 × 224) to match the input size expected by the custom CNN and the VGG backbone. Pixel values are normalized to match the pre-training statistics for ImageNet when using VGG [22]. For the custom CNN trained from scratch, the same normalization is applied to keep preprocessing consistent. 3.3 Data Augmentation To improve generalization under real-world variance, we use standard augmentation techniques [21]: • random horizontal flipping (when label semantics permit), • random resized crops / spatial jitter, • mild color jitter (brightness/contrast/saturation), 5 • optional random rotation within a small range. Augmentation is applied only to training images, not validation or test sets. 3.4 Class Imbalance and Sampling Real datasets often exhibit class imbalance. In addition to reporting macro F1-score (which is more sensitive to minority classes), we optionally use class-weighted cross-entropy or balanced sampling for stability. All reported results in Table 3 are based on the same training protocol for all models to ensure fairness. 4 Methodology 4.1 Problem Formulation Each dataset is treated as a multi-class classification problem. Let (xi, yi) denote an image and its label, where yi ∈{1, . . . , C}. A model fθ outputs logits z ∈RC, and class probabilities are given by the softmax: p(y = c | x) = exp(zc) PC k=1 exp(zk) . (1) The primary training objective is the cross-entropy loss: LCE = −1 N N X i=1 log p(yi | xi). (2) 4.2 Models Evaluated We evaluate three CNN paradigms representing distinct trade-offs between complexity, cost, and accuracy. 4.2.1 Custom CNN (Trained from Scratch) The custom CNN is a compact VGG-inspired architecture designed for efficiency while preserving hierarchical feature learning. It consists of four convolutional stages with progressive channel expansion: • Stage 1: two 3 × 3 convolutional layers with 64 channels, • Stage 2: two 3 × 3 convolutional layers with 128 channels, • Stage 3: three 3 × 3 convolutional layers with 256 channels, • Stage 4: three 3 × 3 convolutional layers with 512 channels. Each convolution is followed by batch normalization [10] and ReLU activation. A 2 × 2 max pooling layer reduces spatial resolution after each stage. Following the feature extractor, global average pooling (GAP) [13] aggregates spatial features, and a small classifier head produces class logits: • fully connected layer 512 →256 with ReLU, • dropout (p = 0.5) [23], • final linear layer 256 →C. This model is substantially smaller than VGG-16 while retaining strong representational capacity. 6 Layer-wise summary. Table 2 provides a compact layer summary (illustrative; adapt if your implementation differs). Table 2: Custom CNN architecture summary (illustrative). Block Layers Output Channels Stage 1 Conv–BN–ReLU ×2, MaxPool 64 Stage 2 Conv–BN–ReLU ×2, MaxPool 128 Stage 3 Conv–BN–ReLU ×3, MaxPool 256 Stage 4 Conv–BN–ReLU ×3, MaxPool 512 Head GAP, FC(512→256), Dropout, FC(256→C) — 4.2.2 Pre-trained CNN (VGG-16 as Frozen Feature Extractor) A standard VGG-16 model pre-trained on ImageNet [22, 4] is used as a fixed feature extractor. All convolutional parameters are frozen, and only a lightweight classifier head is trained on the target dataset. This approach is computationally efficient and reduces overfitting risk, but may under-adapt when the target domain differs significantly from ImageNet. 4.2.3 Transfer Learning CNN (VGG-16 Fine-tuning) The transfer learning model uses the same VGG-16 initialization but allows fine-tuning of the final convolutional block and classifier head. Fine-tuning improves domain adaptation [28] and often yields better accuracy than freezing features, at the cost of additional compute. 4.3 Training Setup All models are trained under identical experimental conditions to ensure fairness. Optimization. We train for 20 epochs using stochastic gradient descent (SGD) with momentum 0.9 [2] and base learning rate 0.01. A cosine decay learning rate schedule is used [14]. Cross-entropy loss is used for all tasks. Regularization. We use batch normalization [10], dropout [23], and data augmentation [21]. Early stopping is not used in the primary comparison; instead, all methods run for the same number of epochs. Implementation details. Experiments can be implemented in PyTorch [19] or Keras [3]. To support reproducibility, random seeds should be fixed, and preprocessing should be identical across runs. 7 5 Evaluation Metrics 5.1 Accuracy Accuracy measures the fraction of correct predictions: Accuracy = 1 N N X i=1 ⊮[ˆyi = yi]. (3) 5.2 Macro F1-score Macro F1 averages F1 across classes, treating each class equally and highlighting performance on minority classes. For class c, define precision and recall: Precisionc = TPc TPc + FPc , Recallc = TPc TPc + FNc . (4) Then F1c = 2 · Precisionc · Recallc Precisionc + Recallc , Macro F1 = 1 C C X c=1 F1c. (5) 5.3 Efficiency Metrics We report training time per epoch (seconds) and parameter counts. Parameter counts help estimate memory footprint and deployment feasibility; for example, VGG-16 has ∼138M parameters [22], which can be prohibitive on edge devices. 6 Results 6.1 Classification Performance Table 3 summarizes performance across datasets. Transfer learning achieves the best accuracy and macro F1 on all datasets in this comparison. Figure 1 illustrates accuracy and macro F1 across datasets. Figure 1: Test accuracy and macro F1-score comparison across datasets. 8 Table 3: Performance comparison across datasets. Dataset Model Type Accuracy (%) Macro F1 Auto-RickshawImageBD Custom CNN 89.8 0.90 Pre-trained CNN 89.1 0.89 Transfer Learning 92.1 0.93 FootpathVision Custom CNN 91.9 0.91 Pre-trained CNN 92.6 0.92 Transfer Learning 94.6 0.95 RoadDamageBD Custom CNN 94.2 0.93 Pre-trained CNN 95.0 0.94 Transfer Learning 96.8 0.96 MangoImageBD Custom CNN 90.9 0.91 Pre-trained CNN 90.2 0.90 Transfer Learning 93.4 0.94 PaddyVarietyBD Custom CNN 91.8 0.92 Pre-trained CNN 93.1 0.93 Transfer Learning 95.0 0.95 6.2 Training Time Table 4 compares training time per epoch. Table 4: Training time (seconds) per epoch for different CNN approaches across datasets. Dataset VGG-16 (Frozen) Custom CNN Transfer Learning Auto-RickshawImageBD 15.9 10.9 22.6 FootpathVision 19.1 13.1 27.2 RoadDamageBD 18.5 12.6 26.4 MangoImageBD 16.8 11.5 24.1 PaddyVarietyBD 17.2 11.8 24.8 Figure 2 provides a graphical view of the time comparison. 6.2.1 Model Size and Memory Footprint Model size affects deployability. VGG-16 is large (around 138M parameters), whereas efficient architectures can be significantly smaller [22, 8, 27]. Table 5 provides a template for reporting parameter counts and checkpoint sizes. 9 Figure 2: Training time comparison across datasets and model types. Table 5: Model size comparison (fill with exact counts from your code). Model Parameters (M) Checkpoint Size (MB) Custom CNN 9–10 (approx.) 36 VGG-16 (frozen) 138 (approx.) 276 VGG-16 (fine-tuned) 138 (approx.) 552 7 Discussion Across all datasets, transfer learning achieves the highest accuracy and macro F1-score, consistent with prior evidence that pre-trained representations combined with domain adaptation improve downstream performance [28, 18]. Gains are particularly pronounced in datasets with higher intra-class variability and domain shift relative to ImageNet. The custom CNN demonstrates that a mid-sized architecture with batch normalization and GAP can remain competitive while reducing training cost and parameter count. This is practically useful when training must be performed frequently (e.g., iterative dataset updates), when GPUs are limited, or when models must run on resource-constrained devices. Pre-trained VGG-16 used as a frozen feature extractor provides a reasonable baseline that often improves over scratch training, but it is consistently outperformed by fine-tuning. This suggests that representational reuse alone may be insufficient when texture, color statistics, and imaging conditions differ from the source domain. 7.1 When to Prefer Each Paradigm • Custom CNN: preferred when compute and memory are limited, when inference must be fast, or when deployment targets edge devices. • Frozen pre-trained CNN: preferred when labeled data is limited and training time must be minimal, while still benefiting from ImageNet features. • Transfer learning: preferred when maximizing accuracy is the priority and additional compute is acceptable. 10 8 Limitations This study focuses on a single pre-trained backbone (VGG-16) and a single custom CNN design. Additional backbones (ResNet, EfficientNet) could shift the efficiency–accuracy frontier [6, 27]. Further, hyperparameters are kept consistent across datasets for fairness; dataset-specific tuning may yield higher absolute performance for all methods. 9 Ethical and Practical Considerations Vision systems deployed in real settings should be evaluated under realistic distribution shifts (lighting, camera devices, backgrounds). For street-scene datasets, privacy concerns may arise if faces or license plates are visible; appropriate anonymization and responsible data handling are recommended. 10 Conclusion This report presented a systematic comparison of a custom CNN, a pre-trained CNN used as a fixed feature extractor, and a transfer learning CNN with fine-tuning across five visual classification datasets. Transfer learning consistently delivered the strongest performance, benefiting from rich pre-trained representations and domain-specific adaptation. However, the custom CNN achieved competitive results at substantially lower computational cost, highlighting its practical value for efficiency-constrained deployments. Overall, the findings support a resource-aware model selection strategy: use transfer learning when accuracy is paramount, but prefer compact custom CNNs when compute, memory, and training time are critical constraints. References [1] Sarder Iftekhar Ahmed, Muhammad Ibrahim, Md. Nadim, Md. Mizanur Rahman, Maria Meh- jabin Shejunti, Taskeed Jabid, and Md. Sawkat Ali. Mangoleafbd: A comprehensive image dataset to classify diseased and healthy mango leaves. Data in Brief, 47:108941, 4 2023. [Online; accessed 2026-01-04]. [2] L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages 177–186. Springer, 2010. [3] Fran¸cois Chollet et al. Keras. https://keras.io, 2015. [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. [5] Konstantinos P. Ferentinos. Deep learning models for plant disease detection and diagnosis. Computers and Electronics in Agriculture, 145:311–318, 2018. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 11 [7] Rasel Hossen, Diptajoy Mistry, Mushiur Rahman, Waki As Sami Atikur Rahman Hridoy, Sajib Saha, and Muhammad Ibrahim. Road Damage and Manhole Detection using Deep Learning for Smart Cities: A Polygonal Annotation Approach. https://arxiv.org/abs/2510.03797, oct 4 2025. [8] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [9] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708, 2017. [10] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015. [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep con- volutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012. [12] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [13] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In International Conference on Learning Representations (ICLR), 2014. [14] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017. [15] Afia Lubaina, Md. Ehsan Uddoula Pahlowan, Tahshina Islam Munni, and Muhammad Ibrahim. Footpathvision: A comprehensive image dataset and deep learning baselines for footpath encroachment detection. https://papers.ssrn.com/sol3/papers.cfm?abstract id=5577201. [16] Hiroya Maeda, Yoshihide Sekimoto, Tadashi Seto, Takehiro Kashiyama, and Hiroshi Omata. Road damage detection and classification using deep neural networks with smartphone images. In Proceedings of the IEEE International Conference on Big Data (Big Data), pages 3845–3850, 2018. [17] Sharada P. Mohanty, David P. Hughes, and Marcel Salath´e. Using deep learning for image-based plant disease detection. Frontiers in Plant Science, 7:1419, 2016. [18] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. [20] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 12 [21] Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(60), 2019. [22] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. [24] Sudipto Das Sukanto, Diponker Roy, Fahim Shakil, Nirjhar Singha, Abdullah Asik, Aniket Joarder, Mridha Md Nafis Fuad, and Muhammad Ibrahim. Detecting Unau- thorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh. https://arxiv.org/abs/2510.26154, oct 30 2025. [25] Md Tahsin, Muhammad Ibrahim, Anika Tabassum Nafisa, Maksura Binte Rabbani Nuha, Mehrab Islam Arnab, Md. Hasanul Ferdaus, Mohammad Manzurul Islam, Mohammad Rifat Ah- mmad Rashid, Taskeed Jabid, Md. Sawkat Ali, and Nishat Tasnim Niloy. Paddyvarietybd: Classifying paddy variations of Bangladesh with a novel image dataset. Data in Brief, 60: 111514, 6 2025. [Online; accessed 2026-01-04]. [26] Nima Tajbakhsh, Jae Y. Shin, Suryakanth R. Gurudu, Ryan T. Hurst, Christopher B. Kendall, Michael B. Gotway, and Jianming Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Transactions on Medical Imaging, 35(5):1299–1312, 2016. [27] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), pages 6105–6114, 2019. [28] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems (NeurIPS), 2014. 13