A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

February 04, 2026

Reading time: 18 minute

...

#paper #research

📝 Original Paper Info

- Title: A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets
- ArXiv ID: 2601.02246
- Date: 2026-01-05
- Authors: Annoor Sharara Akhand

📝 Abstract

Convolutional Neural Networks (CNNs) are a standard approach for visual recognition due to their capacity to learn hierarchical representations from raw pixels. In practice, practitioners often choose among (i) training a compact custom CNN from scratch, (ii) using a large pre-trained CNN as a fixed feature extractor, and (iii) performing transfer learning via partial or full fine-tuning of a pre-trained backbone. This report presents a controlled comparison of these three paradigms across five real-world image classification datasets spanning road-surface defect recognition, agricultural variety identification, fruit/leaf disease recognition, pedestrian walkway encroachment recognition, and unauthorized vehicle recognition. Models are evaluated using accuracy and macro F1-score, complemented by efficiency metrics including training time per epoch and parameter counts. The results show that transfer learning consistently yields the strongest predictive performance, while the custom CNN provides an attractive efficiency--accuracy trade-off, especially when compute and memory budgets are constrained.

💡 Summary & Analysis

1. **Systematic Comparison of Three Learning Paradigms**: This study systematically compares three CNN-based learning paradigms to provide a deeper understanding of practical usability in different settings. 2. **Generalization Through Diverse Datasets**: The use of various real-world datasets for road damage recognition and agricultural crop analysis ensures the applicability of models across diverse environments, similar to finding crops that thrive in multiple regions. 3. **Balancing Efficiency and Performance**: Both accuracy and computational efficiency are considered to find an optimal learning strategy, akin to managing a sports athlete's training regimen for peak performance.

📄 Full Paper Content (ArXiv Source)

# Introduction

Computer vision has undergone a profound transformation over the past decade, largely driven by advances in deep learning and the widespread adoption of Convolutional Neural Networks (CNNs). Unlike traditional vision pipelines that rely on hand-crafted features such as SIFT or HOG, CNNs learn hierarchical feature representations directly from raw pixel data through end-to-end optimization . This ability to jointly learn low-level, mid-level, and high-level visual abstractions has enabled CNNs to achieve state-of-the-art performance across a broad spectrum of tasks, including image classification, object detection, semantic segmentation, and visual scene understanding.

The success of CNNs has been further amplified by the availability of large-scale annotated datasets and increased computational power, particularly through GPU acceleration. Benchmark datasets such as ImageNet have played a pivotal role in demonstrating the scalability of deep CNNs and their capacity to generalize across diverse visual categories . Architectures trained on such datasets learn reusable visual primitives, including edges, textures, object parts, and compositional patterns, which form the basis for knowledge transfer across tasks and domains . As a result, CNNs are now routinely deployed in real-world applications ranging from autonomous driving and infrastructure monitoring to precision agriculture and medical imaging.

Despite these advances, the practical deployment of CNN-based systems presents several design challenges. One of the most consequential decisions concerns how a CNN should be trained for a given task. In practice, three dominant paradigms are commonly employed: training a custom CNN architecture from scratch, using a pre-trained CNN as a fixed feature extractor, and applying transfer learning through fine-tuning of a pre-trained model. Each paradigm embodies distinct assumptions about data availability, computational resources, and the degree of domain similarity between training and deployment environments.

Training a CNN from scratch provides full control over architectural design and allows the model to be explicitly tailored to the characteristics of a target dataset and deployment constraints. Compact custom CNNs are often preferred in scenarios where memory footprint, inference latency, or energy consumption are critical considerations, such as embedded or edge computing environments. Moreover, recent studies have shown that carefully designed mid-sized CNNs can achieve competitive performance when combined with modern regularization techniques such as batch normalization and global average pooling . However, training from scratch typically requires substantial labeled data and careful optimization to avoid overfitting, particularly when the task exhibits high intra-class variability.

An alternative strategy is to leverage large pre-trained CNNs, such as VGG-16 or ResNet, as fixed feature extractors. In this setting, the convolutional layers are frozen and only a task-specific classifier head is trained on the target dataset . This approach significantly reduces training time and mitigates overfitting when labeled data is limited. Nevertheless, freezing the feature extractor can restrict the model’s ability to adapt to domain-specific visual patterns, especially when the target domain differs substantially from the source domain in terms of texture, color distribution, or scene composition .

Transfer learning via fine-tuning has emerged as a powerful compromise between training from scratch and using frozen pre-trained features. By initializing a network with pre-trained weights and selectively fine-tuning higher-level layers, the model can retain general-purpose visual knowledge while adapting its representations to the target task . Empirical evidence consistently shows that fine-tuning improves convergence speed and predictive accuracy, particularly for small and medium-sized datasets . However, fine-tuning increases computational cost and introduces additional hyperparameter sensitivity, requiring careful selection of learning rates, regularization strategies, and the depth of unfrozen layers.

Importantly, predictive performance alone does not fully capture the practical utility of a CNN-based system. Computational efficiency, training time, and model size play a crucial role in determining whether a model can be realistically deployed in production settings. Large pre-trained architectures often contain tens or hundreds of millions of parameters, resulting in high memory consumption and slower inference . In contrast, custom CNNs with significantly fewer parameters may offer favorable trade-offs between accuracy and efficiency, particularly in applications that require frequent retraining or real-time processing.

Motivated by these considerations, this work presents a systematic comparative study of three CNN-based learning paradigms: a custom CNN trained entirely from scratch, a pre-trained CNN used as a fixed feature extractor, and a transfer learning model with fine-tuning. The comparison is conducted across multiple real-world image classification datasets that span diverse application domains, including road surface damage recognition , agricultural crop and fruit analysis , and urban scene understanding. These datasets collectively capture challenges such as class imbalance, background clutter, illumination variation, and high inter-class visual similarity.

All models are evaluated under identical experimental conditions, including consistent data splits, preprocessing pipelines, training hyperparameters, and evaluation metrics. Performance is assessed using both accuracy and macro F1-score to account for class imbalance, while computational efficiency is evaluated through training time per epoch and model complexity. By analyzing performance trends across datasets and learning paradigms, this study aims to provide practical, evidence-based guidance for selecting CNN training strategies under varying data and resource constraints.

Rather than advocating a single universally optimal approach, the objective of this work is to illuminate the trade-offs inherent in different CNN paradigms. The results of this study seek to answer a central practical question faced by researchers and practitioners alike: when does the additional computational cost of transfer learning yield sufficient performance gains to justify its use, and under what conditions can a carefully designed custom CNN provide a more efficient and competitive alternative?

The development of Convolutional Neural Networks (CNNs) has fundamentally reshaped the field of computer vision by enabling end-to-end learning of hierarchical visual representations directly from raw pixel data. Early CNN-based systems demonstrated the feasibility of learning spatially localized features using convolution and pooling operations, achieving strong performance on document recognition and digit classification tasks . However, the widespread adoption of CNNs for large-scale visual recognition only became practical with the availability of large annotated datasets and increased computational power, culminating in the success of deep architectures on the ImageNet benchmark .

Deep CNN Architectures

Following the breakthrough performance of AlexNet, subsequent research focused on improving depth, representational capacity, and optimization stability of CNN architectures. The VGG family of networks demonstrated that increasing depth through the systematic use of small $`3\times3`$ convolutional filters could significantly improve recognition accuracy, albeit at the cost of increased parameter count and computational complexity . While VGG-style networks remain influential due to their architectural simplicity and transferability, their large memory footprint motivates the exploration of more efficient alternatives.

Residual learning, introduced through ResNet architectures, addressed the degradation problem associated with very deep networks by enabling identity-based skip connections . This innovation allowed CNNs to scale to hundreds of layers while maintaining stable optimization behavior. DenseNet architectures further extended this idea by encouraging feature reuse through dense connectivity patterns, improving parameter efficiency and gradient flow . These architectural advances underscore the importance of structural inductive biases in deep CNN design.

In parallel, a line of research has focused on computational efficiency and deployment feasibility. Architectures such as MobileNet, MobileNetV2, and EfficientNet employ depthwise separable convolutions, inverted residuals, and compound scaling to reduce parameter counts and floating-point operations while preserving accuracy . These models are particularly relevant for edge computing scenarios, where resource constraints limit the applicability of large-scale CNNs.

Regularization and Architectural Design Choices

Effective training of deep CNNs relies heavily on regularization and architectural design choices. Batch normalization has become a standard component of modern CNNs, mitigating internal covariate shift and enabling higher learning rates and faster convergence . Dropout has been widely adopted as a stochastic regularization technique to prevent co-adaptation of neurons and reduce overfitting, particularly in fully connected layers .

Global average pooling (GAP) has emerged as an effective alternative to large fully connected layers, significantly reducing parameter counts while improving generalization by enforcing spatial correspondence between feature maps and class predictions . These design principles are particularly relevant for custom CNNs trained from scratch, where parameter efficiency and training stability are critical considerations.

Transfer Learning in Visual Recognition

Transfer learning has become a dominant paradigm in computer vision, especially in settings where labeled data is limited or training from scratch is computationally expensive. The central premise of transfer learning is that CNNs trained on large-scale datasets learn general-purpose visual features that can be reused across tasks . Empirical studies have shown that lower layers of CNNs tend to capture generic features such as edges and textures, while higher layers become increasingly task-specific .

Two primary transfer learning strategies are commonly employed: using a pre-trained CNN as a fixed feature extractor, and fine-tuning some or all of the network layers on the target dataset. Using frozen features offers computational efficiency and reduces overfitting risk but may limit adaptability to new domains. Fine-tuning allows the model to adjust its representations to domain-specific characteristics and typically yields superior performance, particularly when moderate amounts of labeled data are available .

The effectiveness of fine-tuning has been demonstrated across numerous domains, including medical imaging, remote sensing, and agriculture, where domain shift relative to ImageNet is often substantial. However, fine-tuning introduces additional hyperparameter sensitivity, requiring careful control of learning rates and regularization to prevent catastrophic forgetting or overfitting .

CNNs in Road and Infrastructure Monitoring

CNN-based approaches have been widely explored for road surface condition assessment and infrastructure monitoring. Road damage detection and classification using smartphone imagery has been shown to be a cost-effective and scalable alternative to traditional inspection methods . Such datasets often exhibit significant variability in lighting conditions, camera viewpoints, and background clutter, posing challenges for robust visual recognition. Transfer learning has proven particularly effective in this domain, as pre-trained CNNs provide strong initialization for handling diverse visual patterns.

CNNs in Agriculture and Plant Phenotyping

Agricultural applications represent another important area where CNNs have demonstrated substantial impact. Image-based plant disease detection and crop variety classification benefit from CNNs’ ability to capture fine-grained texture and color variations . Subsequent studies have shown that deep CNNs can outperform traditional machine learning approaches across multiple crop species and disease categories, even under real-world imaging conditions . However, agricultural datasets often suffer from class imbalance, limited sample sizes, and environmental variability, making the choice of training paradigm-scratch training versus transfer learning-particularly consequential.

Efficiency–Accuracy Trade-offs

While transfer learning and large pre-trained models often achieve superior accuracy, their computational cost and memory requirements can limit practical deployment. Several studies emphasize the importance of evaluating CNN models not only in terms of predictive performance but also with respect to efficiency metrics such as training time, inference latency, and parameter count . Custom CNNs, when carefully designed, can offer competitive performance at a fraction of the computational cost, making them attractive for real-world systems with strict resource constraints.

Positioning of This Work

Building upon these prior studies, the present work conducts a systematic and controlled comparison of three CNN paradigms-custom CNNs trained from scratch, frozen pre-trained CNNs, and fine-tuned transfer learning models-across multiple real-world datasets. Unlike studies that focus on a single domain or architecture, this work emphasizes cross-dataset consistency, controlled experimental conditions, and joint evaluation of performance and efficiency. By situating the analysis across diverse application domains, this study aims to provide practical insights into when transfer learning is essential and when a well-designed custom CNN can serve as a viable and efficient alternative.

Datasets

This study evaluates five datasets: Auto-RickshawImageBD, FootpathVision, RoadDamageBD, MangoImageBD, PaddyVarietyBD spanning both structured (agriculture varieties) and unconstrained street-scene imagery. For each dataset, we use fixed train/validation/test splits and apply identical preprocessing and augmentation across all models

Dataset Overview

Table 1 summarizes key dataset characteristics used in our experiments.

Dataset	#Classes ($`C`$)	Train	Val	Test
Auto-RickshawImageBD	2	932	266	133
FootpathVision	2	867	248	124
RoadDamageBD	2	315	90	45
MangoImageBD	15	3992	1141	570
PaddyVarietyBD	35	9800	2800	1400

Dataset statistics (update counts if needed).

Preprocessing

All images are resized to a fixed resolution (e.g., $`224\times224`$) to match the input size expected by the custom CNN and the VGG backbone. Pixel values are normalized to match the pre-training statistics for ImageNet when using VGG . For the custom CNN trained from scratch, the same normalization is applied to keep preprocessing consistent.

Data Augmentation

To improve generalization under real-world variance, we use standard augmentation techniques :

random horizontal flipping (when label semantics permit),
random resized crops / spatial jitter,
mild color jitter (brightness/contrast/saturation),
optional random rotation within a small range.

Augmentation is applied only to training images, not validation or test sets.

Class Imbalance and Sampling

Real datasets often exhibit class imbalance. In addition to reporting macro F1-score (which is more sensitive to minority classes), we optionally use class-weighted cross-entropy or balanced sampling for stability. All reported results in Table 3 are based on the same training protocol for all models to ensure fairness.

Methodology

Problem Formulation

Each dataset is treated as a multi-class classification problem. Let $`(x_i, y_i)`$ denote an image and its label, where $`y_i \in \{1,\dots,C\}`$. A model $`f_\theta`$ outputs logits $`z \in \mathbb{R}^C`$, and class probabilities are given by the softmax:

MATH

\begin{equation}
p(y=c \mid x) = \frac{\exp(z_c)}{\sum_{k=1}^{C} \exp(z_k)}.
\end{equation}

Click to expand and view more

The primary training objective is the cross-entropy loss:

MATH

\begin{equation}
\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i \mid x_i).
\end{equation}

Click to expand and view more

Models Evaluated

We evaluate three CNN paradigms representing distinct trade-offs between complexity, cost, and accuracy.

Custom CNN (Trained from Scratch)

The custom CNN is a compact VGG-inspired architecture designed for efficiency while preserving hierarchical feature learning. It consists of four convolutional stages with progressive channel expansion:

Stage 1: two $`3\times3`$ convolutional layers with 64 channels,
Stage 2: two $`3\times3`$ convolutional layers with 128 channels,
Stage 3: three $`3\times3`$ convolutional layers with 256 channels,
Stage 4: three $`3\times3`$ convolutional layers with 512 channels.

Each convolution is followed by batch normalization and ReLU activation. A $`2\times2`$ max pooling layer reduces spatial resolution after each stage. Following the feature extractor, global average pooling (GAP) aggregates spatial features, and a small classifier head produces class logits:

fully connected layer $`512 \rightarrow 256`$ with ReLU,
dropout ($`p=0.5`$) ,
final linear layer $`256 \rightarrow C`$.

This model is substantially smaller than VGG-16 while retaining strong representational capacity.

Layer-wise summary.

Table 2 provides a compact layer summary (illustrative; adapt if your implementation differs).

Block	Layers	Output Channels
Stage 1	Conv–BN–ReLU $`\times2`$, MaxPool	64
Stage 2	Conv–BN–ReLU $`\times2`$, MaxPool	128
Stage 3	Conv–BN–ReLU $`\times3`$, MaxPool	256
Stage 4	Conv–BN–ReLU $`\times3`$, MaxPool	512
Head	GAP, FC(512$`\rightarrow`$256), Dropout, FC(256$`\rightarrow C`$)	—

Custom CNN architecture summary (illustrative).

Pre-trained CNN (VGG-16 as Frozen Feature Extractor)

A standard VGG-16 model pre-trained on ImageNet is used as a fixed feature extractor. All convolutional parameters are frozen, and only a lightweight classifier head is trained on the target dataset. This approach is computationally efficient and reduces overfitting risk, but may under-adapt when the target domain differs significantly from ImageNet.

Transfer Learning CNN (VGG-16 Fine-tuning)

The transfer learning model uses the same VGG-16 initialization but allows fine-tuning of the final convolutional block and classifier head. Fine-tuning improves domain adaptation and often yields better accuracy than freezing features, at the cost of additional compute.

Training Setup

All models are trained under identical experimental conditions to ensure fairness.

Optimization.

We train for 20 epochs using stochastic gradient descent (SGD) with momentum $`0.9`$ and base learning rate $`0.01`$. A cosine decay learning rate schedule is used . Cross-entropy loss is used for all tasks.

Regularization.

We use batch normalization , dropout , and data augmentation . Early stopping is not used in the primary comparison; instead, all methods run for the same number of epochs.

Implementation details.

Experiments can be implemented in PyTorch or Keras . To support reproducibility, random seeds should be fixed, and preprocessing should be identical across runs.

Evaluation Metrics

Accuracy

Accuracy measures the fraction of correct predictions:

MATH

\begin{equation}
\text{Accuracy} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}[\hat{y}_i = y_i].
\end{equation}

Click to expand and view more

Macro F1-score

Macro F1 averages F1 across classes, treating each class equally and highlighting performance on minority classes. For class $`c`$, define precision and recall:

MATH

\begin{equation}
\text{Precision}_c = \frac{TP_c}{TP_c + FP_c}, \quad
\text{Recall}_c = \frac{TP_c}{TP_c + FN_c}.
\end{equation}

Click to expand and view more

Then

MATH

\begin{equation}
F1_c = \frac{2\cdot \text{Precision}_c \cdot \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c},
\quad
\text{Macro F1} = \frac{1}{C}\sum_{c=1}^{C} F1_c.
\end{equation}

Click to expand and view more

Efficiency Metrics

We report training time per epoch (seconds) and parameter counts. Parameter counts help estimate memory footprint and deployment feasibility; for example, VGG-16 has $`\sim`$138M parameters , which can be prohibitive on edge devices.

Results

Classification Performance

Table 3 summarizes performance across datasets. Transfer learning achieves the best accuracy and macro F1 on all datasets in this comparison.

Performance comparison across datasets.
Dataset	Model Type	Accuracy (%)	Macro F1
Auto-RickshawImageBD	Custom CNN	89.8	0.90
	Pre-trained CNN	89.1	0.89
	Transfer Learning	92.1	0.93
FootpathVision	Custom CNN	91.9	0.91
	Pre-trained CNN	92.6	0.92
	Transfer Learning	94.6	0.95
RoadDamageBD	Custom CNN	94.2	0.93
	Pre-trained CNN	95.0	0.94
	Transfer Learning	96.8	0.96
MangoImageBD	Custom CNN	90.9	0.91
	Pre-trained CNN	90.2	0.90
	Transfer Learning	93.4	0.94
PaddyVarietyBD	Custom CNN	91.8	0.92
	Pre-trained CNN	93.1	0.93
	Transfer Learning	95.0	0.95

Figure 1 illustrates accuracy and macro F1 across datasets.

Test accuracy and macro F1-score comparison across datasets.

Training Time

Table 4 compares training time per epoch.

Dataset	VGG-16 (Frozen)	Custom CNN	Transfer Learning
Auto-RickshawImageBD	15.9	10.9	22.6
FootpathVision	19.1	13.1	27.2
RoadDamageBD	18.5	12.6	26.4
MangoImageBD	16.8	11.5	24.1
PaddyVarietyBD	17.2	11.8	24.8

Training time (seconds) per epoch for different CNN approaches across datasets.

Figure 2 provides a graphical view of the time comparison.

Training time comparison across datasets and model types.

Model Size and Memory Footprint

Model size affects deployability. VGG-16 is large (around 138M parameters), whereas efficient architectures can be significantly smaller . Table 5 provides a template for reporting parameter counts and checkpoint sizes.

Model	Parameters (M)	Checkpoint Size (MB)
Custom CNN	9–10 (approx.)	36
VGG-16 (frozen)	138 (approx.)	276
VGG-16 (fine-tuned)	138 (approx.)	552

Model size comparison (fill with exact counts from your code).

Discussion

Across all datasets, transfer learning achieves the highest accuracy and macro F1-score, consistent with prior evidence that pre-trained representations combined with domain adaptation improve downstream performance . Gains are particularly pronounced in datasets with higher intra-class variability and domain shift relative to ImageNet.

The custom CNN demonstrates that a mid-sized architecture with batch normalization and GAP can remain competitive while reducing training cost and parameter count. This is practically useful when training must be performed frequently (e.g., iterative dataset updates), when GPUs are limited, or when models must run on resource-constrained devices.

Pre-trained VGG-16 used as a frozen feature extractor provides a reasonable baseline that often improves over scratch training, but it is consistently outperformed by fine-tuning. This suggests that representational reuse alone may be insufficient when texture, color statistics, and imaging conditions differ from the source domain.

When to Prefer Each Paradigm

Custom CNN: preferred when compute and memory are limited, when inference must be fast, or when deployment targets edge devices.
Frozen pre-trained CNN: preferred when labeled data is limited and training time must be minimal, while still benefiting from ImageNet features.
Transfer learning: preferred when maximizing accuracy is the priority and additional compute is acceptable.

Limitations

This study focuses on a single pre-trained backbone (VGG-16) and a single custom CNN design. Additional backbones (ResNet, EfficientNet) could shift the efficiency–accuracy frontier . Further, hyperparameters are kept consistent across datasets for fairness; dataset-specific tuning may yield higher absolute performance for all methods.

Ethical and Practical Considerations

Vision systems deployed in real settings should be evaluated under realistic distribution shifts (lighting, camera devices, backgrounds). For street-scene datasets, privacy concerns may arise if faces or license plates are visible; appropriate anonymization and responsible data handling are recommended.

Conclusion

This report presented a systematic comparison of a custom CNN, a pre-trained CNN used as a fixed feature extractor, and a transfer learning CNN with fine-tuning across five visual classification datasets. Transfer learning consistently delivered the strongest performance, benefiting from rich pre-trained representations and domain-specific adaptation. However, the custom CNN achieved competitive results at substantially lower computational cost, highlighting its practical value for efficiency-constrained deployments. Overall, the findings support a resource-aware model selection strategy: use transfer learning when accuracy is paramount, but prefer compact custom CNNs when compute, memory, and training time are critical constraints.

Read Full PDF on ArXiv

📊 논문 시각자료 (Figures)

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Related Work

Deep CNN Architectures

Regularization and Architectural Design Choices

Transfer Learning in Visual Recognition

CNNs in Road and Infrastructure Monitoring

CNNs in Agriculture and Plant Phenotyping

Efficiency–Accuracy Trade-offs

Positioning of This Work

Datasets

Dataset Overview

Preprocessing

Data Augmentation

Class Imbalance and Sampling

Methodology

Problem Formulation

Models Evaluated

Custom CNN (Trained from Scratch)

Layer-wise summary.

Pre-trained CNN (VGG-16 as Frozen Feature Extractor)

Transfer Learning CNN (VGG-16 Fine-tuning)

Training Setup

Optimization.

Regularization.

Implementation details.

Evaluation Metrics

Accuracy

Macro F1-score

Efficiency Metrics

Results

Classification Performance

Training Time

Model Size and Memory Footprint

Discussion

When to Prefer Each Paradigm

Limitations

Ethical and Practical Considerations

Conclusion

📊 논문 시각자료 (Figures)

A Note of Gratitude

Related Posts

A Comprehensive Dataset for Human vs. AI Generated Image Detection

A Generalized UCB Bandit Algorithm for ML-Based Estimators

A Global Atlas of Digital Dermatology to Map Innovation and Disparities

Start searching

No results found