Self-informed neural network structure learning
We study the problem of large scale, multi-label visual recognition with a large number of possible classes. We propose a method for augmenting a trained neural network classifier with auxiliary capacity in a manner designed to significantly improve upon an already well-performing model, while minimally impacting its computational footprint. Using the predictions of the network itself as a descriptor for assessing visual similarity, we define a partitioning of the label space into groups of visually similar entities. We then augment the network with auxilliary hidden layer pathways with connectivity only to these groups of label units. We report a significant improvement in mean average precision on a large-scale object recognition task with the augmented model, while increasing the number of multiply-adds by less than 3%.
💡 Research Summary
The paper tackles the challenge of improving large‑scale, multi‑label visual recognition systems without incurring a prohibitive computational cost. Starting from a high‑performing convolutional network (an Inception‑style model with two 4096‑unit fully‑connected layers), the authors propose a systematic way to add “specialist” capacity that is targeted at groups of labels the base model finds visually similar. The key insight is to use the model’s own predictions on a held‑out dataset as a proxy for visual similarity. By collecting the top‑K (K = 100) predictions for each image and comparing them to ground‑truth labels, a confusion matrix A is built that records how often label i is confused with label j. A symmetric similarity matrix B = AᵀA is then constructed and fed to spectral clustering (Ng et al., 2002) to partition the 17 000‑class label space into a small number of clusters (e.g., six or thirteen). An alternative version uses only co‑detections (model predictions against model predictions) to form A, eliminating the need for ground‑truth.
For each discovered cluster, a “specialist” sub‑network is added in parallel to the existing fully‑connected stack. Each specialist consists of two 512‑unit ReLU layers and connects only to the output units belonging to its cluster via a dedicated weight matrix. During training, the original network’s weights (except the final classifier) are frozen; only the new specialist weights and the classifier layer are learned. This forces the added capacity to focus on the discriminative nuances within each cluster rather than re‑learning the generic features already captured by the base model. The authors note that a full fine‑tune of the whole network is possible but was not explored in their experiments.
Experiments are conducted on the internal Google JFT dataset (≈100 M images, 17 k classes). The baseline model achieves 36.80 % mean average precision (mAP) at top‑50 predictions with 1.52 billion multiply‑adds (MACs). Adding six specialist heads derived from the ground‑truth confusion matrix raises mAP to 39.41 % while increasing MACs to 1.56 billion (+2.6 %). Randomly permuting labels across the same six heads yields only 32.97 % mAP, confirming that the clustering is essential. Using thirteen heads derived from co‑detection clustering yields 38.07 % mAP (1.60 B MAC), again outperforming a random baseline (32.13 %). Thus, the method consistently improves performance with less than a 3 % increase in computational load, making it suitable for latency‑sensitive production environments.
A preliminary cross‑dataset evaluation maps a subset of JFT classes to the 1 000‑class ImageNet benchmark (≈660 overlapping classes). Although not directly comparable to ImageNet‑trained models, the results suggest that the specialist‑augmented network retains its gains on a different test distribution.
In the discussion, the authors position their work relative to mixture‑of‑experts and knowledge‑distillation literature. Unlike classic mixtures of experts, they do not learn a gating network; instead, the gating information is extracted a priori from the already‑trained “generalist” model. Compared to distillation, which uses soft targets to train a single compact model, this approach adds capacity after distillation could even be applied, offering a complementary avenue for fine‑grained improvements.
Future directions include adaptive allocation of specialist capacity based on cluster size or data abundance, inserting specialist pathways at earlier convolutional layers to capture more localized visual cues, and iterating the clustering‑augmentation cycle to discover hierarchical partitions of the label space. The authors also highlight the connection to conditional computation (Bengio, 2013), where only relevant subnetworks are activated at inference time, pointing toward scalable architectures that combine massive overall capacity with low‑cost selective evaluation.
Overall, the paper presents a pragmatic, data‑driven method for augmenting deep visual classifiers: by letting the model “self‑inform” which labels are confusable, it creates targeted specialist modules that boost accuracy with minimal extra computation, a valuable contribution for large‑scale, real‑world vision systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment