Fisher Kernel for Deep Neural Activations

Fisher Kernel for Deep Neural Activations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compared to image representation based on low-level local descriptors, deep neural activations of Convolutional Neural Networks (CNNs) are richer in mid-level representation, but poorer in geometric invariance properties. In this paper, we present a straightforward framework for better image representation by combining the two approaches. To take advantages of both representations, we propose an efficient method to extract a fair amount of multi-scale dense local activations from a pre-trained CNN. We then aggregate the activations by Fisher kernel framework, which has been modified with a simple scale-wise normalization essential to make it suitable for CNN activations. Replacing the direct use of a single activation vector with our representation demonstrates significant performance improvements: +17.76 (Acc.) on MIT Indoor 67 and +7.18 (mAP) on PASCAL VOC 2007. The results suggest that our proposal can be used as a primary image representation for better performances in visual recognition tasks.


💡 Research Summary

The paper addresses the long‑standing trade‑off between the rich semantic content of deep convolutional neural network (CNN) activations and the geometric invariance of traditional local‑descriptor‑based representations. While a single global activation vector (e.g., from FC6 or FC7) captures high‑level information, it is vulnerable to scale, translation, and occlusion. Conversely, Bag‑of‑Words, VLAD, and Fisher Vector (FV) built on dense SIFT descriptors are robust to such transformations but lack discriminative power. To combine the strengths of both worlds, the authors propose a two‑step framework. First, they replace the fully‑connected layers of a pre‑trained CNN with equivalent convolutional layers, allowing dense extraction of local activations at multiple spatial resolutions in a single forward pass. This yields thousands of activation vectors per image (up to 4,410) with only ~0.5 s processing time, as shown in Table 1. Second, they aggregate these multi‑scale activations using a Fisher kernel enhanced by a simple yet crucial scale‑wise ℓ2‑normalization followed by average pooling—termed Multi‑Scale Pyramid Pooling (MPP). Each scale’s activations are PCA‑reduced, encoded into a Fisher vector with a GMM, normalized, and then averaged across scales. The scale‑wise normalization prevents the overwhelming influence of the many fine‑scale descriptors, a problem observed when naïvely pooling all descriptors together. Experiments on three benchmark datasets—MIT Indoor 67 (scene classification), PASCAL VOC 2007 (object classification), and Oxford‑102 Flowers (fine‑grained classification)—demonstrate substantial gains over baseline CNN representations, average‑pooled CNN features, and VLAD‑based multi‑scale methods. Notably, the method improves top‑1 accuracy on MIT Indoor 67 by 17.76 % and mean average precision on PASCAL VOC 2007 by 7.18 %. Additional analyses reveal that the performance boost stems more from the proposed pooling strategy than from the choice of Fisher versus VLAD encoding. The authors also show that object confidence maps can be derived without bounding‑box supervision, hinting at potential applications in weakly‑supervised detection. In summary, the work presents an efficient, scalable, and generic image representation that leverages deep activations while preserving the geometric robustness of traditional local descriptors, making it a strong candidate for a wide range of visual recognition tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment