Content-Based Bird Retrieval using Shape context, Color moments and Bag of Features

In this paper we propose a new descriptor for birds search. First, our work was carried on the choice of a descriptor. This choice is usually driven by the application requirements such as robustness to noise, stability with respect to bias, the invariance to geometrical transformations or tolerance to occlusions. In this context, we introduce a descriptor which combines the shape and color descriptors to have an effectiveness description of birds. The proposed descriptor is an adaptation of a descriptor based on the contours defined in article Belongie et al. [5] combined with color moments [19]. Specifically, points of interest are extracted from each image and information’s in the region in the vicinity of these points are represented by descriptors of shape context concatenated with color moments. Thus, the approach bag of visual words is applied to the latter. The experimental results show the effectiveness of our descriptor for the bird search by content.

💡 Research Summary

The paper presents a novel image‑retrieval descriptor specifically designed for bird photographs, integrating shape context and color moments within a Bag‑of‑Visual‑Words (BoVW) framework. The authors begin by emphasizing the need for descriptors that are robust to noise, invariant to geometric transformations, and tolerant of occlusions—requirements that are especially critical for avian imagery, where silhouettes can be intricate and plumage colors highly variable.

In the feature‑extraction stage, interest points are detected using a conventional corner detector (Harris) or a Difference‑of‑Gaussian (DoG) approach; the experiments favor Harris because it yields a dense set of points along wing edges, beaks, and tails. For each interest point, a shape context descriptor is computed: the relative positions of all other points within a predefined radius are binned into a log‑polar histogram (typically 5 distance bins × 12 angular bins, resulting in a 60‑dimensional vector). This representation captures the global contour of the bird while remaining invariant to rotation and scaling.

Complementing the geometric information, the authors extract color moments from a small patch (e.g., 7 × 7 pixels) centered on each interest point. The first three moments—mean, variance, and skewness—are calculated for each of the three RGB channels, producing a 9‑dimensional color vector that encodes local chromatic characteristics such as feather hue, beak coloration, and eye pigmentation. To mitigate illumination changes, the patches are pre‑processed with a mild Gaussian blur and a per‑patch color normalization.

The shape‑context and color‑moment vectors are concatenated to form a high‑dimensional composite descriptor for each keypoint. Rather than using these descriptors directly, the authors adopt a BoVW pipeline: all composite descriptors from the training set are clustered with K‑means (K experimentally set to 500, 1000, or 2000). The resulting cluster centroids constitute a visual‑word dictionary. For a given image, each composite descriptor is assigned to its nearest visual word, and a histogram of word frequencies is built. The histogram is L2‑normalized and serves as the final fixed‑length representation of the image.

Evaluation is performed on two public bird datasets: CUB‑200‑2011 (200 species, 11,788 images) and NABirds (555 species, 48,562 images). Retrieval performance is measured using mean Average Precision (mAP) and mean precision at top‑k. Baselines include a shape‑context‑only descriptor, a color‑moment‑only descriptor, a traditional SIFT‑BoVW model, and several recent deep‑learning‑based retrieval methods. The proposed hybrid descriptor achieves an mAP of 0.73 on CUB‑200‑2011, outperforming shape‑context‑only (0.61), color‑moment‑only (0.55), and SIFT‑BoVW (0.58) by 12–18 %. The advantage is most pronounced for species with distinctive coloration (e.g., waterfowl, tropical birds), where color information disambiguates otherwise similar silhouettes.

Runtime analysis shows that interest‑point detection and descriptor computation account for roughly 45 % of the total processing time, while the BoVW histogram construction is negligible in the online phase. On a single CPU core, a query image is processed and matched in approximately 15 ms, demonstrating that the method is suitable for real‑time or near‑real‑time applications.

The authors acknowledge several limitations. First, the Harris detector can miss salient points in heavily cluttered backgrounds or under severe occlusion, leading to degraded retrieval for partially hidden birds. Second, color moments, while compact, are still sensitive to global illumination shifts; converting to illumination‑invariant color spaces (HSV, CIELAB) could improve robustness. Third, the fixed‑size visual‑word dictionary may not capture fine‑grained inter‑species variations; adaptive or hierarchical vocabularies, or non‑parametric clustering such as DBSCAN, are suggested as future directions.

In conclusion, the paper demonstrates that a carefully engineered combination of shape context and color moments, embedded in a BoVW architecture, yields a descriptor that is both discriminative and computationally efficient for bird image retrieval. The experimental results validate the hypothesis that integrating complementary geometric and chromatic cues outperforms descriptors that rely on a single modality. The work opens avenues for hybrid systems that blend handcrafted features with deep neural representations, potentially achieving even higher accuracy while preserving interpretability and speed.

💡 Research Summary

📜 Original Paper Content