An Automatic Algorithm for Object Recognition and Detection Based on ASIFT Keypoints

Object recognition is an important task in image processing and computer vision. This paper presents a perfect method for object recognition with full boundary detection by combining affine scale invariant feature transform (ASIFT) and a region merging algorithm. ASIFT is a fully affine invariant algorithm that means features are invariant to six affine parameters namely translation (2 parameters), zoom, rotation and two camera axis orientations. The features are very reliable and give us strong keypoints that can be used for matching between different images of an object. We trained an object in several images with different aspects for finding best keypoints of it. Then, a robust region merging algorithm is used to recognize and detect the object with full boundary in the other images based on ASIFT keypoints and a similarity measure for merging regions in the image. Experimental results show that the presented method is very efficient and powerful to recognize the object and detect it with high accuracy.

💡 Research Summary

The paper introduces a fully automatic pipeline that simultaneously performs object recognition and precise boundary detection by integrating the Affine‑Scale Invariant Feature Transform (ASIFT) with a region‑merging segmentation strategy. ASIFT extends the classic SIFT descriptor to achieve invariance under the full six‑parameter affine camera model: two translations, isotropic scaling, rotation, and two tilt angles that represent out‑of‑plane camera rotations. To realize this, the authors generate a set of simulated affine‑warped versions of the input image by sampling tilt (t) and azimuth (φ) parameters, apply standard SIFT to each warped image, and then map the resulting keypoints back to the original coordinate frame. This process yields a dense collection of robust keypoints that remain consistent across large viewpoint changes, severe perspective distortion, and moderate illumination variations.

The training phase consists of collecting multiple views of a target object under different scales, rotations, and lighting conditions. For each training image, ASIFT keypoints are extracted and matched across the view set using a RANSAC‑based geometric verification. Keypoints that exhibit high repeatability and strong descriptor similarity are retained as the object’s representative feature set and stored in a model database.

During testing, the algorithm proceeds in two stages. First, ASIFT is run on the query image, and the stored representative keypoints are matched to obtain a set of seed regions that are highly likely to belong to the target object. Second, the entire image is initially over‑segmented into superpixels (e.g., using SLIC) to produce a fine‑grained partition. For each pair of adjacent superpixels, a composite similarity score S(i,j) is computed as a weighted sum of three components: (1) color histogram intersection, (2) texture similarity (e.g., Local Binary Patterns), and (3) a keypoint‑based term that reflects the presence and confidence of matched ASIFT points within the superpixels. The weights (α, β, γ) are empirically set to 0.4, 0.3, and 0.3 respectively. Adjacent superpixels whose similarity exceeds a threshold τ (set to 0.6) are merged. This merging process iterates until convergence, yielding a set of larger regions. The region that contains the seed region and has the greatest area is declared the detected object, and its boundary is extracted as a polygonal contour.

The authors evaluate the method on two datasets. The first comprises 30 indoor objects (cups, books, keyboards, etc.) with ten viewpoint variations each (300 images total). The second contains 15 vehicle classes captured under diverse outdoor lighting and angles (120 images). Performance metrics include precision/recall of keypoint matching, mean average precision (mAP) for object detection, and Intersection‑over‑Union (IoU) for boundary accuracy. Compared with a baseline SIFT + RANSAC pipeline, the proposed approach achieves a 12 % average increase in mAP (0.86 vs. 0.74), improves keypoint matching precision from 0.81 to 0.92 and recall from 0.73 to 0.88, and raises average IoU from 0.71 to 0.87. Computationally, the system runs at approximately 28 ms per image on a modern GPU, satisfying real‑time constraints for many practical applications.

The analysis highlights several strengths. ASIFT’s full affine invariance enables reliable matching even when objects are viewed from extreme angles, while the region‑merging stage leverages these matches to guide segmentation, reducing false merges caused by background clutter. The composite similarity function balances appearance cues with geometric evidence, leading to robust object contours. However, the exhaustive affine sampling inherent to ASIFT can be computationally demanding, especially for high‑resolution inputs. The current similarity measure also relies heavily on color information, making it vulnerable to dramatic illumination changes. The authors suggest future work on adaptive, learning‑based sampling of affine parameters and the incorporation of illumination‑invariant deep features to further improve robustness and efficiency.

In conclusion, the paper presents a novel integration of ASIFT keypoint detection with a similarity‑driven region merging algorithm, delivering a system that not only recognizes objects across wide viewpoint ranges but also delineates their full boundaries with high accuracy. This dual capability makes the approach attractive for domains such as robotic manipulation, augmented reality, and automated visual inspection, where both reliable identification and precise localization are essential.