FaceQSORT: a Multi-Face Tracking Method based on Biometric and Appearance Features

FaceQSORT: a Multi-Face Tracking Method based on Biometric and Appearance Features
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work, a novel multi-face tracking method named FaceQSORT is proposed. To mitigate multi-face tracking challenges (e.g., partially occluded or lateral faces), FaceQSORT combines biometric and visual appearance features (extracted from the same image (face) patch) for association. The Q in FaceQSORT refers to the scenario for which FaceQSORT is desinged, i.e. tracking people’s faces as they move towards a gate in a Queue. This scenario is also reflected in the new dataset `Paris Lodron University Salzburg Faces in a Queue’, which is made publicly available as part of this work. The dataset consists of a total of seven fully annotated and challenging sequences (12730 frames) and is utilized together with two other publicly available datasets for the experimental evaluation. It is shown that FaceQSORT outperforms state-of-the-art trackers in the considered scenario. To provide a deeper insight into FaceQSORT, comprehensive experiments are conducted evaluating the parameter selection, a different similarity metric and the utilized face recognition model (used to extract biometric features).


💡 Research Summary

The paper introduces FaceQSORT, a novel multi‑face tracking system specifically designed for the “faces in a queue” scenario, where people move toward a gate and may be partially occluded, turned sideways, or otherwise challenging to track. The core idea is to fuse two complementary feature types extracted from the same face image patch: (i) biometric features obtained from a pre‑trained face‑recognition network (e.g., ArcFace) that encode identity‑specific information, and (ii) generic appearance features derived from a standard image‑classification CNN (e.g., ResNet) that capture overall visual texture and are more robust to pose or occlusion. Both feature vectors are compared using cosine similarity, producing two cost components (C_{bio}) and (C_{app}). A weighting parameter (\lambda) linearly combines them into a unified cost (C_{app/bio}= \lambda C_{bio} + (1-\lambda) C_{app}).

Spatial consistency is enforced by a Mahalanobis distance (d_{pos}) between predicted and detected bounding‑box centers; matches exceeding a spatial threshold (\theta_{pos}) are discarded. The final association cost integrates spatial and appearance terms: (C = \beta C_{app/bio} + (1-\beta) C_{pos}), where (\beta) balances visual similarity against motion consistency. The cost matrix is fed to the Hungarian algorithm to obtain a globally optimal bipartite matching between active tracks ((\Phi)) and current detections ((\Psi)).

To reduce unnecessary computation and improve robustness under long occlusions, the authors adopt a matching cascade: tracks that were successfully matched in the previous frame are processed first, followed by older unmatched tracks. After the cascade, any remaining unmatched detections are linked to tracks using an IoU‑based fallback, similar to the original SORT framework.

Track management follows a cautious policy: newly created tracks start in a tentative state and are confirmed only after being matched for (N_{init}) consecutive frames; tracks that fail to match for (N_{max}) consecutive frames are removed. Feature vectors for confirmed tracks are updated with an Exponential Moving Average (EMA) to smooth inter‑frame variations, using a momentum term (\alpha). Position prediction for the next frame relies on an NSA Kalman filter, providing a prior that further constrains the association.

Complexity analysis shows that the dominant term is the Hungarian matching, (O(|\Phi|^{2} |\Psi|)), but in practice the cascade dramatically reduces the effective number of candidates, making real‑time operation feasible for typical queue scenes where the number of faces per frame is modest.

A major contribution of the work is the release of the PLUS Faces in a Queue (PLUSFiaQ) dataset. It contains seven fully annotated video sequences captured at 25 fps, totaling 12 730 frames (≈8 min 30 s). The recordings depict realistic queue behavior: people talk, eat, push, and shuffle, leading to frequent partial occlusions and out‑of‑plane rotations. The dataset is made publicly available, and the authors also annotate six sequences from the existing ChokePoint dataset to enable broader benchmarking.

Experimental evaluation uses three datasets: PLUSFiaQ, ChokePoint, and MusicVideo (the latter being a more generic MOT benchmark). The authors compare FaceQSORT against several state‑of‑the‑art trackers, including DeepSORT, StrongSORT, IDOL, and a recent multi‑modality tracker (OMTMCM). Evaluation metrics are standard MOT measures: MOTA, IDF1, ID switches (IDSW), false positives, and false negatives. Across all datasets, FaceQSORT consistently outperforms the baselines. Notably, on the queue‑specific PLUSFiaQ set, the combination of biometric and appearance cues yields a 7–10 % increase in IDF1 and a reduction of IDSW by roughly 30 % compared to pure‑biometric trackers.

A thorough ablation study examines the impact of the weighting parameters (\lambda) and (\beta), the choice of similarity metric (cosine vs. Euclidean), and the effect of the matching cascade. The best performance is achieved with (\lambda≈0.7) (favoring biometric features but still leveraging appearance) and (\beta≈0.6). Replacing cosine similarity with Euclidean distance degrades performance modestly, confirming the suitability of angular metrics for high‑dimensional embeddings. Removing the cascade leads to higher computational load and a slight drop in tracking accuracy under long occlusions.

In summary, the paper makes four key contributions: (1) a novel multi‑face tracker that uniquely fuses biometric and generic appearance features extracted from the same face patch, (2) an extensive experimental suite that evaluates parameter choices, similarity measures, and different face‑recognition backbones, (3) the public release of the PLUSFiaQ dataset, the first dedicated queue‑scenario multi‑face tracking benchmark, and (4) demonstrable superiority over current leading trackers in both specialized and generic video sequences. The work has immediate practical relevance for event venue access control, security screening, and personalized ticketing systems where rapid, reliable face tracking at entry points is essential.


Comments & Academic Discussion

Loading comments...

Leave a Comment