Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at https://vizwiz.org/tasks-and-datasets/hierarchical-instance-tracking/


💡 Research Summary

The paper introduces a new computer‑vision task called Hierarchical Instance Tracking (HIT), which simultaneously tracks all instances of predefined object categories and their semantic parts while preserving the hierarchical relationship between objects and parts. The authors argue that this task is essential for privacy‑preserving video services, such as assistive tools for blind users or remote‑consultation platforms, where an incorrect segmentation could unintentionally expose sensitive information (e.g., names, addresses, credit‑card numbers).

To enable research on HIT, the authors construct the first publicly available benchmark, BIV‑Priv‑HIT. It consists of 552 videos captured by 26 blind photographers, each about 25–30 seconds long, yielding a total of 11,165 annotated frames. Within these frames, 2,765 unique entities are tracked: 537 whole objects and 2,228 parts. The taxonomy comprises 40 semantic categories (16 object types and 24 part types) organized into three privacy‑risk tiers: personally identifiable information (PII), quasi‑PII, and sensitive non‑identifying information. Annotations are provided as “masklets” – per‑frame segmentation masks linked across time with unique instance IDs. The dataset is split 60 %/15 %/25 % for training, validation, and testing, ensuring that videos from the same photographer never appear in multiple splits.

The authors compare BIV‑Priv‑HIT against six hierarchical image‑segmentation datasets (ADE20K, PACO‑Ego4D, PACO‑LVIS, PartImageNet, PASCAL‑Part) and four video‑tracking datasets (DAVIS, YouTube‑VIS, SA‑V). BIV‑Priv‑HIT is unique in three respects: (1) it provides semantic labels for both objects and parts, (2) it tracks the hierarchical decomposition of objects (object + nested parts) across time, and (3) its videos are substantially longer (average 27.9 s) than those in existing tracking benchmarks. Moreover, parts in this dataset often contain text, have elongated or line‑like shapes, and exhibit smoother boundaries, which differentiates them from typical part‑segmentation datasets.

For baseline evaluation, the authors adapt four state‑of‑the‑art models—two video object segmentation (VOS) models (STM, XMem), a video instance segmentation (VIS) model (Mask2Former‑based), and a hierarchical image‑segmentation model (MaskFormer‑Hier)—into seven variants that can handle HIT. Because none of these models natively support simultaneous object‑and‑part labeling or hierarchical identity preservation, the authors employ ad‑hoc workarounds such as multiple inference passes and post‑hoc mask association. They introduce a new metric, Hierarchical Tracking Accuracy (HIT‑Acc), in addition to standard mAP and IDF1.

Results show that all baselines perform poorly, especially on part tracking. Overall mAP values stay below 20 %, and the HIT‑Acc for parts is often under 15 %. Object tracking is modestly better but still far from practical utility. The models also suffer from inefficiency: the need for multiple passes dramatically increases inference time, making real‑time deployment infeasible. Error analysis reveals that small, text‑rich parts are frequently missed or fragmented, and occlusions quickly cause identity switches, which would directly compromise privacy‑preserving applications.

The authors conclude that current VOS/VIS architectures are ill‑suited for HIT because they (a) forbid a pixel from belonging to multiple semantic categories, (b) lack built‑in hierarchical reasoning, and (c) require external mechanisms to maintain temporal identity across object‑part hierarchies. They suggest future directions such as graph‑based hierarchical representations, multi‑task learning that jointly predicts objects and parts, uncertainty estimation for privacy‑critical parts, and human‑in‑the‑loop verification.

In summary, this work defines a novel, socially impactful vision task, provides the first large‑scale benchmark that fills a critical gap in privacy‑aware video analytics, and demonstrates that existing models struggle on this challenge. By releasing BIV‑Priv‑HIT and the associated evaluation protocol, the authors invite the community to develop more sophisticated, efficient, and privacy‑respectful algorithms that can benefit assistive technologies, remote communication, robotics, and any application where fine‑grained part understanding must coexist with strong privacy guarantees.


Comments & Academic Discussion

Loading comments...

Leave a Comment