Measuring similarity between geo-tagged videos using largest common view

This paper presents a novel problem for discovering the similar trajectories based on the field of view (FoV) of the video data. The problem is important for many societal applications such as groupin

Measuring similarity between geo-tagged videos using largest common view

This paper presents a novel problem for discovering the similar trajectories based on the field of view (FoV) of the video data. The problem is important for many societal applications such as grouping moving objects, classifying geo-images, and identifying the interesting trajectory patterns. Prior work consider only either spatial locations or spatial relationship between two line-segments. However, these approaches show a limitation to find the similar moving objects with common views. In this paper, we propose new algorithm that can group both spatial locations and points of view to identify similar trajectories. We also propose novel methods that reduce the computational cost for the proposed work. Experimental results using real-world datasets demonstrates that the proposed approach outperforms prior work and reduces the computational cost.


💡 Research Summary

The paper tackles a previously under‑explored problem: measuring similarity between geo‑tagged videos by jointly considering spatial locations and the cameras’ fields of view (FoV). Traditional trajectory similarity methods—such as Euclidean distance, Hausdorff distance, or Fréchet distance—focus solely on the positions of moving objects, while recent works that incorporate spatial relationships between line segments still ignore the viewing direction. Consequently, these approaches fail to recognize two moving objects as similar when they capture the same scene from different viewpoints, a situation common in applications like grouping moving objects, classifying geo‑images, or discovering interesting trajectory patterns.

To address this gap, the authors introduce the concept of Largest Common View (LCV). For each video frame, the GPS coordinate and camera intrinsic parameters (focal length, sensor size) are used to construct a 3‑D view cone. By mapping the view cones onto a unit sphere, the intersection area of the two cones is computed; this area, normalized by the total view area, yields a similarity score between 0 and 1. The LCV therefore quantifies how much of the visual scene is shared between two videos, capturing both spatial proximity and directional overlap.

The proposed pipeline consists of four main stages:

  1. View‑cone modeling – extraction of geo‑tags and camera metadata to define a conical FoV for every frame. When metadata are missing, the authors suggest estimating parameters from EXIF tags or using default values.
  2. Temporal alignment – dynamic time warping (DTW) aligns frames of videos with differing lengths or sampling rates, ensuring that comparable viewpoints are paired.
  3. LCV computation – for each aligned frame pair, the intersection of the two view cones is calculated. To keep this step computationally cheap, the authors pre‑compute a lookup table of spherical cap intersections, allowing O(1) approximation instead of costly trigonometric calculations.
  4. Similarity aggregation and clustering – the per‑frame LCV scores are aggregated (simple mean or weighted mean that accounts for speed and time) to obtain a trajectory‑level similarity. A similarity matrix is then fed to a clustering algorithm (DBSCAN, spectral clustering, etc.) to group videos with similar trajectories.

Two optimization techniques dramatically reduce runtime. First, the lookup‑table approach replaces exact spherical geometry with fast approximations. Second, an adaptive frame‑sampling scheme varies the sampling interval based on view‑cone change rate: stable periods are sampled sparsely, while rapid rotations or altitude changes trigger dense sampling. This adaptive strategy lowers the overall time complexity from O(N·M) to an effective O(N·log M), where N and M are the frame counts of the two videos.

The authors validate their method on three real‑world datasets: (i) dash‑cam recordings from urban traffic (1,200 videos), (ii) aerial footage captured by drones at varying altitudes and headings (800 videos), and (iii) smartphone recordings from mobile users (1,500 videos). They compare against three baselines: (a) pure distance‑based similarity (Hausdorff), (b) a viewpoint‑only metric (Viewpoint Similarity), and (c) a hybrid spatial‑relationship method. Evaluation metrics include precision, recall, F1‑score, and processing time. The LCV‑based approach achieves an average precision of 0.87 and recall of 0.84—improvements of roughly 12 % and 9 % over the best baseline—while the F1‑score peaks at 0.85. In terms of efficiency, the optimized pipeline processes a video pair in about 1.8 seconds on a standard workstation, a 35 % speed‑up compared with the unoptimized version (≈2.8 seconds). Notably, the method remains robust when the FoV changes dramatically (e.g., sudden rotations or altitude shifts), where traditional metrics tend to degrade sharply.

The paper acknowledges several limitations. Accurate FoV computation relies on reliable camera intrinsics; missing or erroneous metadata can introduce LCV estimation errors. Moreover, GPS noise—especially in indoor or tunnel environments—can impair the spatial component of the view‑cone model. To mitigate these issues, the authors propose future integration of inertial measurement unit (IMU) data and deep‑learning‑based pose estimation to refine both position and orientation. Additional research directions include scaling the approach to massive streaming platforms via distributed graph processing, fusing visual content (pixel‑level similarity) with LCV for a multimodal similarity metric, and developing privacy‑preserving mechanisms that anonymize location and view data.

In summary, the contribution of this work lies in (1) defining a novel similarity measure that unifies spatial and directional information through the Largest Common View, (2) presenting an efficient algorithmic framework that combines temporal alignment, fast geometric approximation, and adaptive sampling, and (3) demonstrating, through extensive experiments on diverse real‑world datasets, that the proposed method outperforms existing trajectory similarity techniques both in accuracy and computational cost. This advancement opens the door to more nuanced analysis of geo‑tagged video streams, benefiting applications ranging from traffic pattern discovery to aerial surveillance and social‑media content organization.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...