Fine-Grained Instance-Level Sketch-Based Video Retrieval

Fine-Grained Instance-Level Sketch-Based Video Retrieval
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing sketch-analysis work studies sketches depicting static objects or scenes. In this work, we propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR), where a sketch sequence is used as a query to retrieve a specific target video instance. Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level. We contribute the first FG-SBVR dataset with rich annotations. We then introduce a novel multi-stream multi-modality deep network to perform FG-SBVR under both strong and weakly supervised settings. The key component of the network is a relation module, designed to prevent model over-fitting given scarce training data. We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.


💡 Research Summary

The paper introduces a novel cross‑modal retrieval task called Fine‑Grained Instance‑Level Sketch‑Based Video Retrieval (FG‑SBVR). Unlike traditional sketch‑based image retrieval, which deals with static objects, FG‑SBVR requires matching a sequence of free‑hand sketches to a specific video instance, demanding simultaneous alignment of fine‑grained appearance details (pose, clothing, hairstyle) and motion cues (jumps, spins, glides). To enable research on this challenging problem, the authors construct the first FG‑SBVR dataset: 528 high‑definition figure‑skating video clips (total duration 3,546 s, average length 6.7 s) paired with 1,448 multi‑page sketches. Sketches are stored in SVG format, contain an average of 102 strokes, and are divided into a “skater” component and a single motion‑vector component. Each video clip is described by 1‑9 sketch pages (average 2.7), and detailed frame‑to‑sketch page correspondence is annotated for strong‑supervision experiments.

The core technical contribution is a multi‑stream, multi‑modality deep neural network. Both video and sketch modalities are processed by two parallel streams: an Appearance stream that extracts static visual features, and a Motion stream that captures dynamic information. Within each stream, a triplet ranking loss aligns sketch and video embeddings in a joint space. This explicit separation contrasts with prevailing 3D‑CNN approaches that conflate spatial and temporal cues, and it proves essential given the large modality gap and limited training data.

To further mitigate data scarcity, the authors embed a Relation Module inspired by meta‑learning few‑shot techniques. The module receives a pair of sketch‑clip embeddings and outputs a non‑linear similarity score, effectively providing richer supervision than the triplet loss alone. This design enables effective learning even with a few hundred training pairs.

Training is explored under two regimes. In the strongly supervised setting, the exact page‑to‑frame alignment is used to guide the network, allowing precise temporal matching. In the weakly supervised setting, no intra‑video alignment is provided; instead, each video clip is treated as a bag of frames and each sketch sequence as a bag of pages, and a Multiple‑Instance Learning (MIL) framework is employed. Experiments demonstrate that the proposed model outperforms a range of state‑of‑the‑art video analysis baselines (I3D, C3D, 3D‑ResNet) and fine‑grained sketch‑based image retrieval models. Gains of 10‑15 percentage points in mean Average Precision (mAP) and Top‑K accuracy are reported, with the Relation Module contributing an additional ~8 pp improvement in the weakly supervised scenario.

The paper’s contributions can be summarized as: (1) defining the FG‑SBVR problem, (2) releasing a richly annotated, fine‑grained sketch‑video dataset, (3) proposing a dual‑stream network that explicitly separates appearance and motion, augmented by a Relation Module to prevent over‑fitting, (4) demonstrating the feasibility of both strong and weak supervision via MIL, and (5) providing extensive empirical evidence of superior performance over existing video retrieval and sketch‑based methods.

Overall, this work establishes a solid benchmark for cross‑modal video retrieval using sketches, offers a practical solution to the data‑scarcity challenge, and opens avenues for future extensions to other domains (e.g., sports, daily activities) and additional modalities such as text or audio.


Comments & Academic Discussion

Loading comments...

Leave a Comment