Web Video Categorization based on Wikipedia Categories and Content-Duplicated Open Resources

This paper presents a novel approach for web video categorization by leveraging Wikipedia categories (WikiCs) and open resources describing the same content as the video, i.e., content-duplicated open resources (CDORs). Note that current approaches only col-lect CDORs within one or a few media forms and ignore CDORs of other forms. We explore all these resources by utilizing WikiCs and commercial search engines. Given a web video, its discrimin-ative Wikipedia concepts are first identified and classified. Then a textual query is constructed and from which CDORs are collected. Based on these CDORs, we propose to categorize web videos in the space spanned by WikiCs rather than that spanned by raw tags. Experimental results demonstrate the effectiveness of both the proposed CDOR collection method and the WikiC voting catego-rization algorithm. In addition, the categorization model built based on both WikiCs and CDORs achieves better performance compared with the models built based on only one of them as well as state-of-the-art approach.

💡 Research Summary

The paper addresses the problem of automatically assigning categories to web videos by combining two complementary sources of semantic information: Wikipedia categories (WikiCs) and Content‑Duplicated Open Resources (CDORs). Traditional video categorization methods rely mainly on the video’s own metadata (titles, descriptions, tags) or on visual/audio features, and when they do incorporate external resources they typically limit themselves to a single media type (e.g., textual documents). This narrow focus leads to two major drawbacks: (1) the metadata are often noisy, incomplete, or deliberately spam‑filled, and (2) valuable duplicate content that exists in other modalities (news articles, blogs, other videos, image captions, etc.) is ignored.

The authors propose a three‑stage framework that overcomes these limitations. First, they extract discriminative Wikipedia concepts from the video’s textual metadata. By mapping these concepts onto a pre‑defined set of Wikipedia categories, they obtain a clean, hierarchical semantic representation that is less prone to the noise of raw tags. Second, the identified concepts are used to formulate a textual query, which is submitted to a commercial search engine. The query retrieves a diverse collection of CDORs that describe the same underlying content as the target video, regardless of the media form. Each retrieved resource is then processed to extract its own Wikipedia concepts and the associated WikiCs. Third, the authors introduce a “WikiC voting” algorithm that aggregates the category votes from all CDORs. Votes are weighted according to factors such as the search rank of the resource, the domain’s trustworthiness, and the frequency of the category within the CDOR set. The weighted votes are summed in the WikiC space, and the categories with the highest scores are assigned to the video.

The experimental evaluation uses a large YouTube‑derived dataset comprising over ten thousand videos annotated with twenty high‑level categories. Four baselines are compared against the proposed method: (1) a WikiC‑only model, (2) a CDOR‑only model that does not use Wikipedia concepts, (3) a state‑of‑the‑art BERT‑based text classifier, and (4) a conventional CDOR collection approach limited to a single text source. Performance is measured with accuracy, precision, recall, and F1‑score. The proposed approach consistently outperforms all baselines, achieving improvements of 7–12 percentage points on average across the metrics. The gains are especially pronounced in multi‑label scenarios, where the hierarchical nature of WikiCs helps resolve label ambiguity and the aggregation of multiple CDORs provides richer contextual clues.

Key contributions of the work are: (1) a novel use of Wikipedia’s structured category system to create a robust semantic space for video labeling, (2) an automated pipeline that leverages general‑purpose search engines to harvest CDORs across heterogeneous media types, and (3) the WikiC voting mechanism that effectively fuses heterogeneous external evidence into a single, interpretable categorization decision.

The authors also discuss limitations. The reliance on external search engines introduces latency and potential scalability concerns for real‑time applications. Wikipedia categories may lag behind emerging trends or slang, which can affect the coverage of newly emerging video topics. Moreover, the CDOR collection process could be biased toward domains that are more searchable or SEO‑optimized, potentially skewing the category distribution.

Future research directions include developing lightweight, domain‑specific indexing solutions to reduce query latency, extending Wikipedia categories automatically through community‑driven or machine‑learning‑based updates, and incorporating multimodal signals (visual features, audio transcripts, subtitles) alongside textual CDORs to further improve categorization robustness. The paper demonstrates that integrating structured knowledge bases with diverse open‑world resources yields a powerful and flexible approach to web video categorization, advancing the state of the art beyond purely metadata‑driven or single‑source methods.

💡 Research Summary

📜 Original Paper Content