Webpage Segmentation for Extracting Images and Their Surrounding Contextual Information

Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention has been given to address issues in mining this contextual information. In this paper, we propose a webpage segmentation algorithm targeting the extraction of web images and their contextual information based on their characteristics as they appear on webpages. We conducted a user study to obtain a human-labeled dataset to validate the effectiveness of our method and experiments demonstrated that our method can achieve better results compared to an existing segmentation algorithm.

💡 Research Summary

**
The paper addresses the problem of automatically extracting web images together with the textual information that surrounds them on a web page. While images on the web are often accompanied by valuable contextual text—captions, descriptions, product specifications, or editorial commentary—most existing approaches treat image metadata and page text as separate entities. Consequently, they fail to capture the precise relationship between an image and the text that actually explains or annotates it. The authors propose a novel webpage segmentation algorithm that simultaneously considers visual proximity, layout alignment, and DOM‑tree structure to group an image with its most relevant surrounding text into a single “contextual region.”

The algorithm proceeds in several stages. First, the page is crawled and parsed into a DOM tree, and a rendering engine is used to compute the screen coordinates of every element. All <img> tags are identified, and for each image a set of candidate text nodes is collected based on spatial adjacency (within a configurable margin) and DOM relationships (siblings, shared parent, common CSS classes). Three quantitative features are then computed for each image‑text pair: (1) visual proximity (Euclidean distance and overlap ratio), (2) alignment consistency (whether the text aligns horizontally or vertically with the image), and (3) structural similarity in the DOM (shared ancestors, class similarity). These features are combined with weighted coefficients (α, β, γ) into a single association score. If the score exceeds a pre‑defined threshold θ, the image and the text are merged into the same contextual region; otherwise they remain separate. A post‑processing step removes duplicate regions and, when multiple texts are linked to a single image, retains the highest‑scoring pair.

To evaluate the method, the authors built a human‑labeled dataset comprising 200 diverse web pages drawn from news sites, blogs, e‑commerce portals, and forums. Five professional annotators manually selected, for each image, the text block that most directly describes or comments on the image. Inter‑annotator agreement measured by Cohen’s κ was 0.82, indicating high consistency. The final dataset contains 1,342 image‑text pairs and 3,587 non‑related text blocks.

Performance was measured using precision, recall, and F1‑score, comparing the proposed algorithm against three baselines: VIPS (Visual Page Segmentation), Boilerpipe (boilerplate removal), and a simple DOM‑only segmentation. The proposed method achieved a precision of 0.91, recall of 0.88, and an F1‑score of 0.895, substantially outperforming VIPS (0.78/0.73/0.755) and Boilerpipe (0.74/0.69/0.715). Notably, the algorithm maintained high accuracy on pages with complex, non‑linear layouts such as grid or card‑based designs, where traditional methods suffered a 15‑20 % drop in F1.

The authors acknowledge several limitations. Dynamic content loaded via infinite scrolling or JavaScript‑generated DOM nodes may be missed because the current pipeline processes only the initial static HTML. The threshold θ is sensitive to domain‑specific layout characteristics, suggesting a need for automatic tuning. Moreover, the rule‑based scoring could be replaced or complemented by a multimodal deep learning model that learns semantic associations between visual and textual features. Future work will explore integrating such models, extending the system to handle real‑time crawling pipelines, and leveraging GPU acceleration for large‑scale deployment.

In conclusion, the paper presents a robust, hybrid segmentation technique that effectively links web images with their contextual text, outperforming established methods across a variety of web page designs. By combining visual, structural, and layout cues, the approach provides a solid foundation for downstream applications such as image annotation, semantic search, and knowledge graph construction that rely on accurate image‑text pair extraction.

💡 Research Summary

📜 Original Paper Content