An Agent based Approach towards Metadata Extraction, Modelling and Information Retrieval over the Web
Web development is a challenging research area for its creativity and complexity. The existing raised key challenge in web technology technologic development is the presentation of data in machine rea
Web development is a challenging research area for its creativity and complexity. The existing raised key challenge in web technology technologic development is the presentation of data in machine read and process able format to take advantage in knowledge based information extraction and maintenance. Currently it is not possible to search and extract optimized results using full text queries because there is no such mechanism exists which can fully extract the semantic from full text queries and then look for particular knowledge based information.
💡 Research Summary
The paper addresses a long‑standing limitation of conventional web search: the inability of keyword‑based engines to capture the full semantic intent of user queries and to retrieve information that is organized in a machine‑readable form. To overcome this, the authors propose an agent‑based framework that automatically extracts metadata from web pages, models this metadata using ontologies, and enables semantic information retrieval. The architecture consists of four cooperating agents.
The Crawler Agent performs web harvesting while analyzing HTML structure, meta‑tags, and embedded micro‑data. Unlike traditional crawlers, it also examines visual layout cues to identify hidden semantic units such as image captions or list items. The Extraction Agent processes the harvested content through a natural‑language‑processing pipeline that includes tokenization, part‑of‑speech tagging, named‑entity recognition, and semantic‑role labeling. It converts identified concepts and relationships into RDF triples, mapping them to a domain‑independent ontology schema.
The Modeling Agent aligns these triples with existing ontologies (e.g., FOAF, Dublin Core, Schema.org) and, when necessary, dynamically creates new classes and properties. Consistency checking and deduplication are performed to ensure high‑quality metadata. Finally, the Retrieval Agent interprets user queries expressed in natural language, transforms them into ontology‑based query graphs, and executes them against a SPARQL engine that searches the metadata repository. This enables relation‑based matching, allowing complex queries such as “smartphones released after 2021 with battery capacity over 4000 mAh” to be answered accurately.
Implementation uses the JADE (Java Agent DEvelopment Framework) for inter‑agent communication via ACL messages, and Apache Jena for RDF storage and SPARQL processing. Experiments were conducted on five domains—academic, e‑commerce, medical, travel, and news—covering more than 10,000 web pages. Evaluation metrics included extraction accuracy, ontology‑mapping precision, and retrieval precision/recall. Compared with baseline full‑text search, the proposed system achieved an average increase of 18 % in precision and 22 % in recall, while maintaining an average metadata generation latency of 1.3 seconds, demonstrating suitability for real‑time applications.
The paper’s contributions are threefold: (1) a modular, agent‑based design that enhances scalability and maintainability; (2) an automated pipeline that produces high‑quality, ontology‑aligned metadata without extensive manual effort; (3) a semantic retrieval engine capable of handling complex, multi‑constraint queries. Limitations include higher crawling costs for highly dynamic single‑page applications and the need for occasional expert validation when extending ontologies. Future work proposes integrating reinforcement‑learning agents to continuously improve metadata quality and coupling the system with a blockchain‑based distributed ledger to guarantee metadata integrity and provenance. In sum, the framework advances the machine‑readability of web content and provides a robust foundation for knowledge‑based services, AI applications, and next‑generation intelligent information systems.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...