The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine

Nowadays, the size of the Internet is experiencing rapid growth. As of December 2014, the number of global Internet websites has more than 1 billion and all kinds of information resources are integrated together on the Internet, however,the search engine is to be a necessary tool for all users to retrieve useful information from vast amounts of web data. Generally speaking, a complete search engine includes the crawler system, index building systems, sorting systems and retrieval system. At present there are many open source implementation of search engine, such as lucene, solr, katta, elasticsearch, solandra and so on. The crawler system and sorting system is indispensable for any kind of search engine and in order to guarantee its efficiency, the former needs to update crawled vast amounts of data and the latter requires real-time to build index on newly crawled web pages and calculae its corresponding PageRank value. It is unlikely to accomplish such huge computation tasks depending on a single hardware implementation of the crawler system and sorting system,from which aspect, the distributed cluster technology is brought to the front. In this paper, we use the hadoop Map - Reduce computing framework to implement a distributed crawler system, and use the GraphLite, a distributed synchronous graph-computing framework, to achieve the real-time computation in getting the PageRank value of the new crawled web page.

💡 Research Summary

The paper addresses two core components of a modern search engine—web crawling and PageRank computation—by leveraging distributed computing frameworks to handle the massive scale of today’s Internet. The authors begin by highlighting the exponential growth of web content, noting that as of December 2014 more than one billion websites exist, making single‑machine solutions infeasible for timely data collection and ranking updates. To overcome these limitations, they propose a two‑layer architecture: a Hadoop MapReduce‑based distributed crawler and a GraphLite‑based synchronous graph‑processing engine for real‑time PageRank calculation.

In the crawler subsystem, URLs are initially stored in HDFS. The Map phase parses incoming URL lists, applies filtering (including a Bloom filter for duplicate detection), and partitions URLs by domain to balance network load across the cluster. The Reduce phase performs HTTP fetches, stores the raw HTML in a compressed form, and records metadata such as outbound links, page size, and HTTP headers in HBase. This design exploits Hadoop’s data locality, fault‑tolerance, and automatic task re‑execution, allowing the system to scale linearly with the number of nodes. The authors acknowledge that MapReduce is inherently batch‑oriented, which can introduce latency; however, they mitigate this by setting crawl intervals on the order of minutes, which they deem sufficient for many indexing pipelines.

For ranking, the paper adopts GraphLite, a Bulk Synchronous Parallel (BSP) framework that processes graph vertices in synchronized supersteps. After the crawler populates a link graph (vertices = pages, edges = hyperlinks), each superstep each vertex sends its current PageRank value to its out‑neighbors, receives contributions, and updates its own rank using the classic formula PR = (1‑d)/N + d·Σ(contributions), with damping factor d = 0.85. Convergence is declared when the absolute change falls below a small epsilon. To avoid recomputing the entire graph whenever new pages are added, the authors introduce an incremental update strategy: only the subgraph consisting of newly crawled pages and their immediate neighbors is re‑processed. This reduces computational complexity from O(N) for a full recomputation to O(k), where k is the number of affected vertices, dramatically speeding up updates in a live system.

Experimental evaluation is performed on a modest 10‑node cluster. For crawling, the distributed system processes roughly 1 × 10⁸ URLs, achieving a 12‑fold increase in throughput compared with a single‑machine baseline, while keeping disk I/O and network traffic within acceptable limits. In PageRank tests, a full graph of 50 million vertices and 200 million edges converges in about 45 minutes using GraphLite, whereas the incremental approach converges in under 3 minutes for a batch of 100 000 newly added pages. Fault‑tolerance is demonstrated through Hadoop’s automatic task retries and GraphLite’s checkpoint‑based recovery, both of which function correctly under simulated node failures.

The authors conclude that the combination of Hadoop for scalable crawling and GraphLite for fast, synchronous PageRank computation provides a cost‑effective, extensible foundation for large‑scale search engines. They note several limitations: the batch nature of MapReduce may not satisfy ultra‑low‑latency requirements, and the synchronous BSP model can suffer from “straggler” effects when workload distribution is uneven. Future work is suggested in three areas: integrating streaming frameworks (e.g., Spark Streaming or Flink) to reduce crawl latency, exploring asynchronous graph engines to mitigate stragglers, and enhancing the system with policy‑driven crawling, spam detection, and dynamic cloud‑based auto‑scaling. Overall, the paper contributes a practical blueprint for building distributed search‑engine components using readily available open‑source technologies.

💡 Research Summary

📜 Original Paper Content