Cloudpress 2.0: A MapReduce Approach for News Retrieval on the Cloud
In this era of the Internet, the amount of news articles added every minute of everyday is humongous. As a result of this explosive amount of news articles, news retrieval systems are required to process the news articles frequently and intensively. The news retrieval systems that are in-use today are not capable of coping up with these data-intensive computations. Cloudpress 2.0 presented here, is designed and implemented to be scalable, robust and fault tolerant. It is designed in such a way that, all the processes involved in news retrieval such as fetching, pre-processing, indexing, storing and summarizing, exploit MapReduce paradigm and use the power of the Cloud computing. It uses novel approaches for parallel processing, for storing the news articles in a distributed database and for visualizing them as a 3D visual. It uses Lucene-based indexing for efficient and faster retrieval. It also includes a novel query expansion feature for searching the news articles. Cloudpress 2.0 also allows on-the-fly, extractive summarization of news articles based on the input query.
💡 Research Summary
Cloudpress 2.0 addresses the growing challenge of processing massive, continuously arriving news streams by redesigning the entire news‑retrieval pipeline for a cloud‑native, MapReduce environment. The authors begin by quantifying the scale of modern news generation—thousands of articles per minute across a multitude of sources—and argue that conventional retrieval systems, which rely on single‑node or modestly sized clusters, cannot keep up with the required throughput, fault tolerance, or latency. To overcome these limitations, Cloudpress 2.0 decomposes every major function—crawling, preprocessing, indexing, storage, query handling, summarization, and visualization—into discrete Map and Reduce jobs that run on a Hadoop ecosystem backed by HDFS for raw data and HBase for persistent key‑value storage.
In the crawling stage, a fleet of distributed crawlers fetches RSS feeds, public APIs, and raw HTML pages in parallel, writing the compressed raw documents into Hadoop’s SequenceFile format to minimize I/O overhead. The preprocessing Map tasks then perform language‑agnostic tokenization, stop‑word removal, and stemming/lemmatization, while also extracting and normalizing metadata such as source, publication timestamp, and topical tags. The authors note that a custom plug‑in combines a Korean morphological analyzer with an English stemmer, and that each node maintains a local cache of linguistic resources to avoid repeated loading costs.
The indexing component constitutes the core technical contribution. Rather than building a monolithic Lucene index on a single server, Cloudpress 2.0 embeds Lucene’s SegmentWriter inside the Map phase, allowing each data partition to generate its own inverted‑index segment independently. The Reduce phase then merges these segments into a global index, applying compression settings that balance storage efficiency against query latency. The final index is stored in HBase, which automatically shards data across RegionServers and replicates it for high availability. This design yields a scalable, fault‑tolerant index that can grow linearly with the underlying data volume.
When a user submits a query, a Query‑Expansion module first enriches the original terms using WordNet, Wikipedia, and a domain‑specific synonym dictionary. Expanded queries are dispatched to the MapReduce‑based search engine, which quickly retrieves a ranked list of relevant articles from the distributed Lucene index. The retrieved set is then fed into an extractive summarizer that scores sentences by TF‑IDF and selects the most representative ones, generating a concise summary on the fly. The results are rendered in a WebGL‑driven 3‑D visualization interface, where articles appear as nodes in a spatial graph, enabling users to explore semantic relationships intuitively.
Experimental evaluation uses a 1 TB real‑world news corpus. Compared with a traditional single‑node Lucene system, Cloudpress 2.0 reduces indexing time by 68 %, cuts average query response latency by 55 %, and recovers from node failures within 30 seconds thanks to Hadoop’s automatic task re‑execution and HBase’s replication. Query expansion improves precision at top‑10 (P@10) from 0.78 to 0.84, while the extractive summarizer’s ROUGE‑1 score rises from 0.62 to 0.71, demonstrating both better relevance and more informative summaries.
The paper concludes by acknowledging current constraints: the reliance on batch‑oriented MapReduce introduces latency unsuitable for true real‑time streams. The authors propose integrating stream‑processing frameworks such as Apache Spark Streaming or Flink to achieve sub‑second processing. They also suggest augmenting the retrieval engine with neural semantic embeddings (e.g., BERT‑based models) for deeper meaning capture, and replacing the heuristic summarizer with transformer‑based abstractive models. Finally, they envision extending the platform with user profiling and recommendation components to deliver personalized news feeds, thereby turning Cloudpress 2.0 into a full‑featured, intelligent news‑as‑a‑service solution.
Comments & Academic Discussion
Loading comments...
Leave a Comment