Vectorization of Large Amounts of Raster Satellite Images in a Distributed Architecture Using HIPI

Vectorization of Large Amounts of Raster Satellite Images in a   Distributed Architecture Using HIPI
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vectorization process focus on grouping pixels of a raster image into raw line segments, and forming lines, polylines or poligons. To vectorize massive raster images regarding resource and performane problems, weuse a distributed HIPI image processing interface based on MapReduce approach. Apache Hadoop is placed at the core of the framework. To realize such a system, we first define mapper function, and then its input and output formats. In this paper, mappers convert raster mosaics into vector counterparts. Reduc functions are not needed for vectorization. Vector representations of raster images is expected to give better performance in distributed computations by reducing the negative effects of bandwidth problem and horizontal scalability analysis is done.


💡 Research Summary

The paper presents a distributed architecture for vectorizing large collections of raster satellite images by leveraging the Hadoop MapReduce framework together with the Hadoop Image Processing Interface (HIPI). The authors argue that traditional raster processing of massive remote‑sensing datasets suffers from prohibitive memory, storage, and network bandwidth requirements, and that converting raster data into vector representations can alleviate these bottlenecks.

The system design places Apache Hadoop at its core. Raster images are first packaged into a HIPI Image Bundle (HIB), which stores many image files as a single logical unit on HDFS. A custom InputFormat reads the HIB and feeds each image (or image tile) to a mapper. No reducer is employed because each image can be processed independently; the mapper performs the complete vectorization pipeline.

Inside the mapper, the workflow is as follows: the incoming FloatImage is converted to an OpenCV matrix; the matrix is transformed to grayscale; an automatic Otsu threshold (graythresh) creates a binary image; small objects below a 300‑pixel area are removed; a 3×3 structuring element is used for morphological opening and closing to suppress noise; holes are filled with imfill; a larger area filter (10 000 px) discards residual noise, leaving only significant objects; the external contours of these objects are extracted, and the contour points are stored as polylines or polygons. The resulting vector data is serialized back into a FloatImage, encoded with JPEG, and written to HDFS.

The authors evaluated the approach on Landsat‑8 mosaics of size 7 000 × 7 000 pixels. Experiments were conducted on a two‑node and a four‑node Hadoop cluster (Intel i7‑3610QM, 8 GB RAM, Ubuntu 3.13, Hadoop 2.6.0, Java 1.7). For comparison, the same algorithm was executed in MATLAB on a single workstation. Two test scenarios were considered: processing 3 images (N = 3) and 20 images (N = 20). Execution times (in seconds) were:

  • N = 3: MATLAB = 67, 2‑node cluster = 56, 4‑node cluster = 54
  • N = 20: MATLAB = 582, 2‑node cluster = 315, 4‑node cluster = 261

These results demonstrate clear horizontal scalability: doubling the number of nodes reduces processing time, especially for larger workloads where the distributed system achieves more than a two‑fold speed‑up over the single‑machine baseline. Moreover, the vectorized outputs are substantially smaller than the original raster files, which reduces network traffic and storage consumption during subsequent analysis.

The paper situates its contribution within a body of related work that either uses Hadoop for remote‑sensing image preprocessing (often converting images to text or binary formats before analysis) or extends HIPI to support additional image formats. The novelty here lies in performing full raster‑to‑vector conversion within the mapper, eliminating the need for a reducer, and demonstrating practical performance gains on real satellite data.

Limitations are acknowledged: the current implementation handles only single‑band grayscale Landsat‑8 data; extending to multispectral or higher‑resolution sensors would require additional preprocessing steps. The reliance on OpenCV via JNI introduces some CPU‑bound overhead, suggesting that GPU acceleration could further improve throughput. Finally, because reducers are omitted, global post‑processing tasks such as merging adjacent polygon boundaries or enforcing topological consistency are not addressed. Future work is proposed to incorporate such global operations, support multi‑band imagery, explore GPU‑based kernels, and integrate dynamic resource scaling in cloud environments.

In conclusion, the study validates that HIPI‑enabled Hadoop clusters can efficiently vectorize massive raster satellite collections, offering a viable pathway for scalable remote‑sensing analytics, object extraction, and web‑GIS services. The research was funded by the Turkish Scientific and Technological Research Council (TÜBİTAK) under project EEEAG 2‑15E189.


Comments & Academic Discussion

Loading comments...

Leave a Comment