Lossless compression of simulated radio interferometric visibilities
Context. Processing radio interferometric data often requires storing forward-predicted model data. In direction-dependent calibration, these data may have a volume an order of magnitude larger than the original data. Existing lossy compression techniques work well for observed, noisy data, but cause issues in calibration when applied to forward-predicted model data. Aims. To reduce the volume of forward-predicted model data, we present a lossless compression method called Simulated Signal Compression (Sisco) for noiseless data that integrates seamlessly with existing workflows. We show that Sisco can be combined with baseline-dependent averaging for further size reduction. Methods. Sisco decomposes complex floating-point visibility values and uses polynomial extrapolation in time and frequency to predict values, groups bytes for efficient encoding, and compresses residuals using the Deflate algorithm. We evaluate Sisco on diverse LOFAR, MeerKAT, and MWA datasets with various extrapolation functions. Implemented as an open-source Casacore storage manager, it can directly be used by any observatory that makes use of this format. Results. We find that a combination of linear and quadratic prediction yields optimal compression, reducing noiseless forward-predicted model data to 24% of its original volume on average. Compression varies by dataset, ranging from 13% for smooth data to 38% for less predictable data. For pure noise data, compression achieves just a size of 84% due to the unpredictability of such data. With the current implementation, the achieved compression throughput is with 534 MB/s mostly dominated by I/O on our testing platform, but occupies the processor during compression or decompression. Finally, we discuss the extension to a lossy algorithm.
💡 Research Summary
The paper addresses a growing bottleneck in modern radio interferometry: the storage of forward‑predicted model visibilities, which are required in calibration and deconvolution pipelines and can be an order of magnitude larger than the observed data when direction‑dependent calibration is performed. Unlike observed visibilities, model data are noiseless and exhibit smooth variations in both time and frequency. Existing lossy compression schemes (e.g., Dysco, FITS‑Rice, MGARD) are tuned for Gaussian‑distributed noisy data and introduce unacceptable artefacts when applied to model visibilities, making them unsuitable for calibration‑critical workflows.
To solve this problem the authors introduce Simulated Signal Compression (Sisco), a lossless compression framework specifically designed for floating‑point complex visibility data. Sisco operates in four stages. First, it predicts each complex visibility value using two‑dimensional polynomial extrapolation across time and frequency. The implementation supports 0th‑ to 3rd‑order polynomials, as well as hybrid schemes such as linear prediction in time combined with quadratic prediction in frequency. The authors find that a combination of linear and quadratic extrapolation yields the best trade‑off between prediction accuracy and computational overhead.
Second, the predicted value is subtracted from the original value to form a residual. This subtraction is performed on the mantissa after aligning the exponents of the predicted and original numbers, using integer arithmetic to guarantee exact reversibility. The authors emphasize that floating‑point subtraction alone would not be invertible for extreme cases where the predicted value dominates the original, potentially truncating the mantissa. By shifting the exponent of the predicted value toward that of the data, the residual remains representable and lossless.
Third, the residual stream—comprising sign, mantissa, and exponent bytes—is reordered into five separate groups: one for exponent bytes and four for the individual mantissa bytes (the first mantissa byte also carries the sign bit). This grouping clusters statistically similar bytes together, which dramatically improves the effectiveness of the subsequent entropy coder.
Fourth, each group is compressed independently using the Deflate algorithm via the libdeflate library. Deflate combines LZ77‑style dictionary encoding with Huffman coding, offering a good balance between speed and compression ratio. The library provides tunable compression levels (1–12); the authors report that level 6 delivers the best overall performance on their test platform, achieving an average throughput of 534 MB s⁻¹, limited primarily by I/O rather than CPU.
Sisco is delivered as a Casacore storage manager plugin, meaning it integrates seamlessly with the Measurement Set (MS) format used throughout the radio‑astronomy software ecosystem. No changes to existing pipelines are required: DP3 and WSClean can write Sisco‑compressed data simply by setting a configuration flag, and any Casacore‑based reader will automatically decompress the data on the fly. The memory footprint per baseline is minimal, enabling on‑the‑fly compression even for large arrays.
The authors evaluate Sisco on a diverse set of simulated datasets from LOFAR, MeerKAT, and the Murchison Widefield Array (MWA). Compression performance varies with data smoothness. For “smooth” model data (slowly varying in time/frequency) the method reduces the data volume to as little as 13 % of the original size; for less predictable data the reduction is around 38 %. Pure noise data, which lack any predictable structure, compress only to 84 % of the original size, confirming that the algorithm exploits the intrinsic smoothness of model visibilities. When combined with baseline‑dependent averaging (BDA), an additional reduction of roughly 30 % is achieved, demonstrating that Sisco can be part of a multi‑stage data‑reduction strategy.
The paper also discusses future extensions. The authors outline a pathway toward a controlled‑loss version of Sisco, where a user‑specified error bound would allow further size reductions at the cost of negligible scientific impact. They suggest incorporating weighted extrapolation (using visibility weights) and multi‑grid hierarchical prediction to improve accuracy for data with more complex structures. GPU acceleration of the prediction and residual‑formation steps is identified as a promising avenue to increase throughput for the massive data rates expected from the Square Kilometre Array (SKA).
In summary, Sisco provides a practical, open‑source, lossless compression solution tailored to the unique characteristics of simulated interferometric visibilities. By leveraging lightweight polynomial prediction, careful mantissa‑exponent handling, strategic byte grouping, and a mature Deflate implementation, it achieves average compression ratios of ~24 % (i.e., a 76 % size reduction) while maintaining full scientific fidelity and requiring only modest computational resources. This makes it a valuable tool for current and next‑generation radio observatories facing ever‑growing data volumes.
Comments & Academic Discussion
Loading comments...
Leave a Comment