Data Smashing
Investigation of the underlying physics or biology from empirical data requires a quantifiable notion of similarity - when do two observed data sets indicate nearly identical generating processes, and when they do not. The discriminating characteristics to look for in data is often determined by heuristics designed by experts, $e.g.$, distinct shapes of “folded” lightcurves may be used as “features” to classify variable stars, while determination of pathological brain states might require a Fourier analysis of brainwave activity. Finding good features is non-trivial. Here, we propose a universal solution to this problem: we delineate a principle for quantifying similarity between sources of arbitrary data streams, without a priori knowledge, features or training. We uncover an algebraic structure on a space of symbolic models for quantized data, and show that such stochastic generators may be added and uniquely inverted; and that a model and its inverse always sum to the generator of flat white noise. Therefore, every data stream has an anti-stream: data generated by the inverse model. Similarity between two streams, then, is the degree to which one, when summed to the other’s anti-stream, mutually annihilates all statistical structure to noise. We call this data smashing. We present diverse applications, including disambiguation of brainwaves pertaining to epileptic seizures, detection of anomalous cardiac rhythms, and classification of astronomical objects from raw photometry. In our examples, the data smashing principle, without access to any domain knowledge, meets or exceeds the performance of specialized algorithms tuned by domain experts.
💡 Research Summary
The paper introduces “Data Smashing,” a universal, feature‑free method for quantifying similarity between arbitrary data streams. The approach begins by quantizing continuous‑valued signals into symbolic sequences over a finite alphabet. Each symbolic stream is assumed to be generated by a hidden probabilistic finite‑state automaton (PFSA) that is stationary, ergodic, and has a finite number of states. The authors show that the space of such PFSA models possesses an Abelian group structure: for any model G there exists a unique inverse model G such that G + G yields the identity element W, which is the single‑state PFSA that produces flat white noise (FWN) – a uniform i.i.d. symbol stream.
Data smashing exploits this algebraic structure directly on observed sequences, without constructing explicit PFSA models. Four elementary stream operations are defined: (1) Independent Stream Copy – generates an independent replica of a stream by sampling from FWN and retaining symbols that match the original; (2) Stream Inversion – creates an anti‑stream by simultaneously reading multiple independent copies of the original and outputting a symbol only when all copies agree, thereby inverting the statistical distribution of substrings; (3) Stream Summation – outputs a symbol when the current symbols of two streams coincide; (4) Deviation from FWN – computes a scalar ^ ☐(s; L) that measures how far a symbolic sequence s deviates from uniform randomness, by summing, over all substrings up to length L, the L1 distance between the empirical next‑symbol distribution and the uniform distribution, weighted inversely by substring length. The parameter L is chosen as the logarithm of the alphabet size, ensuring sufficient statistical confidence.
The distance between two hidden generators G and H is defined as d(G,H)=^ ☐(G + H). Because the group operation and inverse are well‑defined, d satisfies the axioms of a metric (non‑negativity, symmetry, triangle inequality). Crucially, the authors prove that d can be estimated solely from the observed streams s and t by computing ^ ☐ on the summed stream of t and the anti‑stream of s, without ever reconstructing G or H.
Empirical validation spans three domains: (i) EEG recordings for epileptic seizure detection (495 excerpts, 23.6 s each, sampled at 173.61 Hz). After a simple three‑symbol quantization of signal derivatives, data smashing achieved 98.9 % classification accuracy, matching or exceeding specialist methods. (ii) Cardiac rhythm analysis, where abnormal beats were identified with high sensitivity and specificity using only raw inter‑beat intervals. (iii) Astronomical photometry, where raw light curves of variable stars were clustered correctly without any period‑finding or shape‑based features. In all cases, the method required no domain‑specific preprocessing, feature engineering, or labeled training data, yet performed on par with or better than state‑of‑the‑art, task‑specific pipelines.
The paper also discusses limitations. Deterministic or near‑deterministic processes (e.g., perfectly periodic signals) do not admit a finite‑state stochastic model and thus fall outside the method’s scope. The technique assumes statistical independence between streams; strong cross‑stream correlations can bias the annihilation result. Computational cost grows with the square of alphabet size and linearly with stream length, so very high‑resolution quantizations or ultra‑high‑speed streams may require dimensionality reduction or sampling.
In summary, Data Smashing provides a mathematically grounded, model‑agnostic similarity metric based on the concept of statistical annihilation. By constructing anti‑streams and measuring residual randomness, it enables clustering, classification, and anomaly detection across heterogeneous data types without any handcrafted features or supervised learning, offering a powerful tool for automated scientific discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment