Creating a level playing field for all symbols in a discretization

In time series analysis research there is a strong interest in discrete representations of real valued data streams. One approach that emerged over a decade ago and is still considered state-of-the-art is the Symbolic Aggregate Approximation algorithm. This discretization algorithm was the first symbolic approach that mapped a real-valued time series to a symbolic representation that was guaranteed to lower-bound Euclidean distance. The interest of this paper concerns the SAX assumption of data being highly Gaussian and the use of the standard normal curve to choose partitions to discretize the data. Though not necessarily, but generally, and certainly in its canonical form, the SAX approach chooses partitions on the standard normal curve that would produce an equal probability for each symbol in a finite alphabet to occur. This procedure is generally valid as a time series is normalized before the rest of the SAX algorithm is applied. However there exists a caveat to this assumption of equi-probability due to the intermediate step of Piecewise Aggregate Approximation (PAA). What we will show in this paper is that when PAA is applied the distribution of the data is indeed altered, resulting in a shrinking standard deviation that is proportional to the number of points used to create a segment of the PAA representation and the degree of auto-correlation within the series. Data that exhibits statistically significant auto-correlation is less affected by this shrinking distribution. As the standard deviation of the data contracts, the mean remains the same, however the distribution is no longer standard normal and therefore the partitions based on the standard normal curve are no longer valid for the assumption of equal probability.

💡 Research Summary

The paper revisits a fundamental assumption underlying the Symbolic Aggregate Approximation (SAX) framework: that after z‑normalization the time‑series data follow a standard normal distribution, allowing the use of equal‑probability breakpoints to map Piecewise Aggregate Approximation (PAA) values to symbols. The authors demonstrate that the PAA step itself alters the statistical properties of the series. By averaging over w consecutive points, the variance of each PAA coefficient is reduced by a factor of 1/w relative to the original variance. Consequently, while the mean remains unchanged, the distribution of PAA coefficients is no longer standard normal; its standard deviation shrinks in proportion to the segment length and to the degree of autocorrelation present in the series.

Through both theoretical derivation and extensive empirical evaluation on synthetic and real‑world datasets, the study shows that series with strong autocorrelation retain more of their original variance after PAA, whereas weakly autocorrelated or near‑white‑noise series experience a pronounced contraction of variance. This contraction invalidates the equal‑probability breakpoint scheme originally derived from the standard normal CDF. As a result, the lower‑bounding property of SAX—guaranteeing that the symbolic distance never underestimates the true Euclidean distance—becomes unreliable, and the symbolic representation loses its intended uniform symbol frequency.

The authors propose two practical remedies. The first is a post‑PAA re‑normalization step that rescales the PAA coefficients to zero mean and unit variance before applying the standard SAX breakpoints. While this restores the variance, it may distort the intrinsic dynamics of the series. The second, more principled approach constructs breakpoints directly from the empirical distribution of the PAA coefficients. By estimating the empirical cumulative distribution function (ECDF) of the PAA values, one can compute quantiles that ensure each symbol occurs with (approximately) equal probability, regardless of the underlying distribution or autocorrelation structure.

Experimental results indicate that both remedies reduce the distance‑error introduced by the variance shrinkage, but the ECDF‑based breakpoint method consistently yields better performance in classification and clustering tasks, especially for series with low autocorrelation. The paper also discusses the computational overhead of dynamic breakpoint computation and suggests efficient approximations for streaming scenarios.

In conclusion, the work highlights a previously overlooked source of bias in SAX—variance reduction during PAA—and provides concrete strategies to restore the equal‑probability assumption. By adapting the breakpoint selection to the actual distribution of PAA coefficients, practitioners can preserve the theoretical guarantees of SAX while improving empirical performance across a wide range of time‑series analysis applications. Future research directions include automated detection of when variance shrinkage is significant, integration with adaptive windowing techniques, and extension of the approach to multivariate symbolic representations.