An Empirical Evaluation of Similarity Measures for Time Series Classification

An Empirical Evaluation of Similarity Measures for Time Series   Classification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Time series are ubiquitous, and a measure to assess their similarity is a core part of many computational systems. In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems. Because of this importance, countless approaches to estimate time series similarity have been proposed. However, there is a lack of comparative studies using empirical, rigorous, quantitative, and large-scale assessment strategies. In this article, we provide an extensive evaluation of similarity measures for time series classification following the aforementioned principles. We consider 7 different measures coming from alternative measure `families’, and 45 publicly-available time series data sets coming from a wide variety of scientific domains. We focus on out-of-sample classification accuracy, but in-sample accuracies and parameter choices are also discussed. Our work is based on rigorous evaluation methodologies and includes the use of powerful statistical significance tests to derive meaningful conclusions. The obtained results show the equivalence, in terms of accuracy, of a number of measures, but with one single candidate outperforming the rest. Such findings, together with the followed methodology, invite researchers on the field to adopt a more consistent evaluation criteria and a more informed decision regarding the baseline measures to which new developments should be compared.


💡 Research Summary

The paper presents a rigorous, large‑scale empirical comparison of time‑series similarity measures in the context of classification. Seven representative measures are evaluated: Euclidean distance (a lock‑step L2 norm), Fourier‑coefficient distance, auto‑regressive (AR) model‑based distance, and four elastic measures—Dynamic Time Warping (DTW), Edit Distance on Real sequences (EDR), Time‑Warped Edit Distance (TWED), and the Move‑Split‑Merge (MJC) distance. A random baseline is also included for reference.

A total of 45 publicly available time‑series datasets from the UCR/UEA archives are used, covering a wide range of domains (finance, medicine, speech, image, sensor, environmental, human activity, synthetic, etc.). For each dataset, a 1‑Nearest‑Neighbour (1‑NN) classifier is employed, which is the standard protocol for assessing similarity measures in time‑series classification. All hyper‑parameters (e.g., DTW’s Sakoe‑Chiba window size, EDR’s matching threshold, the number of Fourier coefficients θ, the AR order η) are tuned via an inner 5‑fold cross‑validation on the training split. The outer evaluation uses a 10‑fold cross‑validation to obtain unbiased test accuracies. Both training and test accuracies are reported to highlight over‑fitting tendencies.

Statistical significance is assessed using a Friedman test across all datasets, followed by a Nemenyi post‑hoc test to identify groups of measures that are not significantly different. Pairwise Wilcoxon signed‑rank tests supplement the analysis for specific dataset comparisons.

Key findings:

  1. Overall superiority of DTW – DTW achieves the highest mean test accuracy and belongs to the top‑ranked group in the Nemenyi diagram.
  2. Near‑equivalence of EDR and TWED – Both elastic edit‑based distances perform statistically indistinguishably from DTW on many datasets, especially those with substantial noise or temporal distortions.
  3. Viable low‑cost alternatives – Euclidean distance and Fourier‑coefficient distance, while computationally cheap, obtain competitive accuracies on short or low‑variability series, making them attractive for real‑time or resource‑constrained scenarios.
  4. Sensitivity of model‑based measures – AR‑based distances are highly dependent on the chosen order η; sub‑optimal η leads to poorer performance than even the simple Euclidean baseline.
  5. Training vs. testing gap – DTW and EDR exhibit minimal over‑fitting, with training and test accuracies closely aligned. In contrast, FC and AR sometimes show large gaps, indicating susceptibility to over‑fitting when hyper‑parameters are aggressively tuned.
  6. Random baseline confirmation – All evaluated measures significantly outperform the random baseline, confirming the validity of the experimental setup.

The authors argue that future research proposing new similarity measures should adopt DTW (or an equally performing elastic measure such as EDR/TWED) as the primary baseline. When computational efficiency is paramount, Euclidean or Fourier‑based distances are justified alternatives. The study also stresses the necessity of internal cross‑validation for hyper‑parameter selection and the use of robust non‑parametric statistical tests to substantiate performance claims.

Limitations include the exclusive focus on 1‑NN classification; the behavior of the measures with other classifiers (e.g., SVM, Random Forest) remains unexplored. Moreover, the impact of preprocessing steps (normalization, dimensionality reduction) and variable series lengths warrants further investigation. Future work could explore hybrid combinations of elastic and feature‑based distances, or integrate deep‑learning embeddings with traditional measures.

In summary, this work delivers the most comprehensive, statistically sound benchmark of time‑series similarity measures to date, providing clear guidance on which measures to use as baselines and under which circumstances alternative, cheaper measures may be preferable.


Comments & Academic Discussion

Loading comments...

Leave a Comment