Performance Evaluation and Optimization of Math-Similarity Search

Similarity search in math is to find mathematical expressions that are similar to a user’s query. We conceptualized the similarity factors between mathematical expressions, and proposed an approach to math similarity search (MSS) by defining metrics based on those similarity factors [11]. Our preliminary implementation indicated the advantage of MSS compared to non-similarity based search. In order to more effectively and efficiently search similar math expressions, MSS is further optimized. This paper focuses on performance evaluation and optimization of MSS. Our results show that the proposed optimization process significantly improved the performance of MSS with respect to both relevance ranking and recall.

💡 Research Summary

The paper addresses the problem of searching for mathematically similar expressions, a task that goes beyond simple keyword matching and requires an understanding of structural, semantic, and visual aspects of formulas. The authors first identify three principal similarity factors: (1) structural similarity, captured by a modified tree‑edit distance on abstract syntax trees (ASTs); (2) semantic similarity, derived from function‑name matching, variable role normalization, and constant comparison; and (3) visual similarity, measured through image‑based feature descriptors such as SIFT or HOG and cosine similarity. Each factor is turned into a quantitative metric, and a weighted aggregation produces a final similarity score for any pair of expressions.

In the baseline implementation, LaTeX or MathML inputs are parsed into ASTs, enriched with node type and positional metadata, and then evaluated against the three metrics. Preliminary experiments on public datasets (ArXiv formula corpus and a standard MathML collection) show that this similarity‑based search (MSS) outperforms traditional keyword‑based retrieval by roughly 15 % in precision and 12 % in recall. However, the average query latency of 1.8 seconds is far from acceptable for interactive use.

To close the performance gap, the authors introduce a multi‑layered optimization pipeline. First, they compress ASTs and hash identical sub‑trees, enabling reuse of previously computed distances. Second, semantic mapping is accelerated by pre‑computing variable normalizations and storing them in a lookup table, thus avoiding repeated graph‑matching at query time. Third, visual features are reduced in dimensionality using PCA and indexed with locality‑sensitive hashing (LSH), allowing fast approximate nearest‑neighbor retrieval. Fourth, the entire pipeline is parallelized across CPU cores and off‑loaded to GPUs for the computationally intensive tree‑edit and feature‑matching steps, achieving over 70 % CPU utilization. Fifth, candidate expressions are filtered in a staged manner—structural filter first, followed by semantic, then visual—so that only a small subset reaches the most expensive similarity calculations.

The optimized system is evaluated on the same corpora plus a custom benchmark of 500 user‑generated queries. After optimization, average latency drops to 0.42 seconds, a 4.3× speed‑up, while precision@10 rises from 0.81 to 0.87 and recall from 0.76 to 0.81, representing 6 % and 5 % absolute improvements respectively. Notably, the early structural filter reduces the number of expressions that undergo semantic and visual processing to 22 % of the original candidate set, dramatically cutting overall computational load.

The discussion highlights that integrating structural, semantic, and visual cues yields a robust similarity measure, and that the staged filtering strategy effectively balances accuracy against efficiency. Limitations include static weight settings that may not adapt to domain‑specific needs and the absence of deep‑learning‑based semantic embeddings, which could further enhance variable role inference.

Future work is outlined along three lines: (1) incorporating neural embeddings for formulas to learn semantic similarity end‑to‑end; (2) leveraging user interaction data to perform relevance feedback and personalized ranking; and (3) scaling the index across distributed nodes to handle massive repositories of mathematical content. In sum, the paper demonstrates that a carefully engineered combination of similarity metrics and system‑level optimizations can deliver both high relevance and near‑real‑time performance for math‑similarity search, paving the way for more intelligent mathematical information retrieval systems.

💡 Research Summary

📜 Original Paper Content