ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite significant strides in statement autoformalization, a critical gap remains in the development of automated evaluation metrics capable of assessing formal translation quality. Existing metrics often fail to balance semantic and structural information: string-based methods neglect semantics, whereas proof-based approaches offer no graded similarity when proofs fail. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which captures syntactic structure by transforming formal statements into operator trees and computes a real-valued similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric by incorporating semantic transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a benchmark comprising 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, distinctively labeled for both semantic provability and structural likeness. Experiments on the EPLA benchmark demonstrate that TransTED Similarity surpasses existing methods, achieving state-of-the-art accuracy and Kappa score. The benchmark dataset, code, and detailed experimental results are available at https://github.com/XiaoyangLiu-sjtu/ASSESS.

💡 Research Summary

The paper addresses a critical gap in the evaluation of automatically formalized mathematical statements: existing metrics either ignore semantics (string‑based scores such as BLEU/ROUGE) or provide only a binary verdict (proof‑based methods) and thus cannot give graded feedback when a proof fails. To bridge this gap, the authors propose ASSESS, a two‑stage framework that captures both the syntactic structure and the semantic equivalence of formal statements.

In the first stage, each Lean formal statement is parsed with the Lean Language Server into an Operator Tree (OPT). Operators become internal nodes, their arguments become ordered children, and a placeholder token is appended to every non‑leaf node to disambiguate operator roles. Parentheses are omitted because the tree topology already encodes precedence. This representation yields a compact, structure‑preserving encoding that is robust to superficial textual variations.

The second stage introduces TransTED Similarity, a novel distance metric built on top of the classic Tree Edit Distance (TED). While TED measures the minimum cost of node insertions, deletions, and relabelings, it treats all transformations equally and therefore penalizes semantically equivalent expressions (e.g., a + b vs. b + a). TransTED augments TED with a curated set of semantic transformations derived from logical implication relationships. The authors formalize a “transformation” as a mapping from a weaker contrast to a stronger one (e.g., i = j ⇒ f(i) = f(j)). They then define a new pseudometric d* that must (1) never exceed the original TED distance and (2) be monotone under these transformations. By solving a linear‑programming problem they prove the existence of a unique maximal d* satisfying both constraints; this maximal function is defined as TransTED. In practice, because only a finite set of transformations is implemented, the algorithm computes an upper bound of the theoretical value.

TransTED similarity is obtained by normalizing the distance d* with the size of the larger tree, yielding a real‑valued score in

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

💡 Research Summary

Comments & Academic Discussion

Leave a Comment