Excision Score: Evaluating Edits with Surgical Precision
Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by >21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES’ improvement over SARI increases to 20% and >30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.
💡 Research Summary
The paper addresses a fundamental flaw in existing similarity metrics for revision tasks, where the majority of a document is shared between the original source and its revisions. Traditional pairwise measures such as BLEU, ROUGE, METEOR, and chrF are dominated by this shared context, leading them to assign near‑perfect scores even when the actual edit is completely incorrect. To remedy this, the authors formalize the “revision similarity” problem as a three‑way alignment among the original document (O), a reference revision (A), and a hypothesis revision (B). They introduce the concepts of conserved columns (identical across all three sequences) and divergent regions (clusters of columns where at least one sequence differs).
Based on this alignment, they propose five adequacy criteria that any revision‑similarity metric should satisfy: (1) reward matching edits, (2) penalize mismatching edits, (3) be invariant to shared context, (4) be origin‑variant (scores should change when the original document changes while A and B stay fixed), and (5) reward semantically equivalent mismatches. An analysis shows that most popular metrics violate several of these criteria, especially invariance to shared context.
The core contribution is the Excision Score (ES), a static, task‑agnostic, interpretable, and lightweight metric. ES first computes the Longest Common Subsequence (LCS) between O and each revision to excise the shared content, then compares only the remaining divergent regions using n‑gram overlap. To keep computation tractable, the authors adopt a quadratic‑time approximation of the cubic‑time LCS algorithm.
Empirical evaluation focuses on code‑editing tasks, using the HumanEvalFix benchmark. ES achieves a Pearson correlation with human judgments that is 12 % higher than the best existing static metric (SARI) and more than 21 % higher than BLEU, ROUGE, and METEOR. When the dataset is perturbed to increase shared context, ES’s advantage grows to 20 % over SARI and exceeds 30 % over standard measures, confirming its robustness to the very issue that plagues other metrics. Additional experiments demonstrate that ES correctly handles edge cases such as moved code blocks, insertions, deletions, and semantically equivalent alternatives (e.g., different sorting algorithms).
The authors argue that while dynamic, execution‑based metrics (e.g., pass@k) remain essential for assessing functional correctness, static metrics like ES are indispensable for large‑scale model evaluation, clustering of revisions, and scenarios where execution is infeasible or costly. By focusing exclusively on the divergent regions, ES provides a more faithful proxy for human judgment, enabling faster and more interpretable assessment of LLM‑driven edit assistants across both natural language and code domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment