대형 언어 모델의 응답 간결성 평가를 위한 참조 없는 메트릭

Reading time: 5 minute
...

📝 Abstract

Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

💡 Analysis

Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

📄 Content

ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers⋆ Seyed Mohssen Ghafari1,, Ronny Kol1, Juan C. Quiroz1, Nella Luan1, Monika Patial1, Chanaka Rupasinghe1, Herman Wandabwa1 and Luiz Pizzato1, 1Commonwealth Bank of Australia, Sydney, Australia Abstract Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) word- removal compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations. Keywords Large Language Models, Evaluation Metrics, Conciseness, Natural Language Processing, Reference-free Evaluation

  1. Introduction As large language models (LLMs) [1, 2] are increasingly used to answer questions and engage in dialogue, the quality of their responses becomes critical. In many applications, brief and clear answers are preferred [3]. However, LLMs often produce overly verbose, long-winded responses containing redundant or irrelevant information [3]. A response that is thorough but lengthy may overwhelm users, while one that is brief but lacks detail may fail to meet their needs. Thus, conciseness–providing the shortest answer that still covers the necessary content–is a desirable property of LLM outputs. Conciseness is rarely directly measured by existing evaluation metrics. Traditional metrics for text generation (e.g. BLEU or ROUGE) depend on reference texts and focus on lexical overlap or content coverage, which do not capture verbosity [4]. Recent work has explored reference-free metrics for other quality aspects [4]. For example, the RAGAS framework introduces reference-free metrics for retrieval-augmented question answering, allowing automated evaluation without ground-truth answers. Inspired by such approaches, we seek a metric that quantifies conciseness by detecting non-essential content in an answer. We leverage LLM capabilities to simulate human judgments of brevity in a reference-free manner. We propose ConCISE, a new conciseness metric that operates without gold-standard answers. Our approach quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM ProActLLM Proactive Conversational Information Seeking with Large Language Models, November 14, 2025, Coex, Seoul, South Korea (co-located with CIKM 2025) ⋆This paper presents ConCISE, a novel metric for evaluating the conciseness of LLM-generated answers without requiring reference texts. *Corresponding author. $ seyedmohssen.ghafari@cba.com.au (S. M. Ghafari); ronny.kol@cba.com.au (R. Kol); juan.quirozaguilera@cba.com.au (J. C. Quiroz); nella.luan@cba.com.au (N. Luan); Monika.Patial@cba.com.au (M. Patial); Chanaka.Rupasinghe@cba.com.au (C. Rupasinghe); Herman.Wandabwa@cba.com.au (H. Wandabwa); Luiz.Pizzato1@cba.com.au (L. Pizzato) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Figure 1: ConCISE Architecture abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) word-removal compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. We apply ConCISE to the WikiEval dataset (a set of Wikipedia-based questions) and verify that our approach is effetely measure the conciseness of an LLM-generated response. The contribution of this paper are as follows: 1) we propose a novel reference-free metric for evaluating the conciseness of responses generated by LLMs; 2) We conducted tests to demonstrate the effectiveness of the new metric and its level of alignment with human judgement. 3) To the best of our knowledge, this metric is one of the first evaluation mechanisms that can assess an LLM’s output based on its length without requiring any gold standard reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut