Automatic Simplification of Common Vulnerabilities and Exposures Descriptions
Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification_nmi.
💡 Research Summary
This paper investigates the use of large language models (LLMs) for automatically simplifying descriptions of Common Vulnerabilities and Exposures (CVE), aiming to make technical security information accessible to non‑experts. The authors first assembled a test set of 40 CVE entries selected from the 2025 CVElistV5 repository, after manually removing non‑natural language fragments such as code snippets. Two main modeling approaches were evaluated. The first employed the commercial GPT‑4o model via Azure OpenAI API, initially simplifying each CVE sentence‑by‑sentence and later refining the output at the document level after a first round of human assessment. The second approach built an open‑source pipeline called GemmaAgent, which uses a 4‑billion‑parameter Gemma model together with three specialized agents: (1) a term‑extraction agent based on the AITSecNER named‑entity recognizer that isolates security‑relevant concepts (e.g., malware, tactics, tools), (2) a retrieval‑augmented generation (RAG) agent that consults a curated cybersecurity lexicon to produce concise explanations for each extracted term, and (3) a simplification agent that rewrites the original CVE description using the term explanations as scaffolding.
Evaluation combined automatic metrics and expert human judgments. Automatic scores included D‑SARI (a document‑level adaptation of SARI), Flesch‑Kincaid Grade Level (FKGL), and three meaning‑preservation measures: BERScore, MeaningBERT, and Sentence‑BERT (using the nomic‑embed‑text‑v1.5 model). D‑SARI values were low across the board (0.09–0.14), reflecting the limited availability of reference simplifications. GPT‑4o achieved the greatest reduction in FKGL, dropping from 12.45 (original) to 9.49, mainly by producing longer texts with more sentences but fewer syllables per word. In meaning‑preservation, the semi‑synthetic human‑crafted simplifications and the GemmaAgent pipeline performed best, with GemmaAgent slightly surpassing GPT‑4o on MeaningBERT.
Human evaluation was conducted in two rounds using a 3‑point Likert scale on two statements: (1) “The simplification is easier to understand than the original,” and (2) “The simplification preserves the meaning of the original.” The first round involved 12 cybersecurity experts; any simplification receiving ≥80 % “agree” and no “disagree” responses was deemed high‑quality and excluded from the second round. The second round, with 10 experts, compared the initial and revised simplifications and asked whether the second version was of higher quality. Only five of the 40 CVEs passed the first round unchanged; the remaining 35 required revision, highlighting the difficulty of preserving meaning while simplifying. Notably, GPT‑4o sometimes altered precise numeric details (e.g., rounding version numbers) unless explicitly instructed to keep them unchanged, and it occasionally refused to simplify sentences that triggered safety guardrails (e.g., descriptions of exploit mechanisms).
The discussion emphasizes several key findings. First, out‑of‑the‑box LLMs can make text appear simpler but often sacrifice factual accuracy, especially for domain‑specific terminology and numeric data. Second, incorporating term explanations via a RAG component (as in GemmaAgent) improves meaning preservation but does not necessarily lower readability metrics, and it may increase the number of named entities because the model explicitly expands on them. Third, automatic metrics alone are insufficient to capture the nuanced trade‑off between simplicity and fidelity; expert human judgment remains essential. The authors suggest future work should focus on building larger, human‑validated reference corpora, refining prompt engineering to protect critical details, and developing controllable LLM frameworks that can balance multiple objectives (simplicity, factuality, brevity). They also recommend pilot deployments with target non‑technical audiences to assess real‑world utility. Overall, the study provides a baseline for cybersecurity text simplification, demonstrates the promise and limits of current LLMs, and outlines a roadmap for more reliable, user‑centered simplification systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment