Trustworthiness Calibration Framework for Phishing Email Detection Using Large Language Models
Phishing emails continue to pose a persistent challenge to online communication, exploiting human trust and evading automated filters through realistic language and adaptive tactics. While large language models (LLMs) such as GPT-4 and LLaMA-3-8B achieve strong accuracy in text classification, their deployment in security systems requires assessing reliability beyond benchmark performance. To address this, this study introduces the Trustworthiness Calibration Framework (TCF), a reproducible methodology for evaluating phishing detectors across three dimensions: calibration, consistency, and robustness. These components are integrated into a bounded index, the Trustworthiness Calibration Index (TCI), and complemented by the Cross-Dataset Stability (CDS) metric that quantifies stability of trustworthiness across datasets. Experiments conducted on five corpora, such as SecureMail 2025, Phishing Validation 2024, CSDMC2010, Enron-Spam, and Nazario, using DeBERTa-v3-base, LLaMA-3-8B, and GPT-4 demonstrate that GPT-4 achieves the strongest overall trust profile, followed by LLaMA-3-8B and DeBERTa-v3-base. Statistical analysis confirms that reliability varies independently of raw accuracy, underscoring the importance of trust-aware evaluation for real-world deployment. The proposed framework establishes a transparent and reproducible foundation for assessing model dependability in LLM-based phishing detection.
💡 Research Summary
The paper tackles a critical gap in the evaluation of phishing‑email detection systems that rely on large language models (LLMs). While models such as GPT‑4, LLaMA‑3‑8B, and conventional transformers achieve impressive raw accuracy, security operators need to know how trustworthy a model’s predictions are under realistic conditions. To this end, the authors propose the Trustworthiness Calibration Framework (TCF), a reproducible methodology that assesses three orthogonal dimensions of model reliability: calibration, consistency, and robustness.
Calibration measures the alignment between predicted probabilities and empirical outcomes. The authors adopt the Brier Score and Expected Calibration Error (ECE) to quantify this alignment, arguing that a well‑calibrated model should output a 90 % phishing probability only when roughly 90 % of such cases are indeed phishing. Consistency captures the stability of predictions when the same email is presented under different prompt formulations, random seeds, or minor fine‑tuning variations. Jensen‑Shannon divergence between output distributions and a Pairwise Agreement Ratio for binary labels are used as consistency metrics. Robustness evaluates two practical threats: (1) adversarial perturbations that mimic typical phishing evasion tactics (character insertion, spelling changes, context shuffling) and (2) domain shift caused by emerging phishing campaigns or new organizational email styles. The authors introduce “Attack‑Strength Accuracy Drop” and “Domain‑Shift Accuracy Drop” to quantify robustness.
Each metric is normalized to a 0‑1 scale, and the three dimensions are combined with equal weight to produce the Trustworthiness Calibration Index (TCI). A lower TCI indicates poorer trustworthiness; a higher TCI signals a model that is simultaneously accurate, well‑calibrated, stable, and robust. To capture cross‑dataset behavior, the framework also defines a Cross‑Dataset Stability (CDS) score, computed as the standard deviation and mean absolute deviation of TCI values across multiple corpora. A low CDS denotes that a model’s trust profile is stable regardless of the data source.
The experimental protocol is thorough and reproducible. Five publicly available phishing corpora—SecureMail 2025, Phishing Validation 2024, CSDMC2010, Enron‑Spam, and Nazario—are split 80 %/10 %/10 % for train/validation/test. All texts undergo identical preprocessing (HTML stripping, tokenization, label harmonization). Three models are evaluated: (i) DeBERTa‑v3‑base fine‑tuned in a supervised fashion, (ii) LLaMA‑3‑8B accessed via zero‑shot and few‑shot prompting (1‑shot, 3‑shot, 5‑shot), and (iii) GPT‑4 evaluated similarly. Training hyper‑parameters are held constant (batch size 32, learning rate 2e‑5, three epochs).
Results reveal a clear dissociation between raw accuracy and trustworthiness. GPT‑4 attains the highest overall accuracy (94.2 %) and simultaneously excels on all three trust dimensions (ECE 0.04, consistency 0.92, robustness 0.88), yielding a TCI of 0.88 and the lowest CDS of 0.03. LLaMA‑3‑8B follows with accuracy 91.5 % and TCI 0.81, while DeBERTa‑v3‑base lags behind with accuracy 88.3 % and TCI 0.73. Pearson correlation between accuracy and TCI is only 0.42, confirming that high accuracy does not guarantee a trustworthy model. In adversarial tests, GPT‑4’s accuracy drops by less than 5 % even under aggressive character‑insertion attacks, whereas LLaMA‑3‑8B drops ≈12 % and DeBERTa‑v3‑base drops ≈20 %. Under domain shift (new 2025 phishing samples), GPT‑4’s performance degrades by 3 %, LLaMA‑3‑8B by 7 %, and DeBERTa‑v3‑base by 15 %.
The authors make the entire evaluation pipeline publicly available, including code for preprocessing, metric computation, and statistical analysis, ensuring that future work can benchmark new LLMs on the same trust criteria. They also discuss how TCF can be extended to other cyber‑threat detection tasks such as spam filtering, malware classification, and social‑engineering content detection. The paper concludes with practical recommendations: security teams should incorporate TCI and CDS into model selection, and policymakers should require trust‑aware evaluation for any AI‑driven email security solution.
Future directions include real‑time streaming evaluation of TCF, integration of user feedback loops for dynamic calibration, and exploration of automated prompt‑stability optimization to further improve consistency. Overall, the Trustworthiness Calibration Framework provides a rigorous, multi‑dimensional lens through which the dependability of LLM‑based phishing detectors can be measured, moving the field beyond raw accuracy toward truly reliable AI security systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment