Phishing Email Detection Using Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

💡 Research Summary

The paper introduces LLM‑PEA, a comprehensive evaluation framework designed to assess the robustness of large language models (LLMs) when used for phishing‑email detection in realistic, multi‑vector threat environments. Recognizing that modern email security solutions increasingly embed LLMs for text classification, the authors argue that existing security evaluations are fragmented—typically focusing on a single vulnerability such as adversarial perturbations, prompt‑injection, or multilingual performance. LLM‑PEA unifies these dimensions, enabling systematic testing of three state‑of‑the‑art LLMs—GPT‑4o, Claude Sonnet 4, and Grok‑3—against five carefully constructed dataset configurations: (1) a balanced 50/50 safe‑vs‑phishing set, (2) an imbalanced 90/10 set reflecting real‑world traffic, (3) an adversarial set generated by semantic‑preserving paraphrases, (4) a prompt‑injection set employing six distinct jailbreak templates, and (5) a multilingual set translated into Bangla, Chinese, and Hindi.

The framework’s pipeline consists of (i) email ingestion and normalization, (ii) an adversarial attack generation module that applies seven manipulation strategies (re‑phrasing, instruction injection, context manipulation, authority impersonation, confidence bypass, logical contradiction, technical exploitation) plus multilingual obfuscation, and (iii) a downstream LLM decision module that performs the final classification. Three prompting styles are evaluated: a Structured Prompt that enumerates five detection criteria, a Zero‑Shot Prompt that provides minimal instruction, and a Chain‑of‑Thought (CoT) Prompt that forces step‑by‑step reasoning before output.

Baseline results on the balanced set show high raw accuracy: GPT‑4o 95 %, Claude Sonnet 4 94 %, Grok‑3 88 %. However, prompt design dramatically influences performance. Structured prompts yield an average F1 of 0.657, Zero‑Shot prompts improve to 0.793, and CoT prompts achieve the highest individual scores (up to 0.865) but with greater variance across models. In the imbalanced scenario, Zero‑Shot prompting attains the best F1 (0.864), indicating that rigid templates can hinder detection of minority phishing instances.

Adversarial robustness testing reveals model‑specific weaknesses. After semantic‑preserving paraphrasing, GPT‑4o and Grok‑3 maintain 100 % detection, while Claude Sonnet 4’s accuracy drops by 12.7 % (24 misclassifications out of 189). Prompt‑injection experiments, using 1,134 crafted variants, show that many models succumb to injected instructions, especially when the Structured Prompt is used, confirming that overly explicit system prompts can become attack surfaces.

Multilingual evaluation uncovers a severe performance gap: despite the same underlying LLMs, detection accuracy falls sharply across Bangla, Chinese, and Hindi samples, with phishing prevalence as low as 5 % in the test set. The authors attribute this to training‑data bias, tokenization differences, and reduced exposure to low‑resource languages.

From these findings, the authors draw several practical hardening recommendations for deploying LLM‑based email security: (1) implement rigorous prompt sanitization and normalization to mitigate injection attacks; (2) conduct continuous red‑team testing that combines adversarial, injection, and multilingual scenarios; (3) fine‑tune or augment models with language‑specific data to close cross‑lingual gaps; (4) embed internal safety mechanisms such as token‑level filters, system‑prompt isolation, and controlled output parsing; and (5) prefer flexible prompting strategies (Zero‑Shot or CoT) over overly prescriptive templates when robustness is paramount.

In conclusion, while LLMs demonstrate superior baseline phishing detection compared to traditional classifiers, their security posture is fragile under multi‑vector attacks. LLM‑PEA provides a reproducible, end‑to‑end benchmark that reveals compound vulnerabilities missed by single‑dimension tests. The paper underscores the necessity of holistic hardening and ongoing evaluation before LLMs can be trusted as autonomous components in production email security pipelines.

Phishing Email Detection Using Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment