Eliciting Least-to-Most Reasoning for Phishing URL Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Phishing continues to be one of the most prevalent attack vectors, making accurate classification of phishing URLs essential. Recently, large language models (LLMs) have demonstrated promising results in phishing URL detection. However, their reasoning capabilities that enabled such performance remain underexplored. To this end, in this paper, we propose a Least-to-Most prompting framework for phishing URL detection. In particular, we introduce an “answer sensitivity” mechanism that guides Least-to-Most’s iterative approach to enhance reasoning and yield higher prediction accuracy. We evaluate our framework using three URL datasets and four state-of-the-art LLMs, comparing against a one-shot approach and a supervised model. We demonstrate that our framework outperforms the one-shot baseline while achieving performance comparable to that of the supervised model, despite requiring significantly less training data. Furthermore, our in-depth analysis highlights how the iterative reasoning enabled by Least-to-Most, and reinforced by our answer sensitivity mechanism, drives these performance gains. Overall, we show that this simple yet powerful prompting strategy consistently outperforms both one-shot and supervised approaches, despite requiring minimal training or few-shot guidance. Our experimental setup can be found in our Github repository github.sydney.edu.au/htri0928/least-to-most-phishing-detection.

💡 Research Summary

**
The paper addresses the problem of phishing URL detection by leveraging the reasoning capabilities of large language models (LLMs) through a novel prompting strategy called “Least‑to‑Most” combined with an “answer sensitivity” mechanism. Traditional approaches for LLM‑based phishing detection have relied on one‑shot prompting or Chain‑of‑Thought (CoT) prompting, which produce a single output without explicitly exposing the model’s intermediate reasoning steps. In contrast, Least‑to‑Most decomposes the classification task into a sequence of sub‑questions, each of which asks the model to provide a percentage estimate of how likely the URL is phishing (0 % = benign, 100 % = phishing). The model iterates over these sub‑questions until the estimated likelihood crosses a user‑defined upper or lower threshold (e.g., ≥80 % for phishing, ≤20 % for benign). If the thresholds are never crossed after a maximum of ten iterations, the URL is conservatively labeled as phishing.

The authors evaluate this framework on three publicly available URL datasets—HP, EBBU, and ISCX—using four state‑of‑the‑art LLMs: Gemma‑3‑12B, Llama‑3.1‑8B, GPT‑4.1, and Gemini‑2.5‑Flash. For each dataset they sample a balanced set of 1,000 URLs (500 benign, 500 phishing) and repeat the experiment five times, reporting the mean F1 score. Two baselines are considered: (1) a one‑shot prompting approach previously shown to work for phishing URL detection, and (2) a supervised model called URLTran, a BERT‑based classifier fine‑tuned on large labeled URL corpora.

Results show that Least‑to‑Most consistently outperforms the one‑shot baseline, achieving an average F1 of 0.9040 versus 0.8726 for one‑shot—a gain of roughly 0.03 across all models and datasets. The best‑performing LLM, Gemini‑2.5‑Flash, reaches an average F1 of 0.9621 under Least‑to‑Most, which is only 0.028 below the supervised URLTran’s reported 0.99 F1. Other models also benefit: Gemma‑3‑12B improves from 0.8534 (one‑shot) to 0.8872 (Least‑to‑Most) on the HP dataset, and GPT‑4.1 shows marginal but consistent gains across all three datasets.

A deeper analysis focuses on the role of iterations and answer sensitivity. Correct predictions typically require only a few iterations (1–3), while incorrect predictions tend to involve more varied iteration counts. However, a notable subset of correct predictions emerges only after “outlier” numbers of iterations (e.g., 5–7). By tracking the sensitivity value across iterations, the authors demonstrate that many true positives start with a low phishing likelihood (e.g., 25 %) and gradually rise above the upper threshold by the final iteration, thereby correcting an initial misclassification. Conversely, true negatives often start high and fall below the lower threshold. This dynamic adjustment illustrates how the answer‑sensitivity mechanism guides the model toward a more confident decision.

Table 2 further quantifies the comparative advantage: for almost every model‑dataset pair, the number of URLs correctly classified uniquely by Least‑to‑Most exceeds those uniquely correct under one‑shot, while the number of unique errors is lower. Even for the weakest LLM (Llama‑3.1‑8B), Least‑to‑Most reduces errors by over one hundred compared to one‑shot, underscoring the robustness of the approach across model capacities.

The paper’s contributions are threefold: (1) introducing a Least‑to‑Most prompting framework with answer sensitivity for phishing URL detection, (2) empirically demonstrating that this method narrows the performance gap between few‑shot LLMs and fully supervised classifiers, and (3) providing an extensive analysis of how iterative reasoning and sensitivity thresholds improve model confidence and error correction. The authors acknowledge limitations, notably the need to manually set sensitivity thresholds and the increased inference cost due to multiple iterations. Future work is suggested in automatic threshold tuning, cost‑effective iteration control, and integrating additional URL metadata (WHOIS, SSL certificates) or multimodal cues to further boost detection performance.

In summary, the study shows that a carefully engineered prompting strategy can unlock the latent reasoning abilities of LLMs, enabling them to achieve near‑supervised performance on phishing URL detection without requiring large labeled training sets. This finding has practical implications for security operations where rapid deployment and interpretability are essential, and it opens avenues for applying similar iterative prompting techniques to other cybersecurity classification tasks.

Eliciting Least-to-Most Reasoning for Phishing URL Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment