LLM-FS: Zero-Shot Feature Selection for Effective and Interpretable Malware Detection

LLM-FS: Zero-Shot Feature Selection for Effective and Interpretable Malware Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Feature selection (FS) remains essential for building accurate and interpretable detection models, particularly in high-dimensional malware datasets. Conventional FS methods such as Extra Trees, Variance Threshold, Tree-based models, Chi-Squared tests, ANOVA, Random Selection, and Sequential Attention rely primarily on statistical heuristics or model-driven importance scores, often overlooking the semantic context of features. Motivated by recent progress in LLM-driven FS, we investigate whether large language models (LLMs) can guide feature selection in a zero-shot setting, using only feature names and task descriptions, as a viable alternative to traditional approaches. We evaluate multiple LLMs (GPT-5.0, GPT-4.0, Gemini-2.5 etc.) on the EMBOD dataset (a fusion of EMBER and BODMAS benchmark datasets), comparing them against established FS methods across several classifiers, including Random Forest, Extra Trees, MLP, and KNN. Performance is assessed using accuracy, precision, recall, F1, AUC, MCC, and runtime. Our results demonstrate that LLM-guided zero-shot feature selection achieves competitive performance with traditional FS methods while offering additional advantages in interpretability, stability, and reduced dependence on labeled data. These findings position zero-shot LLM-based FS as a promising alternative strategy for effective and interpretable malware detection, paving the way for knowledge-guided feature selection in security-critical applications


💡 Research Summary

The paper introduces LLM‑FS, a zero‑shot feature‑selection framework that leverages large language models (LLMs) to rank and select features for malware detection without requiring any labeled data beyond the feature names and a brief task description. The authors motivate the work by pointing out the limitations of conventional feature‑selection (FS) techniques—filter methods (e.g., variance threshold, χ², ANOVA), wrapper methods (e.g., sequential feature selection), and embedded methods (e.g., tree‑based importance, LASSO). These traditional approaches rely heavily on statistical heuristics or model‑driven importance scores, are often unstable across runs, and demand substantial labeled data, which can be scarce or costly in security contexts.

Dataset and Experimental Setup
To evaluate their proposal, the authors construct a new benchmark called EMBOD, which fuses the static‑analysis EMBER dataset with the dynamic‑behavior BODMAS dataset, yielding roughly 200 features spanning API call frequencies, file metadata, opcode statistics, and network activity. For each feature they compute a set of descriptive statistics: global mean, variance, median, min, max, inter‑quartile range, as well as class‑conditional means, standard deviations, and the mean difference between malware and benign samples (Δµ). These statistics are embedded into a structured prompt together with a concise task description (“classify whether a given file is malware or benign”).

Four LLMs are queried in a zero‑shot manner: GPT‑5.0, GPT‑4.0, GPT‑4.0‑mini, and Gemini‑2.5. Each model receives the same prompt for every feature and returns a scalar importance score in the range


Comments & Academic Discussion

Loading comments...

Leave a Comment