Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router’s structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.


💡 Research Summary

The paper introduces BAGEL (Bootstrap Aggregated Ensemble Layer), a novel framework for detecting malicious prompts directed at large language models (LLMs). Existing defenses fall into two categories: black‑box moderation APIs (e.g., OpenAI Moderation API, Perspective) that lack transparency and adapt poorly to new attacks, and white‑box approaches that employ large LLM judges, which achieve high detection accuracy but are computationally prohibitive for real‑time, low‑latency deployments. BAGEL seeks to reconcile the three competing desiderata of performance, efficiency, and adaptability by constructing a modular ensemble of lightweight, fine‑tuned classifiers and an interpretable routing mechanism.

Core Architecture

  • Base Classifier: Each ensemble member (“prompt‑cop”) is a fine‑tuned version of Prompt Guard 2, an 86‑million‑parameter binary safety model. This keeps per‑model inference cheap and enables the simultaneous evaluation of multiple members.
  • Bootstrap‑Inspired Diversity: Unlike classic bagging, where each model is trained on a different bootstrap sample of the same dataset, BAGEL trains each prompt‑cop on a distinct attack dataset (e.g., jailbreak, prompt injection, direct harmful requests). This dataset‑level diversification yields experts that specialize in different threat families, improving overall robustness.
  • Mixture‑of‑Experts Routing: At inference time a random‑forest router examines structural features of the incoming prompt (token length, special‑character ratios, presence of role‑playing cues, etc.) and predicts the most suitable prompt‑cop. The router’s decision tree structure provides transparent feature‑importance scores, allowing operators to understand why a prompt is flagged.
  • Stochastic Subset Selection: After the router’s top‑choice is identified, BAGEL randomly samples an additional k prompt‑cops (the paper uses k = 4, yielding a total of five models per query). The final malicious‑ness score is obtained by aggregating the binary predictions (majority vote or averaged probability). This dual‑routing strategy mirrors a mixture‑of‑experts with “safety‑in‑depth”: if the primary expert fails on a novel pattern, the stochastic peers can compensate.

Training and Incremental Updates
When a new attack vector emerges, the defender simply fine‑tunes another Prompt Guard 2 model on the fresh dataset, adds it to the ensemble, and retrains the random‑forest router (and optionally the decision threshold). No full‑system retraining is required, dramatically reducing the computational cost of updates. The authors demonstrate nine sequential updates, each adding a new dataset, while the overall F1 score never falls below 0.92.

Empirical Evaluation

  • Datasets: Nine large‑scale, publicly available malicious‑prompt corpora covering jailbreaks, prompt injections, and direct harmful requests (total >1 M samples).
  • Metrics: Attack Success Rate (ASR), False Positive Rate (FPR), and F1 score.
  • Results: Using only five prompt‑cops (effective parameter count 430 M), BAGEL achieves an F1 of 0.922, ASR = 0.095, and FPR = 0.066. This outperforms OpenAI Moderation API and ShieldGemma, both of which rely on multi‑billion‑parameter models. The ensemble’s performance remains stable across the nine incremental updates, confirming its adaptability.
  • Interpretability: Feature‑importance analysis of the random‑forest router highlights cues such as “ignore system instructions”, “role‑play directives”, and “long code block insertion” as strong indicators of malicious intent, aligning with known jailbreak and injection patterns.

Strengths

  1. Efficiency: The total inference footprint (430 M parameters) is an order of magnitude smaller than monolithic safety models, enabling low‑latency deployment on commodity hardware.
  2. Adaptability: Incremental addition of new prompt‑cops avoids expensive end‑to‑end retraining, allowing rapid response to emerging threats.
  3. Transparency: The random‑forest router provides interpretable decision paths and feature importance, facilitating auditability and policy refinement.
  4. Robustness: The combination of expert routing and stochastic ensemble voting yields “safety‑in‑depth”, reducing vulnerability to any single model’s blind spots.

Limitations and Future Work

  • Feature Dependence: The router’s effectiveness hinges on the engineered structural features; entirely novel prompt formats (e.g., multimodal prompts mixing images and text) may evade detection.
  • Binary Classification Scope: BAGEL currently outputs a benign/malicious label. Extending to multi‑class attack taxonomy could improve downstream response (e.g., applying different mitigation strategies).
  • Memory Footprint: Although lightweight compared to billion‑parameter guards, each 86 M model still imposes a non‑trivial memory load for ultra‑resource‑constrained edge devices. Future research could explore knowledge distillation to sub‑10 M models or parameter‑efficient adapters.
  • Scalability of Routing: As the ensemble grows, the random‑forest may need re‑training to maintain optimal routing decisions; hierarchical routing or learned gating networks could be investigated.

Conclusion
BAGEL demonstrates that a carefully engineered ensemble of small, dataset‑specialized classifiers, coupled with an interpretable routing and stochastic selection mechanism, can deliver state‑of‑the‑art malicious‑prompt detection with a fraction of the computational budget required by existing solutions. Its incremental update protocol and transparent decision logic make it a practical candidate for production‑grade LLM safety pipelines, especially in environments where latency, cost, and rapid adaptation to evolving adversarial techniques are paramount.


Comments & Academic Discussion

Loading comments...

Leave a Comment