WADBERT: Dual-channel Web Attack Detection Based on BERT Models
Web attack detection is the first line of defense for securing web applications, designed to preemptively identify malicious activities. Deep learning-based approaches are increasingly popular for their advantages: automatically learning complex patterns and extracting semantic features from HTTP requests to achieve superior detection performance. However, existing methods are less effective in embedding irregular HTTP requests, even failing to model unordered parameters and achieve attack traceability. In this paper, we propose an effective web attack detection model, named WADBERT. It achieves high detection accuracy while enabling the precise identification of malicious parameters. To this end, we first employ Hybrid Granularity Embedding (HGE) to generate fine-grained embeddings for URL and payload parameters. Then, URLBERT and SecBERT are respectively utilized to extract their semantic features. Further, parameter-level features (extracted by SecBERT) are fused through a multi-head attention mechanism, resulting in a comprehensive payload feature. Finally, by feeding the concatenated URL and payload features into a linear classifier, a final detection result is obtained. The experimental results on CSIC2010 and SR-BH2020 datasets validate the efficacy of WADBERT, which respectively achieves F1-scores of 99.63% and 99.50%, and significantly outperforms state-of-the-art methods.
💡 Research Summary
Web applications are constantly targeted by a wide range of attacks such as SQL injection, XSS, and command injection. Detecting malicious HTTP requests before they reach the application is therefore a critical defensive measure. Existing detection approaches—rule‑based, classic machine‑learning with handcrafted features, and deep learning models based on CNNs, RNNs, or generic Transformers—suffer from three fundamental drawbacks. First, URLs and request payloads contain many non‑standard symbols, mixed‑case tokens, and URL‑encoded fragments that are poorly handled by standard sub‑word tokenizers (BPE, WordPiece) designed for natural language. Second, HTTP payload parameters are inherently unordered, yet most deep models treat them as ordered sequences, making them vulnerable to simple parameter permutation attacks. Third, while many models achieve high overall detection accuracy, they provide little insight into which specific parameters are responsible for a malicious verdict, limiting their usefulness for incident response and remediation.
To address these issues, the authors propose WADBERT, a dual‑channel architecture that processes the URL and the payload parameters separately, enriches their representations with a novel Hybrid Granularity Embedding (HGE), and fuses the resulting features using a multi‑head attention mechanism that respects the unordered nature of parameters and yields interpretable attention weights.
Hybrid Granularity Embedding (HGE)
HGE combines the strengths of sub‑word embeddings and character‑level modeling. An input string is first tokenized with WordPiece, producing a token sequence. In parallel, the raw character sequence is embedded and fed into a bidirectional GRU. For each token, forward and backward differential representations are computed from the GRU hidden states at the token’s start and end positions. These differential vectors capture fine‑grained character variations (e.g., case changes, URL‑encoded symbols, camel‑case identifiers). After a linear projection that aligns them with the WordPiece embedding space, the character‑level vectors are summed with the original WordPiece embeddings, yielding a hybrid token embedding that preserves both semantic sub‑word information and detailed character patterns. This design enables robust handling of symbol‑dense strings such as “%27”, “1%2B=1%20”, or “getUserName”.
Dual‑Channel Feature Extraction
URL Channel: The URL is normalized (collapse duplicate slashes, lower‑casing) and prefixed with the HTTP method (GET, POST, etc.) because the method often correlates with malicious behavior. The normalized URL is tokenized, wrapped with
Comments & Academic Discussion
Loading comments...
Leave a Comment