Seeing the Unseen: Rethinking Illicit Promotion Detection with In-Context Learning
Illicit online promotion is a persistent threat that evolves to evade detection. Existing moderation systems remain tethered to platform-specific supervision and static taxonomies, a reactive paradigm that struggles to generalize across domains or un…
Authors: Sangyi Wu, Junpu Guo, Xianghang Mi
Seeing the Unseen: Rethinking Illicit Promotion Detection with In-Conte xt Learning Sangyi W u ∗ , Junpu Guo ∗ , Xianghang Mi ∗ † ∗ Univ ersity of Science and T echnology of China, Hefei, China † Monash Univ ersity , Melbourne, Australia Abstract —Illicit online promotion is a persistent, cross- platform threat that continuously evolves to evade detection. Ex- isting moderation systems, howev er , remain tether ed to platform- specific supervision and static taxonomies—a reactive paradigm that struggles to generalize across domains, adapt to emerging categories, or unco ver nov el threats bef ore they pr oliferate. This paper presents a systematic study of In-Context Learning (ICL) as a unified framework f or illicit promotion detection across heterogeneous platforms. Through rigorous analysis of prompt design, we establish that properly configured ICL achieves performance comparable to fine-tuned models using 22x fewer labeled examples. More importantly , we demon- strate three capabilities that fundamentally shift the moderation paradigm: (1) seeing the unseen —ICL generalizes to entirely new illicit categories without any category-specific demonstrations, incurring a performance drop of less than 6% for ov er half of the 12 evaluated categories; (2) autonomous disco very —a novel two-stage pipeline distills over 2,900 free-form labels into coherent taxonomies, surfacing eight previously undocumented illicit categories including usury and illegal immigration; and (3) cross-platform generalization —deployed on 200,000 r eal-world samples from search engines and T witter without any platform adaptation, ICL achie ves 92.6% accuracy , with 61.8% of its uniquely flagged samples corresponding to borderline or obfus- cated promotional content missed by existing detectors. Our findings position ICL as a new paradigm f or content moderation—one that combines the precision of specialized clas- sifiers with uniquely powerful capabilities: cross-platform gener- alization without adaptation, autonomous disco very of emerging threats, and dramatic reductions in labeled data requirements. By shifting fr om static, taxonomy-driven supervision to infer ence- time reasoning, ICL offers a path toward moderation systems that are not just reactive but proactively adaptive. The source code of this work is available at https: //github.com/ChaseSecuri ty/illicit- icl . I . I N T RO D U C T I O N The internet’ s underground economy thri ves on a persistent, shape-shifting threat: illicit online promotion. From drug traf- ficking and counterfeit goods to illegal gambling and financial fraud, this content permeates search engines, social media, and messaging apps [ 10 , 29 , 38 , 13 , 36 ]. While its societal harm is undeniable, what makes this threat particularly insidious is its cross-platf orm, adaptiv e nature . Adversaries do not confine themselves to a single service; they strategically distrib ute their operations—using search engine optimization (SEO) to attract victims, social media for amplification, and encrypted messaging for final transactions [ 33 , 12 , 28 , 38 ]. This constant migration exploits the structural and linguistic gaps between Corresponding to xianghangmi@gmail.com . platforms, creating a moving target that renders traditional, platform-specific defenses brittle [ 27 , 7 , 13 ]. For years, machine learning-based moderation has fought this battle with a reacti ve playbook: train a classifier on labeled data for a specific platform, then hope it holds up. But in the open world of online threats, this approach is failing. The emergence of ne w illicit categories, intentional linguistic obfuscation, and distrib ution shifts across platforms all cause the performance of fine-tuned models to degrade precipitously . The result is a costly , unsustainable cycle of continuous data annotation, retraining, and manual intervention—a paradigm ill-suited for a problem defined by constant ev olution. What we need is not just a better classifier , b ut a fundamental shift: from a closed-world assumption of fixed categories to an open- world capability for generalization and discovery . In-Context Learning (ICL) with Large Language Models (LLMs) offers precisely this paradigm shift [ 3 ]. By condi- tioning on a handful of examples provided in a prompt, ICL enables models to learn new tasks at inference time without any parameter updates. This unlocks the potential for a unified, platform-agnostic moderation framew ork that can leverage the vast semantic knowledge learned during pretraining to adapt on the fly . Howe ver , deploying ICL in the adversarial, multilingual, and heterogeneous landscape of illicit promotion is not straightforward. It forces us to confront a series of open questions. How do we design prompts that are robust to adversarial text? Can ICL truly generalize to entirely new , unseen categories of illicit activity without retraining? And can we mov e beyond simple classification to use ICL as a tool for autonomously discovering emerging threats from raw , unlabeled data? This paper provides the first systematic study to answer these questions, transforming ICL from a promising tech- nique into a practical framework for next-generation content moderation. Our in vestigation spans controlled experiments to large-scale, real-world deployments, with four primary contributions. A blueprint for effecti ve ICL in security . Through rigorous analysis of prompt configurations—demonstration quantity , se- lection strategies, and label design—we unco ver the inductive biases and critical configurations that make ICL robust for adversarial content. W e show that with optimal design, ICL achiev es performance comparable to fine-tuned models using 22x fewer labeled examples , and that explicit semantic labels are essential—without them, the false positiv e rate jumps from 8% to 42%. The ability to see the unseen. In a stringent open-world ev aluation, we demonstrate that ICL generalizes effecti vely to entirely new illicit cate gories without any cate gory-specific examples, relying instead on shared semantic characteristics of illicit intent. For more than half of the 12 e valuated categories, excluding the target category from demonstrations incurs a performance drop of less than 6% relati ve to the category- included setting. A utonomous discovery of emerging threats. Moving beyond static taxonomies, we design a novel two-stage pipeline where ICL first generates open-ended category labels for unlabeled data, then self-consolidates them into coherent, nov el threat taxonomies. This process distills over 2,900 free-form labels into meaningful clusters and autonomously surfaces eight distinct new illicit categories —including usury , illegal im- migration, and software piracy—absent from prior studies. V alidation in the wild. On large-scale, cross-platform data (200,000 samples from search engines and T witter), our ICL framew ork outperforms platform-specific baselines without any adaptation, achieving 92.6% accuracy . Crucially , 61.8% of samples uniquely flagged by ICL correspond to valid borderline or heavily obfuscated promotional content missed by e xisting detectors, positioning ICL as a powerful high-recall discov ery funnel for hybrid moderation systems. Overall, this work reframes illicit promotion detection as an adaptiv e, open-world challenge and demonstrates that in- context learning provides a viable framework for addressing it. By shifting from static, taxonomy-driven supervision toward inference-time reasoning, we pave the way for moderation systems that can keep pace with the very threats they aim to stop. T o support reproducibility and facilitate future research, we release our code, prompt templates, and curated benchmark datasets at h ttp s:/ /g ith ub. com /Ch ase Sec uri ty/ ill ici t- i cl under permissiv e licensing. I I . R E L A T E D W O R K S A. Illicit Online Pr omotion Illicit online promotion represents a persistent and ev olving threat where miscreants le verage the internet’ s v ast reach to promote illicit goods and services. T o maximize visibility and traffic, adversaries hav e developed a div erse arsenal of tech- niques targeting search engines, legitimate web infrastructure, and online social networks. Early research into illicit promotion largely focused on blackhat search engine optimization (SEO). W u and Davi- son [ 33 ] identified the use of link farms, where densely connected networks of spam pages are created to manipulate graph-based ranking algorithms. As search engines improved their defenses, attackers ev olved to adopt cloaking techniques, which in volv e serving benign content to search engine cra wlers while displaying malicious content to human visitors. W ang et al. [ 27 ] characterized this dynamic, and In vernizzi et al. [ 7 ] further formalized it, dev eloping systems to detect when machines browse a dif ferent web than humans. A parallel strate gy in volv es compromising legitimate, high- reputation websites to exploit their domain authority . Leon- tiadis et al. [ 10 ] analyzed search-redirection attacks, demon- strating how compromised high-v alue sites (e.g., .edu do- mains) were injected with code to div ert search traffic to illicit online pharmacies. Building on this, Liao et al. [ 13 ] proposed detection methods based on identifying the semantic inconsistency between such injected promotional content and the victim site’ s original conte xt. Beyond direct compromise, adversaries abuse public infras- tructure to host and promote content. Liao et al. [ 12 ] uncovered long-tail SEO spam hosted on reputable cloud platforms, which used doorway pages to monetize traffic through affiliate programs. Similarly , the abuse of local business listings has become a critical vector . Huang et al. [ 6 ] and W ang et al. [ 29 ] in vestigated ho w miscreants poison local search services (e.g., Google Maps) with fake listings to promote affiliate fraud and illicit drugs. More recently , Lin et al. [ 14 ] demonstrated adversarial Wiki search poisoning, where attackers manipulate open-collaboration platforms to boost the ranking of illicit content. Distinct from these injection-based methods, W u et al. [ 34 ] identified reflected search poisoning, a technique where attackers abuse legitimate URL reflection schemes on benign websites to index illicit promotion texts without compromising the server . W ith the rise of social media, illicit promotion has ex- panded beyond search engines. W ang et al. [ 28 ] performed a large-scale analysis of illicit promotion on T witter , iden- tifying millions of posts across diverse categories such as data leakage, pornography , and contraband. Y ang et al. [ 36 ] examined the ecosystem of ille gal online gambling, re vealing a complex profit chain in volving promotion via social apps, blackhat SEO, and third-party payment abuse. Furthermore, modern illicit promotion often e xhibits comple x cross-platform behaviors. Zha et al. [ 38 ] re vealed a cross-platform referral traffic ecosystem, where drug dealers utilize video platforms like TikT ok and Y ouT ube to attract potential buyers before redirecting them to encrypted messaging apps or storefronts on other platforms to consummate the sale. Collectiv ely , these studies illustrate an arms race where illicit promotion strate gies continuously shift across platforms and modalities. The inherent heterogeneity of these distribu- tion channels poses a significant and ongoing challenge for traditional, static detection systems. B. In-Context Learning Large Language Models (LLMs) ha ve demonstrated a re- markable emergent ability kno wn as In-Context Learning (ICL), which allows models to perform nov el tasks by condi- tioning on a few input-output pairs provided in the prompt, without updating any model parameters [ 3 ]. Unlike tradi- tional supervised fine-tuning, ICL enables rapid adaptation to downstream tasks through inference alone, fundamentally shifting the paradigm of NLP system deployment. ICL is 2 distinct from related concepts like prompt learning [ 16 ] and standard few-s hot learning [ 30 ]. Its defining characteristics offer sev eral ke y adv antages: it supports interpretable human- LLM interaction through natural language demonstrations [ 3 ]; its analogical reasoning mechanism aligns with human cog- nitiv e processes [ 32 ]; and its training-free nature significantly reduces the adaptation cost for deploying large models in real- world scenarios [ 25 ]. T o maximize the ef fectiv eness of ICL, the selection and presentation of demonstration e xamples are critical. Empir- ical research has established sev eral best practices. While increasing the number of demonstrations generally improves performance [ 2 ], the marginal gains often diminish beyond a certain point [ 3 , 19 ]. More importantly , regarding demon- stration selection, compared to the traditional approach of using a fixed set of few-shot demonstrations for all queries, dynamically retrie ving customized demonstrations tailored to each specific query has yielded significant performance gains [ 18 , 37 ]. T o enable this, unsupervised selection methods have been developed, utilizing metrics such as k-nearest neighbor retriev al based on sentence embeddings [ 15 , 20 ], mutual infor- mation [ 24 ], perplexity [ 4 ], and v arious information-theoretic scores [ 35 , 11 ]. Beyond the selection and quantity of e xamples, the order in which the y appear also significantly impacts model behavior , with performance being highly sensitiv e to prompt ordering [ 17 ] and a recency bias often observed in model predictions [ 41 ]. The flexibility of ICL has driven its adoption in content moderation and security applications, where data scarcity and ev olving threats pose challenges for supervised models. W ang and Chang [ 31 ] utilized generati ve prompt-based inference for zero-shot toxicity detection, demonstrating advantages in han- dling long-tail phenomena. Recent framew orks have inte grated ICL into more comprehensi ve moderation pipelines. For exam- ple, LLM-Mod [ 9 ] le verages LLMs to identify rule violations in online communities, employing multi-step prompting to en- hance reasoning about guidelines. Similarly , Zhang et al. [ 39 ] proposed bootstrapping LLMs via a Decision-Tree-of-Thought prompting strategy to detect toxic content and distill rationales into smaller models. In the realm of factuality and fairness, Zhang et al. [ 40 ] introduced a unified checking framework that combines fact generation with ethical classification prompts. While these works demonstrate the efficacy of ICL for specific moderation tasks, our work extends this paradigm to the detection of illicit online promotion, a distinct challenge characterized by cross-platform distribution and adv ersarial adaptation. I I I . E X P E R I M E N T A L S E T U P A. Datasets T o comprehensiv ely ev aluate the efficacy of in-context learning (ICL) for detecting and understanding illicit online promotion, we construct two purpose-built datasets: a Binary Dataset for illicit vs. benign classification and a fine-grained Multiclass Dataset for identifying specific illicit categories. These datasets are derived from two large-scale, real-world sources to ensure div ersity and practical relev ance. Data Sources. The data employed in this study draw upon two existing research initiati ves examining illicit promotion across distinct online platforms, thereby capturing varied distribution channels and linguistic contexts. The first source consists of Illicit Promotion T exts (IPTs) collected via Reflected Search Poisoning (RSP) techniques, as detailed in the work by W u et al. [ 34 ]. This dataset comprises over 11 million entries spanning 14 illicit categories—including drug trading, data theft, and hacking services—generated by manipulating high- ranking websites to poison search engine results. It thus represents a covert, web-based promotional vector . The second source deriv es from a study of illicit promotion within Online Social Networks (OSNs), specifically T witter , as presented by W ang et al. [ 28 ]. This collection contains 12 million Posts of Illicit Promotion (PIPs) across 10 illicit categories such as drugs, gambling, and weapon sales, posted in five major natural languages. It reflects o vert promotional acti vities within public social media discourse. Dataset Integration and Category Alignment. The original sources used different categorization schemas. T o create a coherent multiclass dataset, we performed a semantic cor- relation and unification of categories across the two works. This process consolidated overlapping categories (e.g., ”Illegal Drug Sales” from IPTs and ”drug” from PIPs) and resolved naming discrepancies, resulting in a unified taxonomy of 12 distinct illicit categories. The mapping is detailed in T able I . T ABLE I T AX O N OM Y U NI FI C A T I ON M AP P I N G . A LI G N M EN T O F H E T ER O GE N E O US C A T E G O RI E S F RO M T W I T TE R A ND S EA R C H E N GI N E D A TA SE T S I N T O T H E U N IFI E D TA XO N O M Y U S E D I N T H I S S T U DY . Unified Category Source: T witter[ 28 ] Source: Search Engine[ 34 ] porn porn Illegal Sex surrogacy surrogacy Illegal Surrogacy gambling gambling Gambling drug drug Illegal Drug Sales weapon weapon Illegal W eapon Sales data-theft data leakage Data Theft money-laundry money-laundry Money Laundering advertisement crowdturfing Black Hat SEO & Adv . counterfeit fake document Fake Account Fake Certificate Counterfeit Goods hacking – Hacking Service fraud – Financial Fraud others harassment Others Balanced Dataset Construction. T o mitigate potential model bias, a balanced dataset was constructed for the binary classification task distinguishing illicit from benign content. The original data comprised 9,352 illicit and 4,395 benign instances collected from tw o distinct sources. Stratified re- sampling was applied to achiev e both class balance and source div ersity . The resulting Binary Dataset consists of 5,600 samples, maintaining an equal distribution across labels and sources. Specifically , the dataset contains 2,800 illicit and 2,800 benign samples, with each label containing 1,400 3 samples from T witter and 1,400 from search engine data. Then, it was di vided into a training set and a test set in a 4:1 ratio. For the multiclass categorization task, a separate balanced dataset was assembled across 12 unified illicit categories and one benign class. Each o f these 13 categories was resampled to contain 500 instances, yielding a total of 6,500 samples. T o preserve source di versity where feasible, a 1:1 source ratio was targeted; howe ver , for categories underrepresented in a particular source (such as “hacking” which was absent from T witter), samples were drawn exclusiv ely from the av ailable source. The final Multiclass Dataset contains 3,264 samples from T witter and 3,236 from search engines. Linguistically , the dataset is diverse, predominantly composed of Chinese (68.3%) and English (19.9%), with additional representation from Japanese, K orean, Thai, and other languages. B. Models T o ensure a comprehensive and representati ve ev alua- tion of In-Context Learning (ICL) capabilities across dif- ferent model architectures and scales, we select a di verse suite of modern, open-source decoder-only large language models (LLMs). The models are: Llama [ 5 ] (llama-3.1- 8B 1 ), Mistral [ 8 ] (mistral-7b-instruct-v0.2 2 ), Phi3-Small [ 1 ] (phi-3-small-128k-instruct 3 ), Phi3-Mini [ 1 ] (phi-3-mini-128k- instruct 4 ), Qwen [ 21 ] (qwen2.5-7b-instruct 5 ), and Gemma [ 26 ] (gemma-2b-it 6 ). I V . A N A L Y S I S O F P R O M P T C O N FI G U R AT I O N S Before deploying ICL for lar ge-scale detection, it is essen- tial to understand how different prompt engineering factors interact with the specific nuances of illicit promotion. In this section, we systematically deconstruct the ICL configuration to identify the optimal setup. W e examine four critical di- mensions: the quantity of demonstrations, selection strategies, the underlying model architecture (instruction tuning), and the semantic influence of label verbalization. These analyses serve as the foundation for the performance e valuations in subsequent sections. A. Impact of Demonstration Quantity A fundamental question in deploying in-context learning (ICL) is determining the optimal number of demonstration examples. Providing too fe w may fail to adequately conv ey the task, while too man y may introduce noise, increase computa- tional cost, and potentially e xceed the model’ s ef fectiv e con- text window . T o inv estigate this, we ev aluated all models with shot counts k ∈ [0 , 2 , 4 , 8 , 16 , 32 , 64 , 128] across binary and multiclass tasks. The demonstrations were selected randomly from the training dataset of each task, and we experiment with 10 seeds to calculate averaged scores. 1 https://huggingface.co/meta-llama/Llama-3.1-8B 2 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 3 https://huggingface.co/microsoft/Phi-3-small-128k-instruct 4 https://huggingface.co/microsoft/Phi-3-mini-128k-instruct 5 https://huggingface.co/Qwen/Qwen2.5-7B-Instruct 6 https://huggingface.co/google/gemma-2b-it During our experiments, we observ ed that Gemma could not reliably perform classification, failing to output the predefined labels with a failure rate as high as 58.94%.. Consequently , we excluded it from further experiments and analysis. For simi- lar reasons, Phi3-mini was also omitted from the multiclass classification task. Binary Classification. As illustrated in Fig. 1a , most models benefit from increasing the number of shots in binary classifi- cation, though the extent and saturation point of impro vement vary . Models such as Mistral, Llama, and Qwen show steady gains as k rises from 0 to 32, after which performance plateaus. In contrast, the Phi3 family exhibits less stable behavior . Phi3- small and Phi3-mini display noticeable fluctuations under few- shot conditions, indicating greater sensitivity to the choice of demonstrations. Although they improve with more examples, their progress is less consistent and saturates earlier than other models. Notably , a sharp performance jump occurs between 16 and 32 shots, suggesting that a minimum amount of contextual supervision is needed to activ ate their classification capability . Overall, binary classification tasks can be effecti vely learned with a moderate number of demonstrations (e.g., 16–32 shots), beyond which additional examples yield diminishing returns. Multiclass Classification. Fig. 1b reveals a distinctly different trend for multiclass classification. Unlike in the binary setting, all models exhibit a stronger and more consistent reliance on the number of shots, with no clear plateau e ven at 128 shots. This suggests that multiclass tasks place greater demands on in-context learning, requiring more examples to adequately capture inter-class decision boundaries. Among the ev aluated models, Llama and Mistral again outperform others across all shot counts, demonstrating nearly linear improvement as k increases. Phi3-small shows moderate gains but consistently lags behind, reflecting its limited ability to generalize across multiple classes from context alone. Qwen performs notably worse in lo w-shot settings and improv es more gradually , although its gap narrows with more shots. These findings highlight that multiclass classification is substantially more sensitiv e to the number of in-context examples than binary classification. While adding shots consistently improves accu- racy , the lack of a clear saturation point implies that practical deployment must balance accuracy gains against increased inference costs and context-length constraints. B. Demonstration Selection Strate gies While increasing the number of demonstrations generally improv es model performance, relying solely on random se- lection ov erlooks a critical factor: the relev ance and quality of each demonstration to the target query . Intuitiv ely , a fe w highly pertinent examples should offer more actionable context than numerous irrelev ant ones. T o in vestigate this systematically , we compare three demonstration selection strategies: • Random: Baseline selection without considering query content. • Lexical-based: Employs the BM25 algorithm[ 23 ], a bag- of-words retriev al function that ranks demonstrations based on lexical ov erlap with the query . 4 • Semantic-based: Utilizes a pre-trained Sentence- T ransformer[ 22 ] model 7 to encode both queries and demonstrations into dense vector embeddings. Demonstrations are ranked by their cosine similarity to the query in this semantic space. (a) Binary classification. (b) Multiclass classification. 8 Fig. 1. Impact of Demonstration Quantity . F1-scores of different model families as a function of the number of shots for (a) binary and (b) multiclass classification tasks. Binary Classification. Fig. 2a illustrates the impact of differ - ent demonstration selection strategies on binary classification performance. Overall, both lexical-based and semantic-based retriev al consistently outperform random selection across al- most all shot settings. This gap is particularly pronounced in the low-shot regime ( k ≤ 8 ), where random selection exhibits high variance and unstable performance, as reflected by the larger error bars. In contrast, retriev al-based strategies provide more reliable and informativ e contextual signals, lead- ing to faster performance gains with fewer demonstrations. Between the two retriev al approaches, semantic-based selec- tion generally achiev es the best or comparable performance to lexical-based selection, especially as the number of shots increases. This suggests that capturing semantic similarity beyond surface-lev el lexical o verlap is beneficial for selecting more informativ e demonstrations. As k grows larger (e.g., k ≥ 32 ), the performance gap between strategies gradually 7 https://huggingface.co/sentence-transformers/paraphrase-multilingual- MiniLM-L12-v2 8 Due to the context length limit of Mistral, the 128-shot experiment could not be conducted. narrows, indicating that the neg ativ e impact of suboptimal demonstration selection can be partially mitigated by increas- ing the number of examples. Multiclass Classification. As shown in Fig. 2b , the advan- tages of informed demonstration selection become e ven more pronounced in multiclass classification. Both lexical-based and semantic-based strategies significantly outperform random selection across all shot counts, with a substantial margin that persists ev en at higher values of k . Random selection struggles to provide suf ficient class-discriminati ve information, resulting in consistently lower F1 scores and slower improvement as more shots are added. Semantic-based retrie val again demon- strates slightly stronger or comparable performance relative to lexical-based retrie val, particularly in the mid-to-high shot range. This highlights the importance of semantic alignment between the query and selected demonstrations when dealing with more comple x decision boundaries in volving multiple classes. Overall, these results indicate that demonstration selection strategy plays a critical role in in-context learning, and that semantically informed retrie val is an effecti ve and robust choice for improving classification performance. (a) Binary classification. (Mistral) (b) Multiclass classification. (Mistral) Fig. 2. Efficacy of Demonstration Selection Strategies. Comparison of Random, Lexical (BM25), and Semantic (Embedding-based) selection strategies across varying shot counts using the Mistral model. 10 C. Instruction T uning and Model Alignment As established, the ef fectiv eness of in-context learning is significantly influenced by both the number and the selection of demonstrations. Howe ver , these configuration choices act upon a more fundamental element: the underlying large lan- guage model (LLM) itself. A critical architectural distinction among LLMs is whether they have undergone instruction tuning, a supervised fine-tuning process designed to align the model’ s outputs with human instructions and desired response formats. T o isolate the impact of this property , we conduct a controlled comparison between the instruction-tuned and non-instruction-tuned (base) variants of three model families: Llama, Mistral and Qwen. All experiments employ 32 demon- strations selected via semantic retriev al. As illustrated in Fig. 3 , the instruction-tuned variants con- sistently outperform their base counterparts across all three 10 Mistral was selected because it performs best; see the appendix A for details. 5 families. For both Llama and Mistral, instruction tuning yields clear and stable gains in F1 score, indicating that alignment with human instructions substantially enhances the model’ s capacity to follo w classification prompts and generate well- structured label outputs in the ICL setting. While Qwen exhibits lower ov erall performance compared to the other models, instruction tuning still provides a noticeable improve- ment. This suggests that the benefits of instruction tuning are generalizable across model families, ev en when overall classification capability is constrained by other architectural or training factors. These results underscore that instruction tuning is a critical enabler of ef fective in-conte xt learning. Beyond impro ving raw accuracy , instruction-tuned models show better adherence to task specifications and output constraints, thereby reducing ambiguity in label generation. Consequently , while demonstra- tion quantity and selection strate gy are important operational factors, the choice of an instruction-tuned backbone model plays a foundational role in achieving reliable and rob ust ICL performance. Fig. 3. Instruction T uning Impact. Performance comparison between Instruction-T uned and Base variants across Llama, Mistral, and Qwen families (32-shot, semantic retriev al). D. Label Necessity and Semantic V erbalization W ith the optimal demonstration strategy and backbone model established, we now turn to the final component of the prompt: the label v erbalization. While instruction-tuned models are better at following rules, the semantic clarity of those rules (i.e., the labels) remains a critical variable. This section in vestigates two fundamental aspects of label usage: whether explicit labels are indispensable, and ho w the semantic content of the label words themselves influences task learning. The Necessity of Explicit Labels. T o determine whether LLMs can infer task rules from input e xamples alone, we compare three e xperimental settings for the Mistral model: zero-shot, 32-shot without labels, and 32-shot with labels. As summarized in Figure 4 , explicit labels prove critical. Removing labels from the demonstrations led to a signifi- cant performance drop relativ e to the labeled fe w-shot setting, and performance was e ven worse than the zero-shot baseline. Notably , the False Positiv e Rate (FPR) increased sharply in the no-label condition. This suggests that without explicit labels, the model cannot reliably infer the intended classification rule from the demonstration texts alone. Instead, it appears to default to a biased strategy , frequently predicting the posi- tiv e (illicit) class. These results confirm that labels are not merely helpful b ut essential in the ICL frame work for content moderation. They provide the definitive supervisory signal that anchors the model’ s interpretation of the demonstration context, pre venting de generate learning outcomes. Fig. 4. Necessity of Explicit Labels. Impact of removing labels from demonstrations compared to Zero-shot and Standard (Labeled) Few-shot baselines on FPR and F1 Score. Sensitivity to Label V erbalization. Giv en that labels are necessary , we further examine whether their semantic meaning matters. W e experiment with three verbalization schemes for the Mistral model: Original (benign/illicit), In verted (il- licit/benign), and Abstract (0/1 or A/B). The results reveal a pronounced difference between binary and multiclass tasks. For binary classification (Fig. 5a ), in lo w-shot regimes ( k < 16 ), the model relies heavily on its pre-existing semantic knowledge. The original labels, whose meanings align with the task, perform best. While semantically contradictory (inv erted) or meaningless (abstract) labels perform poorly . As the number of shots increases ( k ≥ 32 ), this pattern shifts. The model begins to learn primarily from the input–label mapping within the demonstrations rather than from the intrinsic meaning of the label words. Consequently , the performance of in verted and abstract labels con ver ges with or ev en slightly surpasses that of original labels. The simplicity of abstract symbols may reduce cognitiv e load, making the mapping rule easier to learn from abundant e xamples. The pattern for multiclass classification (Fig. 5b ) is markedly different. Here, semantically meaningful original labels consistently and substantially outperform abstract labels across all shot counts. Although abstract labels still improve ov er zero-shot performance, the y f ail to match original la- bels. Multiclass classification requires disambiguating multiple fine-grained categories. Semantic labels (e.g., ‘porn’, ‘gam- bling’) provide direct, pre-learned conceptual anchors that help the model quickly associate demonstration content with the correct class. Abstract labels lack this semantic scaffold- ing, forcing the model to learn a completely arbitrary and more complex mapping from scratch, which proves inefficient 6 within limited context. (a) Binary classification. (b) Multiclass classification. Fig. 5. Sensitivity to Label V erbalization. Comparison of Original (Seman- tic), Inv erted, and Abstract (Symbolic) label mappings for (a) binary and (b) multiclass tasks. V . C O M PA R A T I V E E FFI C I E N C Y A N D R O BU S T N E S S A NA LY S I S Having identified the optimal configuration for ICL, a fundamental question remains: is this inference-only paradigm truly competitiv e with traditional training-based methods? In this section, we mov e beyond configuration to a rigorous benchmarking analysis. W e specifically ev aluate ICL ’ s data efficienc y against parameter-ef ficient fine-tuning (LoRA) and probe its mechanistic robustness against positional biases and noise, establishing the theoretical and practical grounds for its deployment. A. Data Ef ficiency: ICL vs. P arameter -Efficient F ine-T uning A key advantage of In-Context Learning (ICL) lies in its ability to adapt to downstream tasks without updating model parameters, relying solely on task demonstrations provided within the input prompt. This property makes ICL partic- ularly attractiv e in scenarios where computational resources or labeled data are limited. In contrast, parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), explicitly modify a subset of model parameters using supervised training data. This section seeks to answer a practical and fundamental question: how does the performance of ICL, scaled by the number of in-context demonstrations ( k - shot), compare with that of LoRA fine-tuning, scaled by the number of training samples ( K )? T o ensure a fair comparison, we conduct all experiments us- ing Mistral as the base model for both approaches. For ICL, we adopt the best-performing configuration, including instruction- tuned checkpoints and semantic retriev al for demonstration se- lection. For LoRA fine-tuning, we adopt a standard parameter- efficient configuration. Specifically , we set the LoRA rank to r = 16 , scaling factor α = 16 , and apply a dropout rate of 0 . 05 to the LoRA layers. The model is optimized using AdamW with a learning rate of 2e-4 and trained for multiple epochs. T able II summarizes the performance comparison on binary classification. Overall, the results rev eal a clear data-efficiency advantage of ICL. Notably , ICL with 200 demonstrations achiev es an F1-score of 0.943 and an accuracy of 0.944, closely matching the performance of a LoRA-tuned model trained on 4,480 labeled examples. This corresponds to a reduction of more than 22 × in the number of task-specific supervised samples. Moreover , ICL consistently exhibits a lower false positive rate (FPR), indicating more precise and conservati ve decision boundaries. T ABLE II D ATA E FFI C I EN C Y C O M P A R I SO N . P E R FO R M A NC E BE N C H MA R K I NG O F I C L ( k - S H OT ) V E R S US L O R A FIN E - T UN I N G ( T R AI N E D O N K S A M P LE S ) F O R B I NA RY C L A S SI FI C A T I O N . Method K/k Epochs Prec. Rec. F1 FPR Acc. PEFT 280 1 0.5632 0.9900 0.7166 0.7142 0.6220 2 0.6035 0.9918 0.7473 0.6148 0.6738 3 0.7692 0.9750 0.8538 0.3007 0.8310 560 1 0.5627 0.9994 0.7176 0.7283 0.6180 2 0.7385 0.9899 0.8433 0.3393 0.8187 3 0.7724 0.9893 0.8665 0.2702 0.8535 1120 1 0.8298 0.9834 0.8998 0.1872 0.8947 2 0.8477 0.9830 0.9087 0.1689 0.9042 3 0.8580 0.9871 0.9167 0.1565 0.9126 2240 1 0.8817 0.9854 0.9304 0.1266 0.9279 2 0.8666 0.9862 0.9218 0.1468 0.9178 3 0.8743 0.9848 0.9262 0.1322 0.9242 4480 1 0.9105 0.9886 0.9479 0.0914 0.9474 2 0.9174 0.9884 0.9516 0.0824 0.9516 3 0.8546 0.9924 0.9183 0.1561 0.9152 ICL 128 – 0.9232 0.9554 0.9390 0.0732 0.9405 200 – 0.9382 0.9474 0.9428 0.0584 0.9444 The comparison further highlights the substantial overhead associated with fine-tuning in low-data regimes. When trained on small datasets ( K = 280 or 560 ), LoRA models struggle to achieve competitiv e performance, with F1-scores remaining around 0.86 ev en under carefully selected epoch settings. In contrast, as shown in § IV -A , ICL reaches a similar perfor- mance le vel using only 32–64 demonstrations, underscoring its immediate applicability in data-scarce settings where col- lecting labeled samples is costly or impractical. Another critical distinction emerges in terms of training stability . The performance of LoRA is highly sensitiv e to the choice of training epochs, particularly when the amount of training data is limited. For larger datasets, we observe clear signs of ov erfitting, such as the performance degradation at epoch 3 for K = 4480 . ICL, by design, av oids this issue entirely: it requires no iterative optimization or hyperparameter tuning, and its behavior is directly controlled by the number and composition of demonstrations. This property makes ICL substantially more predictable and robust in deployment sce- narios. The advantage of ICL becomes ev en more pronounced in complex multiclass classification tasks. As sho wn in Fig. 6 , a LoRA model fine-tuned on the full multiclass training set ( K = 5 , 200 ) is outperformed by an ICL configuration using only 64 demonstrations. This result suggests that for fine-grained cate gorization tasks, where learning nuanced and class-specific decision boundaries is critical, the dynamic and query-adaptive reasoning enabled by ICL is inherently 7 more effecti ve than the static parameter updates produced by fine-tuning on limited data. By presenting carefully selected, semantically aligned examples at inference time, ICL can better navigate complex semantic spaces and capture subtle distinctions among illicit categories. Fig. 6. Multiclass Efficiency Benchmark. Comparing 64-shot ICL against LoRA fine-tuning with increasing training sample sizes ( K ) and epochs. B. Robustness to Pr ompt Or dering and P ositional Bias The data ef ficiency advantage makes ICL a powerful al- ternativ e to fine-tuning. Howe ver , for reliable deployment in production systems, a method must also exhibit robustness, i.e., its performance should not be brittle to innocuous vari- ations in input formatting. A particular concern for ICL is its potential sensitivity to the order of elements within the dynamic prompt. This section systematically ev aluates the robustness of ICL against two such v ariations: the order of demonstration examples and the order of labels in the demonstrations. Sensitivity to Demonstration Ordering. In a dynamically constructed few-shot prompt, the sequence of demonstration examples is typically arbitrary . T o assess whether this order in- fluences model predictions, we conducted experiments where we held the set of k demonstrations constant but randomly shuffled their sequence multiple times, measuring the variation in performance (standard deviation of F1-score) for the Mistral model on both binary and multiclass classification tasks. Fig. 7 reports the standard deviation of F1-score across multiple random permutations of the same demonstration set for different numbers of shots. Overall, the results indicate that demonstration ordering has a measurable b ut diminishing im- pact on model performance as the number of shots increases. For binary classification, the sensitivity to demonstration order is most pronounced in lo w-shot settings. W ith only 4 and 8 demonstrations, the standard deviation reaches 0.0307 and 0.0482, indicating substantial variability induced solely by reordering examples. Howe ver , this sensiti vity decreases rapidly as more demonstrations are pro vided. At 16 shots, the standard deviation drops to 0.0151, and further declines to 0.0062 at 64 shots, respectively . This trend suggests that a larger demonstration set provides a more stable and redun- dant contextual signal, mitigating the influence of individual example positions. In contrast, multiclass classification exhibits a more moder- ate and consistent sensitivity across shot counts. The standard deviation remains around 0.018 for 4 to 16 shots and decreases to 0.0132 at 32 shots, before slightly increasing to 0.0161 at 64 shots. Compared to the binary setting, this relatively stable pattern implies that multiclass performance variability is less dominated by the ordering of individual demonstrations and more influenced by the inherent complexity of the task and class div ersity . T aken together , these findings suggest that while demonstra- tion ordering can introduce non-negligible v ariance in fe w-shot ICL, particularly in low-shot binary classification, its impact becomes marginal as the number of demonstrations increases. From a practical perspectiv e, this implies that for moderate to large k , ICL systems can safely ignore demonstration order without sacrificing robustness, whereas in extremely low- shot regimes, controlling or av eraging over multiple prompt orderings may be beneficial. Fig. 7. Robustness to Demonstration Ordering. Standard deviation of F1- scores across random permutations of the same demonstration set, illustrating stability improvements as shot count increases. Sensitivity to Label Ordering. Beyond the ordering of demonstrations themselves, we examine a more subtle source of bias: whether the model’ s predictions are influenced by the positional distribution of labels within the prompt, in- dependent of their semantic rele vance. Specifically , we in- vestigate whether the model may implicitly fa vor labels that appear more frequently or more recently in the demonstration sequence, thereby introducing systematic distortions in its decision behavior . W e manipulate the global arrangement of class labels across the entire demonstration set while keeping the underlying examples unchanged. Concretely , all demonstrations belonging to the same class are grouped together and placed either at the beginning or the end of the prompt. As shown in Fig. 8a , this global reordering leads to a noticeable degradation in accuracy when the number of shots becomes large ( k ≥ 32 ), compared to the default mixed-label order . This effect suggests that even when demonstrations are relev ant, a coherent and imbalanced label signal can bias the model’ s implicit decision boundary , particularly when accumulated over man y examples. More strikingly , we observe a strong positional or recency bias in the model’ s predictions. As illustrated in Fig. 8b , the 8 label that appears most frequently in close proximity to the query ex erts a disproportionate influence on the final output. When illicit demonstrations are placed near the end of the prompt, the model becomes more inclined to predict the illicit label, resulting in a substantially higher false positi ve rate. Con versely , positioning benign demonstrations closest to the query significantly suppresses false positi ves, albeit at the risk of reduced recall for illicit cases. This phenomenon, which is also observed in a controlled multiclass setting, indicates that the model assigns greater weight to local contextual evidence near the query than to globally aggregated information. Such behavior suggests that ICL-based classifiers do not rely solely on abstract class se- mantics but are sensiti ve to positional cues within the prompt, which can systematically ske w predictions if not carefully controlled. (a) Accuracy . (b) False positive rate. Fig. 8. Analysis of Positional Label Bias. (a) Accuracy degradation under clustered label arrangements compared to mixed order . (b) Recency Bias: The False Positiv e Rate significantly fluctuates depending on whether the label nearest to the query is ’illicit’ or ’benign’. C. Contextual Information Retrieval Capabilities Building upon the demonstrated advantages of ICL in terms of data ef ficiency and rob ustness, we no w turn to a more mechanistic question: how effecti vely can an LLM identify and exploit relev ant information within a comple x, and potentially noisy , in-context prompt? T o this end, we design a Needle-in-a-Haystack experiment that explicitly probes the model’ s ability to retrie ve a single, highly relev ant demonstration from a prompt dominated by irrelev ant context. This setting allows us to disentangle gen- uine in-context retrie val from superficial pattern matching or majority-label biases. In the experiment setup, for each target query in the test set, we construct a prompt containing n demonstrations, where n ∈ { 2 , 4 , 8 , 16 , 32 , 64 , 128 } . Exactly one demonstration is an identical cop y of the target query paired with its ground-truth label (the needle ), while the remaining n − 1 demonstrations are randomly sampled from the training set and are semanti- cally unrelated to the query (the haystack ). The position of the needle within the demonstration sequence is uniformly randomized to avoid positional artifacts. The model is then asked to classify the target query , as in the standard ICL setting. If the model perfectly retrie ves and copies the label from the needle, accuracy should remain 100% regardless of n . Any drop in accuracy thus directly measures the model’ s failure to isolate the relev ant signal amidst increasing noise. The results across four model families are sho wn in Fig. 9 . Mistral, Llama, and Phi3-small show strong noise resilience, maintaining near-perfect accuracy ev en with 128 demonstra- tions. This indicates a robust capacity to attend to the relev ant signal despite substantial distraction. In contrast, Qwen shows pronounced sensitivity to the increasing size of the haystack. While its accuracy initially rises with more demonstrations, it declines noticeably at larger n values, suggesting difficulty in consistently identifying the needle when overwhelmed by irrelev ant content. These findings provide direct empirical evidence that ICL performance is not solely a function of demonstration quantity , but critically depends on the model’ s ability to selecti vely retriev e and prioritize relev ant input information. Models with stronger contextual retriev al capabilities are better equipped to scale ICL to longer prompts without suf fering from infor- mation dilution, while models that struggle in this setting are more susceptible to noise. Overall, the Needle-in-a-Haystack experiment highlights a fundamental b ut often overlooked aspect of ICL: its reliance on implicit, model-dependent retriev al mechanisms ov er the input context. This insight underscores the importance of model selection when deploying ICL in real-world applications with long or noisy contexts. Fig. 9. Contextual Retrieval Capability (Needle-in-a-Haystack) . Accuracy of models in retrieving a single relev ant demonstration amidst an increasing number of irrelev ant ”haystack” examples. V I . D E P L OY M E N T I N O P E N - W O R L D S C E N A R I O S The analyses demonstrated that ICL is not only data- efficient but, when utilizing capable models, exhibits strong resilience to noise and irrele vant context. Lev eraging these robust retriev al capabilities, we no w adv ance to the most chal- lenging phase: deploying ICL in open-world scenarios. In this section, we ev aluate whether ICL can translate its mechanistic strengths into practical utility , specifically testing its ability to generalize to unseen cate gories, autonomously disco ver no vel threats, and operate across heterogeneous platforms without adaptation. 9 A. Generalization to Unseen Illicit Categories A practical content moderation system must be resilient not only to known forms of illicit activity , but also to nov el and previously unseen promotion categories that continuously emerge in online ecosystems. One of the key promises of ICL is its ability to extrapolate from a limited set of demonstra- tions to ne w but semantically related tasks without parameter updates. T o ev aluate this capability rigorously , we design a controlled experiment simulating a deployment scenario where the system encounters content from a completely unseen illicit category . W e frame this task as a binary classification problem with labels benign and illicit , using the instruction-tuned Mistral model. Each prompt contains 64 demonstrations, ev enly split between benign and illicit examples, retriev ed randomly to av oid category-specific bias from semantic search. For each of the 12 illicit categories, we construct three ev aluation settings: (i) Zer o-shot : no demonstrations are pro- vided; (ii) Excluded : demonstrations include benign samples and illicit samples from all categories except the target; (iii) Included : demonstrations additionally contain a few samples from the target cate gory . In all cases, the test set consists exclusi vely of samples from the target category , ensuring that accuracy directly reflects the model’ s ability to generalize to that category (with precision fixed at 1). Results are summarized in T able III . Sev eral consistent patterns emer ge. First, using ICL demonstrations that e xclude the target category yields performance improv ements over the zero-shot baseline for the majority of categories. In more than half of the cases, the performance drop incurred by excluding the target category is less than 6% relative to the category- included setting, indicating that ICL can effecti vely generalize from other illicit categories to unseen ones. Second, while including demonstrations from the tar get category unsurprisingly produces the best o verall performance, the gains are often modest compared to the excluded setting. This suggests that much of the discriminativ e capability arises from shared semantic and structural characteristics across illicit promotion types, rather than from category-specific memorization. Notably , categories such as fraud , data-theft , and counter- feit exhibit substantial improvements over zero-shot inference ev en when excluded from the demonstrations, highlighting ICL ’ s strong cross-category generalization. In contrast, a small subset of categories (e.g., advertisement and surr ogacy ) sho w larger gaps between included and excluded settings, reflecting higher semantic div ergence from other illicit categories and thus a greater reliance on category-specific cues. These findings provide empirical evidence that ICL can gen- eralize beyond seen categories by lev eraging high-level shared patterns. Even without target-cate gory demonstrations, the model often infers malicious intent from conte xtual similarities learned from other categories. This capability is particularly valuable in real-world deployment scenarios, where new forms of illicit promotion frequently appear before labeled data can be collected. Overall, this e xperiment shows that ICL e xhibits meaningful extrapolati ve generalization across illicit categories, supporting its suitability as a fle xible and adapti ve moderation approach. T ABLE III G E NE R A L IZ AT IO N P E R F OR M A N CE O F I C L O N U N S E EN I LL I C I T C A T E GO R I E S . A C C U RA CY I S R E P O RTE D UN D E R T H R EE S ET T I N GS : Z E RO - S H OT , C ATE G O RY - I N CL U D E D D E M O NS T R A T I O N S , A ND C A T E G O RY - E X CL U D ED D E MO N S TR ATI O N S . T H E T E S T IN G S E T O F E AC H RO W C O N T A I N S O N L Y S A M P LE S FR O M T H E TA RG E T C ATE G O RY . Category Zero-shot Included Excluded Gain over Gap vs. Zero-shot Included Hacking 0.9011 0.9457 0.9550 +0.0539 +0.0093 Others 0.8861 0.9080 0.9080 +0.0218 0.0000 Drug 0.9103 0.9733 0.9558 +0.0454 -0.0176 Counterfeit 0.6828 0.8816 0.8571 +0.1743 -0.0244 Data-theft 0.6916 0.9574 0.9225 +0.2309 -0.0349 Porn 0.8497 0.8643 0.8221 -0.0276 -0.0421 Fraud 0.5124 0.9378 0.8780 +0.3657 -0.0597 Benign 0.8789 0.8426 0.7778 -0.1011 -0.0648 Gambling 0.5988 0.8035 0.6711 +0.0723 -0.1324 W eapon 0.6600 0.9393 0.7934 +0.1334 -0.1459 Money-laundering 0.8485 0.9118 0.7595 -0.0890 -0.1523 Surrogacy 0.7500 0.9880 0.7640 +0.0140 -0.2240 Advertisement 0.7263 0.9177 0.6266 -0.0998 -0.2911 B. Autonomous Discovery of Emer ging Thr eats Motivation. A persistent challenge in online content modera- tion is the rapid emergence of nov el illicit goods and services. In practice, any fix ed classification taxonomy ine vitably results in detection gaps, leaving nov el threats uncaught. Therefore, enabling a system to autonomously discover and categorize previously unknown forms of illicit promotion from unlabeled data is crucial for proactive defense. In principle, this problem could be addressed through peri- odic taxonomy updates driven by manual inspection. Howe ver , such expert-driv en processes are slow , labor-intensi ve, and dif- ficult to scale in the face of high-throughput online platforms. A natural automated alternati ve is unsupervised clustering ov er textual representations. Unfortunately , our empirical ev aluation shows that con ventional clustering methods (e.g., DBSCAN on w ord-lev el or sentence-level embeddings) are ill-suited for this domain. These approaches tend to produce overly coarse semantic groupings that lack discriminative granularity . For example, distinct illicit acti vities such as hacking , counterfeit trading , and fr aud are merged into a single cluster, while categories like surr ogacy , gambling , and pornography are con- flated into another . This failure arises because such methods rely on general lexical or distributional similarity and fail to capture the nuanced distinctions between fine-grained illicit categories. The resulting clusters are thus too vague to support actionable threat intelligence. Therefore, these limitations highlight the need for an alter - nativ e approach that preserves the scalability of automation while incorporating the contextual and domain-aw are reason- ing typically associated with human analysis. Methodology . T o address this challenge, we propose a two- stage pipeline that le verages the generativ e and reasoning 10 capabilities of LLMs to perform both cate gory discovery and conceptual consolidation. Stage 1: ICL-based Category Annotation. Rather than forcing content into a fixed label set, we prompt an LLM to operate as a human-like content analyst. Giv en an unlabeled promotional text, the model is asked to generate a concise, free-form category name that describes the core illicit activity being promoted. This step is implemented using a fe w-shot ICL configuration with the Mistral model. W e explore tw o prompt designs to examine the impact of prior structural constraints: • Anchored Prompting. The prompt provides a list of known illicit cate gories as references, while explicitly allowing the model to propose a new category name if none of the existing ones are suitable. • Open-Ended Prompting . The prompt provides no prede- fined categories and simply instructs the model to propose a category name that best captures the illicit activity described in the text. This design minimizes anchoring bias and encourages unconstrained conceptualization. Stage 2: LLM-power ed Conceptual Clustering. The first stage produces a large set of heterogeneous, free-form category strings (e.g., luxury-goods-counterfeit , live-str eaming-sex ). T o transform this unstructured output into a usable taxonomy , we perform a second LLM in vocation. The model is pro vided with the complete list of generated cate gory names and instructed to group them based on their underlying illegal activities, assign a concise and representati ve cluster name to each group, and produce a brief semantic summary . This step lev erages the LLM’ s ability to reason over conceptual similarity rather than relying on superficial lexical ov erlap, enabling coherent consolidation of semantically related but lexically div erse labels. Results. The impact of prompt design is immediately reflected in the di versity of generated category names. The anchored prompting strategy yields 197 unique category labels, whereas the open-ended strategy produces 2,909 unique labels, demon- strating a substantially richer exploration of the category space when prior constraints are removed. More importantly , qualitativ e analysis of the final clus- tered outputs rev eals the pipeline’ s capacity for genuine cat- egory discovery . The anchored pipeline, while constrained by existing kno wledge, identifies two coherent novel clusters absent from the original taxonomy . In contrast, the open- ended pipeline discovers eight distinct new illicit categories, including emerging or niche threats such as usury , ille gal immigration , and software piracy . These categories are not only semantically coherent b ut also practically meaningful, aligning well with real-world moderation concerns. Overall, this experiment demonstrates that ICL can be effecti vely repurposed from a static classification mechanism into a dynamic discov ery engine. By decoupling label gen- eration from predefined schemas and subsequently imposing structure through LLM-based conceptual clustering, our two- stage design enables open-world exploration while maintaining organizational clarity . This approach shifts the role of human analysts from exhausti ve manual discov ery to efficient vali- dation and oversight. As a result, it substantially accelerates the threat intelligence lifecycle and moves content moderation systems closer to truly adaptive, future-proof operation. C. Cross-Platform Evaluation in the W ild W e e valuate the configured ICL under real operational conditions, using lar ge-scale, unlabeled data collected from heterogeneous online platforms. This setting represents the most challenging scenario in our study: highly imbalanced class distributions, platform-specific linguistic noise, and the absence of ground-truth supervision. Our goal is not merely to benchmark raw accuracy , but to assess whether ICL offers unique practical adv antages that are difficult to replicate with con ventional, platform-specific detectors. Experimental Setup. W e collected 100,000 unlabeled samples each from search engine results and T witter tweets. The optimized ICL pipeline, utilizing the Mistral model with 32- shot semantic retrie val, w as applied without platform-specific modifications. Performance was ev aluated through manual auditing based on standard industry practices, with 1,000 randomly selected predictions revie wed for each platform. As baselines, we compared ICL against two state-of-the- art, platform-specific detectors: (i) a high-efficienc y Random Forest classifier for Search Engine data, and (ii) a fine-tuned, transformer-based multilingual model for T witter data. Cross-Platf orm Generalization on Mixed Data. T o directly assess cross-platform generalization, we construct a curated, balanced test set that equally mixes samples from search en- gine and T witter sources. As shown in T able IV , ICL achieves the best ov erall performance across all e valuation metrics, with an accuracy of 92.59% and an F1-score of 92.31%. In contrast, both platform-specific baselines suf fer substantial degradation when applied outside their nati ve domains. The search-engine Random Forest classifier , optimized for ef ficiency and le xical patterns in query data, performs poorly on T witter content, resulting in a sharp drop in both recall and overall accuracy . Similarly , the T witter fine-tuned transformer model, while achieving high precision, exhibits a pronounced recall loss on search engine samples, reflecting its limited exposure to non-social-media distributions during training. Importantly , ICL ’ s superior performance is achieved with- out any platform-aware adaptation or retraining. A single, unified model and prompting strategy is applied identically across both data sources. This contrasts fundamentally with the platform-specific detectors, whose performance is tightly coupled to their training distributions and de grades under dis- tribution shift. These results demonstrate that ICL functions as a genuinely platform-agnostic classifier, capable of le veraging semantic reasoning over in-context demonstrations to bridge heterogeneous data sources. Behavior on Wild Data. When deployed on fully unlabeled wild data, ICL demonstrates a distinct operational profile that differs markedly from platform-specific detectors. On search engine data, ICL achie ves an F1-score of 83.89%, closely 11 T ABLE IV C RO S S -P L ATF O R M G E NE R A L IZ AT IO N . P E R F OR M A NC E C OM PA RI S O N B E TW E E N I C L A N D P L A T F OR M - S PE C I FI C B A SE L I N ES ( R A N D O M F O R E ST A N D F I N E - T U N E D T R A N SF O R M ER ) O N A BA L A N CE D , M I X ED - S O UR C E T E S T S E T . Classifier Accuracy Precision Recall F1-score FPR ICL (Ours) 0.9259 0.9111 0.9354 0.9231 0.0826 Random Forest 1 0.7330 0.7202 0.7162 0.7182 0.2517 T ransformer 2 0.8339 0.8728 0.7613 0.8133 0.1003 1 This is search engine-specific. 2 This is T witter-specific. approaching the performance of the specialized search en- gine classifier , despite operating without any platform-specific adaptation or retraining. On T witter data, ICL exhibits a high-recall but lower- precision behavior , achieving a recall of 92.41% and a preci- sion of 68.73%, which leads to an ele vated f alse positiv e rate compared to the T witter-specific model. This performance pat- tern does not indicate a f ailure of the approach, but rather re- flects a fundamental difference in design philosophy . Platform- specific detectors are typically optimized for high-precision filtering under relati vely stable, closed-world distributions. In contrast, ICL prioritizes semantic cov erage and adaptability , aggressiv ely surfacing content that weakly matches known illicit patterns but may correspond to emerging, obfuscated, or borderline promotional behaviors. T o further in vestigate this phenomenon, we conduct a com- parativ e analysis of samples uniquely flagged by each system. Manual inspection reveals that 61.8% of the samples labeled as illicit e xclusiv ely by ICL correspond to either clearly illicit or borderline promotional content, often inv olving implicit language, novel phrasing, or cross-domain references. In com- parison, among samples flagged only by the T witter-specific classifier , this proportion drops to 45.7%, with the remaining cases lar gely attrib utable to platform-specific artifacts or overly narrow le xical cues. T aken together , these results position ICL as a comple- mentary paradigm rather than a replacement for traditional classifiers. Its strengths lie in cross-platform generalization, zero-adaptation deployment, and the discovery of emerging illicit behaviors, and these capabilities are inherently difficult to achiev e with fine-tuned, platform-specific models. In prac- tice, the most effecti ve deployment strategy is a hybrid system in which ICL operates as a flexible, high-recall discovery engine, continuously feeding nov el patterns and labeled data to ef ficient do wnstream classifiers. Such a design enables adaptiv e, future-proof content moderation in the face of rapidly ev olving online threats. V I I . D I S C U S S I O N A. Implications for Practical Content Moder ation Systems While ICL demonstrates strong data efficiency and cross- platform generalization, its deployment entails trade-offs, in- cluding higher inference latency and lo wer precision on noisy data. Consequently , we position ICL not as a standalone replacement for specialized classifiers, but as the core of a tiered, hybrid architecture. In scenarios where labeled data are scarce, costly , or de- layed, ICL can function effecti vely as a primary detector . Our analysis confirms that ICL achieves competitiv e performance using up to 22 times fe wer labeled samples than fine-tuned models. This allows for immediate operational coverage while structured data-collection pipelines are being established, re- ducing dependence on large-scale annotation efforts and ac- celerating response to emerging promotion tactics. For platforms with mature moderation systems, ICL is best deployed in parallel with traditional classifiers as a high-recall discov ery funnel. Its strength in generalizing from limited demonstrations enables it to flag ambiguous, novel, or inten- tionally obfuscated content that may ev ade static rule-based or fine-tuned models. These high-value samples can then be revie wed by human experts to generate labeled data for updating lightweight downstream models, creating a virtuous cycle of model improv ement. In summary , this synergistic architecture combines the agility and generalizing power of ICL with the efficiency and precision of specialized classifiers, establishing a more robust and adaptive defense against the continuously e volving challenge of illicit online promotion. B. Limitations and Futur e W ork Restriction to T extual Modality . Our e valuation focuses primarily on text-based illicit promotion. In real-world ecosys- tems, illicit promotion increasingly adopts multimodal strate- gies, combining text with images, videos, URLs, and encoded symbols to e vade detection. While ICL exhibits strong adapt- ability in the textual domain, its effecti veness for multimodal illicit promotion detection remains unexplored. Extending the proposed framew ork to multimodal large language models and integrating visual and structural signals constitutes an important av enue for future work. Dependence on Demonstration Quality and Diversity . Although ICL sho ws strong generalization to unseen illicit categories, its performance remains dependent on the quality and diversity of the demonstration pool. When demonstrations are biased, outdated, or semantically narrow , the model may inherit these limitations and exhibit reduced robustness. Future work could in vestigate adaptiv e demonstration management strategies, such as continuously refreshing the demonstration repository through acti ve learning or uncertainty-driv en sam- pling, in order to maintain cov erage ov er emerging threat patterns. Inference Latency and Cost. While ICL eliminates training costs, utilizing long-context prompts incurs significant com- putational o verhead during inference compared to lightweight fine-tuned classifiers. This latency trade-off limits the direct applicability of high-shot ICL for real-time, high-throughput filtering. Future work should inv estigate prompt compression techniques or knowledge distillation, where ICL models serv e as ”teachers” to generate rationale-augmented training data for smaller , latency-ef ficient student models. 12 C. Data and Code Release T o support reproducibility and facilitate future research, we plan to release the data and code associated with this study , subject to ethical and legal constraints. Specifically , we will release: (i) the full implementation of our ICL pipeline, includ- ing prompt templates, semantic retrieval strategies, and e valu- ation scripts; and (ii) anonymized and de-identified versions of the curated benchmark datasets used in controlled e xperiments. All released resources are available and will be continuously updated on h t t p s : / / g i t h u b . c o m / C h a s e S e c u r i t y / i l l i c i t - i cl . W e hope that these resources will enable the community to replicate our findings, explore alternati ve prompt designs, and extend ICL-based approaches to new domains of online security . V I I I . C O N C L U S I O N In this paper , we in vestigated in-context learning as a practical and adapti ve approach for illicit promotion detection across online platforms. Through extensiv e experiments, we showed that ICL exhibits robust cross-platform generalization, meaningful resilience to unseen categories, and a unique capacity for autonomous threat discovery . These capabilities that are dif ficult to achie ve with traditional fine-tuning-based systems. At the same time, our analysis highlights important limitations, including sensitivity to prompt construction and reduced precision on highly noisy data. These findings under- score that ICL should not be viewed as a universal replacement for specialized classifiers, but rather as a complementary paradigm with distinct strengths. By reframing illicit content detection as an open-world, ev olving problem, our w ork demonstrates that ICL enables a shift from static classification to ward adaptiv e, discov ery- driv en moderation. W e believ e this perspecti ve opens ne w di- rections for future research, including more principled prompt optimization, inte gration with retriev al systems, and tighter coupling between ICL-driven discovery and downstream learn- ing. R E F E R E N C E S [1] Marah Abdin, Jyoti Aneja, Hany A wadalla, Ahmed A wadallah, Ammar Ahmad A wan, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. [2] Rishabh Agarwal, A vi Singh, Lei M. Zhang, Bernd Bohnet, and Stephanie Chan. Many-Shot In-Context Learning, April 2024. [3] T om B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and Jared Kaplan. Language Models are Fe w- Shot Learners, July 2020. [4] Hila Gonen, Srini Iyer , T erra Blevins, Noah Smith, and Luke Zettlemoyer . Demystifying Prompts in Lan- guage Models via Perplexity Estimation. In F indings of the Association for Computational Linguistics: EMNLP 2023 , pages 10136–10148, Singapore, December 2023. Association for Computational Linguistics. [5] Aaron Grattafiori, Abhimanyu Dubey , Abhina v Jauhri, Abhinav Pandey , Abhishek Kadian, et al. The llama 3 herd of models, 2024. [6] Danny Y uxing Huang, Doug Grundman, Kurt Thomas, Abhishek Kumar , Elie Bursztein, Kirill Levchenko, and Alex C. Snoeren. Pinning Down Abuse on Google Maps. In Pr oceedings of the 26th International Conference on W orld W ide W eb , pages 1471–1479, Perth Australia, April 2017. International W orld W ide W eb Conferences Steering Committee. [7] Luca In vernizzi, Kurt Thomas, Alexandros Kaprave- los, Oxana Comanescu, Jean-Michel Picod, and Elie Bursztein. Cloak of V isibility: Detecting When Machines Browse a Different W eb. In 2016 IEEE Symposium on Security and Privacy (SP) , pages 743–758, May 2016. [8] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Dev endra Singh Chaplot, et al. Mistral 7b, 2023. [9] Mahi K olla, Siddharth Salunkhe, Eshwar Chandrasekha- ran, and K oustuv Saha. LLM-Mod: Can Large Lan- guage Models Assist Content Moderation? In Extended Abstracts of the CHI Confer ence on Human F actors in Computing Systems , CHI EA ’24, pages 1–8, Ne w Y ork, NY , USA, May 2024. Association for Computing Machinery . [10] Nektarios Leontiadis, T yler Moore, and Nicolas Christin. Measuring and Analyzing { Search-Redirection } Attacks in the Illicit Online Prescription Drug Trade. In 20th USENIX Security Symposium (USENIX Security 11) , 2011. [11] Xiaonan Li and Xipeng Qiu. Finding Support Examples for In-Context Learning. In Houda Bouamor , Juan Pino, and Kalika Bali, editors, F indings of the Association for Computational Linguistics: EMNLP 2023 , pages 6219– 6235, Singapore, December 2023. Association for Com- putational Linguistics. [12] Xiaojing Liao, Chang Liu, Damon McCoy , Elaine Shi, Shuang Hao, and Raheem Beyah. Characterizing Long- tail SEO Spam on Cloud W eb Hosting Services. In Pr o- ceedings of the 25th International Conference on W orld W ide W eb , pages 321–332, Montr ´ eal Qu ´ ebec Canada, April 2016. International W orld W ide W eb Conferences Steering Committee. [13] Xiaojing Liao, Kan Y uan, XiaoFeng W ang, Zhongyu Pei, and Hao Y ang. Seeking Nonsense, Looking for Trouble: Efficient Promotional-Infection Detection through Se- mantic Inconsistency Search. In 2016 IEEE Symposium on Security and Privacy (SP) , pages 707–723, May 2016. [14] Zilong Lin, Zhengyi Li, Xiaojing Liao, XiaoFeng W ang, and Xiaozhong Liu. MA WSEO: Adversarial W iki Search Poisoning for Illicit Online Promotion, June 2024. [15] Jiachang Liu, Dinghan Shen, Y izhe Zhang, Bill Dolan, and Lawrence Carin. What Makes Good In-Context Examples for GPT -3? In Pr oceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd W orkshop on Knowl- edge Extraction and Inte gration for Deep Learning Ar- 13 chitectur es , pages 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. [16] Pengfei Liu, https://orcid.org/0000-0001-9030-1875, V iew Profile, W eizhe Y uan, and https://orcid.org/0000- 0002-6117-9417. Pre-train, Prompt, and Predict: A Systematic Surve y of Prompting Methods in Natural Language Processing. A CM Computing Surveys , 55(9):1–35, January 2023. [17] Y ao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity . In Pr oceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Lin- guistics. [18] Man Luo, Xin Xu, Zhuyun Dai, Panupong P asupat, and Mehran Kazemi. Dr .ICL: Demonstration-Retrie ved In- context Learning, May 2023. [19] Sewon Min, Xinxi L yu, Ari Holtzman, Mikel Artetxe, and Mike Le wis. Rethinking the Role of Demonstrations: What Makes In-Context Learning W ork? In Pr oceedings of the 2022 Conference on Empirical Methods in Natural Language Pr ocessing , pages 11048–11064, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. [20] Chengwei Qin, Aston Zhang, Chen Chen, Anirudh Da- gar , and W enming Y e. In-Context Learning with Iterativ e Demonstration Selection. In F indings of the Associa- tion for Computational Linguistics: EMNLP 2024 , pages 7441–7455, Miami, Florida, USA, November 2024. As- sociation for Computational Linguistics. [21] Qwen, An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5 technical report, 2025. [22] Nils Reimers and Iryna Gurevych. Sentence-bert: Sen- tence embeddings using siamese bert-networks. In Pr o- ceedings of the 2019 Conference on Empirical Methods in Natural Language Pr ocessing . Association for Com- putational Linguistics, 11 2019. [23] Stephen Robertson and Hugo Zaragoza. The probabilistic relev ance framew ork: Bm25 and beyond. F ound. T rends Inf. Retr . , 3(4):333–389, April 2009. [24] T aylor Sorensen, Joshua Robinson, Christopher Rytting, Alexander Shaw , and Kyle Rogers. An Information- theoretic Approach to Prompt Engineering W ithout Ground T ruth Labels. In Pr oceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long P apers) , pages 819–862, Dublin, Ireland, May 2022. Association for Computational Lin- guistics. [25] Tianxiang Sun, Y unfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-Box T uning for Language-Model-as-a-Service. In Proceedings of the 39th International Confer ence on Machine Learning , pages 20841–20855. PMLR, June 2022. [26] Gemma T eam, Mor gane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, et al. Gemma 2: Improving open language models at a practical size, 2024. [27] David Y . W ang, Stefan Sav age, and Geoffrey M. V oelker . Cloak and dagger: Dynamics of web search cloaking. In Pr oceedings of the 18th A CM Confer ence on Computer and Communications Security , CCS ’11, pages 477–490, New Y ork, NY , USA, October 2011. Association for Computing Machinery . [28] Hongyu W ang, Y ing Li, Ronghong Huang, and Xiang- hang Mi. Detecting and Understanding the Promotion of Illicit Goods and Services on T witter . In Pr oceedings of the ACM on W eb Confer ence 2025 , pages 3389–3404, Sydney NSW Australia, April 2025. A CM. [29] Peng W ang, Zilong Lin, Xiaojing Liao, and XiaoFeng W ang. Demystifying Local Business Search Poisoning for Illicit Drug Promotion. In Pr oceedings 2022 Network and Distrib uted System Security Symposium , San Diego, CA, USA, 2022. Internet Society . [30] Y aqing W ang, Quanming Y ao, James Kwok, and Li- onel M. Ni. Generalizing from a Few Examples: A Surve y on Few-Shot Learning, March 2020. [31] Y au-Shian W ang and Y ingshan Chang. T oxicity Detec- tion with Generati ve Prompt-based Inference, May 2022. [32] Patrick H. W inston. Learning and reasoning by analogy . Communications of the A CM , 23(12):689–703, Decem- ber 1980. [33] Baoning W u and Brian D. Davison. Identifying link farm spam pages. In Special Interest T rac ks and P osters of the 14th International Confer ence on W orld W ide W eb , WWW ’05, pages 820–829, New Y ork, NY , USA, May 2005. Association for Computing Machinery . [34] Sangyi W u, Jialong Xue, Shaoxuan Zhou, and Xianghang Mi. Reflected Search Poisoning for Illicit Promotion, April 2024. [35] Zhiyong W u, Y aoxiang W ang, Jiacheng Y e, and Ling- peng Kong. Self-Adaptiv e In-Context Learning: An Information Compression Perspecti ve for In-Context Ex- ample Selection and Ordering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Pr oceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long P apers) , pages 1423–1436, T oronto, Canada, July 2023. Association for Computational Linguistics. [36] Hao Y ang, Kun Du, Y ubao Zhang, Shuang Hao, and Zhou Li. Casino royale: A deep exploration of illegal online gambling. In Pr oceedings of the 35th Annual Computer Security Applications Conference , A CSA C ’19, pages 500–513, New Y ork, NY , USA, December 2019. Association for Computing Machinery . [37] Jiacheng Y e, Zhiyong W u, Jiangtao Feng, T ao Y u, and Lingpeng K ong. Compositional exemplars for in-context learning. In Pr oceedings of the 40th International Con- fer ence on Machine Learning , volume 202 of ICML’23 , pages 39818–39833, Honolulu, Hawaii, USA, July 2023. JMLR.org. 14 [38] Mingming Zha, Zilong Lin, Siyuan T ang, Xiaojing Liao, Y uhong Nan, and XiaoFeng W ang. Understanding Cross- Platform Referral T raffic for Illicit Drug Promotion. In Pr oceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages 2132– 2146, Salt Lake City UT USA, December 2024. ACM. [39] Jiang Zhang, Qiong W u, Y iming Xu, Cheng Cao, Zheng Du, and Konstantinos Psounis. Efficient T oxic Content Detection by Bootstrapping and Distilling Large Lan- guage Models. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 38(19):21779–21787, March 2024. [40] Tianhua Zhang, Hongyin Luo, Y ung-Sung Chuang, W ei Fang, and Luc Gaitskell. Interpretable Unified Language Checking, April 2023. [41] Zihao Zhao, Eric W allace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate Before Use: Improving Few- shot Performance of Language Models. In Pr oceedings of the 38th International Confer ence on Machine Learning , pages 12697–12706. PMLR, July 2021. A P P E N D I X A T H E S E L E C T I O N O F M O D E L . W e compared the performance of different models under semantic-based retriev al and found that Mistral performed the best across all tasks, as shown in Figure 10 . Therefore, we selected Mistral as the baseline model for all subsequent experiments. (a) Binary classification. (b) Multiclass classification. Fig. 10. Model Selection Basis. Performance comparison of all candidate models, justifying the selection of Mistral as the primary backbone for subsequent experiments due to its superior multiclass performance. A P P E N D I X B E X A M P L E P RO M P T S . Binary Classification T ask Prompt T emplate Y ou will act as a content detector to decide if a giv en text is illicit promotion, i.e., in volving the promotion of illicit goods and services. If it is, label it as ’illicit’, otherwise, label it as ’benign’. First, you will be giv en a set of message examples along with their labels. Then, you will answer the category of the last message from the following options: - ’benign’: Content that does not contain illicit promotion - ’illicit’: Content that contain illicit promotion Return only the name of the cate gory , and nothing else. == Query: { demonstration te xt } Answer: { demonstration label } == { ... } == Query: { query te xt } Answer: Multiclass Classification T ask Prompt T emplate Y ou will act as a content detector . First, you will be giv en a set of message examples along with their labels. Then, you will answer the category of the last message from the following options: - ’benign’: Content that does not contain illicit promotion - ’porn’: Content that is sexually e xplicit or pornographic in nature - ’ surrogacy’: Content related to surrogate mothers and gestational surrogacy - ’gambling’: Content related to gambling or betting activities - ’ drug’: Content related to illegal drug use, sales, or promotions - ’ data-theft’: Content inv olving the theft or illegal use of data, identity theft, or similar activities - ’money-laundry’: Content in volving the promotion or recruitment for money laundering activities - ’counterfeit’: Content related to fak e goods, forged certificates or false accounts - ’advertisement’: Content related to illegal marketing and black hat SEO - ’weapon’: Content related to weapons, including sales, manufacturing, or usage - ’fraud’: Content related to fraudulent activities and scams - ’hacking’: Content related to hacking, cybersecurity threats, or dev elopment of unlawful programs - ’others’: Content that does not fit into any of the above categories Return only the name of the cate gory , and nothing else. == Query: { demonstration text } Answer: { demonstration label } == { ... } == Query: { query te xt } Answer: 15
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment