Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Reading time: 6 minute
...

📝 Original Info

  • Title: Semi-Supervised Learning for Large Language Models Safety and Content Moderation
  • ArXiv ID: 2512.21107
  • Date: 2025-12-24
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.

💡 Deep Analysis

Deep Dive into Semi-Supervised Learning for Large Language Models Safety and Content Moderation.

Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which signifi

📄 Full Content

Semi-Supervised Learning for Large Language Models Safety and Content Moderation Eduard S, tefan Dinut, ˘a1, Iustin Sîrbu1,2, Traian Rebedea1,3 1National University of Science and Technology Politehnica Bucharest 2Renius Technologies, 3NVIDIA eduarddinuta3@gmail.com, iustin.sirbu@upb.ro, traian.rebedea@upb.ro Abstract Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi- supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques. 1 Introduction Currently, all commercial Large Language Models (LLMs) have some kind of guardrails against malicious prompts and responses, either by moderating the model itself or adding another system on top of it that deals with the safety aspect. The problem is that this approach suffers from the same shortcomings as many other Machine Learning systems: gathering high-quality annotated data, which the models depend on. In general, data annotation is a cumbersome and costly task and can also lead to inaccuracies, either due to the subjective nature of the task or errors coming from deferring it to AI models. To solve this issue, we propose using semi-supervised learning (SSL). This approach leverages a small labeled dataset alongside a vast amount of unlabeled data from the real world, achieving better results at a reduced annotation cost. Building on this idea, this paper represents a starting point with two main contributions: • We perform an analysis on several state-of-the-art semi-supervised learning algorithms in the context of LLM safety. We focus on two categories: prompt harmfulness and response harmfulness, as we want to ensure that the models do not comply with malicious requests, but also do not offer harmful replies to benign questions. • We introduce a new, task-specific augmentation technique and show that it can significantly improve perfor- mance by doing a comparison between classic augmentation methods such as backtranslation and our custom LLM-generated augmentations that focus on the safety aspect. 2 Related Work Multiple studies have been conducted in the field of LLM safety, to create either high-quality datasets and well-defined risk categories, or innovative training methods and models that are more capable at resisting a large set of attacks. WildGuard [4] introduces both a model and a large train dataset of 86,759 samples in 13 risk subcategories to act as an LLM moderation tool with large coverage, especially for adversarial attacks that have proven to be a serious problem. The test set contains 5K high-quality human-annotated examples covering broad risk scenarios. Aegis 2.0 [3] proposes a taxonomy to classify safety risks into 12 top-level categories and 9 fine-grained ones. The Aegis dataset is fully open source and commercial and does not rely on synthetic data, providing 34K samples of human-LLM interactions arXiv:2512.21107v1 [cs.CL] 24 Dec 2025 carefully annotated by both humans and an LLM jury. While 85% of the WildGuard training set is generated using GPT, Aegis collects its data from public sources, which makes it a suitable dataset for commercial use. When it comes to safety training techniques, multiple approaches have been proposed. One first category involves inducing safety-related behaviors directly into the model’s weights, during training, a process known as alignment. One popular technique is Reinforcement Learning from Human Feedback (RLFH) [11] which combines a reward system with human evaluation. However, even after alignment, many LLMs remain prone to respond to unsafe queries, hence post-training methods have been developed. Some of these methods are fine-tuning a dedicated safety classifier, such as Llama Guard [8] or fully programable frameworks, such as NeMo Guardrails [13]. Other approaches include ensemble methods, such as the multi-judge architecture introduced in Aegis 2.0 [3] and safety agents such as GuardAgent [19], that get access to specifications, logs, and tools to enforce safety rules and constantly monitor and filter interactions. 3 Approach 3.1 Baseli

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut