📝 Original Info
- Title: Semi-Supervised Learning for Large Language Models Safety and Content Moderation
- ArXiv ID: 2512.21107
- Date: 2025-12-24
- Authors: Researchers from original ArXiv paper
📝 Abstract
Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.
💡 Deep Analysis
Deep Dive into Semi-Supervised Learning for Large Language Models Safety and Content Moderation.
Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which signifi
📄 Full Content
Semi-Supervised Learning for Large Language Models Safety
and Content Moderation
Eduard S, tefan Dinut, ˘a1, Iustin Sîrbu1,2, Traian Rebedea1,3
1National University of Science and Technology Politehnica Bucharest
2Renius Technologies, 3NVIDIA
eduarddinuta3@gmail.com, iustin.sirbu@upb.ro, traian.rebedea@upb.ro
Abstract
Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence
and is even more relevant nowadays with the increasing capacity of those models. Currently, there
are several guardrails in place for all public LLMs and multiple proposed datasets for training safety
classifiers. However, training these safety classifiers relies on large quantities of labeled data, which
can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address
these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which
leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze
the improvements that these techniques can offer for both prompts given to Large Language Models
and the responses to those requests. Moreover, since augmentation is the central part of semi-
supervised algorithms, we demonstrate the importance of using task-specific augmentations, which
significantly increase the performance when compared to general-purpose augmentation techniques.
1
Introduction
Currently, all commercial Large Language Models (LLMs) have some kind of guardrails against malicious prompts
and responses, either by moderating the model itself or adding another system on top of it that deals with the safety
aspect. The problem is that this approach suffers from the same shortcomings as many other Machine Learning systems:
gathering high-quality annotated data, which the models depend on. In general, data annotation is a cumbersome
and costly task and can also lead to inaccuracies, either due to the subjective nature of the task or errors coming
from deferring it to AI models. To solve this issue, we propose using semi-supervised learning (SSL). This approach
leverages a small labeled dataset alongside a vast amount of unlabeled data from the real world, achieving better results
at a reduced annotation cost. Building on this idea, this paper represents a starting point with two main contributions:
• We perform an analysis on several state-of-the-art semi-supervised learning algorithms in the context of LLM
safety. We focus on two categories: prompt harmfulness and response harmfulness, as we want to ensure that
the models do not comply with malicious requests, but also do not offer harmful replies to benign questions.
• We introduce a new, task-specific augmentation technique and show that it can significantly improve perfor-
mance by doing a comparison between classic augmentation methods such as backtranslation and our custom
LLM-generated augmentations that focus on the safety aspect.
2
Related Work
Multiple studies have been conducted in the field of LLM safety, to create either high-quality datasets and well-defined
risk categories, or innovative training methods and models that are more capable at resisting a large set of attacks.
WildGuard [4] introduces both a model and a large train dataset of 86,759 samples in 13 risk subcategories to act as an
LLM moderation tool with large coverage, especially for adversarial attacks that have proven to be a serious problem.
The test set contains 5K high-quality human-annotated examples covering broad risk scenarios. Aegis 2.0 [3] proposes
a taxonomy to classify safety risks into 12 top-level categories and 9 fine-grained ones. The Aegis dataset is fully
open source and commercial and does not rely on synthetic data, providing 34K samples of human-LLM interactions
arXiv:2512.21107v1 [cs.CL] 24 Dec 2025
carefully annotated by both humans and an LLM jury. While 85% of the WildGuard training set is generated using
GPT, Aegis collects its data from public sources, which makes it a suitable dataset for commercial use.
When it comes to safety training techniques, multiple approaches have been proposed. One first category involves
inducing safety-related behaviors directly into the model’s weights, during training, a process known as alignment. One
popular technique is Reinforcement Learning from Human Feedback (RLFH) [11] which combines a reward system
with human evaluation. However, even after alignment, many LLMs remain prone to respond to unsafe queries, hence
post-training methods have been developed. Some of these methods are fine-tuning a dedicated safety classifier, such as
Llama Guard [8] or fully programable frameworks, such as NeMo Guardrails [13]. Other approaches include ensemble
methods, such as the multi-judge architecture introduced in Aegis 2.0 [3] and safety agents such as GuardAgent [19],
that get access to specifications, logs, and tools to enforce safety rules and constantly monitor and filter interactions.
3
Approach
3.1
Baseli
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.