KOINEU

February 10, 2026

Reading time: 12 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.12501
Date:
Authors: Unknown

📝 Abstract

Generative Artificial Intelligence (AI) has created unprecedented opportunities for creative expression, education, and research. Text-to-image systems such as DALL•E, Stable Diffusion, and Midjourney can now convert ideas into visuals within seconds, but they also present a dual-use dilemma, raising critical ethical concerns: amplifying societal biases, producing high-fidelity disinformation, and violating intellectual property. This paper introduces SafeGen, a framework that embeds ethical safeguards directly into the text-to-image generation pipeline, grounding its design in established principles for Trustworthy AI. SafeGen integrates two complementary components: BGE-M3, a fine-tuned text classifier that filters harmful or misleading prompts, and Hyper-SD, an optimized diffusion model that produces highfidelity, semantically aligned images. Built on a curated multilingual (English-Vietnamese) dataset and a fairness-aware training process, SafeGen demonstrates that creative freedom and ethical responsibility can be reconciled within a single workflow. Quantitative evaluations confirm its effectiveness, with Hyper-SD achieving IS = 3.52, FID = 22.08, and SSIM = 0.79, while BGE-M3 reaches an F1-Score of 0.81. An ablation study further validates the importance of domain-specific fine-tuning for both modules. Case studies illustrate SafeGen's practical impact in blocking unsafe prompts, generating inclusive teaching materials, and reinforcing academic integrity.

📄 Full Content

The rapid proliferation of generative Artificial Intelligence (AI) has transformed the landscape of creative production, education, and scientific research. Among the most visible advances are text-to-image systems such as DALL•E, Stable Diffusion, and Midjourney [1,2]. These models can translate abstract concepts into vivid imagery within seconds, offering new opportunities for teaching and learning. For educators, they provide a means to visualize difficult theories, design tailored illustrations, and stimulate student engagement. For researchers, they open avenues for creative experimentation, data visualization, and crossdisciplinary collaboration.

Yet these possibilities are inseparable from a set of profound ethical concerns. Because text-to-image models are trained on massive, uncurated datasets harvested from the Internet, they frequently inherit and amplify existing societal biases, producing images that reinforce stereotypes of gender, race, or cultural identity [3,4]. At the same time, their ability to generate photorealistic but fabricated content raises the specter of large-scale disinformation and deepfake production, undermining trust in digital media [5]. Questions of intellectual property add another layer of complexity: models may inadvertently reproduce copyrighted elements without attribution or permission [6]. These risks are particularly acute in academic settings, where originality, accountability, and integrity are foundational values [7].

Attempts to address these problems through post-hoc safety mechanisms have proven inadequate. Commercial systems must walk a fine line between protecting users and preserving creative freedom. In practice, most solutions fall short. Stable Diffusion’s initial safety filter, for example, was criticized for its narrow scope: it targeted sexually explicit material but ignored violent or disturbing imagery [8]. Conversely, Midjourney’s stricter filters were soon shown to be vulnerable to adversarial prompt attacks, in which slight modifications to input text successfully bypassed restrictions [12]. Current systems thus rely heavily on after-the-fact filtering, a brittle approach that is easily circumvented. What is needed instead is a framework that embeds ethical safeguards into the design of the generative process itself. This call for proactive solutions aligns with a broader international conversation on the governance of AI. Over the past decade, scholars and policy bodies-including the European Union, the OECD, and Floridi and Cowls-have articulated converging principles for Trustworthy AI [9,10,11]. Across these initiatives, several themes recur: the imperative of fairness and inclusion, the obligation to prevent harm, the need for transparency and interpretability, the demand for accountability and human oversight, and the requirement of robustness in both technical and social contexts. These principles provide not only a normative compass but also a practical foundation for guiding the deployment of generative AI in sensitive environments such as education and research.

Generative AI directly tests the resilience of these principles. Bias amplification challenges fairness; deepfake production threatens non-maleficence; opaque safety filters undermine transparency; and the lack of clear governance weakens accountability. Academic contexts further demand attention to the integrity of scholarship, since the misuse of generative models risks eroding standards of originality and responsible authorship. Bridging the gap between principle and practice therefore requires concrete technical frameworks that operationalize ethical safeguards at the core of generative pipelines.

In response to this need, we introduce SafeGen, a framework that integrates ethical safeguards directly into text-to-image generation. SafeGen is grounded in the principles of Trustworthy AI and is designed specifically for educational and research contexts. The framework incorporates two core components: (i) BGE-M3, a fine-tuned classifier that proactively screens prompts for ethically problematic content before image synthesis; and (ii) Hyper-SD, an optimized diffusion model that produces high-quality, semantically aligned images while respecting ethical constraints. Together, these components demonstrate that creative freedom and ethical responsibility need not be mutually exclusive, but can be reconciled within a unified workflow.

In parallel, other research has explored integrating safety mechanisms directly into text-to-image models. For instance, Li et al.’s system (also called SafeGen) fine-tunes a diffusion model’s internal layers to eliminate sexually explicit visual concepts [27]. Similarly, Schramowski et al.’s Safe Latent Diffusion method removes or suppresses inappropriate image content during the generation process without requiring additional training [28]. However, such methods typically focus on a narrow subset of unsafe content (primarily NSFW material) or address only one aspect of the problem. By contrast, our SafeGen targets a broader spectrum of ethical risks (from bias and hate to misinformation and academic misconduct) and employs a dual-module strategy, combining prompt filtering with bias-aware generation, to provide a more comprehensive solution tailored to educational and research environments.

SafeGen is grounded in widely recognized frameworks for Trustworthy AI, including the EU’s Ethics Guidelines for Trustworthy AI [10], OECD AI Principles [11], UNESCO’s Recommendation on the Ethics of AI [24], and academic works such as Floridi and Cowls’ unified framework [9]. From these foundations, we adapt five actionable pillars tailored to the academic use of generative AI:

Fairness, Non-Discrimination, and Inclusion: Rooted in the EU’s fairness principle [10] and OECD’s human rights emphasis [11], this pillar targets algorithmic bias. SafeGen aims to ensure equitable outcomes for all users, reducing stereotypes and reinforcing inclusion [9]. 2. Prevention of Harm and Promotion of Well-Being: Combining the EU’s safety requirements [10], OECD’s focus on societal well-being [11], and UN-ESCO’s “do no harm” principle [24], this pillar emphasizes preventing harmful content-including misinformation, violence, or hate speech-while promoting constructive educational use.

works [9,10,11], this pillar highlights the need for systems to remain intelligible to users. SafeGen not only provides clear explanations when prompts are blocked but also communicates the model’s capabilities and limitations. This connects the framework to the broader agenda of Explainable and In-terpretable AI, which emphasizes human-understandable justifications for AI decisions and promotes user trust [23,24]. 4. Accountability and Human Oversight: Following EU and OECD requirements for accountability and oversight [10,11], this pillar establishes governance mechanisms, documentation standards, and human-in-the-loop processes. It ensures that AI decisions can be audited, challenged, and corrected when necessary [24]. 5. Robustness, Security, and Academic Integrity: Inspired by the robustness principle of EU and OECD [10,11], this pillar extends beyond technical resilience against adversarial attacks. It also encompasses social robustness, ensuring that SafeGen supports the preservation of academic standards such as originality, integrity, and responsible authorship [25].

Together, these five pillars provide a normative foundation that aligns Safe-Gen with international principles for Trustworthy AI while adapting them to the specific challenges of generative models in academic environments.

SafeGen is designed as a practical instantiation of the five ethical pillars established in Section 2. Its architecture ensures that fairness, harm prevention, transparency, accountability, and robustness are embedded not as optional addons but as integral design choices throughout the system.

The core principle of SafeGen is that ethical safeguards must precede and accompany the generative process. To this end, the framework integrates two complementary modules:

-BGE-M3 Classifier: A fine-tuned Transformer-based model that proactively screens prompts, filtering out harmful, biased, or misleading inputs. This directly reflects the pillars of Fairness and Prevention of Harm, ensuring that outputs are not compromised by problematic inputs. -Hyper-SD Generator: A customized diffusion model, adapted from Stable Diffusion [13], fine-tuned to deliver semantically faithful and high-quality images. It incorporates fairness-aware optimization to minimize bias in outputs while upholding academic integrity. This supports the pillars of Fairness and Robustness.

Together, these modules establish a dual safeguard: BGE-M3 enforces ethical integrity at the input stage, while Hyper-SD guarantees responsible and accurate synthesis. Both modules are built on the Transformer architecture [12] and employ subword tokenization methods such as Byte Pair Encoding (BPE) [6], which enhance robustness across diverse vocabularies and languages.

SafeGen’s components are explicitly aligned with the five ethical pillars:

-Classifier (BGE-M3): For multilingual robustness, the classifier leverages BERT [14], RoBERTa [15], and XLM-RoBERTa [16]. To address class imbalance (≈ 9:1 safe vs. harmful prompts), SafeGen employs balance-class batching and Class-Balanced Focal Loss, ensuring sensitivity to rare but ethically significant signals. This reflects the principles of Fairness and Prevention of Harm. -Generator (Hyper-SD): Hyper-SD extends the latent diffusion framework [13], trained with fairness-aitware fine-tuning to mitigate representational bias. The model denoises latent representations to produce high-resolution, semantically aligned images. Training used a batch size of 128, a learning rate of 5e-5, and 100 epochs, demonstrating Robustness and accountability in design.

Ethical design also depends on carefully curated data, which anchors SafeGen’s commitment to inclusivity, transparency, and academic integrity:

-Image Data and Captions: Six categories-Animals, Cars, Bicycles, Motorbikes, Flowers, and Humans/DeepFashion-yielded over 65,000 images. Because many lacked captions, the vit-gpt2-image-captioning model was selected for automatic annotation after outperforming other baselines (Table 1). This enhances Transparency by providing interpretable textual grounding for visual data. -Ethical and Multilingual Text Corpus: Captions were translated into Vietnamese, with Google Translate achieving the highest evaluation scores (Table 2). This reduces linguistic inequity and supports the pillar of Fairness, Non-Discrimination, and Inclusion [17]. For training the BGE-M3 classifier, an ∼830,000-sample corpus was constructed from diverse sources, including Vietnamese fake news datasets [18], Vietnamese legal documents [?], and English corpora for toxic comments and biased news [2,?]. After preprocessing, the dataset contained 730,000 safe and 100,000 harmful samples. The inclusion of legal documents provided “clean” normative language, reinforcing Accountability and Academic Integrity by sharpening the classifier’s ability to distinguish acceptable from problematic text.

SafeGen was evaluated not only for its technical accuracy but also for its ability to uphold the ethical safeguards outlined in the five pillars. Experiments therefore assessed both performance benchmarks and the framework’s effectiveness in filtering harmful prompts, promoting fairness, and preserving academic integrity.

-Text Classification: BGE-M3 and PhoBERT-base-v2 were fine-tuned on a curated ∼830,000-sample corpus. To address the ≈ 9:1 imbalance between safe and harmful samples, balanced-class sampling and Class-Balanced Focal Loss were adopted, reflecting the pillars of Fairness and Prevention of Harm.

Training details are shown in Table 3. -Text-to-Image Generation: Hyper-SD and mini-SD were trained with a batch size of 128, a learning rate of 5e-5, and 100 epochs. Deployment was carried out on an NVIDIA A40 GPU, wrapped in a web application using Python and Gradio [19], ensuring Transparency and Accessibility.

Performance was measured using widely adopted benchmarks:

-Classification: Accuracy and F1-Score captured reliability under class imbalance. -Generation: Three complementary metrics assessed image outputs:

• Inception Score (IS): Higher scores reflect improved quality and diversity.

• Fréchet Inception Distance (FID): Lower values indicate closer alignment to real image distributions. • Structural Similarity Index (SSIM): Closer to 1 reflects higher perceptual similarity and robustness.

-Classification: Fine-tuning significantly improved both models (Table 4). BGE-M3 achieved the strongest performance (Accuracy = 0.8215, F1 = 0.8145), validating its reliability for detecting ethically problematic promptssupporting Prevention of Harm. -Generation: Hyper-SD outperformed the baseline and mini-SD models across all metrics (Table 5). With IS = 3.52, FID = 22.08, and SSIM = 0.79, it demonstrated robust, semantically faithful generation aligned with Fairness and Academic Integrity.

These quantitative metrics also directly correspond to SafeGen’s ethical design pillars. Specifically, the classifier’s F1-Score reflects the system’s ability to uphold Prevention of Harm through prompt filtering, while the generator’s FID and SSIM scores indicate Robustness and Fairness in image synthesis. Together, these results quantitatively support SafeGen’s alignment with the ethical principles introduced in Section 2.

To evaluate the contribution of each component:

-Generation: Fine-tuned Hyper-SD achieved superior results compared to its base version and external baselines (Table 6), showing that domainspecific fine-tuning is critical for Robustness. -Classification: Replacing BGE-M3 with alternative encoders caused notable performance drops (Table 7), confirming its role in supporting Fairness and Accountability.

The experimental results confirm that SafeGen successfully combines technical robustness with ethical safeguards. Each quantitative indicator introduced earlier (F1, FID, SSIM) is thus interpreted not only as a measure of technical accuracy but also as evidence of adherence to specific ethical pillars, Prevention of Harm, Fairness, and Robustness, demonstrating that SafeGen’s ethical framework is empirically grounded rather than purely conceptual. Fine-tuning on ethically curated, multilingual datasets proved essential; off-the-shelf models were insufficient for nuanced filtering and high-fidelity generation in the Vietnamese academic context. This finding supports broader concerns in the literature regarding the risks of relying on large but generic “stochastic parrots” [17]. SafeGen demonstrates practical benefits for education and research. It enables the generation of safe and inclusive teaching materials, supports the visualization of complex concepts, and mitigates risks of disinformation or fabricated content. At the same time, the results highlight enduring challenges. The tension between safety and censorship remains unresolved [20]. Although proactive filtering reduces harmful outputs, sustainable trust will require complementary governance frameworks.

Future development should focus on enhancing Transparency and Accountability. Standardized reporting tools such as Model Cards [21] and Datasheets for Datasets [22] can docuitmitent performance limitations, dataset composition, and bias sources. Likewise, incorporating Explainable AI (XAI) techniques [23] could transform rejection events into pedagogical opportunities by clarifying why specific prompts are blocked.

Within academia, SafeGen plays a dual role: enabling creativity while safeguarding academic integrity. By constraining unsafe or deceptive uses of textto-image models, it addresses concerns about fabricated data, plagiarism, and misuse in student work [7,25]. Yet, technical safeguards must operate in tandem with institutional policies and clear user guidelines. Together, these measures can ensure that generative AI is not only innovative but also responsible and trustworthy.

This paper introduced SafeGen, a framework that embeds ethical safeguards directly into the text-to-image generation pipeline. By combining a proactive prompt classifier (BGE-M3) with a fine-tuned diffusion generator (Hyper-SD), SafeGen illustrates that technical performance and ethical responsibility can be integrated within a single system.

Experiments confirmed the framework’s effectiveness: BGE-M3 reliably filtered harmful prompts, and Hyper-SD produced high-fidelity, semantically aligned images. These results highlight the value of domain-specific fine-tuning on ethically curated, multilingual data. Beyond meittrics, SafeGen’s evaluations demonstrate alignment with the five ethical pillars, Fairness, Prevention of Harm, Transparency, Accountability, and Academic Integrity.

SafeGen thus offers a roadmap for developing responsible generative AI systems tailored for educational and research settings. Future work will expand on three fronts: continuous bias auditing and adversarial testing to ensure robustness, standardized governance tools such as Model Cards for transparency, and closer integration with institutional policies to reinforce academic integrity. In advancing these directions, SafeGen contributes to a vision of generative AI that is not only creative but also trustworthy, inclusive, and aligned with scholarly values.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found