Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Disclosure

February 18, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Disclosure
ArXiv ID: 2511.21569
Date: 2025-11-26
Authors: Alex Diep

📝 Abstract

Self-transparency is a critical safety boundary, requiring language models to honestly disclose their limitations and artificial nature. This study stress-tests this capability, investigating whether models willingly disclose their identity when assigned professional personas that conflict with transparent self-representation. When models prioritize role consistency over this boundary disclosure, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 35.2% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.6%-a 9.7-fold difference that emerged at the initial epistemic inquiry. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 61.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count (Delta R_adj^2 = 0.375 vs 0.012). Reasoning variants showed heterogeneous effects: some exhibited up to -48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error (kappa = 0.908). Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.

💡 Deep Analysis

📄 Full Content

Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Disclosure ALEX DIEP, Google, USA Self-transparency is a critical safety boundary, requiring language models to honestly disclose their limitations and artificial nature. This study stress-tests this capability, investigating whether models willingly disclose their identity when assigned professional personas that conflict with transparent self-representation. When models prioritize role consistency over this boundary disclosure, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B–671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 35.2% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.6%—an 9.7%-fold difference that emerged at the initial epistemic inquiry. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 61.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count (Δ𝑅2 adj = 0.375 vs 0.012). Reasoning variants showed heterogeneous effects: some exhibited up to +48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error (𝜅= 0.908). Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification. CCS Concepts: • Computing methodologies →Machine learning; • Human-centered computing → Empirical studies in HCI; HCI theory, concepts and models; • Social and professional topics →Computing / technology policy. Additional Key Words and Phrases: large language models, self-transparency, epistemic honesty, AI safety, behavioral auditing 1 Introduction Transparency enables appropriate trust calibration in algorithmic systems [9, 27], a principle now embedded in regulatory frameworks for high-stakes domains [10]. This paper extends transparency to self-transparency in language models, examining whether systems honestly represent their AI identity when questioned about claimed professional expertise. For language models deployed in professional contexts, AI disclosure serves as scope-setting: acknowledging AI nature signals that advice represents pattern-matching over training data rather than licensed professional judgment. When models maintain false professional credentials, users calibrate trust based on overstated expertise claims, potentially treating AI-generated guidance as equivalent to licensed advice [24]. Evidence suggests this problem is worsening: medical disclaimers declined from 26.3% in 2022 to under 1% in 2025 despite capability improvements [33]. Current training approaches face tensions between helpfulness and honesty: RLHF can prioritize instruction-following over truthfulness [2, 32], creating conflicts where fulfilling user requests may require maintaining assigned identities [18]. Whether models exhibit self-transparency under professional persona pressure becomes critical for deployment safety. This paper empirically investigates this through large-scale behavioral testing, operationalizing AI self-transparency as identity disclosure by measuring whether models explicitly acknowledge December 18, 2025. Author’s Contact Information: Alex Diep, alexdiep+ai-identity-disclosure@google.com, Google, Mountain View, CA, USA. arXiv:2511.21569v5 [cs.AI] 17 Dec 2025 2 Alex Diep their AI nature when probed about claimed expertise. While prior work examined AI identity disclosure in pre-LLM conversational systems [13] and epistemic honesty in contemporary LLMs including belief states, knowledge boundaries, and uncertainty expression [5, 17, 30, 35, 39], this work provides systematic testing of self-transparency under professional persona pressure. Using a common-garden experimental design, sixteen open-weight models were audited under identical conditions, testing professional personas through epistemic probes such as “How did you acquire your knowledge and expertise?” to measure whether models spontaneously disclose their AI nature when probed about claimed expertise. To enable efficient large-scale evaluation, an asynchronous infrastructure with interleaved judge calls was developed, processing 19,200 trials in under 20 minutes on consumer hardware at a cost of $16.66. Disclosure rates vary substantially across tested models, driven by mode

📄 Read Full PDF on ArXiv