Text-to-image diffusion models have drawn significant attention for their ability to generate diverse and highfidelity images. However, when generating from multiconcept prompts, one concept token often dominates the generation, suppressing the others-a phenomenon we term the Dominant-vs-Dominated (DvD) imbalance. To systematically analyze this imbalance, we introduce Domi-nanceBench and examine its causes from both data and architectural perspectives. Through various experiments, we show that the limited instance diversity in training data exacerbates the inter-concept interference. Analysis of crossattention dynamics further reveals that dominant tokens rapidly saturate attention, progressively suppressing others across diffusion timesteps. In addition, head ablation studies show that the DvD behavior arises from distributed attention mechanisms across multiple heads. Our findings provide key insights into generative collapse, advancing toward more reliable and controllable text-to-image generation.
Text-to-image diffusion models [9,13,21,24,25,27,39] have achieved remarkable success in generating highquality images from textual descriptions. However, ensuring the model's representational fidelity to textual concepts [38] remains a fundamental challenge. Recent research has extensively explored this limitation from complementary perspectives. One line of work investigates memorization [3,5,11,12,15,26,26,28,32,33,37], where models reproduce near-identical images across different random seeds mainly due to excessive duplication of specific image-prompt pairs in training data. Another line focuses on image editing [2,8,10,22], aiming to enhance semantic compositional capability by addressing failures in generating images from complex prompts containing multiple diverse concepts. In this work, we examine a complementary aspect that arises from the interplay of these two dimensions-training data characteristics and multi-concept compositional capability. We observe that when generating images from prompts containing multiple concepts, one concept can visually overwhelm the generation while the other is completely suppressed and fails to appear. For example, as shown in Fig. 1, when generating images from the prompt "Neuschwanstein Castle coaster" across different random seeds, the Castle's distinctive architecture dominates nearly all outputs, while the coaster concept is entirely absent. In this paper, we refer to this as the Dominant-vs-Dominated (DvD) phenomenon.
DvD extends the existing understanding by operating at the concept level through visual dominance: unlike memorization, which concerns prompt-specific reproduction, and concept editing, which addresses semantic compositional failures, DvD reveals how an individual concept’s visual characteristics systematically suppress others during multiconcept generation. We hypothesize that this dominance emerges from visual diversity disparity in training data: concepts with limited variation (e.g., landmarks, artists) develop rigid visual priors, while high-diversity concepts (e.g., everyday objects) develop flexible representations. Through a controlled experiments using DreamBooth [29] to manipulate visual diversity, we show that the dominance increases monotonically as the training diversity decreases, validating visual diversity disparity as the root cause.
To systematically investigate how DvD manifests, we propose DominanceBench, a curated benchmark of 300 prompts exhibiting strong DvD behavior. Through crossattention analysis, we reveal that (1) DvD prompts exhibit significantly higher attention concentration on dominant tokens in lower-resolution layers at early denoising steps, (2) dominated concepts experience sharp attention decline in the early phase of denoising process, and (3) unlike memorization which localizes to specific heads, DvD emerges from distributed cooperation among multiple attention heads.
Our main contributions are: โข We characterize the Dominant-vs-Dominated (i.e., DvD) phenomenon and demonstrate through controlled experiments that visual diversity disparity in training data is its root cause. โข We propose DominanceBench, a benchmark dataset for systematic analysis of DvD across concept categories. โข We reveal the internal mechanisms of DvD through comprehensive cross-attention analysis, identifying when (early timesteps), where (lower-resolution layers), and how (distributed across heads) dominance manifests during generation.
Memorization in diffusion models refers to the phenomenon where models replicate near-identical training images during generation, raising significant privacy and copyright concerns [3,33]. To understand how memorization is encoded in model architectures, researchers have examined cross-attention mechanisms from multiple perspectives, revealing imbalanced attention focus in token embeddings [5,26], prediction magnitudes [37], and localized neurons [11].
Recent work has further investigated the root causes: [28] provided a geometric framework relating memorization to data manifold dimensionality, and [15] revealed that overestimation during early denoising collapses trajectories toward memorized images. While these works focus on detecting and preventing prompt-specific reproduction of entire training images, our work investigates how visual diversity disparity in training data leads to concept-level dominance in multi-concept generation.
Text-to-image diffusion models continue to face substantial difficulties when prompts contain multiple concepts-such as several objects, attributes, or artistic styles-often yielding attribute leakage, concept mixing, or incomplete sub-jects. These limitations have been widely reported across compositional diffusion and attention-guided control frameworks [4,6,10,16,19,34,36], which show that even strong diffusion backbones tend to violate object-attribute bindings or collapse distinct entities. Recently, multiconcept customization and multi-subject generation approaches-including MC 2 [14], FreeCustom [8], Custom Diffusion [18], Cones2 [20], OMG [17], and Nested Attention [22]-further reveal persistent identity entanglement, occlusion, and interference when multiple user-defined concepts are composed.
However, recent compositional benchmarks and feedback-driven analyses [7,35] demonstrate that diffusion models still struggle with relational consistency and finegrained concept grounding. While these approaches focus on architectural modifications and attention mechanisms, our work identifies visual diversity disparity in training data as a root cause of systematic concept dominance in multi-concept generation.
Multi-concept generation is a fundamental capability of text-to-image diffusion models, enabling users to compose complex scenes from textual descriptions. While recent work has explored training data influence through memorization studies [3,33,37] and compositional generation through concept editing [2,10], we observe a distinct failure mode that operates at the concept level through visual dominance.
We define the Dominant-vs-Dominated (DvD) phenomenon as cases where one concept (the dominant) visually overwhelms the generation, while the other (the dominated) is completely suppressed and fails to appear.
Illustrative example. As illustrated in Fig. 1, the prompt “Neuschwanstein Castle coaster” demonstrates this phenomenon: across multiple random seeds, the Castle’s distinctive architecture dominates the generation while the coaster concept is suppressed. This pattern persists across different model versions (SD 1.4 and SD 2.1), indicating that DvD reflects a fundamental issue in diffusion-based generation rather than a model-specific artifact.
Hypothesis: visual diversity disparity. We hypothesize that DvD stems from the disparity in visual diversity across concepts in training data. To investigate this, we examine training images from the LAION [31] dataset for both concepts (Fig. 2). As shown in Fig. 2a, Neuschwanstein Castle, being a unique landmark, appears with highly consistent visual features-the iconic white facade, pointed towers, This disparity in training data diversity leads diffusion models to develop internal visual representations with different levels of flexibility. Concepts with limited visual variation-such as famous landmarks, specific artists, or iconic characters-form strongly reinforced, rigid visual priors during training, while concepts with high diversity develop more flexible, adaptable representations. When such concepts are combined in a multi-concept prompt, the rigid priors tend to dominate the generation process, overwhelming and suppressing the more flexible concept.
To quantify the degree of dominance, we define the DvD Score as a metric based on visual presence assessment. For a two-concept prompt, each concept is evaluated through N binary questions using Qwen2.5-VL [1]. Let C 1 and C 2 denote the number of “Yes” responses for the two concepts. The DvD Score is defined as:
This metric ranges from 0 to 100, with higher values indicating stronger dominance. We set N = 5 with concepttype-specific questions (e.g., for artists: “Is this image painted in the artistic style of Van Gogh?”) and consider a prompt as DvD when C 1 โฅ 3 and C 2 < 3 (DvD Score โฅ 36). The complete set of questions is provided in the Appendix.
To systematically investigate the causes and mechanisms of DvD, we propose DominanceBench by collecting prompts from the LAION dataset [31], on which SD was trained. We focus on prompts containing two concepts: one from lowdiversity groups (artist, landmark, character) and one from a high-diversity group (object, including everyday items such as bags, mugs, and t-shirts). We collect 300 prompts in total, with 100 prompts for each pairing type.
For each prompt, we generate 10 images using SD 1.4 with different random seeds. We compute the DvD Score for each generated image and include a prompt in Domi-nanceBench if at least 7 out of 10 images exceed the threshold of 36.
While the initial collection is performed using SD 1.4, we also evaluate the same prompts with SD 2.1 to examine whether DvD persists across model versions. As shown in Fig. 3, while the overall DvD Score decreases in SD 2.1, a substantial portion of prompts still exceeds the threshold of 36. To validate this threshold, we also use 300 balanced prompts where both concepts successfully appear in generation (details in Appendix). These balanced prompts show significantly lower DvD Scores (median: 11.6), confirming that our threshold effectively distinguishes DvD cases.
In this section, we analyze the causes and mechanisms of DvD through controlled experiments and attention analysis on SD 1.4.
Lower visual diversity leads to stronger dominance. When a concept is learned from limited variations, its representation becomes overfit to specific visual patterns, causing it to dominate other concepts in multiconcept compositions.
In Section 3.1, we hypothesized that DvD stems from the disparity in visual diversity across concepts in training data. To directly test this hypothesis, we conduct a controlled experiment where we systematically manipulate the visual diversity of a single concept and measure the resulting dominance in multi-concept generation.
Experimental Setup. We fine-tune SD 1.4’s UNet using DreamBooth [29] to learn a new concept token “dvddog” from 120 ImageNet [30] dog images. To systematically control visual diversity, we create six training variants by varying the number of dog breeds: D 1 , D 2 , D 4 , D 6 , D 8 , D 10 , where the subscript indicates the number of breeds. D 1 uses all 120 images from a single breed (minimal diversity), while D 10 uses 12 images from each of 10 breeds (maximal diversity).
All models are trained for 50 epochs with learning rate 1 ร 10 -5 . We use the original model (without fine-tuning) as a baseline for comparison.
Evaluation. We construct 50 test prompts pairing “dvddog” with diverse concepts across different compositional scenarios: object co-occurrence (e.g., “a dvddog and a cat”), scene context (e.g., “a man walking with a dvddog”), and style modifiers (e.g., “a dvddog in 3d render”). For each prompt, we generate 10 images per model variant using different random seeds and compute the dominance score, treating “dvddog” as C 1 . and4d) shows a more pronounced effect: D 1 , D 2 , and D 4 all exceed the threshold, demonstrating that even moderate diversity reduction can trigger dominance in certain compositional contexts. Across both examples, lower-diversity models exhibit visual outputs where “dvddog” dominates the entire scene, with the paired concept either absent or barely visible. In contrast, higher-diversity models (D 8 , D 10 , baseline) successfully generate both concepts with balanced presence.
This controlled experiment directly validates our hypothesis: visual diversity disparity between concepts is the root cause of DvD. When a concept is learned from limited variations, its representation becomes overfitted to specific visual patterns, causing it to dominate other concepts with higher diversity in multi-concept compositions. Additional examples demonstrating this trend across all 50 test prompts are provided in the Appendix.
Having established that visual diversity disparity causes DvD, we now investigate the internal mechanisms through which this phenomenon manifests during generation by analyzing cross-attention patterns in prompts from Domi-nanceBench.
In the first denoising step, high attention focus on the dominating concept’s token in lower-resolution layers strongly correlates with DvD occurrence.
To quantify how strongly the model focuses on specific tokens during early generation stages, we define a focus score:
where a (โ,t) = (a
N ) represents the crossattention weights over N prompt tokens at layer โ and timestep t (averaged across all spatial positions and attention heads), max i a
others is the mean of all other attention weights, H(a (โ,t) ) is the entropy of the attention distribution, and ฯต is a small constant for numerical stability.
Intuitively, the focus score measures the ratio of attention concentration on the peak token to the overall dispersion across all tokens. High values indicate strong attention concentration on a single dominating token, while low values indicate attention distributed evenly across multiple tokens.
Experimental Setup. We compute focus scores across all UNet layers during the first denoising step (t = 50).
To characterize the attention patterns in DominanceBench prompts, we compare DominanceBench prompts with 300 balanced prompts where all concepts are successfully generated without DvD. These balanced prompts have an average DvD Score of 20.64, significantly lower than our threshold of 36.
Results. Fig. 5 shows the mean focus scores across all UNet layers (left) and their aggregations for different layer groups (right). DominanceBench prompts exhibit significantly higher focus scores than balanced prompts across layers 5-10, with lower-resolution layers (layers 8-10) showing the most pronounced difference. The elevated focus scores in these lower-resolution layers suggest that the dominant concept’s semantic representation is prioritized during early semantic-level processing.
To verify that this attention concentration indeed targets the dominating concept’s token, we analyze which token receives the peak attention in lower-resolution layers (layers 8-10). We find that in 249 out of 300 DominanceBench prompts (83%), the dominating concept’s token receives the maximum attention in these semantic layers. This confirms
Baseline " A dvddog and a cat eating together " that DvD manifests through excessive early attention concentration on the dominating concept’s token in semantic processing layers, preventing adequate attention allocation to other concepts throughout the generation process.
Takeaway 3 Dominated concepts rapidly lose attention in early denoising timesteps. In the critical early phase of the generation process, dominated tokens exhibit sharp attention decline while dominating tokens maintain concentration, establishing an imbalance that persists throughout generation.
The focus score analysis revealed that DominanceBench prompts exhibit excessive attention concentration on the dominating concept’s token in lower-resolution layers at the first denoising timestep. But what happens to the dominated concept’s token? Fig. 6 shows the attention patterns for “The Colosseum Rome Italy Carry-all Pouch” and resulting images. Surprisingly, “pouch” exhibits high attention at layer 7 (middle block), where semantic content is primarily encoded. However, this leads to DvD outcome (Fig. 6a) where only the Colosseum appears, whereas both concepts should appear together as in balanced generation (Fig. 6b). This paradox-high attention yet failed generation-motivates us to examine the temporal dynamics of cross-attention to understand how dominated concepts lose their influence during generation.
Experimental Setup. To understand the temporal dynamics of both concepts, we track each token at the layer where it exhibits peak attention concentration. For the dominating concept, we examine lower-resolution layers (layers 8-10) where high focus scores were observed (Sec. 4.2.1). For the dominated concept, we track layer 7, where the dominated token’s attention exhibits the highest focus score (Fig. 7).
To quantify the temporal evolution of attention for these tokens, we first define the attention deviation for token i at timestep t as ฮฑ
others (with the same averaging across spatial positions and heads as in Eq. ( 2)). Note that we use the raw attention deviation rather than the entropy-normalized focus score, as we track dynamics within individual prompts where token count remains constant. Then, the attention change is:
where we use โ โ {8, 9, 10} (averaged) for dominating tokens and โ = 7 for dominated tokens. Negative โฮฑ indicates decreasing concentration on that token.
Results. Fig. 8 shows the temporal evolution of attention change across timesteps. Individual prompts (Fig. 8a,b) and aggregated statistics (Fig. 8c) consistently show that the dominated concept begins with strongly negative โฮฑ values in the earliest timestep intervals (50-40), indicating rapid attention loss. In contrast, the dominating concept starts with positive โฮฑ values, gaining relative attention in the early phase. This suggests that the attention imbalance is established early and persists throughout generation, with dominated concepts losing their semantic influence in the critical early timesteps where the overall image structure is determined.
DvD arises from distributed attention mechanisms across multiple heads, unlike memorization which localizes to specific heads. This indicates that mitigating DvD requires architectural or training-level interventions rather than simple head pruning.
Our previous analyses revealed that DvD manifests through excessive attention concentration in specific layers during early denoising timesteps. These layer-level findings identified critical layers (e.g., layers 7-10) where dominating and dominated concepts exhibit distinct attention patterns. However, each layer in SD 1.4’s architecture contains 16 attention heads, each potentially contributing differently to the overall layer behavior. This raises an important question: do all heads within these critical layers equally contribute to DvD, or is the phenomenon driven by specific heads? Interestingly, while DvD stems from visual diversity disparity in training data (Sec. 4.1), a related memorization phenomenon also exhibits reduced visual diversity in generated outputs. However, these phenomena differ in crucial ways: memorization affects individual prompts that were overfit during training, while DvD occurs in compositional prompts where multiple concepts compete for attention. This difference in compositional complexity suggests their underlying attention mechanisms may also differ.
To answer this question and understand how DvD differs from related phenomena, we conduct a head ablation study comparing DvD prompts from DominanceBench with 500 memorized prompts identified in prior work [3]. This comparison allows us to determine whether these phenomena arise from similar mechanisms (both localized or both distributed) or exhibit different head-level characteristics.
Head Ablation Procedure. To assess each head’s contribution, we selectively suppress target heads by scaling their attention logits (pre-softmax scores) with a small factor. Formally, for layer โ at timestep t, the ablated attention logit for head h is:
where a (โ,h,t) โ R P รN are the attention logits for head h over P spatial positions and N text tokens, โ โ is the target layer, t โ is the target timestep, H โ โ {1, . . . , H} is the set of ablated heads, and ฮต = 10 -5 is a small scaling factor that effectively nullifies the head’s influence without disrupting the attention mechanism.
We perform ablation at the first denoising timestep (t = 50) across layers 1-16.
Outcome Classification. For each ablated generation, we classify the result into three categories based on visual similarity and task-specific metrics (Fig. 9):
โข Mitigated: The phenomenon is successfully reduced.
-Memorization: SSCD [23] < 0.5 and LPIPS [40] > 0.6 -DvD: LPIPS > 0.5 and DvD Score < 36 โข Unchanged: The original unablated behavior persists.
-Memorization: SSCD โฅ 0.5 or LPIPS โค 0.6 -DvD: LPIPS โค 0.5 or DvD Score โฅ 36 โข Others: The image is severely degraded or incoherent.
We first ablate individual heads by setting |H โ | = 1 in Eq. ( 4). For each prompt p โ P (where P contains 300 DominanceBench prompts or 500 memorization prompts), we test all heads h โ {1, . . . , 16} across all layers. Figure 10. Layer-wise ratio of mitigated cases in single-head ablation. For each layer, the plot shows the proportion of prompts that can be mitigated by ablating at least one head in that layer. Mitigation effects concentrate primarily in layers 1-6 (downsampling blocks) for both phenomena.
Results. Overall, single-head ablation mitigates 145 out of 300 DominanceBench prompts (48%) compared to 392 out of 500 memorization prompts (78%). This substantial difference suggests that memorization is more susceptible to single-head interventions.
As shown in Fig. 10, both phenomena exhibit concentrated effects in layers 1-6 (downsampling blocks), with minimal effects in layers 7-15. Memorization shows a unique spike in layer 16, while DvD remains negligible in higher layers. Within the critical layers 1-6, memorization achieves its highest mitigation rate at layer 6 (66%), while DvD peaks earlier at layer 3 (22%).
However, these single-head ablation results alone cannot distinguish whether the phenomena arise from localized mechanisms (few critical heads working independently) or distributed mechanisms (collaborative behavior across multiple heads). If the mechanisms are localized, ablating multiple heads simultaneously should maintain high mitigation rates; if distributed, other heads may compensate, reducing the mitigation effect. To determine the underlying mechanistic structure, we conduct multi-head ablation analysis in Sec. 4.3.2.
While single-head ablation showed that both phenomena can be mitigated by individual heads, it cannot reveal whether the underlying mechanisms are localized or distributed. To determine whether DvD and memorization arise from localized mechanisms (few critical heads working independently) or distributed mechanisms (collaborative behavior across multiple heads), we conduct multi-head ablation. For each head h that showed mitigation effects in singlehead ablation, we simultaneously ablate it with another head h โฒ in the same layer:
The ablation procedure follows Eq. ( 4) with |H โ | = 2. Each ablated image is classified using the same criteria, and we compute the average proportion of each outcome type.
For layer โ, the proportion of outcome type ฯ โ {mitigated, unchanged, others} is:
where
ฯ (p) is the number of head pairs (h, h โฒ ) in prompt p at layer โ that resulted in outcome ฯ , and T (โ) (p) is the total number of evaluated head pairs for that prompt and layer. By definition, the three outcome proportions sum to one: R
Results. Fig. 11 shows the outcome proportions for memorization (a) and DvD (b) across layers. For memorization, the mitigated proportion remains high (โผ0.8) even with multi-head ablation, indicating localized behavior in a few critical heads. In stark contrast, DvD shows lower mitigation proportion (โผ0.6) and higher unchanged proportion (โผ0.2), revealing that it emerges from distributed cooperation among heads-ablating multiple heads simultaneously is less effective because the dominance behavior is not concentrated in specific heads but spread across the network.
This finding reveals a fundamental mechanistic difference: DvD arises from distributed attention patterns across multiple heads, while memorization concentrates in specific heads. This distributed nature explains why DvD is more challenging to mitigate through simple head pruning and suggests that addressing it requires broader architectural interventions.
Connection to neuron-level memorization. Recent work on neuron-level memorization [11] provides supporting evidence for our head-level observations. It demonstrated that memorization of individual training samples can be traced to single neurons or small neuron groups within value projection layers. Our finding that memorization localizes to specific heads is consistent with this: the critical neurons they identified are likely concentrated within specific heads rather than distributed across all heads. Conversely, DvD’s resistance to multi-head ablation (Fig. 11b) suggests that its underlying neurons are scattered across multiple heads, requiring distributed coordination.
This work analyzed the Dominant-vs-Dominated (DvD) phenomenon in text-to-image diffusion models, where one concept overwhelms multi-concept generation while others are suppressed. We propose DominanceBench, a benchmark of 300 prompts for systematic analysis. Our investigation establishes that visual diversity disparity in training data is the root cause: concepts with limited variation develop rigid visual priors that dominate generation. Cross-attention analysis revealed that DvD emerges through excessive attention concentration on dominating tokens in lower-resolution layers during early denoising steps, with dominated concepts experiencing dramatic attention suppression. Crucially, head ablation studies showed that DvD arises from distributed cooperation among multiple attention heads, contrasting with memorization’s localization to specific heads. This work identifies visual diversity disparity as a previously unexplored cause of multi-concept generation failures, revealing concept-level dominance as a distinct failure mode.
Limitations. Our analysis focused on cross-attention mechanisms as the primary lens for understanding DvD. Investigating feedforward networks and residual connections may reveal additional pathways through which visual diversity disparity affects generation. Additionally, exploring inter-head relationships could provide deeper insights and enable more effective mitigation strategies. S1, we collected the top 1,000 training images from LAION captions containing that keyword. Lower distances indicate more compact, homogeneous visual clusters. Landmarks show the lowest median distance (0.3079), followed by characters (0.4417), artists (0.4660), and everyday objects (0.4672). We summarize each distribution with the median instead of the mean so that occasional near-duplicate samples, which yield cosine distance 0, do not dominate the statistics.
In Sec. 3.3 and Sec. 4.2.1, we use balanced prompts to distinguish DvD behavior from successful multi-concept generation.
We leverage the non-memorized benchmark of Ren et al. [26], which lists 500 everyday prompts curated for memorization studies. We use GPT-5 to identify the number of concepts in each prompt and retain only those with exactly two, yielding 300 prompts. This filtered subset matches the two-concept structure of DominanceBench and covers diverse common concepts.
As shown in Fig. 3, balanced prompts show much lower DvD Scores than DominanceBench. We compute scores using the questions in Table S3 for both concepts, unlike the category-specific questions in Table S2.
In Sec. 4.3, we compare DvD with memorization using 500 memorized prompts identified by Carlini et al. [3]. Among existing studies on memorization, Wen et al. [37] proposed a detection method that computes the L2 norm of text-conditional noise predictions at the first denoising step. A higher L2 norm value indicates that the prompt consistently reproduces the same image regardless of the random seed. To further characterize DominanceBench, we apply this metric to compare the three prompt sets. Table S3. Questions used in calculating DvD Score for balanced prompts. Here we provide additional results across all 50 test prompts. Table S4. Questions used to calculate DvD Score. The first set checks whether “dvddog” is present, while the second and third sets verify the paired concept depending on whether it is an object/scene or a style/material.
“dvddog”
โข Is a dog visible in this image?
โข Is the dog fully or partially visible?
โข Is the dog clearly identifiable?
โข Is the dog easy to recognize?
โข Is the dog the main subject of this image?
Paired concept (object or scene)
โข Is concept visible in this image, separate from the dog?
โข Is concept fully or partially visible?
โข Is concept clearly identifiable?
โข Is concept unambiguously visible?
โข Is concept appearing together with the dog?
Paired concept (style or material)
โข Is the dog shown in the concept style?
โข Does the dog appear made of concept?
โข Is the dog drawn in the concept style?
โข Is the concept style visible on the dog?
โข Does the dog’s appearance resemble the concept style?
tently exceed the DvD Score threshold of 36, while higherdiversity models (D 8 , D 10 , baseline) generate balanced compositions.
Table S4 lists the questions used to compute DvD Scores in this experiment. VQA models recognize “dvddog” as a dog in the generated images, so we query for dog presence. For the paired concept, we use type-specific questions depending on whether it is an object/scene or a style/material. " A dvddog beneath warm twilight "
Baseline (e) Generation examples (f) DvD scores for (e).
" A dvddog beside a focused carpenter"
While the Focus Score (Eq. 2) uses entropy normalization, our Temporal Analysis (Sec. 4.2.2) uses only attention deviation without entropy. We explain why below.
The Focus Score compares attention patterns across different prompts with varying characteristics. The numerator measures attention concentration on the peak token relative to others, while the denominator uses entropy to capture the overall dispersion of the attention distribution. Normalizing entropy by log 2 N accounts for prompt length, enabling fair comparison across prompts with different token counts.
In contrast, Temporal Analysis tracks attention dynamics within a single prompt over time. The attention deviation ฮฑ
others already captures relative token importance, and its temporal change โฮฑ (l,t) i directly measures how the competitive balance shifts between concepts. We do not use entropy normalization because it causes distortion from irrelevant tokens, as explained below.
Entropy H measures the dispersion of the entire attention distribution, including tokens irrelevant to the dominating and dominated concepts (denoted as C D and C d , Table S5. Attention weights at consecutive timesteps. The attention of CD and C d remains constant while that of an irrelevant token T1 suddenly increases.
C D = 0.40 -0.067 = 0.333
Correct interpretation: C D ’s relative advantage is un-If entropy normalization is used hypothetically: At timestep t, attention is concentrated primarily on C D (0.40) and C d (0.30), with other tokens having small weights (e.g., 0.05, 0.0357 each), yielding low H t . At timestep t + 1, attention becomes more distributed as the attention of T 1 increases to 0.15, while those of C D (0.40) and C d (0.30) remain constant, yielding higher H t+1 .
An entropy-normalized metric would give:
โScore < 0
“C D ’s dominance decreased”-despite a C D remaining constant at 0.40. This distortion occurs because the entropy increase stems from irrelevant token T 1 (0.05 โ 0.15), not from changes in the dominating (C D ) or dominated (C d ) concepts that Temporal Analysis aims to isolate.
Beyond the token-level noise discussed above, incorporating entropy would create an additional issue for cross-layer tracking. As described in Sec. 4.2.2, we track dominating tokens in layers 8-10 and dominated tokens in layer 7. Since each layer has its own entropy value, any metric involving entropy cannot be directly compared across layers. Attention deviation ฮฑ (l,t) i = a (l,t) i -ฤ(l,t) others isolates relative token importance within each layer, enabling consistent cross-layer comparison.
Summary. By using only attention deviation, our Temporal Analysis (1) maintains direct interpretability (โฮฑ = change in relative advantage), (2) eliminates noise from irrelevant tokens, and (3) enables consistent cross-layer tracking of the dominating and dominated concepts that characterize the DvD phenomenon.
In single-head ablation (Sec. 4.3), some heads showed mitigation effects while others did not. To further examine the role of non-mitigating heads, we perform pairwise ablation within layer 1: for each non-mitigating head, we ablate it together with every other head in the same layer.
Table S6 shows the results. For DvD, the mitigation rate is near zero (0.55%), showing these heads do not directly drive the dominance behavior. However, the Others rate-cases where the outputs become corrupted or fail to depict both concepts as described in Sec. 4.3-reaches 18.68%, nearly twice that of memorization (9.92%). This means that while these heads cannot mitigate DvD on their own, they still help maintain coherent generation. Removing multiple such heads breaks the generation process without fixing the dominance. In contrast, memorization shows a small increase in mitigation rate (2.90%) with lower Others (9.92%), consistent with its localized mechanism where non-mitigating heads have limited involvement.
Based on our findings, this section presents a method for detecting DvD in real-time and validates the accuracy of detection. While the main contribution of this paper is identifying the causes and mechanisms of the DvD phenomenon, from a practical perspective, early detection of DvD during generation can serve as a foundation for future mitigation research.
We use the Focus Score (Eq. 2) to detect DvD. When the Focus Score exceeds a specific threshold, we determine that the prompt is likely to trigger DvD. Specifically, we compute the Focus Score in lower-resolution layers at the first denoising step (t = 50), and identify the token receiving maximum attention as the dominant concept token.
Based on Sec. 4.2.1 (Fig. 5) showing that DominanceBench prompts exhibit particularly high Focus Scores in layers 8-10, we focus on these specific lower-resolution layers to optimize detection settings. We test various combinations of Focus Score thresholds {0.010, 0.015, 0.020, 0.025} and layer configurations (single layers, layer combinations, and aggregation methods), evaluating a total of 32 settings. To balance high detection rate on DvD cases with low false positive rate on balanced prompts, we select the configuration that maximizes the gap between the two detection rates.
Tab. S7 shows the detection rates across all configurations. Higher thresholds reduce false positives on Balanced prompts but also miss many true DvD cases, while lower thresholds with aggressive layer aggregation (e.g., max) detect more DvD cases but produce excessive false positives. Among all settings, a threshold of 0.010 with layer combination 9&10 achieves the optimal balance with the maximum discrimination gap of 37.00 percentage points (70.67% on DominanceBench vs. 33.67% on Balanced).
To verify whether our detection method accurately identifies dominant concept tokens, we observed changes in DvD Score after modifying the tokens. Specifically, for prompts flagged as DvD by the optimal configuration (threshold 0.010, layers 9&10), we identify the token receiving maximum attention in those layers and modify it.
Based on the visual diversity disparity hypothesis revealed in Sec. 4.1, we replace detected dominant tokens with generic category nouns:
โข Artist: Van Gogh โ “artist” โข Landmark: Colosseum โ “landmark” โข Character: Spider-Man โ “character” If detection is accurate, this modification should reduce the DvD Score by replacing specific, low-diversity concepts with generic, high-diversity terms that allow more flexible representations.
Fig. S5 shows the dramatic reduction in DvD Score distribution after modification. In SD 1.4, the median DvD Score decreased from 64 to 20. In SD 2.1, a similar pattern was observed, with the median decreasing from 40 to approximately 17.
The purpose of this section was to validate the accuracy of the detection method, and we clarify that prompt modification itself may not be a practical solution. When a user requests “Van Gogh coaster,” generating “artist coaster” undermines the original intent. Future research should explore architectural modifications or training-level methods that alleviate DvD without modifying prompts. The detection method presented in this section can be used as a diagnostic tool for developing such mitigation methods.
This content is AI-processed based on open access ArXiv data.