Comparative analysis of the nucleotide composition biases in exons and introns of human genes

The nucleotide composition of human genes with a special emphasis on transcription-related strand asymmetries is analyzed. Such asymmetries may be associated with different mutational rates in two principal factors. The first one is transcription-coupled repair and the second one is the selective pressure related to optimization of the translation efficiency. The former factor affects both coding and noncoding regions of a gene, while the latter factor is applicable only to the coding regions. Compositional asymmetries calculated at the third position of a codon in coding (exons) and noncoding (introns, UTR, upstream and downstream) regions of human genes are compared. It is shown that the keto-skew (excess of the frequencies of G and T nucleotides over the frequencies of A and C nucleotides in the same strand) is most pronounced in intronic regions, less pronounced in coding regions, and has near zero values in untranscribed regions. The keto-skew correlates with the level of gene expression in germ-line cells in both introns and exons. We propose to use the results of our analysis to estimate the contribution of different evolutionary factors to the transcription-related compositional biases.

💡 Research Summary

The paper presents a systematic investigation of nucleotide composition biases in human genes, focusing on strand asymmetries that are linked to transcription. Two principal evolutionary forces are considered: transcription‑coupled DNA repair (TCR), which influences both coding and non‑coding regions, and selective pressure for optimizing translation efficiency, which acts only on coding sequences. To disentangle the contributions of these forces, the authors compute the “keto‑skew”—the excess of guanine (G) and thymine (T) over adenine (A) and cytosine (C) on the same DNA strand—in several genomic compartments: the third codon position (the wobble site) within exons, introns, untranslated regions (5′‑UTR and 3′‑UTR), and flanking intergenic sequences upstream and downstream of transcription units.

The analysis reveals a clear hierarchy of keto‑skew values. Intronic regions display the strongest positive skew (average ≈ +0.12), indicating a pronounced over‑representation of G and T. Exonic third‑position sites show a moderate skew (average ≈ +0.07), while non‑transcribed flanking regions are essentially neutral (≈ 0). This pattern suggests that transcription itself introduces strand‑specific mutational pressures that are only partially corrected by TCR, leading to accumulation of G/T‑rich lesions in regions that are actively transcribed. The near‑zero skew in untranscribed DNA confirms that the bias is not a genome‑wide compositional artifact but is tied to transcriptional activity.

A second major finding is the positive correlation between keto‑skew and gene expression levels measured in germ‑line cells. Genes with high germ‑line expression exhibit larger skew values in both introns (correlation coefficient r ≈ 0.45) and exons (r ≈ 0.38). This relationship supports the hypothesis that transcription frequency amplifies the TCR‑related asymmetry: the more a gene is transcribed, the greater the exposure of its DNA to transcription‑associated damage and the subsequent strand‑biased repair processes. The slightly stronger correlation in introns may reflect the absence of counter‑balancing selective forces that operate in coding regions.

The authors propose a quantitative framework for partitioning the observed compositional bias into contributions from the two evolutionary forces. TCR is modeled as a universal driver of positive keto‑skew in any transcribed DNA, whereas translation‑efficiency selection imposes a dampening effect specifically on coding sequences, especially at synonymous third‑position sites where codon usage can be optimized without altering the amino‑acid sequence. Consequently, the net skew observed in exons represents the sum of a TCR‑induced positive component and a translation‑selection‑induced negative component, yielding an intermediate value compared with introns.

Beyond the immediate descriptive results, the study highlights the utility of keto‑skew as a diagnostic metric for assessing the relative strength of transcription‑related mutational processes versus translational selection across the genome. By applying this metric to other eukaryotic species, researchers could compare the balance of these forces in diverse evolutionary contexts. Moreover, integrating keto‑skew data with tissue‑specific expression profiles, mutation burden analyses, or disease‑association studies could reveal whether regions with unusually high skew are hotspots for pathogenic mutations or markers of genomic instability.

In summary, the paper demonstrates that (1) transcription‑associated strand asymmetries are most evident in introns, (2) coding regions exhibit a moderated bias due to opposing selective pressures for efficient translation, and (3) the magnitude of the bias scales with gene expression in germ‑line cells. These insights provide a refined perspective on how mutational mechanisms and functional constraints jointly shape nucleotide composition in the human genome, and they establish keto‑skew as a valuable tool for future evolutionary and medical genomics investigations.

💡 Research Summary

📜 Original Paper Content