A Binary Representation of the Genetic Code
📝 Abstract
This article introduces a novel binary representation of the canonical genetic code based on both the structural similarities of the nucleotides, as well as the physicochemical properties of the encoded amino acids. Each of the four mRNA bases is assigned a unique 2-bit identifier, so that the 64 triplet codons are each indexed by a 6-bit label. The ordering of the bits reflects the hierarchical organization manifested by the DNA replication/repair and tRNA translation systems. In this system, transition and transversion mutations are naturally expressed as binary operations, and the severities of the different point mutations can be analyzed. Using a principal component analysis, it is shown that the physicochemical properties of amino acids related to protein folding also correlate with certain bit positions of their respective labels. Thus, the likelihood for a point mutation to be conservative, and less likely to cause a change in protein functionality, can be estimated.
💡 Analysis
This article introduces a novel binary representation of the canonical genetic code based on both the structural similarities of the nucleotides, as well as the physicochemical properties of the encoded amino acids. Each of the four mRNA bases is assigned a unique 2-bit identifier, so that the 64 triplet codons are each indexed by a 6-bit label. The ordering of the bits reflects the hierarchical organization manifested by the DNA replication/repair and tRNA translation systems. In this system, transition and transversion mutations are naturally expressed as binary operations, and the severities of the different point mutations can be analyzed. Using a principal component analysis, it is shown that the physicochemical properties of amino acids related to protein folding also correlate with certain bit positions of their respective labels. Thus, the likelihood for a point mutation to be conservative, and less likely to cause a change in protein functionality, can be estimated.
📄 Content
A Binary Representation of the Genetic Code
Louis R. Nemzeri*
1Department of Chemistry and Physics, Halmos College of Natural Sciences and Oceanography, Nova Southeastern University, Davie, Florida, United States of America
*Corresponding author E-mail: lnemzer@nova.edu (LRN)
A Binary Representation of the Genetic Code Louis R. Nemzer “The virtue of binary is that it’s the simplest possible way of representing numbers. Anything else is more complicated.” - George Whitesides
Abstract
This article introduces a novel binary representation of the canonical genetic code based on both
the structural similarities of the nucleotides, as well as the physicochemical properties of the
encoded amino acids. Each of the four mRNA bases is assigned a unique 2-bit identifier, so that
the 64 triplet codons are each indexed by a 6-bit label. The ordering of the bits reflects the
hierarchical organization manifested by the DNA replication/repair and tRNA translation systems.
In this system, transition and transversion mutations are naturally expressed as binary
operations, and the severities of the different point mutations can be analyzed. Using a principal
component analysis, it is shown that the physicochemical properties of amino acids related to
protein folding also correlate with certain bit positions of their respective labels. Thus, the
likelihood for a point mutation to be conservative, and less likely to cause a change in protein
functionality, can be estimated.
Author Summary This work introduces a new method of representing the genetic code using a binary system that reflects the relationships between nucleotide structures, as well as the amino acids they code for. Significant correlations are revealed between particular bits and the properties of the encoded amino acids. This paper also explores the way mutations can be classified as Boolean operations, and ranks the severity of the amino acid substitutions they cause. Thus, the binary labels are not arbitrary, but rather, have definite physiological meanings. This paper demonstrates a fruitful analogy between information represented as binary bits and the canonical genetic code. This connects the fields of information theory with molecular biology, since the inherent redundancy of the genetic code, an important source of error correction in the protein translation mechanism, relates to the risk of genetic disorders.
Introduction
Because of its central role in biological information processing, the canonical genetic code - which
maps DNA codons onto corresponding amino acids - has been closely scrutinized for underlying
symmetries. In this article, a novel binary representation of the code is introduced that accounts
for both the chemical structures of the nucleotides themselves, as well as the physicochemical
properties of the amino acids encoded. An accurate evaluation of the adaptive advantage of the
code, which is robust to many point mutations and mRNA/tRNA mispairings, must consider not
only the relatedness of amino acids separated by a single letter mutation, but also the probability
of such a change or mispairing occurring in the first place based on the similarities of the
nucleotides.
The primary addition this research makes to the existing literature is that the current work
provides quantitative support for its novel binary classification system. That is, the choice of
binary labels, as well as the order of the bits, have meaningful relationships with both the
chemical structures of the nucleotides themselves, as well as the amino acids corresponding to
codons in which they appear, as demonstrated with physicochemical data. This stands in contrast
with many previous studies, which fixated on using the degree of degeneracy in the third letter
as the primary or sole metric, and more crucially, treated the nucleotides as interchangeable
labels for group theory analysis. This had the effect of strongly deemphasizing or obliterating
entirely the physical reality of these biomolecules and their physicochemical similarities.
One of the first dichotomous divisions of the genetic code was due to theoretical physicist Yuri
Rumer [1], [2], who noticed complete third-letter degeneracy in exactly half of the codon
quartets [those of the form NCN or SKN. Refer to figures 1 and 2 for nucleotide abbreviations].
More recent generalizations [3] [4] [5] explored additional ways to bisect the genetic table.
However, these works remained focused on classification according to the metric of third-letter
degeneracy, along with hidden symmetries revealed by transformation rules involving the
nucleotides. These rules show patterns under the interchange of nucleotides [6], while the
current research takes account of the actual chemical properties of the nucleotides, as well as
the amino acids they encode.
Modern computing, [7] which is built on the foundation of a binary system, pro
This content is AI-processed based on ArXiv data.