A Binary Representation of the Genetic Code

Reading time: 5 minute
...

📝 Abstract

This article introduces a novel binary representation of the canonical genetic code based on both the structural similarities of the nucleotides, as well as the physicochemical properties of the encoded amino acids. Each of the four mRNA bases is assigned a unique 2-bit identifier, so that the 64 triplet codons are each indexed by a 6-bit label. The ordering of the bits reflects the hierarchical organization manifested by the DNA replication/repair and tRNA translation systems. In this system, transition and transversion mutations are naturally expressed as binary operations, and the severities of the different point mutations can be analyzed. Using a principal component analysis, it is shown that the physicochemical properties of amino acids related to protein folding also correlate with certain bit positions of their respective labels. Thus, the likelihood for a point mutation to be conservative, and less likely to cause a change in protein functionality, can be estimated.

💡 Analysis

This article introduces a novel binary representation of the canonical genetic code based on both the structural similarities of the nucleotides, as well as the physicochemical properties of the encoded amino acids. Each of the four mRNA bases is assigned a unique 2-bit identifier, so that the 64 triplet codons are each indexed by a 6-bit label. The ordering of the bits reflects the hierarchical organization manifested by the DNA replication/repair and tRNA translation systems. In this system, transition and transversion mutations are naturally expressed as binary operations, and the severities of the different point mutations can be analyzed. Using a principal component analysis, it is shown that the physicochemical properties of amino acids related to protein folding also correlate with certain bit positions of their respective labels. Thus, the likelihood for a point mutation to be conservative, and less likely to cause a change in protein functionality, can be estimated.

📄 Content

A Binary Representation of the Genetic Code

Louis R. Nemzeri*

1Department of Chemistry and Physics, Halmos College of Natural Sciences and Oceanography, Nova Southeastern University, Davie, Florida, United States of America

*Corresponding author E-mail: lnemzer@nova.edu (LRN)

A Binary Representation of the Genetic Code Louis R. Nemzer “The virtue of binary is that it’s the simplest possible way of representing numbers. Anything else is more complicated.” - George Whitesides

Abstract
This article introduces a novel binary representation of the canonical genetic code based on both the structural similarities of the nucleotides, as well as the physicochemical properties of the encoded amino acids. Each of the four mRNA bases is assigned a unique 2-bit identifier, so that the 64 triplet codons are each indexed by a 6-bit label. The ordering of the bits reflects the hierarchical organization manifested by the DNA replication/repair and tRNA translation systems. In this system, transition and transversion mutations are naturally expressed as binary operations, and the severities of the different point mutations can be analyzed. Using a principal component analysis, it is shown that the physicochemical properties of amino acids related to protein folding also correlate with certain bit positions of their respective labels. Thus, the likelihood for a point mutation to be conservative, and less likely to cause a change in protein functionality, can be estimated.

Author Summary This work introduces a new method of representing the genetic code using a binary system that reflects the relationships between nucleotide structures, as well as the amino acids they code for. Significant correlations are revealed between particular bits and the properties of the encoded amino acids. This paper also explores the way mutations can be classified as Boolean operations, and ranks the severity of the amino acid substitutions they cause. Thus, the binary labels are not arbitrary, but rather, have definite physiological meanings. This paper demonstrates a fruitful analogy between information represented as binary bits and the canonical genetic code. This connects the fields of information theory with molecular biology, since the inherent redundancy of the genetic code, an important source of error correction in the protein translation mechanism, relates to the risk of genetic disorders.

Introduction Because of its central role in biological information processing, the canonical genetic code - which maps DNA codons onto corresponding amino acids - has been closely scrutinized for underlying symmetries. In this article, a novel binary representation of the code is introduced that accounts for both the chemical structures of the nucleotides themselves, as well as the physicochemical properties of the amino acids encoded. An accurate evaluation of the adaptive advantage of the code, which is robust to many point mutations and mRNA/tRNA mispairings, must consider not only the relatedness of amino acids separated by a single letter mutation, but also the probability of such a change or mispairing occurring in the first place based on the similarities of the nucleotides.
The primary addition this research makes to the existing literature is that the current work provides quantitative support for its novel binary classification system. That is, the choice of binary labels, as well as the order of the bits, have meaningful relationships with both the chemical structures of the nucleotides themselves, as well as the amino acids corresponding to codons in which they appear, as demonstrated with physicochemical data. This stands in contrast with many previous studies, which fixated on using the degree of degeneracy in the third letter as the primary or sole metric, and more crucially, treated the nucleotides as interchangeable labels for group theory analysis. This had the effect of strongly deemphasizing or obliterating entirely the physical reality of these biomolecules and their physicochemical similarities. One of the first dichotomous divisions of the genetic code was due to theoretical physicist Yuri Rumer [1], [2], who noticed complete third-letter degeneracy in exactly half of the codon quartets [those of the form NCN or SKN. Refer to figures 1 and 2 for nucleotide abbreviations]. More recent generalizations [3] [4] [5] explored additional ways to bisect the genetic table. However, these works remained focused on classification according to the metric of third-letter degeneracy, along with hidden symmetries revealed by transformation rules involving the nucleotides. These rules show patterns under the interchange of nucleotides [6], while the current research takes account of the actual chemical properties of the nucleotides, as well as the amino acids they encode.
Modern computing, [7] which is built on the foundation of a binary system, pro

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut