Introduction: While the origin and evolution of proteins remain mysterious, advances in evolutionary genomics and systems biology are facilitating the historical exploration of the structure, function and organization of proteins and proteomes. Molecular chronologies are series of time events describing the history of biological systems and subsystems and the rise of biological innovations. Together with time-varying networks, these chronologies provide a window into the past. Areas covered: Here, we review molecular chronologies and networks built with modern methods of phylogeny reconstruction. We discuss how chronologies of structural domain families uncover the explosive emergence of metabolism, the late rise of translation, the co-evolution of ribosomal proteins and rRNA, and the late development of the ribosomal exit tunnel; events that coincided with a tendency to shorten folding time. Evolving networks described the early emergence of domains and a late big bang of domain combinations. Expert opinion: Two processes, folding and recruitment appear central to the evolutionary progression. The former increases protein persistence. The later fosters diversity. Chronologically, protein evolution mirrors folding by combining supersecondary structures into domains, developing translation machinery to facilitate folding speed and stability, and enhancing structural complexity by establishing long-distance interactions in novel structural and architectural designs.
A proteome represents the entire collection of proteins that is encoded in the genome of an organism. This potential set of gene products can be either inferred from genome sequences or can be experimentally determined [1]. For example, a genomic catalog of thousands of deep RNA sequencing experiments of the human genome revealed a total of ~320,000 transcripts encoded by ~42,000 genes, ~20,000 of which are translated into proteins [2]. However, not all genes of a genome may be expressed by a cell, tissue or organism at a given time and at a certain environmental condition. A mass spectrometric analysis of protein levels encoded by over 12,000 human genes across 201 samples of 32 normal human tissues revealed that ~85% proteins were present in all tissues [3]. The relative abundance of these common proteins together with presence/absence of a minority protein set helped explain biological processes that require the interplay of multiple tissues. Tissue-specific distribution of enzymes revealed a coordinated control of metabolic reactions, while tissue-enriched proteins provided insights into phenotypes of genetic diseases. Mass spectrometric measurements allow visualization of proteomes with unprecedented quantitative detail, which can be recalibrated to estimate the number of proteins per unit cell volume [4]. About 2-4 million proteins per cubic micron populate the cells of varied organisms. Typical bacteria will contain about 3 million proteins (e.g. Escherichia coli) while a typical mammalian cell will contain over a billion, four orders of magnitude more. This reality highlights how crowded is the molecular environment of the cell.
Most proteins are both structured and intrinsically flexible. Their polypeptide chains fold into compact atomic three-dimensional (3D) arrangements that are organized around structural, functional and evolutionary modules. These modules are recurrently present in various molecular contexts. The human genome, for example, embodies ~1,500 distinct combinations of folded structures [5]. Conversely, a substantial number of proteins lack typical structure. They represent intrinsically disordered proteins, molecules that lack significant constraints on internal degrees of freedom of the polypeptide chain. Intrinsic disorder is also present in structured proteins in the form of intrinsically disordered regions [6]. These regions exhibit highly dynamic conformations that resemble either random-coils, molten globules or flexible linkers. Disordered regions are often evolutionarily conserved and needed for molecular recognition, regulation and signaling (e.g. [7]). Significant surveys of protein modules and intrinsic disorder with advanced computational methods have for example generated protein taxonomies for the classification of the protein world. Similarly, the distribution of protein structure and intrinsic disorder in organisms provide the necessary tools to understand the origin and evolution of proteomes.
Here we explore how developments in evolutionary genomics and systems biology are helping understand the origin and evolution of proteins and proteomes. We start by addressing protein structural complexity, over what has been already reviewed [8,9]. Proteins are highly organized entities that fold into structures that exhibit many levels (layers) of structural organization. These structures are highly diverse and have been the subject of exhaustive classification. We focus on chronologies that describe how protein modules are becoming structured in the protein world and how they originate and spread in evolving and increasingly complex proteomes. Our emphasis is to prompt critical thinking that could help untangle processes of molecular emergence and diversification that are operating in our planet.
Proteins are biological macromolecules with properties of nanoparticles. Typical 5-500 kDa proteins have diameters that range 2-10 nm [10]. They are made up of one or more relatively long polypeptides that fold into globular, fibrous or membrane forms. Their name [from the Greek πρώτειος (prōteios) and πρῶτος (prōtos), meaning “primary” and “first”] suggests their very primordial origin, which as we will discuss below makes justice to their very early origin. A polypeptide is a single linear heteropolymer chain of amino acids covalently bonded together by peptide bonds that link the carboxyl and amino groups of adjacent amino acid residues. Proteins synthesized by the ribosomal machinery of the cell typically contain polypeptides of more than 20 amino acid residues in length. Shorter molecules are generally synthesized by the non-ribosomal protein synthetase (NRPS) machinery and are called ‘peptides’. The typical polypeptide lengths of proteins range from tens of amino acids to thousands [11][12][13], with mean values of 329, 365 and 532 amino acids for proteomes of the Archaea, Bacteria and Eukarya superkingdoms, respectively [14]. Small proteins have been overlooked in genome
This content is AI-processed based on open access ArXiv data.