We propose a biologically motivated quantity, twinness, to evaluate local similarity between nodes in a network. The twinness of a pair of nodes is the number of connected, labeled subgraphs of size n in which the two nodes possess identical neighbours. The graph animal algorithm is used to estimate twinness for each pair of nodes (for subgraph sizes n=4 to n=12) in four different protein interaction networks (PINs). These include an Escherichia coli PIN and three Saccharomyces cerevisiae PINs -- each obtained using state-of-the-art high throughput methods. In almost all cases, the average twinness of node pairs is vastly higher than expected from a null model obtained by switching links. For all n, we observe a difference in the ratio of type A twins (which are unlinked pairs) to type B twins (which are linked pairs) distinguishing the prokaryote E. coli from the eukaryote S. cerevisiae. Interaction similarity is expected due to gene duplication, and whole genome duplication paralogues in S. cerevisiae have been reported to co-cluster into the same complexes. Indeed, we find that these paralogous proteins are over-represented as twins compared to pairs chosen at random. These results indicate that twinness can detect ancestral relationships from currently available PIN data.
Deep Dive into Node similarity within subgraphs of protein interaction networks.
We propose a biologically motivated quantity, twinness, to evaluate local similarity between nodes in a network. The twinness of a pair of nodes is the number of connected, labeled subgraphs of size n in which the two nodes possess identical neighbours. The graph animal algorithm is used to estimate twinness for each pair of nodes (for subgraph sizes n=4 to n=12) in four different protein interaction networks (PINs). These include an Escherichia coli PIN and three Saccharomyces cerevisiae PINs – each obtained using state-of-the-art high throughput methods. In almost all cases, the average twinness of node pairs is vastly higher than expected from a null model obtained by switching links. For all n, we observe a difference in the ratio of type A twins (which are unlinked pairs) to type B twins (which are linked pairs) distinguishing the prokaryote E. coli from the eukaryote S. cerevisiae. Interaction similarity is expected due to gene duplication, and whole genome duplication paralogue
arXiv:0707.2076v2 [q-bio.MN] 17 Aug 2007
Node Similarity Within Subgraphs of Protein Interaction Networks
Orion Penner,1 Vishal Sood,1, 2 Gabriel Musso,3 Kim Baskerville,4 Peter Grassberger,1, 2 and Maya Paczuski1
1Complexity Science Group, University of Calgary, Calgary, Alberta T2N 1N4, Canada
2Institute for Biocomplexity and Informatics, University of Calgary, Calgary, Alberta T2N 1N4, Canada
3Department of Medical Genetics and Microbiology,
University of Toronto, Toronto, Ontario M5S 3E1, Canada
4Perimeter Institute for Theoretical Physics, Waterloo, Ontario N2L 2Y5, Canada
(Dated: November 13, 2018)
We propose a biologically motivated quantity, twinness, to evaluate local similarity between nodes
in a network.
The twinness of a pair of nodes is the number of connected, labeled subgraphs
of size n in which the two nodes possess identical neighbours.
The graph animal algorithm is
used to estimate twinness for each pair of nodes (for subgraph sizes n = 4 to n = 12) in four
different protein interaction networks (PINs).
These include an Escherichia coli PIN and three
Saccharomyces cerevisiae PINs – each obtained using state-of-the-art high throughput methods. In
almost all cases, the average twinness of node pairs is vastly higher than expected from a null model
obtained by switching links. For all n, we observe a difference in the ratio of type A twins (which
are unlinked pairs) to type B twins (which are linked pairs) distinguishing the prokaryote E. coli
from the eukaryote S. cerevisiae. Interaction similarity is expected due to gene duplication, and
whole genome duplication paralogues in S. cerevisiae have been reported to co-cluster into the same
complexes. Indeed, we find that these paralogous proteins are over-represented as twins compared
to pairs chosen at random. These results indicate that twinness can detect ancestral relationships
from currently available PIN data.
PACS numbers: 87.14.Ee, 02.70.Uu, 87.10.+e, 89.75.Fb, 89.75.Hc
I.
INTRODUCTION
Proteins constitute the machinery that carry out cellular processes by forming stable or transitory complexes with
each other – organized perhaps into a web of overlapping modules. Information about this complex system can be
condensed into a protein interaction network (PIN), which is a graph where nodes are proteins and links are measured
or inferred pairwise binding interactions in a cell. Major efforts over the years devoted to resolving protein interactions
have employed both small-scale and large-scale techniques. High throughput methods, such as yeast two hybrid (Y2H)
and tandem affinity purification (TAP) have recently generated vast amounts of protein interaction data [1, 2, 3],
allowing PINs from different organisms, experiments, research teams etc. to be compared.
A basic statistical feature of any network is its degree distribution, P(k), for the number of links, k, connected to
a node. In this respect, a variety of networks have been shown [4] to deviate decisively from a random graph, where
the degree distribution is Poisson. In fact, early work suggested that degree distributions for PINs were power-law or
scale-free [5]. However, as demonstrated in Fig. 1, degree distributions for recently obtained PINs are neither power-
law nor particularly stable across different state-of-the-art constructions for the same organism – here the budding
yeast S. cerevisiae. Note that all of the data sets studied here are based on the TAP-MS technique, except for Batada
et al., which is a compilation of data obtained from a number of different techniques.
Despite the empirical inconsistency presented by PIN degree distributions, similar local structures can stand out
when each network is compared to a null model where links are switched while retaining the original degree sequence [6,
7]. Subgraphs that are significantly over-abundant are referred to as motifs, while subgraphs that are significantly
under-represented are referred to as anti-motifs [6]. It has been reported that proteins within motifs are more conserved
than other proteins [8]. In PINs, dense subgraphs containing 3 or 4 nodes are motifs, while tree-like subgraphs are
anti-motifs [9].
Complementary to the search for motifs, graph clustering algorithms have been applied to identify components
or complexes in PINs. By construction, these components tend to contain a high density of links but are weakly
connected to the rest of the network (see e.g. Refs. [10, 11]). Complexes identified in this manner can contain up
to 100 or more proteins, with on-going debate [11] as to their biological significance. However, biological processes
such as signal transduction, cell-fate regulation, transcription, and translation typically involve a few tens of proteins.
In previous work [10] mesoscale (5-25) protein clusters have also been identified using graph clustering algorithms.
These clusters were matched with groups of proteins known to form complex macromolecular structures, or modules
of proteins that participate i
…(Full text truncated)…
This content is AI-processed based on ArXiv data.