# Graphlets

Graphlets in mathematics are induced subgraph isomorphism classes in a graph,[1][2] i.e. two graphlet occurrences are isomorphic, whereas two graphlets are non-isomorphic. Graphlets differ from network motifs in a statistical sense, network motifs are defined as over- or under-represented graphlets with respect to some random graph null model.

## Graphlet-based network properties

### Relative graphlet frequency distance

RGF-distance compares the frequencies of the appearance of all 3-5-node graphlets in two networks.[1] Let Ni(G) be the number of graphlets of type ${\displaystyle i}$ (${\displaystyle i\in \{1,\ldots ,29\}}$) in network G, and let ${\displaystyle T(G)=\sum _{i=1}^{29}N_{i}(G)}$ be the total number of graphlets of G. The "similarity" between two graphs should be independent of the total number of nodes or edges, and should depend only upon the differences between relative frequencies of graphlets. Thus, relative graphlet frequency distance D(G,H) between two graphs G and H is defined as:
${\displaystyle D(G,H)=\sum _{i=1}^{29}|F_{i}(G)-F_{i}(H)|}$,
where ${\displaystyle F_{i}(G)=-\log(N_{i}(G)/T(G))}$. The logarithm of the graphlet frequency is used because frequencies of different graphlets can differ by several orders of magnitude and the distance measure should not be entirely dominated by the most frequent graphlets.

### Graphlet degree distribution agreement

GDD-agreement generalizes the notion of the degree distribution to the spectrum of graphlet degree distributions (GDDs) in the following way.[2] The degree distribution measures the number of nodes of degree k in graph G, i.e., the number of nodes "touching" k edges, for each value of k. Note that an edge is the only graphlet with two nodes. GDDs generalize the degree distribution to other graphlets: they measure for each 2-5-node graphlet Gi, ${\displaystyle i=0,1,...,29}$, such as a triangle or a square, the number of nodes "touching" k graphlets Gi at a particular node. A node at which a graphlet is "touched" is topologically relevant, since it allows us to distinguish between nodes "touching", for example, a three node path at an end node or at the middle node. This is summarized by automorphism orbits (or just orbits, for brevity): by taking into account the "symmetries" between nodes of a graphlet, there are 73 different orbits across all 2-5-node graphlets (see [Pržulj, 2007][2] for details).

For each orbit j, one needs to measure the jth GDD, dGj(k), i.e., the distribution of the number of nodes in G "touching" the corresponding graphlet at orbit j k times. Clearly, the degree distribution is the 0th GDD. dGj(k) is scaled as ${\displaystyle S_{G}^{j}(k)={\frac {d_{G}^{j}(k)}{k}}}$ to decrease the contribution of larger degrees in a GDD and then normalized with respect to its total area ${\displaystyle T_{G}^{j}=\sum _{k=1}^{\infty }S_{G}^{j}(k)}$ giving the "normalized distribution" ${\displaystyle N_{G}^{j}(k)={\frac {S_{G}^{j}(k)}{T_{G}^{j}}}}$.

The jth GDD-agreement compares the jth GDDs of two networks. For two networks G and H and a particular orbit j, the "distance" Dj(G,H) between their normalized jth GDDs is:
${\displaystyle D^{j}(G,H)={\frac {1}{\sqrt {2}}}\left(\sum _{k=1}^{\infty }[N_{G}^{j}(k)-N_{H}^{j}(k)]^{2}\right)^{\frac {1}{2}}}$.

The distance is between 0 and 1, where 0 means that G and H have identical jth GDDs, and 1 means that their jth GDDs are far away. Next, Dj(G,H) is reversed to obtain the jth GDD-agreement:
${\displaystyle A^{j}(G,H)=1-D^{j}(G,H)}$, for ${\displaystyle j\in \{0,1,\ldots ,72\}}$.

The total GDD-agreement between two networks G and H is the arithmetic or the geometric average of the jth GDD-agreements over all j, i.e.,
${\displaystyle A_{arith}(G,H)={\frac {1}{73}}\sum _{j=0}^{72}A^{j}(G,H)}$,
and
${\displaystyle A_{geo}(G,H)=\left(\prod _{j=0}^{72}A^{j}(G,H)\right)^{\frac {1}{73}}}$,
respectively. GDD-agreement is scaled to always be between 0 and 1, where 1 means that two networks are identical with respect to this property. (See [Pržulj, 2007][2] for details.)

### Graphlet degree vectors (signatures) and signature similarities

This method generalizes the degree of a node, which counts the number of edges that the node touches, into the vector of graphlet degrees, or graphlet degree signature, counting the number of graphlets that the node touches at a particular orbit, for all graphlets on 2 to 5 nodes.[3] The resulting vector of 73 coordinates is the signature of a node that describes the topology of node's neighborhood and captures its interconnectivities out to a distance of 4 (see [Milenković and Pržulj, 2008][3] for details). The graphlet degree signature of a node provides a highly constraining measure of local topology in its vicinity and comparing the signatures of two nodes provides a highly constraining measure of local topological similarity between them.

The signature similarity[3] is computed as follows. For a node u in graph G, ui denotes the ith coordinate of its signature vector, i.e., ui is the number of times node u is touched by an orbit i in G. The distance Di(u,v) between the ith orbits of nodes u and v is defined as:
${\displaystyle D_{i}(u,v)=w_{i}\times {\frac {|\log(u_{i}+1)-\log(v_{i}+1)|}{\log(\max\{u_{i},v_{i}\}+2)}}}$,
where wi is the weight of orbit i that accounts for dependencies between orbits (see [Milenković and Pržulj, 2008][3] for details). The total distance D(u,v) between nodes u and v is defined as:
${\displaystyle D(u,v)={\frac {\sum _{i=0}^{72}D_{i}}{\sum _{i=0}^{72}w_{i}}}}$.
The distance D(u,v) is in [0, 1), where distance 0 means that signatures of nodes u and v are identical. Finally, the signature similarity, S(u,v), between nodes u and v is:
${\displaystyle S(u,v)=1-D(u,v)}$.
Clearly, a higher signature similarity between two nodes corresponds to a higher topological similarity between their extended neighborhoods (out to distance 4).

## Application of graphlet-based network properties

RGF-distance and GDD-agreement were used to evaluate the fit of various network models to real-world networks and to discover a new, well-fitting, geometric random graph model for protein-protein interaction networks,[1][2] as well as other types of biological networks, such as protein structure networks, also called residue interaction graphs.[4] These graphlet-based network properties are implemented in GraphCrunch, a software tool for large network analyses and modeling.[5] Alternatively, a parallel implementation is provided in PGD, a software library for computing graphlet-based network properties in large and massive networks.[6][7]

Graphlet degree vectors (signatures) and signature similarities were applied to biological networks to identify groups (or clusters) of topologically similar nodes in a network and predict biological properties of yet uncharacterized nodes based on known biological properties of characterized nodes. Specifically, they were applied to protein function prediction,[3] cancer gene identification,[8] and discovery of pathways underlying certain biological processes, such as melanogenesis[8] or protein degradation.[9] Additionally, GRAph ALigner (GRAAL), a global network alignment method, used graphlet degree vectors and signature similarities to produce topological alignments of biological networks, without using any information external to network topology.

## References

1. ^ a b c Pržulj N, Corneil DG, Jurisica I: Modeling Interactome, Scale-Free or Geometric?, Bioinformatics 2004, 20(18):3508-3515.
2. Pržulj N, Biological Network Comparison Using Graphlet Degree Distribution, Bioinformatics 2007, 23:e177-e183.
3. Tijana Milenković and Nataša Pržulj, Uncovering Biological Network Function via Graphlet Degree Signatures, Cancer Informatics 2008, 6:257–273.
4. ^ Tijana Milenković, Ioannis Filippis, Michael Lappe, and Nataša Pržulj, Optimized Null Model for Protein Structure Networks, 2009, PLoS ONE 4(6): e5967.
5. ^ Tijana Milenković, Jason Lai, and Nataša Pržulj, GraphCrunch: a tool for large network analyses, BMC Bioinformatics 2008, 9:70. Highly accessed.
6. ^ Ahmed, N. K.; Neville, J.; Rossi, R. A.; Duffield, N. (2015-11-01). "Efficient Graphlet Counting for Large Networks". 2015 IEEE International Conference on Data Mining. pp. 1–10. doi:10.1109/ICDM.2015.141. ISBN 978-1-4673-9504-5. S2CID 11482293.
7. ^ Ahmed, Nesreen K.; Neville, Jennifer; Rossi, Ryan A.; Duffield, Nick G.; Willke, Theodore L. (2016-06-27). "Graphlet decomposition: framework, algorithms, and applications". Knowledge and Information Systems. 50 (3): 689–722. arXiv:1506.04322. doi:10.1007/s10115-016-0965-5. ISSN 0219-1377. S2CID 254130395.
8. ^ a b Tijana Milenković, Vesna Memisević, Anand K. Ganesan, and Nataša Pržulj, Systems-level Cancer Gene Identification from Protein Interaction Network Topology Applied to Melanogenesis-related Interaction Networks, Journal of the Royal Society Interface 2009, doi:10.1098/rsif.2009.0192.
9. ^ Cortnie Guerrero, Tijana Milenković, Nataša Pržulj, Peter Kaiser, Lan Huang, Characterization of the Yeast Proteasome Interaction Network by QTAX-Based Tag-Team Mass Spectrometry and Protein Interaction Network Analysis, PNAS 2008, 105(36): 13333–13338.