LoTo – Help Section – Computational Biology Lab (DLab)

LoTo – Local Topology Comparison of Directed Networks

Table of Contents
1. Method used in LoTo. 1.1. Metrics to quantify network similariy.
2. Inputs. 2.1. Network files formats accepted by LoTo. 2.2. Graph formats description. Cytoscape json (cyjs). Graph Markup Language (gml). Graphml. Simple Interaction File (sif). Tab Separated Values (tsv). Cytoscape XGMML (xml). 2.3. Other Inputs. Threshold. Case. Job name. User email.
3. Outputs. 3.1. Output Page. 3.2. Output Files. 3.2.1. Results File. 3.2.2. Network Files.
4. How to cite LoTo.
5. Contact.

1. Method used in LoTo

GRN are abstract representations of gene regulation. In GRNs, whereas vertices (nodes) represent genes, the connections among them represent the existence of a regulatory interaction.
Regulatory interactions exist if the product of a gene, e.g. a Transcription Factor (TF), regulates the expression of a target gene, therefore these networks are directed. Network diferential
analysis is a technique that tries to identify topological variations in different realizations of the same network. To do so, it aims to indicate how the regulation of gene expression varies,
and thus, to find the causes behind alteration in gene expression levels.

GRN, as other directed networks, are composed of basic building units, small induced subgraphs called Graphlets.

Graphlets denote local interconnectivity patterns that describe functional associations between nodes. LoTo compares the topology of two networks and to do so, it quantifies the similarity between
graphlets formed by the same nodes in the two networks. Formally, it compares two versions of the same network, reference Vs compared, that solely differ in the edges connecting the nodes.
LoTo uses all graphlets formed by three nodes with at least two true connections (Fig. 1). LoTo also allows to characterize a single network by reporting
the occurrence of each graphlet in it.

Figure 1: All possible three graphlets used in LoTo. Edges are directed and thus, they imply the direction of the regulation they represent.
Black edges are true interactions, and red edges are false, i.e., non-existing regulations.

1.1 Metrics to quantify network similariy.

LoTo implements proper quantitative metrics to treat the existence or absence of graphlets as a binary classification problem. We also created a metric, named REC to measure the REConstruction
rate between a graphlet found in the reference networks and the connectivity pattern formed by the same nodes in the second network. REC is a metric that quantifies the similarity between the
connectivity of same three nodes forming a graphlet in the reference (A) and in the compared (B) networks. To do so, true and false edges are transformed into numerical values, 1 for true edges and 0
for false ones. Being m the total number of possible edges betwee the nodes forming the graphlet (6 for triplets of nodes in directed networks) and a i and b i the same edge in the two compared triplets
of nodes, the reconstruction rate is calculated using the equation below. REC ranges from 0 to 1, 0 indicating total disagreement and 1 meaning there is a perfect match.

$REC=1-(\frac{1}{m}\sum_{i=1}^{m}|a_{i}-b_{i}|)$

REC for all graphlets in which a node participates, N, can be averged to measure the local topology similarity for that node in the two compared networks.
This metric is called REC Graphlet Degree (RGD), and can be calculated as follows:

$RGD=\frac{1}{N}&space;\sum_1^N&space;1-(\frac{1}{m}\sum_{i=1}^{m}|a_{i}-b_{i}|)$

RGD can be used to identify those nodes that exhibit a variation in connectivity and their neighborhood of target genes. A value of RGD below 1 indicates a node with variations in its local topology,
the lower is the value the larger is the change. RGD also gives a way to determine the subnetworks showing the topological variation -i.e. all those graphlets in which the node paricipates- indicating,
due to the usage of graphlets, relationships with other nodes further away than those that are direct neighbours.

All other metrics used in LoTo are employed in binary clasification problems and are calculated from a confusion matrix. Rows in a confusion matrix contain all examples that in each predicted
class while columns contain examples in their actual class. These examples are usually named True Positive (TP) if the predicted class and the actual class are both positive; False Positive (FP) in the
predicted class is positive but they are actual false examples; False Negative (FN) if the exmples is redicted as positive but is false; and True Negative (TN) if the examples is predicted as negative
and is and actual negative. In LoTo, instead of building confusion matrices for the existence or absence of single connections between pairs of nodes, they are constructed for the existence or absence of
graphlets. Therefore, TP graphlets are triplets of nodes that form the same type of graphlet in both compared networks; a triplet of nodes that form a different type of graphlet in the compared
networks is a FP graphlet for the graphlet type formed in the compared network and a FN graphlet for the type formed in the reference network; while TN graphlets are triplets of nodes that do not form
that type of graphlet in any of the two networks.

2. Inputs

LoTo takes as input one or two network files. In the case of a single network, LoTo reports all graphlets found in it and the number of graphlets in which each node participates
(node graphlet degree). This single network file can be binary (edges are true or false, i.e., 1 or 0) or a network in which edges have an associated probability value.
This probablility reflects its likelihood of an edge to be a true edge. When two networks are used, they must be formed by the same nodes or at least share a subset of the nodes.
In this case, the networks can also have a probability value associated to them, or else be binary networks. Networks that are not binary, i.e., they have a probablibility asociated to each pair
of nodes, true connections are defined as those with a probability value higher than the user selected threshold, and edges that are either below the threshold or are not present in the file are
considered false.

2.1. Network files formats accepted by LoTo.

Actually, LoTo support the following graph formats:

Moreover, LoTo supports tar.gz (tgz) and zip files whenever these only have one file with formats mentioned above.

2.2. Graph formats description.

Cytoscape json (cyjs)

Cytoscape jason format for graph define nodes with its Cytoscape jason format for graph define nodes with its “id”, “shared_name”, “selected”, “SUID” and “name” attributes.
This format also describes edges with its “id”, “source”, “target”, “interaction”, “selected”, “shared_interaction”, “shared_name”, “SUID” and “name” attributes.

A example of cyjs format can be downloaded doing click Here.

For more information please refer to: http://wiki.cytoscape.org/Cytoscape_3/UserManual/CytoscapeJs

Graph Markup Language (GML)

Graph Markup Language (GML) is a hierarchical ASCII based file format. It defines each node with “id” and “label” attributes, it also describes each edge with the next attributes: “source”, “target” and “interaction”.

A example of gml format can be downloaded doing click Here

Form more information please refer to: http://www.fim.uni-passau.de/index.php?id=17297&L=1 or
http://en.wikipedia.org/wiki/Graph_Modelling_Language

Graphml

Graphml is a XML based file format for graphs. It defines each node by its “id”. Edges in this format are described by their “source”, “target” and “data” attributes.

A example of graphml format can be downloaded doing click Here

For more information please refer to: http://en.wikipedia.org/wiki/GraphML

Simple Interaction File (SIF)

Simple interaction file (SIF) is a format that only specifies nodes and interactions. The first column in this format indicates the source node, the third column the target node (one or more) and the second column describes the connection between the source and target nodes. In the second column a 1 represent connection and 0 not connection.

A example of sif format can be downloaded doing click Here

For more information please refer to: http://cytoscape.org/manual/Cytoscape2_8Manual.html#SIF Format

Tab Separated Values (TSV)

Tab Separated Values (TSV) is the native format of LoTo. In this format, the first column indicates the id of a source node, the second column the id of the target gene and the third column the interaction between source and target nodes. In LoTo, the third column can be 0 if there is not a regulatory interaction between source and target nodes, 1 if there is a regulatory interaction, or a probability of existence for the interaction in the range [0,1].

A example of tsv format can be downloaded doing click Here

Cytoscape XGMML (XML or XGMML)

XGMML is a format based on the GML definition. This format allows to describe each node and edges with many attributes, including diverse coulor codes that can be employed for a better
netwrok visualization.

A example of this format can be downloaded doing click Here (xml) or Here (xgmml)

For more information please refer to: Cytoscape file format manual.

2.3 Other inputs.

Threshold: This number is used to define true edges in the case of networks in which the edges are associated to a probability. The threshold has a by default value of 0, but it can be set to any number within the [0,1] range. All interactions defined in the inferred network with a probability value above the threshold are defined as true connections.

Case: LoTo has two use cases defined as “case 1” and “case 2“. In “case 1“, the reference network contains all true and false edges,
and threfore, only those interactions in reference network are considered in the comparison.
In “case 2“, the reference network contains only all true edges, and those interactions that are absent in the network are considered false edges.

Job name: The user can specify a job name to identify each query submitted to LoTo.

User email: If the user enters a valid email address, LoTo sends an email as soon as the job is finished. If the user does not enter a valid email address we
advise to keep the LoTo page open until the output page is produced.

3. Outputs.

The output in LoTo is formed by two different parts, the output page and downloadable files.

3.1. Output Page.

The first lines in the output page show the reference and inferred networks filename (only the reference network in the case of single network analysis), the user defined threshold and the case used in the comparison. Following these lines, the user will find a brief table showing the results of the comparison of two networks or the single network analysis.

Right under the table, there is a button that if pressed will display the network resulting from the comparison using the vis.js plugin. A description of the networks displayed can be found below.

For the single network analysis the results table reports the occurrence of each graphlet type in the network. For the comparisons, the table has a row for each graphlet type (1 to 13), another for the same metrics calculated for all graphlets (all) and the last row that shows the same metrics calculated for the existence of single edges (gl). The metrics shown in each column are (from left to right in the table):

#T	Occurrence of graphlets in the reference network.
#I	Occurrence of graphlets in the compared network.
TP	Number of triplets of nodes forming the same type of graphlet in both the reference network and the second network.
FP	Occurrence of graphlets whose nodes form of a different type in the reference network.
TN	Number of all possible graphlets that could be form by all triplets of nodes and are not classified as TP, FP and FN.
FN	Occurrence of graphlets in the reference network whose nodes form a different graphlet type in the second network.
Recall (R)	$R=\frac{TP}{TP+FN}$
Precision (P)	$P=\frac{TP}{TP+FP}$
False Positive Rate (FPR)	$FPR=\frac{FP}{TP+FP}$
F1 score	$F1=\frac{2PR}{P+R}$
Matthews Correlation Coefficient (MCC)	$MCC=\frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP+FN)(TN+FP)(TN+FN)}}$
Accuracy (ACC)	$ACC=\frac{TP + TN}{TP+FP+TN+FN}$
Reconstruction Rate (REC)

3.2. Output Files.

3.2.1. Results File:

The output file contains a more detailed explanation of the comparison. The first lines also specify the name of the input files, the case and threshold used. Following, these lines there is a table very similar to that of the server output page but with several more metrics calculated. The metrics only present in this table are:

True Negative Rate (TNR, specificity)	$TRN=\frac{TN}{TN+FP}$
Negative predictive value	$NVP=\frac{TN}{TN+FN}$
False Discovery Rate $[(FDR=1-P)]$	$FDR=\frac{FP}{TP+FP}$
False Negative Rate (miss rate)	$FNR=\frac{FN}{FN+TP}$
Informedness	$[Informedness=R+TNR-1]$
Markedness	$[Markedness=P+NPV-1]$
Jaccard index (J)	$J=\frac{TP}{FP+FN+TP}$
Sorensen-Dice (SD)	$SD=\frac{TP}{TP+\frac{FN+FP}{2}}$
Kulczynski 1 (K1)	$K1=\frac{TP}{FN+FP}$
Kulczynski 2 (K2)	$K=0.5\left [ \frac{TP}{TP+FN}+\frac{TP}{]TP+FP} \right ]$
Otsuka (O)	$O=\frac{TP}{\sqrt{(TP+FN)(TP+FP)}}$
Correlation ratio (C)	$C=\frac{TP^{2}}{(TP+FN)(TP+FP)}$
Hamming distance (H)	$[H=FP+FN]$

The next table shows the number of each graphlet in which a TF-encoding gene participates in the reference network, the total number of graphlet in which the node participates (node graphlet degree) and when two networks were compared the RGD and F1 for all graphlets that were used to calculate RGD. TF-coding nodes are those nodes from which at least one edge origins.

The following table table indicates same attributes, but in this case for non-TF-encoding genes in the reference network.

Below, there are two tables with the participation of each TF and non-TF encoding genes in graphlets but for the inferred network.

In the last section of the output file there are three lists of graphlets sorted by the graphlet type. The first list shows those graphlets that are present in both networks or TP graphlets. The second list shows those graphlets that are only present in the the reference network or FN graphlets. The last list shows graphlets that are only present in the inferred network or FP graphlets. This list can be used to determine the subnetwork formed by all graphlets in which a certain node participates.

3.2.2. Network Files:

LoTo also generates two types of network files that can be used to visualize the results of the analysis or the comparison. These files are a xgmml network file for easy loading and visualization in Cytoscape3.X, and two tables files that describe the nodes and the edges. Both the xgmml and the table files contain the same information, including nodes and edges coloring schemes indicating the results of the comparison and several other node and edge attributes. The information contained in the files varies when using a single network as input or when comparing two networks.

Single network analysis:

Node attributes:
Node id (name): string identifying each node in the network.
Node type: TF for source nodes or nTF for nodes that are not the source of any edge.
Node color: hexadecimal color code for the node, TFs or nodes from which edges originates (source nodes) are orange colored and target nodes (no TFs) are blue colored
Node graphlet degree: Total number of graphlets in which the node participates.

Edge attributes:
Edge id (name): string identifying each edge in the network. The name used is Source_name(1)Target_name.
Edge source: source node name.
Edge target: target node name.
Interaction: since edges shown in the network are only the true ones, this attribute is always set to 1.
Edge color: hexadecimal color code for the edge, all edges are colored in black.

Network comparison:

Node attributes:
Node id (name): string identifying each node in the network.
Node type: TF for nodes that are the source of at least one edge in any of the two networks or nTF for nodes that are not the source of any edge.
Presence in reference: this attribute is set to A if the node is absent in the reference network or to P if the node is present.
Presence in compared: this attribute is set to A if the node is absent in the compared network or to P if the node is present.
Node class: according to their presence in the two compared networks, nodes are TP (True Positive) when they are present in both networks; FN (False Negative) when they are present in the reference network and absent in the compared; and FP (False Positive) when they are absent in the reference network and present in the compared.
Node color: hexadecimal color code for the node, nodes are colored according to their class and type attributes.
TFs present in both networks (TP) are Orange colored
TFs present only in reference network (FN) are Pink colored
TFs present only in compared network (FP) are Yellow colored
non TFs present in both networks (TP) are Blue colored
non TFs present only in reference network (FN) are Purple colored
TFs present only in compared network (FP) are colored white
Node graphlet degree in reference: Total number of graphlets in which the node participates in the reference network.
Node graphlet degree in compared: Total number of graphlets in which the node participates in the compared network.
Node RGD: average REC for all graphlets in which a node participates in the reference network.
Node F1: F1 for all graphlets in which a node participates in the reference network.

Edge attributes:
Edge id (name): string identifying each edge in the network. The name used is Source_name(1)Target_name.
Edge source: source node name.
Edge target: target node name.
Interaction: since edges shown in the network are only the true ones, this attribute is always set to 1.
Presence in reference: this attribute is set to A if the node is absent in the reference network or to P if the edge is present.
Presence in compared: this attribute is set to A if the node is absent in the compared network or to P if the edge is present.
Source presence: this attribute is set to PP if the source node is present in both networks, PA if the source node is only present in the reference network and to AP if it is present only in the compared network.
Target presence: this attribute is set to PP if the target node is present in both networks, PA if the target node is only present in the reference network and to AP if it is present only in the compared network.
Edge class: edges are TP when present in both networks, FN when present only in the reference network and FP when present only in the compared network.
Edge color: hexadecimal color code for the edge, edges are colored according to their class.
TP Edges are Black colored
FN Edges Pink colored
FP Edges Yellow colored

4. How to cite LoTo.

At this time the article is being reviewed. For more details send an email to ajmm@dlab.cl.

5. Contact.

If you have any question or problem using LoTo you can sen an email to ajmm@dlab.cl.