conet documentationpsbweb05.psb.ugent.be/conet/documents/conetdoc.pdf · label. the edge...

49
CoNet documentation Documentation for the Cytoscape plugin CoNet. CoNet computes significant cooccurrence or mutual exclusion between items (matrix rows), whose presence/absence or abundance was observed repeatedly (matrix columns) and visualize the result as a network. In case of questions, contact Karoline Faust: This document has been generated from html with prince.

Upload: others

Post on 06-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

CoNet documentation

Documentation for the Cytoscape plugin CoNet.

CoNet computes significant cooccurrence or mutual exclusion between items (matrixrows), whose presence/absence or abundance was observed repeatedly (matrix columns)and visualize the result as a network.

In case of questions, contact Karoline Faust:

This document has been generated from html with prince.

www.princexml.com
Prince - Non-commercial License
This document was created with Prince, a great way of getting web content onto paper.
Page 2: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Index

1. Main menu help2. Network loading menu help3. Data menu help4. Preprocessing and filter menu help5. Methods menu help6. Merge menu help7. Randomization menu help8. Configuration menu help9. Settings loading/saving menu help

Command line help

References, resources and related tools

Page 3: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Main menu.

CoNet is a versatile tool to compute co-occurrences and mutual exclusions betweenitems. A basic run looks something like this:

1. Open the Data menu and select the input matrix file.2. Open the Methods menu and select the methods to execute and their thresholds.3. Open the Merge menu and specify how the results of the different methods

should be combined4. Click the GO button

When the GO button is clicked, a progress menu appears allowing to cancel the networkconstruction job. It disappears upon completion of the job. During computation, nofurther jobs can be launched.In more advanced runs, randomizations can be carried out (Randomization menu).

Demo

The "Demo" button computes co-occurrences on an example data set (the oral samplescollected in Costello et al., Science, 2009). Running the demo does not change the currentconfiguration of CoNet, neither is the demo influenced by CoNet's current configuration.The demo data and settings accompany the CoNet distribution.

Command line call

CoNet is accompanied by a command line tool, which is recommended if thousands ofrandomizations have to be carried out on large matrices. The button "Generate commandline call" displays a window that shows the command line call corresponding to thecurrent setting of CoNet. Note that Rserve-related configuration cannot be set viacommand line directly, but need to be specified in the (optional) configuration file of theCoNet command line tool.

Settings loading/saving

CoNet settings can also be saved to or loaded from a file via the button "Settings loading/saving". This saves the user from repeating a tedious CoNet configuration.

Network loading

The Network loading menu allows displaying networks generated with the CoNetcommand line tool or with other network inference tools that generate adjacencymatrices. Select the file to be displayed in the menu and then click GO.

Page 4: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Network display

Result networks are displayed with a number of attributes.Network attribute:The Comment stores details of the network generation as well as the command line call.

Node attributes:• Label The name of the node that is displayed on the node.• abundance Each node corresponds to a row in the input matrix. The abundance

is the row sum. When the input matrix was processed in any way, this refers tothe row sum in the processed matrix.

• samplecount The samplecount is the number of values of the corresponding rowthat are neither zero nor missing values. When the input matrix was processed inany way, this refers to the number of occurrences in the processed matrix.

• degree The number of links of the node.• posdegree The number of positive links of the node.• negdegree The number of negative links of the node.• unknowndegree The number of links of the node having unknown interaction

type.• impliesdegree The number of links of a node contributed by association mining.• oriID The identifier of the row to which this node corresponds.• canonicalName An attribute set by Cytoscape that stores the node identifier.• isafeature This attribute is only set in case features are provided and allows to

differentiate between feature nodes and other nodes.• matrix_number This attribute is only set if two input matrices are provided and

indicates the origin of the node from matrix 1 or 2.• shiftgroup This attribute is only set if a lag is specified and groups all shifted

versions of one row together.Additional node attributes are added when metadata are provided, such as the lineage orthe node group.

Edge attributes:• Label The edge identifier.• cooc_method The co-occurrence method that contributed this edge. There may

be more than one co-occurrence method.• weight The edge weight. When no randomization is run, this is the score of the

co-occurrence method. In case of several methods per edge, the weightrepresents the combined score.

• interactionType The edge type. Can be either co-presence (positive interaction)or mutual exclusion (negative interaction). Co-presence edges are colored ingreen, mutual exclusion edges in red.

• method_number The number of methods supporting an edge. This attribute isnot set for multi-graphs.

Page 5: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

• method_scores The scores of methods supporting an edge. This attribute is notset for multi-graphs.

• methodname_scores The method name versus its score for all methodssupporting an edge. This attribute is not set for multi-graphs.

• methodname_interactiontype The method name versus the interaction type(positive, negative or unknown) it predicted. This attribute is not set for multi-graphs.

• interaction An attribute set by Cytoscape that refers to the source of the edge.Here, it is set to the same value as cooc_method.

• canonicalName An attribute set by Cytoscape that stores the edge identifier.When the network was randomized, the following edge attributes are added:

• pval The p-value computed from the random score distribution.• qval The q-value computed from the p-value when benjaminihochberg or

bonferroni are selected as multiple testing correction strategies.• sig The significance computed from the p-value when compute_eval is selected

as multiple testing correction strategy.• methodname_pval The method name versus its p-value for all methods

supporting an edge. This attribute is not set for multi-graphs.• randdistribMean The mean of the random score distribution.• randdistribMedian The median of the random score distribution.• randdistribSD The standard deviation of the random score distribution.• randdistribNumNonNaNScores The number of iterations that did not result in

missing values.• nulldistribmean When both permutation and bootstrap distribution were

computed, the mean of the permutation distribution.• nulldistribmedian When both permutation and bootstrap distribution were

computed, the median of the permutation distribution.• nulldistribsd When both permutation and bootstrap distribution were computed,

the standard deviation of the permutation distribution.• oriScore The value of the edge weight before randomization.

When the network is a randomized merged multigraph, the following edge attribute isadded:

• pval-MERGE_STRATEGY The p-value computed by merging measure-specific p-values with the selected merging strategy (e.g. pval-brown-merge).

In case of a merged multigraph, multiple-testing-corrected q-values are computed fromthe merged p-values. The q-values are also stored under the "weight" attribute.

So when you do not randomize, your edge attribute of choice to weight edges is "weight",when you randomize, it is "pval", when you merge p-values, it is pval-MERGE_STRATEGY and when you correct for multiple testing, it is either "qval" (forbonferroni and benjaminihochberg) or "sig" (for E-value computation).

Page 6: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Limitation

CoNet expects small-sized (hundreds of rows) to medium-sized (thousands of rows)matrices as input. If filtering steps are selected that reduce the matrix size, larger matricescan be provided as well. If you want to infer networks from matrices with more than fivethousand rows, CoNet is not a suitable tool for you. You might however combine it withother tools by first clustering a huge input matrix with an external tool and thensubmitting aggregated clusters to CoNet. For matrices with a size close to CoNet's limit,it is best to avoid selecting slow inference methods such as mutual information orKendall's correlation and to carry out the inference on command line.

Error report

No tool is without bugs. CoNet errors usually cause an error report to be generated,which you can send if you think it's relevant. You can also check the cytoscape-generatedlog file "output.log" in the Cytoscape application folder.

Page 7: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Load an input network

With this menu, you can load networks in two formats currently not supported byCytoscape. Select the network file of your choice and click the GO button in the mainmenu to load the network.

Network types

GDL networks

GDL networks can be generated by CoNet on command line. They contain node andedge attributes and are the most convenient way of loading a network computed withCoNet on command line into Cytoscape.

Adjacency matrices

Adjacency matrices are N x N matrices, where N is the number of rows in the inputmatrix. A non-zero entry in an adjacency matrix represents an edge (for instance, if A isan adjacency matrix and A(i,j)>0, then taxon i is connected to taxon j by an edge). Thevalue of the non-zero entry represents the score of the edge. For instance, SparCC outputsan adjacency matrix with correlation scores and a p-value matrix, both of which can beloaded via this menu.

Load

Select a GDL network or an adjacency matrix file by clicking on "Open file". To undo afile selection, click "Open file" again and then "Cancel". This will clear the file selection.If you want to load an adjacency matrix, you can optionally load a p-value matrix. The p-value matrix is assumed to have the same order of rows and columns as the adjacencymatrix. Thus, the p-value matrix provides p-values for the edges stored in the adjacencymatrix.

Adjacency matrix options

The scores in the adjacency matrix can be filtered using the lower and upper thresholdtext fields. If no thresholds are provided, all edges coded by the adjacency matrix are readin.The p-values in the p-value matrix can also be filtered by providing a threshold via the P-value threshold field. If this field is left empty, edges with p-values above 0.05 arediscarded.Finally, p-values for adjacency matrices can be multiple-testing corrected, using

Page 8: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Bonferroni or Benjamini-Hochberg multiple testing correction. If multiple-testingcorrection is applied, the p-value threshold refers to adjusted p-values.

Page 9: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Load an input matrix

Open input matrix

To select the input matrix from your local file system, click "Select file" and browse thefile tree. The first time you select a file or a folder after launching CoNet, it may take abit of time before your file tree is visualized. After having selected a file, click "Open".The file path is displayed in the field above the "Select file" button. If you want to clearthe selected file, click "Select File", then "Cancel". Alternatively, you can paste the fullpath of your file into the text field below the "Select file" button.You can optionally load a second input matrix. In this case, co-occurrence is computedbetween all cross-matrix row combinations (and never between two rows belonging tothe same matrix). However, the first and second matrix are expected to have an equalnumber of columns. This results in a bipartite network with two node types, each nodetype corresponding to one input matrix.Note that metadata have to include all rows of the first and second input matrix. Forinstance, the first input matrix could represent Eukaryotes and the second Bacteriasampled for the same number of locations. A combined metadata file could then providethe lineages of both Eukaryotes and Bacteria. The delimiter and transpose optionsexplained below apply to both input matrices. The biom file options also refer to bothmatrices (that is, both matrices have to be in the same input format).When two matrices are read in, each of their rows is assigned a group attribute called"matrix_number", with different values for the first and second matrix. This groupattribute can be entered into the "Metadata and Feature loading menu", to use groupoptions, especially group-wise normalization (see below).Note that if the second matrix is a table generated from a biom file, its lineages will notbe parsed.

Matrix property computation

If you enable "Compute matrix info" and push the GO button, some basic properties ofthe input matrix are displayed along with configuration tips. For instance, the row andcolumn number, minimal and maximal column sums as well as total possible edgenumber is displayed. The total possible edge number is computed according to theformula: n*(n-1)/2, where n is the row number (note that this formula is not valid forassociation rule mining). Matrix properties are computed after selected preprocessing andfilter steps have been applied.

Parsing options

By default, strength of co-occurrence is computed between rows. If it should becomputed between columns, activate the "Transpose data matrix" check box. If the

Page 10: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

matrix transposing is enabled, feature columns should match the columns of thetransposed matrix (i.e. the rows of the input matrix) and metadata identifier should referto the rows of the transposed matrix.

The columns of the input matrix are by default expected to be tab-delimited. However,the column separator can be changed by entering the desired delimiter string in the textfield below "Change default column delimiter".

The matrix can be optionally given as a taxon count table generated from a biom file. Toindicate that the taxon table was generated from a biom file, activate the "Table obtainedfrom biom file" check box. A biom file to table converter is available at http://biom-format.org/. Tables generated from biom files are accepted with or without taxonomy.When lineages are provided in the format:k__KINGDOM;p__PHYLUM;c__CLASS;o__ORDER;f__FAMILY;g__GENUS;s__SPECIES,then lineage information is automatically parsed from the table (overriding anypreviously set metadata), which allows the usage of metadata-dependent CoNet optionssuch as higher-level taxon assignment or the suppression of relationships between OTUsbelonging to the same genus. Taxon nodes will then provide values for the phylum, class,order, family, genus and species attribute, although these values may be set to none.The taxonomy is also parsed automatically from tables where the lineage is included inthe row name (example for a row name: k__Bacteria; p__Bacteroidetes; c__Bacteroidia;o__Bacteroidales; f__Prevotellaceae; g__Prevotella). Note however that the taxon counttable should not contain features, i.e. non-taxon rows. Note also that the taxon count tableformat, if enabled, is assumed to be valid also for the second input matrix, if one is given.If two matrices are provided, lineages will be parsed only from the first.

From version 1.1.beta onwards, CoNet compiled for Cytoscape 3.X can also parse biomfiles directly. You can indicate that your data matrix is a biom file by enabling the "Biomfile" check box. Note that only HDF5 biom files are supported. HDF5 biom files containHDF5 in the first line. The alternative biom format, JSON, is human-readable, whereasHDF5 biom files are not. The biom converter available from http://biom-format.org/ canconvert JSON into HDF5 and vice versa. When the biom file provides lineages under the"taxonomy" attribute, they are parsed automatically, as for the biom-derived taxon tables.As above, the lineage format:k__KINGDOM;p__PHYLUM;c__CLASS;o__ORDER;f__FAMILY;g__GENUS;s__SPECIESis assumed. Note that the HDF5 biom file import is not supported on command line.

Input matrix format

In general, each row should appear only once in your matrix, i.e. each row name shouldbe unique. Lines preceded by # are treated as comment lines and omitted. The inputmatrix can be a pure numeric matrix with tab-delimited entries, such as the smallexample below:

Page 11: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

9.4 5.4 10.113.5 4.4 11.914.3 13.4 8.9

It is also possible to specify a matrix with row names and without column names. Therow name column is expected to be the first column and row names should not be purelynumeric or blank. Example:

taxon1 1.2 4.4 3.2 0.0 1.1taxon2 2.2 4.3 1.1 0.0 1.2taxon3 1.3 2.1 3.1 2.3 1.1

The matrix can also provide row and column names. In this case, it is expected that thecolumn names form the first row and that the row name column has itself a column name.Row and column names should not be purely numeric or blank. A small exampleillustrates this:

colnames col1 col2 col3 col4tax1 1.2 1.0 3.2 0.0tax2 2.2 2.3 1.4 0.0tax3 1.3 2.1 3.1 2.3tax4 1.2 1.4 2.1 2.4tax5 2.3 2.1 4.0 3.1tax6 2.5 2.4 2.3 1.3

Note that OTU tables generated from biom files are also accepted (check out the "Parsingoptions").

Treatment of missing values

Missing values have to be indicated using NaN. For example:

9.4 NaN 10.113.5 4.4 11.914.3 13.4 NaN

Page 12: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Treatment of special characters

Special characters in row names (or in the first metadata column, which represents rownames, see below) are replaced by a dash to avoid confusion with CoNet's own specialcharacters. However, the original row names are stored as values of the node attribute"oriId".

Matrix type

The matrix can be of three different types. An abundance matrix has continuous values, acount matrix has integer values and an incidence matrix has only two values representingpresence (1) and absence (0). Methods like the hypergeometric distribution andassociation rule mining can only be applied to the incidence matrix and will produce anerror if applied to an abundance matrix.

Time series data

CoNet's methods are not tailored to deal with time series. However, CoNet allows tocompute lagged similarities via the option "Specify maximum lag". If a value larger thanzero is set for this option, input data are treated as time series. For each time series,shifted versions are generated up to the given maximum shift. This allows to detectassociations between two species with similar, but shifted abundances. Take for examplethe following time series observed for species A and B:

speciesA 1 2 3 4 5speciesB 2 3 4 5 6

If a lag of 1 is specified, then CoNet would generate rows

speciesA-0 1 2 3 4speciesA-1 2 3 4 5speciesB-0 2 3 4 5speciesB-1 3 4 5 6

If missing values are treated, the shifted rows would be generated as follows:

speciesA-0 1 2 3 4 5speciesA-1 2 3 4 5 NaNspeciesB-0 2 3 4 5 6speciesB-1 3 4 5 6 NaN

CoNet does not compute auto-correlations, e.g. no similarity is computed betweenspeciesA-0 and speciesA-1. In addition, similarities between the same shift (larger than

Page 13: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

zero) are also not computed, e.g. the combination speciesA-1, speciesB-1 is omitted.Beware that specifying a lag greatly increases the runtime as well as the risk of falsepositives.

Metadata and Feature Loading

If you click on the "Metadata and features menu", a sub-menu will be opened. The fourfunctions of this sub-menu are explained below:

1. Metadata Metadata are row attributes. For instance, if the input matrix rowsrepresent species and we want to associate each species with its genus, we canupload a metadata matrix consisting of two tab-delimited columns: the firstlisting the row names of the input matrix and the second listing the genera. Morethan one attribute per row can be up-loaded, by providing a matrix with severaltab-delimited columns. The attribute names have to be given in the order inwhich they appear in the metadata matrix in the text field below "Enter metadatacolumn names". If several attribute names are given, they have to be separated bya slash. For instance, if we want to load values for the attributes "lineage" and"taxon" on the rows, we need to up-load a matrix with 3 columns (row names,lineage attribute values and taxon attribute values) and specify the two attributenames separated by slash in the metadata name input field. Note that metadatawill not be loaded if the input matrix is in a biom or biom-derived format.One column of the metadata file may assign group memberships. For instance, acolumn assigning body sites to each row groups rows body-site-wise. If a groupattribute has been specified, parent-child relationships between taxa acrossdifferent groups are allowed and column-wise normalization and renormalization(see randomization) are carried out group-wise.A second group attribute can be provided, which does not have any of theeffects of the first group attribute. Instead, specifying this group attributeprevents intra-group links, i.e. links between nodes sharing the same value forthis group attribute are forbidden. This is useful to prevent links between tooclosely related taxa.

2. Lineage The lineage is a specific row attribute that provides the taxonomicclassification, from most generic to most specific, e.g. Bacteria--Bacteroidetes--Bacteroidia--Bacteroidales--Prevotellaceae--Prevotella. The two dashes ('--')between the taxonomic level are the lineage separator, which can be changedusing option lineage separator. The lineage column metadata name is fixed to"lineage" the taxon column name to "taxon". It is recommended that, if bothlineage and taxon are provided, the lineage includes the taxon itself as last entry,e.g. if the taxon is OTU-12345, the lineage could be Bacteria--Bacteroidetes--OTU-12345.When a matrix consists of taxa on the same level and the lineages are given inthe metadata, higher-level taxa rows can be automatically assigned by summinglower-level taxon rows. The option explore links between higher-level taxaenables automatic assignment of higher taxon rows from the lineages, which

Page 14: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

allows to compute correlations between higher-level taxa. This option howeverassumes that all taxa in the input matrix are on the same taxonomic level and thattaxonomic levels have different names (Beware: Actinobacteria is the name ofthe class and the phylum, so the name would need a modification such asActinobacteria_class to differentiate between class and phylum). It is advisable touse this option together with the parent-child exclusion filter. Note that if higher-level taxon assignment is combined with a group attribute, it is better to definethe groups via the metadata file and not to give two separate matrices.

3. Row combination filters. Specifying metadata allows to apply filters duringnetwork building that are listed here:

1. The parent-child exclusion filter prevents the formation of linksbetween taxa that have a parent-child relationship, such as inPasteurellaceae versus Pasteurella. This has an impact on the run-timeand the (non-edge-wise) p-value calculation. To apply this filter, alineage and a taxon attribute are expected (with metadata column names"lineage" and "taxon" respectively). The taxon is expected to correspondto the last entry of the lineage. For instance, one line in the metadata file(with a lineage and a taxon column) could look like this:

id1 Bacteria--Actinobacteria Actinobacteria

2. The Within-group relationships only filter can be applied to preventlinks between items belonging to different groups. It requires the firstgroup attribute to be set.

3. The Between-group relationships only filter can be applied to preventlinks between items belonging to the same group. It requires the firstgroup attribute to be set.

4. Features Features are additional rows that differ from the other rows of the inputmatrix. For instance, if the matrix describes count data of species, the featurescould capture physical properties such as the temperature or pH. In the network,feature nodes are displayed with another shape than the other nodes. Featuresshould be set separately, since they need to be excluded from some operations,such as column-wise normalization.The feature matrix is expected to be tab-delimited and to contain continuous datawhen the input matrix is of count or abundance type or of binary data when theinput matrix is an incidence (i.e. presence/absence) matrix. Otherwise, the rulesfor input matrix parsing also apply to feature matrix parsing, i.e. row and columnnames should not be purely numeric, special characters will be treated and linespreceded by # are interpreted as comment lines and skipped. It should not containnegative values, because these cannot be treated correctly by Kullback-Leiblerand other measures that interpret standardized abundances as probabilities (if youonly plan to use correlations, negative values are fine). In addition, constantfeatures (i.e. those that have the same value across all samples) should also beavoided. If transpose is enabled, the feature matrix is transposed, i.e. columnsare parsed as rows and vice versa. If match samples is enabled, feature samples

Page 15: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

are matched to input samples. Samples not present in the feature matrix will bediscarded from the input matrix and vice versa. In addition, samples in thefeature matrix are ordered in the same way as the input matrix. Thus, this optioncan be activated if the feature samples are in a different order than those of theinput matrix. Note that matching is carried out after transposing, if both optionsare enabled. Note that the feature matrix is expected to have the same samples inthe same order as the input matrix (after transposing), except when matching isenabled.

The following feature-specific filters are provided:1. Activating the first filter excludes features from association rule mining.2. Activating the second filter keeps only features in the network that are

supported by Spearman correlation.

Save output network

Saving the resulting network to a given location is an option that allows saving results informats not supported by Cytoscape. If no output file is selected, the network isvisualized in Cytoscape without being saved to a file.

To specify an output file location, select first a folder via button "Select folder", thenclick on the folder into which the output file should be saved and click "Choose". Theselected folder will be displayed in the text area below this button. Then type a file namein the text area below "Type file name". To undo the specification of an output file,remove the file name or push the button "Select folder" and then push Cancel.

Network formats

The GDL (GraphDataLinker) format is a custom format that stores networks along withtheir node and edge attributes. In contrast to all Cytoscape-supported formats, it preservesmulti-edges. GDL networks can be loaded into Cytoscape via a button in the main menu.The "tab" format is a tab-delimited custom format that consists of two parts: a node partlisting all nodes and their attribute values, one by line, and an edge part listing all edgesand their attribute values, one by line. Here is a small example:

;NODES genus

a Escherichia

b Yersinia

c Salmonella

Page 16: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

d Mycoplasma

;ARCS score

a b 1.1

b c 2.4

c a 3

The other formats are supported by the following tools:• VisML This is the input format of the network and pathway analysis platform

VisANT.• Dot This is the input format of the command line graph visualization software

GraphViz.

Output network properties

Each output network has a "Comment" and a "CALL" entry. The comment is a longstring that lists the details of network generation, whereas the CALL string precedes thecommand line call that was used to generate the network.

Page 17: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Preprocess and filter input matrix

Input filtering

In some situations, it is useful to filter the input matrix. For instance, if the matrixrepresents bacterial abundances, noise may contribute much more than biological reasonsto variances in low-abundant bacteria abundances. In addition, low-abundant speciescontain many zero values, which are ambiguous: they mean either that the taxon does notoccur at all in the sample or that it was below detection limit. Zeros require a specialtreatment: either by "smoothing" (e.g. Witten-Bell smoothing) which tries to "guess"values from the remaining data or by pseudo-counts. If a row is dominated by zero entriesand missing values, it is best discarded. Below, a number of filters are listed that allowdiscarding problematic rows. These filters can also be combined. In this case, filters arecarried out in the order of their selection.

1. row_minsum The sum of the values of each row should be equal to or largerthan the specified minimum. To specify the minimum, write the value in the textfield next to the check box and press enter.

2. row_minocc Each row should have at least the specified number of non-zero,non-missing values.

3. col_minsum The sum of the values of each column should be equal to or largerthan the specified minimum. This is useful to prevent zero-sum columns and toimpose sufficient sequencing depth before down-sampling.

4. col_minocc Each column should have at least the specified number of non-zero,non-missing values.

5. minzeropairs Each row pair should have at least the specified number of non-double-zero value pairs when computing its similarity score. This filter avoidscomputing high scores based on double absences.

Keep sum of filtered rows

Instead of simply discarding filtered rows, sum them into a single row that is kept forfurther processing steps. Already filtered and summed taxon rows can also be providedby the user, with the row name: summed-nonfeat-rows This special row will be ignoredduring network inference. Note that features are excluded from summing. If groups areprovided, summing is carried out group-wise.

Standardization

If the columns of the matrix represent different samples, standardization over the samplesis necessary. A number of standardization strategies are available, which can be appliedto columns and/or to rows. Features are excluded from standardization. If a groupattribute has been specified (see data menu help), column-wise standardization strategies

Page 18: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

are carried out group-wise. Note that rows or columns with zero sum will be removedfrom the input matrix. If both filtering and standardization is specified, filtering is carriedout first, followed by standardization. If more than one standardization method isselected, standardization methods are carried out in the order of their selection.

1. col_norm Each column is divided by its sum, converting abundances in column-wise proportions.

2. col_downsample Each column is down-sampled such that its sum is equal to thesum of the column with the smallest sum. This option is only available for countmatrices. Note that down-sampling should be preceded by sufficiently stringentfiltering, since otherwise down-sampling will introduce many zero counts. Notethat all-zero rows will be removed after down-sampling.

3. row_stand Standardize rows such that each row x is transformed as follows:x_stand = (x-mean(x))/sd(x).

4. row_stand_robust Standardize rows as in row_stand, but estimate the mean bythe median and the standard deviation with the interquartile range normalized by(qnorm(0.75) - qnorm(0.25) ~ 1.35), where qnorm is the quantile function of thestandard normal distribution (robust standardization follows an R-script byJacques van Helden).

5. row_norm Each row is divided by its sum, converting abundances in row-wiseproportions.

6. row_downsample Each row is down-sampled such that its sum is equal to thesum of the row with the smallest sum. This option is only available for countmatrices. Note that all-zero columns will be removed after down-sampling.

7. log2 Take the logarithm to basis 2 of each matrix entry.

To incidence conversion

When an abundance or count matrix is given, but incidence matrix methods (such as thehypergeometric distribution) shall be applied as well to the input matrix, the matrix canbe converted into an incidence matrix for these methods. The following methods areavailable for conversion (features are excluded from the conversion and subsequentanalysis steps):

1. user Each matrix value equal to or above the threshold is considered as presence(and set to one) and each value below is considered as absence (and set to zero).

2. quantiles The row value equal to or above the given row-specific quantile isconsidered as presence (and set to one) and the row value below is considered asabsence (and set to zero). Quantiles should be set between 0 and 1. For instance,if the quantile was set to 0.75, a row value is considered as present only if it isequal to or above the third row quantile.

3. eval Each row value is transformed into a z-score by subtracting the row meanfrom it and dividing it by the row standard deviation. With the normaldistribution, each z-score is converted into a p-value, which is multiplied by thecolumn number to obtain a multiple-test corrected E-value. An E-value below orequal to the given threshold is considered as presence, else as absence.

Page 19: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Output filtering

Edges can represent positive ("copresence") or negative ("mutual exclusion")relationships between the items. The output filter allows keeping only copresence or onlymutual exclusion edges.

Page 20: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Select measures and specify theirthresholds

A set of methods to infer relationships between items can be selected. In order to select ameasure of interest, activate its corresponding check box.

Correlations

Correlation measures all have a range of [-1,+1], with -1 for the strongest negativerelationship (anti-correlation), 0 for the neutral case and +1 for the strongest positiverelationship. Briefly stated, Pearson is assuming a linear relationship between the data,whereas Spearman and Kendall are rank-based and thus do not assume linearity. It shouldbe pointed out that Pearson, Spearman and Kendall are not defined in case the standarddeviation is zero. Thus, if two taxa have constant abundances, they will not be linked bythese correlation measures. For large matrices, the computation of Kendall can take aprohibitive amount of time and should better be carried out on command line. For theformulas and in-depth discussions of Pearson, Spearman and Kendall, we refer to therespective Wiki pages: Pearson, Spearman, KendallThe cosine correlation represents the angle between two vectors and is the dot productdivided by the product of the vector norms. It is defined as: cosine(x,y) = sum(x_i * y_i) /(||x|| * ||y||)Note that correlations are sensitive to the so-called double-zero problem (Legendre &Legendre), that is a vector pair with many matched zeros will receive a higher score thanthe same vector pair without them. Since the interpretation of zeros is often ambiguous,this is a serious drawback when dealing with sparse data.In addition, correlation results may be strongly biased when applied to normalized data(see Aitchison 2003). Renormalization (see the randomization menu) can reduce thisbias.

Fisher's Z-score for correlations

The Fisher's Z-score allows computing p-values for correlations without randomizing thedata. It is computed following the implementation in QIIME. 50 samples arerecommended as a minimum to apply the Fisher Z-transformation. First, the correlationscore r is Fisher Z-transformed:z=0.5*log((1+abs(r))/(1-abs(r)))z=z*sqrt(n-3), where n is the sample numberNext, the p-value is computed from the Z-transformed correlation score as the area underthe normal distribution starting from z. Note that Fisher's Z-score cannot be used togetherwith randomization. To enable Fisher's Z-score, click the Fisher's Z-score check box.Edges with Z-score p-values above the given threshold p-value are discarded. This means

Page 21: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

that edges are filtered according to two criteria: the correlation threshold and the p-valuethreshold. P-values can be multiple-testing corrected by selecting a multiple-testingcorrection procedure from the Multiple-test correction box. The multiple-testingcorrection methods are explained here. Note that in contrast to Benjamini-Hochbergcorrection on p-values obtained from randomization, p-values for all allowed edges arecomputed and taken into account during multiple testing correction.

Similarities

Similarities range from 0 to 1 or infinity. The higher their score, the more similar are twoobjects.

• The Steinhaus similarity is the inverse of the Bray Curtis dissimilarity and isdefined asS(x,y) = 2W/(A+B), where A = sum of vector x, B = sum of vector y and W =sum of minimal values, i.e. min(xi,yi)It is bounded between 0 and 1.

• The mutual information has been defined and implemented in a variety ofways. However, different estimations of mutual information give different results(Fernandes & Gloor). The default in CoNet is a fast implementation of mutualinformation by Jean-Sebastien Lerat. A re-implementation of the ARACNEmatlab script, which depends on a parameter (the Gauss kernel width), is alsoavailable. With Rserve enabled, one can also select the implementation that ispart of the minet package in R (see config menu). When computing mutualinformation with minet or the default, abundance matrices or normalizedmatrices require a discretization step, which can be selected in the config menu.Note that the discretization step computes the bin number as the square root ofthe sample number, thus mutual information with discretization requires asufficiently large sample number. Whereas extreme values in most measures areindicative of co-presence at one end and mutual exclusion at the other, anextremely low mutual information (close to zero) signals independence, whereasstrong co-presence and mutual exclusion patterns receive likewise high mutualinformation values. Thus, when mutual information is combined with the "Topand Bottom" option explained below, the double number of top edges is returnedinstead of top and bottom edges. Note that for the same reason, mutualinformation cannot assign the interaction type.

• The variance of log-ratios was introduced by Aitchison and is a measure ofdissimilarity between two components across compositions, i.e. two speciesacross samples. It is defined as:D(x,y)=var(log(xi/yi))Aitchison suggested a transformation that scales the variance of log-ratiosbetween 0 and 1:S(x,y) = 1-exp(-sqrt(D(x,y)))This transformation converts the measure into a similarity, with 0 correspondingto a lack of proportional relationship and 1 perfect proportional relationship.

Page 22: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Note that a pseudocount is added to deal with zero values. However, the varianceof log ratios depends on the selected pseudocount, thus variance of log ratioscomputed on different zero-containing data are only comparable if the samepseudocount was used.

• The distance correlation (also known as Brownian correlation) is a similaritymeasure in the range of 0 to 1. In contrast to correlation measures, 0 flagsstatistical independence. Like mutual information, an interaction type cannot beassigned with this measure. For this reason, if this measure is used together withthe "Top and Bottom" option, the double number of top edges is returned. Thedistance correlation has been implemented following the implementation in the Rpackage energy (DCOR). For more information, see the Wiki entry.

• The Hilbert-Schmidt independence criterion (HSIC) is another generalmeasure of dependency (see this Wiki entry). As in the case of mutualinformation and distance correlation, no interaction type can be assigned and theoption "Top and Bottom" returns simply the double number of top edges. AHSIC of zero flags independence, and the higher the HSIC, the stronger thedependence.

Dissimilarities

Dissimilarities range from 0 to 1 or infinity. The higher their score, the more dissimilarare two objects. A distance (or metric) has to fulfill the following criteria: 1) It shouldnever be negative, 2) It should be zero only if objects are identical, 3) It should besymmetric, e.g. d(x,y)=d(y,x) and 4) It should fulfill the triangle inequality (d(x,z) <=d(x,y)+d(y,z)). Dissimilarities do not meet the fourth criterion.

• Euclidean distance is defined asD(x,y) = sqrt(SUMi(xi - yi)^2) and goes from 0 to infinity.Each row is divided by its sum, such that it adds up to one.

• Bray Curtis dissimilarity This is the inverse of the Steinhaus similaritydescribed above and is bounded between 0 and 1. The formula is: D(x,y) = 1 -2W/(A+B), where A = sum(x), B = sum(y) and W = sum(min(xi,yi))Each row is divided by its sum, such that it adds up to one.

• Hellinger distance is computed as follows:

D(x,y) = (1/sqrt(2))*sqrt(SUMi=1(sqrt(xi) - sqrt(yi))2), where x and y are

supposed to sum to oneEach row is divided by its sum, such that it adds up to one. The Hellingerdistance is closely related to the Kullback-Leibler divergence.

• Kullback-Leibler dissimilarityD(x,y) = SUMi=1(xi*log(xi/yi)+yi*log(yi/xi)), where x and y have beenstandardized to sum to one.Note that a pseudocount is added to deal with zero values.

• Jensen-Shannon dissimilarityD(x,y) = 1/2*SUMi=1(xi*log(xi/M) + yi*log(yi/M)), where M = (xi+yi)/2 and

Page 23: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

where x and y have been standardized to sum to one.Note that a pseudocount is added to deal with zero values.

Incidence methods

In case the matrix is of type "incidence" (i.e. only contains presence/absence values) or aconversion to the incidence type has been specified in the preprocessing menu, a numberof methods specific to incidence matrices are available.

• The hypergeometric distribution is often used to compute the p-value of theoverlap between two sets, taking into account the set sizes. The one-tailedversion of the hypergeometric test is also known as Fisher's exact test. Here, thep-values of the hypergeometric distribution are multiple-test adjusted using theE-value correction (that is multiplication of p-values with the test number) andthen converted into significances (sig=-log10(p-value)). The higher thesignificance, the smaller the likelihood that the predicted edge is due to chance.Thus, the significance behaves like a similarity measure. Strictly speaking, it isnot a similarity, since significances can be negative, but the significancethreshold is usually set to zero or larger.

• The Jaccard distance also measures set overlap and is defined for two sets Aand B as:D(A,B) = 1 - size(intersection(A,B))/size(union(A,B))The Jaccard distance ranges from 0 to 1 (see Wiki entry).

• Association rule mining (not supported for Windows) enumerates logicalrules in presence/absence data sets. These rules can span any item number, butdue to multiple-testing and run-time issues, one should restrict association rulemining to low item numbers. Numerous filters have been developed to retainonly rules of interest. CoNet wraps Christian Borgelt's implementation of theapriori algorithm (Agrawal et al.). For the meaning of the filters, we refer usersto the documentation of apriori. By default, if confidence is not selected as afilter, it is set to 95, whereas support (if not selected) is set to 10. Usingassociation rule mining requires apriori to be installed (see configuration).Randomizations are carried out on the values of the method selected as secondfilter. Note that association rule mining is currently the only method that returns adirected network.

Network inference with Minet

Minet is an R package that implements three popular network inference algorithms usedin genetic regulatory network inference, namely CLR (Faith et al.), ARACNE (Margolinet al.) and MRNET (Meyer et al. 2007). These algorithms work on the basis of asimilarity matrix, where the similarity is usually mutual information. Note that using acorrelation measure instead requires the data to be normally distributed. Using minetrequires Rserve to be enabled. Minet's strategies for mutual information estimation and

Page 24: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

discretization (required for abundance or normalized matrices) can be set in theconfiguration menu. Note that by default, mutual information is not computed in minet,but using ARACNE's algorithm. The default can be changed in the config menu.

Threshold setting

In order to set method-specific thresholds, two options are available:• Manually: The user can set manually the threshold via the sliders and text fields

next to the measures. Text fields are up-dated upon slider movement and viceversa. In this mode, only one threshold per method can be set, e.g. it is notpossible to specify a lower and upper threshold of -0.4 and 0.6 for Pearson.

• Automatically: Click the "Automatic threshold setting" button on the bottom ofthe Methods menu to open the Threshold setting menu. The user can specify acertain edge number or quantile in the text field below the "Edge selection"choice. CoNet will then set thresholds automatically such that the correspondingedge number is included in the output network. The quantile is expected to be inthe range of 0 to 1. For example, if the quantile is set to 0.05, the threshold willinclude the 5% top-scoring edges.If the edges with lowest scores should be included as well, activate the Top andbottom check box. This option is of interest to capture exclusion for measuresother than correlation and likely results in different thresholds for the upper- andlower-ranking edges, only one of which is displayed in the manual thresholdsetting field next to the method. This option will be ignored for measures thatcannot support it (such as mutual information). For these measures, an equivalentnumber of top edges is selected instead. Make sure to disable "Top and Bottom"and to clear the edge selection parameter to disable automated threshold setting.If Force intersection is enabled, thresholds are set such that resulting networkhas the selected number of intersection edges. This option can be used incombination with network merge strategy "intersection", to obtain an intersectionnetwork of given size (see Merge menu help). Note that this option enforces theensemble approach stringently, i.e. it will discard edges strongly supported byonly one method. "Force intersection" is therefore best applied with a largerinitial edge number.Automatically set thresholds can be optionally saved to a file called"thresholds.txt" in a user-selected folder. To select the folder, click the button"Select folder", click on a folder in the file tree and then click "Choose". In thefolder selection mode, files in the file tree are not clickable. The threshold file,besides the thresholds, also contains a correlation matrix with Pearsoncorrelations between the full score distributions of the selected methods, statisticson the full score distributions as well as the correction factor and degree offreedom for Brown's method.Note that it is always possible to store the thresholds as part of a CoNet settingsfile (see Settings loading/saving) after completion of the network computationtask. With such a settings file, there is no need to recompute the thresholds.

Page 25: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

However, the guessing parameter should be deleted (such that the field isempty) before the current settings are saved or the threshold guessing line inthe settings file should be removed, to avoid recomputing the thresholds.When loading lower and upper thresholds from a settings file, only one of themwill be displayed for each measure in the interface, however both will be appliedduring network inference.

Visualize score distributions

Score distributions of selected measures can be visualized by selecting an export folderand activating the "Export distribution" check box in the Threshold setting menu. Toselect an export folder, click the "Select" button and click on a folder, then click"Choose". The score distributions will be exported into a single pdf file called"score_distributions.pdf" in the selected folder. When this option is activated, pushing the"GO" button in the main menu will not result in a network, but instead the user is askedwhether to open the pdf file with the distribution plots. Pdf files are displayed using theAdobe Acrobat Viewer. In case the user confirms opening the pdf file, Acrobat Viewerasks whether the user accepts the License Agreement and upon acceptance displays thepdf file. When score visualization is enabled, previously set thresholds are lost.

Note that threshold guessing and score visualization are not supported for minet,hypergeometric distribution and association rule mining.

Page 26: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Merge networks and edge scores

To combine the output of several methods, different strategies can be considered, whichare listed below. Independently of the merge strategy choice, the method(s) supporting alink between two items are always listed in the resulting network as edge attribute values.

Multigraph

When the "multi-graph" checkbox is enabled, outputs of different methods are notcombined, but displayed in a multi-graph, i.e. if more than one method supports a linkbetween two items, those items are linked by more than one edge. If "multi-graph" is notenabled, the default score merge strategy is carried out instead. "Multi-graph" is currentlyby default enabled.

Score merge

When different methods support the same item pair, their scores can be combined intoone score in different ways. Thus, the score merge is carried out edge-wise. It is onlycarried out when more than one network inference method has been selected. Note that incase of restoring networks from randomization files, the contributing methods and theirindividual scores cannot be restored.

1. scoremethodnum Simply put the number of supporting methods as score.2. scoremean First, distances (D) are converted to similarities (S) by: S =

maximum_observed_value - D. Each method-specific score is then convertedinto a percentage with respect to its maximal observed score, even for boundedmeasures. Finally, the combined score is obtained as the mean over the modifiedmethod scores.

3. scoresum As for scoremean, but instead of the mean the sum is computed.

Network merge

The results of the different measures can be considered as alternative networks, whichcan be merged in various ways (with and without enabling "multi-graph"). The followingoptions are available:

1. union The network union keeps all edges contributed by all methods.2. intersection The network intersection keeps only edges supported by all

methods.3. separate Separate networks are instantiated for each method. If this option is

selected, no score merging is carried out.4. majority Only edges supported by a majority of methods are kept.5. minsupport An edge is only kept if supported at least by the specified method

number (see input field below). The default minimal support is 2.

Page 27: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

The minimum number of methods that should support a link is only taken intoaccount if the network merge strategy is set to minsupport.

Page 28: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

P-value assignment and multiple-testingcorrection

Background

For a matrix with n rows and a symmetric measure, n*(n-1)/2 tests are carried out duringnetwork inference. Thus, with increasing row number, it is more likely to find significantrelationships by chance alone. There are various approaches to correct for this multipletesting issue, most of which rely on the adjustment of p-values. The p-value describes theprobability that a relationship is inferred due to chance, thus the smaller the p-value, thehigher the significance of a relationship. P-values can be computed from a random scoredistribution. In order to calculate such a "background" edge score distribution given anull hypothesis, random data are generated a selected number of times.

Random score distribution options

Randomization routine

First, the randomization routine needs to be chosen. The default "none" disables anyrandomization test. The "edgeScores" routine generates a set of edge-specific scoredistributions, whereas the routine "lallich" is an implementation of BS_FD (bootstrap-based false discovery) routine and is recommended for association rule mining. Thisroutine takes a false discoveries number parameter, which is set to one. Note that the"lallich" routine ignores the resampling parameter (which is fixed to bootstrap for thisroutine) and requires the significance level as parameter (which can be provided via thep-value threshold, e.g. 0.05). In addition, it does not keep bottom edges (i.e. edges withextremely low similarity or high distance scores, see option "Top and bottom" in themethods menu). Note also that the "lallich" routine computes global score distributionsfor each method. In contrast to the edge-specific routine, the "lallich" routine does notassign p-values to the edges. Instead, it adjusts the original thresholds and discards edgeswith scores below the adjusted thresholds. The adjusted thresholds are stored in thenetwork comment attribute. Since "lallich" does not assign p-values, the multiple-testingcorrection options are not applicable.Note that only the "edgeScore" routine allows computing edge-specific p-values. The p-values are one-sided, that is a very high p-value is indicative of mutual exclusion (e.g. fora dissimilarity, the score was much higher than expected at random). P-values above 0.5are converted by subtracting the p-value from one prior to multiple test correction.However, if the p-value does not agree with the initial interaction type (e.g. a high p-value for a co-presence edge), the edge is discarded. P-values are computed as follows:

• For bootstrap, the p-value of the expected null value (e.g. 0 for Pearson) underthe normal distribution is computed, using the mean and standard deviation of

Page 29: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

the bootstrap distribution as parameters of the normal distribution. In short, wecompute how likely it is to observe the null value given the distribution aroundthe observed score. In case the measure has no proper null value (which is thecase for unbounded measures), the null value is obtained by shuffling the vectorpair.

• For shuffling (permutation), the p-value is computed distribution-free bycounting the number of times the observed edge score is smaller/larger(depending on the measure) than the random edge scores, using the formula p-value = (r+1)/(n+1), where n is the number of randomized scores and r thenumber of randomized scores smaller/larger than the observed score.

• If permutation and bootstrap are combined (see "Loading null distributions"),the p-value of the null value under the normal distribution is computed, using themean and standard deviation of the bootstrap distribution as parameters of thenormal distribution. The null value is the mean of the null distribution.

Resampling

By default, the matrix is randomized by permuting the order of columns row-wise("shuffle_rows"), which preserves the row sum. Alternative re-sampling procedures arethe permutation of row order column-wise ("shuffle_cols"), to permute rows as well ascolumns ("shuffle_both"), or to bootstrap.

Iteration number

The iteration number specifies how often the randomization routine is carried out, i.e.how many values the random score distribution contains.

P-value merging

If the computation of a multi-edge network and edge-wise randomizations are enabled,method-specific p-values of multi-edges connecting the same node pair can be mergedusing the strategies listed below. Note that p-value filtering and multiple-test correction,if selected, are carried out after the merge, i.e. on the merged p-values.

• Brown Brown's method works like Fisher-merge (see formula below), but

divides the result by a correction factor before calculating the Chi2 p-value. Thecorrection factor takes the correlation of measure scores into account, see Brown1975 for details. The correction factor is calculated when calculating thresholdsfor a given edge number or quantile, so Brown's method cannot be used withmanually set thresholds.

• Fisher-merge Formula: Chi2=-2 SUMi=1k(log(pi))

The p-value of the merged edge can then be computed from a Chi2 distributionwith 2*k degrees of freedom. See the Wiki entry for more information onFisher's method.

Page 30: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

• Simes Keep the minimum p-value.• mean Keep the mean of the p-values.• geometric_mean Keep the geometric mean of the p-values.• median Keep the median of the p-values.• max Keep the maximum p-value.

Note that p-values of edges filtered out initially are missing for the p-value merge. InBrown's and Fisher's method, these missing p-values are implicitly set to one, but theyare not considered in any other p-value merge method. The best is to enable the Forceintersection option in the Threshold setting sub menu of the Methods menu when using ap-value merge option.

Renormalization

This option is recommended when the input matrix is normalized column-wise andcorrelation measures are selected. Renormalization (Sathirapongsasuti*, Faust* et al.,PLoS Comp Bio 8(7): e1002606, 2012) is a strategy that has been empirically shown tocounter the compositionality bias. The compositionality bias affects correlation measuresand is introduced when normalizing across samples (see Aitchison). The principle is torepeat sample-wise normalization for each item pair in each randomization round, suchthat the effect of all other items on this pair are captured. Renormalization is onlycompatible with the "shuffle_rows" resampling routine. Attention: Renormalization isvery time-consuming and best run on command line. It is not applied to measures knownto be robust to compositional bias (Bray Curtis, Hellinger, Kullback-Leibler, Variance oflog-ratios) or to incidence matrix measures. Note that association rule mining cannot becombined with renormalization. Renormalization is only possible for column-wisenormalization by column sum division. Note that features are excluded fromrenormalization. If groups were specified, renormalization is carried out group-wise.When combining renormalization with pre-processing, permutation and renormalizationis carried out on the matrix resulting from all of these pre-processing steps. Withoutrenormalization, resampling is carried out on the original input matrix and thepreprocessing steps are repeated for each iteration.Permutation with renormalization combined with bootstrapping corresponds to the ccrepep-value computation method.

Variance pooling

When permutation and bootstrap distributions are combined, their variances can bepooled to account for different iteration numbers using the following formula: pooled_sd= sqrt((var(permutation) + var(bootstrap))/2), which computes the mean variance of thebootstrap and permutation distribution. The standard deviation of the pooled varianceinstead of the bootstrap standard deviation is then used to compute the combined p-value.To enable variance pooling, activate the "Pool variances" check box.

Page 31: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Discarding unstable edges

Some edges might be only significant due to outliers and have much lower scores oncethe outliers are removed. Edge-specific bootstrapping results in a confidence interval foreach edge. If the observed edge score is not within the 2.5 and 97.5 percentile of thebootstrap distribution, it can be considered as exceptionally high (due to outliers) and theedge is removed.

P-value options

Once p-values have been obtained from the randomization routine, they can be multiple-test adjusted using various approaches. For each approach, a lower and optionally anupper threshold for the edge p-value can be applied. All edges with p-values above thelower threshold (defaults to 0.05) are discarded.

• E-value The E-value is calculated by multiplying the p-value with the number oftests. An E-value of one means that one predicted relationship is expected to bedue to chance. The E-value is converted into a significance by computing -log10(E-value).

• Bonferroni The Bonferroni-corrected p-value is obtained by dividing thethreshold p-value by the number of tests and retaining only those p-values thatare below the adjusted threshold.

• Benjamini-Hochberg The method by Benhamini and Hochberg (BH) adjusts thefalse discovery rate (FDR), which controls the expected number of false positivesamong all significant relationships. Note that BH is computed only on theinitially selected edges (instead of on all allowed edges) and may therefore beover-optimistic. When applied to multigraphs, BH is carried out with theYekutieli correction for dependent p-values. The dependency arises becausemultiple measures may support the same relationship. For multigraphs withmerged p-values or normal graphs, BH is carried out without dependencycorrection. See this Wiki entry for more details.

Saving and re-using randomized scores

Since the repeated computation of random scores can be extremely time-consuming, it ispossible to save the originally inferred network along with its random scores. To do so,select a folder by clicking the "Select folder" button on the bottom left side of the menu("Save"). This will open the file tree, where you can click a folder and then click"Choose". Below the folder selection button, there is a text field "Specify file name",where you can input the name of the file into which scores are to be saved. Then click the"Save randomizations to file" checkbox.The network can later be restored by selecting the file with the randomization results viathe "Open file" button in the middle right of the menu ("Load" box). To clear a selectedrandomization file, click the "Open file" button again and push "Cancel".

Page 32: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Note that the CoNet settings should match the settings used to generate the randomizationfile (except for the p-value thresholds or metadata). For saving settings, click the"Settings loading/saving" button in the main menu.Loading a network from a randomization file requires at least the following settings:

• the input matrix• the randomization routine used to generate the data• the resampling method• the selected measures and for the "lallich" randomization routine their thresholds• the p-value threshold• if initial thresholds were set with "topbottom" enabled (see Methods menu): the

upper p-value threshold• if selected during randomization: multigraph and p-value merge strategy

and the following options can be enabled:• if the resampling method is bootstrap, unstable edges can be filtered• if a null distribution has been generated with permutation, it can be loaded

together with the bootstrap distribution (see below)• multiple-test correction of p-values

Loading null distributions

Optionally, edge-specific p-values can be computed from two distributions: the nulldistribution generated by shuffling and the bootstrap distribution, from which aconfidence interval can be derived. The p-value is then obtained from both distributionsas described above. In order to compute p-values by combining two randomizations, thefollowing steps are necessary:

1. Configure and run CoNet to compute the null distribution (using one of theshuffling options and, optionally, renormalization). The null distribution shouldbe saved to a file as explained above (using the "Save" box).

2. Configure and run CoNet to compute the bootstrap distribution and to setcombined p-values. The previously calculated null distribution can be set in the"Load null distributions" box.

3. For future runs, both distributions can be reloaded; the null distribution via "Loadnull distribution" and the bootstrap distribution via "Load randomization file".Thus, there is no need for re-computing any of these distributions when changingthe p-value treatment.

Page 33: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Configure CoNet

CoNet by default runs without Rserve, but enabling Rserve allows the usage of moreadvanced implementations of some measures, especially the minet package for mutualinformation.

Rserve configuration

By default, Rserve is disabled. Enable it by activating the "Enable Rserve" check box(assuming you have installed and started the Rserve server outside of CoNet). The hostand port of the Rserve server can be specified in the host and port text fields.

Rserve installation

Rserve is a server that allows to connect to R from within other programming languages.If you do not have access to an Rserve server, you can install and run it locally on yourmachine as follows:

1. Start R (it can be obtained from http://www.r-project.org/)2. In the R console, type:

install.packages("Rserve")

3. After installation of the Rserve package, load it with:

library(Rserve)

4. Then start the Rserve server with the command:

Rserve(args="--no-save")

5. Rserve is now running locally on your machine with default parameters(host=127.0.0.1, port=6311)

More detailed information on Rserve installation is given here.

Implementation selection

For mutual information (MI), the choice of implementation affects the results. Threeimplementations are offered, the default is the implementation by Jean Sebastien Lerat(option jsl). Alternatively, MI can be computed with a re-implementation of theARACNE implementation (the original Matlab code is available here) which has oneparameter, the standard deviation of the Gaussian kernel (which can be set as one of theglobal constants). However, ARACNE cannot deal with missing values. Last but not

Page 34: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

least, MI can be computed using the minet package in R. This requires Rserve to beenabled. Minet is by default run with the "mi.shrink" estimator. Discretization strategyand estimator can be selected in the Minet settings.

Association rule mining binary

CoNet wraps the apriori association mining algorithm developed by Agrawal et al. andimplemented by Christian Borgelt. To use apriori in combination with CoNet, pleasedownload it from here and indicate the directory in which you placed the binary. This canbe done by clicking "Select folder", then click the folder of choice and finally click"Choose". Alternatively, you can add the binary to your path environment variable (/usr/bin or /usr/local/bin). Note that association rule mining is not supported for Windows.

Missing value treatment

Missing values can either be ignored (the default) or treated by omission("pairwise_omit"). In the latter case, if a score is computed between two matrix rows, allvalue pairs that contain a missing value are omitted. Non-pairwise computations such asthe ARACNE implementation of mutual information do not handle missing values andcan therefore not used in combination with "pairwise_omit". Pair-wise omission ofmissing values may shorten the vectors to such an extend that the computed score ismeaningless. To prevent this, a minimum number of missing-value free pairs can bespecified, which should however not exceed the number of columns in the input matrix.

Global constants

The Gauss kernel width is a parameter of the ARACNE implementation of mutualinformation. The pseudo-count is added to zeros in measures like Kullback-Leibler andJensen-Shannon dissimilarity to prevent negative infinity to occur. Note that the varianceof log ratios and to a lesser extent Kullback-Leibler are sensitive to the value of thepseudocount, thus the same pseudocount needs to be selected to compare their scoresacross different data sets. However, pseudocounts are adjusted automatically when theyare larger than the smallest non-zero value in the matrix. Thus, it is better to compareKullback-Leibler and variance of log ratio networks not on the score level, but on the p-value level.

Mutual information settings

Set minet default value for mutual information estimation (see the minet R packagedocumentation for details). In addition, the discretization method for both minet andmutual information as implemented in Jean Sebastien Lerat's library can be selected. Theselected discretization method is only applied if the matrix is an abundance matrix or if a

Page 35: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

count matrix has been transformed into a continuous matrix by a normalization step. Thenumber of bins is always determined by taking the square root of the column number.

System

LoggingThe log level determines the verbosity of CoNet, ranging from "fatal" (only messagesleading to a crash are recorded) to "debug" (most verbose level). CoNet logs to the"output.log" file in Cytoscape's root folder. By default, the "fatal" level is enabled to keepthe log file size small. In case errors occur, the log level can be increased.

Speed-upFrom version 1.0b1 onwards, CoNet comes with a major re-implementation of its core tomake it faster. However, it is possible to switch back to the old implementation bydisabling the speed-up. The re-implementation covers the measures Pearson, Spearman,Kendall, Hellinger, Euclid, Steinhaus, Bray Curtis, Kullback-Leibler and variance ofLog-ratios and adds the Hilbert-Schmidt independence criterion (which is not availablewith the old core). The re-implementation also includes renormalization for thesemeasures. When speed-up is disabled, the following implementations are used:Spearman Apache commons.mathPearson JSC (java statistical classes)Kendall JSC (java statistical classes)Hellinger difference in formula: normalization factor 1/sqrt(2) not used

For Kullback-Leibler, the score range (but not the edge ranks) changed for versions 1.0b1to 1.0b3, such that old threshold files cannot be re-used with these versions. However,from version 1.0b3.1 onwards, the Kullback-Leibler score range is the same as for thealpha version. From version 1.0b1 to 1.0b5, the re-implemented Spearman did not treatties correctly. In addition, it gave wrong results when run with missing value treatmentenabled. This has been fixed from 1.0b6 onwards. From version 1.0b2 onwards, mutualinformation and Jensen-Shannon dissimilarity are also available (with the old core, onlythe ARACNE implementation of mutual information is available and Jensen-Shannon isnot supported). Pairwise count/abundance measures not covered by the speed-up, whichinclude logged Euclidean and Chi-Square distance, are removed. Note that the speed-updoes not cover any of the incidence measures.

Note on reproducibilityNetworks constructed with the beta and the alpha version of CoNet (using Kullback-Leibler, Spearman, Pearson and Bray Curtis as well as the full pipeline, e.g. p-valuescomputed from both renormalized permutations and bootstraps) do not overlap perfectly.The Jaccard indices of edge overlap vary, but are typically between 0.5 and 0.6 (theJaccard index ranges from 0 for no overlap to 1 for 100% overlap). The main reason forthis difference is the treatment of pseudocounts needed for Kullback-Leibler: in the beta

Page 36: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

version, it is ensured that pseudocounts are not larger than 1/100th of the smallest valuein the matrix. Networks also vary slightly when re-computed with the same version ofCoNet, if p-values are computed by randomization (typically this variation should not bemuch larger than 5% in terms of Jaccard index of edge overlap). To reproduce resultsfrom the alpha version, the alpha version can be downloaded from CoNet's web page(click the About button for an active link to CoNet's web page).

Page 37: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Load or save CoNet settings.

This menu allows you to save your current CoNet settings, so if you want to re-do anexperiment after having closed CoNet, you do not need to click again all the buttonsinvolved.

Load CoNet settings

Select a configuration file by clicking on "Select file". Then, click the "Apply settings inselected file" button to replace the current CoNet setting by the setting in the file. Allprevious settings will be lost!

Save CoNet settings

Clicking on "Select folder" will open the file browser. Select the folder into which youwant to save the settings file by clicking on it, then click "Choose". The path of theselected folder will appear below the "Select folder" button. Finally, enter the name of thesettings file in the text field below "Enter file name". Clicking the "Save current settingsto file" button will save the current settings to the file of chosen name in the selectedfolder.

Restore CoNet default settings

If you want to start "from scratch", click the button "Set default", and all parametervalues in CoNet will be re-set to their default values. All previous settings will be lost!

Page 38: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

CoNet on command line

Using CoNet on command line - step by step tutorial

When a high number of iterations is requested (especially with renormalization enabled),CoNet is best run on command line. To ease the command line call of CoNet, theCytoscape plugin allows generating the command line call from the current settings.

Prerequisites

This tutorial demonstrates how to run CoNet on command line. It assumes that you havedownloaded and unzipped the CoNet.zip file. It also assumes that you have Java RuntimeEnvironment 1.6 or higher installed on your machine (if Cytoscape 2.8 runs on yourcomputer, you don't need to worry about this).

Optional: Set the Classpath

Optionally, you can add CoNet.jar located in the lib folder of CoNet to your class path.Java will find the jar by looking up this class path variable. Check this tutorial on how todo this on different systems. Example for MacOS and UNIX-based systems (oncommand line):

export LIB=/Users/me/Documents/CoNet/libexport CLASSPATH=${CLASSPATH}:${LIB}/CoNet.jar

Example for Windows (in command prompt window):

set CLASSPATH=%CLASSPATH%;C:\Users\me\Documents\CoNet\lib\CoNet.jar;

Tutorial steps

1. Start Cytoscape and open the CoNet plugin. Load the demo settings (the file islocated in the demo folder of the CoNet directory).

2. Open CoNet's Data menu and select "Costello_2009_oral.txt" located in thedemo folder as input matrix.

3. Click the "Generate command line call" button in CoNet's main menu. This willopen another window with some text in it.

4. Copy the line(s) of the text starting with "java". These lines represent thecommands you want to carry out. They call CoNet's command line version withthe parameters you have selected or loaded in the Cytoscape plugin.

5. If you did not set the class path, paste the command into an editor and replace"java" by "java -cp C:\Users\me\Documents\CoNet\lib\CoNet.jar" in Windowsand by "java -cp /Users/me/Documents/CoNet/lib/CoNet.jar" in MacOS/UNIX.

Page 39: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

For either OS, please do not copy the example paths, but give the path in whichyour CoNet.jar is located. Also note that the -cp option is placed after "java" butbefore the call to the CoNet main class, like this: "java -cpPATH_TO_MY_CONET_JARbe.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser"Copy the modified command.

6. Open a command shell. In Windows, you can do so by clicking "Start", then"Run...". A window will open into which you write "cmd". This will open thecommand prompt window. Alternatively, you can also open the "Commandprompt" program located in "Accessories". In MacOS, you can open the terminalapplication located in /Applications/Utilities.

7. In the command shell, go to the demo folder of the CoNet directory using thecommand "cd". Example for MacOS:

cd /Users/me/Documents/CoNet/demo

Example for Windows:

cd C:\Users\me\Documents\CoNet\demo

8. Paste the command. On MacOS/UNIX, you can add an ampersand (&) at the endof the command to send it to the background.

9. Push Enter to start the execution of the command.10. After a few seconds, a network file starting with "cooccurrence" and ending with

".gdl" should have been generated in the demo folder.11. You can load this network into Cytoscape by selecting it using the "Load" button

in CoNet's main menu and then pushing "GO".

Command line tips and tricks

Command line help

You can get a short and a long version of the command line help. For the short version,please type:

java be.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser -h

For the long version, type:

java be.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser -H

Runtime memory

You can increase java runtime memory with option -Xmx. Example:

Page 40: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

java -Xmx2000M be.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser -h

Alias

In MacOS/UNIX, you can set an alias to shorten the CoNet call on command line. Forthis, add the line below to your bash shell configuration file:

alias conet="java be.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser"

Then you can call the program as:

conet -h

Running CoNet on command line with a configurationfile

Some options of the CoNet Cytoscape plugin are given to CoNet via a configuration file.These concerns the following options (the corresponding command line option is given inbrackets):

• RserveHost (host)• RservePort (port)• LineageSeparator (lineage_separator)• MiImplementation (mi_implementation)• NoRserveDependency (no_rserve)• PseudoCounts (pseudocount)• Poolvar (poolvar)• DisableSpeedup (disable_speedup)

In addition, a number of other variables can be set via the configuration file (check thecommand line help for more details).Here's an example for a configuration file:

############ CONET CONFIG ############## rserve configrserve_host=127.0.0.1rserve_port=6311# phylogenetic lineagelineage_separator=--# mutual information computationmi_implementation=minetminet_r_batch=true# access to Rno_rserve=falseno_r=false# p-value computation

Page 41: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

poolvar=false# speed-up disableddisable_speedup=true

This configuration file indicates that CoNet makes use of Rserve at the specified host andport. It then specifies the lineage separator string (which is needed for taxon metadataspecifying phylogenetic lineages). Furthermore, it enables mutual informationcomputation in minet and, since R is needed for minet usage, allows calls to R viacommand line and via Rserve. In addition, the configuration switches off the CoNetspeed-up, so that CoNet uses the previous (slower) implementation of various similaritymeasures (disable_speedup).

Example for CoNet with configuration file

Here is an example of running CoNet command line with a configuration file. Theexample assumes that Rserve is running and minet is installed in R. Please copy thecommand into one line.

java be.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser-Z CoNetConfig.txt --method ensemble--input /CoNet/demo/Costello_2009_oral.txt --matrixtype abundance--ensemblemethods correl_pearson/dist_bray/dist_kullbackleibler/sim_mutInfo --minetdisc equalfreq--format gdl --nantreatment pairwise_omit--nantreatmentparam 5 --networkmergestrategy union--stand col_norm --multigraph--ensembleparams correl_pearson~upperThreshold=0.72/correl_pearson~lowerThreshold=-0.6/dist_bray~upperThreshold=0.89/dist_bray~lowerThreshold=0.25/dist_kullbackleibler~upperThreshold=7.16/dist_kullbackleibler~lowerThreshold=0.45/sim_mutInfo~lowerThreshold=0.6--filter row_minocc --filterparameter 5.0--output cooccurrenceNetworkDemo.gdl

where CoNetConfig.txt is a file located in the directory where the command is carriedout. CoNetConfig.txt has the following content:

############ CONET CONFIG ############## rserve configno_rserve=falserserve_host=127.0.0.1rserve_port=6311# mutual information computation with minetmi_implementation=minet

Page 42: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Advanced users: Submitting CoNet jobs to a SGEcluster (MacOS/UNIX)

CoNet can send jobs to a SunGridEngine (SGE) cluster (command line option -b). Thisfeature is needed when thousands of permutation iterations need to be carried out. Anumber of options in the configuration file allow to manage the cluster submissioncapabilities of CoNet (see list below). Most important is the job_num option, whichdefines how iterations are split into jobs. For example, if 1,000 permutations need to becarried out, a job_num of 100 will result in the submission of 100 jobs, each run with 10iterations. In consequence, 100 temporary score files will be created (having 10 lineseach), which will be merged into a final randomization score file after completion of alljobs. When submitting jobs to a SGE cluster, the lib_dir, jar_file, temp_dir, queue,job_num and memory options should all be set, and it is advisable to set bothkeep_tmpscores and no_rserve to true. Here's an example configuration:

############ CONET CONFIG #############no_rserve=trueno_r=true####### cluster configuration# location of the CoNet jarlib_dir=/path/to/my/conet/lib/folder/# name of the CoNet jarjar_file=CoNet.jar# temporary score files are stored heretmp_dir=/path/to/my/temp/folder# SGE queue namequeue=all.q# memory allocated to java (via option -Xmx)memory=4000# number of jobs into which iterations are splittedjob_num=100# keep the temporary score fileskeep_tmpscores=true# dry run: test run that does not submit jobs to the clusterdry_run=false# do not keep launcher scriptskeep_scripts=false

When the CoNet command exits before completion of the jobs, temporary score files canbe concatenated after completion of the jobs using command line option --restorefromscorefolder

Page 43: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

List of cluster options

Here is the list of all cluster-related options, to be set via the configuration file.• lib_dir = location of the CoNet jar file (not including the jar file name)• jar_file = the name of the jar file• tmp_dir = location of temporary score files• keep_tmpscores = if true, temporary score files are kept in the temp directory• job_num = the number of jobs submitted to the cluster (the number of iterations

run in each job is determined automatically from the number of requestediterations)

• queue = the name of the queue• memory = the memory in MB allocated to the job (defaults to 1000)• user_cmd = a bash command carried out before the job is started (e.g.

user_cmd=module load java). The user command can include additional SGEdirectives (e.g. user_cmd=#$ -m ae -M [email protected]). Severalcommands can be given by emulating the new line separator with three questionmarks (e.g. user_cmd=. /etc/profile.d/modules.sh???module load java).

• launch_dir = directory in which cluster submission is launched (defaults to thecurrent directory)

• keep_scripts = if true, job submission scripts are kept• dry_run = if true, job submission scripts are created but not launched• quiet_run = if true, output and error streams will be directed to /dev/null, so the

creation of job-specific log files is suppressed

Advanced users: Parallelizing CoNet (MacOS/UNIX)

CoNet randomization can be split in several jobs to speed it up. Conet supports thisparallelization in two ways: SGE-cluster submission (see above) and user-madewrappers. If you would like to run several CoNet jobs in parallel using your ownwrapper, the following options are of interest to you: You can specify -f with value"randscore". This causes CoNet to write random scores instead of a network to theoutput. The original network can be provided via option -I. If it is provided (and -f is setto randscore), the original network will not be recomputed, but read in from the file givenvia -I. If the file given via -I does not exist yet, the original network will be exported tothis file. Thus, you can set up parallelization in the following way: The first step is tocompute the original network scores:

java -Xmx2000m -cp CoNet.jarbe.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser-i input.txt -f randscore -E correl_spearman/dist_bray --method ensemble--ensembleparamfile thresholds.txt --multigraph --pvaluemerge brown-F rand --iterations 1 -g 0.05 --resamplemethod shuffle_rows-I oriscores.txt -K edgeScores --scoreexport--output randomScores.0 > oriscores.log &

Page 44: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Next, you can create N jobs (where i is the job index going from 1 to N), by launchingthe following command N times:

java -Xmx2000m -cp CoNet.jarbe.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser-i input.txt -f randscore -E correl_spearman/dist_bray --method ensemble--ensembleparamfile thresholds.txt --multigraph --pvaluemerge brown-F rand --iterations 10 -g 0.05 --resamplemethod shuffle_rows-I oriscores.txt -K edgeScores --scoreexport--output randomScores.i > randscores_i.log &

Finally, you can merge all separately generated random score files (randomScores.i with ifrom 1 to N) by appending them to the original score file (oriscores.txt), thus creatingyour final permutation or bootstrap score file. If all your random score files are in onedirectory and if their names start with "randomScores.", you can let CoNet do the finalmerge. For this, point to the random score directory by setting tmp_dir in theconfiguration file to this directory and add option --restorefromscorefolder on commandline. An example command could look like this:

java -Xmx2000m -cp CoNet.jarbe.ac.vub.bsb.cooccurrence.cmd.CooccurrenceAnalyser-i input.txt -f gdl -E correl_spearman/dist_bray --method ensemble--ensembleparamfile thresholds.txt --multigraph --pvaluemerge brown-F rand --iterations 100 -g 0.05 --resamplemethod shuffle_rows-I oriscores.txt -K edgeScores --restorefromscorefolder-Z CoNetConfig.txt --output network.gdl > restore.log &

Advanced users: Ensemble Pipeline Bash Script(MacOS/UNIX)

In the cmd folder of the CoNet distribution, there is an example for a bash script that runsthe steps of the ensemble part of the pipeline published in PLoS Computational Biology8, e1002606. The example bash script uses input data from the third CoNet tutorial. Itassumes that it is located in a folder that has two sub-folders "Input" and "Output", whereinput files (input matrix and metadata) are located in the input folder and the output fileswill be written to the output folder. Don't forget to give the script execution permission(chmod 755 cooc.sh).The bash script runs all required network construction steps in one go, that iscomputation of initial thresholds, generation of renormalized permutation and bootstrapscores and final network construction. The renormalized permutation score computationis the longest step, it takes around 5 minutes on an 8GB RAM machine.You can run a first test by setting PERMUT, BOOT and RESTORE to false and enablingCOMPUTE_THRESHOLDS and TEST instead.If you want to enable CLUSTER, make sure you have the SGE cluster management

Page 45: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

system installed. Then, add required cluster options to the CoNetConfig.txt andCoNetConfigBoot.txt configuration files. Both configuration files should specify thelocation of the CoNet jar file (lib_dir and jar_file) as well as the requested job number(job_num) and memory (memory). The CoNetConfig.txt configuration file should inaddition point to the location of the directory where temporary permutation score fileswill be saved (tmp_dir). Likewise, the CoNetConfigBoot.txt configuration file shouldpoint to a different location, where temporary bootstrap score files will be saved. Notethat in case you keep temporary score files (keep_tmpscores=true), you can restorerandom score distributions from these files using option --restorefromscorefolder later on.

Page 46: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

References, resources and related tools

• CoNet Manual References• CoNet Resources• Links to related tools

CoNet Manual References

1. Agrawal, R., Imielinski, T. & Swami, A. "Mining Association Rules betweenSets of Items in Large Databases" in ACM SIGMOD Conference (eds. Buneman,P. & Jajodia, S.) 207-216 (ACM Press, 1993).

2. Aitchison, J., "A Concise Guide to Compositional Data Analysis" in CDAWorkshop Girona (2003).

3. Brown, M.B., "A Method for Combining Non-Independent, One-Sided Tests ofSignificance." Biometrics 31, 987-992 (1975).

4. Costello, E.K. et al. "Bacterial Community Variation in Human Body HabitatsAcross Space and Time." Science 326, 1694-1697 (2009).

5. Ellson, J., Gansner, E.R., Koutsofios, E., North, S.C. & Woodhull, G. in GRAPHDRAWING SOFTWARE (Springer-Verlag, 2003).

6. Faith J.J., et al. "Large-scale mapping and validation of Escherichia colitranscriptional regulation from a compendium of expression profiles." PLoSBiol, 5:54-66 (2007).

7. Hu, Z. et al. "VisANT 3.5: multi-scale network visualization, analysis andinference based on the gene ontology." Nucleic Acids Research 37, W115-W121(2009).

8. Lallich S., Teytaud O., & Prudhomme E. "Statistical inference and data mining:false discoveries control." 17th Compstat Symposium of the IASC:325-336(2006).

9. Legendre, P. & Legendre, L. "Numerical ecology" (Elsevier Science B.V.,Amsterdam, 1983).

10. Margolin, A.A. et al. "ARACNE: An Algorithm for the Reconstruction of GeneRegulatory Networks in a Mammalian Cellular Context." BMC Bioinformatics,1-15 (2006)

11. Meyer, P.E., Lafitte, F. & Bontempi, G. "minet: A R/Bioconductor Package forInferring Large Transcriptional Networks Using Mutual Information." BMCBioinformatics, 1-10 (2009)

12. Meyer P.E., et al. "Information-theoretic inference of large transcriptionalregulatory networks." EUROSIP J. Bioinform. Syst. Biol. 79879 (2007).

CoNet Resources

Below, resources are listed of which CoNet makes use.

Page 47: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

• apriori http://www.borgelt.net/apriori.htmlassociation rule mining implementation

• Apache Commons Math http://commons.apache.org/math/java mathematics and statistics library

• ARACNE matlab script http://icbp.stanford.edu/software/FastPairMI/fast implementation of mutual information

• BiomIO https://github.com/jladau/BiomIOBiom HDF5 parser

• Colt http://acs.lbl.gov/software/colt/statistical computing with java

• Java statistical classes (jsc) http://www.jsc.nildram.co.uk/statistical computing with java

• minet http://www.bioconductor.org/packages/release/bioc/html/minet.htmlnetwork inference, computation of mutual information

• Rserve http://www.rforge.net/Rserve/java-R bridge

Links to related tools

Tools are listed in alphabetical order by their name.• ARACNE (standalone) http://wiki.c2b2.columbia.edu/califanolab/index.php/

Software/ARACNEGene regulatory network inference algorithm, Califano lab

• BANJO (standalone and Java library) http://www.cs.duke.edu/~amink/software/banjo/Banjo is a software application and framework for structure learning of static anddynamic Bayesian networks, Alexander J. Hartemink and colleagues

• BNW (web server) http://compbio.uthsc.edu/BNW/sourcecodes/home.phpBayesian Network Webserver for Biological Network Modeling, by Jesse D.Ziebarth, Anindya Bhattacharya and Yan Cui

• CCLasso (R package) https://github.com/huayingfang/CCLassoCorrelation Inference for Compositional Data through Lasso, by Fang HuayingIncludes an implementation of SparCC

• ccrepe (R package) http://huttenhower.sph.harvard.edu/ccrepeR package designed to detect significant correlations in compositional data,Huttenhower labCoNet also implements the ccrepe approach to p-value computation (combiningpermutation with renormalization and bootstrap)

• compare-profiles (RSAT tool suite, command line tool) http://rsat.ulb.ac.be/rsatCompares all pairs of rows in a matrix using one of several measures, by Jacquesvan Helden

• Community Analyzer (standalone) http://metagenomics.atc.tcs.com/Community_Analyzer/

Page 48: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

Inter-microbial interactions across metagenomes, TATA Consultancy Services,Bio-Sciences Division

• corpcor (R package) http://strimmerlab.org/software/corpcor/R package with shrinkage estimator for covariance matrix, Strimmer lab

• Cyni Toolbox (Cytoscape plugin) http://apps.cytoscape.org/apps/cynitoolboxCytoscape Network Inference Toolbox puts together several tools that allowinferring networks from bio data, Benno Schwikowski and Oriol Guitart Pla

• EcoSimR (R package) https://cran.r-project.org/web/packages/EcoSimR/Given a site by species interaction matrix, users can make inferences aboutspecies interactions, by Nick Gotelli, Edmund Hart and Aaron Ellison

• ExpressionCorrelation (Cytoscape plugin) http://www.baderlab.org/Software/ExpressionCorrelationSimilarity network from either the genes or conditions in an expression matrix,Bader lab

• Fast Local Similarity Analysis (standalone) http://hallam.microbiology.ubc.ca/fastLSA/install/index.htmlDirected similarity network from time series data, Hallam lab

• Extended Local Similarity Analysis (LSA) (standalone and part of theGalaxy pipeline) http://meta.usc.edu/softs/lsa/Directed similarity network from time series data, Sun lab

• GeneNet (R package) http://strimmerlab.org/software/genenet/index.htmlGeneNet is an R package for learning high-dimensional dependency networksfrom genomic data (e.g. gene association networks), Strimmer lab

• MENA (web server) http://ieg2.ou.edu/MENAMolecular Ecological Network Analysis Pipeline, Zhou lab

• MetaNet (MATLAB) http://biostatistics.csmc.edu/MetaNet/Software tool for network construction and biologically significant moduledetection, by Zhenqiu Liu, Shili Lin and Steven Piantadosi

• MINE (Java and R) http://www.exploredata.netMaximal Information-based Nonparametric Exploration, by David and YakirReshef

• MInt (R package) https://cran.r-project.org/web/packages/MInt/index.htmlLearns direct microbe-microbe interaction networks using a Poissonmultivariate-normal hierarchical model with an L1 penalized precision matrix, bySurojit Biswas

• MLRR (MATLAB) https://biostatistics.csmc.edu/mlrr/ multilevel regularizedregression method that simultaneously identifies taxa and constructs networks,by Zhenqiu Liu

• MONET (Cytoscape plugin and web service) http://monet.kisti.re.krRegulatory network inference algorithm based on Bayesian network learning, byPhil Hyoun Lee and Doheon Lee

• NetCutter (standalone) http://muller.group.ifom-ieo-campus.it/Co-occurrence networks identification and analysis, by Heiko Mueller andFrancesco Mancuso

Page 49: CoNet documentationpsbweb05.psb.ugent.be/conet/documents/CoNetDoc.pdf · Label. The edge identifier. • cooc_method. The co-occurrence method that contributed this edge. There may

• Picante (R package) http://picante.r-forge.r-project.org/Phylogeny and trait diversity, community null models, by Peter Cowan, MatthewHelmus and Steven Kembel

• REBACCA (R code) http://faculty.wcas.northwestern.edu/~hji403/REBACCA.htmSignificant co-occurrence patterns by finding sparse solutions to a system with adeficient rank, by Yuguang Ban, Lingling An and Hongmei Jiang

• SparCC (standalone) https://bitbucket.org/yonatanf/sparccPython module for computing correlations in compositional data, by YonatanFriedman

• sspairs (R package) https://github.com/mjwestgate/sppairsDesigned to calculate the degree of spatial association/co-occurrence betweenspecies from presence/absence data, by Martin Westgate and Peter Lane

• SPIEC-EASI (R package) http://bonneaulab.bio.nyu.edu/software.html#spieceasiSparse InversE Covariance estimation for Ecological Association Inference, byZachary Kurtz, Bonneau labIncludes an implementation of SparCC

• WGCNA (R package) http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/R package for weighted correlation network analysis, by Peter Langfelder andSteve Horvath