magine documentation · 2020-05-05 · magine documentation, release 0.1a1 welcome to magines...

magine DocumentationRelease 0.1a1

James C. Pino

Aug 26, 2020

Contents

1 Table of contents 31.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 MAGINE Modules Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2 Indices and tables 61

Python Module Index 63

Index 65

i

magine Documentation, Release 0.1a1

Welcome to MAGINEs documentation. MAGINE was created to help organize and analyze modern high throughputdata. Specifically, we designed it for multi-sample (time series, drug dose, experimental conditions) and multipleomics platform (RNAseq, ph-silac, silac, label-free, metabolomics). The tools are designed to organize and exploreraw data. - Organize data - Automate enrichment analysis - Enable sample series enrichment exploration - Integratenetwork and enrichment analysis

MAGINE environment.

MAGINE has four main modules

• Data

• Enrichment

• Networks

• Tools

Our Data classes are built to organize and facilate exploration for both the raw data and the analysis. The data class isthe central structure that enables this.

Contents 1


2 Contents

CHAPTER 1

Table of contents

1.1 Installation

1. Install Anaconda

Our recommended approach is to use Anaconda, which is a distribution of Python containing most of the nu-meric and scientific software needed to get started. If you are a Mac or Linux user, have used Python beforeand are comfortable using pip to install software, you may want to skip this step and use your existing Pythoninstallation.

Anaconda has a simple graphical installer which can be downloaded from https://www.anaconda.com/distribution/#download-section - select your operating system and download the Python 3.7 version. Thedefault installer options are usually appropriate.

2. Open a terminal

We will install most packages with conda:

$ conda create -n magine_env python=3.7$ source activate magine_env$ conda config --add channels conda-forge$ conda install jinja2 statsmodels networkx graphviz$ conda install -c marufr python-igraph

Windows users: Please download and install igraph and pycairo using the wheel files providedby Christoph Gohlke, found at https://www.lfd.uci.edu/~gohlke/ . Assuming 64 bit windows,download python_igraph-0.7.1.post6-cp37-cp37m-win_amd64.whl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#python-igraph and pycairo-1.18.0-cp37-cp37m-win_amd64.whl fromhttps://www.lfd.uci.edu/~gohlke/pythonlibs/#pycairo

$ pip install pycairo-1.18.0-cp37-cp37m-win_amd64.whl$ pip install python_igraph-0.7.1.post6-cp37-cp37m-win_amd64.whl

3. Install MAGINE

The installation is very straightforward with pip - type the following in a terminal:

3

https://www.anaconda.com/distribution/#download-section



https://www.lfd.uci.edu/~gohlke/

https://www.lfd.uci.edu/~gohlke/pythonlibs/#python-igraph

https://www.lfd.uci.edu/~gohlke/pythonlibs/#python-igraph

https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycairo


$ git clone https://github.com/LoLab-VU/magine$ cd magine$ pip install -r requirements.txt$ export PYTHONPATH=`pwd`:$PYTHONPATH

4. Start Python and MAGINE

If you installed Python using Anaconda on Windows, search for and select jupyter notebook from yourStart Menu (Windows). Otherwise, open a terminal and type jupyter notebook.

You will then be at the Python prompt. Type import magine to try loading magine. If no error messagesappear and the next Python prompt appears, you have succeeded in installing magine!

1.1.1 Documentation

The manual is available online at http://magine.readthedocs.io.

1.2 Data

This demonstrates the basic format for MAGINEs data.

• Identifier column : use HGNC for gene names and HMDB for metabolites

• label : used to modify the identifier column. For proteins, we tag with any PTM or provide a suffix for theexperimental method.

• species_type : gene or metabolite

• significant : Boolean flag to specify if this is a significant species.

• fold_change : scalar value noted fold_change. Expects not log2 but can convert to it later if desired.

• p_value : only needed if you want to use in post analysis.

• source : Used to group data later, we use to tag which experimental platform used

• sample_id : Provide time point or condition. These can be chained together for more complicated systems

1.2.1 Data management

Tools to process, organize, and query data. The classes are derived from pandas.DataFrame, meaning everything youcan do with pandas you can do with MAGINE.

BaseData is the core DataFrame. We provide functions that are commonly used. This class is used by both “Sample”and “EnrichmentResult”.

4 Chapter 1. Table of contents


http://magine.readthedocs.io


1.3 Tutorial

[1]: from IPython.display import display%matplotlib inlineimport matplotlib.pyplot as plt

[2]: import pandas as pdimport seaborn as snsimport numpy as np

1.3.1 ExperimentalData

Since MAGINE is built for multi-sample, multi-omics data, it is no surprise that the data is the most important aspect.Here we should how to use the :py:class:ExperimentalData class.

[3]: # load the experimental datafrom magine.data.experimental_data import load_data_csv

exp_data = load_data_csv('Data/norris_et_al_2017_cisplatin_data.csv.gz', low_→˓memory=False)

C:\Users\James\miniconda3\envs\magine_37\lib\site-packages\ipykernel_launcher.py:4:→˓DeprecationWarning:

load_data_csv will be removed in a future version of MAGINE. Use load_data instead.

[4]: help(exp_data)

Help on ExperimentalData in module magine.data.experimental_data object:

class ExperimentalData(builtins.object)| ExperimentalData(data_file)|| Manages all experimental data|| Methods defined here:|| __getitem__(self, name)

(continues on next page)

1.3. Tutorial 5


(continued from previous page)

|| __init__(self, data_file)| Parameters| ----------| data_file : str, pandas.DataFrame| Name of file, generally csv.| If provided a str, the file will be read in as a pandas.DataFrame|| __setattr__(self, name, value)| Implement setattr(self, name, value).|| create_summary_table(self, sig=False, index='identifier', save_name=None,→˓plot=False, write_latex=False)| Creates a summary table of data.||| Parameters| ----------| sig: bool| Flag to summarize significant species only| save_name: str| Name to save csv and .tex file| index: str| Index for counts| plot: bool| If you want to create a plot of the table| write_latex: bool| Create latex file of table||| Returns| -------| pandas.DataFrame|| get_measured_by_datatype(self)| Returns dict of species per data type|| Returns| -------| dict|| subset(self, species, index='identifier')| Parameters| ----------| species : list, str| List of species to create subset dataframe from| index : str| Index to filter based on provided 'species' list|| Returns| -------| magine.data.experimental_data.Species|| volcano_analysis(self, out_dir, use_sig_flag=True, p_value=0.1, fold_change_→˓cutoff=1.5)| Creates a volcano plot for each experimental method|





| Parameters| ----------| out_dir: str, path| Path to where the output figures will be saved| use_sig_flag: bool| Use significant flag of data| p_value: float, optional| p value criteria for significant| Will not be used if use_sig_flag| fold_change_cutoff: float, optional| fold change criteria for significant| Will not be used if use_sig_flag|| Returns| -------|| ----------------------------------------------------------------------| Data descriptors defined here:|| __dict__| dictionary for instance variables (if defined)|| __weakref__| list of weak references to the object (if defined)|| compounds| Only compounds in data|| Returns| -------| Sample|| exp_methods| List of source columns|| genes| All data tagged with gene|| Includes protein and RNA.|| Returns| -------|| proteins| Protein level data|| Tagged with "gene" identifier that is not RNA|| Returns| -------|| rna| RNA level data|| Tagged with "RNA"|| Returns


1.3. Tutorial 7



| -------|| sample_ids| List of sample_ids|| species| Returns data in Sample format|| Returns| -------| Sample

Getting counts from data

[5]: display(exp_data.create_summary_table())display(exp_data.create_summary_table(sig=True))display(exp_data.create_summary_table(sig=True, index='label'))

sample_id 01hr 06hr 24hr 48hr Total Unique AcrosssourceC18 522 227 653 685 1402HILIC 471 605 930 613 1504label_free 2766 2742 2551 2261 3447ph_silac 2608 3298 3384 3236 5113rna_seq 18741 19104 19992 - 20642silac 2923 3357 3072 3265 4086



[6]: exp_data.species.head(5)

[6]: identifier label species_type fold_change p_value \0 HOXD1 HOXD1_rnaseq protein -520.256762 0.001021 MIR7704 MIR7704_rnaseq protein -520.256762 0.001022 AC078814.1 AC078814.1_rnaseq protein -76.022260 0.001023 PPM1H PPM1H_rnaseq protein -76.022260 0.001024 PLCH1 PLCH1_rnaseq protein -17.888990 0.00102

significant sample_id source





0 True 06hr rna_seq1 True 06hr rna_seq2 True 06hr rna_seq3 True 06hr rna_seq4 True 06hr rna_seq

Filter by category (experimental method)

The .species index aggregates all data.

MAGINE uses the species_type and source column name to split data into compounds, genes (in-cludes species_type==gene), rna (includes species_type==gene, source == rna), or protein(species_type==gene, source != rna). They can be accessed with the “.prefix”, such as

[7]: exp_data.genes.head(5)


significant sample_id source0 True 06hr rna_seq1 True 06hr rna_seq2 True 06hr rna_seq3 True 06hr rna_seq4 True 06hr rna_seq

[8]: exp_data.compounds.head(5)

[8]: identifier label \128152 HMDB0036114 (-)-3-Thujone128153 HMDB0001320 (13E)-11a-Hydroxy-9,15-dioxoprost-13-enoic acid128154 HMDB0012113 (22Alpha)-hydroxy-campest-4-en-3-one128155 HMDB0010361 (23S)-23,25-dihdroxy-24-oxovitamine D3 23-(bet...128156 HMDB0011644 (24R)-Cholest-5-ene-3-beta,7-alpha,24-triol

species_type fold_change p_value significant sample_id source128152 metabolites 1.6 2.100000e-02 True 06hr C18128153 metabolites 88.8 5.800000e-12 True 24hr C18128154 metabolites 100.0 9.500000e-04 True 48hr HILIC128155 metabolites -100.0 1.000000e-12 True 48hr C18128156 metabolites 1.6 7.400000e-05 True 01hr C18

Similarily, we can also filter the data by source using the .name, where name is anything in the source column.We can get a list of these by printing exp_data.exp_methods.

[9]: # prints all the available exp_methodsexp_data.exp_methods

[9]: ['rna_seq', 'ph_silac', 'label_free', 'silac', 'C18', 'HILIC']

[10]: # filters to only the 'label_free'exp_data.label_free.shape

1.3. Tutorial 9


[10]: (13085, 8)

[11]: exp_data.label_free.head(5)

[11]: identifier label species_type fold_change p_value significant \102446 LIMS1 LIMS1_lf protein 12.42 0.00003 True102447 SMARCE1 SMARCE1_lf protein -2.49 0.00030 True102448 HEXA HEXA_lf protein 6.42 0.00060 True102449 SRSF1 SRSF1_lf protein -3.21 0.00060 True102450 SF3B1 SF3B1_lf protein -1.57 0.00130 True

sample_id source102446 01hr label_free102447 01hr label_free102448 01hr label_free102449 01hr label_free102450 01hr label_free

[12]: exp_data.HILIC.head(5)

[12]: identifier label species_type \128154 HMDB0012113 (22Alpha)-hydroxy-campest-4-en-3-one metabolites128157 HMDB0011644 (24R)-Cholest-5-ene-3-beta,7-alpha,24-triol metabolites128162 HMDB0012114 (3S)-3,6-Diaminohexanoate metabolites128164 HMDB0012114 (3S)-3,6-Diaminohexanoate metabolites128166 HMDB0012115 (3S,5S)-3,5-Diaminohexanoate metabolites

fold_change p_value significant sample_id source128154 100.0 0.000950 True 48hr HILIC128157 1.7 0.000072 True 24hr HILIC128162 -1.9 0.000030 True 06hr HILIC128164 -3.0 0.002000 True 24hr HILIC128166 -1.9 0.000030 True 06hr HILIC

Significant filter

The significant column is mapped to the .sig property.

[13]: exp_data.rna_seq.sig.head(5)





Filter data to up or down regulated species.

For enrichment analysis, we will want to access up-regulated and down-regulated species.

[14]: exp_data.rna_seq.up.head(10)

[14]: identifier label species_type fold_change p_value \13 DLX2 DLX2_rnaseq protein 2.874358 0.00102018 RETSAT RETSAT_rnaseq protein 2.325934 0.00102021 SLC52A1 SLC52A1_rnaseq protein 2.871869 0.00102024 OTUD3 OTUD3_rnaseq protein 1.821775 0.00102035 RP11-209D14.2 RP11-209D14.2_rnaseq protein 1.819533 0.02520458 ZNF554 ZNF554_rnaseq protein 2.309691 0.00415359 FZD9 FZD9_rnaseq protein 1.812798 0.00102071 SBK1 SBK1_rnaseq protein 1.806427 0.00268988 PPM1D PPM1D_rnaseq protein 1.803186 0.00102092 ZNF425 ZNF425_rnaseq protein 2.846581 0.001020

significant sample_id source13 True 06hr rna_seq18 True 06hr rna_seq21 True 06hr rna_seq24 True 06hr rna_seq35 True 06hr rna_seq58 True 06hr rna_seq59 True 06hr rna_seq71 True 06hr rna_seq88 True 06hr rna_seq92 True 06hr rna_seq

[15]: exp_data.rna_seq.down.head(10)

[15]: identifier label species_type fold_change p_value \0 HOXD1 HOXD1_rnaseq protein -520.256762 0.0010201 MIR7704 MIR7704_rnaseq protein -520.256762 0.0010202 AC078814.1 AC078814.1_rnaseq protein -76.022260 0.0010203 PPM1H PPM1H_rnaseq protein -76.022260 0.0010204 PLCH1 PLCH1_rnaseq protein -17.888990 0.0010205 RP11-639F1.1 RP11-639F1.1_rnaseq protein -17.888990 0.0010206 TP63 TP63_rnaseq protein -12.355659 0.0010207 JARID2 JARID2_rnaseq protein -7.891502 0.0010208 GLI2 GLI2_rnaseq protein -5.389009 0.0010209 MAP3K5 MAP3K5_rnaseq protein -4.262353 0.001893

significant sample_id source0 True 06hr rna_seq1 True 06hr rna_seq2 True 06hr rna_seq3 True 06hr rna_seq4 True 06hr rna_seq5 True 06hr rna_seq6 True 06hr rna_seq7 True 06hr rna_seq8 True 06hr rna_seq9 True 06hr rna_seq

1.3. Tutorial 11


Extracting by sample (time point)

[16]: for i in exp_data.sample_ids:print(i)display(exp_data[i].head(5))

01hr

identifier label species_type fold_change p_value \19160 GRIK4 GRIK4_rnaseq protein 77.555651 0.01982419161 GRIK4_3p_UTR GRIK4_3p_UTR_rnaseq protein 77.555651 0.01982419162 AP001187.9 AP001187.9_rnaseq protein -25.455050 0.01982419163 MIR192 MIR192_rnaseq protein -25.455050 0.01982419164 MIR194-2 MIR194-2_rnaseq protein -25.455050 0.019824


06hr

identifier label species_type fold_change p_value \0 HOXD1 HOXD1_rnaseq protein -520.256762 0.001021 MIR7704 MIR7704_rnaseq protein -520.256762 0.001022 AC078814.1 AC078814.1_rnaseq protein -76.022260 0.001023 PPM1H PPM1H_rnaseq protein -76.022260 0.001024 PLCH1 PLCH1_rnaseq protein -17.888990 0.00102


24hr

identifier label species_type fold_change p_value \37960 LHX3 LHX3_rnaseq protein 202.225343 0.00518037961 C17orf67 C17orf67_rnaseq protein 2.571464 0.00012337962 ALX1 ALX1_rnaseq protein -2.572587 0.00012337963 MIR7844 MIR7844_rnaseq protein 2.573033 0.00934937964 TMCC3 TMCC3_rnaseq protein 2.573033 0.009349


48hr

identifier label species_type fold_change p_value \58025 TNS3 TNS3_1188_1197_phsilac protein -3.837129 0.04958026 SIPA1L3 SIPA1L3_S(ph)158_phsilac protein -5.119600 0.04958027 TNS3 TNS3_Y(ph)780_phsilac protein -4.986421 0.04958028 FGD6 FGD6_S(ph)554_phsilac protein -3.900705 0.049





58029 GPN1 GPN1_S(ph)312_phsilac protein 2.901199 0.049

significant sample_id source58025 True 48hr ph_silac58026 True 48hr ph_silac58027 True 48hr ph_silac58028 True 48hr ph_silac58029 True 48hr ph_silac

Pivot table to get table across time

[17]: exp_data.label_free.pivoter(convert_to_log=False,index='identifier',columns='sample_id',values=['fold_change', 'p_value']

).head(10)

[17]: fold_change p_value \sample_id 01hr 06hr 24hr 48hr 01hr 06hr 24hridentifierA2M 1.040000 1.140 51.93 11.58 0.514800 0.44370 0.24260AACS -1.100000 3.740 NaN NaN 0.281800 0.26950 NaNAAGAB 1.000000 -1.150 1.46 -2.03 0.968100 0.39240 0.84450AAK1 1.320000 1.590 NaN 1.72 0.715800 0.18110 NaNAAMP -1.200000 -1.460 1.85 1.78 0.836800 0.55420 0.13640AAR2 NaN -1.690 NaN NaN NaN 0.96510 NaNAARS 0.326667 -0.035 -1.44 -3.12 0.299867 0.62425 0.46725AARS2 1.170000 NaN NaN NaN 0.253000 NaN NaNAARSD1 1.210000 4.070 -2.05 NaN 0.459700 0.49160 0.78440AASDHPPT -0.330000 1.020 1.07 -1.11 0.709600 0.81160 0.45290

sample_id 48hridentifierA2M 0.11130AACS NaNAAGAB 0.09760AAK1 0.95660AAMP 0.32460AAR2 NaNAARS 0.00045AARS2 NaNAARSD1 NaNAASDHPPT 0.00070

Note that in the previous two examples, we find that there are NaN values. This is because of our experiental data. Wecan easy check what species are not found in all 4 of our label free experiements.

[18]: print(len(exp_data.label_free.present_in_all_columns(index='identifier',columns='sample_id',

).id_list))

Number in index went from 3447 to 18191819

1.3. Tutorial 13


This shows that out of the 3447 unique species measured in label-free proteomics, only 1819 were measured in alltime points. What one can do with this information is dependent on the analysis. For now, we will keep using the fulldataset.

Visualization

Volcano plots

[19]: exp_data.label_free.volcano_plot();

[20]: exp_data.label_free.volcano_by_sample(sig_column=True);



[21]: exp_data.label_free.plot_histogram();

Plotting subset of species

We provide the a few plotting interfaces to explore that subsets of the data. Basically, you create a list of species andprovide it to the function. It filters based on these and then returns the results.

1.3. Tutorial 15


Time series using ploty and matplotlib

[22]: exp_data.label_free.plot_species(['LMNA', 'VDAC1'], plot_type='plotly')

Data type cannot be displayed: application/vnd.plotly.v1+json, text/html

[23]: exp_data.label_free.plot_species(['LMNA', 'VDAC1'], plot_type='matplotlib');

Heatplots

[24]: exp_data.label_free.heatmap(['LMNA', 'VDAC1'],figsize=(6,4),linewidths=0.01

);

Notice that the above plot doesn’t show any of the modifiers of LMBA (no _s(ph)22_lf). This is because the default in-dex to pivot plots is the identifier column. You can set the label column for plotting by passing index=labelto the function. Note, if you want to filter the data using the more generic ‘identifier’ column, you just specify thatwith subset_index=’identifier’



[25]: exp_data.label_free.heatmap(['LMNA', 'VDAC1'],subset_index='identifier',index='label',figsize=(6,4),linewidths=0.01

);

Examples

Here are a few examples how all the above commands can be chained together to create plots with varying degrees ofcritera.

Query 1:

Heatmap of label-free proteomics that are signficantly change in at least 3 time→˓points.

[26]: lf_sig = exp_data.label_free.require_n_sig(index='label',columns='sample_id',n_sig=3

)lf_sig.heatmap(

convert_to_log=True,cluster_row=True,index='label',values='fold_change',columns='sample_id',annotate_sig=True,figsize=(8, 12),div_colors=True,num_colors=21,linewidths=0.01

);

1.3. Tutorial 17


Query 2:

Changes that happen at all 3 timepoints for RNA-seq.

[27]: exp_data.rna.require_n_sig(n_sig=3, index='label').plot_species(plot_type='plotly');



Data type cannot be displayed: application/vnd.plotly.v1+json, text/html

Query 3:

• Heatmap and time series plot of proteins that are consistently down regulated at 3 time points.

[28]: exp_data.proteins.up.require_n_sig(n_sig=3, index='label').plot_species(plot_type=→˓'matplotlib');exp_data.proteins.down.require_n_sig(n_sig=3, index='label').heatmap(index='label',→˓cluster_row=True);

1.3. Tutorial 19


Query 4:

Clustered heatmap of label-free data

[29]: exp_data.silac.heatmap(linewidths=0.01, index='label',cluster_row=True, min_sig=2, figsize=(12,18));



Extending to other plots

Since our exp_data is built off a pandas.DataFrame, we can use other packages that take that data format. Seaborn isone such tool that provides some very nice plots.

[30]: label_free = exp_data.label_free.copy()label_free.log2_normalize_df(column='fold_change', inplace=True)

g = sns.PairGrid(label_free,x_vars=('sample_id'),y_vars=('fold_change', 'p_value'),hue='source',aspect=3.25, height=3.5)

g.map(sns.violinplot,palette="pastel",split=True,order=label_free.sample_ids

);

1.3. Tutorial 21


Venn diagram comparisons between measurements

[31]: from magine.plotting.venn_diagram_maker import create_venn2, create_venn3

lf = exp_data.label_free.sig.id_listsilac = exp_data.silac.sig.id_listphsilac = exp_data.ph_silac.sig.id_listhilic = exp_data.HILIC.sig.id_listrplc = exp_data.C18.sig.id_list

create_venn2(hilic, rplc, 'HILIC', 'RPLC');



[32]: create_venn3(lf, silac, phsilac, 'LF', 'SILAC', 'ph-SILAC');

1.3.2 Networks

Create data driven network

[33]: from magine.networks.network_generator import build_networkimport magine.networks.utils as utilsimport networkx as nximport os

2019-09-12 15:48:17.518 - magine - INFO - Logging started on MAGINE2019-09-12 15:48:17.519 - magine - INFO - Log entry time offset from UTC: -7.00 hours

1.3. Tutorial 23


[34]: if not os.path.exists('Data/cisplatin_network.p'):network = build_network(

seed_species=exp_data.species.sig.id_list, # genes seed speciesall_measured_list=exp_data.species.id_list, # all data measureduse_biogrid=True, # expand with biogriduse_hmdb=True, # expand with hmdbuse_reactome=True, # expand with reactomeuse_signor=True, # expand with signortrim_source_sink=True, # remove all source and sink nodes not measuredsave_name='Data/cisplatin_network'

)else:

# Load the network, note that it is returned above but for future use# we will use load innetwork = nx.read_gpickle('Data/cisplatin_network.p')

utils.add_data_to_graph(network, exp_data)print("Saving network")# write to GML for cytoscape or other programnx.write_gml(

network,os.path.join('Data', 'cisplatin_network_w_attributes.gml')

)

# write to gpickle for fast loading in pythonnx.write_gpickle(

network,os.path.join('Data', 'cisplatin_based_network.p'),

)

Saving network

Explore subgraphs of network

[35]: from magine.networks.subgraphs import Subgraphfrom magine.networks.visualization import draw_igraph, draw_graphviz, draw_mpl, draw_→˓cyjsnet_sub = Subgraph(network)

[36]: print(len(network.nodes()))print(len(network.edges()))

13308181300

[37]: bax_n = net_sub.neighbors('BAX', upstream=True, downstream=True)

[38]: # display_graph(bax_n)draw_igraph(bax_n, bbox=[400, 400], node_size=25, inline=True)

[38]:

[39]: draw_mpl(bax_n, layout='fdp', scale=3, node_size=100, font_size=12);



[40]: draw_graphviz(bax_n, 'fdp')

[41]: expand = net_sub.expand_neighbors(bax_n, nodes='CASP3', downstream=True)

[42]: draw_igraph(expand,bbox=[800, 800],


1.3. Tutorial 25



node_font_size=12,font_size=8,node_size=12,inline=True,layout='graphopt')

[42]:

[43]: draw_graphviz(expand, 'sfdp', width=500)

[44]: draw_graphviz(expand, 'twopi', width=500)



1.3.3 Enrichment analysis

[45]: from magine.enrichment.enrichr import Enrichr

[46]: e = Enrichr()

[47]: label_free_enrichment = e.run_samples(exp_data.label_free.sig.up_by_sample,exp_data.label_free.sig.sample_ids,gene_set_lib='Reactome_2016')

[48]: label_free_enrichment.head(10)

1.3. Tutorial 27


[48]: term_name rank p_value z_score→˓combined_score adj_p_value genes n_genes db→˓significant sample_id18 metabolism of fat-soluble vitamins_hsa-6806667 19 0.037587 26.143791→˓ 85.780009 1.0 VKORC1 1 Reactome_2016→˓False 01hr19 metabolism_hsa-1430728 20 0.047930 2.795248→˓ 8.491991 1.0 ACSL3,HEXA,RRM1,VKORC1 4 Reactome_2016→˓False 01hr25 l1cam interactions_hsa-373760 26 0.069654 13.888889→˓ 37.003049 1.0 NRCAM 1 Reactome_2016→˓False 01hr28 cell-cell communication_hsa-1500931 29 0.093902 10.178117→˓ 24.076406 1.0 LIMS1 1 Reactome_2016→˓False 01hr29 metabolism of vitamins and cofactors_hsa-196854 30 0.105465 9.009009→˓ 20.264628 1.0 VKORC1 1 Reactome_2016→˓False 01hr42 mrna splicing - major pathway_hsa-72163 2 0.005673 17.559263→˓ 90.816629 1.0 EIF4A3,HNRNPC 2 Reactome_2016→˓False 06hr43 mrna splicing_hsa-72172 3 0.006522 16.339869→˓ 82.230741 1.0 EIF4A3,HNRNPC 2 Reactome_2016→˓False 06hr46 processing of capped intron-containing pre-mrn... 6 0.011454 12.191405→˓ 54.488181 1.0 EIF4A3,HNRNPC 2 Reactome_2016→˓False 06hr59 mrna 3'-end processing_hsa-72187 19 0.042493 23.068051→˓ 72.858353 1.0 EIF4A3 1 Reactome_2016→˓False 06hr60 post-elongation processing of intron-containin... 20 0.042493 23.068051→˓ 72.858353 1.0 EIF4A3 1 Reactome_2016→˓False 06hr

[49]: label_free_enrichment.term_name = label_free_enrichment.term_name.str.split('_').str.→˓get(0)

[50]: label_free_enrichment.head(10)

[50]: term_name rank p_value z_score→˓combined_score adj_p_value genes n_genes db→˓significant sample_id18 metabolism of fat-soluble vitamins 19 0.037587 26.143791→˓ 85.780009 1.0 VKORC1 1 Reactome_2016→˓False 01hr19 metabolism 20 0.047930 2.795248→˓ 8.491991 1.0 ACSL3,HEXA,RRM1,VKORC1 4 Reactome_2016→˓False 01hr25 l1cam interactions 26 0.069654 13.888889→˓ 37.003049 1.0 NRCAM 1 Reactome_2016→˓False 01hr28 cell-cell communication 29 0.093902 10.178117→˓ 24.076406 1.0 LIMS1 1 Reactome_2016→˓False 01hr29 metabolism of vitamins and cofactors 30 0.105465 9.009009→˓ 20.264628 1.0 VKORC1 1 Reactome_2016→˓False 01hr





42 mrna splicing - major pathway 2 0.005673 17.559263→˓ 90.816629 1.0 EIF4A3,HNRNPC 2 Reactome_2016→˓False 06hr43 mrna splicing 3 0.006522 16.339869→˓ 82.230741 1.0 EIF4A3,HNRNPC 2 Reactome_2016→˓False 06hr46 processing of capped intron-containing pre-mrna 6 0.011454 12.191405→˓ 54.488181 1.0 EIF4A3,HNRNPC 2 Reactome_2016→˓False 06hr59 mrna 3'-end processing 19 0.042493 23.068051→˓ 72.858353 1.0 EIF4A3 1 Reactome_2016→˓False 06hr60 post-elongation processing of intron-containin... 20 0.042493 23.068051→˓ 72.858353 1.0 EIF4A3 1 Reactome_2016→˓False 06hr

[51]: label_free_enrichment.heatmap(min_sig=1,figsize=(4,16),linewidths=0.01,cluster_by_set=True);

1.3. Tutorial 29


[52]: label_free_enrichment_slim = label_free_enrichment.remove_redundant(level='dataframe')

Number of rows went from 72 to 27

[53]: label_free_enrichment_slim.heatmap(min_sig=1,figsize=(4,12),linewidths=0.01,cluster_by_set=True

);



[54]: display(sorted(label_free_enrichment_slim.term_name.unique()))

['amino acid transport across the plasma membrane','antigen processing-cross presentation','basigin interactions','binding and uptake of ligands by scavenger receptors','cargo concentration in the er','caspase-mediated cleavage of cytoskeletal proteins','cleavage of growing transcript in the termination region','copi-mediated anterograde transport',


1.3. Tutorial 31



'formation of atp by chemiosmotic coupling','golgi-to-er retrograde transport','hdl-mediated lipid transport','initiation of nuclear envelope reformation','integrin cell surface interactions','l1cam interactions','metabolism of fat-soluble vitamins','mitochondrial protein import','mrna splicing - major pathway','n-glycan trimming in the er and calnexin/calreticulin cycle','nephrin interactions','regulation of complement cascade','regulation of insulin secretion','respiratory electron transport','response to elevated platelet cytosolic ca2+','srp-dependent cotranslational protein targeting to membrane','syndecan interactions','vitamin c (ascorbate) metabolism','xbp1(s) activates chaperone genes']

For a select term, we can extract out the species of interest to visualize.

[55]: exp_data.label_free.heatmap(label_free_enrichment.sig.term_to_genes('caspase-mediated cleavage of

→˓cytoskeletal proteins'),subset_index='identifier',index='label',cluster_row=True,rank_index=True,min_sig=2,linewidths=0.01,figsize=(2, 4),

);

[56]: exp_data.label_free.heatmap(label_free_enrichment.sig.term_to_genes('tp53 regulates metabolic genes'),subset_index='identifier',index='label',





cluster_row=True,rank_index=True,min_sig=2,linewidths=0.01,figsize=(2,6),

);

No terms match subset

[57]: ph_silac_enrichment = e.run_samples(exp_data.ph_silac.sig.up_by_sample,exp_data.ph_silac.sig.sample_ids,gene_set_lib='Reactome_2016')

[58]: ph_silac_enrichment.term_name = ph_silac_enrichment.term_name.str.split('_').str.→˓get(0)

[59]: ph_silac_enrichment.heatmap(min_sig=2,figsize=(4,24),linewidths=0.01,cluster_by_set=True);

1.3. Tutorial 33


[60]: ph_silac_enrichment_slim = ph_silac_enrichment.remove_redundant(level='dataframe')

ph_silac_enrichment_slim.heatmap(min_sig=3,figsize=(4,16),linewidths=0.01,cluster_by_set=True);

Number of rows went from 315 to 84

1.3. Tutorial 35


[61]: exp_data.ph_silac.heatmap(ph_silac_enrichment.sig.term_to_genes('apoptosis'),subset_index='identifier',index='label',cluster_row=True,rank_index=True,min_sig=2,linewidths=0.01,figsize=(2,12),

);

1.3. Tutorial 37


1.4 MAGINE Modules Reference

1.4.1 Data management

Tools to process, organize, and query data. The classes are derived from pandas.DataFrame, meaning everything youcan do with pandas you can do with MAGINE.



BaseData is the core DataFrame. We provide functions that are commonly used. This class is used by both “Sample”and “EnrichmentResult”.

1.4.2 BaseData

class magine.data.base.BaseData(*args, **kwargs)Bases: pandas.core.frame.DataFrame

This class derived from pd.DataFrame

heatmap(subset=None, subset_index=None, convert_to_log=True, y_tick_labels=’auto’, clus-ter_row=False, cluster_col=False, cluster_by_set=False, index=None, values=None,columns=None, annotate_sig=True, figsize=(8, 12), div_colors=True, linewidths=0,num_colors=21, rank_index=False, min_sig=0)

Creates heatmap of data, providing pivot and formatting.

Parameters

subset [list or str] Will filter to only contain a provided list. If a str, will filter based on.contains(subset)

subset_index [str] Index to for subset list to match against

convert_to_log [bool] Convert values to log2 scale

y_tick_labels [str] Column of values, default = ‘auto’

cluster_row [bool]

cluster_col [bool]

cluster_by_set [bool] Clusters by gene set, only used in EnrichmentResult derived class

index [str] Index of heatmap, will be ‘row’ variables

values [str] Values to display in heatmap

columns [str] Value that will be used as columns

annotate_sig [bool] Add ‘+’ annotation to not ‘significant=True’ column

figsize [tuple] Figure size to pass to matplotlib

div_colors [bool] Use colors that are divergent (red to blue, instead of shades of blue)

num_colors [int] How many colors to include on color bar

linewidths [float] line width between individual cols and rows

rank_index [bool] Rank index alphabetically

min_sig [int] Minimum number of significant ‘index’ across samples. Can be used to re-move rows that are not significant across any sample.

Returns

matplotlib.figure

log2_normalize_df(column=’fold_change’, inplace=False)Convert “fold_change” column to log2.

Does so by taking log2 of all positive values and -log2 of all negative values.

Parameters

column [str] Column to convert

1.4. MAGINE Modules Reference 39


inplace [bool] Where to apply log2 in place or return new dataframe

pivoter(convert_to_log=False, columns=’sample_id’, values=’fold_change’, index=None,fill_value=None, min_sig=0)

Pivot data on provided axis.

Parameters

convert_to_log [bool] Convert values column to log2

index [str] Index for pivot table

columns [str] Columns to pivot

values [str] Values of pivot table

fill_value [float, optional] Fill pivot table nans with

min_sig [int] Required number of significant terms to keep in a row, default 0

present_in_all_columns(columns=’sample_id’, index=None, inplace=False)Require index to be present in all columns

Parameters

columns [str] Columns to consider

index [str, list] The column with which to filter by counts

inplace [bool] Filter in place or return a copy of the filtered data

Returns

new_data [BaseData]

require_n_sig(columns=’sample_id’, index=None, n_sig=3, inplace=False, verbose=False)Filter index to have at least “min_terms” significant species.

Parameters

columns [str] Columns to consider

index [str, list] The column with which to filter by counts

n_sig [int] Number of terms required to not be filtered

inplace [bool] Filter in place or return a copy of the filtered data

verbose [bool]

Returns

new_data [BaseData]

sigterms with significant flag

1.4.3 Species data

class magine.data.experimental_data.Sample(*args, **kwargs)Bases: magine.data.base.BaseData

Provides tools for subsets of data types

by_sampleList of significantly flagged species by sample



downreturn down regulated species

down_by_sampleList of down regulated species by sample

exp_methodsList of sample_ids in data

id_listSet of species identifiers

label_listSet of species labels

plot_all(html_file_name, out_dir=’out’, plot_type=’plotly’, run_parallel=False)Creates a plot of all metabolites

Parameters

html_file_name [str] filename to save html of all plots

out_dir: str, path Directory that will contain all proteins

plot_type [str] plotly or matplotlib output

run_parallel [bool] Create the plots in parallel

Returns

——-

plot_histogram(save_name=None, y_range=None, out_dir=None)Plots a histogram of data

Parameters

save_name: str Name of figure

out_dir: str, path Path to location to save figure

y_range: array_like range of data

plot_pie_sig_ratio(save_name=None, ax=None, fig=None, figsize=None)

Parameters

save_name [str]

ax [matplotlib.axes, optional]

fig [matplotlib.figure]

figsize [tuple] Size of figure

plot_species(species_list=None, subset_index=None, save_name=None, out_dir=None, ti-tle=None, plot_type=’plotly’, image_format=’png’)

Create scatter plot of species list

Parameters

species_list [list] list of compounds

subset_index [list] Column to filter based on species_list

save_name [str] Name of html output file

out_dir [str] Location to place plots



title [str] Title for HTML page

plot_type [str] Type of plot outputs, can be “plotly” or “matplotlib”

image_format [str] pdf or png, only used if plot_type=”matplotlib”

Returns

matplotlib.Figure or plotly.Figure

sample_idsList of sample_ids in data

subset(species=None, index=’identifier’, sample_ids=None, exp_methods=None)

Parameters

species [list, str] List of species to create subset dataframe from

index [str] Index to filter based on provided ‘species’ list

sample_ids [str, list] List or string to filter sample

exp_methods [str, list] List or string to filter sample

Returns

magine.data.experimental_data.Species

upreturn up regulated species

up_by_sampleList of up regulated species by sample

volcano_by_sample(save_name=None, p_value=0.1, out_dir=None, fold_change_cutoff=1.5,y_range=None, x_range=None, sig_column=False)

Creates a figure of subplots of provided experimental method

Parameters

save_name: str name to save figure

out_dir: str, directory Location to save figure

sig_column: bool, optional If to use significant flags of data

p_value: float, optional Criteria for significant

fold_change_cutoff: float, optional Criteria for significant

y_range: array_like upper and lower bounds of plot in y direction

x_range: array_like upper and lower bounds of plot in x direction

volcano_plot(save_name=None, out_dir=None, sig_column=False, p_value=0.1,fold_change_cutoff=1.5, x_range=None, y_range=None)

Create a volcano plot of data

Parameters










Returns

matplotlib.Figure

class magine.data.experimental_data.ExperimentalData(data_file)Bases: object

Manages all experimental data

compoundsOnly compounds in data

Returns

Sample

create_summary_table(sig=False, index=’identifier’, save_name=None, plot=False,write_latex=False)

Creates a summary table of data.

Parameters

sig: bool Flag to summarize significant species only

save_name: str Name to save csv and .tex file

index: str Index for counts

plot: bool If you want to create a plot of the table

write_latex: bool Create latex file of table

Returns

pandas.DataFrame

exp_methodsList of source columns

genesAll data tagged with gene

Includes protein and RNA.

get_measured_by_datatype()Returns dict of species per data type

Returns

dict

proteinsProtein level data

Tagged with “gene” identifier that is not RNA

rnaRNA level data

Tagged with “RNA”

sample_idsList of sample_ids



speciesReturns data in Sample format

Returns

Sample

subset(species, index=’identifier’)

Parameters

species [list, str] List of species to create subset dataframe from

index [str] Index to filter based on provided ‘species’ list

Returns

magine.data.experimental_data.Species

volcano_analysis(out_dir, use_sig_flag=True, p_value=0.1, fold_change_cutoff=1.5)Creates a volcano plot for each experimental method

Parameters

out_dir: str, path Path to where the output figures will be saved

use_sig_flag: bool Use significant flag of data

p_value: float, optional p value criteria for significant Will not be used if use_sig_flag

fold_change_cutoff: float, optional fold change criteria for significant Will not be used ifuse_sig_flag

1.4.4 Network Generators

magine.networks.network_generator module

magine.networks.network_generator.build_network(seed_species, species=’hsa’,save_name=None,all_measured_list=None,trim_source_sink=False,use_reactome=True, use_hmdb=False,use_biogrid=True, use_signor=True,verbose=False)

Construct a network from a list of gene names.

Parameters

seed_species [list] list of genes to construct network

save_name [str, optional] output name to save network. Will save one before and after IDconversion

species [str] species of proteins (‘hsa’: human, ‘mmu’:murine)

all_measured_list [list] list of all species that should be considered in network

use_reactome [bool] Add ReactomeFunctionalInteraction reaction to network

use_biogrid [bool] Add BioGrid reaction to network

use_hmdb [bool] Add HMDB reaction to network all_measured_list

use_signor [bool] Add SIGNOR reaction to network



trim_source_sink [bool, optional] Remove source and sink nodes if they are not measured innetwork

verbose [bool]

Returns

networkx.DiGraph

magine.networks.network_generator.create_background_network(save_name=’background_network’,fresh_download=False,verbose=True, cre-ate_overlap=False)

Parameters

save_name [str] Name of the network

fresh_download [bool] Download a fresh copy of the databases

verbose: bool Print information about the databases

create_overlap [bool] Creates a figure comparing the databses

Returns

——-

nx.DiGraph

magine.networks.network_generator.expand_by_db(starting_network, expansion_source,measured_list, verbose=False)

add reference network to main network

Parameters

starting_network [nx.DiGraph]

expansion_source [nx.DiGraph]

measured_list [list_like]

verbose [bool]

Returns

new_graph [nx.DiGraph]

magine.networks.annotated_set module

magine.networks.subgraphs module

Subpackages

The following subpackages to download and parse data

Network Databases

Database downloads.

MAGINE downloads network information from * Reactome Functional Interactions * HMDB * BioGrid * KEGG *Signor With the exception of KEGG and HMDB, all databases are downloaded and processed with pandas. KEGG isdownloaded using Bioservices. HMDB is download in xml format and processed using lxml parser.



magine.networks.databases.load_reactome_fi()Load reactome functional interaction network

Returns

pandas.DataFrame

magine.networks.databases.download_reactome_fi()Downloads reactome functional interaction network

magine.networks.databases.load_biogrid_network(fresh_download=False)

Parameters

fresh_download [bool] Download a fresh copy from biogrid

Returns

nx.DiGraph

magine.networks.databases.download_biogrid()

magine.networks.databases.load_signor(fresh_download=False)Load reactome functional interaction network

Parameters

fresh_download: bool Download fresh network

verbose [bool]

Returns

nx.DiGraph

magine.networks.databases.download_signor()

magine.networks.databases.load_kegg_mappings(species, fresh_download=False)Load mappings of kegg_pathway_id to nodes and nodes to kegg_pathway_id

Parameters

species [str] Species type, currently ‘hsa’ is the only species with automatic name conversion

fresh_download [bool] Download KEGG fresh

Returns

dict, dict

magine.networks.databases.load_kegg(species=’hsa’, fresh_download=False)Loads all KEGG pathways as a single network

Parameters

species [species] Default ‘hsa’

fresh_download [bool] Download kegg new

Returns

nx.DiGraph

magine.networks.databases.load_hmdb_network(fresh_download=False)Create HMDB network containing all metabolite-protein interactions

Parameters

fresh_download [bool] Download fresh copy from HMDB



verbose [bool]

Returns

nx.DiGraph

Visualization tools

1.4.5 Enrichment Module

enrichR module

class magine.enrichment.enrichr.Enrichr(verbose=False)Bases: object

print_valid_libs()Print a list of all available libraries EnrichR has to offer.

run(list_of_genes, gene_set_lib=’GO_Biological_Process_2017’)

Parameters

list_of_genes [list_like] List of genes using HGNC gene names

gene_set_lib [str or list] Name of gene set library To print options use En-richr.print_valid_libs

Returns

df [EnrichmentResult] Results from enrichR

Examples

>>> import pandas as pd>>> pd.set_option('display.max_colwidth', 40)>>> pd.set_option('precision', 3)>>> e = Enrichr()>>> df = e.run(['BAX', 'BCL2', 'CASP3', 'CASP8'],→˓gene_set_lib='Reactome_2016')>>> print(df[['term_name','combined_score']].head(5))#doctest: +NORMALIZE_→˓WHITESPACE

term_name combined_score0 intrinsic pathway for apoptosis hsa ... 11814.4101 apoptosis hsa r-hsa-109581 2365.1412 programmed cell death hsa r-hsa-5357801 2313.5273 caspase-mediated cleavage of cytoske... 10944.2614 caspase activation via extrinsic apo... 4245.542

run_samples(sample_lists, sample_ids, gene_set_lib=’GO_Biological_Process_2017’,save_name=None, create_html=False, out_dir=None, run_parallel=False,exp_data=None, pivot=False)

Run enrichment analysis on a list of samples.

Parameters

sample_lists [list_like] List of lists of genes for enrichment analysis

sample_ids [list] list of ids for the provided sample list



gene_set_lib [str, list] Type of gene set, refer to Enrichr.print_valid_libs

save_name [str, optional] if provided it will save a file as a pivoted table with the term_idsvs sample_ids

create_html [bool] Creates html of output with plots of species across sample

out_dir [str] If create_html, it will place all html plots into this directory

run_parallel [bool] If create_html, it will create plots using multiprocessing

exp_data [magine.data.ExperimentalData] Must be provided if create_html=True

pivot [bool]

Returns

EnrichmentResult

Examples

>>> import pandas as pd>>> import matplotlib.pyplot as plt>>> from magine.enrichment.enrichr import Enrichr>>> pd.set_option('display.max_colwidth', 40)>>> pd.set_option('precision', 3)>>> samples = [['BAX', 'BCL2', 'CASP3', 'CASP8'], ['ATR', 'ATM', 'TP53',→˓'CHEK1']]>>> sample_ids = ['apoptosis', 'dna_repair']>>> e = Enrichr()>>> df = e.run_samples(samples, sample_ids, gene_set_lib='Reactome_2016')>>> print(df[['term_name','combined_score']].head(5))#doctest: +NORMALIZE_→˓WHITESPACE

term_name combined_score0 intrinsic pathway for apoptosis hsa ... 11814.4101 apoptosis hsa r-hsa-109581 2365.1412 programmed cell death hsa r-hsa-5357801 2313.5273 caspase-mediated cleavage of cytoske... 10944.2614 caspase activation via extrinsic apo... 4245.542

>>> df.filter_multi(rank=10, inplace=True)>>> df['term_name'] = df['term_name'].str.split('_').str.get(0)>>> fig = df.sig.heatmap(figsize=(6, 6), linewidths=.05)

Using a ExperimentalData instance, we can run enrichR for all the databases using a simple wrapper around.

magine.enrichment.enrichr.run_enrichment_for_project(exp_data, project_name,databases=None, out-put_path=None)

Parameters

exp_data [magine.data.experimental_data.ExprerimentalData]

project_name [str]

databases [list]

output_path [str] Location to save all individual enrichment output files created.

We also provide some tools to clean up and standardize enrichRs output.



Functions to cleanup enrichR term names

Note: this are in progress and not fully tested! Warning!

magine.enrichment.enrichr.clean_term_names(row)

magine.enrichment.enrichr.clean_lincs(df)Cleans the lincs databases term_names from enrichR.

Parameters

df

magine.enrichment.enrichr.clean_drug_pert_geo(df)

magine.enrichment.enrichr.clean_tf_names(data)Cleans transcription factors databases by removing everything after ‘_’.

Parameters

data [pd.DataFrame]

Download reference databases

magine.enrichment.enrichr.get_background_list(lib_name)Return reference list for given gene referecen set

Parameters

lib_name [str]

EnrichmentResult

The results from enrichR are return in an EnrichmentResult.

class magine.enrichment.enrichment_result.EnrichmentResult(*args, **kwargs)Bases: magine.data.base.BaseData



all_genes_from_df()Returns all genes from gene columns in a set

Returns

set

calc_dist(level=’datafame’)

dist_matrix(figsize=(8, 8), level=’dataframe’)Create a distance matrix of all term similarity

Parameters

figsize [tuple] Size of figure

level [str, {‘dataframe’, ‘each’}] How to treats term_name to genes. Dataframe compressesall genes from all sample_ids into same term. ‘each’ treats each term_name individually.

Returns

matplotlib.Figure

filter_based_on_words(words, inplace=False)Filter term_name based on key terms

Parameters

words [list, str] List of words to use to keep rows in dataframe

inplace [bool] Filter the dataframe in place or return filtered copy

Returns

pandas.DataFrame

filter_multi(p_value=None, combined_score=None, db=None, sample_id=None, category=None,rank=None, inplace=False)

Filters an enrichment array.

This is an aggregate function that allows ones to filter an entire dataframe with a single function call.

Parameters

p_value [float] filters all values less than or equal

combined_score [float] filters all values greater than or equal

db [str, list]

sample_id [str, list]

category [str, list]

rank [int]

inplace [bool] Filter inplace

Returns

new_data [EnrichmentResult]

filter_rows(column, options, inplace=False)Filters a pandas dataframe provides a column and filter selection.

Parameters

column [str]

options [str, list] Can be a single entry or a list



inplace [bool] Filter inplace

Returns

——-

pd.DataFrame

find_similar_terms(term, level=’sample’, remove_subset=True)Calculates similarity of all other terms to given term

Parameters

term [str]

level [str] Sample or dataframe level, flattens all terms to one set of genes

remove_subset [bool] If any term is a subset of the other term, a score of 1 will be usedinstead of jaccard index.

Returns

pd.DataFrame

remove_redundant(threshold=0.75, verbose=False, level=’sample’, sort_by=’combined_score’, in-place=False)

Calculate similarity between all term sets and removes redundant terms.

Parameters

threshold [float, default 0.75]

verbose [bool, default False] Print similarity scores and removed terms.

level [{‘sample’, ‘dataframe’}, default ‘sample’] Level to filter dataframe. ‘sample’ willpivot the dataframe and filter each group of ‘sample_id’ individually. ‘dataframe’ willmerge all genes that share the same ‘term_name’.

sort_by [{‘combined_score’, ‘rank’, ‘adj_p_value’, ‘n_genes’},] default ‘combined_score’Keyword to sort the dataframe. The scoring starts at the top term and compares to all thelower terms. Options are

inplace [bool] Filter the dataframe in place or return filtered copy

Returns

pandas.DataFrame

show_terms_below(term, level=’dataframe’, threshold=0.7, remove_subset=True)Find terms that were removed by remove_redundant

Parameters

term [str]

level [str]

threshold [float]

remove_subset [bool]

Returns

EnrichmentResult

term_to_genes(term)Get set of genes of provides term(s)

Parameters



term [str, list]

Returns

set

term_to_genes_dict(term_list=None)

Parameters

term_list [list]

Returns

OrderedDict

unique_terms(threshold=0.75, verbose=False, level=’dataframe’)

Parameters

threshold [float]

verbose [bool]

level [str, {‘dataframe’, ‘each’}]

This can be saved just like a pandas.DataFrame and loaded in using

magine.enrichment.enrichment_result.load_enrichment_csv(file_name, **args)Load data into EnrichmentResult data class

Parameters

file_name [str]

Returns

EnrichmentResult

1.4.6 Plotting tools

Generate Heatmaps

magine.plotting.heatmaps.cluster_distance_mat(dist_mat, names, figsize=(8, 8))

Parameters

dist_mat [np.array] Distance matrix array.

names [list_like] Names of ticks for distance matrix

figsize [tuple] Size of figure, passed to matplotlib

magine.plotting.heatmaps.heatmap_by_terms(data, term_labels, term_sets, colors=None,min_sig=None, convert_to_log=False,y_tick_labels=’auto’, columns=’sample_id’,index=’identifier’, values=’fold_change’,linewidths=0, cluster_row=False,cluster_col=False, div_colors=False,num_colors=21, figsize=None, anno-tate_sig=False, **kwargs)

Parameters

data [pd.DataFrame]



term_labels [list_like] List of labels for grouping

term_sets [list_like] List of list like that create the terms

colors [list_like] Colors for plotting, if not provided it will be created

min_sig [int] Number of sign

convert_to_log [bool]

y_tick_labels [list_like]

columns [str] Name of columns of df for pivotn

index [str] Name of index of df for pivot

values [str] Name of values of df for pivot

cluster_col [bool] Cluster the data using searborn.clustermap

cluster_row [bool] Cluster rows

div_colors [bool] Use divergent colors for plotting

figsize [tuple] Size of figure, passed to matplotlib/seaborn

num_colors [int] Number of colors for color bar

annotate_sig [bool] Add ‘*’ annotation to plot for significant changed terms

linewidths [float or None] Add white line between plots

min_sig [int] Minimum number of significant ‘index’ across samples. Can be used to removerows that are not significant across any sample.

Returns

plt.Figure

magine.plotting.heatmaps.heatmap_from_array(data, convert_to_log=False,y_tick_labels=’auto’, cluster_row=False,cluster_col=False, columns=’sample_id’,index=’term_name’, val-ues=’combined_score’, div_colors=False,num_colors=7, figsize=(6, 4),rank_index=False, annotate_sig=False,linewidths=0.0, cluster_by_set=False,min_sig=0)

Parameters

data [magine.data.base.BaseData]

convert_to_log [bool] Convert fold_change column to log2 scale

y_tick_labels [list_like]

columns [str] Name of columns of df for pivot

index [str] Name of index of df for pivot

values [str] Name of values of df for pivot

cluster_col [bool] Cluster the data using searborn.clustermap

cluster_row [bool] Cluster the data using searborn.clustermap

div_colors [bool] Use divergent colors for plotting



figsize [tuple] Size of figure, passed to matplotlib/seaborn

rank_index [bool] Order by index.

num_colors [int] Number of colors for color bar

annotate_sig [bool] Add ‘*’ annotation to plot for significant changed terms

linewidths [float or None] Add white line between plots

cluster_by_set: bool Cluster by gene set column. Only works for enrichment_array

min_sig [int] Minimum number of significant ‘index’ across samples. Can be used to removerows that are not significant across any sample.

Returns

plt.Figure

magine.plotting.species_plotting module

magine.plotting.species_plotting.plot_dataframe(exp_data, html_filename,out_dir=’proteins’, plot_type=’plotly’,run_parallel=False)

Creates

Parameters

exp_data [magine.BaseData.]

html_filename [str]

out_dir: str, path Directory that will contain all proteins

plot_type [str] plotly or matplotlib output

run_parallel [bool] create plots in parallel

Returns

——-

magine.plotting.species_plotting.plot_genes_by_ont(data, list_of_terms, save_name,out_dir=None, exp_data=None,run_parallel=False,plot_type=’plotly’)

Creates a figure for each GO term in data

BaseData should be a result of running calculate_enrichment. This function creates a plot of all proteins perterm if a term is significant and the number of the reference set is larger than 5 and the total number of speciesmeasured is less than 100.

Parameters

data [pandas.DataFrame] previously ran enrichment analysis

list_of_terms [list_list]

save_name [str] name to save file

out_dir [str] output path for file

exp_data [magine.ExperimentalData] data to plot

run_parallel [bool] To run in parallel using pathos.multiprocessing

plot_type [str] plotly or matplotlib



Returns

out_array [dict] dict where keys are pointers to figure locations

magine.plotting.species_plotting.plot_species(df, species_list=None, save_name=’test’,out_dir=None, title=None,plot_type=’plotly’, image_format=’pdf’,close_plots=False)

Parameters

df: pandas.DataFrame magine formatted dataframe

species_list: list List of genes to be plotter

save_name: str Filename to be saved as

out_dir: str Path for output to be saved

title: str Title of plot, useful when list of genes corresponds to a GO term

plot_type [str] Use plotly to generate html output or matplotlib to generate pdf

image_format [str] pdf or png, only used if plot_type=”matplotlib”

close_plots [bool] Close plot after making, use when creating lots of plots in parallel.

magine.plotting.species_plotting.write_table_to_html(data, save_name=’index’,out_dir=None,run_parallel=False,exp_data=None,plot_type=’matplotlib’)

Creates a html table of plots of genes for each ontology term.

Parameters

data [magine.enrichment.enrichment_result.EnrichmentResult]

save_name [str] name of html output file

out_dir [str, optional] output path for all plots

run_parallel [bool] Create plots in parallel

exp_data [magine.data.ExperimentalData]

plot_type [str {‘matplotlib’, ‘plotly’}]

magine.plotting.venn_diagram_maker module

magine.plotting.venn_diagram_maker.create_venn2(list1, list2, label1, label2,save_name=None, title=None, im-age_format=’png’, ax=None)

Creates a venn digram containing for 2 lists

Parameters

list1 [list_like]

list2 [list_like]

label1 [str]

label2 [str]

save_name [str]



title [str]

image_format [str, optional] default png

ax [matplotlib.axes]

magine.plotting.venn_diagram_maker.create_venn3(list1, list2, list3, label1, label2,label3, save_name=None, im-age_format=’png’, title=None,ax=None, colors=(’g’, ’r’, ’b’))

Creates a venn digram containing for 3 lists

Parameters

list1 [list_like]

list2 [list_like]

list3 [list_like]

label1 [str]

label2 [str]

label3 [str]

save_name [str]

image_format [str] default png

title: str

ax [matplotlib.axes]

magine.plotting.volcano_plots module

magine.plotting.volcano_plots.add_volcano_plot(fig_axis, section_0, section_1, sec-tion_2)

Adds a volcano plot to a fig axis

Parameters

fig_axis [plt.Figure.axes]

section_0 [pd.DataFrame]



magine.plotting.volcano_plots.create_mask(data, use_sig=True, p_value=0.1,fold_change_cutoff=1.5)

Creates a mask for volcano plots.

# Visual example of volcano plot # section 0 are significant criteria

# 0 # 1 # 0 # # # # # ################################# # # # # # 2 # 2 # 2 # # # # ##################################

Parameters

data [pd.DataFrame]

use_sig [bool]

p_value [float] p_value threshold



fold_change_cutoff [float] fold change threshold

magine.plotting.volcano_plots.save_plot(fig, save_name, out_dir=None, im-age_type=’png’)

Saves fig

Parameters

fig [plt.Figure] Figure to be saved

save_name [str] output file name

out_dir [str, optional] output path

image_type [str, optional] output type of file, {“png”, “pdf”, etc..}

magine.plotting.volcano_plots.volcano_plot(data, save_name=None, out_dir=None,sig_column=False, p_value=0.1,fold_change_cutoff=1.5, x_range=None,y_range=None)

Create a volcano plot of data

Creates a volcano plot of data type provided

Parameters

data [pandas.DataFrame] data to create volcano plots from








1.4.7 Other useful tools

magine.html_templates.html_tools module

magine.html_templates.html_tools.create_yadf_filters(table)

magine.html_templates.html_tools.format_ploty(text, save_name)

Parameters

text [str] html code to embed in file

save_name [str] html output filename

magine.html_templates.html_tools.write_filter_table(table, save_name)

Parameters

table [pandas.DataFrame]

save_name [str]



ID Mapping

Interface to mapping between species IDs.

We constructed two classes for mappings ids. 1. GeneMapper 2. ChemicalMapper

Currently supported options

1. Genes KEGG, HGNC, Uniprot, Entrez, ensembl_gene_id

2. Metabolites/compounds kegg_id, name, accession, chebi_id, chemspider_id, biocyc_id, synonyms, pub-chem_compound_id, protein_associations, inchikey, iupac_name, ontology, drugbank_id, chemi-cal_formula, smiles, metlin_id, average_molecular_weight

class magine.mappings.ChemicalMapper(fresh_download=False)Bases: object

Convert chemical species across various ids.

Database was creating using HMDB

check_synonym_dict(term, format_name)checks hmdb database for synonyms and returns formatted name

Parameters

term [str]

format_name [str]

Returns

dict

Examples

>>> cm = ChemicalMapper()>>> cm.check_synonym_dict(term='dodecene', format_name='main_accession')['HMDB0000933', 'HMDB0059874']

chem_name_to_hmdb

convert_kegg_nodes(network)Maps network from kegg to gene names

Parameters

network [nx.DiGraph]

Returns

dict

drugbank_to_hmdb

hmdb_accession_to_main

hmdb_main_to_protein

hmdb_to_kegg

hmdb_to_protein



print_info()print information about the dataframe

valid_columns = ['kegg_id', 'name', 'accession', 'chebi_id', 'inchikey', 'chemspider_id', 'biocyc_id', 'synonyms', 'iupac_name', 'pubchem_compound_id', 'protein_associations', 'ontology', 'drugbank_id', 'chemical_formula', 'smiles', 'metlin_id', 'average_molecular_weight', 'secondary_accessions']

class magine.mappings.GeneMapper(species=’hsa’)Bases: object

Mapping class between common gene ids

Database was creating by pulling down from NCBI, UNIPROT, HGNC

check_synonym_dict(term, format_name)checks hmdb database for synonyms and returns formatted name

Parameters

term [str]

format_name [str]

Returns

dict

convert_kegg_nodes(network, species=’hsa’)Convert kegg ids to HGNC gene symbol.

Parameters

network [nx.DiGraph]

species [str {‘hsa’}] Main support for humans only.

Returns

kegg_to_gene_name, kegg_short [dict, dict]

gene_name_to_alias_name

gene_name_to_ensembl

gene_name_to_kegg

gene_name_to_uniprot

kegg_to_gene_name

kegg_to_hugo(genes, species=’hsa’)Converts all KEGG names to HGNC

Parameters

genes [list]

species [str]

Returns

dict

kegg_to_symbol_through_uniprot(unknown_genes)

kegg_to_uniprot

ncbi_to_symbol

uniprot_to_gene_name

uniprot_to_kegg



Database Mapping

Interface to downloading ID mapping databases.

Databases supported

URL https://www.uniprot.org/

URL https://www.ncbi.nlm.nih.gov/

URL http://www.hmdb.ca/

magine.mappings.databases.load_hgnc()

magine.mappings.databases.load_ncbi()

magine.mappings.databases.load_uniprot()

class magine.mappings.databases.HMDBBases: object

Downloads and processes HMDB metabolites database

http://www.hmdb.ca/

download_db(fresh_download)parse HMDB to Pandas.DataFrame

load_db(fresh_download=False)


https://www.uniprot.org/

https://www.ncbi.nlm.nih.gov/

http://www.hmdb.ca/

http://www.hmdb.ca/

CHAPTER 2

Indices and tables

• genindex

• modindex

• search

61


62 Chapter 2. Indices and tables

Python Module Index

mmagine.html_templates.html_tools, 57magine.mappings, 58magine.mappings.databases, 60magine.networks.databases, 45magine.networks.network_generator, 44magine.plotting.heatmaps, 52magine.plotting.species_plotting, 54magine.plotting.venn_diagram_maker, 55magine.plotting.volcano_plots, 56

63


64 Python Module Index

Index

Aadd_volcano_plot() (in module mag-

ine.plotting.volcano_plots), 56all_genes_from_df() (mag-

ine.enrichment.enrichment_result.EnrichmentResultmethod), 49

BBaseData (class in magine.data.base), 39build_network() (in module mag-

ine.networks.network_generator), 44by_sample (magine.data.experimental_data.Sample

attribute), 40

Ccalc_dist() (magine.enrichment.enrichment_result.EnrichmentResult

method), 50check_synonym_dict() (mag-

ine.mappings.ChemicalMapper method),58

check_synonym_dict() (mag-ine.mappings.GeneMapper method), 59

chem_name_to_hmdb (mag-ine.mappings.ChemicalMapper attribute),58

ChemicalMapper (class in magine.mappings), 58clean_drug_pert_geo() (in module mag-

ine.enrichment.enrichr), 49clean_lincs() (in module mag-

ine.enrichment.enrichr), 49clean_term_names() (in module mag-

ine.enrichment.enrichr), 49clean_tf_names() (in module mag-

ine.enrichment.enrichr), 49cluster_distance_mat() (in module mag-

ine.plotting.heatmaps), 52compounds (magine.data.experimental_data.ExperimentalData

attribute), 43

convert_kegg_nodes() (mag-ine.mappings.ChemicalMapper method),58

convert_kegg_nodes() (mag-ine.mappings.GeneMapper method), 59

create_background_network() (in module mag-ine.networks.network_generator), 45

create_mask() (in module mag-ine.plotting.volcano_plots), 56

create_summary_table() (mag-ine.data.experimental_data.ExperimentalDatamethod), 43

create_venn2() (in module mag-ine.plotting.venn_diagram_maker), 55

create_venn3() (in module mag-ine.plotting.venn_diagram_maker), 56

create_yadf_filters() (in module mag-ine.html_templates.html_tools), 57

Ddist_matrix() (mag-


down (magine.data.experimental_data.Sample at-tribute), 40

down_by_sample (mag-ine.data.experimental_data.Sample attribute),41

download_biogrid() (in module mag-ine.networks.databases), 46

download_db() (magine.mappings.databases.HMDBmethod), 60

download_reactome_fi() (in module mag-ine.networks.databases), 46

download_signor() (in module mag-ine.networks.databases), 46

drugbank_to_hmdb (mag-ine.mappings.ChemicalMapper attribute),58

65


EEnrichmentResult (class in mag-

ine.enrichment.enrichment_result), 49Enrichr (class in magine.enrichment.enrichr), 47exp_methods (magine.data.experimental_data.ExperimentalData

attribute), 43exp_methods (magine.data.experimental_data.Sample

attribute), 41expand_by_db() (in module mag-

ine.networks.network_generator), 45ExperimentalData (class in mag-

ine.data.experimental_data), 43

Ffilter_based_on_words() (mag-


filter_multi() (mag-ine.enrichment.enrichment_result.EnrichmentResultmethod), 50

filter_rows() (mag-ine.enrichment.enrichment_result.EnrichmentResultmethod), 50

find_similar_terms() (mag-ine.enrichment.enrichment_result.EnrichmentResultmethod), 51

format_ploty() (in module mag-ine.html_templates.html_tools), 57

Ggene_name_to_alias_name (mag-

ine.mappings.GeneMapper attribute), 59gene_name_to_ensembl (mag-

ine.mappings.GeneMapper attribute), 59gene_name_to_kegg (mag-

ine.mappings.GeneMapper attribute), 59gene_name_to_uniprot (mag-

ine.mappings.GeneMapper attribute), 59GeneMapper (class in magine.mappings), 59genes (magine.data.experimental_data.ExperimentalData

attribute), 43get_background_list() (in module mag-

ine.enrichment.enrichr), 49get_measured_by_datatype() (mag-

ine.data.experimental_data.ExperimentalDatamethod), 43

Hheatmap() (magine.data.base.BaseData method), 39heatmap_by_terms() (in module mag-

ine.plotting.heatmaps), 52heatmap_from_array() (in module mag-

ine.plotting.heatmaps), 53

HMDB (class in magine.mappings.databases), 60hmdb_accession_to_main (mag-

ine.mappings.ChemicalMapper attribute),58

hmdb_main_to_protein (mag-ine.mappings.ChemicalMapper attribute),58

hmdb_to_kegg (magine.mappings.ChemicalMapperattribute), 58

hmdb_to_protein (mag-ine.mappings.ChemicalMapper attribute),58

Iid_list (magine.data.experimental_data.Sample at-

tribute), 41

Kkegg_to_gene_name (mag-

ine.mappings.GeneMapper attribute), 59kegg_to_hugo() (magine.mappings.GeneMapper

method), 59kegg_to_symbol_through_uniprot() (mag-

ine.mappings.GeneMapper method), 59kegg_to_uniprot (magine.mappings.GeneMapper

attribute), 59

Llabel_list (magine.data.experimental_data.Sample

attribute), 41load_biogrid_network() (in module mag-

ine.networks.databases), 46load_db() (magine.mappings.databases.HMDB

method), 60load_enrichment_csv() (in module mag-

ine.enrichment.enrichment_result), 52load_hgnc() (in module mag-

ine.mappings.databases), 60load_hmdb_network() (in module mag-

ine.networks.databases), 46load_kegg() (in module magine.networks.databases),

46load_kegg_mappings() (in module mag-

ine.networks.databases), 46load_ncbi() (in module mag-

ine.mappings.databases), 60load_reactome_fi() (in module mag-

ine.networks.databases), 45load_signor() (in module mag-

ine.networks.databases), 46load_uniprot() (in module mag-

ine.mappings.databases), 60log2_normalize_df() (mag-

ine.data.base.BaseData method), 39

66 Index


Mmagine.html_templates.html_tools (mod-

ule), 57magine.mappings (module), 58magine.mappings.databases (module), 60magine.networks.databases (module), 45magine.networks.network_generator (mod-

ule), 44magine.plotting.heatmaps (module), 52magine.plotting.species_plotting (mod-

ule), 54magine.plotting.venn_diagram_maker (mod-

ule), 55magine.plotting.volcano_plots (module), 56

Nncbi_to_symbol (magine.mappings.GeneMapper at-

tribute), 59

Ppivoter() (magine.data.base.BaseData method), 40plot_all() (magine.data.experimental_data.Sample

method), 41plot_dataframe() (in module mag-

ine.plotting.species_plotting), 54plot_genes_by_ont() (in module mag-

ine.plotting.species_plotting), 54plot_histogram() (mag-

ine.data.experimental_data.Sample method),41

plot_pie_sig_ratio() (mag-ine.data.experimental_data.Sample method),41

plot_species() (in module mag-ine.plotting.species_plotting), 55

plot_species() (mag-ine.data.experimental_data.Sample method),41

present_in_all_columns() (mag-ine.data.base.BaseData method), 40

print_info() (magine.mappings.ChemicalMappermethod), 58

print_valid_libs() (mag-ine.enrichment.enrichr.Enrichr method),47

proteins (magine.data.experimental_data.ExperimentalDataattribute), 43

Rremove_redundant() (mag-


require_n_sig() (magine.data.base.BaseDatamethod), 40

rna (magine.data.experimental_data.ExperimentalDataattribute), 43

run() (magine.enrichment.enrichr.Enrichr method), 47run_enrichment_for_project() (in module

magine.enrichment.enrichr), 48run_samples() (magine.enrichment.enrichr.Enrichr

method), 47

SSample (class in magine.data.experimental_data), 40sample_ids (magine.data.experimental_data.ExperimentalData

attribute), 43sample_ids (magine.data.experimental_data.Sample

attribute), 42save_plot() (in module mag-

ine.plotting.volcano_plots), 57show_terms_below() (mag-


sig (magine.data.base.BaseData attribute), 40species (magine.data.experimental_data.ExperimentalData

attribute), 43subset() (magine.data.experimental_data.ExperimentalData

method), 44subset() (magine.data.experimental_data.Sample

method), 42

Tterm_to_genes() (mag-


term_to_genes_dict() (mag-ine.enrichment.enrichment_result.EnrichmentResultmethod), 52

Uuniprot_to_gene_name (mag-

ine.mappings.GeneMapper attribute), 59uniprot_to_kegg (magine.mappings.GeneMapper

attribute), 59unique_terms() (mag-


up (magine.data.experimental_data.Sample attribute),42

up_by_sample (mag-ine.data.experimental_data.Sample attribute),42

Vvalid_columns (magine.mappings.ChemicalMapper

attribute), 59

Index 67


volcano_analysis() (mag-ine.data.experimental_data.ExperimentalDatamethod), 44

volcano_by_sample() (mag-ine.data.experimental_data.Sample method),42

volcano_plot() (in module mag-ine.plotting.volcano_plots), 57

volcano_plot() (mag-ine.data.experimental_data.Sample method),42

Wwrite_filter_table() (in module mag-

ine.html_templates.html_tools), 57write_table_to_html() (in module mag-

ine.plotting.species_plotting), 55

68 Index

magine documentation · 2020-05-05 · magine documentation, release 0.1a1 welcome to magines...

Documents