analysis of large groups of genes petri toronen [email protected]
TRANSCRIPT
Analysis of large groups of genes
Petri [email protected]
Content• PART I: Analysis of groups of genes with
functional classes– Do I need to analyze large gene groups?– Functional classification of genes– Over-representation analysis methods– Gene Set Enrichment Analysis (GSEA) methods– What comes out?– Further applications– When these methods will fail– Conclusion
Content
• PART II: Analysis of gene groups with gene networks• Why gene networks• What are different gene/protein
interactions• Where can we get the data• Usage of interaction datasets• Flaws of interaction datasets• Conclusions
PART I: Analysis of group of genes with functional classes
Content
• Do I need to analyze large gene groups?• Functional classification of genes• Over-representation analysis methods• Gene Set Enrichment Analysis (GSEA) methods• What comes out?• Further applications• When these methods will fail• Conclusion
Do I need to analyze large gene groups?
• Biology more and more High Throughput• Understanding N*100 gene list is hard• Errors in data• Popular in literature, reviewers might require• These can ease the data analysis
Examples of large-scale datasets
• Gene expression datasets• Large scale genome wide associations (GWAS)• Enviromental microbial samples
(metagenomics)• Comparing two or more genomes for missing
genes (phylogenetic footpronting)
Errors in data
• All high throughput methods have been shown to generate also error results
• Technical biases, sample preparation, dyes, different lab workers
• Variances between laboratories, sample preparation in the lab
• Biological variation between the individuals, samples
Errors in data
=> Interesting gene might not be reliable observation!
Thinking the data analysis
• Biology usually looks at obtained genes
Thinking the data analysis
• Biology usually looks at obtained genes• But genes are often not the main interest!
Thinking the data analysis
• Biology usually looks at obtained genes• But genes are often not the main interest!• Usually it is the biological processes, the
functions we are interested
Thinking the data analysis
• Biology usually looks at obtained genes• But genes are often not the main interest!• Usually it is the biological processes, the
functions we are interested• Active genes can vary while the same process
is active (Linghu et al. 2009, Efron, Tibshirani 2007, Subramanian 2005)
http://arxiv.org/pdf/math/0610667.pdfhttp://www.biomedcentral.com/content/pdf/gb-2009-10-9-r91.pdf
Thinking the data analysis
• Monitoring genes requires detailed knowledge on gene functions
• Genes can have multiple functions• Secondary function might go unnoticed
• Objectivity• We easily find what we expected to find
Over-representation analysis
• Motivation to over-representation analysis• Over-representation analysis methods• What over-representation analysis always
requires• What the results look like
Typical analysis pipelineGenerate the GE data
Pre-processing(Normalization etc.)
Define Differentially Expressed genes
Collect functional annotations for all the genes
Compare functional classes withthe analyzed gene group. Use selected statistical score
Select over-represented biological processes
Gene expression analysis
Over-representation analysis
Information on gene functions
• Gene functions can be represented as categorizations:– Gene belongs to category when it has a function– Gene can belong to many categories– Some categories can lie within each other
• Smaller more precise categories are nested within larger category
• Different levels of information on gene function
Information on gene function
• Where do we get the functional categorization?
Information on gene function
• Where do we get the functional categorization?
• We use existing functional classifications from databases that provide them
• Biological Processes• Molecular Functions …
• Laboratory can also generate its own functional classifications
G1G1G1G1
G2G2G2G2
G3G3G3G3
Cell cycleCell cycleCell cycleCell cycle
ApoptosisApoptosisApoptosisApoptosis
NeurogenesisNeurogenesisNeurogenesisNeurogenesis
Cell deathCell deathCell deathCell death
G4G4G4G4
ATPase activityATPase activityATPase activityATPase activity
Functional classesGenes
information of the functions canbe obtained, for example, usingsome genome database (MIPS, SGD)
G1G1G1G1
G2G2G2G2
G3G3G3G3
T1T1T1T1
T2T2T2T2
T3T3T3T3
T4T4T4T4
……………
…000G4
…000G3
…010G2
…101G1
…T3T2T1
G4G4G4G4
T5T5T5T5
Obtained association data Obtained association data can be turned into can be turned into a a binary matrixbinary matrix11 indicates association and indicates association and 00 no association no association
Functional annotation standards
• Gene Ontology (GO)• Reactome (Biochemical Pathways)• KEGG (Kioto Encyclopedia of Genes and
Genomes) (not freely available anymore)
• SwissProt keywords• MIPS FunCat (Functional Catalogue)• Molecular Signatures Database (MSigDB, MIT)
=> You can use many of these simultaneously
Benefits of Gene Ontology (GO)
• Many gene features are covered• Biological Processes• Molecular Function• Cellular Component
• Hierarchical sophisticated struture• Detailed and broad categories are included to structure
• Most used, popular standard• Many organisms are covered• Information source reported
• GO Evidence Codes
Drawbacks of GO
• Unstable structure• New classes appear and old ones disappear
• Most annotations are not manually evaluated*• Some annotations are bound to be wrong
• Gene function might vary, for example between tissues*
• Genes roughly categorized to classes*• In reality genes are more or less relevant to some function /
pathway• Gene from pathway can up or down regulate the synthesis
*Problem is not specific to GO. It rather links to all functional classifications
Motivation of the over-representation analysis
• Aim is to summarize gene functions observed in the analyzed group of genes
• It would be intuitive to report the most common functional classes from the gene list
Motivation of the over-representation analysis
• Aim is to summarize gene functions observed in the analyzed group of genes
• It would be intuitive to report the most common functional classes from the group of genes
• However, the most common classes are often the most common also in the background
Motivation of the over-representation analysis
• Aim is to summarize gene functions observed in the analyzed group of genes
• It would be intuitive to report the most common functional classes from the group of genes
• However, the most common classes are often the most common also in the background
• We would rather want to report the most over-represeted classes
Over-representation analysis methods
• Over-representation analysis looks classes that are more frequent in gene group than in the background
• Most popular tests are based on sampling without replacement
sample
Whole data
Sampling w/o replacements answers to:
How many ways there are to select 8 balls so that two of them are white and rest are black from the whole data?
This is done by summing probability for observed outcome (6 black balls) with the probabilities of more extreme outcomes (7 and 8 black balls). This is a standard available in many web tools.
The sampling without replacement can be turned to a p-value using Fisher’s exact test (or hypergeometric test).
Figure shows the probabilities for every possible outcome in theexample. The p-value for previous sample (6 black balls wouldbe obtained by summing threeleftmost bars) p-value: 0.621
http://en.wikipedia.org/wiki/Hypergeometric_distributionhttp://en.wikipedia.org/wiki/Fisher%27s_exact_test
Over-representation analysis methods• Various methods used:
– Hypergeometric test– Binomial test– Chi Square test (bad)
• Less used methods reporting same:– Jaccard correlation– Log-likelihood ratio aka. G-statistics (Mutual
Information)http://en.wikipedia.org/wiki/Binomial_testhttp://en.wikipedia.org/wiki/G-testhttp://en.wikipedia.org/wiki/Pearson%27s_chi-squared_testhttp://en.wikipedia.org/wiki/Jaccard_index
What statistics calculate
• Hypergeometric test is based on sampling w/o replacement
• Binomial test is based on sampling with replacement– This model is true when the dataset is much larger
than the sample– Binomial test is often used to approximate
hypergeometric test
What statistics calculate
• Log likelihood ratio (LLR, G-test) is based on ratio of likelihoods between two models:
• Maximum Likelihood model• Null model
It approximates binomial test• Chi Square test approximates LLR. Chi square
test is popular but it is not recommended.
What statistics calculate
• Jaccard correlation (JC) is different. It is ratio between:
• the size of intersection (∩) between functional class and gene group (number of genes in both sets)
• The size of union (U) between functional class and gene group (number of genes in either set)
• JC = (A ∩ B )/(A U B)
• JC does not tell how significant the result is
Over-representation analysis methods:What values methods need
• Hypergeometric test (= Fisher’s exact test, hypergeometric p-value)
• X: Number of class members in the gene group• K: Size of the gene group• N: Size of the whole dataset• L: Number of class members in the whole dataset
• Binomial p-value, log-likelihood test (= G-statistics), Chi Square test
• X: Number of class members in the gene group• K: Size of the gene group• P: Probability of class members in the background(P = L/N)
What the results look like
• Sorted lists of significant functional classes• Visualization of classes in the structure of the
functional class hierarchy
Two examples of sorted class lists
functional class -log10(P) P size of the classobs. number of genes expected number of genes STD of expected numberTRANSCRIPTION 57.94446 0 601 282 131.9268 8.822874mRNA transcrition 31.11666 0 444 196 97.46342 7.897141nuclear organization 27.23162 0 648 245 142.2439 9.044816mRNA synthesis 22.65499 0 343 151 75.29268 7.1128rRNA transcription 16.981 0 98 60 21.5122 4.01593transcriptional control 16.73279 0 271 118 59.48781 6.428959rRNA processing 11.33614 0 58 37 12.73171 3.115542mRNA processing (splicing) 8.325967 0 66 36 14.48781 3.31793rRNA synthesis 6.903157 0 37 23 8.121951 2.499256tRNA transcription 5.381251 0.000004 72 33 15.80488 3.461119general transcription activities 4.479288 0.000033 59 27 12.95122 3.141631
functional class -log10(P) P size of the class obs. number of genes expected number of genes STD of expected numbermitochondrial organization 33.52572 0 303 88 23.64878 4.373248respiration 9.47456 0 68 23 5.307317 2.181689PROTEIN SYNTHESIS 6.225534 0.000001 299 47 23.33659 4.348312ribosomal proteins 5.816049 0.000002 173 32 13.50244 3.402624ENERGY 4.188443 0.000065 169 28 13.19024 3.365997assembly of protein complexes 3.630167 0.000234 86 17 6.712195 2.44426CELLULAR ORGANIZATION 2.494072 0.003206 1806 157 140.9561 5.879028mRNA processing (splicing) 1.933813 0.011646 66 11 5.15122 2.150264regulation of phosphate utilization 1.70902 0.019543 8 3 0.62439 0.75764
Graphical output• classes with high over-representation can be visualized using the GO
structure• well-scoring classes are highlighted
Better pipeline: Permutations
Collect functional annotations for all the genes
Compare functional classes withthe analyzed gene group. Use selected statistical score
Select over-represented biological processes
Tolvanen et al. 2009 http://www.sciencedirect.com/science/article/pii/S1532046408001445
Compare functional classes withthe analyzed gene group. Use selected statistical score
Select over-represented biological processes
Permute (mix) the annotationsrandomly across the genes
Compare the results with correct data to the permuted data. Report only classes that have better score than the permutations
Repeat N*100 times
Permutations
• Sanity check: What results can be obtained from random data?
• First method (Row permutation, gene permutation):
• Repeat following step, say 200 times– Permute the functional annotations– Rerun the analysis– Store the results
• Compare the results from true data to the permuted data
Permutations
• Second method (column permutation, sample permutation):
• Repeat following steps, say 200 times:– Permute the labels of (gene expression) samples– Rerun the (gene expression) data analysis– Take the significantly regulated genes– Rerun the over-representation analysis with regulated genes
• Compare the results from each class true data to the permuted data
• Heavier but still preferred permutation idea• Might not always be applicable
Rowrand.
Col. rand
Permutations
Expr
essi
on d
ata
Clas
s da
ta
Sample labels
Permutations
• Tian et al. 2005, PNAS (http://www.pnas.org/content/102/38/13544.long)
• Efron & Tibshirani, 2007, Annals of Applied Statistics (http://arxiv.org/pdf/math/0610667.pdf)
• Goeman & Buhlmann, 2007, Bioinformatics (http://arxiv.org/pdf/math/0610667.pdf)
• Törönen et al. 2009, BMC Bioinf. (http://www.biomedcentral.com/1471-2105/10/307)
Some say one is better than other. Me and Tian et al. propose: Test both and expect the worse result to be true.
Multiple Testing Problem
• We run over-representation analysis with many thousands classes. What can go wrong?
• Think throwing a dice. • With one dice the probability of getting 6 is 1/6• With 10 dices the probability for one 6 is much larger (~0.8)
• Same phenomena occurs with functional classes when we look at the best-scoring classes.
• This needs to be corrected
Multiple Testing Problem
• Several alternatives to correcting the phenomena
• These require the obtained best results to be better than what could be expected by random from the set of N experiments.
• Holm’s correction • Bonferroni correction• False Discovery Rate (FDR)• ..others…
Further ideas
Simple extensions to over-representation analysis:
• Search of clusters from hierarchical cluster tree
• Analysis of functional subgroups from the generated gene list
G1G1G1G1
G2G2G2G2
G3G3G3G3
T1T1T1T1
T2T2T2T2
T3T3T3T3
T4T4T4T4
G4G4G4G4
T5T5T5T5
The gene list
G5G5G5G5
G6G6G6G6
G7G7G7G7
G8G8G8G8
T6T6T6T6
T7T7T7T7
T8T8T8T8
T9T9T9T9
T10T10T10T10
Functional classes
Analyzing the heterogeneity of the gene list
Pehkonen et al., BMC Bioinformatics 2005analyzing a gene list as one entity forover-represented gene classes is standard.
Problem:
The best scoring gene group tends to overwhelm the results.
G1G1G1G1
G2G2G2G2
G3G3G3G3
T1T1T1T1
T2T2T2T2
T3T3T3T3
T4T4T4T4
G4G4G4G4
T5T5T5T5
The gene list
Our question:Can the gene list be grouped in a reasonable way using the functional information.
Aim:help to find the heterogenous gene groups nested in the gene list
G5G5G5G5
G6G6G6G6
G7G7G7G7
G8G8G8G8
T6T6T6T6
T7T7T7T7
T8T8T8T8
T9T9T9T9
T10T10T10T10
Functional classes
Analyzing the heterogeneity of the gene list
Pehkonen et al., BMC Bioinformatics 2005analyzing a gene list as one entity forover-represented gene classes is standard.
Problem:
The best scoring gene group tends to overwhelm the results.
Proposed method
• Cluster the gene list using ontological data with Non-negative Matrix factorization (NMF)
-shown to perform well on sparse binary matrices-note that the clustering is only based on the functional annotations-no expression data or sequence similarity is used
• Vary the used cluster number and highlight the features that stay similar although the number of clusters changes
• present the most over-represented classes for each cluster
Obtained graphical output
analyzed list was yeast genes that were important for the growth of yeast in H2O2 stress (Thorpe, PNAS, 2004)
Lines present binary correlation (0, no correlation, 1, clusters match exactly)
Each cluster shows functional classes that were over-represented in the original gene list and which are over-represented in the created cluster.
More details will be shown for this clustering
3 best classes for each cluster
<Cluster 1 size=11>
RNA ligase activity
tRNA ligase activity
ligase activity, forming aminoacyl-tRNA and related compounds
<Cluster 2 size=33>
mitochondrial genome maintenance
mitochondrion organization and biogenesis
mitochondrial chromosome
<Cluster 3 size=14>
mitochondrial membrane
inner membrane
mitochondrial inner membrane
<Cluster 4 size=31>
organellar ribosome
mitochondrial ribosome
mitochondrial matrix
<Cluster 5 size=20>
transcription regulator activity
nucleobase, nucleoside, nucleotide and nucleic acid metabolism
mediator complex
Competing methods:sorted class list
mitochondrion
mitochondrial matrix
organellar ribosome
mitochondrial ribosome
protein biosynthesis
organellar large ribosomal subunit
mitochondrial large ribosomal subunit
macromolecule biosynthesis
structural constituent of ribosome
ribosome
biosynthesis
structural molecule activity
ribonucleoprotein complex
protein metabolism
metabolismetc…
Sorted class list gives an impression that the list includes only mitochondrial ribosome proteins
Summary of theme discovery
• We have used clustering to look for the heterogeneous gene groups from the list of genes
• sorted class lists have a tendency to highlight only the most enriched functional classes
• user gets an impression that the gene group is homogenous
• Method would be especially useful with the analysis of large gene lists
Motivation for next work• clustering is standard procedure with gene expression data
analysis• clusters are often monitored for over-representation of
genes from same (GO) functional class • strong over-representation presents a interesting clustering
solution
Motivation for next work
But• we usually face a problem of selecting clustering method and
parameters (cluster numbers etc.) -these result to several clustering solutions
• Many clustering solutions create a massive analysis task
• clustering is standard procedure with gene expression data analysis
• clusters are often monitored for over-representation of genes from same (GO) functional class
• strong over-representation presents a interesting clustering solution
If a good/interesting cluster is considered as a one that produces a strong correlation to functional classes then why not look for such a cluster in the first place.
Application to hierarchical clustering (Toronen BMC bioinf. 2004)
1. Cluster using hierarchical clustering2. Calculate correlations for all the clusters with all the gene classes3. Store the best correlation for each cluster4. Select the clusters with best scores as output*
*analysis is heavily simplified
Selection of informative clusters with GO
ordinal cluster number
most enriched MIPS gene class log-p-value
bonf. corr. log(p)
observed
clust. size
class size
1 cytoplasmic ribosomes 102.5676 99.7780
78 159 108
2 respiration chain complexes 61.0379 58.2483
28 41 31
3 mitochondrion 57.3513 54.5617
75 142 305
4 AA metabolism 55.4905 52.7009
75 259 171
5 AA biosynthesis 55.4849 52.6953
59 244 98
6 mitochondrial ribosomes 46.2851 43.4955
32 101 46
7 rRNA transcription 33.5036 30.7140
46 274 99
8 cytoplasmic ribosomes 25.8871 23.0975
15 15 108
9 nucleosomal protein complex 25.2346 22.4450
8 8 8
10 26S proteasome 23.1373 20.3477
12 21 28
Some top results
Visualization
• green shows nucleus, nucleotide, cell cycle and differentiation
• red is energy production• blue is protein synthesis=>rough functional
annotation of cluster tree branches
Further ideas of over-representation
• Evaluation using genes associated with some disease
• Evaluation using genes that are active in some treatment
• Connectivity Map project • http://www.sciencemag.org/content/313/5795/1929.full
• Gene groups linked across species• http://www.pnas.org/content/107/14/6544.abstract
When the over-representation analysis will fail
• Functional classifications generated by the same data type that it is used to evaluate
• Circular logic• Need independent data sources
• Genes occuring multiple times in dataset (*)• An example: Analysis of probes instead of genes• Independence of observations is strongly violated
• Background set needs to be well defined (**)*Toronen et al. http://www.biomedcentral.com/1471-2105/10/307Subramanian et al. http://www.pnas.org/content/102/43/15545.short** Pehkonen et al. http://www.biomedcentral.com/1471-2105/6/162/
When the over-representation analysis will fail
• Exclude the genes that have no annotation **• Selected group is defined badly ***
• Group is too small => relevant genes are left out• Group is too large => biological signal is diluted
• If worried: Do the permutation analysis• Permutation is discussed later
** Pehkonen et al. http://www.biomedcentral.com/1471-2105/6/162/Kankainen et al. http://nar.oxfordjournals.org/content/34/suppl_2/W534.short*** The solution to this comes later
Conclusion on over-representation analysis
• Gene level analysis can be unreliable• Functional classes (Gene sets) provide more
reliable level• Result direct towards relevant biological
processes• Results can be confirmed via permutation
analysis
Gene Set Enrichment Analysis
• Why this instead of earlier over-representation methods
• Main idea• Different statistics• Different permutations
Typical pipelinesGenerate the GE data
Pre-processing(Normalization etc.)
Define Differentially Expressed genes
Draw biological conclusionsFind over-represented biological processes
Generate the GE data
Pre-processing(Normalization etc.)
Define Differentially Expressed genes
Cluster selected genes
Draw biological conclusions
What can go wrong?
• Is the definition of Differentially Expressed genes always reasonable?– datasets with large noise levels– p-value thresholds– sudden jump to signif. regulation– genes with weak regulation
What can go wrong?
Analysis of data with one threshold.Biological process with weak regulation goes unnoticed
Functional classes should be analyzed so that The signal level is taken into account!
Solution
(Following text uses term gene set instead of functional class. Terminology varies between fields!)
• Take all the genes to data analysis (not just differentially expressed genes)
• Use some scoring method to define important functional classes (Gene Sets)
• All the genes contribute to the result• Take the strength of regulation of genes to analysis• Monitor the regulation of functional processes (not the
genes)
Gene set analysis pipeline
Generate the GE data
Pre-processing(Normalization etc.)
Define continuous Diff. Expr. score for genes
Calculate a gene set score for each gene set
Generate permuted data
Pre-definedgene sets
Calculate the gene set scorefor each gene set
Look for gene sets that show stronger signal in real data than in permuted data
Gene level
Gene set level
Expr
essi
on d
ata
Clas
s da
ta
Sample labels
Methods for gene set scoring
• Average based methods• Rank based methods• Other methods (omitted here)
• Classification based method (http://bioinformatics.oxfordjournals.org/content/24/21/2474.full.pdf+html)
• Generalized linear models (http://bioinformatics.oxfordjournals.org/content/20/1/93.short)
• Logistic Regression (http://bioinformatics.oxfordjournals.org/content/25/2/211.short)
Average based methods
• Associate the differential expression scores x to gene set analysis
• X can be average fold change or T-test score
• Take all the genes from the gene set S • Calculate one of the vector norm scores for
gene set• ScoreS = ∑i∊Sxi A• ScoreS = ∑i∊S(xi)2 B• ScoreS = ∑i∊S abs(xi) C
• Compare with permutations …
Average based methods• What is good and where?
A
B or C
B
• No single method that would detect allA Tian et al. http://www.pnas.org/content/102/38/13544.short Irizarry et al. http://biostats.bepress.com/cgi/viewcontent.cgi?article=1185&context=jhubiostat
B Irizarry et al. , Kong et al. http://bioinformatics.oxfordjournals.org/content/22/19/2373.long
C Dinu et al. http://www.biomedcentral.com/1471-2105/8/242
Average based methods• Extensions:
– Split data to two halves ( X > 0, X<= 0)– Do average analysis separately for both halves– Report the half that shows stronger signal
• Gene Set Analysis (http://arxiv.org/pdf/math/0610667.pdf)
Rank based methods• Steps:
– order genes with differential expression*
– test every possible threshold in the ordered list
– look over(/under)-representation of gene set above the threshold
– select the threshold position with strongest score
– Compare the results to permutation• Expression values are (often)
discarded!
*Differential expression:Average fold change, T-test score…
Analyzedsubset
threshold
Gene expression data Analyzed gene classes
Black = class memberWhite = not a member
Rank based methods
• Scoring methods:– Hypergeometric p-value (A)– Kolmogorov-Smirnov test (KS) (B)– modified KS (C)
• A: Iterative Group Analysis and others• http://www.biomedcentral.com/1471-2105/5/34• http://www.biomedcentral.com/1471-2105/10/48
• B: Old Gene Set Enrichment Analysis (GSEA)
• http://statgen.ncsu.edu/~dahlia/journalclub/S04/ng267.pdf
• C: New GSEA• http://www.pnas.org/content/102/43/15545.short
Analyzedsubset
threshold
Gene expression data Analyzed gene classes
Black = class memberWhite = not a member
My brilliant proposal• Combine two method groups:
– Order genes with diff. expr. scores– Test every threshold position– At each threshold calculate
• Scale the difference with STD and average estimates (Toronen et al. 2009)
• http://www.biomedcentral.com/1471-2105/10/307
• Get a Z-score scaling for difference=> Gene Set Z-score (GSZ)
STDmeanX /)(scoreZ
My brilliant proposal
• An over-representation (hypergeometric) score weighted with diff. expr. score
• GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list.
My brilliant proposal
• Many popular Gene Set scoring methods are variants of GSZ-method:– hypergeometric testing– Pearson correlation– Max-Mean (Efron, Tibshirani)– Random Sets (Newton et al.)
Comparing methods
• http://www.biomedcentral.com/1471-2105/10/47/
• http://www.biomedcentral.com/1471-2105/8/431/
• http://bioinformatics.oxfordjournals.org/content/28/11/1480.short
• http://www.biomedcentral.com/1471-2105/9/502
Summary of methods
• Some methods could at detecting consistent shift in the whole group. Not good with non-coherent regulation
• Average method
• Some methods are good at detecting non-coherent regulation also, if it is strong enough. These cannot detect mild consistent regulation.
• Average method with squared signal• Most rank based methods
Do not use MIT GSEA programs
• Several comparisons point that it is weak• Toronen et al.
• http://www.biomedcentral.com/1471-2105/10/307
• Irizarry et al. • http://smm.sagepub.com/content/18/6/565.short
• Ackermann et al.• http://www.biomedcentral.com/1471-2105/10/47/
• Dinu et al.• http://www.biomedcentral.com/1471-2105/8/242
• … and others…
Permutations
• Needed to evaluate significance• Two types:• Row Randomization
– mix row labels gene set / gene class
• Column Randomization– mix sample labels, used to
calculate diff. expr.• Column Randomization
preferred
Rowrand.
Col. rand
Permutations
• Some propose that both should be used– Toronen et al.– Efron & Tibshirani– Tian et al.
• Some errors can be detected by one permutation and some errors by other.
• It would seem safest to use both of them
Extensions to Gene Set Analysis
• Principle: Continuous data vs. label data.
• siRNA data vs. gene IDs• Linkage data vs. biological processes• BLAST result list vs. descriptions• BLAST result list vs. GO classes• Correlation of gene expression with query
gene across large data vs. GO classes
Warnings• Quality of gene expression data• Use Column Permutations OR Column and Row
permutations (depends on statistics)• Enough samples for permutations• Each gene should occur only once in the
expression data• Filter genes without annotations (with GO data)• Quality of gene sets / annotations• See discussion in Toronen et al.
http://www.biomedcentral.com/1471-2105/10/307
Conclusion on Gene Set Enrichment Methods
• Extension of over-representation analysis to continuous data
• Two main method principles: Rank based and average based methods
• Methods report different signals• Goodness depends on the application
Analysis of Gene Networks
• Why Gene Networks• How this links to large gene groups• What are different gene/protein interactions• Databases for interactions• Is the interaction true• Usage of interaction datasets
Why Gene Networks
• Previous slides represented genes as members of group
• This assumes that group members are similar and are equally important
• This can be considered true with protein complexes etc.
• However often biology is constructed from networks
Why gene networks
• Biological synthesis and degradation pathways• Signal pathways (phosphorylation)• Regulatory pathways to gene expression
Why Gene Networks
• In addition one can generate large scale datasets that represent gene – gene interactions
• These can be represented also as networks
Different Interactions
Different interaction represent totally different things. Keep this in mind.
• Physical protein – protein Interaction• Synthetic genetic interaction (epistasis)• Literature gene interaction
Physical Protein Protein interaction
• Proteins bind physically each other• Protein complexes• Signal transporting partners
• High throughput methods:– Two-hybrid screening (http://en.wikipedia.org/wiki/Two-
hybrid_screening)
– Tandem affinity purification (http://en.wikipedia.org/wiki/Tandem_affinity_purification)
Physical Protein Protein interaction
• Prediction directly from sequence has been attempted
• Domain – Domain interaction• Shoemaker & Panchenko 2007• http://www.ploscompbiol.org/article/info%3Adoi
%2F10.1371%2Fjournal.pcbi.0030043
• Critisism• Ta & Holm, 2009• http://www.sciencedirect.com/science/article/pii/
S0006291X09019470
Genetic Interactions
• Disabling two genes at the same time has a ”surprising” effect
Surprising: Clearly weaker growth OR cancellation of weak growth observed with knock-outs of single genes
• High Throughput in simple organisms• Phenotype: Growth• http://www.nature.com/nmeth/journal/v4/n10/full/nmeth1098.html• http://stke.sciencemag.org/cgi/content/abstract/sci;294/5550/2364
• Genes do not (necessarily) make physical interaction
• Interacting genes are often in backup pathways
Genetic vs. physical interactionPathway A Pathway B
Two pathways that work as backup to each otherBlue lines = Physical InteractionsRed Lines = Genetic Interactions
Regulatory Networks• Systems regulating gene expression activity in cell• Consist interactions between trascription factors,
other proteins, metabolic compounds• High throughput datasets generated gene
expression data from perturbed organisms / cell lines
• http://www.embl.de/predoccourse/2008/modules/genomics/journal_club/HughesCell2000.pdf
• http://stke.sciencemag.org/cgi/content/abstract/sci;313/5795/1929• http://www.sciencedirect.com/science/article/pii/S0167779911000023
• Primary or secondary effect?
Literature interaction
• Co-occurence of gene names in published articles
• Consists all previous different interaction types
• False positives serious problem=>Requires good statistics
Older articleshttp://cms.dt.uh.edu/Faculty/ChenP/IR/paper1-ng-v28-p21.pdfhttp://www.nature.com/ng/journal/v36/n7/full/ng0704-664.htmlNew articleshttp://bioinformatics.oxfordjournals.org/content/25/12/1536.shorthttp://bioinformatics.oxfordjournals.org/content/24/13/i277.shorthttp://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0034480
False positive interactions
• All these interaction datasets report also false positive interactions
• A single interaction cannot be trusted• Especially if both genes have many interactions
• Group of interactions is more reliable• Many alternative paths between points
Measuring similarity in network
• Try to emphasise reliable groups of interactions over random interactions
• Try to emphasise cases where there is several alternative paths between two genes
• Similarity can also predict an unobserved interaction
http://nar.oxfordjournals.org/content/40/W1/W140.fullhttp://www.pnas.org/content/100/8/4372.fullhttp://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000350http://genome.cshlp.org/content/18/12/1991.fullhttp://alfred.cse.buffalo.edu/DBGROUP/bioinformatics/papers/chuan.pdf
Where is the data
• DIP (Database of Interacting Proteins)• http://dip.doe-mbi.ucla.edu/dip/Main.cgi
• Biogrid• http://thebiogrid.org/
• MINT (Molecular INTeraction database)• http://mint.bio.uniroma2.it/mint/Welcome.do
• List of databases: http://proteome.wayne.edu/PIDBL.html
Use of interaction networks with large data
• Find interaction clusters from the gene lists• Generate clusters from selected gene group using the
interaction data• http://cms.dt.uh.edu/Faculty/ChenP/IR/paper1-ng-v28-p21.pdf
• Find correlations across two datasets• http://bioinformatics.oxfordjournals.org/content/21/11/2730.full• http://nar.oxfordjournals.org/content/31/21/6283.full
• Use the analyzed data to find interesting regions in the interaction graph
• http://bioinformatics.oxfordjournals.org/content/26/21/2713.full
Problems with interaction data
• False positive interactions• Hidden circular logic
• Use of interaction graph, generated from gene expression data, to evaluate gene expression data
• Combination of different types of interaction datasets might not be trivial
• Permutation analysis should help also here
Summary of interaction networks
• Popular data for evaluation of large-scale datasets
• Many types of interactions.• False positive interactions serious problem
Few words on writing essay
• Tips from Liisa:• Select a topic/s that seems interesting• Select key words and terms• Use Google and Google Scholar• Look for reviews, tutorials
• Google: Keyword1 Keyword2 tutorial OR guide OR review
• Keep eye on publication year and number of citations
• Keep the essay focused