analysis of large groups of genes petri toronen [email protected]

Analysis of large groups of genes

Petri [email protected]

Content• PART I: Analysis of groups of genes with

functional classes– Do I need to analyze large gene groups?– Functional classification of genes– Over-representation analysis methods– Gene Set Enrichment Analysis (GSEA) methods– What comes out?– Further applications– When these methods will fail– Conclusion

Content

• PART II: Analysis of gene groups with gene networks• Why gene networks• What are different gene/protein

interactions• Where can we get the data• Usage of interaction datasets• Flaws of interaction datasets• Conclusions

PART I: Analysis of group of genes with functional classes

Content

• Do I need to analyze large gene groups?• Functional classification of genes• Over-representation analysis methods• Gene Set Enrichment Analysis (GSEA) methods• What comes out?• Further applications• When these methods will fail• Conclusion

Do I need to analyze large gene groups?

• Biology more and more High Throughput• Understanding N*100 gene list is hard• Errors in data• Popular in literature, reviewers might require• These can ease the data analysis

Examples of large-scale datasets

• Gene expression datasets• Large scale genome wide associations (GWAS)• Enviromental microbial samples

(metagenomics)• Comparing two or more genomes for missing

genes (phylogenetic footpronting)

Errors in data

• All high throughput methods have been shown to generate also error results

• Technical biases, sample preparation, dyes, different lab workers

• Variances between laboratories, sample preparation in the lab

• Biological variation between the individuals, samples

Errors in data

=> Interesting gene might not be reliable observation!

Thinking the data analysis

• Biology usually looks at obtained genes


• Biology usually looks at obtained genes• But genes are often not the main interest!


• Biology usually looks at obtained genes• But genes are often not the main interest!• Usually it is the biological processes, the

functions we are interested


• Biology usually looks at obtained genes• But genes are often not the main interest!• Usually it is the biological processes, the

functions we are interested• Active genes can vary while the same process

is active (Linghu et al. 2009, Efron, Tibshirani 2007, Subramanian 2005)

http://arxiv.org/pdf/math/0610667.pdfhttp://www.biomedcentral.com/content/pdf/gb-2009-10-9-r91.pdf

http://arxiv.org/pdf/math/0610667.pdf

http://www.biomedcentral.com/content/pdf/gb-2009-10-9-r91.pdf


• Monitoring genes requires detailed knowledge on gene functions

• Genes can have multiple functions• Secondary function might go unnoticed

• Objectivity• We easily find what we expected to find

Over-representation analysis

• Motivation to over-representation analysis• Over-representation analysis methods• What over-representation analysis always

requires• What the results look like

Typical analysis pipelineGenerate the GE data

Pre-processing(Normalization etc.)

Define Differentially Expressed genes

Collect functional annotations for all the genes

Compare functional classes withthe analyzed gene group. Use selected statistical score

Select over-represented biological processes

Gene expression analysis

Over-representation analysis

Information on gene functions

• Gene functions can be represented as categorizations:– Gene belongs to category when it has a function– Gene can belong to many categories– Some categories can lie within each other

• Smaller more precise categories are nested within larger category

• Different levels of information on gene function

Information on gene function

• Where do we get the functional categorization?

Information on gene function

• Where do we get the functional categorization?

• We use existing functional classifications from databases that provide them

• Biological Processes• Molecular Functions …

• Laboratory can also generate its own functional classifications

G1G1G1G1

G2G2G2G2

G3G3G3G3

Cell cycleCell cycleCell cycleCell cycle

ApoptosisApoptosisApoptosisApoptosis

NeurogenesisNeurogenesisNeurogenesisNeurogenesis

Cell deathCell deathCell deathCell death

G4G4G4G4

ATPase activityATPase activityATPase activityATPase activity

Functional classesGenes

information of the functions canbe obtained, for example, usingsome genome database (MIPS, SGD)

G1G1G1G1

G2G2G2G2

G3G3G3G3

T1T1T1T1

T2T2T2T2

T3T3T3T3

T4T4T4T4

……………

…000G4

…000G3

…010G2

…101G1

…T3T2T1

G4G4G4G4

T5T5T5T5

Obtained association data Obtained association data can be turned into can be turned into a a binary matrixbinary matrix11 indicates association and indicates association and 00 no association no association

Functional annotation standards

• Gene Ontology (GO)• Reactome (Biochemical Pathways)• KEGG (Kioto Encyclopedia of Genes and

Genomes) (not freely available anymore)

• SwissProt keywords• MIPS FunCat (Functional Catalogue)• Molecular Signatures Database (MSigDB, MIT)

=> You can use many of these simultaneously

Benefits of Gene Ontology (GO)

• Many gene features are covered• Biological Processes• Molecular Function• Cellular Component

• Hierarchical sophisticated struture• Detailed and broad categories are included to structure

• Most used, popular standard• Many organisms are covered• Information source reported

• GO Evidence Codes

Drawbacks of GO

• Unstable structure• New classes appear and old ones disappear

• Most annotations are not manually evaluated*• Some annotations are bound to be wrong

• Gene function might vary, for example between tissues*

• Genes roughly categorized to classes*• In reality genes are more or less relevant to some function /

pathway• Gene from pathway can up or down regulate the synthesis

*Problem is not specific to GO. It rather links to all functional classifications

Motivation of the over-representation analysis

• Aim is to summarize gene functions observed in the analyzed group of genes

• It would be intuitive to report the most common functional classes from the gene list



• It would be intuitive to report the most common functional classes from the group of genes

• However, the most common classes are often the most common also in the background



• It would be intuitive to report the most common functional classes from the group of genes

• However, the most common classes are often the most common also in the background

• We would rather want to report the most over-represeted classes

Over-representation analysis methods

• Over-representation analysis looks classes that are more frequent in gene group than in the background

• Most popular tests are based on sampling without replacement

sample

Whole data

Sampling w/o replacements answers to:

How many ways there are to select 8 balls so that two of them are white and rest are black from the whole data?

This is done by summing probability for observed outcome (6 black balls) with the probabilities of more extreme outcomes (7 and 8 black balls). This is a standard available in many web tools.

The sampling without replacement can be turned to a p-value using Fisher’s exact test (or hypergeometric test).

Figure shows the probabilities for every possible outcome in theexample. The p-value for previous sample (6 black balls wouldbe obtained by summing threeleftmost bars) p-value: 0.621

http://en.wikipedia.org/wiki/Hypergeometric_distributionhttp://en.wikipedia.org/wiki/Fisher%27s_exact_test

http://en.wikipedia.org/wiki/Hypergeometric_distribution

http://en.wikipedia.org/wiki/Fisher%27s_exact_test

Over-representation analysis methods• Various methods used:

– Hypergeometric test– Binomial test– Chi Square test (bad)

• Less used methods reporting same:– Jaccard correlation– Log-likelihood ratio aka. G-statistics (Mutual

Information)http://en.wikipedia.org/wiki/Binomial_testhttp://en.wikipedia.org/wiki/G-testhttp://en.wikipedia.org/wiki/Pearson%27s_chi-squared_testhttp://en.wikipedia.org/wiki/Jaccard_index

http://en.wikipedia.org/wiki/Binomial_test

http://en.wikipedia.org/wiki/G-test

http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

http://en.wikipedia.org/wiki/Jaccard_index

What statistics calculate

• Hypergeometric test is based on sampling w/o replacement

• Binomial test is based on sampling with replacement– This model is true when the dataset is much larger

than the sample– Binomial test is often used to approximate

hypergeometric test


• Log likelihood ratio (LLR, G-test) is based on ratio of likelihoods between two models:

• Maximum Likelihood model• Null model

It approximates binomial test• Chi Square test approximates LLR. Chi square

test is popular but it is not recommended.


• Jaccard correlation (JC) is different. It is ratio between:

• the size of intersection (∩) between functional class and gene group (number of genes in both sets)

• The size of union (U) between functional class and gene group (number of genes in either set)

• JC = (A ∩ B )/(A U B)

• JC does not tell how significant the result is

Over-representation analysis methods:What values methods need

• Hypergeometric test (= Fisher’s exact test, hypergeometric p-value)

• X: Number of class members in the gene group• K: Size of the gene group• N: Size of the whole dataset• L: Number of class members in the whole dataset

• Binomial p-value, log-likelihood test (= G-statistics), Chi Square test

• X: Number of class members in the gene group• K: Size of the gene group• P: Probability of class members in the background(P = L/N)

What the results look like

• Sorted lists of significant functional classes• Visualization of classes in the structure of the

functional class hierarchy

Two examples of sorted class lists

functional class -log10(P) P size of the classobs. number of genes expected number of genes STD of expected numberTRANSCRIPTION 57.94446 0 601 282 131.9268 8.822874mRNA transcrition 31.11666 0 444 196 97.46342 7.897141nuclear organization 27.23162 0 648 245 142.2439 9.044816mRNA synthesis 22.65499 0 343 151 75.29268 7.1128rRNA transcription 16.981 0 98 60 21.5122 4.01593transcriptional control 16.73279 0 271 118 59.48781 6.428959rRNA processing 11.33614 0 58 37 12.73171 3.115542mRNA processing (splicing) 8.325967 0 66 36 14.48781 3.31793rRNA synthesis 6.903157 0 37 23 8.121951 2.499256tRNA transcription 5.381251 0.000004 72 33 15.80488 3.461119general transcription activities 4.479288 0.000033 59 27 12.95122 3.141631

functional class -log10(P) P size of the class obs. number of genes expected number of genes STD of expected numbermitochondrial organization 33.52572 0 303 88 23.64878 4.373248respiration 9.47456 0 68 23 5.307317 2.181689PROTEIN SYNTHESIS 6.225534 0.000001 299 47 23.33659 4.348312ribosomal proteins 5.816049 0.000002 173 32 13.50244 3.402624ENERGY 4.188443 0.000065 169 28 13.19024 3.365997assembly of protein complexes 3.630167 0.000234 86 17 6.712195 2.44426CELLULAR ORGANIZATION 2.494072 0.003206 1806 157 140.9561 5.879028mRNA processing (splicing) 1.933813 0.011646 66 11 5.15122 2.150264regulation of phosphate utilization 1.70902 0.019543 8 3 0.62439 0.75764

Graphical output• classes with high over-representation can be visualized using the GO

structure• well-scoring classes are highlighted

Better pipeline: Permutations

Collect functional annotations for all the genes



Tolvanen et al. 2009 http://www.sciencedirect.com/science/article/pii/S1532046408001445



Permute (mix) the annotationsrandomly across the genes

Compare the results with correct data to the permuted data. Report only classes that have better score than the permutations

Repeat N*100 times

Permutations

• Sanity check: What results can be obtained from random data?

• First method (Row permutation, gene permutation):

• Repeat following step, say 200 times– Permute the functional annotations– Rerun the analysis– Store the results

• Compare the results from true data to the permuted data

Permutations

• Second method (column permutation, sample permutation):

• Repeat following steps, say 200 times:– Permute the labels of (gene expression) samples– Rerun the (gene expression) data analysis– Take the significantly regulated genes– Rerun the over-representation analysis with regulated genes

• Compare the results from each class true data to the permuted data

• Heavier but still preferred permutation idea• Might not always be applicable

Rowrand.

Col. rand

Permutations

Expr

essi

on d

ata

Clas

s da

ta

Sample labels

Permutations

• Tian et al. 2005, PNAS (http://www.pnas.org/content/102/38/13544.long)

• Efron & Tibshirani, 2007, Annals of Applied Statistics (http://arxiv.org/pdf/math/0610667.pdf)

• Goeman & Buhlmann, 2007, Bioinformatics (http://arxiv.org/pdf/math/0610667.pdf)

• Törönen et al. 2009, BMC Bioinf. (http://www.biomedcentral.com/1471-2105/10/307)

Some say one is better than other. Me and Tian et al. propose: Test both and expect the worse result to be true.

http://www.pnas.org/content/102/38/13544.long



http://www.biomedcentral.com/1471-2105/10/307

Multiple Testing Problem

• We run over-representation analysis with many thousands classes. What can go wrong?

• Think throwing a dice. • With one dice the probability of getting 6 is 1/6• With 10 dices the probability for one 6 is much larger (~0.8)

• Same phenomena occurs with functional classes when we look at the best-scoring classes.

• This needs to be corrected

Multiple Testing Problem

• Several alternatives to correcting the phenomena

• These require the obtained best results to be better than what could be expected by random from the set of N experiments.

• Holm’s correction • Bonferroni correction• False Discovery Rate (FDR)• ..others…

Further ideas

Simple extensions to over-representation analysis:

• Search of clusters from hierarchical cluster tree

• Analysis of functional subgroups from the generated gene list

G1G1G1G1

G2G2G2G2

G3G3G3G3

T1T1T1T1

T2T2T2T2

T3T3T3T3

T4T4T4T4

G4G4G4G4

T5T5T5T5

The gene list

G5G5G5G5

G6G6G6G6

G7G7G7G7

G8G8G8G8

T6T6T6T6

T7T7T7T7

T8T8T8T8

T9T9T9T9

T10T10T10T10

Functional classes

Analyzing the heterogeneity of the gene list

Pehkonen et al., BMC Bioinformatics 2005analyzing a gene list as one entity forover-represented gene classes is standard.

Problem:

The best scoring gene group tends to overwhelm the results.

G1G1G1G1

G2G2G2G2

G3G3G3G3

T1T1T1T1

T2T2T2T2

T3T3T3T3

T4T4T4T4

G4G4G4G4

T5T5T5T5

The gene list

Our question:Can the gene list be grouped in a reasonable way using the functional information.

Aim:help to find the heterogenous gene groups nested in the gene list

G5G5G5G5

G6G6G6G6

G7G7G7G7

G8G8G8G8

T6T6T6T6

T7T7T7T7

T8T8T8T8

T9T9T9T9

T10T10T10T10

Functional classes

Analyzing the heterogeneity of the gene list

Pehkonen et al., BMC Bioinformatics 2005analyzing a gene list as one entity forover-represented gene classes is standard.

Problem:

The best scoring gene group tends to overwhelm the results.

Proposed method

• Cluster the gene list using ontological data with Non-negative Matrix factorization (NMF)

-shown to perform well on sparse binary matrices-note that the clustering is only based on the functional annotations-no expression data or sequence similarity is used

• Vary the used cluster number and highlight the features that stay similar although the number of clusters changes

• present the most over-represented classes for each cluster

Obtained graphical output

analyzed list was yeast genes that were important for the growth of yeast in H2O2 stress (Thorpe, PNAS, 2004)

Lines present binary correlation (0, no correlation, 1, clusters match exactly)

Each cluster shows functional classes that were over-represented in the original gene list and which are over-represented in the created cluster.

More details will be shown for this clustering

3 best classes for each cluster

<Cluster 1 size=11>

RNA ligase activity

tRNA ligase activity

ligase activity, forming aminoacyl-tRNA and related compounds

<Cluster 2 size=33>

mitochondrial genome maintenance

mitochondrion organization and biogenesis

mitochondrial chromosome

<Cluster 3 size=14>

mitochondrial membrane

inner membrane

mitochondrial inner membrane

<Cluster 4 size=31>

organellar ribosome

mitochondrial ribosome

mitochondrial matrix

<Cluster 5 size=20>

transcription regulator activity

nucleobase, nucleoside, nucleotide and nucleic acid metabolism

mediator complex

Competing methods:sorted class list

mitochondrion

mitochondrial matrix

organellar ribosome

mitochondrial ribosome

protein biosynthesis

organellar large ribosomal subunit

mitochondrial large ribosomal subunit

macromolecule biosynthesis

structural constituent of ribosome

ribosome

biosynthesis

structural molecule activity

ribonucleoprotein complex

protein metabolism

metabolismetc…

Sorted class list gives an impression that the list includes only mitochondrial ribosome proteins

Summary of theme discovery

• We have used clustering to look for the heterogeneous gene groups from the list of genes

• sorted class lists have a tendency to highlight only the most enriched functional classes

• user gets an impression that the gene group is homogenous

• Method would be especially useful with the analysis of large gene lists

Motivation for next work• clustering is standard procedure with gene expression data

analysis• clusters are often monitored for over-representation of

genes from same (GO) functional class • strong over-representation presents a interesting clustering

solution

Motivation for next work

But• we usually face a problem of selecting clustering method and

parameters (cluster numbers etc.) -these result to several clustering solutions

• Many clustering solutions create a massive analysis task

• clustering is standard procedure with gene expression data analysis

• clusters are often monitored for over-representation of genes from same (GO) functional class

• strong over-representation presents a interesting clustering solution

If a good/interesting cluster is considered as a one that produces a strong correlation to functional classes then why not look for such a cluster in the first place.

Application to hierarchical clustering (Toronen BMC bioinf. 2004)

1. Cluster using hierarchical clustering2. Calculate correlations for all the clusters with all the gene classes3. Store the best correlation for each cluster4. Select the clusters with best scores as output*

*analysis is heavily simplified

Selection of informative clusters with GO

ordinal cluster number

most enriched MIPS gene class log-p-value

bonf. corr. log(p)

observed

clust. size

class size

1 cytoplasmic ribosomes 102.5676 99.7780

78 159 108

2 respiration chain complexes 61.0379 58.2483

28 41 31

3 mitochondrion 57.3513 54.5617

75 142 305

4 AA metabolism 55.4905 52.7009

75 259 171

5 AA biosynthesis 55.4849 52.6953

59 244 98

6 mitochondrial ribosomes 46.2851 43.4955

32 101 46

7 rRNA transcription 33.5036 30.7140

46 274 99

8 cytoplasmic ribosomes 25.8871 23.0975

15 15 108

9 nucleosomal protein complex 25.2346 22.4450

8 8 8

10 26S proteasome 23.1373 20.3477

12 21 28

Some top results

Visualization

• green shows nucleus, nucleotide, cell cycle and differentiation

• red is energy production• blue is protein synthesis=>rough functional

annotation of cluster tree branches

Further ideas of over-representation

• Evaluation using genes associated with some disease

• Evaluation using genes that are active in some treatment

• Connectivity Map project • http://www.sciencemag.org/content/313/5795/1929.full

• Gene groups linked across species• http://www.pnas.org/content/107/14/6544.abstract

When the over-representation analysis will fail

• Functional classifications generated by the same data type that it is used to evaluate

• Circular logic• Need independent data sources

• Genes occuring multiple times in dataset (*)• An example: Analysis of probes instead of genes• Independence of observations is strongly violated

• Background set needs to be well defined (**)*Toronen et al. http://www.biomedcentral.com/1471-2105/10/307Subramanian et al. http://www.pnas.org/content/102/43/15545.short** Pehkonen et al. http://www.biomedcentral.com/1471-2105/6/162/


When the over-representation analysis will fail

• Exclude the genes that have no annotation **• Selected group is defined badly ***

• Group is too small => relevant genes are left out• Group is too large => biological signal is diluted

• If worried: Do the permutation analysis• Permutation is discussed later

** Pehkonen et al. http://www.biomedcentral.com/1471-2105/6/162/Kankainen et al. http://nar.oxfordjournals.org/content/34/suppl_2/W534.short*** The solution to this comes later

http://www.biomedcentral.com/1471-2105/6/162/

http://nar.oxfordjournals.org/content/34/suppl_2/W534.short

Conclusion on over-representation analysis

• Gene level analysis can be unreliable• Functional classes (Gene sets) provide more

reliable level• Result direct towards relevant biological

processes• Results can be confirmed via permutation

analysis

Gene Set Enrichment Analysis

• Why this instead of earlier over-representation methods

• Main idea• Different statistics• Different permutations

Typical pipelinesGenerate the GE data



Draw biological conclusionsFind over-represented biological processes

Generate the GE data



Cluster selected genes

Draw biological conclusions

What can go wrong?

• Is the definition of Differentially Expressed genes always reasonable?– datasets with large noise levels– p-value thresholds– sudden jump to signif. regulation– genes with weak regulation

What can go wrong?

Analysis of data with one threshold.Biological process with weak regulation goes unnoticed

Functional classes should be analyzed so that The signal level is taken into account!

Solution

(Following text uses term gene set instead of functional class. Terminology varies between fields!)

• Take all the genes to data analysis (not just differentially expressed genes)

• Use some scoring method to define important functional classes (Gene Sets)

• All the genes contribute to the result• Take the strength of regulation of genes to analysis• Monitor the regulation of functional processes (not the

genes)

Gene set analysis pipeline

Generate the GE data


Define continuous Diff. Expr. score for genes

Calculate a gene set score for each gene set

Generate permuted data

Pre-definedgene sets

Calculate the gene set scorefor each gene set

Look for gene sets that show stronger signal in real data than in permuted data

Gene level

Gene set level

Expr

essi

on d

ata

Clas

s da

ta

Sample labels

Methods for gene set scoring

• Average based methods• Rank based methods• Other methods (omitted here)

• Classification based method (http://bioinformatics.oxfordjournals.org/content/24/21/2474.full.pdf+html)

• Generalized linear models (http://bioinformatics.oxfordjournals.org/content/20/1/93.short)

• Logistic Regression (http://bioinformatics.oxfordjournals.org/content/25/2/211.short)

http://bioinformatics.oxfordjournals.org/content/24/21/2474.full.pdf+html

Average based methods

• Associate the differential expression scores x to gene set analysis

• X can be average fold change or T-test score

• Take all the genes from the gene set S • Calculate one of the vector norm scores for

gene set• ScoreS = ∑i∊Sxi A• ScoreS = ∑i∊S(xi)2 B• ScoreS = ∑i∊S abs(xi) C

• Compare with permutations …

Average based methods• What is good and where?

A

B or C

B

• No single method that would detect allA Tian et al. http://www.pnas.org/content/102/38/13544.short Irizarry et al. http://biostats.bepress.com/cgi/viewcontent.cgi?article=1185&context=jhubiostat

B Irizarry et al. , Kong et al. http://bioinformatics.oxfordjournals.org/content/22/19/2373.long

C Dinu et al. http://www.biomedcentral.com/1471-2105/8/242

Average based methods• Extensions:

– Split data to two halves ( X > 0, X<= 0)– Do average analysis separately for both halves– Report the half that shows stronger signal

• Gene Set Analysis (http://arxiv.org/pdf/math/0610667.pdf)

Rank based methods• Steps:

– order genes with differential expression*

– test every possible threshold in the ordered list

– look over(/under)-representation of gene set above the threshold

– select the threshold position with strongest score

– Compare the results to permutation• Expression values are (often)

discarded!

*Differential expression:Average fold change, T-test score…

Analyzedsubset

threshold

Gene expression data Analyzed gene classes

Black = class memberWhite = not a member

Rank based methods

• Scoring methods:– Hypergeometric p-value (A)– Kolmogorov-Smirnov test (KS) (B)– modified KS (C)

• A: Iterative Group Analysis and others• http://www.biomedcentral.com/1471-2105/5/34• http://www.biomedcentral.com/1471-2105/10/48

• B: Old Gene Set Enrichment Analysis (GSEA)

• http://statgen.ncsu.edu/~dahlia/journalclub/S04/ng267.pdf

• C: New GSEA• http://www.pnas.org/content/102/43/15545.short

Analyzedsubset

threshold

Gene expression data Analyzed gene classes

Black = class memberWhite = not a member

My brilliant proposal• Combine two method groups:

– Order genes with diff. expr. scores– Test every threshold position– At each threshold calculate

• Scale the difference with STD and average estimates (Toronen et al. 2009)

• http://www.biomedcentral.com/1471-2105/10/307

• Get a Z-score scaling for difference=> Gene Set Z-score (GSZ)

STDmeanX /)(scoreZ

My brilliant proposal

• An over-representation (hypergeometric) score weighted with diff. expr. score

• GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list.

My brilliant proposal

• Many popular Gene Set scoring methods are variants of GSZ-method:– hypergeometric testing– Pearson correlation– Max-Mean (Efron, Tibshirani)– Random Sets (Newton et al.)

Comparing methods

• http://www.biomedcentral.com/1471-2105/10/47/

• http://www.biomedcentral.com/1471-2105/8/431/

• http://bioinformatics.oxfordjournals.org/content/28/11/1480.short


Summary of methods

• Some methods could at detecting consistent shift in the whole group. Not good with non-coherent regulation

• Average method

• Some methods are good at detecting non-coherent regulation also, if it is strong enough. These cannot detect mild consistent regulation.

• Average method with squared signal• Most rank based methods

Do not use MIT GSEA programs

• Several comparisons point that it is weak• Toronen et al.


• Irizarry et al. • http://smm.sagepub.com/content/18/6/565.short

• Ackermann et al.• http://www.biomedcentral.com/1471-2105/10/47/

• Dinu et al.• http://www.biomedcentral.com/1471-2105/8/242

• … and others…

Permutations

• Needed to evaluate significance• Two types:• Row Randomization

– mix row labels gene set / gene class

• Column Randomization– mix sample labels, used to

calculate diff. expr.• Column Randomization

preferred

Rowrand.

Col. rand

Permutations

• Some propose that both should be used– Toronen et al.– Efron & Tibshirani– Tian et al.

• Some errors can be detected by one permutation and some errors by other.

• It would seem safest to use both of them

Extensions to Gene Set Analysis

• Principle: Continuous data vs. label data.

• siRNA data vs. gene IDs• Linkage data vs. biological processes• BLAST result list vs. descriptions• BLAST result list vs. GO classes• Correlation of gene expression with query

gene across large data vs. GO classes

Warnings• Quality of gene expression data• Use Column Permutations OR Column and Row

permutations (depends on statistics)• Enough samples for permutations• Each gene should occur only once in the

expression data• Filter genes without annotations (with GO data)• Quality of gene sets / annotations• See discussion in Toronen et al.


Conclusion on Gene Set Enrichment Methods

• Extension of over-representation analysis to continuous data

• Two main method principles: Rank based and average based methods

• Methods report different signals• Goodness depends on the application

Analysis of Gene Networks

• Why Gene Networks• How this links to large gene groups• What are different gene/protein interactions• Databases for interactions• Is the interaction true• Usage of interaction datasets

Why Gene Networks

• Previous slides represented genes as members of group

• This assumes that group members are similar and are equally important

• This can be considered true with protein complexes etc.

• However often biology is constructed from networks

Why gene networks

• Biological synthesis and degradation pathways• Signal pathways (phosphorylation)• Regulatory pathways to gene expression

Why Gene Networks

• In addition one can generate large scale datasets that represent gene – gene interactions

• These can be represented also as networks

Different Interactions

Different interaction represent totally different things. Keep this in mind.

• Physical protein – protein Interaction• Synthetic genetic interaction (epistasis)• Literature gene interaction

Physical Protein Protein interaction

• Proteins bind physically each other• Protein complexes• Signal transporting partners

• High throughput methods:– Two-hybrid screening (http://en.wikipedia.org/wiki/Two-

hybrid_screening)

– Tandem affinity purification (http://en.wikipedia.org/wiki/Tandem_affinity_purification)

Physical Protein Protein interaction

• Prediction directly from sequence has been attempted

• Domain – Domain interaction• Shoemaker & Panchenko 2007• http://www.ploscompbiol.org/article/info%3Adoi

%2F10.1371%2Fjournal.pcbi.0030043

• Critisism• Ta & Holm, 2009• http://www.sciencedirect.com/science/article/pii/

S0006291X09019470

Genetic Interactions

• Disabling two genes at the same time has a ”surprising” effect

Surprising: Clearly weaker growth OR cancellation of weak growth observed with knock-outs of single genes

• High Throughput in simple organisms• Phenotype: Growth• http://www.nature.com/nmeth/journal/v4/n10/full/nmeth1098.html• http://stke.sciencemag.org/cgi/content/abstract/sci;294/5550/2364

• Genes do not (necessarily) make physical interaction

• Interacting genes are often in backup pathways

Genetic vs. physical interactionPathway A Pathway B

Two pathways that work as backup to each otherBlue lines = Physical InteractionsRed Lines = Genetic Interactions

Regulatory Networks• Systems regulating gene expression activity in cell• Consist interactions between trascription factors,

other proteins, metabolic compounds• High throughput datasets generated gene

expression data from perturbed organisms / cell lines

• http://www.embl.de/predoccourse/2008/modules/genomics/journal_club/HughesCell2000.pdf

• http://stke.sciencemag.org/cgi/content/abstract/sci;313/5795/1929• http://www.sciencedirect.com/science/article/pii/S0167779911000023

• Primary or secondary effect?

Literature interaction

• Co-occurence of gene names in published articles

• Consists all previous different interaction types

• False positives serious problem=>Requires good statistics

Older articleshttp://cms.dt.uh.edu/Faculty/ChenP/IR/paper1-ng-v28-p21.pdfhttp://www.nature.com/ng/journal/v36/n7/full/ng0704-664.htmlNew articleshttp://bioinformatics.oxfordjournals.org/content/25/12/1536.shorthttp://bioinformatics.oxfordjournals.org/content/24/13/i277.shorthttp://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0034480

False positive interactions

• All these interaction datasets report also false positive interactions

• A single interaction cannot be trusted• Especially if both genes have many interactions

• Group of interactions is more reliable• Many alternative paths between points

Measuring similarity in network

• Try to emphasise reliable groups of interactions over random interactions

• Try to emphasise cases where there is several alternative paths between two genes

• Similarity can also predict an unobserved interaction

http://nar.oxfordjournals.org/content/40/W1/W140.fullhttp://www.pnas.org/content/100/8/4372.fullhttp://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000350http://genome.cshlp.org/content/18/12/1991.fullhttp://alfred.cse.buffalo.edu/DBGROUP/bioinformatics/papers/chuan.pdf

Where is the data

• DIP (Database of Interacting Proteins)• http://dip.doe-mbi.ucla.edu/dip/Main.cgi

• Biogrid• http://thebiogrid.org/

• MINT (Molecular INTeraction database)• http://mint.bio.uniroma2.it/mint/Welcome.do

• List of databases: http://proteome.wayne.edu/PIDBL.html

Use of interaction networks with large data

• Find interaction clusters from the gene lists• Generate clusters from selected gene group using the

interaction data• http://cms.dt.uh.edu/Faculty/ChenP/IR/paper1-ng-v28-p21.pdf

• Find correlations across two datasets• http://bioinformatics.oxfordjournals.org/content/21/11/2730.full• http://nar.oxfordjournals.org/content/31/21/6283.full

• Use the analyzed data to find interesting regions in the interaction graph

• http://bioinformatics.oxfordjournals.org/content/26/21/2713.full

Problems with interaction data

• False positive interactions• Hidden circular logic

• Use of interaction graph, generated from gene expression data, to evaluate gene expression data

• Combination of different types of interaction datasets might not be trivial

• Permutation analysis should help also here

Summary of interaction networks

• Popular data for evaluation of large-scale datasets

• Many types of interactions.• False positive interactions serious problem

Few words on writing essay

• Tips from Liisa:• Select a topic/s that seems interesting• Select key words and terms• Use Google and Google Scholar• Look for reviews, tutorials

• Google: Keyword1 Keyword2 tutorial OR guide OR review

• Keep eye on publication year and number of citations

• Keep the essay focused

analysis of large groups of genes petri toronen [email protected]

Documents

analysis of gene groups

analysis of groups

large gene groups

gene list

gene functionsgenes

interesting gene

genesbut genes

data analysismonitoring