1module 2: analyzing gene lists canadian bioinformatics workshops

51
1 Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops www.bioinformatics.ca

Upload: harvey-gilbert

Post on 25-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

1Module 2:Analyzing Gene Lists

Canadian Bioinformatics Workshops

www.bioinformatics.ca

Page 2: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

2Module 2:Analyzing Gene Lists

Page 3: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

3Module 2:Analyzing Gene Lists

Module 2: Analyzing gene lists: over-representation analysis

Interpreting Genes from OMICS Studies

Quaid Morris

Page 4: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

4Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 5: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

5Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 6: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

6Module 2:Analyzing Gene Lists

Over-representation analysis (ORA) in a nutshell

• Given:1. Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42

(yeast)

2. Gene annotations: e.g. Gene ontology, transcription factor binding sites in promoter

• ORA Question: Are any of the gene annotations surprisingly enriched in the gene list?

• Details:– How to assess “surprisingly” (statistics)– How to correct for repeating the tests

Page 7: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

7Module 2:Analyzing Gene Lists

ORA example: Fisher’s exact testa.k.a., the hypergeometric test

Background population:500 black genes, 5000 red genes

Gene list

RRP6MRD1RRP7RRP43RRP42

Formal question: What is the probability of finding 4 or more black genes in a random sample of 5 genes?

Page 8: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

8Module 2:Analyzing Gene Lists

ORA example: Fisher’s exact test

Background population:500 black genes, 5000 red genes

Gene list

RRP6MRD1RRP7RRP43RRP42

P-value

Null distribution

Answer = 4.6 x 10-4

Page 9: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

9Module 2:Analyzing Gene Lists

Important details

• To test for under-enrichment of “black”, test for over-enrichment of “red”.

• Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background.

• To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

Page 10: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

10Module 2:Analyzing Gene Lists

What have we learned?

• Over-representation analysis (ORA) detects surprising enrichment of gene annotations in a gene list.

• Fisher’s exact test is used for ORA of gene lists for a single type of annotation,

• P-value for Fisher’s exact test– is “the probability that a random draw of the same size as

the gene list from the background population would produce the observed number of annotations in the gene list or more.”,

– and depends on size of both gene list and background population as well and # of black genes in gene list and background.

Page 11: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

11Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 12: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

12Module 2:Analyzing Gene Lists

Break for lab #1

• Try out an over-representation analysis using Fisher’s exact test

• Funspec:– http://funspec.med.utoronto.ca/

Page 13: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

13Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 14: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

14Module 2:Analyzing Gene Lists

Examples of sources of gene lists

Source Eisen et al. (1998) PNAS 95

Clustering

Gen

es

Thresholding a gene “score”

Gene listGene list

Source: Gerber et al. (2006) PNAS103

Gen

es

Time

Examples of gene scores

Page 15: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

15Module 2:Analyzing Gene Lists

ORA using gene scores

66

7

7

5

01

1 22

1

1

1

10

00 0

Gene scores

Gene score distributions

Question: How likely are the differences between the two distributions due to chance?

Page 16: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

16Module 2:Analyzing Gene Lists

ORA using the T-test

Gene score distributions

Formal Question: Are the means of the two distributions significantly different?

Answer: Two-tailed T-test

Black: N1=500

Red: N2=4500

Mean: m1 = 1.1 Std: s1 = 0.9

T-statistic =

Mean: m1 = 4.9 Std: s1 = 1.0

2

22

1

21

21

Ns

Ns

mm

+

= -88.5

Page 17: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

17Module 2:Analyzing Gene Lists

ORA using the T-test

Gene score distributions

T-statistic =

2

22

1

21

21

Ns

Ns

mm

+

= -88.5

T-distribution

Pro

ba

bili

ty d

en

sity

T-statistic

0

P-value = shaded area * 2

-88.5

Formal Question: Are the means of the two distributions significantly different?

Page 18: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

18Module 2:Analyzing Gene Lists

T-test caveats (also see next slide)

1. Assumes black and red gene score distributions are both approximately Gaussian (i.e. normal) – Score distribution assumption is often true for:

• Log ratios from microarrays

– Score distribution assumption is rarely true for:• Peptide counts, sequence tags (SAGE or NextGen

sequencing), transcription factor binding sites hits

2. Tests for significance of difference in means of two distribution but does not test for other differences between distributions.

Page 19: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

19Module 2:Analyzing Gene Lists

Examples of inappropriate score distributions for T-tests

Pro

bab

ility

den

sity

Gene score 0

Gene scores are positive and have increasing density near zero, e.g. sequence counts

Pro

bab

ility

den

sity

Gene score

Distributions with gene score outliers, or “heavy-tailed” distributions

Pro

bab

ility

den

sity

Gene score

Bimodal “two-bumped” distributions.

Solutions:1) Robust test for difference of medians (WMW)2) Direct test of difference of distributions (K-S)

Page 20: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

20Module 2:Analyzing Gene Lists

Wilcoxon-Mann-Whitney (WMW) testaka Mann-Whitney U-test, Wilcoxon rank-sum test

1) Rank gene scores, calculate RB, sum of ranks of black gene scores

6.55.64.53.22.11.70.1

-1.1-2.5-0.5

3.21.76.54.50.1

2.15.6

-1.1-2.5-0.5

N2 red genescores

N1 black genescores

ranks

123456789

10

RB = 21

Z

Pro

babi

lity

dens

ity

Gene score

Formal Question: Are the medians of the two distributions significantly different?

Page 21: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

21Module 2:Analyzing Gene Lists

Wilcoxon-Mann-Whitney (WMW) testaka Mann-Whitney U-test, Wilcoxon rank-sum test

RB = 21

Z

Pro

babi

lity

dens

ity

Gene score

Formal Question: Are the medians of the two distributions significantly different?

2) Calculate Z-score:

U

B

NNNR

2121

1++−

=

mean rank

Z

Normal distribution

Pro

ba

bili

ty d

en

sity

0

P-value = shaded area * 2

-1.4

3) Calculate P-value:

= -1.4

Page 22: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

22Module 2:Analyzing Gene Lists

WMW test details

• Described method is only applicable for large N1 and N2 and when there are no tied scores

• Note: WMW test calculates the significance of the difference of medians, T-test calculates the significance of the difference of means

• WMW test is robust to (a few) outliers

• 12/)1( 2121 ++= NNNNuσ

Page 23: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

23Module 2:Analyzing Gene Lists

Kolmogorov-Smirnov (K-S) test

Pro

babi

lity

dens

ity

Gene score 0

Question: Are the red and black distributions significantly different?

Cum

ulat

ive

prob

abili

ty

Gene score 0

0.5

1.0

Cumulative distribution

1) Calculate cumulative distributions of red and black

Page 24: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

24Module 2:Analyzing Gene Lists

Kolmogorov-Smirnov (K-S) test

Pro

babi

lity

dens

ity

Gene score 0

Question: Are the red and black distributions significantly different?

Cum

ulat

ive

prob

abili

ty

Gene score 0

0.5

1.0

Cumulative distribution

1) Calculate cumulative distributions of red and black

Page 25: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

25Module 2:Analyzing Gene Lists

Kolmogorov-Smirnov (K-S) test

Pro

babi

lity

dens

ity

Gene score 0

Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant?

Cum

ulat

ive

prob

abili

ty

Gene score 0

0.5

1.0

Cumulative distribution

Length = 0.4

Page 26: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

26Module 2:Analyzing Gene Lists

WMW and K-S test caveats

• Neither tests is as sensitive as the T-test, ie they require more data points to detect the same amount of difference, so use the T-test whenever it is valid.

• K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects difference of medians

• Rare problem: Tied scores and small # of observations can be a problem for some implementations of the WMW test

Page 27: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

27Module 2:Analyzing Gene Lists

Proper tests for different distributions

Pro

bab

ility

den

sity

Gene score 0

Gene scores are positive and have increasing density near zero, e.g. sequence counts

Pro

bab

ility

den

sity

Gene score

Distributions with gene score outliers, or “heavy-tailed” distributions

Pro

bab

ility

den

sity

Gene score

Bimodal “two-bumped” distributions.

WMW or K-S K-S only WMW or K-S

Recommended test:

Page 28: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

28Module 2:Analyzing Gene Lists

What have we learned?

• T-test is not valid when one or both of the score distributions is not normal,

• If need a “robust” test, or to test for difference of medians use WMW test,

• To test for overall difference between two distributions, use K-S test.

Page 29: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

29Module 2:Analyzing Gene Lists

Other common tests and distributions

• Chi-squared (contingency table) test– Useful if there are >2 values of annotation (e.g. red genes,

black genes, and blue genes)– Used as an approximation to Fisher’s Exact Test but is

inaccurate for small gene lists

• Binomial test– Tests if gene scores for red and black either come from

either N flips of the same coin or different coins.– E.g. black genes are “expressed” in, on average, 5 out of 12

conditions and red genes are expressed in, on average, 2 out of 12 conditions, is the probability of being expressed significantly different for the black and red genes?

Page 30: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

30Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 31: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

31Module 2:Analyzing Gene Lists

How to win the P-value lottery, part 1

Background population:500 black genes, 5000 red genes

Random draws

… 7,834 draws later …

Expect a random draw with observed enrichment once every 1 / P-value draws

Page 32: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

32Module 2:Analyzing Gene Lists

How to win the P-value lottery, part 2Keep the gene list the same, evaluate different annotations

Observed drawRRP6MRD1RRP7RRP43RRP42

Different annotations

RRP6MRD1RRP7RRP43RRP42

Page 33: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

33Module 2:Analyzing Gene Lists

ORA tests need correction

From the Gene Ontology website:

Current ontology statistics: 25206 terms• 14825 biological process• 2101 cellular component• 8280 molecular function

Page 34: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

34Module 2:Analyzing Gene Lists

Simple P-value correction: Bonferroni

If M = # of annotations tested:

Corrected P-value = M x original P-value

Corrected P-value is greater than or equal to the probability that any single one of the observed enrichments could be due to

random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

Page 35: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

35Module 2:Analyzing Gene Lists

Bonferroni correction caveats

• Bonferroni correction is very stringent and can “wash away” real enrichments.

• Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

Page 36: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

36Module 2:Analyzing Gene Lists

False discovery rate (FDR)

• FDR is the expected proportion of the observed enrichments that are due to random chance.

• Compare to Bonferroni correction which is the probability that any one of the observed enrichments is due to random chance.

Page 37: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

37Module 2:Analyzing Gene Lists

Benjamini-Hochberg (B-H) FDRIf is the desired FDR (ie level of significance), then choose the corresponding cutoff for the original P-values as follows:

1) Rank all “M” P-values

P-value Rank

0.90.70.5

0.04…

0.005

1234

…M

2) Test each P-value against q = x (M-Rank+1) / Me.g. Let M = 100,

q Is P-value < q?

0.05 X 1.000.05 x 0.990.05 X 0.980.05 x 0.97 ...0.05 x 0.01

NoNoNoYes…No

3) New P-value cutoff, i.e. “”, is first P-value to pass the test.

P-value cutoff of 0.04 ensures FDR < 0.05

Page 38: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

38Module 2:Analyzing Gene Lists

Reducing multiple test correction stringency

• The correction to the P-value threshold depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be

• Can control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GO annotations.

Page 39: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

39Module 2:Analyzing Gene Lists

What have we learned

• When testing multiple annotations, need to correct the P-values (or, equivalently, ) to avoid winning the P-value lottery.

• There are two types of corrections:– Bonferroni controls the probability any one test is due to

random chance (aka FWER) and is very stringent– B-H controls the FDR, i.e., expected proportion of “hits” that

are due to random chance

• Can control stringency by carefully choosing which annotation categories to test.

Page 40: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

40Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 41: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

41Module 2:Analyzing Gene Lists

Funspec: Simple ORA for yeasthttp://funspec.med.utoronto.ca/

Paste gene list hereBonferroni correct? YES!

Choose sources of annotation

Cavaets:• yeast only,• last updated 2002

Page 42: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

42Module 2:Analyzing Gene Lists

GoMiner, part 1http://discover.nci.nih.gov/gominer

1. Click “web interface”

2. Upload names of background genes

3. Upload gene list

4. Choose organism

5. Choose evidence code (All or Level 1)

Page 43: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

43Module 2:Analyzing Gene Lists

GoMiner, part 2

6. Restrict # of tests via category size

7. Restrict # of tests via GO hierarchy

8. Results emailed to this address, in a few minutes

Page 44: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

44Module 2:Analyzing Gene Lists

DAVID, part 1 http://david.abcc.ncifcrf.gov/

Paste list here

Choose ID type

List type: list or background?

DAVID automatically detects organism

Page 45: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

45Module 2:Analyzing Gene Lists

DAVID, part 2http://david.abcc.ncifcrf.gov/

Page 46: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

46Module 2:Analyzing Gene Lists

BINGO, an ORA cytoscape pluginhttp://www.psb.ugent.be/cbd/papers/BiNGO/index.htm

Links represent parent-child relationships in GO ontology

Colours represent significance of enrichment

Nodes represent GO categories

Page 47: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

47Module 2:Analyzing Gene Lists

Other tools

• GSEA: Gene Set Enrichment Analysis– http://www.broad.mit.edu/gsea/– More complex tool that allows gene scores to be

analyzed for enrichment– Has extensive gene annotations available

Page 48: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

48Module 2:Analyzing Gene Lists

What have we learned

• Web-based ORA tools for gene lists:– Funspec:

• easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes)

– GoMiner: • Uses GO annotations, covers many organisms, needs a

background set of genes

• Cytoscape-based ORA tools for gene lists:– BINGO:

• Does GO annotations and displays enrichment results graphically and visually organizes related categories

Page 49: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

49Module 2:Analyzing Gene Lists

Overview

• The basics of over-representation analysis• Lab #1• Gene list statistics:

– A taxonomy of tests for over-representation– Correcting for multiple tests

• Easy-to-use software tools for over-representation analysis,

• Lab #2

Page 50: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

50Module 2:Analyzing Gene Lists

Lab #2

• Use GoMiner to analyze a yeast gene list.• Protocol:

– Step 1: Get list of all yeast genes from Biomart• http://www.biomart.org/biomart/martview

– Step 2: Translate gene list IDs into gene symbols using Synergizer

• http://llama.med.harvard.edu/cgi/synergizer/translate

– Step 3: Do an enrichment analysis using GoMiner• http://discover.nci.nih.gov/gominer/

Page 51: 1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

51Module 2:Analyzing Gene Lists

Questions?