the statistical significance of max-gap clusters rose hoberman david sankoff dannie durand

Post on 19-Dec-2015

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Statistical Significance

of Max-gap Clusters

Rose Hoberman

David Sankoff

Dannie Durand

Gene Clustering for Functional Inference in Bacterial Genomes

The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999.

Gene content and order are preserved

rearrangement, mutation

Similarity in gene content

Neither content nor order is strictly preserved

large scale duplication

or speciation event

original genome

“Evolution of gene order conservation in prokaryotes”

Tamames, Genome Biology 2, 2001

“Evolution of gene order conservation in prokaryotes”

Tamames, Genome Biology 2, 2001

Gene insertion/loss

“Evolution of gene order conservation in prokaryotes”

Tamames, Genome Biology 2, 2001

Gene insertion/loss

Local rearrangement

Two Possible Questions

1. Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance

2. Identify all significantly conserved gene clusters as a starting point for making functional inferences

Two Possible Questions

1. Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance

2. Identify all significantly conserved gene clusters as a starting point for making functional inferences

Reference set scenario

Whole genome comparison

Reference Set Scenario

Reference Set Scenario

• Model of a genome– G = 1, …, n; an ordered set of n unique genes– assume genes do not overlap– chromosome breaks ignored

• Model of a genome– G = 1, …, n; an ordered set of n unique genes– assume genes do not overlap– chromosome breaks ignored

• Reference gene scenario:– m genes of interest (in red) are pre-specified– want to find clusters of (a subset of) these genes

Reference Set Scenario

Given: two genomes: G = 1, …, n and H = 1, …, n

Find all significant clusters of at least

k homologs in close proximity in both genomes?

Whole Genome Scenario

G

H

Outline

• What formalisms do we need to address these questions?– Definitions: formulate a cluster definition– Algorithms: identifying clusters in real dataStatistics: assess the significance of one or more

clusters

• Reference set scenario• Whole genome comparison• Conclusion

Why develop a formal statistical model?

• Understand trends and verify that they match our expectations

• Choose parameters effectively

• Statistical tests for data analysis

Typically researchers use randomization tests to estimate statistical significance

Cluster Definitions

• An intuitive notion of a cluster is a group of genes– occurring in close proximity– neither gene content nor order is strictly conserved

• Algorithms and statistics require a formal definition.– What properties are desirable?– Do existing definitions have these properties?

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

size = 3 genes

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

length = 6

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

length = 6

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

– density: proportion of red genes (size/length)• Example: density ≥ 0.5

density = 6/11

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

– density: proportion of red genes (size/length)• Example: density ≥ 0.5

density = 6/11

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

– density: proportion of red genes (size/length)– compactness: maximum gap between adjacent red

genes

gap ≤ 4 genes

Max-Gap Cluster

• Commonly used in analysis of genomic data

• Desirable properties– Ensures minimum local density – Extensible: doesn’t artificially limit cluster length– Disjoint: clusters will not overlap

gap g

Outline

• Formalisms

• Reference set scenario

• Whole genome comparison

• Conclusion

Formalisms

• Definitions: formulate a cluster definition

• Algorithms: identify clusters in real data

• Statistics: assess the significance of a cluster

A Statistical Model

• Given– a genome: G = 1, …, n unique genes – a set of m reference genes – a maximum-gap size g

• Null hypothesis: – Random gene order

• Alternate hypotheses:– Evolutionary history– Functional selection

• We provide – analytical and dynamic programming solutions – to determine cluster significance exactly– for the reference set scenario

Hoberman, Sankoff and Durand. In ``Proceedings of the RECOMB Satellite Workshop on Comparative Genomics'', J. Lagergren, ed.,

Lecture Notes in Bioinformatics, Springer Verlag, in press.

Hoberman, Sankoff, Durand. Submitted to RECOMB 2005.

Statistics of Max-Gap Gene Clusters

Test Statistic: Complete Clusters

The probability of observing all m reference genes in a max-gap cluster in G

Test Statistic: Incomplete Clusters

The probability of observing at least h of the m reference genes in a max-gap cluster in G

Cluster significance n = 1000, m=50

• n = number genes in each genome• m = number of genes shared between the two genomes• g = maximum allowed gap size• h = size of cluster (e.g. number of red genes)

n = 500, h = m/2

Significant Parameter Values (α = 0.0001)

n = 500

Significant Parameter Values (α = 0.0001)

n = 500

Outline

• Formalisms

• Reference set scenario

• Whole genome comparison

• Conclusion

Formalisms

• Definitions: formulate a cluster definition

• Algorithms: identify clusters in real data

• Statistics: assess the significance of one or more clusters

Whole genome comparison

Find all sets of genes that form max-gap clusters in both genomes.

g 10

g 10

Properties of Max-Gap Clusters for Whole Genome Comparison

• Clusters are locally dense in both genomes

• Clusters are still guaranteed to be disjoint.

• The definition is symmetric with respect to genome

Most existing cluster algorithms are not symmetric!

If g = 2• There is no valid max-gap cluster of

size two or three• There is a valid max-gap cluster of

size four

Algorithms: Finding Max-Gap Clusters

• A consequence of this is that a greedy iterative approach will not find all max-gap clusters– Specifically, larger clusters that don’t contain smaller

ones will not be found

Algorithms: Finding Max-Gap Clusters

There is an efficient divide-and-conquer algorithm to find all max-gap clusters (Bergeron et al, 2002)

Since algorithms are generally not stated formally in application papers, we don’t know whether people are actually getting what they think they’re getting

Algorithms: Finding Max-Gap Clusters

Formalisms

• Definitions: formulate a cluster definition

• Algorithms: identify clusters in real data

• Statistics: assess the significance of one or more clusters

Work in Progress…

Statistics: Whole genome comparison

What is the probability that at least k genes form a max-gap cluster in both genomes?

g 10

g 10

What is the probability that at least k genes form a max-gap cluster in both genomes?

Assuming identical gene content, the probability of finding a max-gap cluster of size at least k is

always one!

g 10

g 10

Statistics: Whole genome comparison

An Example

Example: g =1

Example: g =1

An Example

Example: g =1

An Example

A cluster of size k does not necessarily

contain a cluster of size k-1

Example: g =1

An Example

• When gene content is identical, there will always be a cluster of size n

Example: g =1

An Example

• When gene content is identical, there will always be a cluster of size n

• Therefore, for all k, there will always be a cluster of size at least k

Example: g =1

An Example

• When gene content is identical, there will always be a cluster of size n

• Therefore, for all k, there will always be a cluster of size at least k

• Therefore, the probability of finding a cluster of size at least k is always one!

Example: g =1

An Example

Relaxing the Assumption of Identical Gene Content

• Assume only m of the n genes in each genome are shared

• If the longest run of “non-shared” genes is less than g then we are still guaranteed to find a complete cluster

More generally…

Simulations of randomly ordered genomes show that large clusters may be very likely

to occur merely by chance

Unexpected Statistical Trends• There can be a significant

probability of finding a cluster that includes all homologous gene pairs

• The significance of a cluster of size k can be less than that of a cluster of size k-1

• Probabilities are not monotonic

• Large clusters may not be significant

n = 1000, m = 250, g=20

Probability of a cluster of size 250 ~ 50%

Outline

• Formalisms

• Reference set scenario

• Whole genome comparison

• Conclusion

Clusters Are Used in Many Other Applications

Inferring functional coupling of genes in bacteria (Overbeek et al 1999)

Recent polyploidy in Arabidopsis (Blanc et al 2003)

Sequence of the human genome (Venter et al 2001)

Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002)

Duplications in Eukaryotes (Vision et al 2000)

Identification of horizontal transfers (Lawrence and Roth 1996)

Evolution of gene order conservation in prokaryotes (Tamames 2001)

Ancient yeast duplication (Wolfe and Shields 1997)

Genomic duplication during early chordate evolution (McLysaght et al 2002)

Comparing rates of rearrangements (Coghlan and Wolfe 2002)

Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998)

Operon prediction in newly sequenced bacteria (Chen et al 2004)

Breakpoints as phylogenetic features (Blanchette et al 1999)...

Max-Gap Clusters are Especially Common

Inferring functional coupling of genes in bacteria (Overbeek et al 1999)

Recent polyploidy in Arabidopsis (Blanc et al 2003)

Sequence of the human genome (Venter et al 2001)

Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002)

Duplications in Eukaryotes (Vision et al 2000)

Identification of horizontal transfers (Lawrence and Roth 1996)

Evolution of gene order conservation in prokaryotes (Tamames 2001)

Ancient yeast duplication (Wolfe and Shields 1997)

Genomic duplication during early chordate evolution (McLysaght et al 2002)

Comparing rates of rearrangements (Coghlan and Wolfe 2002)

Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998)

Operon prediction in newly sequenced bacteria (Chen et al 2004)

Breakpoints as phylogenetic features (Blanchette et al 1999)...

Formal statistical models allow us to– understand trends and verify that they match

our expectations,– choose parameters effectively– conduct statistical tests for data analysis

Formal statistical models require– a formal cluster definition– a search procedure to find clusters

These issues are more complicated than they might seem!

Summary

Results: statistical tests of significance for max-gap clusters• Reference set scenario• Genome comparison (work in progress)

We need to• explicitly consider the cluster properties we would like

our definitions to satisfy• rigorously evaluate whether our definition meets these

requirements • carefully prove that our search procedures match our

stated definitions

Thank You

top related