defining gene clusters: 24 ways of looking at mount fuji anne bergeron, uqam dublin, september 19,...

Post on 03-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Defining Gene Clusters:24 Ways of Looking at Mount Fuji

Anne Bergeron, UQAMDublin, September 19, 2005

7. Mt Fuji from the Foot

Defining Gene Clusters:24 Ways of Looking at Mount Fuji

Anne Bergeron, UQAMDublin, September 19, 2005

"It struck me that it would be good to take one thing in life and regard it from many viewpoints, ... " Roger Zelazny

The basic problem

Genome A

Genome B

Genome C

We start with a set of genomes, labeled by gene names, domains, or synteny blocks,and a similarity relation on those labels.

Highlighting a gene means selecting all labels that are similar.

Genes, or other types of signals, can appear in multiple copies in a genome, or even be missing. In this talk, the similarity relation is "given" and is anequivalence relation.

Genome A

Genome B

Genome C

The basic problemWe are interested in what happens when a set of genes is highlighted.

A set of genes : { }

Boring...

Genome A

Genome B

Genome C

The basic problem

Another set of genes: { }

Interesting ?Measures of surprise are studied by Durand, Haque, Hoberman, Sankoff, Raghupathy, etc.

The basic problem

Goal : Given a (big) set of genomes, automatically identify all potentially interesting sets of genes.

1. Mount Fuji from Owari

Towards formal models

Towards formal models

What do labels stand for?

How many labels and genomes do we want to compare ?

What do we want to do with the resulting clusters ?

Towards formal models: Example 1

From: Eichler and Sankoff, Science (301:793-797), 2003

Definition of labels and similarity:Large homology segments disrupted only by local micro-rearrangements.

A total of 281 synteny blocks,colored in the human genomeby their mouse chromosome number.

Interesting features:

Chromosome XChromosome 17Chromosome 20

Application:

Genome evolution

Towards formal models: Example 2

Definition of labels and similarity:Gene annotations of chloroplasts.

Trachelium

Campanula

Adenophora

Symphandra

Walhenbergia

Merceria

Interesting features:

Rearrangements

Application:

Phylogeny

Towards formal models: Example 3

From: Pasek et al, Genome Research (15:867-874), 2005

Definition of labels and similarity:PFAM Domain numbers labeling fourbacterial genomes.

Interesting features:

DuplicationsInsertionsRearrangements

Application:

Operon identification

Towards formal models: Example 4

From: Pasek et al, Genome Research (15:867-874), 2005

Definition of labels and similarity:PFAM Domain numbers labeling fourbacterial genomes.

Application:

Identification of orthologsand/or duplicate segments.

With such an high E-value,the potential duplicate wouldhave been missed by a comparisonbased on sequence similarity.

Towards formal models: Example 5

Definition of labels and similarity:Large homology segments disrupted only by local micro-rearrangements.

Comparing 16 segments of the mouseand rat chromosome X.

Application:

Reconstructing ancestors

From: Bérard et al, WABI 2005

Mouse = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rat = -4 -3 -2 1 -13 -15 14 -16 8 9 10 -11 12 5 6 7

2. Mt Fuji from a Teahouse at Yoshida

Down to earth details

Down to earth details

Do we allow gaps ?

Do we allow rearrangements?

Do we allow duplicates and missing genes ?

Do we allow multiple genomes orself-comparison ?

How about "extensions" ?

Genome A

Genome B

Genome C

A set of genes: { }

Down to earth details : Model 1

No gaps, no duplications, any rearrangement.

Genome A

Genome B

Genome C

A set of genes: { }

No gaps, no duplications, any rearrangement.

What about this gene? Should we add it ?

Down to earth details : Model 1

Genome A

Genome B

Genome C

A set of genes: { }

No gaps, no duplications, any rearrangement.

What about this gene? Should we add it ?

Down to earth details : Model 1Extension

Genome A

Genome B

Genome C

A set of genes: { }

No gaps, duplications, any rearrangement.

Genes not in the set

Down to earth details : Model 2

Genome A

Genome B

Genome C

A set of genes: { }

Gaps, no duplications, any rearrangement.

Down to earth details : Model 3

Genome A

Genome B

Genome C

A set of genes: { }

Gaps, missing/inserted genes, any rearrangement.

Down to earth details : Model 4

Genome A

Genome B

Genome C

A set of genes: { }

Gaps, missing genes, duplications, any rearrangement.

With gap size = 1, we get 4 occurrences.

Reducing the number of genes....

Down to earth details : Model 5

Genome A

Genome B

Genome C

A smaller set of genes: { }

... yields 5 occurrences.

Down to earth details : Model 5

24. Mount Fuji in a Summer Storm

A general framework

A general framework

Given a gap g, an occurrence of S is a maximal run of genes of S, separated by gaps of at most g genes not in S,and that contains at least one of each gene of S.

A set S of genes: { }

A set of genes S is an extension of a set T, included in S, if each occurrence of T is contained in an occurrence of S.

S = { } is an extension of T= { }

> g > g > g≤ g

Occurrence #1 Occurrence #2

A chromosome:

A general framework

Given a gap g, an occurrence of S is a maximal run of genes of S, separated by gaps of at most g genes not in S,and that contains at least one of each gene of S.

A set S of genes: { }

A set of genes S is an extension of a set T, included in S, if each occurrence of T is contained in an occurrence of S.

S = { } is an extension of T= { }

> g > g > g≤ g

Occurrence #1 Occurrence #2

A chromosome:

• g = 0 or g > 0

ChoicesWhen g = 0, the number of candidates is polynomial in the number of genes.

When g > 0, the number ofcandidates can be exponentialin the number of genes.

A general framework

Even with g = 1, there are problems. For example, with g = 0, the sequence of genes:

a b c d e fproduces one potential cluster that contains both a and f. But with g = 1, there are 8 of them:

a b c d e fa b c d fa b c e fa b d e fa c d e fa c e f a b d fa c d f

The number of these sequences grows in a Fibonacci progression!

• g = 0 or g > 0

Choices

• Duplications or no duplications Duplications usually meansan exponential number of candidates but, most of the time,are unavoidable.

Models without duplications are,nevertheless, useful in many situations.

A general framework

• g = 0 or g > 0

Choices

• Duplications or no duplications

• Three ways of filtering candidates

Filtering is mostly based on the properties of the extension relation.

If the number of candidates is low, filtering is not necessary,but it can be relevant.

For models with a huge numberof candidates, filtering is a must.

A general framework

• g = 0 or g > 0

Choices

• Duplications or no duplications

• Three ways of filtering candidates

• Formal or heuristic Formal models have inherentcomputational problems whenapplied to real data.

Heuristics will always be useful.

A general framework

• g = 0 or g > 0

Choices

• Duplications or no duplications

• Three ways of filtering candidates

• Formal or heuristic

A general framework

2 x 2 x 3 x 2 = 24How convenient!

20. Mount Fuji from Inume Pass

*Voluntary simplicity is a lifestyle considered by its adherents to be a sustainable, ecologically sensitive alternative to the typical, western consumerist lifestyle. [Ref. Wikipedia]

Common intervals: Voluntary simplicity*

Common intervals: Voluntary simplicity*

*Voluntary simplicity is a lifestyle considered by its adherents to be a sustainable, ecologically sensitive alternative to the typical, western consumerist lifestyle. [Ref. Wikipedia]

A (partial) list of credits:Uno and Yagiura (2000)Heber and Stoye (2001)Bergeron, Heber and Stoye (2002)Didier (2003)Schmidt and Stoye (2004)Figeac and Varré (2004)Bérard, Bergeron and Chauve (2004)Blin, Chauve and Fertin(2005)Landau, Parida and Weizman (2005)Tannier and Sagot (2005)Bérard, Bergeron, Chauve and Paul (2005)Bergeron, Chauve, de Montgolfier and Raffinot (2005)

Common intervals

• g = 0

Choices

• No duplications

• No filtering

• Formal

Genome A

Genome B

Genome C

The basic model of common intervals oftenyields a large number of 'uninteresting clusters'.However, filtering provides unusual informationon whole genome organization.

Common intervals -> Strong Intervals

• g = 0

Choices

• No duplications

• Filtering

• Formal

Genome A

Genome B

Common intervals

stuv

Both t and u are two different extensions of the common interval s: Remove them.

Strong intervalss

v

Strong Intervals

From: Bérard et al, WABI 2005

Mouse = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rat = -4 -3 -2 1 -13 -15 14 -16 8 9 10 -11 12 5 6 7

This tree displays the strongintervals between the synteny blocks of the mouse and rat chromosomes X.

This kind of tree is known as a PQ-tree. Strong intervals possess a rich combinatorial structure that can be exploited both from the biological and computation perspective.

13 15 14 16 8 9 10 11 12 5 6 7

4 3 2 1

13 15 14 16

8 9 10 11 12 5 6 715 14

15 14 8 9 10 121 5 6 74 3 2 1113 16

4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

This tree provides guidelines to possible rearrangementscenarios that transform the rat chromosome into a mouse chromosome. These scenarios preserve all common intervals.

13 15 14 16 8 9 10 11 12 5 6 7

4 3 2 1

13 15 14 16

8 9 10 11 12 5 6 715 14

15 14 8 9 10 121 5 6 74 3 2 1113 16

4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Intervals are first labeled (in red) with respect to their relative orientation.

4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7

13 15 14 16 8 9 10 11 12 5 6 7

4 3 2 1

13 15 14 16

8 9 10 11 12 5 6 715 14

15 14 8 9 10 121 5 6 74 3 2 1113 16

Strong Intervals : transforming a rat into a mouse

Intervals are first labeled (in red) with respect to their relative orientation.

4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7

4 3 2 1

4 3 2 1

13 15 14 16 8 9 10 11 12 5 6 7

13 15 14 16

8 9 10 11 12 5 6 715 14

15 14 8 9 10 12 5 6 71113 161

4 3 2 1

4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 1

4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7

1

4 3 2 1

4 3 2

13 15 14 16 8 9 10 11 12 5 6 7

13 15 14 16

8 9 10 11 12 5 6 715 14

15 14 8 9 10 12 5 6 71113 164

1 2 3 4

1 2 3

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 4 3 2 1

4

1 2 3 4

1 2 3

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

13 15 14 16 8 9 10 11 12 5 6 7

13 15 14 16

15 14

15 1413 16

8 9 10 11 12 5 6 7

8 9 10 12 5 6 71113

13 15 14 16

13 15 14 16 8 9 10 11 12 5 6 7

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 13

13

13 15 14 16

13 15 14 16 8 9 10 11 12 5 6 7

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

4

1 2 3 4

1 2 3

15 14

15 14 16

8 9 10 11 12 5 6 7

8 9 10 12 5 6 711

15 14

13 15 14 16

14

13 15 14 16 8 9 10 11 12 5 6 7

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 14

15 14

13 15 14 16

14

13 15 14 16 8 9 10 11 12 5 6 7

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

134

1 2 3 4

1 2 3 15 16

8 9 10 11 12 5 6 7

8 9 10 12 5 6 71116

13 15 14 16

13 15 14 16 8 9 10 11 12 5 6 7

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 16

16

13 15 14 16

13 15 14 16 8 9 10 11 12 5 6 7

1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7

134

1 2 3 4

1 2 3

15 14

1415

8 9 10 11 12 5 6 7

8 9 10 12 5 6 711

14 15

1514

13 14 15 16

13 14 15 16 8 9 10 11 12 5 6 7

1 2 3 4 13 14 15 16 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 14 15

16

13 15 14 16

13

15 14

1415

14 15

1514

13 14 15 16

13 14 15 16 8 9 10 11 12 5 6 7

1 2 3 4 13 14 15 16 8 9 10 11 12 5 6 7

4

1 2 3 4

1 2 3

8 9 10 11 12 5 6 7

8 9 10 12 5 6 711

14 15

1514

13 14 15 16

1613

15 14

1415

16 15 14 13

1316

16 15 14 13 8 9 10 11 12 5 6 7

1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 13 14 15 16

14 15

1514

13 14 15 16

1613

15 14

1415

16 15 14 13

1316

16 15 14 13 8 9 10 11 12 5 6 7

1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7

4

1 2 3 4

1 2 3

8 9 10 11 12

8 9 10 1211

5 6 7

5 6 711

8 9 10 11 12

16 15 14 13 8 9 10 11 12 5 6 7

1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 11

16 15 14 13 8 9 10 11 12 5 6 7

1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7

14 15

1514

13 14 15 16

1613

15 14

1415

16 15 14 13

13164

1 2 3 4

1 2 3 11

8 9 10 11 12

8 9 10 12

5 6 7

5 6 79

12 11 10 9 8

12 11 10 8

16 15 14 13 12 11 10 9 8 5 6 7

1 2 3 4 16 15 14 13 12 11 10 9 8 5 6 7

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 8 9 10 11 12

9

12 11 10 9 8

12 11 10 8

16 15 14 13 12 11 10 9 8 5 6 7

1 2 3 4 16 15 14 13 12 11 10 9 8 5 6 7

14 15

1514

13 14 15 16

1613

15 14

1415

16 15 14 13

13164

1 2 3 4

1 2 3

5 6 7

5 6 7

7 6 5

7 6 5

16 15 14 13 12 11 10 9 8 7 6 5

1 2 3 4 16 15 14 13 12 11 10 9 8 7 6 5

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 5 6 7

1 2 3 4 16 15 14 13 12 11 10 9 8 7 6 5

7 6 5

7 6 5

16 15 14 13 12 11 10 9 8 7 6 5

9

12 11 10 9 8

12 11 10 8

14 15

1514

13 14 15 16

1613

15 14

1415

16 15 14 13

13164

1 2 3 4

1 2 3

5 6 7

14 15 16

5 6 7 8 9 10 11 12 13 14 15 16

12

8 9 10 11 12

9 10 11 13

14 15

76

13 14 15 16

85

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 5 6 7 ... 14 15 16

1 2 3 4 16 15 14 13 12 11 10 9 8 7 6 5

4

1 2 3 4

1 2 3

5 6 7

14 15 16

5 6 7 8 9 10 11 12 13 14 15 16

12

8 9 10 11 12

9 10 11 13

14 15

76

13 14 15 16

85

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Strong Intervals : transforming a rat into a mouse

Then all strong intervals that disagree with their parent are inverted : 5 6 7 ... 14 15 16

18. Mt Fuji from the Offing in Kanagawa

Domain Teams: The 'eXtreme' model

A (partial) list of credits:Bergeron, Corteel and Raffinot (2002)Luc, Risler, Bergeron and Raffinot (2003)He and Goldwasser (2004)Béal, Bergeron, Corteel and Raffinot (2004)Pasek, Bergeron, Risler, Louis, Ollivier and Raffinot (2005)Blin, Chauve and Fertin (2005)

Domain Teams: The 'eXtreme' model

Domain Teams

• g > 0

Choices

• Duplications

• Heavy filtering

• Formal

Genome A

Genome B

Remove them all!

has an extension. has an extension.

has an extension. has an extension.

Surviving teams:

Domain Teams : Example

67591 Domains 50078 Proteins 16 ChromosomesMaximum gap: 3 16713 Domain Teams

Domain Teams : Example

From: Pasek et al, Genome Research (15:867-874), 2005

The combinatorial beauty of nature

12. Mt Fuji from Lake Kawaguchiç

The combinatorial beauty of nature

Does nature allow all possiblerearrangements ?

Six domains can theoretically form 63 potential teams.If they are labelled as {a, b, c, d, e, f}, the possible teamswith more than one member are:{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {b, c}...{a, b, c}, {a, b, d}, {a, b, e}, ......{a, b, c, d, e, f}

For 6 domains, of the 63 possibilities, we found 35 teams that had at least two occurrences and no extension.q

The combinatorial beauty of nature

Promiscuous domains

Who are they?PF00005 ABC transporterPF00072 Response regulator receiver domainPF00486 Transcriptional regulatory proteinPF00512 His Kinase A PF00528 Binding-protein-dependent transport system inner membranePF00672 HAMP domain

The need for heuristics

21. Mount Fuji from the Totomi Mountains

The need for heuristics

• g > 0

Choices

• Duplications

• No filtering

• Heuristic

From: St-Onge, et al. Poster RECOMB CG 2005

Very reasonable approximationsof the general model can be obtainedefficiently -- a few minutes -- in the case of very large scale comparisons.

The need for heuristics

An uncertainty principle

With the general model of gene clusters, it is impossible to predict simultaneously the computing time AND the properties of the output.

Marie-Pierre Béal, Informatique, Marne-la-ValléeSèverine Bérard, INRA, ToulouseMathieu Blanchette, McGill UniversitySylvie Corteel, PRiSM, VersaillesSteffen Heber, Raleig, USAHokusai Katsushika: 1760-1849Nicolas Luc,Génome et informatique, EvryFabien de Montgolfier, LIAFA, ParisChristophe Paul, LIRMM, MontpellierSophie Pasek, Génome et informatique, EvryJean-Loup Risler, Génome et informatique, EvryMathieu Raffinot, Laboratoire Poncelet, MoscouJens Stoye, Technische Facultat, Bielefeld

Credits

Cedric ChauveAnnie ChateauOlivier GingrasYannick GingrasAndré LevasseurJacqueline RwirangiraKarine St-Onge

Marie-Pierre Béal, Informatique, Marne-la-ValléeSèverine Bérard, INRA, ToulouseMathieu Blanchette, McGill UniversitySylvie Corteel, PRiSM, VersaillesSteffen Heber, Raleig, USAHokusai Katsushika: 1760-1849Nicolas Luc,Génome et informatique, EvryFabien de Montgolfier, LIAFA, ParisChristophe Paul, LIRMM, MontpellierSophie Pasek, Génome et informatique, EvryJean-Loup Risler, Génome et informatique, EvryMathieu Raffinot, Laboratoire Poncelet, MoscouJens Stoye, Technische Facultat, Bielefeld

Credits

Cedric ChauveAnnie ChateauOlivier GingrasYannick GingrasAndré LevasseurJacqueline RwirangiraKarine St-Onge

top related