the gene family play and the chromosomal theater todd vision department of biology university of...
Post on 19-Dec-2015
217 views
TRANSCRIPT
The gene family play and the chromosomal theater
Todd Vision
Department of Biology
University of North Carolina at Chapel Hill
Outline
Large-scale duplication and loss of genes in the angiosperms
Looking into the future of plant phylogenomics
A case study in gene family demography
Duplication and functional divergence
Paul Franz, University of Amsterdam
Arabidopsis as a hub for plant comparative maps
genome sizes in angiosperms
145262
367 367 372 415 439 473 560 622
907
0
250
500
750
1000
mega
base
s
data from Arumuganathan & Earle (1991)Plant Mol Biol Rep 9:208-218
Tomato-Arabidopsis synteny
Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121
Duplicated genes in Arabidopsis
Modes of gene duplication Tandem (T)
• unequal crossing-over• mostly young
Dispersed (D)• transposition• all ages
Segmental (S)• polyploidy• all old
Paleotetraploidy?
The Arabidopsis Genome Initiative. 2000. Nature 408:796
Vision et al. (2000) Science 290:2114-7.
Microsynteny within blocks
distribution of dA
Problems• proteins diverge at different rates
• high dA is difficult to estimate
Solution• average dA within blocks
in blocksnot in blocks
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.0 0.1 0.2 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.8 0.9 1.0
amino acid substitution
f
A B DC E F
0 50 100 150 200 Mya
discrete duplication events
monocots(rice)
Asterids(tomato)
Rosids(Arabidopsis)
110-160 Mya
160-240 Mya
0
2
4
6
8
10
12
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
amino acid substitution
freq
uenc
y of
blo
cks
the 2-4 complex(one ancestral segment broken up by 4 large
inversions)
2600
3000
3400
3800
4200
1200 1600 2000 2400 2800
chromosome 2 (5.6 Mb)
chro
mos
ome
4 (4
.6 M
b)
45
52
49
54
56
0
10
20
30
40
50
60
70
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ka
freq
uen
cy
0
20
40
60
80
100
120
0 1 2 3 4 5
Ks
freq
uen
cy
coefficient of variation = 0.67
coefficient of variation = 0.53
Mayer et al. (2001) Genome Res. 11, 1167
Rice-Arabidopsis microsynteny
Blanc, Hokamp, Wolfe (2003) Genome Res. 13, 137-144.
Ara
bido
psis
Ric
e
Ric
e
Ara
bido
psis
Ara
bido
psis
Ric
e
Ric
e
Ara
bido
psis
duplication
Block 37 after
Asterid-Rosidsplit
Block 57before
monocot-dicot divergence
Raes, Vandepoele, Saeys, Simillion, Van de Peer (2003) J. Struct. Func. Genomics 3, 117-129
Divergence among duplicated genes in rice
Goff et al. (2002) Science 296: 92
Hidden syntenies
Simillion, Vandepoele, Van Montagu, Zabeau, Van de Peer (2002) PNAS 99, 13627
Interspecies comparison can reveal hidden syntenies
Vandepoele, Simillion, Van de Peer (2002) TIG 18, 606-608
Comparative mapping in a phylogenetic context
Major plant genome datasetsFamily Genus genome EST
mapAizoaceae Mesembryanthemum crystallinum XBrassicaceae Arabidopsis thaliana X X X Brassica spp. XFabaceae Glycine max X X Medicago truncatula X X Phaseolus spp. XMalvaceae Gossypium arboreum X XSolanaceae Capsicum annuum X Lycopersicon esculentum X X Solanum tuberosum X XPoaceae Hordeum vulgare X X Oryza sativa X X X Sorghum bicolor/propinguim X X Triticum aestivum X X Zea mays X XOther Beta vulgaris X Chlamydomonas reinhardtii X X Pinus taeda X X Populus spp. X Prunus spp. X
Plant unigene datasetsspecies TIGR PlantGDB
barley 49885 74621beet na 13565chlamydomonas 30296 nacitrus na 4266coffee na 392cotton 24350 27854grape 49885 74621iceplant 8455 8945lettuce 21960 nalotus 11025 namaize 55063 71655marchantia na 1059medicago 36976 43384oat na 361onion 11726 napine 26882 24668poplar na 20935potato 24275 24839rice 60778 52156rye 5199 5384sorghum 33273 34363soybean 67826 73946sunflower 20520 natomato 31012 35725wheat 109509 95949
+ Arabidopsis 27170
Wikström et al (2001) Proc R Soc Lond B 268, 2211
Plant phylogenomics: Phytome
The goal is to integrate• Organismal phylogeny• Gene family
sequencealignmentphylogeny
• Genetic and physical maps
Some uses for Phytome Starting with a chromosome segment
• Identify homologous segments• Predict unobserved gene content (candidate QTL)
Starting with a gene family• Resolve orthology/paralogy relationships• Identify coevolving families
Starting with a species• Explore lineage-specific diversification• Guide comparative mapping wet-work
Homolog identification
Multiple sequence alignment
Protein sequence prediction
Protein family clustering
Phylogenetic inference
Unigenecollections
Annotations
Phytome
Current pipeline
Lineage specific diversification
Arabidopsis
Cotton
Medicago
Tomato
Rice
1033
436173
334
696836
715
919
152 genes are “single copy” in all four species
A tale of two sisters: the ARF and the Aux/IAA gene families
Modulate whole plant response to auxinInteract via dimerization
• ARFs are transcription factors• Aux/IAAs bind and repress ARFs in the
absence of auxin
The chromosomal context
Diversification of ARFs
Diversification of the Aux/IAAs
Why the different patterns of diversification?
12% (ARF) vs 40% (Aux/IAA) segmental duplications
Presumably reflects differential retentionPossible explanations
• Dosage requirements• Coevolution with other interacting genes• Regional transcriptional regulation
Divergence of duplicated genes
Age of duplication
Div
erge
nce
in
expr
essi
on p
rofi
le
Duplicate pairs in yeast and human (Gu et al. 2002, Makova and Li 2003)
Appx. 50% of pairs diverge very rapidly Proportion of divergent pairs increases with
Ks and Ka
• Plateaus at Ka ~0.3 in human
In humans,• Immune response genes over-represented among
young, divergent pairs• Distantly related pairs with conserved expression
tend to be either ubiquitous or very tissue specific
Retention of duplicated genes
Nonfunctionalization, or loss of one copy• The fate of most pairs
Neofunctionalization (NF)• Positive selection on a new mutation can maintain the pair
Subfunctionalization (SF)• Mutations that increase the specificity of duplicates can fix
due to drift provided that, combined, the two copies provide the functionality of the ancestral gene. Once SF happens, both copies are indispensable and are retained.
• One prediction of the model is that SF more likely for tandem than dispersed pairs (due to linkage)
Digital expression profiling Massively Parallel Signature Sequencing (MPSS)
• Count occurrence of 17-20 bp mRNA signatures• Cloning and sequencing is done on microbeads• Similar to Serial Analysis of Gene Expression
(SAGE) “Bar-code” counting reduces concerns of
• cross-hybridization• probe affinity• background hybridization
Advantages• Accurate counts of low expression genes• Can distinguish expression profiles of duplicate genes
MPSS library constructionAAAAAAA
AAAAAAA
AAAAAAA
AAAAAAAAAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAAAAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
extract mRNA from tissue
AAAAAAATTTTTTT
5’ - Add standard
primer(added by cloning)
3’ - Add unique 32 bp
tag and standard
primer
AAAAAAAmRNA
Cut w/ Sau3A AAAAAAA
TTTTTTT
AAAAAAA
Convert to cDNA
TTTTTTT Add linker
Brenner et al., PNAS 97:1665-70.
Remove 3’ primer and expose single stranded unique tag
(digest, 3' 5' exonuclease)
Anneal to beads coated with unique anti-tag(32 bp, complementary to tag on mRNA) PCR
AAAAAAATTTTTTT
GATC
MPSS library construction
The result of the library construction is a set of microbeads. Each bead contains many DNA molecules, all derived from the 3’ end of a single transcript.
Beads are loaded in a monolayer on a microscope slide for the sequencing of 17 – 20 bp from the 5’ end.
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAAAAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
Brenner et al., PNAS 97:1665-70.
Sort by FACS to remove ‘empty’ beads
MPSS Sequencing
Repeat Cycle
8 7 6 5
Steps of four bases; overhang is shifted by four
bases in each round
NNNN
Digest with Type IIS enzyme to
uncover next 4 bases
9 bp
13 bp
CNNN 4 3 2 1
^ ^GNNN CODEC4RS DECODERED
Sequence by hybridization
16 cyclesfor 4 bp
NNXN CODEX2
XNNN CODEX4
NXNN CODEX3
NNNX CODEX1RS
RS
RS
RS
4 3 2 1NNNN
+
Add adaptors
Brenner et al., Nat. Biotech. 18:630-4.
MPSS Sequencing
GATCAATCGGACTTGTCGATCGTGCATCAGCAGTGATCCGATACAGCTTTGGATCTATGGGTATAGTCGATCCATCGTTTGGTGCGATCCCAGCAAGATAACGATCCTCCGTCTTCACAGATCACTTCTCTCATTAGATCTACCAGAACTCGG..GATCGGACCGATCGACT
253212349417561672702814..2,935
123456789..30,285
Each bead provides a signature of 17-20 bp
Tag #SignatureSequence
# of Beads (Frequency)
Two sets of signatures are generated from each sample in different reading frames staggered
by two bases
Total # of tags: >1,000,000
ATG TGA
Classifying signatures
Potential alternative splicing or nested
gene
Potential alternative termination
Potential un-annotated
ORF
Potential anti-sensetranscript
Anti-sense transcript or nested
gene?
Duplicated: expression may
be from other site in genome
Triangles refer to colors used on our web page:Class 1 - in an exon, same strand as ORF.Class 2 - within 500 bp after stop codon, same strand as ORF.Class 3 - anti-sense of ORF (like Class 1, but on opposite strand).Class 4 - in genome but NOT class 1, 2, 3, 5 or 6.Class 5 - entirely within intron, same strand.Class 6 - entirely within intron, anti-sense.
Grey = potential signature NOT expressedClass 0 - signatures found in the expression libraries but not the genome.
or
or
or
or
or
or
Typicalsignatures
Core Arabidopsis MPSS librariessequenced by Lynx for Blake Meyers, U. of Delaware
Signatures Distinct
Library sequenced signatures
Root 3,645,414 48,102
Shoot 2,885,229 53,396
Flower 1,791,460 37,754
Callus 1,963,474 40,903
Silique 2,018,785 38,503
TOTAL 12,304,362 133,377
http://www.dbi.udel.edu/mpss
Query by• Sequence• Arabidopsis gene identifier• chromosomal position• BAC clone ID• MPSS signature• Library comparison
Site includes• Library and tissue information• FAQs and help pages
Genome-wide MPSS profile in Arabidopsis
Of the 29,084 gene models, 17,849 match unambiguous, expressed class 1 and/or 2 signatures
Chr. I
Chr. II
Chr. III
Chr. IV
Chr. V
Dataset of duplicate pairs
Gene families of size two in Arabidopsis classified as• Dispersed (280)• Segmental (149)• Tandem (63)
For each pair• Measure similarity/distance in expression profile
• Estimate of Ks and KA
Expression distance
library 1
library 2
library 3
The number of genes with >5 ppm expression in a given number of libraries among the 984 genes in pairs analyzed and among all Arabidopsis genes with MPSS profiles.
Libraries Genes in pairs All genes 0 153 (15.5%) 4160 (23.3%)1 124 (12.6%) 2643 (14.8%)2 73 (7.4%) 1727 (9.6%)3 93 (9.5%) 1777 (10.0%)4 109 (11.1%) 1930 (10.8%)5 432 (43.9%) 5612 (31.4%)
Asymmetry in levels of expression among libraries within pairs
Symmetry of divergenceType of Pair A B C D ________________________________________________________________Young Dispersed (Ks0.5) 14 61 8 6
15.7% 68.5% 9.0% 6.7%
Tandem (Ks0.5) 8 29 10 914.3% 51.8 17.9% 16.1%
Old Dispersed (Ks>0.5) 35 111 24 21
18.3% 58.1% 12.6% 11.0%
Segmental (All) 31 104 7 720.8% 69.8% 4.7% 4.7%
A: Each copy has higher expression in at least one libraryB: One copy has higher expression in all libraries that differ and at least
two libraries differC: Copies differ in expression in only one libraryD: Copies do not differ in expression in any libraries
0
0.1
0.2
0.3
0.4
0.5
0.6
0.05 0.1 0.15 0.2
nonsynonymous substitution
no
rma
lize
d d
ista
nc
e
D
S
T
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 1
synonymous substitution
no
rma
lize
d d
ista
nc
e
D
S
T
dN =0.48+0.37 KA, p<0.0001
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0.05 0.1 0.15 0.2 0.25 0.35 0.4
nonsynonymous substitution
bre
ad
th o
f e
xp
res
sio
n
D
S
T
0
500
1000
1500
2000
2500
3000
3500
4000
0.05 0.1 0.15 0.2 0.25 0.35 0.4
nonsynonymous substitution
tota
l ex
pre
ss
ion
D
S
T
Pairs with small Ks but dissimilar expression profiles.
Ks Ka dup gene pair callus flower leaf rootsilique
0.03 <0.01 D AT1G80700 71 59 11 140 94AT1G80980 0 0 1 8 17
0.17 0.05 T AT2G46280 246 210 160 308 80AT2G46290 28 29 1 29 16
0.20 0.06 T AT2G15400 4 14 5 5 34AT2G15430 42 128 14 136 18
0.22 0.05 D AT1G36280 1 3 9 13 10AT4G18440 40 87 69 69 51
0.26 0.05 T AT1G71270 88 56 44 52 107AT1G71300 0 0 0 0 1
0.27 0.07 T AT3G13290 20 22 1 1 6AT3G13300 246 245 72 192 77
0.27 0.10 T AT1G29390 18 238 89 8 165AT1G29395 0 63 5 0 36
0.27 0.06 T AT3G26070 16 169 346 0 524AT3G26080 349 13 41 4 135
0.28 0.13 D AT3G56190 216 115 144 239 56AT3G56450 15 0 6 4 1
Pairs with large Ks but similar expression profiles.
Ks Ka dup gene pair callus flower leaf rootsilique
0.87 0.28 T AT3G16220 16 10 57 3 19
AT3G16230 21 12 35 13 13
0.89 0.13 D AT3G03660 14 0 0 0 0AT5G17810 71 0 0 0 0
0.95 0.29 D AT2G41180 57 14 78 4 29
AT3G56710 75 15 39 3 14
0.97 0.28 D AT1G31814 2 39 4 3 0
AT5G16320 0 55 10 19 8
0.98 0.23 D AT5G07230 0 344 0 0 0
AT5G62080 0 288 0 0 0
0.99 0.26 D AT3G22160 86 6 10 4 4AT4G15120 34 2 0 0 0
A closing thought
1965 • The Ecological Theater and the Evolutionary Play,
G. E. Hutchison
2004• The Chromosomal Theater and the Gene Family
Play
Phylogenetics has a great deal to contribute to understanding the evolutionary interplay of genome structure and function
Dan BrownBrandon Gaut
Steven TanksleyLiqing Zhang
Jason PhillipsDihui Lu
David RemingtonJason Reed
Tom Guilfoyle
Blake Meyers
NSF