inferring functional constraints on drosophila noncoding...
TRANSCRIPT
Casey M. Bergman
Faculty of Life SciencesUniversity of Manchester
Inferring functional constraints on Drosophila noncoding DNA from patterns of sequence evolution.
Outline of Talk
• Noncoding DNA and Drosophila as a system
• Conserved noncoding sequences are selectively constrained.
• A framework for predicting enhancers and transcription factor motifs.
• Spatial constraints on conserved noncoding sequences
Higher organisms have ahigher proportion of noncoding DNA
Bacteria15 %
Yeast30 %
Worm70 %
Fly75 %
The function of most noncoding DNA is unknown & unannotated
Bioinformatic & functional analysis of noncoding DNA ⇒
Genome organization
Transcriptional regulation
= Exon
Mef2
Mef2
Mef2
Mef2
Mef2
CG15863
CG12130
CG1418
CG12133
Adam
CG12134
CG12134
eve
TER94
TER94
Pka-R2
Pka-R2
Pka-R2
CG12128
BS 1360
(A)n
Mef2
Mef2
Mef2
Mef2
Mef2
CG15863
CG12130
CG1418
CG12133
Adam
CG12134
CG12134
eve
TER94
TER94
Pka-R2
Pka-R2
Pka-R2
CG12128
BS 1360
Enhancers
AR3/7
2
APRCQ4/6
mes
15RP2
Transposable elements
Goal: comprehensive functional annotation of noncoding sequences in Drososphila
Why is annotation of cis-regulatory sequences important?
• Better understand development
• Better understand mechanisms of transcription
• Provide material for forward genetics
• Provide material for evolutionary biology
• Generate data for systems biology
Why Drosophila as a model system?
~120 Mb of euchromatin~15,000 genes
75% noncoding
Compact, deletion bias
“Pseudogenes” decay rapidly by deletion in Drosophila
Petrov and Hartl (1998) Mol. Biol. Evol. 15:293-302
Genes with complex expression have long intergenic regions in compact genomes
Nelson, Hersh & Carroll (2004) Genome Biology 5:R25
Long introns & intergenic regions have slower rates of sequence evolution in Drosophila
Halligan & Keightley (2006) Genome Research 16:875-884
A wealth of comparative genomic data exists for the genus Drosophila
http://species.flybase.nethttp://rana.lbl.gov/drosophila
image from Pavel Tomancak (MPI-Dresden)
Thousands of candidate expression patterns:BDGP embryonic in situ database
http://www.fruitfly.org/cgi-bin/ex/insitu.pl
Base Position
Chromosome Band
Conservation
d_yakubad_pseudoobscura
a_gambiae
5034500 5035000 5035500 5036000 5036500 5037000 5037500 5038000 5038500 5039000 5039500 5040000 5040500 5041000Chromosome Bands
Protein-Coding Genes from FlyBase
Non-Coding Genes from FlyBaseFlyReg: Drosophila DNase I Footprint Database
D.mel./D.yakuba/D.pseudoob./A.gambiae Multiz Alignments & phastCons Scores
46C10
eve
eveUnspecified
evettk
UnspecifiedUnspecified
knihbhbknihbknihb
hbkni
hbhb
kni
hbhb
hbhb
KrKrKrbcd
Krgt
bcdgt
KrKr
Krbcd
KrKr
bcd
Krgt
hbKr
bcd
Kr
hbKr
hb
UnspecifiedUnspecifiedUnspecified
ttkUnspecified
ttkUnspecified
prdeve
UnspecifiedUnspecified
eveprd
UnspecifiedUnspecifiedUnspecifiedUnspecified
Unspecified
Systematic annotation of cis-regulatory datain Drosophila: FlyReg & REDfly databases
Bergman et al. (2005) Bioinformatics 21:1747-1749Gallo et al. (2006) Bioinformatics 22:381-383
cis-regulatory annotation & systems biology
Ashburner & Bergman (2005) Genome Research 15:1661-1667
shnAbd-A
fkh
ko
Dll
dpp
mus209
tsh
bcd
salm
Antp
dl
Ubx
zen
kni
ftz
eve
hb
tll
Kr
Trl
grh
cad
h
en
gt
ttk
cis-regulatory annotation & systems biology
Ashburner & Bergman (2005) Genome Research 15:1661-1667
ORegAnno: Open Regulatory Annotation
Montgomery et al. (2006) Bioinformatics 22:637-640
Outline of Talk
• Noncoding DNA and Drosophila as a system
• Conserved noncoding sequences are selectively constrained.
• A framework for predicting enhancers and transcription factor motifs.
• Spatial constraints on conserved noncoding sequences
mel
sim yak ere tak ana pse
500 bp spacer
Pattern of noncoding sequence evolution in Drosophila: the eve stripe 2 enhancer
block
Are conserved blocks functionally constrained or simply mutational cold spots?
Bergman & Kreitman (2001) Genome Research 11:1335-1345
Clark (2001) Genome Research 11:1319-1320
median: 19 bp
Using population genetics to test of the mutational cold-spot hypothesis
1. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
2. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
If blocks are functionally constrained we predict the following:
If blocks are functionally constrained we predict the following:
block blockspacer
Divergence
Polymorphism
div.
π
1. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
2. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
Using population genetics to test of the mutational cold-spot hypothesis
0 ! 0.1 0.1 ! 0.2 0.2 ! 0.3 0.3 ! 0.4 0.4 ! 0.5 0.5 ! 0.6 0.6 ! 0.7 0.7 ! 0.8 0.8 ! 0.9 0.9 ! 1.0
Derived Allele Frequency
0.0
2.0
4.0
6.0
Fra
ction o
f S
NP
s
1. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
2. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
spacer
If blocks are functionally constrained we predict the following:
Using population genetics to test of the mutational cold-spot hypothesis
0 ! 0.1 0.1 ! 0.2 0.2 ! 0.3 0.3 ! 0.4 0.4 ! 0.5 0.5 ! 0.6 0.6 ! 0.7 0.7 ! 0.8 0.8 ! 0.9 0.9 ! 1.0
Derived Allele Frequency
0.0
2.0
4.0
6.0
Fra
ction o
f S
NP
s
blockspacer
1. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
2. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
If blocks are functionally constrained we predict the following:
Using population genetics to test of the mutational cold-spot hypothesis
Harvesting data from GenBank using PDA: a pipeline to study polymorphism
Casillas & Barbadilla (2004) Nucl. Acids Res. 32:W166-W169
Get sequences & annotations
Input from sequencesfrom Genbank,
corresponding to theDrosophila genus
Minimum of 2sequences per category
MSAparameters
Gene, CDS, exon,intron, 5’UTR,
3’UTR, promoter
Group byspecies & gene
Sequences &annotations
1b
Muscle
Sequencesorganized incategories
2
Alignmentvalidation
Alignmentswith Scores
3
Sequencessubgroups
4
Read geneannotations
8
Extract generegions
Sequences,positions and orientations
9 Alignmentssubgroups
56
Polymorphism
Syn & Non-synpolymorphisms
Linkagedisequilibrium
Codon bias
Diversity AnalysisModule
7Web-based
output
Alignments
Jalview
Output
1a
MySQLdatabase
Seq. manipulations
External programs
OutputDiversity analysis
Low qualitysequences
excluded
Alignqualityvalues
Glinka (2003) + Ometto (2005)
African
Glinka (2003) + Ometto (2005)
European
Orengo (2004)
European
Intronic 167 173 28
Intergenic 90 93 80
Total loci 257 266 108
# Alleles 11.7 11.8 12.7
bp block 30,683 33,292 28,721
bp spacer 79,317 87,379 47,590
Summary of the polymorphism data sets
chr2R:
KrKrKr
bcdKrgt
bcdgtKrKrKr
bcdKrKr
bcdKrgt
hbKr
bcdKrhbKrhb
UnspecifiedUnspecifiedUnspecified
ttkUnspecified
ttkUnspecified
prdeve
UnspecifiedUnspecified
eveprd
UnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecified
Conservation
d_simulansd_yakuba
d_ananassaed_pseudoobscura
d_virilisd_mojavensis
a_gambiaea_mellifera
lod=76lod=16lod=97lod=17lod=59lod=18lod=84lod=47lod=58lod=57
lod=116lod=13
lod=121lod=125
lod=90lod=14lod=11lod=12lod=16lod=35lod=27lod=11lod=15lod=42lod=51lod=22lod=23
lod=465
5489500 5490000 5490500 5491000 5491500FlyBase Protein-Coding Genes
FlyReg: Drosophila DNase I Footprint Database
7 Flies, Mosquito and Honeybee Multiz Alignments & phastCons Scores
PhastCons Conserved Elements (7 Flies, Mosquito and Honeybee)
eve
Conserved blocks - UCSC phastcons track
0
1,250
2,500
3,750
5,000
polymorphism divergence
Single nucleotide polymorphisms & fixed differences are reduced in conserved blocks
66%
80%
3345
437
4901
386
Obs
erve
d nu
mbe
r
blockspacer
0
0.375
0.750
1.125
1.500
Blocks have an excess of point mutations within species relative to divergence between species
Poly
mor
phis
m :
dive
rgen
ce Chi-square:p<5x10-12
Block Spacer
Poly 437 3345
Div. 386 4901
block spacer
0
0.13
0.26
0.39
0.52
0.65
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
blockspacer
Blocks have an excess of rare SNPs relative to non-conserved spacers
KS test: p<6x10-11
Derived allele frequency (DAF)
Freq
uenc
y
Chi-square test: p<1x10-11
Outline of Talk
• Noncoding DNA and Drosophila as a system
• Conserved noncoding sequences are selectively constrained.
• A framework for predicting enhancers and transcription factor motifs.
• Spatial constraints on conserved noncoding sequences
1020
2020
3020
4020
5020
6020
7020
8020
37350 38350 39350 40350 41350 42350 43350
D. melanogaster
Muller & Basler(2000)
Hepker et al.(1999)
D. v
irilis
dpp 3’ cis-regulatory region
Conserved noncoding sequences are clustered in complex cis-regulatory regions
1 Kb
chi2 = 2040d.f. = 30 p < 10-6
Conserved noncoding sequences are clustered in Drosophila
Bergman et al. (2002) Genome Biology 3:0086.
A molecular interpretation of conservation in complex cis-regulatory regions
= Conserved noncoding sequence
= Spacer intervals
Enhancer 1 Enhancer 2
&
= Transcription factors
75%
75%
75%
75%
10k 11k 12k 13k 14k 15k 16k 17k 18k 19k
CNS cluster 1 HB
75%
75%
75%
75%
20k 21k 22k 23k 24k 25k 26k 27k 28k 29k
CNS cluster 2
75%
75%
75%
75%
30k 31k 32k 33k 34k 35k 36k 37k 38k 39k 40k
ap-RA
CNS cluster 3
75%
75%
75%
75%
41k 42k 43k 44k 45k 46k 47k 48k 49k 50k
brain enhancer muscle enhancer
vlc-RA
mel-ere
mel-pse
mel-vir
mel-ano
mel-ere
mel-pse
mel-vir
mel-ano
mel-ere
mel-pse
mel-vir
mel-ano
mel-ere
mel-pse
mel-vir
mel-ano
A cluster of conserved noncoding sequences in the apterous region predicts a brain specific enhancer
Bergman et al. (2002) Genome Biology 3:0086.
Coding exon
Conserved noncoding sequence
Conserved regulatory sequence
MEME
Clusters of conserved noncoding sequencescontain over-represented binding site motifs
--------------------------------------------------------------------------------Sequence name Strand Start P-value Site ------------- ------ ----- --------- ----------38 + 9 1.03e-05 CATTCATA TTTTTATGAG GCTGTTCCTT4 + 15 1.03e-05 TTTGTTGCTC TTTTTATGAG TTTTTTCCAT3 + 15 1.03e-05 TTTGTTGCTC TTTTTATGAG TTTTTTCCAT14 + 10 1.41e-05 GGACGCGCC TTTTTATTGG TGCACCTTCG13 + 10 1.41e-05 GGACGCGCC TTTTTATTGG TGCACCTTCG
.
.--------------------------------------------------------------------------------hunchback recognition motif Stanojevic et al. (1989) TTTTTRNG
Enhancer
Motif
Berman et al. (2002) Proc. Nat. Acad. Sci.
99:757
Enhancer prediction by clustering: conserved noncoding sequences vs. binding site prediction
Enhancer
Motif
Matching inferred motifs to functions & factors
Down, Bergman, Su & Hubbard (2007) PLoS Comp Biol 3:e7.
pannierserpent
Summary
• Drosophila present an excellent model system for understanding the function of noncoding DNA
• Conserved noncoding sequences are under selective constraints & are not mutational cold spots
• Conserved noncoding sequences are clustered into higher order units in complex cis-regulatory regions
• Combining conservation and over-representation can produce high quality cis-regulatory predictions in Drosophila
Acknowledgements
Marty Kreitman
David Huen, Michael Ashburner
Thomas Down - Wellcome Trust Sanger Institute
Sue Celniker, Gerry Rubin,Eddy Rubin
Nora Pierstorff - CologneSonia Casillas - Barcelona